PATTIE: Publication access through tiered interaction & exploration

In this work we present Publication Access Through Tiered Interaction & Exploration (PATTIE) – an information foraging, sense-making, and exploratory spatial-semantic information retrieval (IR) system (http://pattie.unc.edu/plos). Non-spatial, spatial IR systems, and some recent studies focused on their principal functions are discussed and compared. To interactively work through a use-case from the biomedical domain, instructions are provided for readers to conduct exploratory searches directly on the PLOS archive based on the software embedded in the online version of this paper (http://vzlib.unc.edu/software/). To carefully evaluate some of the critical parameters of the PATTIE algorithm, and the core functions of the implemented system, a set of experiments were conducted. Along with details on the experimental methods and their rationale, key findings from the experiments are analyzed and presented. Finally, with an eye toward the future of software-embedded scientific papers, their potential benefits for supporting direct engagement with scientific content, replication, and validation are discussed.


Introduction
Information retrieval (IR) systems are essential tools for finding relevant documents.
2 Current IR systems dominantly adopt the ranking-based retrieval model, which returns 3 a list of documents ranked in descending order of predicted relevance for a user query 4 (i.e., search keywords). Such an architecture and information access point rely on the 5 user having and understanding on how and what to search for. The evidence base 6 suggests that this is often an incorrect assumption to make [1,2]. Despite a user to locate and comprehend all the relevant information. This is especially true for the 10 biomedical domain given the ever-growing body of the literature; a position that IR 11 researchers have been discussing for decades now.
assures that the users can explore the most updated data. The client side is built on a 66 JavaScript visualization library D3.js [31]. Ajax is implemented on the client side to 67 asynchronously communicate with the server while content is dynamically explored, 68 maintaining user work space without reloading the web page.

70
As with the standard of the modern search systems, PATTIE presents a text box for a 71 user to type in a query (Fig 3) although it can also initiate a process of information 72 exploration without a query. When a search terms are provided, PATTIE retrieves N 73 latest articles that map to the search terms. in any textual fields including titles, 74 abstracts, and body texts. When no search terms are provided, PATTIE retrieves N 75 latest articles indexed in the PLOS archive in order to provide the user a mechanism for 76 archive sense-making. N is fixed to a constant in order to dynamically cluster archival 77 content in constant time which facilitates real-time processing.

78
To some extent, this is similar to the idea of mini-batch k-means [32] which has been 79 observed to perform significantly faster than k-means while still converging on a similar 80 clustering solution. Instead of complete randomness, however, PATTIE focuses on index 81 recency as biomedical researchers are generally concerned with emerging concepts in 82 their field. Limiting the number of documents by N may have an impact on the the 83 resulting cluster structure and its quality. However, we assume that the effect is limited 84 when N is set to a sufficiently large value. We will empirically investigate the validity of 85 this assumption in the Discussion section. the underlying PLOS API. Currently, the system retrieves concatenated titles and 88 abstracts, but other indexed fields are available and we plan to study their use in future 89 work.

90
After mapping the query to the archive and retrieving a document set, PATTIE 91 executes a unsupervised machine learning pipeline that is in sequential order below.

92
The pipeline was evaluated and it was concluded that PATTIE can partition 93 information spaces into coherent clusters [33]. There are two panels and buttons for Scatter/Gather (Fig 4)  Map and immediately "scattered" to their coordinates.

132
In the PATTIE Map, a circle represents a cluster, and the area of the cluster is  users conceptualize the Scatter/Gather process, although they are new to the idea.

158
Users can iterate and/or restart this Scatter/Gather process until their information 159 needs are satisfied.

160
The user can also choose to go back to the previous state by clicking the "Back" sharply improved as the sample size increased up to 2,000 for Title and Abstract and 207 then stabilized while full-text data was not as effective.

208
The result indicates that more information does not necessarily translate to greater 209 performance in clustering documents potentially due to more irrelevant words brought 210 in with full text. A difference from our previous work [23] is that using titles only did 211 not yield as good clusters as using abstracts. In fact, the difference between Abstract  Here, it should be noted that the above experiment only examined cluster quality, to process in real time depending on the data size as observed in Fig 6. N is currently 227 limited to a manageable size balancing cluster quality and processing time, but we plan 228 to increase it by employing more efficient data structure and distributed processing in 229 future work. Although k-means achieves slightly higher AMI, the difference was found to be not 245 statistically significant and mini-batch k-means runs slightly faster (approximately 0.5 246 seconds) irrespective of N . Their processing times did not differ greater because a large 247 portion of the processing time (64~90% for N = 2, 000) is accounted for constructing a 248 tf-idf matrix and applying VCGS and SVD.

249
Overall, mini-batch k-means algorithm brings a slight increase in speed with 250 insignificant difference in clustering performance. While it is a valid alternative, more 251 work needs to be done on the processes including tf-idf matrix construction for further 252 improvement. Therefore, our current system adopts the standard k-means.

254
In the following, we will demonstrate how PATTIE can be used to explore the PLOS 255 digital archive with respect to a use-case involving a university student s workflow for creating an outline for a review article on the latest advancements in modulating 261 CRISPR guided gene editing. The following items are logically ordered thought processes 262 of the student while anticipating, and engaging in, this information-seeking task.  understands that CRISPR associated protein 9 (Cas9) is the essential mechanism for 283 cutting DNA, and is more precisely an endonuclease enzyme [39]. Moreover, editing, the engineering involved in CRISPR technology [40].

307
• off-target, gene-edited -If CRISPR-guided therapy can result in off-target effects 308 with unintended gene-edits then these studies are crucial to understanding how 309 modulation of CRISPR activity will need to be further investigated.

310
The student has now scoped the information space by iterating through