Abstract
We present Peax, a novel feature-based technique for interactive visual pattern search in sequential data, like time series or data mapped to a genome sequence. Visually searching for patterns by similarity is often challenging because of the large search space, the visual complexity of patterns, and the user’s perception of similarity. For example, in genomics, researchers try to link patterns in multivariate sequential data to cellular or pathogenic processes, but a lack of ground truth and high variance makes automatic pattern detection unreliable. We have developed a convolutional autoencoder for unsupervised representation learning of regions in sequential data that can capture more visual details of complex patterns compared to existing similarity measures. Using this learned representation as features of the sequential data, our accompanying visual query system enables interactive feedback-driven adjustments of the pattern search to adapt to the users’ perceived similarity. Using an active learning sampling strategy, Peax collects user-generated binary relevance feedback. This feedback is used to train a model for binary classification, to ultimately find other regions that exhibit patterns similar to the search target. We demonstrate Peax’s features through a case study in genomics and report on a user study with eight domain experts to assess the usability and usefulness of Peax. Moreover, we evaluate the effectiveness of the learned feature representation for visual similarity search in two additional user studies. We find that our models retrieve significantly more similar patterns than other commonly used techniques.
Footnotes
We rewrote several paragraphs and simplified the equations in Section 5 to improve the clarity and understanding of our initial and active learning sampling strategies. We also updated the related Supplementary Figure S1 and S2 to provide better intuition for our sampling strategies. Further, we extended several discussion points in Section 10 to provide more information about the learnability, the types of queries that are most effective, and the scalability in terms of the dataset size and labeling efforts. Furthermore, we improved the clarity of several other sections, and we did a careful pass to correct language and typos. To provide space for the changes and additions mentioned above, we moved Figure 11 to the supplementary material and shortened some paragraphs in Section 8 and 9.