StainedGlass: Interactive visualization of massive tandem repeat structures with identity heatmaps

Summary Visualization and analysis of genomic repeats is typically accomplished through the use of dot plots; however, the emergence of telomere-to-telomere assemblies with multi-megabase repeats requires new visualization strategies. Here, we introduce StainedGlass which can generate publication quality figures and interactive visualizations that depict the identity and orientation of multi-megabase repeat structures at a genome-wide scale. The tool can rapidly reveal higher-order structures and improve the inference of evolutionary history for some of the most complex regions of genomes. Availability and implementation StainedGlass is implemented using Snakemake and is available open source under the MIT license at https://mrvollger.github.io/StainedGlass/. Contact mvollger@uw.edu


Introduction
Dot plot analyses are often used to reveal the underlying structure of complex repeats including differences in sequence identity and orientation (Gibbs and McIntyre, 1970;Cabanettes and Klopp, 2018;Rice et al., 2000;Sonnhammer and Durbin, 1995). Advances in long-read sequencing technology, however, have recently made more complex repeat structures and genetic variation available for sequence analysis (Chaisson et al., 2015;Audano et al., 2019;Ebert et al., 2021). With increasingly contiguous assemblies of reference genomes (Rhie et al., 2021) and complete human chromosomes (Miga et al., 2020;Logsdon et al., 2021;Nurk et al., 2021), complete centromeres, tandem duplications, and other heterochromatic arrays can now be systematically analyzed in their entirety. The size and complexity of these structures, often many megabase pairs in size, elude traditional dot plot analyses for three reasons: 1) current visualization methods are largely based on perfect or k-mer matches which do not lend themselves to the complex higher-order repeats found in centromeres (Willard and Waye, 1987) and the expected mismatches between these large repeats, 2) dot plots offer limited resolution for tandem arrays consisting of megabases of sequence data, frequently reducing them to black squares that relay little information other than the size and presence of a repeat, and 3) the number of possible pairwise matches increases rapidly when identifying exact matches in tandem arrays (e.g. in MUMmer) and this problem is exacerbated further when using a small minimum match length for comparing more divergent arrays.
In this work, we present StainedGlass, which generates identity heatmaps based on sequence alignment rather than small k-mers using an easy, scalable, and customizable workflow that allows for interactive use as well as publication-ready figures. The tool can be applied to study repeat structures at a genome-wide scale or focused at particular regions to characterize complex higher-order repeat structures. As part of our recent analysis of chromosome 8, we developed a prototype of this method which rendered the higher-order repeat structure of the 2 Mbp centromere as an identity heatmap. This prototype facilitated the discovery of higher level symmetry and a layered organization in the centromere, which assisted in the development of a more refined model for centromere evolution, as well as the discovery of hotspots for copy number variation (Logsdon et al., 2021).

Methods
To generate pairwise sequence identity heatmaps for StainedGlass the input sequence is fragmented into windows of a configurable size (default 1 kbp). All possible pairwise alignments between the fragments are calculated using minimap2 (Li, 2018). The color gradient used in the heatmap is then determined by the sequence identities of the alignments which are calculated as: where is the number of matches, the number of mismatches, the number of insertion events, and the number of deletion events. When there are multiple alignments between the same two sequence fragments all alignments other than the one with the most matches are filtered out regardless of their sequence identity. This situation often arises when aligning tandem repeats where there can be multiple valid alignments between fragments, and this strategy assists in highlighting the most representative alignment.
StainedGlass generates two types of outputs: fixed resolution static figures as well as multi-resolution Cooler files (Abdennur and Mirny, 2020) suitable for interactive multiscale visualization using HiGlass (Kerpedjiev et al., 2018). Sequence identity in Cooler files is calculated using the method described above for the highest resolution and interpolated by averaging values for lower resolutions. The static figures are more appropriate for visualization of relatively small regions (30 Mbp or less) at publication quality (Figure 1) while the HiGlass visualization is better suited for data exploration of whole genome alignments, such as the sequence identity relationships among the short arms of human acrocentric chromosomes (interactive HiGlass browser) (Nurk et al., 2021;Kerpedjiev et al., 2018).
The tool is made available using Snakemake Rahmann, 2012, 2018;Mölder et al., 2021) which allows for reproducible and scalable data analyses. The stability of new changes is automatically tested with each new addition using continuous integration via GitHub actions.
Finally, StainedGlass is a Snakemake standard compliant workflow and therefore has automated usage documentation on the Snakemake website.

Usage and examples
StainedGlass is run using the Snakemake workflow language. To run StainedGlass, clone the repository from https://github.com/mrvollger/StainedGlass and follow the installation instructions. From the cloned directory you can then generate the alignments for StainedGlass using the following command.
snakemake --use-conda --cores 24 \ --config fasta={path.to.fasta} sample={sample.prefix.for.output} All configuration options are described in config/README.md and parameters can be specified on the command line as done above or by modifying the configuration file (config/config.yaml). You can also preview the jobs that will be run without executing the pipeline by adding -n to the command line. To generate static identity heatmaps from the StainedGlass alignments, add make_figures to the Snakemake command. For an example of the output of this command see Figure 1.