A multiple genome alignment workflow shows the impact of repeat masking and parameter tuning on alignment of functional regions in plants

Alignments of multiple genomes are a cornerstone of comparative genomics, but generating these alignments remains technically challenging and often impractical. We developed the msa_pipeline workflow (https://bitbucket.org/bucklerlab/msa_pipeline) based on the LAST aligner to allow practical and sensitive multiple alignment of diverged plant genomes with minimal user inputs. Our workflow only requires a set of genomes in FASTA format as input. The workflow outputs multiple alignments in MAF format, and includes utilities to help calculate genome-wide conservation scores. As high repeat content and genomic divergence are substantial challenges in plant genome alignment, we also explored the impact of different masking approaches and alignment parameters using genome assemblies of 33 grass species. Compared to conventional masking with RepeatMasker, a k-mer masking approach increased the alignment rate of CDS and non-coding functional regions by 25% and 14% respectively. We further found that default alignment parameters generally perform well, but parameter tuning can increase the alignment rate for non-coding functional regions by over 52% compared to default LAST settings. Finally, by increasing alignment sensitivity from the default baseline, parameter tuning can increase the number of non-coding sites that can be scored for conservation by over 76%.


Introduction
Multiple sequence alignment is a key challenge in comparative genomics and evolutionary studies (Chowdhury and Garai 2017;Armstrong et al. 2019). As the number of novel genomes being generated is rapidly accelerating, researchers rely on robust tools that can scale from dozens to hundreds of genomes. Many tools are available for pairwise or multiple alignment of genome sequences (Frith and Kawaguchi 2015;Marçais et al. 2018;Armstrong et al. 2020;Minkin and Medvedev 2020). However, these tools generally require a range of inputs such as a phylogenetic tree and repeat masking information. Pairwise alignment tools such as LASTZ and LAST also need their outputs to be post-processed before subjecting them to multiple alignment using a different tool. In addition, many tools do not scale well to large sets of plant genomes. The many requirements and types of software involved can make the seemingly straightforward task of multiple sequence alignment technically challenging for individual researchers. In this work, we therefore developed the practical msa_pipeline to generate multiple sequence alignments from a reference genome and a set of query genomes. The msa_pipeline relies on the LAST aligner and aims to minimize the amount of user effort required to rapidly produce a high-quality multiple alignment. We tested the computational efficiency of the pipeline and the impact of a range of repeat masking and alignment parameters using public grass genome sequences. Overall, we present the publicly available msa_pipeline and recommend repeat masking and alignment strategies that enhance alignment of genic and intergenic regions of diverged plant genomes.

Features and implementation of msa_pipeline
The msa_pipeline only requires a set of masked genomes in FASTA format as input, outputting a multiple sequence alignment in MAF format ( Figure 1). Dependencies are handled using Docker/Singularity and snakemake is deployed as a workflow manager. We used the LAST alignment tool for pairwise alignment, rather than the faster minimap2, because the high sensitivity of LAST (Frith and Noé 2014) makes it more suitable for comparison of diverged genomes. High sensitivity is important for many downstream analyses of the alignment, because it facilitates alignment of functional sequences such as promoters and enhancers that are located in more variable intergenic regions.
Pairwise alignment to the reference genome is conducted in parallel, with the main pipeline bottleneck being multiple sequence alignment using the single-threaded ROAST program. The pipeline outputs multiple alignments in MAF format. We provide scripts to use the alignment to generate genome-wide conservation scores calculated with GERP++ (Davydov et al. 2010), phastCons or phyloP (Siepel et al. 2005). The runtime and memory usage of msa_pipeline shows its efficiency compared to the powerful but resource-intensive Cactus aligner (Table 1).
Benchmarking and improving multiple alignment in plant genomes Selecting alignment benchmarking metrics Measuring the accuracy of alignments between distantly related species is challenging because ground-truth alignments are generally unknown. Studies have therefore measured alignment accuracy by focusing on partial alignments of conserved functional sequences such as exons (Sharma and Hiller 2017;Frith, Hamada, and Horton 2010) or by relying on simulated sequences (Armstrong et al. 2020). To reduce biases caused by simulation parameters or by an exclusive focus on coding sequence, we measured accuracy based on alignments of functional sequences in coding and non-coding regions. Specifically, we calculated precision, recall and F1 score (harmonic mean of precision and recall) of functional regions, assuming that alignments of nonfunctional regions were false positives (see Methods for further details). Although this simplifying assumption is unlikely to generally be the case, the resulting approximate measures are useful for benchmarking alignment quality in the functional regions of the genome that are most important for the majority of downstream analyses.
Appropriate repeat masking can improve multiple alignment performance A major obstacle to accurate and efficient alignment is the large proportion of repetitive sequence found in most plant genomes. In contrast to masking tools like RepeatMasker that rely on repeat databases, approaches such as RED (Girgis 2015) or KMER (Song et al. 2020) try to avoid database bias by using repetitive k-mers (nucleotide sequences of k length) in the genome to identify repeats. Here, we compared RepeatMasker, RED and KMER and tested their impact on subsequent multiple sequence alignment in grasses. We selected species from the PACMAD grass clade (subfamilies Panicoideae, Aristidoideae, Chloridoideae, Micrairoideae, Arundinoideae, and Danthonioideae) which diverged ~32.4 mya (Cotton et al. 2015) as well as species from the BOP grass clade (subfamilies Bambusoideae, Oryzoideae, and Pooideae) which diverged ~80 mya (Christin et al. 2014). We found substantial differences between all three masking methods, In maize, compared to KMER, RepeatMasker masked an additional 28.89% of CDS and 38.96% of non-coding functional regions ( Figure 2A, Table S1). KMER also masked substantially less sequence than RED ( Figure 2A and Table S1). Overall, KMER displayed the most favorable tradeoff between the masking rate and the rate of masked coding and non-coding functional sequence across most genomes ( Figure 2B, Figure S1B, Table S2; see Supplementary Results and Supplementary Data). KMER, however, failed to mask substantial numbers of repeats in fragmented genome assemblies such as those of Dichanthelium oligosanthes and Eragrostis tef.
Based on analysis in the PACMAD clade, genomes masked with KMER produced sensitive alignments (mean F1= 0.4670 for pairwise alignment; multiple alignment F1= 0.4809) with higher alignment rates of functional sequence than those masked with RepeatMasker (mean F1= 0.3569 for pairwise alignment; multiple alignment F1= 0.3686 ) and those masked with RED (mean F1= 0.4284 for pairwise alignment; multiple alignment F1= 0.4506) ( Figure 2C, Figure S1C and S1D; see Supplementary Data). Our results suggest that using k-mer-based masking improves alignment, with hard-masking performing comparably to soft-masking while also providing minor improvements in runtime (Table 1 and Table S3).
Exploration of alignment parameter space shows potential for improving intergenic alignment rates Alignment parameters such as substitution matrices and gap penalties can have a substantial effect on alignment (Frith, Hamada, and Horton 2010). Often default alignment settings are based on testing in mammalian genomes that are less repetitive and diverse than those of many plants. To explore the alignment parameter space for grass genomes, we tested 750 different combinations of ten LAST parameters including gap penalties and substitution matrices for multiple alignments (Table S4). By approximating recall and precision as measures of alignment performance, we assessed 750 multiple alignments of a 5Mb and 1.4 Mb syntenic region in the grass clades known as PACMAD and BOP ( Figures S2-S5). Although we found that some of the best alignments were generated using default LAST alignment parameters (recall = 0.2823, precision = 0.9095, F1 = 0.4309) and Cactus alignment (recall = 0.4040, precision = 0.8478, F1 = 0.5472), alternative LAST parameter combinations showed substantial differences including some improvements in alignment performance compared to the default parameters (Table S5, Table S6 and Table S7).
The default LAST penalty matrix and parameters favor precision over recall, which we found leads to low alignment rates in intergenic regions for divergent genomes like those in the PACMAD grass clade. In this study, we selected the parameter combination 'LAST strict' (Table S6) with equal precision compared to LAST default parameters but a recall of 0.35, corresponding to a 23% increase from the default (Table S7). This gain in recall is mainly attributable to use of the HOXD70 penalty matrix and lower penalization of gaps (Table S6). The parameter combination 'LAST relaxed' (Table S6) further decreases the gap existence cost (parameter -a), elevating the recall to 0.57 while maintaining a precision over 0.85. This parameter combination produces an alignment with similar precision and recall but substantially lower computational cost compared to the Cactus alignment in both the PACMAD and BOP clade ( Figure 3, Figure S5, and Table S7).

Multiple alignment parametrization facilitates detection of genomic conservation
To evaluate how much the multiple alignment affects estimates of genomic conservation, we calculated the GERP conservation score based on the previously introduced 750 alignments of PACMAD and BOP generated with different alignment parameter combinations. In PACMAD, the number of sites that had sufficient alignment depth (>=3 species) to produce a conservation score ranged from 92,437 to 3,843,983 ( Figure

Outlook
The msa_pipeline leverages existing tools to provide a practical solution for rapid multiple alignment of genomes with minimal user effort. For divergent plant genomes, different repeat masking approaches had limited impact on the alignment rate, but reduction of gap-related alignment penalties boosted alignment rates of non-coding functional elements. We anticipate that the accelerating pace of genome sequencing and assembly will generate rich resources for genomescale multiple alignments that drive biological discovery in plants.

Repeat masking approaches
Repeats often cannot be aligned accurately between genomes. For this reason, repetitive sequences are often replaced with 'N's (hard-masked) or set to lowercase (soft-masked) and treated differently from non-repetitive sequences during alignment. Repeat masking with popular methods such as RepeatMasker generally relies on libraries of repeat elements that are aligned to genomes to identify known repeats. Here, repeat masking was carried out on the genome assemblies using RepeatMasker 4.1.1 with the RepBase 20181026 database of Viridiplantae and a custom set of repeats mined from each genome using RepeatModeler. A drawback is that repeat elements not similar to those in the library will not be masked and, conversely, non-repetitive functional elements with similarities to repeats may be erroneously masked (Bayer, Edwards, and Batley 2018). To compare kmer-based masking approaches to RepeatMasker, we therefore also conducted masking with RED and a novel kmer-based approach (Song et al., 2020) that we refer to as KMER.

Selection of syntenic regions for alignment analysis
Whole genome alignment is computationally demanding (Table 1). To accelerate comparison of multiple alignments constructed with a range of parameters, we used a subset of genomic sequences from our target species. Specifically, we used MCScan (Wang et al. 2012) to select a syntenic region that is common to grass genomes and contains 100 genes based on the Sorghum bicolor GCF_000003195.3 genome ( Figure S2; see Supplementary Data). This allowed us to compare alignment results from two distinct clades of grasses known as the BOP and PACMAD clades. Oryza longistaminata was excluded from mini-genome analyses due to poor alignment rates (see Supplementary Data). The reference genome used for the 18 selected BOP species was rice (version IRGSP-1.0) and the reference genome for the 14 selected PACMAD species was maize (version B73V4).

Sampling the alignment parameter space
We performed multiple alignment with the msa_pipeline for PACMAD clade species and BOP clade species using three sets of differently masked sequences (RepeatMasker, RED, KMER) for each clade. Each masking approach was furthermore tested with hard-masked and soft-masked sequences. We varied 10 LAST pairwise alignment parameters to explore the parameter space (Table S4)

Evaluation of alignments
We focus on the alignment of functional elements of the genome as a measure of alignment quality.
We use a broad definition of these functional elements, including non-coding functional regions (promotors, UTRs, introns, open chromatin) and coding regions (CDS).
We define as the number of bases of Zea mays functional elements with at least half of the query species aligned, while is defined as the total number of bases of Zea mays functional elements.

=
(1) Thus we define approximate alignment recall as shown in equation 1.

= (2)
In equation 2, we define the number of aligned non-functional intergenic bases as and use them to help calculate approximate alignment precision. A key assumption here is that intergenic regions distant from genes and with inaccessible chromatin are enriched for erroneous alignments compared to our defined functional regions. This assumption is a caveat for our calculation of precision, because false positives are identified based on this assumption rather than a ground truth.
Finally, we can calculate the F1 score using our calculations of alignment recall and precision.

Alignments affect the detection of genomic conservation
To assess how the alignment affects the inference of genomic conservation, we calculated conservation using GERP with the msa_pipeline in the PACMAD and BOP clade respectively.
For each alignment generated from the 750 parameter combinations, we used a fixed neutral tree and considered all sites with Rejected Substitution (RS) scores greater than 80% of the maximum RS score to be conserved. The threshold for considering a site conserved in BOP was RS=1.568 and the threshold in PACMAD was RS=1.072.
To further explore the site to site alignment, we used Pearson's correlation of GERP RS scores between the PACMAD and BOP clades. We expect a substantial proportion of conservation to be clade-specific and thus uncorrelated, limiting the maximum correlation possible. However, we cautiously consider an increase in correlation as a potential indicator for improvements in alignment of functional sequences conserved across grass clades.
We used LAST alignment to lift-over genomic coordinates between the rice genome (the reference for BOP) and the maize genome (the reference for PACMAD). For the sites that could be lifted over between rice and maize, we then calculated the correlation of GERP RS scores between PACMAD and BOP across the genome and for different functional genomic regions.    *Due to its high resource requirements, the Cactus aligner resource use estimate is based on an alignment prepared for a separate study using 10 PACMAD species (including three shared with the msa_pipeline set). The alignment was conducted with 96 threads and runtime per thread is conservatively estimated by dividing total runtime by the thread count. Runtime was rounded down to the nearest hundred hours and memory was rounded down to the nearest 50 Gb.
Supplementary Data S1. Repeat masking results for grass genomes using three masking RepeatMasker thus had the highest masking rate but for many species, it also masked the highest proportion of functional elements occurring in coding and open chromatin sequence, and noncoding functional region ( Figure 1A). KMER displayed the most favorable trade-off between the masking rate and the rate of masked coding and open chromatin sequence. This difference between RepeatMasker and KMER was not specific to maize, it was noticeable in most grass genomes ( Figure 1B, Figure S1). However, KMER performance declined in cases where the genome was poorly assembled, as is the case for Dichanthelium oligosanthes and Eragrostis tef.
To further investigate the impact of masking on pairwise alignment, we analyzed the alignment recall, precision and F1 score for CDS, open chromatin regions, non-coding functional regions and all functional regions ( Figure 1C). Using pairwise alignments of 32 grass species, the average F1 scores were 0.40 (KMER), 0.35 (RED), 0.29 (RepeatMasker), and 0.32 (unmasked). This showed that KMER produces the highest F1 score and that hard-masked alignment performs similarly to soft-masked alignment but with a significant speed-up ( Figure 1, Figure S1, Table 1