Efficient masking of plant genomes by combining kmer counting and curated repeats

The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis or pangenome exploration. While homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here we benchmark a two-step approach, where repeats are first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, using the kmer-based Repeat Detector (Red) and two repeat libraries (REdat and nrTEplants, curated for this work). We obtained repeated genome fractions that match those reported in the literature, but with shorter repeated elements than those produced with conventional annotators. Inspection of masked regions overlapping genes revealed no preference for specific protein domains. Half of Red masked sequences can be successfully classified with nrTEplants, with the complete protocol taking less than 2h on a desktop Linux box. The repeat library and the scripts to mask and annotate plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.


Introduction
Besides genes, plant genomes contain intergenic sequences, with increasing repetitive sequences as genome size grows. The growth in repeat content is roughly linear up to a genome size of 10Gbp, including most known angiosperms, and then plateaus (1). The repetitive fraction of the genome is made up of low-copy repeats, simple repeats (such as satellite DNA), and transposable elements (TEs), which were discovered by Barbara McClintock in maize (2).
TEs can be important to explain observed phenotypes or domestication (see for instance reference (3) ) and are used as a source of genetic variability in breeding programs (4). The hypothesis is that the copy-and-paste and cut-and-paste mechanisms of TEs might leave footprints in the genome and can potentially affect the expression, regulation or coding sequence of neighboring genes. Moreover, TEs are increasingly receiving attention in studies tackling plant pangenomes (see for instance (5)). According to the Wicker classification, plant TEs can be classified either as class I RNA retrotransposons or class II DNA transposons (6). Software resources such as RepeatMasker (RM) (7), RepBase (8) or RepetDB (9) that are typically used to annotate TEs in plant genomes use the Wicker classification rules (see (10) for a software review). However, it has been reported that disease resistance (R) genes, which are of great interest for plant breeding, are often masked by these annotation strategies (11). Furthermore, in our experience creating the Ensembl Plants resources, repeat annotation jobs can take up to several days on a computer cluster depending on the genome size. Moreover, the RepBase library requires subscription.
In addition to the intrinsic biological value of TEs, the annotation of repeats can be used to estimate assembly quality (12), as an alternative to gene completeness (13). For other genomic analyses, the bulk of repeated sequences may disrupt common computational genomic analyses and are thus often masked out without any classification attempt. For instance, whole-genome alignment (WGA), promoter analysis or the construction of graph genomes require the computation of resulting in frequency tables of k-mers, which are nucleotide words of size k. If repeated sequences are not masked, the frequency tables are severely biased and can affect the obtained results (see for instance (14)). While annotation approaches based on sequence similarity are computationally expensive, k-mer masking strategies are orders of magnitude faster (15)(16)(17)(18) and, in our experience, much better to prepare WGAs of barley and wheat cultivars using LASTZ (19).
In this paper we benchmark a two-step approach for the annotation of plant repeated sequences. First, repeats are called by k-mer counting with the Repeat Detector (Red).
Second, the discovered repeated sequences are annotated by sequence alignment to a newly curated metacollection of repeats called nrTEplants. We compare this approach to the conventional RM pipeline, with both nrTEplants and the REdat (20) library, on a set of 20 angiosperms from Ensembl. We then compare their performance and discuss the results.
The nrTEplants library is bundled with scripts to mask and annotate genomes of angiosperms, enabling interoperability, reuse and reproducible analyses (21).

Plant repeat libraries
We searched the literature for plant-specific libraries of repeated sequences and selected those in Table 1. While some are specific for a species or repeat family, others comprise repeats from mixed species, such as REdat from PGSB PlantsDB (20) or RepetDB (9).
FASTA files with nucleotide sequences of repeats were downloaded from the indicated URLs or obtained from the authors.

Plant cDNA sequences
Plant species in Ensembl Plants release 46 (November 2020) (22) were ranked in terms of the number of proteins reviewed in Uniprot by February 22nd, 2020 (23

Sequence clustering
All cDNA and TE sequences were clustered with GET_HOMOLOGUES-EST version 10042020 (24). This software runs BLASTN and the MCL algorithm, and computes coverage by combining local alignments. The sequence identity cutoff was 95% and the alignment coverage 75%. Global variables in script get_homologues-est.pl lines L36-7 were set to $MAXSEQLENGTH = 55000 and $MINSEQLENGTH = 90. Sequences were clustered with command get_homologues-est.pl -d repeats -m cluster -M -t 0 -i 100 . Check https://github.com/Ensembl/plant_tools/tree/master/bench/repeat_libs for more details.

Negative control: Pfam domains of disease resistance (R) genes
For the identification and curation of Pfam domains encoded by R genes, the following steps were performed. First, a set of 153 protein sequences encoded by reference R genes (i.e. cloned and/or with robust evidence) was retrieved from http://www.prgdb.org/prgdb (26). aestivum). From the initial set of Pfam domains, only 43 were consistently identified in our final panel of potential R gene's encoded proteins, and used as negative control. Note that one of them (PF02892, zf-BED) is often found in transposases (25). The list is available at https://github.com/Ensembl/plant_tools/blob/master/bench/repeat_libs/control_neg_NLR.list.

De novo annotation of nucleotide-binding and leucine-rich repeat immune receptor (NLR) genes
The NLR-annotator software was used for de novo annotation of NLR genes, which are the most abundant R genes characterized to date, in whole genome sequences (29). Briefly, the set of 20 plant genomes were dissected into fragments of 20 kb length, with 5 kb overlaps, using the ChopSequence.jar routine. The cut sequences were then scanned to find NLR-associated sequence motifs using NLR-Parser.jar. Finally, NLR-Annotator.jar was used to integrate the annotated motifs and retrieve the actual NLR loci in BED format. In order to compute intersections with repeats only NLR loci with overlap > 50bp were considered.
Moreover, to account for the fact that the tested masking strategies cover different fractions of the genome, odd ratios of NLR masking were computed with Equation 1:

Masking and annotation of repeats in plant genomes
RepeatMasker version v4.0.5 and a fork of Repeat Detector (Red) v2.0 adapted for Ensembl, available at https://github.com/EnsemblGenomes/Red, were used to call repeats in plant genomes with libraries REdat v9.3 and nrTEplanst v0.3. Red was called from script https://github.com/Ensembl/plant-scripts/blob/master/repeats/Red2Ensembl.py, which can run several sequences in parallel and feed the results into a Ensembl core database (30).
In addition, minimap2 version 2.17-r974-dirty (31) was used to annotate repeats called by Neighbor genes were also subtracted from downstream/upstream windows.

Enrichment of Pfam domains
Enrichment was computed with R function fisher.test (34) and Pfam domains (25) retrieved with recipe B4 of https://github.com/Ensembl/plant-scripts (35). Pfam domain counts for the complete proteome were used as expected frequencies. Only genes with an overlap > 50bp and domains with False Discovery Rate adjust values (p<0.05) were considered.

Construction and benchmark of a non-redundant library of repeats: nrTEplants
A set of plant TE libraries and annotated repeats from selected species, listed on Table 1, plus cDNA sequence sets from the best annotated plant species in Ensembl, were curated and their TE classification terms uniformized. Then, they were merged and clustered (95% identity, 75% coverage of shortest sequence). From the resulting 994,349 clusters, a total 174,426 clusters contained TE sequences and were 6-frame translated and assigned Pfam domains. Of these, a subset of 8,910 mixed clusters comprised both TE and cDNA sequences and required further processing (see example on Figure S1). After empirical assessment, we decided to take only clusters i) containing sequences from at least 6 different TE libraries (6 replicates), which eventually left out RosaTE repeats; and ii) with a fraction of sequences marked as 'Potential Host Gene' in RepetDB < 0.00. The resulting nrTElibrary contains 171,104 sequences (see Tables S1 and S3).
In order to benchmark the newly constructed library we compiled a positive control, comprising 22 Pfam domains found in transposable elements, and a negative control, a list of 43 Pfam domains found in disease resistance NLR genes. With these controls, we estimated the sensitivity (0.909) and specificity (0.947). The nrTEplants library can be obtained at https://github.com/Ensembl/plant-scripts/releases/tag/v0.3 .

Masking repeats within plant genomes
A set of 20 plant genomes were selected from Ensembl (22) to benchmark repeat calling strategies. These are listed on Table 2 next to the genomic fraction of repeats reported in the literature and their GC content. All these genome sequences were annotated with RepeatMasker (7), nrTEplants and REdat (20), a repeat library used in Ensembl Plants. In addition, the fraction of repeats called by Red, based on K-mer enrichment, is also shown.
Note that Red automatically selected k values from 13 to 16.
On Figure 1 the resulting percentages of repeated sequences are plotted next to the values reported in the literature. The median difference between REdat repeated fraction and the literature reports is 26.5%. This number is 9.8% for nrTEplants and 4.3% for Red. These results suggest that Red can successfully mask any genomes without previous knowledge of the repetitive sequence repertoire of a species. Moreover, repeats called by Red generally overlap sequences masked with REdat (66.6%) and nrTEplants (73.8%). Table 3 summarizes the number of repeats, and their length, called by all tested strategies.
We observe that Red calls more (a median of 845 per Mbp, compared to 391 for nrTEplants and 221 for REdat) but shorter repeats (median 143, compared to 284 and 217, respectively). Note that the numbers below are shown in the same REd, nrTEplants, REdat order for clarity. Figure 2 summarizes how called repeats overlap with genes, exons and 500pb windows upstream and downstream genes. It can be seen that Red overlaps more with genes (median 17.8%, compared to 13.7% and 5%, respectively), but when exons are considered these numbers change to 9%, 15% and 4%, respectively. The figure also shows that Red masks more proximal upstream and downstream space, which will likely have a positive impact on k-mer counting strategies for promoter analysis (39). The analysis on Table S4 shows that Red identifies four times more K-mers with 20+ copies in this regulatory space, which agrees with recent work that found that unidentified TEs are over-represented in specific regulatory networks (40).
In order to check whether the compared approaches masked preferentially genes from certain families, a Pfam enrichment analysis was carried out and summarized on  Table S5 show that nrTEplants performs better than REdat in this respect, with 87 domains compared to 153 (Red had 39 in total). As gene annotation is frequently performed after repeat masking, we reasoned this could affect the Pfam enrichment analyses. Therefore we carried out a complementary analysis where NLR genes were called de novo on the genomic sequences instead of using the Ensembl gene annotation. The results, summarized on Table S5, confirm that Red tends to mask less NLR genes than expected at genomic scale, with only one species (Trifolium pratense) with an odds ratio > 1. In contrast, we obtained odd ratios greater than 1 for several species with REdat (n=7) and nrTEplants (n=12).
Overall, we observe that Red consistently produces repeated fractions similar to the expected values from the literature. The library nrTEplants also shows good performance in most species, but fails to recover the expected repeat fraction in cases such as melon or sunflower. Red also performs better than nrTEplants in terms of exon and upstream/downstream overlap. Furthermore, Red does not seem to systematically mask certain gene families. The lower values observed for REdat are a consequence of it underpredicting repeats, showing less sensitivity than the other approaches.

Annotating Red-masked repeats within genomes with nrTEplants and minimap2
In the previous analyses we showed that Red-base masking is an effective way of calling repeats in plant genomes. Moreover, we observed that nrTEplants behaves better than REdat in most cases. Therefore, we wanted to check whether repeats called with Red can be annotated and classified. For that, we aligned the repeat sequences against the non-redundant nrTElibrary with minimap2. The results are plotted on Figure 4, where it can be seen that, in most species, more than half of the repeat space can be annotated (median 65.9%). As our library contains only transposable elements, we expected a fraction of unmapped space containing simple repeats or satellite DNA. However, as also observed on Figure 1, in a few species, only a small repeat fraction could be classified. We reasoned this was due to repeat consensus not represented in the library. This was confirmed in a separate experiment where olive and R. chinensis repeated sequences obtained from their authors were mapped to Red repeats, as seen in Figure 4 in a grey dashed line. A positive control was also carried out with sunflower repeated sequences, in order to confirm that no valuable repeats had been lost during the construction of nrTEplants.
These results were obtained with the default map-ont settings of minimap2. We also tried map-pab and asm20 settings, but obtained similar results. Red clover (Trifolium pratense) was re-analyzed replacing minimap2 with BLAST algorithms megablast, dc-megablast, blastn and rmblastn (41). Compared to the mapped fraction produced by minimap2 (0.4%), a maximum value of 6.1% was obtained with blastn. This modest gain in sensitivity required 1,412 minutes. The algorithm rmblastn, used by RM, yielded a mapped fraction of 0.7%. We concluded that the alternatives to minimap2 offered little gain at the cost of spiralling computing time. Figure 5 shows the runtime and RAM required by the two-step protocol presented in this paper, measured on a CentOS7.9 computer using 4 cores of a Xeon E5-2620 v4 (2.10GHz) CPU. Panels A and B correspond to the first step, Red masking. It can be seen that all genomes tested take less than 40 min to run, with the exception of tetraploid Triticum turgidum, which took 71 min. The memory consumption was below 20GB in all cases, but climbed to 22.7GB and 29.9GB in A. tauschii and T. turgidum. Panel C illustrates the runtime of the second step, the mapping of nrTEplants. It can be seen that all plants required less than 27 min, except A. tauschii and T. turgidum, that took 3h and 1h respectively. The memory consumed by minimap2 was 3.8-4.0GB in all cases.

Conclusions
The hybrid, two-step methodology presented in this paper was tested on 20 angiosperms with genome sizes from 0.12 to 10.46 Gbp. Compared to RM, Red calls more and shorter repeats, and the obtained repeated genome fractions agree closely with those reported in the literature. Moreover, we find that Red k-mer masking does not have a preference for particular protein-coding families, in contrast to repeats annotated with RM using REdat and nrTEplants. Overall, more than half of Red masked sequences can be classified with nrTEplants, except in species with repeats not present in that library. Our protocol takes less than 2h to run and up to 30GB of RAM, and can use nrTEplants or any repeat library in FASTA format.    (7) and libraries REdat (20) and nrTEplants. The resulting percentage of repeated sequences is plotted next to the values reported in the literature for those genomes and the fraction of repeats provided by Repeat Detector (Red), based on K-mer enrichment (15). Species are sorted from small to large genome size.  (15) or RepeatMasker (7) with libraries REdat (20) and nrTEplants. The median genome space occupied by genes, exons and proximal upstream/downstream 500bp windows in these species are 126.7Mbp, 62.9Mbp and 37.7Mbp, respectively. This plot was generated with R package vioplot (42).   (15). The resulting repeats were subsequently mapped to library nrTEplants with minimap2 (31), producing the genome fractions shown. Repeats from three species (R. chinensis, O. europaea and H. annuus) were also mapped to annotated repeats provided by the respective sequencing consortia as a control. Species are sorted from small to large genome size.

Figure 5.
Runtime and memory requirements of a two-step repeat annotation protocol based on the Repeat Detector (15), minimap2 (31) and the nrTEplants library. The protocol was tested on twenty genomes from release 49 (November 2020) of Ensembl Plants. Similar values were measured on a Ubuntu box with four Core i5-6600 (3.30GHz) CPU cores. Figure S1. Cluster with two Arabidopsis thaliana cDNA sequences (AT4G16920.2 and AT4G16950.1) and transposable element TEdenovo-B-R2288-Map4 from library repetDB. These sequences contain Pfam domain NB-ARC (PF00931), which is part of NLR defense proteins. Figure generated with Bioedit (43) from cluster 269_AT4G16920.2.