The effect of removing repeat-induced overlaps in de novo assembly

Determining accurate genotypes is important for associating phenotypes to genotypes. De novo genome assembly is a critical step to determine the complete genotype for species for which no reference exists yet. The main challenge of de novo eukaryote genome assembly, particularly plant genomes, are repetitive DNA sequences within their genomes. The introduction of third generation sequencing and corresponding long reads has promised to resolve repeat-related problems. While there have been notable improvements, reads originating from these repeats are still creating errors because they introduce false overlaps in the assembly graph. This study focuses on analyzing the effect of repeats on de novo assembly and improving performance of existing de novo assembly algorithms by removing repeat-induced overlaps. First, we show the possible improvements in de novo assembly with removing repeat-induced overlaps. Then we propose several methods for detecting and removing repeat-induced overlaps and evaluate their performance on several simulated datasets.


Introduction 22
The goal of de novo genome assembly is to reconstruct a species' genome sequence as completely as 23 possible using a large number of relatively short sequences referred to as "reads" that are read 24 from the species' genome. While high-quality assemblies are already available for many species, 25 many branches of the tree of life still need representative genome sequences. Recently, due to the 26 popularity of long-read sequencing technologies, de novo assembly has once more become of 27 interest. In this paper, we focus on improving the standard long read de novo assembly pipeline.

28
Most de novo assembly pipelines suitable for long reads follow the OLC paradigm: overlap-layout-29 consensus. First, in the overlap step, pairwise alignments between the reads are identified. The 30 output of the overlap step is a set of pairwise read overlaps that can be represented as a graph,

31
where nodes are the reads, and edges indicate overlaps between the reads. This graph will be 32 referred to as the assembly graph. Second, the layout step tries to identify bundles of overlaps that 33 belong together. This is done by pruning unwanted edges from the graph such that it becomes more 34 linear through several graph cleaning procedures. Once all procedures are done, the graph is split 35 up into contigs. Finally, the consensus step of the assembly pipeline identifies the most likely base 36 for each position. The layout step is arguably the most differentiating step between the various de 37 novo assembly methods that exist. This can go from extremely simple, e.g. miniasm (1) to very 38 intricate with many manually optimized rules and corresponding specific data types, e.g. DISCOvar

40
A problem that has plagued de novo assembly since the beginning is interspersed repeats in the 42 two or more distinct genomic locations. The reads originated from any of the repeat instances 43 introduce pair-wise overlaps with all instances of the repetition across the genome, which leads to 44 cross-connections in the assembly graph. This will confuse the 'layout' step in the OLC assembly 45 paradigm. Reads spanning the repetitive region can resolve the confusion by connecting the two 46 sides of the repetitive regions together. While read lengths have been increasing dramatically for 47 Third Generation Technologies (TGS), for the vast majority of eukaryotic species, the read length is 48 still orders of magnitude smaller than the genome size. Moreover, it is unlikely that we will 49 experience the luxury of chromosome-spanning reads like the ones observed for some microbial 50 genomes soon (3-5). Finally, TGS reads are often still not (yet) long enough to span most of the 51 repetitive regions in eukaryotic genomes.

52
In this paper, we analyze the effect of interspersed repeats on de novo assembly. Next, we show that 53 removing repeat-induced overlaps can improve the performance of de novo assembly in different 54 eukaryotic genomes, e.g. yeast, human, and potato. We demonstrate that a perfect classifier can 55 increase the coverage of genome assembly by 0.1%, 4% and 7% in yeast, potato, and human 56 chromosome 9, respectively. Finally, we also investigate some methods to detect and remove 57 repeat-induced overlaps and compare their performance to the standard de novo assembly pipeline.

58
Initially, we tried a baseline method and removed overlaps based on their degree in the assembly 59 graph. Second, we trained a machine-learning model to detect and remove repeat-induced overlaps 60 based on GraphSage node embeddings (6). While this method makes the overlaps set much smaller,

78
Detecting interspersed repeats 79 We use Generic Repeat Finder (7) version 1.0 with the default parameters to detect interspersed 80 repeats in these three reference sequences.

81
Simulating reads and genomes 82 We use aneusim (8) version 0.4.1 with default parameters to simulate diploid sequences (ploidy=2) 83 close the reference sequences but with mutations and translocations. We use the simulated 84 haplotype 1 and 2 sequences as genomes of two other individuals of these organisms for further 85 analysis.

86
We use SimLoRD (9) version 1.0.2 to simulate reads similar to PacBio with 40x of coverage (-c 40) 87 from the reference, and the simulated sequences. Using simulated reads allows us to label the 88 alignments between the reads since we know where the reads originated from.

89
Alignments and labeling 90 We use minimap2 (10) version 2.13-r858-dirty with the default parameters to find the pairwise 91 alignments between the reads. We label each alignment according to the origination coordinates of

95
Genome assembly and evaluation 96 We use the miniasm (1) version 0.3-r179 with default parameters to assemble the sets of overlaps 97 before and after intervening and removing the candidate alignments.

98
We use compass (11,12) to evaluate the de novo assemblies. While compass reports many metrics,

99
we only report coverage, validity, multiplicity, the number of contigs and the longest contig. GraphSage layers to get the node embeddings in other graphs.

120
However, because the assembly graphs are huge, we need to subsample the graph for training and 121 testing the model. We use the edgesampler module in the StellarGraph library to get the subgraphs.

122
For yeast sequences, we take 20% of the nodes for training and 20% of the nodes for testing, while 123 for human sequences, we use 2% of the nodes for training and 2% for testing.

124
Then, we use GraphSage embeddings to train a logistic regression classifier for separating repeat-125 induced and normal overlaps. We use the first simulated dataset to train this classifier. First, we create the assembly graph of the simulated dataset, and then extract the node embeddings using 127 the previously trained GraphSage.

128
We use the GraphSage model to extract node embedding for every node in the assembly graph, and 129 we concatenate embeddings of the two nodes participating in an edge, to get embedding of that

134
Finally, we use the GraphSage model to extract the embeddings of the second simulated dataset.

135
Then we use the selected model from the previous step to remove overlaps classified as repeat-

136
induced. Next, use miniasm (1) version 0.3-r179 to assemble the remaining overlap set and 137 compare the results with the standard genome assembly pipeline.

138
Results and discussions 139 Characteristics of interspersed repeats in yeast, potato, and human 140 genomes.

141
In the first step, we used Generic Repeat Finder to detect interspersed repeats in the genome of 142 yeast, potato, and human chromosome 9. Table 1 shows the statistics of the interspersed repeats 143 available in these genomes. There are gaps in the potato reference sequence, which are indicated by

144
Ns in the sequence. To simplify the analysis, we removed Ns from the reference sequence.

145
Unresolved repeats are usually responsible for most Ns in the sequence. Consequently, in Table 1,

146
we report fewer interspersed repeats for the potato genome than are present. The analysis is also 147 simplified for human chromosome 9 since it is separated from the rest of the chromosomes,

148
thereby excluding the occurrence if interspersed repeats in the other chromosomes from the 149 analysis.
150 As shown in Table 1, the repeat content is much higher in human chromosome 9 and potato than in

164
The distribution of interspersed repeats follows a similar pattern in the three test organisms.

165
However, human chromosome 9 has many longer repeats than the other two organisms (see Figure   166 1). As mentioned before, the count of repeats in the human genome can be even more than what is 167 shown in Figure 1 because they might also be present in other chromosomes, which we did not 168 consider in this study. Interestingly, although yeast has lower repeat content (see Table 1) than the 169 other two organisms, it has some very long repeats. The longest repeats in the yeast genome are 170 even longer than the potato's longest repeats. However, this is likely due to the fact that the potato 171 reference sequence is incomplete and the Ns are representing unresolved repeats.

176
The number of times each repeat occurs varies from 2 to more than 1000 times in the three model 177 organism (see Figure 2). There are interspersed repeats in Human chromosome 9 that occur more 178 than 40000 times, without considering other chromosomes that these repeats might be present. It

179
is worth noting that the smaller repeats occur more often through the genome (see Supplementary 180 Figure 1).

181
The effect of interspersed repeats in genome assembly 182 Next, we inspected the effect of interspersed repeats in genome assembly based on simulated reads 183 from the reference genomes. Since the simulator reports the coordinates where a simulated read 184 originated from, it is possible to label the pairwise alignment of reads. If there is an alignment 185 between two reads but the coordinates these reads are sampled from do not overlap, we 186 considered the alignment as repeat-induced. Otherwise, we labeled the alignment as normal. Table   187 2 shows the number of repeat-induced edges in yeast, human chromosome 9, and potato.  Reads that originate from one of the interspersed repeats align with reads from all other instances,

196
which creates repeat-induced edges in the assembly graph. The human and potato reference 197 sequences have considerably high repetitive sequences. Therefore, in the human and potato 198 assembly graphs, the majority of the edges are repeat-induced in their assembly graphs (see Table   199 2). Subsequently, the reads originating from interspersed repeat regions also have a high degree in 200 the assembly graph. Figure 3 shows the degree of the normal and repeat-induced edges in the 201 assembly graphs. We define the degree of an edge as the sum of the degree of the two nodes 202 connected by the edge. Figure 3 shows that most edges with a degree greater than 1000 are repeat-

210
To analyze the effect of repeat-induced overlaps in the assembly, we evaluated assemblies in the 211 three model organisms before and after removing repeat-induced overlaps. In the normal scenario,

212
we aligned the reads with minimap2 and assembled the genome with miniasm, reads, and the 213 overlaps from the last step. In the removing repeat-induced overlaps scenario, we intervened in the 214 assembly process, removed all the alignments labeled as repeat-induced, and used miniasm to 215 assemble the remaining overlaps set. Table 4 shows the results of these two scenarios in the three 216 model organisms. In all three datasets, removing repeat-induced overlaps improves genome 217 assembly. In the yeast genome, removing repeat-induced overlaps lead to 6% more coverage. In the 218 potato genome removing repeat-induced overlaps lead to 8% more coverage. This is expected since 219 the potato genome is much more repetitive than yeast and suffers from more repeat-induced edges.

220
In the human chromosome 9 dataset removing repeat-induced edges lead to 3% more coverage.

221
We tested whether removing a percentage of repeat-induced overlaps would still improve assembly 222 performance in another experiment, where we removed 25%, 50%, and 75% of repeat-induced 223 overlaps in the human chr9 genome and compared the final assemblies. It is clear from Table 3 that 224 removing more repeat-induced overlaps improves coverage and validity and increases the length of 225 the longest contig. However, the multiplicity, number of contigs and the assembly size is increasing 226 after removing 25%, 50%, 75% repeat-induced overlaps and finally drops and get closer to one after removing all of the repeat-induced overlaps. This means by removing a portion of repeat-228 induced overlaps the assembler is replicating some of the repetitive regions which are valid 229 sequences, but increases multiplicity and assembly size. Finally, with removing all of the repeat-230 induced overlaps, the assembler can fully resolve these repetitive regions and merge the 231 corresponding contigs together which results in multiplicity closer to one, assembly size closer to 232 the reference size, and reduced number of contigs. In conclusion, comparatively to the standard de 233 novo assembly pipeline, removing 25%, 50%, and 75% of repeat-induced overlaps produces more 234 contigs. This means even removing a subset of repeat-induced overlaps accurately, without false 235 positives, can improve de novo assembly performance. Training a classifier to remove repeat-induced overlaps 255 Since the sequence of the interspersed repeats is almost identical, we relied only on graph-based 256 features to find and remove them. One of graph based features that can be informative to detect 257 repeat-induced overlaps is degree. We expect the edges in the assembly graph representing repeat-258 induced overlaps to have a high degree since they connect two reads from the repetitive regions 259 and those reads also align to reads originating from all other instances of the repeat. Figure 3 260 compares the degree of repeat-induced and normal edges in the assembly graphs. Based on Figure   261 3, the number of repeat-induced edges with a degree greater than 1000 is more than normal edges.

262
However, considering edges with a degree greater than 10000, the difference is much higher, and 263 the number of repeat-induced edges is significantly more. Therefore, we intervened in the de novo 264 assembly process and removed the nodes representing overlaps with a degree greater than 10000 265 to see if removing them can improve the final assembly result. Table 4 shows the result of removing 266 repeat-induced overlaps based on degree. No improvements are observed using this method over 267 standard assembly pipelines. Since the yeast assembly graph does not have any edge with degree 268 greater than 10000, we did not apply this method on it.
269  Next, we used the extracted embeddings of overlaps in the second simulated dataset to train a 298 classifier for separating normal and repeat-induced overlaps. Since the dataset is imbalanced, and 299 the graphs have more normal edges in yeast genome and more repeat-induced edges in human, we 300 up-sampled and down-sampled repeat-induced edges in yeast and human datasets, respectively. validation (see Table 5). While the GraphSage embedding model failed to separate the three classes 303 of edges in the yeast dataset, the logistic regression classifier achieved impressive results in 304 separating repeat-induced and normal edges using the same embedding model on the second 305 simulated dataset. Interestingly, the GraphSage model performed much better on the human 306 chromosome 9 assembly graph and achieved 76% validation accuracy.

307
Last, we extracted the embeddings of overlaps in the last dataset and used the classifier trained in 308 the previous step that achieved the highest F1 score to predict the repeat-induced overlaps. After 309 removing the overlaps predicted as repeat-induced, we assembled the remaining overlaps and 310 evaluated the results (see Table 4). The performance of yeast assembly drops after removing the 311 overlaps predicted as repeat-induced. That means that the disadvantage of losing some of the 312 normal edges in the yeast assembly graph because of prediction errors is more than the advantage 313 of removing repeat-induced overlaps. Since the yeast genome does not have many interspersed 314 repeats and repeat-induced edges (see Tables 1 and 2), this is not surprising. On top of that, the 315 only feature we assigned to the nodes before training the GraphSage model is the degree of nodes, 316 while in the yeast assembly graph, the degree of repeat-induced and normal edges is not 317 significantly different (see Figure 3.a). However, the length of the longest contig is increased, and 318 the number of contigs is reduced, which shows that the method solved the previously challenging 319 repetitive regions.

320
Similar to yeast, human chromosome 9 assembly performance is lower than baseline after 321 removing overlaps predicted to be repeat-induced (see Table 4). The coverage is ~12% lower and 322 the assembly size is ~40Mbp smaller than the actual chromosome 9 size. The number of contigs is 323 smaller than all the other cases, and the multiplicity and validity are close to one, which means the 324 assembly and reference map are nearly one-to-one. As a result, the machine learning method is 325 successful in removing some essential repeat-induced overlaps, which enables the assembler to 327 critical normal overlaps as repeat-induced, resulting in decreased coverage and assembly size when 328 they are removed. Despite our best efforts, we were unable to apply the machine-learning method 329 to the potato dataset due to its large size and memory requirement.

330
Conclusion 331 In this study, we study the effect of interspersed repeats on de novo genome assemblies of three 332 organisms, i.e., yeast, human chromosome 9, and potato. The reads originating from interspersed 333 repeat regions align with those from all instances. Therefore, it is possible to label the alignments

334
with not overlapping originating coordinates as repeat-induced overlaps. Here, we analyze the 335 effect of repeat-induced overlaps in the assembly graph and de novo assembly. At last, we 336 investigate some strategies to detect and remove repeat-induced overlaps.

337
Interspersed repeats make up approximately 1, 6, and 10% of the yeast, human chromosome 9, and 338 potato genomes, respectively. Although the repeats are causing only 1% of the overlaps in the yeast 339 dataset, they correspond to 76% and 96% % of overlaps in human and potato datasets. Since most 340 of the overlaps in the assembly graph of these two genomes are repeat-induced, this is the most 341 challenging problem to solve in genome assembly.

342
To investigate the effect of repeat-induced edges in the assembly graph on the final assembly result,

343
we removed all of the repeat-induced overlaps and compared the results to the normal de novo 344 assembly pipeline. We observed that removing repeat-induced overlaps improved coverage and 345 continuity of the assembly, even in yeast with much lower repetitive content. In potato, which has 346 the most repetitive contents among the test organisms, removing repeat-induced edges leads to a 347 9% improvement in coverage. We investigate if it is possible to detect repeat-induced overlaps based on the degree of their 349 corresponding edges in the assembly graph. We define the degree of an edge as the sum of the 350 degree of two nodes connecting the edge. As shown in Figure 3, most of the repeat-induced 351 overlaps in human chromosome 9 and potato assembly graphs have more than degree 10000.

352
Therefore, we remove edges with more than degree 10000 and see the effect of it on the final 353 assemblies. As shown in Table 4, there is no improvement in the assemblies after removing edges 354 with degree greater than 10000, and the final assemblies are very close to the standard assembly 355 pipeline.

356
We also attempt to train a classifier to detect repeat-induced edges based on graph-based features.

357
Although we achieved some improvement after removing repeat-induced edges with the classifier,

358
the results are far from the results when all of the repeat-induced edges are removed. This shows 359 great potential for a follow-up project to detect and remove repeat-induced overlaps accurately.

360
We suggest that detecting and removing repeat-induced overlaps can be one a smart edge filtering 361 method during assembly. Our attempt to train a classifier that accurately detects and removes 362 repeat-induced overlaps did not achieve significant results. However, our results show that a 363 perfect classifier that removes all the repeat-induced overlaps can make impressive improvements 364 in the genome assembly process.