Mitochondrial sequences or Numts – By-catch differs between sequencing methods

Nuclear inserts derived from mitochondrial DNA (Numts) encode valuable information. Being mostly non-functional, and accumulating mutations more slowly than mitochondrial sequence, they act like molecular fossils – they preserve information on the ancestral sequences of the mitochondrial DNA. In addition, changes to the Numt sequence since their insertion into the nuclear genome carry information about the nuclear phylogeny. These attributes cannot be reliably exploited if Numt sequence is confused with the mitochondrial genome (mtDNA). The analysis of mtDNA would be similarly compromised by any confusion, for example producing misleading results in DNA barcoding that used mtDNA sequence. We propose a method to distinguish Numts from mtDNA, without the need for comprehensive assembly of the nuclear genome or the physical separation of organelles and nuclei. It exploits the different biases of long and short-read sequencing. We find that short-read data yield mainly mtDNA sequences, whereas long-read sequencing strongly enriches for Numt sequences. We demonstrate the method using genome-skimming (coverage < 1x) data obtained on Illumina short-read and PacBio long-read technology from DNA extracted from six grasshopper individuals. The mitochondrial genome sequences were assembled from the short-read data despite the presence of Numts. The PacBio data contained a much higher proportion of Numt reads (over 16-fold), making us caution against the use of long-read methods for studies using mitochondrial loci. We obtained two estimates of the genomic proportion of Numts. Finally, we introduce “tangle plots”, a way of visualising Numt structural rearrangements and comparing them between samples.


34
Sequences of mitochondrial DNA have proved indispensable markers for population genetics 35 and phylogenetics for decades (Avise et al., 1987;Ballard & Rand, 2005). More recently, 36 numerous ecological experiments have exploited the universal animal barcoding marker, 37 COX1, which is a mitochondrial gene (Hebert, Ratnasingham, & deWaard, 2003). One 38 valuable property of mitochondrial sequences is that, being more abundant, they tend to be 2017). A second advantage is that, thanks to their comparatively small size (approximately 44 16 kbp) and conserved structure (Boore, 1999) animal mitochondrial genomes are easy to 45 assemble. Similar considerations make plastid genomes particularly valuable in genetic 46 analysis of plants (Twyford & Ness, 2016). 47 The advantages of mtDNA analysis can be negated by the presence of Numts: nuclear inserts 48 derived from mitochondrial DNA (and in plants, the plastid equivalent, Nupts). Evidence of 49 such insertions was first found shortly after mitochondria were discovered to contain their 50 own genetic material (Du Buy & Riley, 1967) and it has since become clear that Numts are 51 present in many species (Bensasson, Zhang, Hartl, & Hewitt, 2001), often in multiple copies. 52 The abundance of Numts strongly depends on whether a species has one or more 53 mitochondria per cell, an observation which led Barbrook  While Numts are commonly seen as a nuisance, they are fascinating study objects in their 61 own right. Accruing substitutions more slowly than the mitochondrial DNA lineage, they act 62 as "molecular fossils" providing information about the ancestral mitochondrial sequence 63 Before high-throughput sequencing data became readily available, Numts could be detected, 70 albeit with some difficulty, by PCR-based methods (Bensasson et al., 2000) or cytologically, 71 by in situ hybridisation of mitochondrial sequences to chromosomal preparations (Gellissen,72 Bradfield, White, & Wyatt, 1983). To physically separate mitochondrial and nuclear DNA, 73 (ultra) centrifugation can be used (Garber & Yoder, 1983 In this paper, we develop a further approach that investigates Numts by making use of the 88 "by-catch" from High-throughput sequencing. We use this term to emphasise that much 89 sequencing data is superfluous to the aims of a specific experiment, often the huge majority. 90 The low price of sequencing data means these data could be discarded; yet this genomic by-91 catch can be mined for valuable incidental information. In particular it can be used to 92 assemble genomes of organelles such as mitochondria. The choice of sequencing platform 93 may influence the type of by-catch produced, particularly because of differences in 94 fragmentation and size-selection protocols. We show that this difference can be exploited to 95 investigate the Numt content of the genome. 96 In order to outline our approach, it is helpful to divide the data generated from a sequencing 97 library into fractions (see Figure 1). The first major division is between A -reads without 98 similarity to mitochondrial sequences and ML -mitochondrial-like sequences (i.e. those 99 which align to the mitochondrial genome). This ML fraction may conceptually be subdivided 100 further into D -those which can be identified as Numt sequences (i.e. sequences aligning for 101 their full length but having a different sequence, or aligning for part of their length), C - Here we demonstrate two complementary approaches for estimating the proportion the 113 nuclear genome made up of Numts. One exploits the variation in M from sample to sample 114 in short-read data, which arises because of differences in the mitochondrial composition 115 with tissue, sex, or developmental stage. Secondly in some types of long-read data M is 116 minimal, so the ratio can be calculated directly. We demonstrate these approaches using 117 genome skimming data (coverage < 1/3x) generated by short-read (Illumina's NextSeq) and 118 long-read (PacBio's RSII) platforms from the grasshopper, Podisma pedestris. We assemble 119 the species's mitochondrial genome sequence and calculate these two estimates of the 120 proportion of Numts in the nuclear genome. We also introduce a method, which we call 121 "tangle plots", for the visualisation of Numts with structural re-arrangements 122 123 124

126
Information about the samples can be found in Table 1. Before DNA extraction, the legs were dipped into boiling water to inactivate DNases. 132 Subsequently, the denatured femur muscle was dissected out. DNA was then extracted using 133 a Qiagen Blood and Tissue kit following the manufacturer's instructions. Using a Covaris ultra 134 sonicator the DNA was sheared aiming to achieve a median size of 550 bp. Libraries for 135 sequencing were prepared using an Illumina TruSeq DNA PCR-Free kit. Sequencing was 136 carried out at QMUL's Genome Centre on Illumina's NextSeq Platform using v2 chemistry. 137

138
Freshly removed hindlegs were stored in pure ethanol. DNA was extracted from four 139 samples using a Qiagen Gentra HMW kit resulting in molecules with a length of 140 mainly > 48 kbp (TapeStation, Agilent Genomics). Further work was carried out by The 141 University of Liverpool's Centre for Genomic Research. The aimed size for DNA 142 fragmentation was 10 kbp. The libraries' median (non-redundant) insert sizes were 3125, 143 3167, and 2097 bp. Sequencing was carried out on a PacBio RSII machine using C6 chemistry. 144 All PacBio data were obtained as circular consensus sequences (CSSs) in FASTQ format. 145 These are of a higher per-base quality than the raw reads, because they are generated from 146 multiple reads generated from the same circular template. 147

Data cleaning
148 Two sets of clean NextSeq data were prepared. For the RepeatExplorer analysis (see below), 149 the data were filtered using a custom python script keeping only read pairs where 90 % of 150 the bases had a phred quality score > 20. Pairs matching the TruSeq adapters (detected by 151 BLASTn num_aligments 1) were discarded to remove adapter dimers. 152 A second cleaned set of NextSeq data was generated for mapping and variant calling. Here, 153 we aimed to remove as many low-quality base calls as possible. The first 5 bp of each read 154 were removed and, using Skewer RepeatExplorer was run twice. The first run was used to assemble a reference sequence for 167 the Podisma pedestris mitochondrial genome, from the NextSeq reads. In the second run,

Mapping and variant detection 188
In order to detect individual-specific variants, a second round of mapping was carried out 189 with NextSeq data. Polymorphisms were called using Geneious's function "Find 190 Variations/SNPs" with default settings and a minimum allele frequency set to 0.01. The 191 resulting tables were exported to CSV format and were processed interactively in R 3.3.1 (R 192 Core Team, 2016). 193 All PacBio reads were aligned to the mitochondrial assembly using the LAST suite (Kielbasa,194 Wan, Sato, Horton, & Frith, 2011). In brief, the reference genome was masked in regions 195 with GC-content below 10 % and was subsequently converted to a LAST database using the 196 scoring scheme NEAR, preserving all masked regions and additionally masking simple 197 repeats (optimised for high AT-content). Lastal was then run with parameter D set to one 198 thousand times the length of the assembly (corresponding to an e-value of 1e-3 in BLAST). 199 Of the resulting hits, only those with alignment lengths above 100 bp were kept. Shorter

235
Six mitochondrial genome assemblies 236 We assembled the mitochondrial genome sequences of six individuals of Podisma pedestris 237 (each of our short-read genome-skimming datasets) using contigs produced by the 238 RepeatExplorer pipeline (Novák et al., 2013). RepeatExplorer generates "clusters" of 239 decreasing size corresponding to repetitive DNA sequences in the samples analysed. 240 RepeatExplorer contigs with similarity to the mitochondrial genome of the desert locust, 241 Schistocerca gregaria, were merged in Geneious R9 and a 383-pb direct repeat, which had 242 been collapsed, was adjusted manually after mapping each sample's reads back to the 243 respective assembly. To check the reliability of this approach, we re-assembled the 244 mitochondrial genome from each data set with NOVOPlasty, yielding essentially the same 245 sequences, the differences being around the control region; for example NOVOPlasty did not 246 assemble the repeat in 3 cases. 247 Each of our assemblies is 16,008 bp in length. The average mapping depth varies between 248 samples from several hundred to few thousand-fold, which could be due to differences in 249 cellular content of mitochondria between individuals (Tab. 1). All genes typically found in 250 animals were identified using the MITOS WebServer v2beta (Fig. 3A). The gene order is 251 collinear with other grasshopper mitochondrial genomes, and the sequences align readily 252 (see alignment in supplementary file S1). 253 The alignment of all six consensus sequences contains 18 variable sites, five of which show 254 population-specific polymorphisms (Fig. 3B). A neighbour-joining tree shows that each is between ¼ and 4-fold for most clusters, the mitochondrial-like clusters (circles in Fig. 3) 277 show the most extreme values (at least 16-fold enrichment in our short-read data). 278 279

Short reads: Polymorphism in individual-specific alignments 280
SNPs were called within each alignment of individual-specific short ML reads. The minimum 281 minor allele frequency set to 1 % to avoid erroneous calls due to sequencing errors. All 282 assemblies contain numerous polymorphic sites with low to medium minor allele 283 frequencies, which can be interpreted as variants present in Numts (dots in Fig. 4). The fact 284 that we find appreciable allele frequencies even though we sequenced only a fraction of 285 each genome, strongly suggests that there a multiple Numt insertions present in each 286 sample. The distributions of these allele frequencies are skewed towards 0 with maxima 287 varying between samples (the extremes are 7 % and 20 % in samples N5 and N6, 288 corresponding to the narrowest and widest band in Fig. 4). Over all samples, there is a 289 correlation between fraction D (distinguishable Numts) and fraction A (without sequence 290 similarity to mitochondrial genomes). The slope of the linear regression represents the 291 genomic proportion of distinguishable Numts in P. pedestris. It is 9×10 -04 (p=1.04×10 -3 ) with a 292 standard error of 1×10 -4 . As expected, the intercept is not significantly different from zero 293 (p=0.805), see Supplemental figure S6. 294 In total, there are four SNPs with frequencies that are clear outliers from the frequency 295 distribution (shown by arrows in Fig. 4). Interestingly, individuals N1 and N3 from Le Blayeul Numts and also chimeric PacBio read (which we expect to be rare). The alignments cover 327 263,635 bp, which corresponds to 0.027 % of the total CCS data generated. There were only 328 59 PacBio CCSs matching full-length, of which 41 were sufficiently diverged from the 329 mitochondrial sequence to meet or criterion for classification as Numts. These align along 330 96,210 bp corresponding to 0.01 % of the (non-redundant) PacBio data generated. The 331 remaining 18 full-length matches could be derived from mitochondrial genomes, but they 332 may well be derived from Numts inserted recently. Covering 36,925 bp, these ambiguous 333 CCSs represent only 9.3 % of the ML fraction in the PacBio data. 334 Interestingly, the mapping depth of full-length matches has a bimodal distribution. While the 335 18 ambiguous matches contribute mostly to the first peak, Numt-derived CCSs map to the 336 areas under both peaks (see Fig. 6). 337

338
Both mitochondrial (M) and Numt (C+D) sequence are generated as side-products of 339 sequencing experiments, analgous to by-catch on fishing trawlers. We investigated these 340 sequences using genome-skimming data (less than ⅓x genomic coverage) from the 341 grasshopper, Podisma pedestris, using Illumina's NextSeq and PacBio's RSII platforms with 342 six and three biological replicates, respectively 343

345
One of the most striking results is that the Illumina data had over 16-fold higher frequency of 346 reads mapping to the mitochondrial clusters than the PacBio data, suggesting that the 347 Illumina protocol produced a correspondingly higher proportion of sequences from the 348 mitochondria (the M fraction), at some point between extraction and data interpretation. proportion of Numts in genome skimming data, which are possible even in the absence of a 370 genome assembly. It is shown in Supplemental Information S6 that for our short-read data, 371 there is a good correlation (R²=0.93) between the proportions of Numt-derived data reads 372 (fraction D) and non-ML data (fraction A). The slope of this regression corresponds to the 373 estimated genomic proportion. It is 0.09 %. This is a lower-bound estimate, because it is 374 based on sequence divergence between Numts and the mitochondrial genomes sequence. 375 The estimate based on PacBio CCSs is somewhat lower; ML CCSs with divergent sequences 376 amount for 0.01% of the PacBio data. Another class of CCSs, which match only along part of 377 their sequence, are likely to represent Numts, too, however there is a small chance that 378 some of them are derived from chimeric SMRT bells. These sequences amount for 0.027% of 379 the PacBio data, giving a total of 0.037%. including transposable elements able to excise and re-insert themselves, providing a 392 mechanism for copy-number increase. 393

394
In Fig. 5, we show tangle plots, which allow visual comparisons between Numts in multiple 395 samples. The repeated occurrence of the same links within a sample suggest that rearranged 396 mitochondrial sequence has been replicated within a single genome (the low coverage of < 397 ⅓x makes repeated sequencing of the same region unlikely). The similar patterns in different 398 individuals and populations suggest that many of these replicated insertions are fixed or 399 occur at a high frequency. Given that they are unlikely to be functional, it is most plausible 400 that they have spread by genetic drift. 401 Tangle plots can be used to visualise any short-read data sets mapped to a circular (or 402  sample, some fraction N will originate from the nuclear DNA and some fraction M from the 580 mitochondrial DNA, and other sequences. The nuclear fraction N may contain Numts. Some 581 of these will have sequences divergent from M and hence are easily identifiable (fraction D). 582 Other Numts may be cryptic, e.g. recent insertions, (fraction C). Together, Numts and 583 mitochondrial sequences for the fraction ML "mitochondrial-like". 584 585 in the short-read data. There is a general trend for enrichment in PacBio data for clusters 596 with higher GC-content (regression line). Mitochondrial-like clusters, which show the 597 strongest bias, are indicated as circles, rDNA as crosses. Data points with a leverage greater 598 than three times the mean leverage are shown in red. These were excluded from the final 599 regression (dashed line). 600