Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies

High quality reference genome sequences are the core of modern genomics. Oxford Nanopore Technologies (ONT) produces inexpensive DNA sequences in excess of 100,000 nucleotides but error rates remain >10% and assembling these sequences, particularly for eukaryotes, is a non-trivial problem. To date there has been no comprehensive attempt to generate experimental design for ONT genome sequencing and assembly. Here, we simulate ONT and Illumina DNA sequence reads for Escherichia coli, Caenorhabditis elegans, Arabidopsis thaliana, and Drosophila melanogaster. We quantify the influence of sequencing coverage, assembly software and experimental design on de novo genome assembly and error correction to predict the optimum sequencing strategy for these organisms. We show proof of concept using real ONT data generated for the nematode Caenorhabditis remanei. ONT sequencing is inexpensive and accessible, and our quantitative results will be helpful for a broad array of researchers seeking guidance for de novo genome assembly projects.


Introduction 23
The ability to sequence molecular fragments has created an entirely new field of biology, 24 genomics. In 1951, Frederick Sanger first sequenced amino acids (Sanger andTuppy, 1951a, 25 Sanger andTuppy, 1951b); in 1964 Robert Holley sequenced RNA and extensions of these 26 works led to DNA sequencing being possible (Holley et al., 1965). The first forms of DNA 27 sequencing would follow in the Wu Lab at Cornell University in 1970 (Wu andTaylor, 1971). 28 Wu's methods were then expanded upon by Sanger in the mid 1970's (Sanger et al., 1973) and 29 later commercialized making sequencing technology available for scientific discovery. The 30 advent of Illumina's high-throughput sequencing-by-synthesis technology resulted in next-31 generation sequencing and opened the door for rapid expansion of the genomics field (Zhang et 32 al., 2011). 33 The higher accuracy of Illumina data is essential for single nucleotide polymorphism 34 (SNP) detection or other fine-scale analyses, but the short read-length (between 50-250 35 nucleotides) is a challenge for genome assembly algorithms and detecting structural variants. 36 The third generation of sequencing focuses on long sequence reads (>10,000 nucleotides) and 37 reading nucleotides from single molecules. Currently available long-read sequencing 38 technologies prioritize read length at the expense of accuracy. Pacific Biosciences (PacBio) and 39 Oxford Nanopore Technologies (ONT) are the current front-runners in long-read sequencing 40 platforms; both are capable of average read lengths in the tens of thousands of base pairs and, 41 theoretically, entire chromosomes can be sequenced in a single read (Quail et al., 2012), 42 (Schneider andDekker, 2012) platforms and parameters are given in Table 1). 43 Many factors affect the de novo genome assembly. Genome size increases the size of the 44 "puzzle" to put together, while the size of the pieces (sequences) remains the same. Polyploidy 45 can create scenarios where many sites in a genome look highly similar to one another, making it 46 tough to place these regions within an assembled genome (Kyriakidou et al., 2018, Claros et al., 47 2012. For non-haploid organisms, there is a potential for heterozygosity and at these sites the 48 effective sequencing coverage is cut in half. 49 Repetitive regions, mobile genetic elements, and diversity between individuals in a 50 population also create unique challenges. Repetitive regions introduce a major hurdle for 51 Illumina data (Treangen and Salzberg, 2011) as repeats longer than the maximum read length 52 (often 150 bp) cannot be properly placed by the assembler, thus creating a break in the assembly. 53 Often these repeat regions are present in more than one location in the genome and without 54 contextual information it is difficult to identify how many copies exist in the complete genome. 55 For example, Alu repeat elements reach >1 million copies in the human genome (Consortium, 56 2001). With repeats making up a significant portion of larger genomes, it is possible to over-or 57 under-assemble in de novo genome sequences. 58 Individual diversity in a population plays a key role when pooled data must be used. This 59 is often encountered when working with small, non-clonal metazoans where the necessary 60 amount of DNA cannot be acquired from a single individual. This pooled-data compounds the 61 issues of ploidy and heterozygosity. 62 The read sizes of PacBio or ONT data can theoretically solve these problems. The long 63 sequence reads span repetitive regions, potentially allowing for the identification of the exact 64 size and location of these repeats on a chromosome. Long sequence reads increase the puzzle 65 piece size for assembly and require less sequencing effort to span the entire genome. In fact, with 66 microbial genomes, it is possible to assemble highly accurate, complete genomes with just long-67 read sequence data (Koren et al., 2013). Since Illumina short-read data are orders of magnitude 68 We find that ONT approaches can produce highly contiguous genome assemblies with 114 relatively high sequencing coverage of >100x, 5x higher than the current recommendations. Pure 115 ONT sequencing and assembly outperforms our tested hybrid approach. For organisms where 116 >100x ONT coverage cannot be generated, we find the success of hybrid assembly is determined 117 by the sequencing coverage of the Illumina data. We also find that the use of Illumina data, even 118 at low 20x sequencing depths, increases accuracy through iterative polishing. 119 120

121
We simulated 150bp paired-end Illumina DNA libraries with the software ART (Huang et al., 122 2012) and ONT DNA libraries with the software NanoSim . Both software 123 programs utilize an assembled sequence to generate simulated libraries with read profiles similar 124 to an empirically generated DNA library. NanoSim additionally requires real ONT flowcell data 125 to simulate the unique, organism-specific ONT mismatch, insertion and deletion rates. 126 We assembled ONT libraries with the Canu software package (Koren et al., 2017) and 127 'hybrid' ONT and Illumina libraries with the MaSuRCA software package (Zimin et al., 2017). 128 We measured genome statistics relative to the reference sequence of each organism with QUAST 129 (Gurevich et al., 2013). We assessed contiguity and accuracy of the assembled sequence through 130 eight statistics: 131 (1) NG50 is a size median statistic and indicates that 50% of the expected assembled genome 132 sequence (where the expectation is based on a known reference) is contained in contiguous 133 sequences that large or larger. 134 (2) NGA50 is a similar size median but indicates that 50% of the expected assembled genome 135 sequence that aligns to the reference genome is contained in contiguous sequences that large or 136 larger. 137 ( 3) LG50 is the number of linkage groups or contiguous assembled sequences containing 50% of 138 the expected assembled genome sequence. 139 (4) LGA50 is similar but again measured in the portion of the assembled sequence that aligns to 140 the reference genome. 141 For our metazoan organisms we also used the software package BUSCO to search for a set of 149 unique genes that are expected to be conserved in single copy in an evolutionarily related group 150 of organisms (Simão et al., 2018). 151 152

Escherichia coli 153
Canu assemblies (Koren et al., 2017) resulted in single contigs that could be circularized when 154 using high levels of coverage. The number of assembled contigs decreased with increasing 155 coverage levels ( Fig. 2A, B). The best overall performance was produced by a dataset containing 156 50,000 reads (~62x coverage or ~290Mbp). This produces a single circular contig that matched 157 the reference genome 100% and had a duplication ratio of just 1.001 (1 duplicated base per 158 1000bp). The worst Canu assembly used 15,000 reads (~19x coverage) and produced 6 total 159 contigs. 160 The hybrid assembly approach using MaSuRCA (Zimin et al., 2017) performed well for 161 E. coli. Two of the tested hybrid sets were able to assemble the genome into a single contig. 162 These two sets had differing coverage depths for both long-and short-reads ( Fig. 2A, B). Here 163 we also noted that MaSuRCA consistently was unable to perform as well as Canu (Koren et al.,164 2017) when given the same long-read dataset. For instance, the top performing Canu run used 165 50,000 ONT reads (~62x coverage); the same reads passed through MaSuRCA produced 2 166 contigs, regardless of coverage from Illumina data (Fig. 2B). 167 We used MaSuRCA to assemble a set of paired end Illumina sequences (DNA reads from 168 other side of a single molecule) to test the influence of adding ONT data to the assembly process. 169 The resulting sequence had a genome fraction of 98.76% and was fragmented into 74 contiguous 170 pieces. The accuracy, 1.7 mismatches and 0.04 indels per 100kbp, was higher than many of the 171 hybrid assembled sequences. This indicates that the inclusion of ONT data can introduce 172 misassemblies in hybrid approaches. 173 174

Caenorhabditis elegans 175
The ONT assemblies for C. elegans show a general improvement in contiguity as coverage depth 176 increases (Fig. 3A,B). However, it does take more overall coverage to approach a chromosome-177 level assembly. This is likely due to increased genome size, increased genome complexity, and 178 the addition of heterozygosity in the data compared to haploid E coli. The top-performing 179 assembly for C. elegans from Canu (Koren et al., 2017) produced 14 contigs (6 chromosomes) 180 from ~336x ONT (33.6Gbp) coverage. One assembly produced fewer contigs with 264x ONT 181 coverage but had lower overall accuracy and quality scores (Fig. 3C, D). It is also worth noting 182 that smaller data sets (ones that realistically can acquired from a single ONT flow cell) still 183 produced highly contiguous assemblies with 21 or fewer contigs. We found that error correcting 184 the more contiguous assembly with Illumina paired-end sequences and the Pilon software  (Table 2). 194

Drosophila melanogaster 196
Canu (Koren et al., 2017) assemblies of ONT data simulated from D. melanogaster repeated the 197 pattern seen with the C. elegans dataset; the most contiguous assembly was produced with 113x 198 ONT coverage (~16Gbp; Fig. 4A,B). This assembly produced 145 contiguous pieces but many 199 of these were small and the LG50 was low ( Figure 4B). We identified 91.7% of the metazoan 200 genes expected to be conserved in single copy with BUSCO (Simão et al., 2018). Additional data 201 decreased contiguity with a slight increase in accuracy and increased NG50 ( Fig. 4 A,D). markedly worse than the pure ONT assemblies in regard to contiguity ( Figure 4B). While the 209 hybrid assemblies do have an initial advantage in accuracy, this disappears after polishing with 210 Illumina sequences (Table 2). We again note that with smaller datasets, the hybrid approach 211 produced assemblies that were much smaller than the expected genome size (Fig. 4C). While not 212 as drastic as the discrepancies seen with C. elegans, assembly completion vs. the reference 213 ranged from 61.5%-94.5%. The top performing MaSuRCA assembly produced 220 contigs and 214 93.9% of the expected metazoan genes were identified in the sequence (Simão et al., 2018). 215 216

Arabidopsis thaliana 217
The assemblies produced by Canu (Koren et al., 2017) for A. thaliana consistently improved 218 contiguity with increasing data. The top Canu assembly started with 420x coverage (56.7Gbp) 219 and produced 30 contiguous pieces (LG50 5 chromosomes) with 98.8% of the expected 220 Viridiplantae genes ( The hybrid assemblies for A. thaliana performed the best of the three eukaryotes. They 225 produced comparable, yet slightly less contiguous assemblies when compared the long-read only 226 approach (SFig. 1A,B). Here, we noted a similar pattern to that seen in C. elegans and D.

Caenorhabditis remanei 233
We attempted to follow the trends found from the simulated data in our approach to creating a de 234 novo assembled sequence for the nematode C. remanei, strain PX356. C. remanei is an obligate 235 outcrossing species with high levels of nucleotide diversity that have hobbled previous assembly 236 attempts (Fierst et al., 2015, Barriere et al., 2009). The best assembly was achieved using 102x 237 polishing with Illumina paired end reads at ~225x average depth (Fig. 6). We tested the influence 244 of Illumina coverage on polishing and found that 97.6% of the expected conserved nematode 245 genes could be identified after polishing with just ~20x Illumina coverage (Table 3), indicating 246 that a large amount of data is not necessary to correct the majority of errors in an assembly. 247 However, this was achieved after three successive rounds of error correction with the Pilon 248 software package (Walker et al., 2014) utilizing the same Illumina DNA sequence reads. This 249 assembled sequence is less fragmented and contains a higher percentage of conserved nematode 250 genes ( Fig. 6) when compared with the previously published assembly for C. remanei PX356 251 produced using paired-end and mate-pair data as well as a linkage map (Fierst et al., 2015). across the genome. This is far higher than the current recommendations of 20x as a minimum 260 and 30-60x for ideal results. We found the discrepancy to be caused by differences in idealized 261 vs. actual read lengths. Although ONT can theoretically produce megabase-sized reads in reality 262 many of the sequence reads in 'real' projects are shorter due to handling techniques that result in 263 library fragmentation and truncated DNA molecules. During assembly the Canu software will 264 discard shorter reads and create a dataset that ideally has 40x coverage of the estimated genome 265 in reads 10,000bp or longer. Real ONT libraries, like the ones we used for simulations, have 266 many more small reads and >100x real coverage can be required to achieve effective high-267 confidence, long read coverage. Intense effort has gone into developing robust high molecular 268 weight DNA extraction protocols that can alleviate some of these issues. However, 'real' 269 sequencing projects should aim to generate >100x coverage for reliable sequencing. were fragmented with more contigs, lower NG50 values and assembled a smaller fraction of the 272 expected genome. We found, surprisingly, that given the same amount of ONT data the Canu 273 software (Koren et al., 2017) assembled a higher contiguity genome sequence when compared 274 with a hybrid MaSuRCA assembly. These hybrid assemblies contained a higher proportion of 275 expected conserved genes when compared with the raw Canu-assembled ONT read sets and had 276 fewer mismatches and insertion/deletion errors. However, Illumina sequences, even in small 277 amounts, can be used to error correct the draft assemblies produced by Canu and improve the 278 accuracy to be on par with, or better than, those produced by MaSuRCA. 279 In light of our findings, we suggest that long-read data be prioritized when undertaking Our results demonstrate that chromosome-level genome sequences are achievable with 292 sufficient ONT data. However, chromosome-level genome assemblies are often not necessary to 293 address many research questions, particularly those focused on small numbers of genes or 294 phylogenomic information. Researchers should approach genome sequencing by first 295 determining what genome-completion level will be sufficient for their research goals. To aid in 296 this approach, we hope our study will help researchers determine the amount of sequencing 297 effort, and the sequencing approaches, that will best suit their needs. 298 299

Limitations of the Study 300
Our study has two central limitations. First, our study is based on simulated data. Our simulated 301 Illumina and ONT DNA sequences and resulting assembled genome sequences are limited by the 302 quality of the available reference. Each of the model organisms we targeted has chromosome-303 level assemblies, but they may still have issues with contiguity and completeness. For example, 304 they may not reach one contig per chromosome or 100% conservation of expected genes for that 305 taxonomic division. We have made our measurements relative to the reference genome to 306 account for this. Second, because of the time and resources necessary to assemble ONT and 307 hybrid ONT and Illumina read sets we were not able to exhaustively search parameter space or 308 assemble very high depth ONT read sets for our metazoan model organisms. Despite these 309 limitations we have presented a quantitative approach to experimental design for genome 310 sequencing and assembly that will be useful to a broad array of researchers interested in genomic 311

Declaration of Interests 336
The authors declare no competing interests.

Assembly & Polishing
Genomes were assembled using two approaches. The first used the simulated ONT read sets and Canu v1.9 (Koren et al., 2017). Each genome was assembled at decreasing coverage depths until the assembler was unable to complete an assembly with the given data. In order to minimize the influence of individual reads and stochastic assembly artifacts, each read set was generated by selecting a random subset of the full simulated dataset.
The long-read datasets that performed the best were then paired with simulated pairedend Illumina data and assembled using MaSuRCA version 3.3.9 (Zimin et al., 2017); coverage depths were adjusted for both datasets to better understand the effects of increasing or decreasing coverage on the final assembly. Here, we retained the ONT dataset to maximize our ability to draw parallels between assembly approaches. For example, the minimum ONT dataset that assembled with Canu (Koren et al., 2017) was an average of 60x coverage across the genome and this readset was used in the MaSuRCA trials with 50x and 100x Illumina coverage, respectively.
The most contiguous assemblies from the long-read only and hybrid categories were error corrected using Pilon version 1.23 (Walker et al., 2014) to determine the effect of short-read polishing on the accuracy of the draft assemblies. Each simulated assembled sequence was polished with the entire simulated paired-end data set. Four rounds of polishing were completed for each assembly with statistics measured after each round with QUAST (Gurevich et al., 2013) and BUSCO (Simão et al., 2018).