High contiguity long read assembly of Brassica nigra allows localization of active centromeres and provides insights into the ancestral Brassica genome

High-quality nanopore genome assemblies were generated for two Brassica nigra genotypes (Ni100 and CN115125); a member of the agronomically important Brassica species. The N50 contig length for the two assemblies were 17.1 Mb (58 contigs) and 0.29 Mb (963 contigs), respectively, reflecting recent improvements in the technology. Comparison with a de novo short read assembly for Ni100 corroborated genome integrity and quantified sequence related error rates (0.002%). The contiguity and coverage allowed unprecedented access to low complexity regions of the genome. Pericentromeric regions and coincidence of hypo-methylation enabled localization of active centromeres and identified a novel centromere-associated ALE class I element which appears to have proliferated through relatively recent nested transposition events (<1 million years ago). Computational abstraction was used to define a post-triplication Brassica specific ancestral genome and to calculate the extensive rearrangements that define the genomic distance separating B. nigra from its diploid relatives.


Abstract
High-quality nanopore genome assemblies were generated for two Brassica nigra genotypes (Ni100 and CN115125); a member of the agronomically important Brassica species. The N50 contig length for the two assemblies were 17.1 Mb (58 contigs) and 0.29 Mb (963 contigs), respectively, reflecting recent improvements in the technology. Comparison with a de novo short read assembly for Ni100 corroborated genome integrity and quantified sequence related error rates (0.002%). The contiguity and coverage allowed unprecedented access to low complexity regions of the genome. Pericentromeric regions and coincidence of hypomethylation enabled localization of active centromeres and identified a novel centromereassociated ALE class I element which appears to have proliferated through relatively recent nested transposition events (<1 million years ago). Computational abstraction was used to define a post-triplication Brassica specific ancestral genome and to calculate the extensive rearrangements that define the genomic distance separating B. nigra from its diploid relatives.
Decoding complete genome information is vital for understanding genome structure, providing a full complement of both the genic and repeat repertoire and uncovering structural variation.
Such information also provides a foundational tool for crop improvement to facilitate the rapid selection of agronomically important traits and to exploit modern breeding tools such as genome editing 1,2,3 . Whole genome duplication and abundant repeat expansion has led to an approximate 660-fold variation in genome size among angiosperms 4 and in particular the low complexity of repetitive regions create challenges for complete genome assembly using short read sequence data 5 .
Recently advances in long read sequencing technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore Technology (ONT) 6 which combined with genome scaffolding methods such as optical mapping and chromosome conformation capture (Hi-C) have led to a paradigm shift in our ability to obtain complete and contiguous genome assemblies 7,8,9 . Both approaches can produce remarkably long reads; although, the error rate is significantly higher than more accurate Illumina short reads, which until recently limited their use to scaffolding to improve assembly contiguity 10 . However, correction algorithms to reduce error rates, and recent technological improvements have increased the output and quality of ONT sequence data making routine and cost effective assembly of large eukaryotic genomes possible 11 .
The Brassicaceae is an important plant family with approximately 3,800 species including commercially important vegetable, fodder, oilseed and ornamental crops. The Brassiceae tribe has a history of extensive whole genome duplication events, including the Brassica genus specific whole genome triplication (WGT), which occurred approximately 22.5 million years ago (Mya) 12,13 and is assumed to be shared by the three important diploids (B. rapa, AA, 2n=2x=20; B. nigra, BB, 2n=2x=16; and B. oleracea, CC, 2n=2x=18) that form the vertices of U's triangle 14 . Among these three, B. nigra (B-genome) has been neglected with regards to both genetic analyses and selection through breeding. Due to its limited domestication and its production as out-crossing populations it has retained valuable allelic diversity compared to its relatives, making it an untapped repository for Brassica breeding 15 . Among the six species of U's triangle, five have been sequenced including most recently B. nigra, but the assemblies cover at most 80% of the estimated genome size and almost all were very highly fragmented due to the use of short-reads alone 16,17,18,19,20 . Recently, the B. rapa reference genome was improved using PacBio sequencing 19 and one genotype each of B. rapa and B. oleracea was sequenced using a combination of ONT and optical maps demonstrating the utility of these technologies for complex duplicated genomes 21 .
The work described represents the nearly complete assembly of two B. nigra genomes (Ni100 and CN115125) using a combination of ONT sequencing, Hi-C and genetic map-based scaffolding. A short-read assembly of Ni100 allowed comprehensive benchmarking of the longread assemblies. Remarkably, direct methylome profiling utilising the ONT data allowed candidate active centromeres of the chromosomes to be resolved, a feature previously unannotated in short read assemblies. In addition, computationally defined genomic distances between the three Brassica diploid genomes allowed the construction of an ancestral Brassica specific genome.

Results
A combination of nanopore sequencing, Illumina error correction, chromosome confirmation capture (Hi-C) and genetic mapping was used to generate two de novo assemblies for the diploid Brassica species, B. nigra (Ni100 and CN115125 genotypes). Identical sequential steps were followed to assemble the contigs for each genome including the development of highquality sequencing data sets, genome assembly and polishing with short reads (Supplementary   Table 2). For both genotypes the read alignment rate was high (>98%) and both tools indicated a significant improvement after correction suggesting final error rates of 0.8 % (CN115125) to 0.2% (Ni100) at the base pair level. The two long read (LR) assemblies were generated over a period of approximately 12 months, during which time ONT upgraded their library construction kits, pore chemistry and base calling software. The combined impact of which was noted in an overall improvement in quality, average read length and usable data output for Ni100 and in final assembly contiguity (Supplementary Table 3  A short read Illumina de novo assembly for B. nigra genotype Ni100 (Ni100-SR) was used for further validation of the nanopore assemblies. The Ni100-SR assembly has a total length of 446.5 Mb from 19,195 contigs, of which 367.5 Mb was anchored to eight pseudo-chromosomes (Table 1; Supplementary Table 6). Alignment and visualisation of corresponding pseudochromosome sequence from the three B. nigra assemblies revealed high levels of collinearity ( Figure 2A). A number of inversions were noted, in particular a large inversion at the bottom of B4 distinguished the SR assembly. The B4 region was difficult to scaffold in the SR assembly due to limited recombination and was largely ordered exploiting synteny data from Arabidopsis thaliana. It was apparent that there was expansion of the ONT assemblies in regions presumed to be pericentromeric, as shown in Figure 2A and 2B. The level of coverage of these regions also varied between the long read assemblies.
Gene annotation from the two LR and two SR assemblies (Ni100-SR and the previously published YZ12151 20 ) were rationalized to generate a final B. nigra gene complement of 67,030 and 59,877 gene models in the two genotypes, CN115125 and Ni100, respectively. These numbers are in line with the predicted pan-gene content of diploid B. oleracea, with 63,865±31 genes 26 . An additional 3,546 genes were annotated in Ni100-LR compared to the Ni100-SR assembly. A homology search performed using GMAP 27 (minimum identity and coverage of 95%) indicated only 914 of the additional genes were unique to the Ni100-LR assembly (Supplementary Figure 5A). This discrepancy was due to both co-assembly of highly similar genes in the SR data and assembly errors that precluded accurate gene annotation. Read mapping of Illumina data back to the SR and LR assemblies showed a marked increase of 9% multi-mapping reads in the latter, with a concomitant reduction in non-concordant matches, suggesting the resolution of duplicated or highly homologous sequences in the LR assembly (Supplementary Table 7). The recent ONT assemblies of B. rapa and B. oleracea studied the self-incompatibility locus, or the S locus region, which due to its repetitive structure has been notoriously difficult to assemble, to infer the enhanced contiguity of the LR derived genome sequences 28 . The S locus region was identified and compared in the two B. nigra long read assemblies showing complete assembly of two differing S locus haplotypes (Supplementary  Figure 8B). Since it is often noted that families related to stress are more prone to differential copy number variation, differences in R-genes, transcription factors (TF) and protein kinase families were assessed in each of the genomes. The distribution of R gene families across the species appeared to be directly related to genome size and/or expansion of Structural variations (SVs) including deletions, insertions, duplications, inversions, and translocations that differentiate genotypes were cataloged between both genomes using ONTreads. The raw ONT reads from Ni100 and CN115125 were aligned to both LR assemblies and SVs were quantified using two different SV callers (Sniffles 31 and Picky 32 ). Self-alignment was used to estimate a false positive rate for each genome, which was higher for the CN115125 assembly (cf. 6,307 to 2,230 events) (Supplementary Table 14; Figure 3). High quality SVs were considered to be those identified with both software packages ( Figure 3D and 3E). At least fifteen random SVs of each type were assayed manually and suggested almost 100% prediction accuracy for deletions and insertions; however, some of the larger predicted events seemed less reliable and it was apparent that a number of deletions were overlooked suggesting the criteria may have been too stringent. Overall, 7,181 SV were identified by comparing the CN115125 ONT reads against the Ni100-LR reference (C2onNL) (Supplementary Table 14; Figure 3). Among the 7,181 SVs, 70% (5,059) were deletions, of which 45% were closely associated with genic regions (Supplementary Table 14; Figure 3B). Likewise, 6,078 SVs were found by mapping Ni100 ONT reads against the C2-LR assembly (NLonC2), with deletions (3,856) again prevailing, followed by insertions (1,921) ( Figure 3A). In general deletions were more prevalent, with shorter deletions and insertions (< 1Kb) being more evenly balanced. Since the reciprocal read mapping should identify effectively the same events this could suggest limitations to the automated pipeline for curating larger rearrangements, but this would be exacerbated by an overall average read length in the CN115125 ONT reads (10.9 Kb cv 20.4 Kb) and some variation in genomic regions assembled (Supplementary Table 4).
A Brassica B genome specific repeat library with 1324 families was developed using multiple annotation tools and was used to survey the repetitive genome fraction of the long-read (Ni100-LR, C2-LR) and short-read B. nigra assemblies (Ni100-SR and YZ12151-SR) (Supplementary Table 15). Repeats spanned 49% and 54% of the CN115125 and Ni100-LR genome assemblies, respectively, compared to 33% (YZ12151) and 41% (Ni100) in the two short read assemblies.
The increase in repeat content of the long read assemblies predominantly resulted from a rise in annotated Class I transposons, in particular Gypsy and Copia elements, which increased by 27.8% and 40.5%, respectively in the Ni100 assembly (Supplementary Table 15). The distribution of repeats revealed class I transposons were more common in traditionally heterochromatic regions such as centromeric, pericentromeric and sub-telomeric regions, while class II DNA transposons were more evenly distributed across the genome (Figure 1; Supplementary Figure 4). The identification of centromere and telomere specific repeats suggested the ONT assemblies provided more complete access to the chromosome structure (Supplementary Figure 9). The repeat fraction appears to reflect the estimated genome sizes of the diploid Brassicas with B. nigra lying between B. oleracea (~60%) 21 and B. rapa (~38%) 19,21 .
Almost all families were similarly distributed in the two long-read assemblies apart from LTR-Gypsy elements, which were ~5% higher in Ni100, suggesting either Ni100 specific amplification or assembly of these elements (Supplementary Table 15). Full-length long terminal repeat retrotransposons (FL-LTR-RTs) were annotated and compared in Ni100-SR and the two long-read assemblies. A total of 1220, 4491 and 3381 FL-LTR-RTs were identified in Ni100-SR, Ni100-LR and C2-LR assemblies, respectively, with an average length of ~6 Kb (Supplementary Table 16 ONT technology allows direct identification of base modifications such as 5-methylcytosine (5-mC) 33 , although this had yet to be shown in plant genomes. Nanopolish was used to detect 5-mC in the CG context in the Ni100 ONT unassembled reads, which provided higher coverage and quality compared to the CN115125 reads. The resultant calls were compared with methylation status detected using whole-genome bisulfite sequence data and showed excellent Of note, there were islands of reduced methylation observed for each chromosome that were also associated with lower gene and higher repeat density regions. Centromeric regions have been associated in Brassica species and more specifically in B. nigra with particular sequences including CRB (centromeric retrotransposon of Brassica) and a B-genome specific short repeat fragment (pBN 35) 34,35 . The distribution of these centromere-associated repeats aligned with the detected hypomethylated regions. Furthermore, members of the more prevalent ALE family, which also has >70% homology with CRB, localized to the same region ( Figure 5A). More recently sequences identified through interaction with the centromere specific histone CENH3, have been sequenced for B. nigra, which co-aligned with the hypomethylated regions, suggesting capture of much of the active centromere 36 .
Although the analyses of nested LTRs has generally been limited to cereal genomes, they would be expected to play a significant role in evolution of chromosome structure and specifically repeat dense regions such as centromeres. ALE LTRs were prominent in the centromeric regions and showed high levels of nested insertion ( Figure 6). Overall 262 nested TE events were found throughout the Ni100-LR genome of which 68% (179)  Genomic differences or similarities among species, as well as the mechanisms by which genomes evolve, can be identified by comparing the order in which genes or syntenic blocks appear in both close and distant relatives 39 . The changes to block orders are defined in terms of certain rearrangement operations both within a chromosome or between chromosomes, such as reversal, transposition, fusion, fission and translocations. These types of operations, which occur often throughout evolution of a species can be abstracted computationally as a series that results in a change to the linear ordering of genes. One approach to measure the degree of dissimilarity between species is to find the series with the fewest possible operations, the most parsimonious evolutionary process or the "genomic distance", that transforms the genome from one "version" to another. In the Double-Cut-and-Join (DCJ) model 40 , two genomes being compared are represented as a "breakpoint" graph, allowing the metric the DCJ distance between genomes to be calculated. The pairwise DCJ distances between the three Brassica diploid genomes were thus calculated to be: ! !"#",!"#$% = 96; ! !"#$%,!"#$%&#% = 98; ! !"#",!"#$%&#% = 52. In addition to measuring genomic difference or similarity, the order of blocks in extant genomes provides rich information that can be used in reconstruction of ancestral gene orders. The median problem for genome rearrangements is the simplest instance of a problem of reconstructing the gene order of an ancestral species. Given gene orders in a set of genomes G and a distance measure d, the median problem finds a genome m that minimizes the sum of distances ! ! = Σ !∈! d(m, g). The ASMedian-Linear algorithm 41 Table 19).

Discussion
Recent advancements and cost reductions in long-read sequencing technologies are facilitating the generation of high-quality genome assemblies even for species that have evolved through recursive whole genome duplication events (WGD) 42 . High quality and highly contiguous assemblies were generated for two genotypes of the mesopolyploid B. nigra using nanopore It is well established that long-read sequence data provides a more comprehensive coverage of the genome 46 , perhaps most obviously reflected in the increased capture of low complexity repeat sequences. Repeat analysis revealed about 14% more repeats in the long read assembly of Ni100 compared to the short read assembly (54% vs 41.2%) and in particular a more complete assembly of the repeat rich centromeric and peri-centromeric space. Centromeres are essential structures for the maintenance of karyotype integrity during meiosis, ensuring fertility of developed gametes through strict inheritance of full chromosome complements; yet centromeres still remain underexplored, especially in larger genomes. Although the active centromere is incredibly diverse in size and sequence among species, it is characterized through its cohesion with the centromere-specific histone H3-like protein, CENH3, and it has been suggested that association with CENH3 is controlled through epigenetic means, including a decrease in CG methylation 34 . Direct CG methylation profiling utilizing the ONT data suggested not only the efficacy of this approach (93-97% correlation with WGBS) but also demarcated the active centromere in the assembly, with hypo-methylated regions being co-located with known and novel centromeric repeat sequences. At least three of the chromosomes for Ni100 (B1, B3 and B8) showed multiple hypomethylated islands within or adjacent to the putative centromere region, which also coincided with centromeric specific repeats (Supplementary Figure 18). It was noted for B. rapa that such repeats found outside the presumed centromeric region may be evidence of ancient paleo-centromeres, remnants of the WGD events 19 . However, all additional sites coincided with hypo-methylation suggesting functionality of the regions. This could imply potential scaffolding errors remaining in the dense repeat regions, although interestingly even though the data was more limiting the same pattern appeared to be apparent for the CN115125 genotype, which could suggest a dispersed structure for the active centromeric region 47 . Where comparison was feasible the two genotypes showed a common dichotomy of centromeric regions, with conservation of gene content but rapid divergence in sequence constitution driven by changes in retrotransposon composition.
Recent work in B. nigra to uncover centromere specific sequences through their association with CENH3 indicated that unlike its diploid relatives and almost all analysed plant genomes, B.
nigra contains no tandemly repeated satellite DNA 36 . Similarly, no characteristic tandem repeat was found in the long-read assemblies; however, analyses of assembled full length LTRs revealed recently amplified (< 1 Mya) elements, in particular ALE-LTRs in the Ni100-LR genome ( Figure 6). Rapid amplification of the young LTRs in a nested insertion fashion was observed in all of the Ni100 centromeric regions (Supplementary Table 17). Nested TE insertion is a prevalent phenomena among monocots, but has only been identified infrequently among dicots 48 . The detected recent nested insertion events involving a single family suggest that ALE or related LTRs might play an important role in rapid divergence of centromeres in B.
nigra, similar to that found when comparing the centromeric region of two rice genotypes 49 .
Further studies to fully establish the role of these elements in centromere function in B. nigra are required and indeed the long read assembly resources developed for the Ni100 genotype could be leveraged as a model for centromere function research in future.
Finally the improved assemblies for all three diploid Brassica genomes allowed a median ancestral Brassica genome (n =9) to be resolved based on 178 syntenic blocks. The calculated DCJ distance between the genomes reflects the age of divergence between the B genome and A/C genome lineages ( Figure 7A). While B. rapa and B. oleracea have chromosomes which share extensive homology with ancestral chromosomes, the extent of the rearrangements separating the B genome would explain the limited genic exchange that has been possible across the two lineages. Therefore, capturing novel diversity from the third Brassica genome for crop improvement strategies in its related species may be more efficient using next generation breeding techniques such as CRISPR/Cas9.
The ability to relatively quickly and affordably generate contiguous genome assemblies provides a platform for developing true pan-genomes for many species. Such assemblies will allow an accurate comparison of not only gene content, but repeat composition and distribution and reveal the range and complexity of structural variation. There are still some limitations; however, with the continuing improvements to the technology and the optimisation of software dedicated to analyses of these new data types, resolution of these problems should be swift.

Plant material and DNA extraction
Brassica nigra CN115125 (C2) and Ni100 were grown in a greenhouse at Agriculture and Agri-

Oxford Nanopore sequencing and reads processing
The C2 genome was sequenced on a MinION while the Ni100 genome was sequenced on a  Figure 19).

Nanopore sequence assembly and polishing
Raw ONT fastq data was filtered for quality at q10 and q7 for C2 and Ni100, respectively, and the resulting reads were error corrected using CANU 1.6 with default parameters 22 . The C2 filtered data was assembled with three different assemblers (SMARTDenovo, wtdbg, Miniasm).
Minimap2 was used to generate overlaps of corrected reads with k-value 24 and other default parameters (-csw5 -L100 -m0) followed by assembly using miniasm 22,55 . SMARTDenovo  Table 1). Based on this preliminary analyses the Ni100 genome were assembled using SMARTDenovo with kmer 24 and default parameters. Both draft assemblies were polished using eight iterations of PILON 23 with available Illumina reads .

Contig Scaffolding
Leaf tissue of C2 was provided to Dovetail genomics (Santa Cruz, CA, USA) who prepared and sequenced CHiCAGO TM and Hi-C libraries. The polished assemblies, CHiCAGO TM , and Dovetail Hi-C library reads were used as input for scaffolding using Dovetail's HiRise™ pipeline 56 . A modified SNAP read mapper uses CHiCAGO TM and HiC reads to align to the draft assembly, HiRise™ produces a likelihood model for the genomic distance between read pairs, computing the optimum threshold to join contigs and to identify putative misjoins.
A genetic map derived from genotyping-by-sequencing data of a backcross population of 72 B.
nigra lines derived from the cross Ni100/doubled-haploid line A1//Ni100 was used to anchor contigs from all assemblies to the pseudomolecules. 20,689, 19,666 and 21,034 loci were anchored to the genome assemblies of C2, Ni100-SR and Ni100-LR, respectively. The assembly was confirmed using Genome-Ordered Graphical Genotypes (GOGGs) 57 Table   7.

Assembly quality assessment
Quality of the assembly was estimated using the single-copy orthologous gene analysis The gene naming convention proposed for B. rapa V3 19 was used with minor modifications: Bni (for Brassica nigra) followed by the chromosome number with leading zero, and a letter "g" (for gene), e.g. B01g (for B genome chromosome 1). Six digit gene numbers were assigned in steps of 10 with leading zeros from top to bottom of chromosomes. Following the gene number and separated by a period, to distinguish genome versions and between genotypes, "2N" was assigned to Ni100 LR (genome version 2), and "1C2" to C2 (genome version 1); for example, BniB01g023500.2N. Low confidence genes were defined as those models with neither transcriptome evidence support nor significant hits to Uniprot Plant database. The low confidence genes were named similarly as described above but with a letter "p" to distinguish them.

Repeat annotation
A de novo repeat library was developed using RepeatModeler (Version 1.0.11; http://www.repeatmasker.org/RepeatModeler/) which employ two de novo repeat finding programs (RECON and RepeatScout) for identification of repeat families. After removing potential false positives based on the homology with A. thaliana gene models, a total of 374 repeat models were retained. In addition, a previously developed repeat library for B. nigra Ni100 which contains 950 repetitive elements was merged to develop a final repeat library with 1324 that was used for repeat annotation in the whole genome. Repeatmasker was employed to estimate the repeat copies, proportion and distribution into the genome 66666666656463 .
Centromeric location was identified based on the distribution of centromere associated repeats such as Centromere specific retrotransposon of Brassica (CRB) and B. nigra specific centromere associated repeat (pBN35 -X16588.1) 35,67 . It is expected that that both the CRB and B. nigra specific centromere associated repeat were associated with the centromere and adjutant regions. Based on the distribution of these two elements using BLAST, centromere regions were located in the assembly.
Full length long terminal repeat retrotransposons (FL-LTR-RTs) were identified from both genome assemblies using LTR_harvest 68  were manually analyzed to identify nested TE insertion.

Gene and genome evolution
OrthoFinder v2.2.7 29 was used to identify members of gene families and assess expansion of gene families in C2 and Ni100-LR, by clustering the annotated genes with closely related species, B. rapa 19 , B. oleracea 16  Synteny analysis was performed to identify syntenic genes between B. nigra Ni100-LR/C2 and A. thaliana using A. thaliana proteome (Araport10) as described previously 16 . Briefly, based on best BLASTP values (E-value 1e -20 or better) syntenic gene pairs between B. nigra and A.
thaliana were employed in DAGChainer with default parameters to compute the chain score 72 .
Manual curation based on better chain score was done to create the final syntelog table   (Supplementary Table 21). Tandemly duplicated genes and proximal genes were identified following the previously reported approach 73 . Briefly, potential homologous pairs between each of three genomes were identified by all-versus-all BLASTP with E < 1e-10. Then MCScanX (default parameters) was used to identify duplicated pairs from Ni100 and C2 that formed intraspecies syntenic chains. These pairs were set aside and classified as WGD-derived gene pairs. For ancestral genome reconstruction, given gene orders in a set of genomes G and a distance measure d, the median problem is to find a genome m that minimizes the sum of distances d Σ = Σ g∈G d(m, g). The median problem is known to be NP-hard under the DCJ distance 78 . Given three genomes and di,j as the pairwise DCJ distance between genome i and genome j, the metric dΣ has the following property, the lower bound is  83 .

Structural variants analysis
SVs such as insertions, deletions, inversions, duplications, and translocations were identified using both Ni100-LR and C2-LR assemblies. Raw long reads of both genomes were mapped using NGMLR long-read aligner on Ni100-LR as a reference and SVs were called using Sniffles with minimum read depth of 20 31 . Likewise, SVs are predicted using C2-LR as a reference assembly. Furthermore, cross-validation of SVs identified by Sniffles was done using another SV identifier, Picky, with the same read depth of 20 32 . SVs shared by both callers were identified as high-quality SVs and used for further analysis.

Whole genome Bisulfite sequencing (WGBS)
Genomic DNA was isolated from leaf tissue of B. nigra Ni100 with two biological replications Quality filtered WGBS reads were used to analyze cytosine methylation ratios following alignment using BSMAP (v2.9) (Supplementary Table 21) 84 . Lambda DNA was included in each library as a control to estimate bisulfite conversion efficiency. In all instances the conversion rate was estimated to exceed 99%. The evidence to assign the methylatation status of each cytosine surveyed was determined by using the binomial probability distribution.
Methylation patterns were determined and summarized using the support from available genome annotation. Methylation patterns were partitioned by context (CG, CHH, CHG) reflecting the underlying biochemistry underpinning their maintenance. Statistical relationships and data organization was performed using custom Perl and R scripts with the support from Datatable, dplyr, stringr, genomation and MethylKit, All graphical summaries were developed using ggplot2.
CpG context in nanopore reads by Nanopolish.
Since, nanopore reads have an ability to output signals for methylated and unmethylated cytosine bases, Nanopolish was used to detect the CpG context in the whole genome of Ni100 33 . Nanopolish 0.10.1 was used to call bases for methylated and unmethylated from raw nanopore reads and results were filtered as described based on either log-likelihood ratio or read depth.
(B) Alignment of the short (NS) and long-read (NL) assemblies for chromosome B5 of Ni100.