Abstract
Background Structural Variations (SVs) are very diverse genomic rearrangements. In the past, their detection was restricted to cytological approaches, then to NGS read size and partitionned assemblies. Due to the current capabilities of technologies such as long read sequencing and optical mapping, larger SVs detection are becoming more and more accessible.
This study proposes a comparison in SVs detection and characterization from long-read sequencing obtained with the MinION device developed by Oxford Nanopore Technologies and from optical mapping produced by the Saphyr device commercialized by Bionano Genomics. The genomes of the two Arabidopsis thaliana ecotypes Columbia-0 (Col-0) and Landsberg erecta 1 (Ler-1) were chosen to guide the use of one or the other technology.
Results We described the SVs detected from the alignment of the best ONT assembly and DLE-1 optical maps of A. thaliana Ler-1 on the public reference Col-0 TAIR10.1. After filtering, 1 184 and 591 Ler-1 SVs were retained from ONT and BioNano technologies respectively. A total of 948 Ler-1 ONT SVs (80.1%) corresponded to 563 Bionano SVs (95.3%) leading to 563 common locations in both technologies. The specific locations were scrutinized to assess improvement in SV detection by either technology. The ONT SVs were mostly detected near TE and gene features, and resistance genes seemed particularly impacted.
Conclusions Structural variations linked to ONT sequencing error were removed and false positives limited, with high quality Bionano SVs being conserved. When compared with the Col-0 TAIR10.1 reference, most of detected SVs were found in same locations. ONT assembly sequence leads to more specific SVs than Bionano one, the later being more efficient to characterize large SVs. Even if both technologies are obvious complementary approaches, ONT data appears to be more adapted to large scale populations study, while Bionano performs better in improving assembly and describing specificity of a genome compared to a reference.
Background
Structural variations (SV) are defined as genomic variations involving segments of DNA from 50 bp to several megabases. SVs consist of unbalanced rearrangements such as copy number variations (CNV) including insertions/deletions (Indels) and presence/absence variations (PAV), and balanced events like inversions and translocations [1,2,3,4]. Several mechanisms explain the formation of SVs, such as recombination errors generated by non-homologous end-joining and non-allelic homologous recombination, genome duplication and transposition [1,2]. The structural variations in human were largely studied and recently, Ho et al. reviewed the impact of the SVs in human deseases [4]. In plants it has been shown that the structural variations play a key role in evolution of genomes and are responsible for phenotypic variations by impacting TEs and genes [3,5,6,7,8]. In particular, SVs were found in stress related and resistance genes [9,10,11,12,13], to be related to local adaptation [14,15], or linked to other traits of agronomical interest such as tomato fruit flavor, rice grain size or poplar wood formation [16,17,18].
Nowaday, identification of SVs contributes to the construction of the Panreference genome or super pangenome [19,20]. This new approach to build a reference will better reflect the genetic diversity of the species, and in the same time deepen the understanding of genome evolution, as well as enhancing knowledge of adaptative traits [21,22,23,24,25].
The development of new sequencing technologies has boosted studies of SVs found in a genome, which were detected until recently only by CGH arrays or SNP [26,27,28]. Short read sequencing technologies have made possible the identification of SVs in several species [29,30,31,32,33,34,35,36]. However, the size of the reads is a limiting factor for the detection of large SVs and SVs in highly repetitive regions. The 3rd generation sequencing offer new opportunities to identify SVs at larger scale with two kinds of methods. First are based on linked short reads, as in 10x Genomics and Hi-C approaches [37], second by directly generating long reads, as proposed by Pacific Biosciences [38] and Oxford Nanopore Technologies (ONT) [39,40]. These approaches provide access to complex regions, increasing their uses to produce genome assemblies and to detect structural variations in human [4,41,42,43], in Arabidopsis thaliana [24,44,45] and in other plants [46,47]. In parallel, a technology based on physical map and developed by Bionano Genomics [48], generates information from very large DNA molecules. These maps, named optical maps, are frequently generated to improve and validate sequencing assembly, to detect SVs in human genomes [43,49,50,51,52] and more recently in plants [7,46,47,53]. These 3rd generation sequencing data made possible the identification of genetic rearrangements between individuals at intra specific level [53].
Herein, we obtained draft assemblies using Oxford Nanopore technology and Bionano Genomics optical maps, in order to compare the detection and characterization of the structural variations by both methods. Despite comparisons between two sequencing technologies or SV detection softwares are not anymore an uncharted territory [24,43,44,54], the comparison of two fundamentaly different technologies like ONT and Bionano was only performed in animals (Chimpanze [52] and Drosophila [55]), but not yet in plants. A. thaliana is a model organism with a small genome (130 Mb). For this study, we selected Columbia 0 (Col-0) and Landsberg erecta 1 (Ler-1), two of the most studied ecotypes.
Results
ONT sequencing and genome assembly
The ONT sequence data of Arabidopsis thaliana ecotypes, Columbia (Col-0) and Landsberg erecta 1 (Ler-1), were cleaned using the correction and trimming steps of Canu assembler [56]. A total of 9.8 Gb (N50=12.7 kb, 75X coverage) and 6.1 Gb (N50=16.5 kb, 47X coverage) were obtained for Col-0 and Ler-1, respectively (Additional file 1: Tables S1 and S2).
To estimate ONT data completeness, the cleaned Ler-1 ONT reads were aligned against the Ler-1 reference sequence with Minimap2 [57]. A total of 98.9% of the Ler-1 reference sequence was covered by ONT reads. These Ler-1 ONT data were also mapped against the Col-0 TAIR10.1 genome that was 95.2% covered (Additional file 1: Table S3). Samtools depth tool was then used on the Ler-1 ONT reads mapping against the Col-0 TAIR10.1 reference to estimate the coverage at each position. The average coverage of 100 kb windows was 46.9X, with depth fluctuations in centromeric regions (Fig. 1).
To identify the best assembler for our data, de novo assemblies for Col-0 and Ler-1 were performed with Canu, RA and SMARTdenovo (SDN). Based on general statistics (assembly size, contig number, N50 size), SMARTdenovo software generated better assemblies for both ecotypes compared to Canu or RA. (Additional file 1: Tables S4 and S5). Indeed, the SDN assemblies resulted in 79 contigs for Col-0 (cumulative size =117 Mb, N50=12.5 Mb with 5 contigs) and 101 contigs for Ler-1 (cumulative sizes = 117 Mb, N50=10.7 Mb with 5 contigs). In addition, chimeric contigs were observed with Canu, while assemblies were more fragmented using RA (Additional file 2: Figures S1A-C and S2A-C). For all assemblers, centromeric regions were covered by many small contigs. These results were also supported by the alignements of the Col-0 and Ler-1 assemblies on the respective reference chromosomes Col-0 TAIR10.1 [58] and Ler [44]. The SDN assemblies named ONT Evry.Col-0 and Evry.Ler-1 assemblies, were used to carry out subsequent SV analyses.
Optical maps generation
The labelling of the genomic DNA was carried out using staining protocol with DLE-1 enzyme according to the manufacturer’s protocol. One run per ecotype on the Saphyr device was performed leading to 577.5 Gb and 610.9 Gb of molecules for Col-0 and Ler-1 respectively. Molecules larger than 150 kb were selected leading to about 600-fold final coverage based on the theorical 130 Mb Arabidopsis genome size (Additional file 1: Tables S6 and S7). A total of 17 and 14 optical maps with N50 of 14.6 Mb and 14.7 Mb are generated for Col-0 and Ler-1 respectively, bringing to a genome size of 125 Mb for both ecotypes (Additional file 1: Tables S8 and S9).
The average label density of the Ler-1 optical maps was estimated at 18.47 per 100 kb (Additional file 1: Table S7). However, this DLE-1 density decreases in the centromeric regions due to molecule depth diminution and optical map breaks (Additional file 2: Figures S3A-E, Fig. 1).
Structural Variations detection
The detections of the structural variations were performed independently using the ONT and Bionano technologies data and were carried out in two ways: Ler-1 versus Col-0 TAIR10.1 reference and Col-0 versus Ler reference. The different types of structural variations detected in our study are described in Additional file 2: Figure S4. Because general SVs characteristics (number, types and location) are similar in both types of analyses, only SV detection results from the Evry.Ler-1 assembly and optical maps against the Col-0 TAIR10.1 reference are presented in details. Description of SVs detected by comparing the SDN assembly and optical maps Col-0 with Ler reference are provided in Additional file 1: Tables S10-S14 and Additional file 2: Figures S5A-E.
The comparison of Evry.Ler-1 assembly to Col-0 TAIR10.1 reference using MUMmer show-diff utility [59] revealed 2 186 potential SVs (Table 1).
A total of 119 SVs, called reference sequence junction (SEQ), break (BRK) and jump (JMP), found in centromeric, telomeric and nearby rDNA cluster were considered to correspond to unresolved assembly regions into Evry.Ler-1 assembly compared to Col-0 TAIR10.1 reference and were filtered out.
To avoid false positive SV detection due to the ONT high error sequencing rate of 7.5% [60], a filter on query ONT structural variations size (> 1 kb) was applied. Out of the 2 186 SVs initially detected, 1 184 SVs remained (54.2% of total SVs) corresponding to 591 insertions (INS), 581 deletions (DEL) and 12 inversions (INV). No duplication was detected (Table 2).
A 5 Mb insertion in the Evry.Ler-1 assembly was detected on Chr3 Col-0 TAIR10.1 reference (14 272 986..14 284 724) due to a detection error of MUMmer in a complex region associated to a rDNA cluster. Thereby, this insertion was removed from the final data and not considered in the result. The ONT structural variations median size was 3 455 bp and the cumulated sizes 7.7 Mb. The SVs were equally distributed in size and number between INS and DEL. The INV categories had larger median and average sizes than INS and DEL. With a cumulated size of 0.3 Mb, INV represented 3.9% of the ONT variation size (Table 2). Structural variations were detected on all chromosomes, with a preferential location on chromosome arms and with no confident SV on the Chr1, 3 and 4 centromeres (Fig. 1).
Bionano data analysis (optical maps construction and SVs detection) was carried out on the Bionano Solve interface. Structural variations were highlighted by comparing optical maps results to in silico Col-0 TAIR10.1 reference genome labelling with DLE-1. A total of 797 SVs were identified during the analysis of Ler-1 optical maps versus Col-0 TAIR10.1 reference (Table 1). When Bionano Solve tools detected one SV embedded in a second one, the largest SV was kept. This case was found on two Chr1 independent locations (INS:19 432 310..19 468 513 and DEL:24 688 666..24 736 849). A 1 kb size filter was applied on the Bionano SVs, that was equivalent to remove deletions and insertions with a Bionano quality score < 10 (defined as poor quality by the manufacturer) (Additional file 1: Table S15). Additionally, on Chr2, the INV SV (3 433 371..3 490 731) with no quality score was discarded. Thereby, 591 SVs representing 74.2% of total Bionano SVs were further considered in this analysis. INS and DEL types constituted the main part of the Bionano SVs (48.9% and 49.9% of the SVs respectively), the remaining 1.2% corresponding to translocations (TRA) and INV (Table 2). Median SVs size was 4 383 bp and SVs cumulated sizes represented 7.2 Mb of the genome. The TRA and INV types corresponded to nearly one third (2.0 Mb) of the structural variations cumulated size. In our study, the translocations were only detected using the Bionano assembly. The two Ler-1 TRA were located on Chr2 (3 378 844..3 397 121; 3 484 209..3 844 839) (Additional file 2: Figure S6). The largest SV identified was a 1.1 Mb Ler-1 INV located on Col-0 TAIR10.1 reference Chr4 (1 435 832..2 593 360) (Additional file 2: Figure S7). SVs were distributed preferentially along the chromosome arms and their detection was limited in centromeric regions due to decreased in labelling in these regions (Fig. 1).
SVs comparison
SVs comparison was based on their absolute start- and end-positions on the Col-0 TAIR10.1 reference sequence. We considered structural variations were comparable in both technologies when their locations overlapped by at least 1 bp. To go further, SVs identified by ONT and Bionano technologies were assigned a two letters svID code with the first letter used for ONT SVs, the second for Bionano SVs, leading to common (svID UU and MU) and specific (svID UN and NU) locations (with “U” for “Unique SV”, “M” for “Multiple SVs” and “N” for “None SV”, Additional file 1: Tables S16 and S17).
SVs comparison metrics are presented in Table 3. A total of 563 common locations were identified representing 948 (80.1%) of ONT SVs and 563 (95.3%) of Bionano SVs. The cumulated sizes of these common SVs are 5.9 Mb and 6.9 Mb for ONT and Bionano detection respectively. ONT SVs tend to be smaller than Bionano SVs (Table 3, Additionnal File 1 Tables S16 and S17).
Among the 563 common regions, 410 (72.8% of the common regions) coincided with svID UU, i.e. one ONT structural variation corresponding to one SV Bionano. In most cases, the overlap of these SVs was at least 50.0% of the ONT SV size, and 405 (98.8% of the svID UU) SVs have conforming type (i.e. have the same type) (Additional file 1: Table S16). The remaining five svID UU (1.2%) were identified as deletions by ONT and insertions by Bionano technologies (svID UU_035, UU_038, UU_057, UU_073, UU_358; Additional file 1: Tables S16 and S17).
In the remaining 153 (27.2%) common locations, a total of 531 of the ONT SVs (56.8% of commons ONT SVs) related to the Bionano SVs (27.0% of commons Bionano SVs) were pinpointed (Table 4). These structural variations had a svID MU. The cumulative size of this SVs category is approximately 4 Mb for both technologies although the number of ONT variants is 3.5 times higher than in Bionano (531 vs 153). Nevertheless, Bionano median and average sizes are 2 and 4 fold larger respectively.
The largest ONT SV was a complex SV (svID MU_102; MU meaning that several ONT SVs match to one Bionano SV) consisting of four contiguous deletions located on Chr4. These four deletions coincided with one Bionano deletion (Additionnal file 1: Tables S16 and S17). The largest Bionano SVs (svID MU_097) was an inversion on Chr4 of 1 143 224 Mb overlapping 22 SV ONTs (corresponding to INS and DEL) (Additionnal file: 1 Table S17).
Specific locations were more abundant with the ONT technology (236 SVs - 19.9%) than with Bionano (28 SVs - 4.7%) leading to a cumulated size of 1.8 Mb and 0.3 Mb respectively, and with a median size twice larger (2 656 bp for ONT SVs vs 1 374 bp for Bionano SVs). The distribution of the specific ONT SVs onto the Col-0 TAIR10.1 chromosomes lead to a clear trend to locate on NOR and centromeres (Fig. 1). The largest specific ONT variant is located on Chr3 and corresponds to a DEL (svID UN_124; SV detected with ONT only, Additional file 1: Table S16). The largest specific Bionano SV is spotted on the Chr3 and corresponds to an INV type (svID NU_017; SV detected with Bionano only, Additional file 1: Table S17). A focus on the TRA revealed a 18.2 kb specific Ler-1 Bionano SV (svID NU_007), close to the second TRA of 360 kb positioned around 3.6 Mb (MU_153). This TRA coincided with seven SV events (1 INV, 5 INS and 1 DEL) in the Ler-1 SDN assembly (Additional file 1: Tables S16 and S17).
Using Araport11 annotation of the Col-0 reference (The Arabidopsis Information Resource - TAIR), a comparison using only ONT SVs is shown in Table 5. Since the Bionano events represent a large-scale observation, they were not taken into account in this analysis. A total of 893 (75.4%) out of 1184 ONT SVs overlaped TE features, of which 579 also overlaped genes. Only 291 (24.6%) SVs are located outside a TE feature, overlapping genes [125 (10.6%)] or not [166 (14.0%)] (Table 5). Focusing on ONT specific SVs (svID=UN), their overlap with the Col-0 reference annotation showed similar percentage compared to the common SVs.
To better characterize the genes affected by ONT SVs in common locations, a GO-terms overrepresentation test was performed with the PANTHER’s tool [61] available on TAIR website (https://www.arabidopsis.org/tools/go_term_enrichment.jsp). Among the 1 764 genes identified in common locations, 47.2% (832) genes were uniquely assigned to a GO term and used in PANTHER (Additional file 1: Tables S18 and S19). Overrepresentations in defense response and ADP-binding terms were detected (Additional file 1: Table S20), but no enrichment for GO-terms in genes in specific ONT locations was highlighted (Additional file 1: Tables S21-S23).
Discussion
Herein, we compare the performance of Oxford Nanopore and Bionano Genomics technologies for structural variation detection. For this, we performed long read sequencing and optical mapping of two A. thaliana ecotypes, namely Columbia-0 (Col-0) and Landsberg erecta 1 (Ler-1). Long read de novo assemblies were constructed using three different assemblers and optical maps were assembled with Bionano Solve tools. Structural variations detected using the Col-0 TAIR10.1 [58] and Ler [44] genomic sequences as references, were described and compared to each other, to reveal the relative strengths of the two technologies in highlighting SVs.
Assemblies based on ONT and Bionano data for SV analyses
To obtain the best assembly based on only long reads data we used three different assemblers. After comparison of assembly metrics, calculation time and collinearity against reference genomes, SDN provided the best assembly even if some collinearity breaks were observed, especially in centromeric regions. The metrics of Evry.Col-0 and Evry.Ler 1 SDN assemblies were comparable to such assemblies in previous studies [24,44,45,62].
Continuous improvement in protocols and new developments in genome assembly strategies and algorithms resulted in higher and higher quality of genomic sequences used in subsequent analyses. Previously published Bionano A. thaliana optical map (KBS-Mac-74) genome [45] used a BspQI staining protocol for labelling, generating about 10 time more maps to cover the entire genome of KBS-Mac-74 than in our study (DLE-1 Bionano staining protocol), highlighting enhancement in Bionano’s protocol. In addition, no optical map was previously available for the Columbia (Col-0) and Landsberg erecta 1 (Ler-1), making our map assemblies especially valuable for further studies.
Our high quality map allowed us to define centromeric and nucleolar organizer regions (NOR), despite lower molecules density and even if label concordance loss were observed between Ler-1 maps compared to the Col-0 TAIR10.1 in silico reference maps. Moreover, fluctuations in ONT coverage density and accumulation of repetitive alignments in the same regions are reinforcing evidences of the approximate locations of the centromeres and NOR. However, we identified several missassemblies in the course of our SVs analyses between the ONT SDN Ler-1 assembly and Col-0 TAIR10.1 reference, highlighting how difficult it can be to get a reliable assembly, and thus detecting SVs, in these complex regions.
SV detection and comparison between the two technologies
Herein, we compared structural variations in Evry.Ler-1 and the reference genome Col-0 TAIR10.1. We chose this reference because of its high quality and the richness of the associated studies [24,44,45].
The cumulated SVs sizes obtained for ONT and Bionano in our study are smaller than in previous studies [24,44]. Filtering on SVs size (SVs > 1kb vs no size filter) could explain this difference. In addition, the lack of duplications detection in ONT assembly could depend on MUMmer’s ability to detect this type of SV, reflecting the detection complexity of the duplication events, as mentioned in Goel et al (2019). In contrast, the absence of duplication detected by Bionano could be explained by polymorphic duplications between Ler-1 maps and Col-0 TAIR10.1 reference, which would break the collinearity, as described in Jiao et al. (2020), and by the size of duplications (< 5kb, [62]) identified as the limit of Bionano detection. Analyzes by the two technologies revealed a predominance of insertion, deletion and inversion with larger median and average sizes for Bionano SVs. The distribution of these types of SV is homogeneous along the chromosomes arms. Even if most of the specific ONT SVs are located in the centromeric and pericentromeric regions, a decrease coverage of the SVs in these regions is probably due to technical problems such as assembly errors (for ONT SMARTdenovo). This diminution in SV coverage is also observed with Bionano technology, showing a lower density labeling in these complex regions. This contrasts with previous results identifying more SVs in regions where the recombination meiotic rate decreases [24]. The filtering of SV ONTs smaller than 1 kb could again be an explanation for this contradiction. On the other hand, Bionano Solve tools well identified translocation previously characterized on Chr2 and three inversions larger than 50 kb present on Chr3, Chr4 and Chr5 [24,35,44]. For example, compared to the Col-0 TAIR10.1 reference, the Ler-1 maps support a 360 kb translocation of mitochondrial sequence in the Chr2 around the 3.6 Mb Col-0 TAIR10.1 position (svID MU_153). This observation is concordant with Stupar et al. (2001) that first described the mtDNA insertion in the Col-0 reference. In this Chr2 region (3.29 Mbp to 3.48 Mbp), Pucker et al. (2019) identified a second 300 kb highly divergent region between A. thaliana Nd-1 and Col-0 reference. In the same study, Pucker et al. also described the lack of the entire region between 3.29 Mbp and 3.48 Mbp in Ler genome, corresponding to the specific translocation of 18.2 kb detected in Ler-1 map (svID NU_007). Zooming in this Col-0 TAIR10.1 Chr2 region (3.2 Mb to 3.5 Mb) in the Ler-1 SDN assembly, many small contigs are observed with a missing sequence of 110 kb. This observation explains absence of SV detection, confirming the great complexity of this region and the sequence divergence between Ler-1 and Col-0 genome described by Pucker et al (2019). Even if the Col-0 reference sequence has been improved since 2000 [58], our assembly (Evry.Col-0) confirms its value to re-evaluate complex region assembly, and provide new high quality optical map data.
The number, type and location of SVs in the largest common ONT (svID MU_102) and Bionano (svID MU_097) SVs, as well as the Chr2 ONT SVs matching the second Bionano translocation (svID MU_153), reflect that the structural variations brought out by ONT were more numerous and smaller, which allows an identification at finer scale. In contrast, Bionano variants were larger and their sizes depend on restriction sites distribution.
To globally estimate consistency of the SV analyzes between ONT SDN or Bionano Ler-1 assemblies against Col-0 TAIR10.1 reference, we compared the structural variations we identified to those of Zapata et al. (2016) (mapping and SV detection tools and parameters being the same). Although the local variations cannot be comparable due to genome sequence accuracy (complete genome vs whole genome sequencing) and the SV filtering differences (no size filter vs > 1 kb), the majority of events are shared by the both studies.
Comparing locations of the Ler-1 ONT SVs with Araport11 annotations, we found that common and specific ONT SVs were preferentially linked to TE features and genes, as reported in Jiao et al (2020). Looking at the GO-term enrichment in genes overlapping common ONT SVs, an overrepresentation in defense response and ADP-binding terms corresponding to resistance genes was observed. This result is concordant with previous studies [13,24,44,63,64,65] in which an association between structural variations and the cluster organisation of resistance genes was described.
General conclusion
Because analyses of SVs and their consequences heavily relies on the quality of their identification and the underlying assembly/mapping data, we aimed to compare the performance of ONT and Bionano biotechnologies for structural variation detection. Applying stringent filters on ONT assembly mapping approach and size filters on SVs, we have shown this methodology is an easy and efficient way to detect reliable SVs. Most of detected SVs were also identified with Bionano optical maps with high concordancy despite different characteristic (average, size, median). Nevertheless, long read sequencing technologies makes possible to detect SVs more accurately, while Bionano offers a broad overview of structural rearrangements. In addition, whole genome SVs analyses is currently mostly limited to model organisms. However, because both Oxford Nanopore long reads and Bionano Genomics maps assemblies do not require previous knowledge on genomic architecture or sequence of the studied taxa, this approach expands the field of suitable plant species or species complexes where in-depth SVs analyses can be performed.
Thereby, ONT appears to be especially suitable for SV studies in population or species complex, and Bionano more relevant for characterization of genome specificity and genome evolution, leading to an obvious complementarity of these two technologies in SVs analyses.
Methods
Plants
Arabidopsis thaliana Columbia-0 (accession number 186AV) and Landsberg erecta-1 (accession number 213AV) seeds were obtained from the Versailles Arabidopsis Stock Center, INRAE. They were sown directly in soil and transplanted after 10 days. Plantlets were grown under a 16h light/8 h night photoperiod in a growth chamber at 20°C for 4-5 weeks. Prior to harvest, the plants were dark-treated for 3 days.
Oxford Nanopore Sequencing (MinION) HMW DNA extraction
High Molecular Weight (HMW) DNA extraction was performed using a modified salting-out protocol. A total of 5g of freshly harvested leaves was ground in liquid nitrogen with a mortar and pestle and transferred to 10ml of 50°C prewarmed extraction buffer in a 50ml tube containing 1.25% SDS, 100mM Tris-HCl, pH 8, 50mM EDTA, 0.01% w/v PVP40. Then 37.5μl of beta-mercaptoethanol (0.375% final) and 10μl RNAse A (Qiagen® 100mg/mL) were added. This solution was incubated for 30 min at 50°C, under agitation (10 sec at 300rpm every 10 min). After incubation, 20ml TE (10:1) were added, slowly homogenized then 10ml of KAc 5M. The tube was kept on ice for 5 minutes, then centrifuged at 4°C during 10 min at 5000g. The solution was transferred in two 15ml tubes and centrifuged again as previously. The supernatant was transferred in a 50ml tube contening 1 volume of Isopropanol, slowly inverted 10 times, then centrifuged at 4°C for 10min at 500g. Pellets were washed with 20ml ethanol 70% then centrifuged at 4°C for 5 min at 500g. Supernatant was removed and pellets were not completely dried before solubilization in 100μl of TE (10:1) prewarmed at 50°C. The DNA solution was then incubated at 50°C for 10 min. Field Inverted Gel Electrophoresis (Program 50-150 kb on Pipin Pulse from Sage Science) was used for DNA size estimation and DNA samples with molecule size above 50 kb were kept. Purity of DNA was evaluated by spectrophotometry (OD260/280 and OD260/230 ratio).
Bionano Optical Maps ultra HMW DNA extraction
We performed the DNA extraction using the Base protocol n°30068 vD (Bionano Genomics) with minor adaptations. Three grammes of very young fresh leaves from each genotype were harvested from the dark-treated rosettes. The samples were placed on aluminium foil on ice then transferred to a 50ml tube surrounded by a screened cap allowing pouring without lost of samples (Bio-Rad) The tubes were kept on ice during the nuclear isolation. Samples were treated in fixing solution containing 2% formaldehyde under a fume hood then rinsed with fixing solution without formaldehyde. Fixed-leaves were transferred to a square Petri dish with 4ml of Plant Homogenization Buffer plus (HB+ is HB supplemented with 1mM spermine tetrahydrochloride, 1mM spermidine trihydrochloride, and 0.2% 2-mercaptoethanol). Entire leaves were chopped with a razor blade in 2×2mm pieces then transferred to a new tube on ice and 7.5ml HB+ is added. Using TissueRuptor (Qiagen) the 2×2mm pieces were blended for a total of four cycles (20 sec at maximum speed then resting 30 sec). Plant homogenates were filtered, first through a 100μm then to a 40μm cell strainer and volumes were adjusted to 45ml. Nuclei were centrifuged at 3840g at 4°C during 20 min, supernatants were discarded. Nuclei were gently re-suspended in residual buffer, 3ml of HB+ were added, then tubes were swirled on ice and the volumes were adjusted to 35ml. Homogenates were centrifuged at 60g at 4°C during 3 min using minimum deceleration. Solutions were very carefully transferred to a new tube in order to avoid carry-over of debris, and filtered again through a 40μm cell strainer.
Nuclei were centrifuged at 3840g at 4°C during 20 min, 3ml of HB+ were added and tubes were swirled on ice. Using Bionano Nuclei Purification by Density Gradient, nuclei homogenate were laid on the top of two solutions with different densities. After a 4500g centrifugation at 4°C during 40 min, the nuclei are at the interface of the two solutions. There are recovered with a wide-bore tip in about 1ml solution and transferred in a 15ml tube and adjusted to 14ml with HB+. Nuclei were centrifuged at 2500g at 4°C during 15 min. All the buffer were removed and nuclei were re-suspended in 60μl HB+.
The nuclei solution were adjusted to 43°C for 3 min and melted 2% agarose from CHEF Genomic DNA Plug Kits (Bio-Rad) was added to reach a 0.82% agarose plug concentration. Plugs were cooled on aluminum blocks refrigerated on ice. Purification of the plugs was performed with Bionano Lysis Buffer adjusted to pH 9 and supplemented with proteinase K and 0.4% 2-mercaptoethanol. Plugs were digested during 2h at 50°C in Thermomixer then solution were refreshed and incubated again overnight. Plugs were treated at RNAse for 1h at 37°C in remaining solution. Plugs were washed three times in Wash Buffer (Bionano Genomics) then four times in TE 10:1. DNA retrieval was performed as recommended by Bionano Genomics, as follow: plugs were melted at 70°C during 2 min then transferred immediately at 43°C and incubated 45 min at 43°C with 2μl Agarase (0.5 unit/μl). The melted plugs were recovered with wide-bore tips and dialyzed on a 0.1μm membrane disk (Millipore) floating on 10ml TE for 1h. DNA was quantified in triplicates with Qubit according to Bionano protocol. Two methods were used to estimate size of DNA molecules: Pipin Pulse and the Qcard Argus System (Opgen) which allows DNA combing on a lane and visualization of molecules after staining under fluorescent microscope. Samples with molecules above 150 kb were kept for labeling. Protocols were performed according to Bionano Genomics with 600ng of DNA for both Col-0 and Ler-1 ecotypes. The direct label and stain (DLS) labeling consisted in a single enzymatic labelling reaction with DLE-1 enzyme following by DNA staining with a fluorescent marker. It was performed with 750ng DNA. Chip loading was performed as recommended by Bionano Genomics.
ONT Sequencing (MinION) and assembly
ONT libraries were prepared according to the following protocol, using the Oxford Nanopore SQK-LSK109 kit. Genomic DNA or DNA previously fragmented to 50 kb with a Megaruptor (Diagenode S.A., Liege, Belgium) was first size-selected using a BluePippin (Sage Science, Beverly, MA, USA). The selected DNA fragments were end-repaired and 3’-adenylated with the NEBNext® Ultra™ II End Repair/dA-Tailing Module (New England Biolabs, Ipswich, MA, USA). The DNA was then purified with AMPure XP beads (Beckmann Coulter, Brea, CA, USA) and ligated with sequencing adapters provided by Oxford Nanopore Technologies (Oxford Nanopore Technologies Ltd, Oxford, UK) using Blunt/TA Ligase Master Mix (NEB). After purification with AMPure XP beads, the library was mixed with Running Buffer with Fuel Mix (ONT) and Library Loading Beads (ONT) and loaded on 4 MinION R9.4 SpotON Flow Cells per Arabidopsis thaliana ecotypes. Resulting FAST5 files were base-called using albacore (versions 2.1.10 and 2.3.1) and FASTA produced as described in Istace et al (2017). Canu version 1.5 (github commit ae9eecc), was used for initial read correction and trimming with the parameters minMemory=100G, corOutCoverage = 10000. The corrected sequences were merged in one final FASTA file per ecotype that were later used as assemblers input.
Assemblies were performed with the relevant genome size parameter set to, or coverage calculation based on, a 130 Mb genome size. Assemblers used with default parameters were Canu version 1.5 ([56], github commit 69b5f32), Rapid Assembler (RA, https://github.com/lbcb-sci/ra commit 07364a1) and SMARTdenovo version 1.0 (with the option –c 1 to run the consensus step) (https://github.com/ruanjue/smartdenovo commit 61cf13d). The MUMmer suite version 3.0 [59] was run with the parameters used in Zapata et al. 2016. To analyze the assemblies, they were aligned to the reference genome of Arabidopsis thaliana Columbia 0 (Col-0, TAIR10.1 GCF_000001735.4) and the sequence of Arabidopsis thaliana Landsberg erecta (Ler, Genbank LUHQ00000000.1) using nucmer with the options -c 100 -b 500 -l 50 -g 100 -L 50. The alignments were filtered with delta-filter (options -1 -l 10000 -i 0.95) and visualized with the mummer-plot (options --fat --large --layout --png) or DNAnexus (github commit 78e3317). These MUMmer parameters [44] allowed conserving exact matches larger than 50bp and alignments longer than 10 kb with a minimal identity of 95%. To check assemblies’ completeness and fragmentation, they were compared to each other based on the metrics (Number of contigs, N50, cumulative genome sizes) and the genome alignments to the references generated with MUMmer viewed with the DNAnexus dot (https://dnanexus.github.io/dot/).
To evaluate the completness of our ONT data, mapping of the corrected ONT reads on the Col-0 TAIR10.1 reference were performed with Minimap2/2.15 aligner [57] with -a -x map-ont parameters. The Samtools/1.6 depth tool with –a option [66] gave us the alignement depth at each Col-0 TAIR10.1 reference positions.
Bionano Optical Map assembly
As it can be beneficial for assembly steps, molecules sub-sampling was conducted when flowcells yielded more than 90 Gb and 600X of data. This adapted selection of molecules was made on each run with the Bionano RefAligner tool in command line (version 1.3.8041.8044 with –minlen 180 –randomize 1 –subset 1 nb_molec options) or with Bionano Access (version Solve3.3 with Filter Molecule Object utility) (Additional file 1: Tables S6 and S7).
Maps were then constructed with the tool Generate de novo Assembly of the Bionano Solve (version 3.3) using the options recommended by Bionano (With pre-assembly, Non haplotype without extend and split) and a 0.115 Gb genome size. The pre-assembly step calculates noise parameters that optimizes the quality of the assembly (less and larger maps). When a reference FASTA file is added, noise parameters are calculated in aligning the molecules to the reference.
Otherwise, the noise parameters are estimated thanks to a first rough assembly of the molecules. For Col-0 and Ler-1 ecotypes, three maps were obtained, one without reference, one with the Col-0 reference and one with the Ler reference (Additional file 1: Tables S8 and S9). In our study, the metrics of these assemblies are very similar. This stability reflects that noise parameters estimated either with references fasta sequences or our data, were comparable. This is a guaranty of quality of Bionano data and assemblies.
ONT variation detection
Structural variations were obtained with MUMmer’s show-diff utility on the filtered alignments of SMARTdenovo assemblies against the references Col-0 and Ler. One DIFF file per comparison were obtained. Six SV types (Gap, Duplication, Break, Jump, Inversion, Sequence) were described in the Additional file 2: Figure S4.
Bionano variation detection
SVs detections were performed on the optical maps built with the public reference and our SMARTdenovo ONT assemblies using the tool Convert SMAP to VCF file. VCF files were recovered, describing all the structural variations between the optical maps and the considered reference. The variations were classified in 6 types: deletion, insertion, translocation and inversion. SVs detection stringency is intrinsic, based on the number of aligned molecules (at least nine by default) and the number of labels accross each variants breakpoint on the genome map (at least two by default) (Bionano tutorial : https://bionanogenomics.com/support-page/data-analysis-documentation/). The technology gave an interval with an uncertainty about breakpoint positions (CIPOS and CIEND in VCF files). In this study, these values were used to calculate the most extended positions for the Bionano SVs and avoid effect of label fluzz.
The low number of structural variations between Col-0 optical maps and the Col-0 TAIR10.1 reference (as Ler-1 maps and Ler reference) reflects the good collinearity between the map and the references. SVs gave us an indication on location of conflicts that could be due to mis-assemblies or intra-ecotype variations. Inter-ecotype detection allowed us to describe the variations between Col-0 and Ler-1.
Quality and length characteristics were used to better describe and filter SVs. Bionano Solve associates a quality score to each INS and DEL based on sensitivity and the fraction of alternative calls in mix assemblies that were called in the alternative genome assembly [from no quality (.) or poor (0) to confident quality (20)]. We observed that this indicator follows the same trend as the SVs size (Additional file 1: Tables S11 and S15). Moreover, size range values where SVs abundances are the most different between both technologies are the extremes : the smallest (< 1 kb), where ONT technology detected much more SVs and the highest (> 5 kb) where Bionano technology detected proportionnally more SVs. So in our comparison analysis, to remove poor quality Bionano SVs, ONT sequencing errors and high sensitivity, a filter on query SV size (> 1 kp) was applied. Confidence scores for translocation and inversion breakpoints were computed as p-values, giving true confidence (in Mahalanobis distance) to positive calls. The recommended cutoffs are 0.1 and 0.01 for translocation and inversion breakpoints calls respectively and were used to eliminate uncertain inversion on Chr2.
SV description
Custom-made R and Perl scripts were used to edit other tools outputs, describe ONT and Bionano SVs (types, size), locate SVs along the chromosomes and filter them. For ONT technology, SVs identified as assemblies discordances were quickly described and discarded before comparison. Those included sequences (SEQ), breaks (BRK) and jumps (JMP) ONT SV because they correspond to assembly or reference artefacts. Finally, size filters (more than 1 kb) were applied to take into accountONT high sequencing error rate, and low quality Bionano SVs. For Bionano SVs the largest absolute positions of the SV were conserved, taking into account the uncertainty around breakpoints due to the distance between two labels.
SV comparison
Comparison of SV obtained with both ONT and Bionano technologies were based on the overlap of their absolute positions.
ONT SV and Bionano SVs files were used after conversion to BED format to identify overlapping regions with BEDtools (version 2.27.1, github commit cd82ed5, “bedtools intersect -wa -wb -a INPUT1.bed -b INPUT2.bed -loj > OUTPUT.bed”). Raw comparisons were then compared, compiled and formatted in one final output file using custom-made R scripts. For each SVs location, this file contained descriptors (SVs size, type, quality) for both technologies, information on the type of conflict and a 2 letter code. This code characterized the SVs location as follow : the first letter corresponds to the ONT SV characterization, the second to the Bionano SV. M (“Multiple”) means more than one SV, U (“Unique”) one SV, N (“None”) no SV. For example, the code “MU” means that this location arbored multiple ONT SV corresponding to a unique Bionano. The landscapes and SVs occurences visualization was performed with Circos/0.69.9 tool (perl/5.16.3) [67].
SV and annotation
SVs overlapping a gene and/or TE were identified with the bedtools intersect by comparing their absolute positions to A. thaliana Col-0 annotations (11th july 2019 release, TAIR10_GFF3_genes_transposons.gff). Lists of genes impacted by SV for both technologies were extracted and a GO-term enrichment analysis performed using Fisher’s Exact test with a Bonferroni correction in PANTHER (released 20200407 with GO Ontology database DOI: 10.5281/zenodo.3873405 Released 2020-06-01, [61], http://go.pantherdb.org/). Significance was evaluated based on a P-value ≤ 10−5 and an FDR value ≤ 0.01 [67].
Additional files description
Additional_file_1.xlsx : Additional tables results
Table S1. Metrics of the ONT run flowcells for A. thaliana Columbia (Col-0).
Note : All sizes are in base pairs. ALL_COL is all Col-0 trimmed merged data.
Table S2. Metrics of the ONT run flowcells for A. thaliana Landsberg erecta (Ler-1). All sizes are in base pairs.
Note : All sizes are in base pairs. ALL_LER is all Ler-1 trimmed merged data.
Table S3. Number of the Col-0 TAIR10.1 and Ler references bases covered and uncovered by ONT reads.
Note : ONT reads are A. thaliana Col-0 and Ler-1 corrected, trimmed and merged ONT sequences (respectively ALL_COL and ALL_LER).
Table S4. A. thaliana Col-0 Assembly Metrics for contigs only, obtained with SMARTdenovo, Canu and RA.
Note : All sizes are in base pairs. Assemblies are obtained with corrected, trimmed and merged Col-0 ONT sequences.
Table S5. A. thaliana Ler-1 Assembly Metrics for contigs only, obtained SMARTdenovo, Canu and RA.
Note : All sizes are in base pairs. Assemblies are obtained with corrected, trimmed and merged Ler-1 sequences.
Table S6. Metrics of the Bionano run chips for A. thaliana Columbia (Col-0) and Landsberg erecta (Ler-1).
Note : All sizes are in base pairs. Results obtained with DLE-1 labelling.
Table S7. Metrics of the sampled Bionano run chips for A. thaliana Columbia (Col-0) and Landsberg erecta (Ler-1).
Note : All sizes are in base pairs. Results obtained with DLE-1 labelling.
Table S8. Assembly Metrics of A. thaliana Columbia (Col-0) sampled molecules.
Note : The options used were “Pre-assembly”, “Non Haplotype” and “Without Extend and Split“.
Table S9. Assembly Metrics of A. thaliana Landsberg erecta (Ler-1) sampled molecules.
Note : The options used were “Pre-assembly”, “Non Haplotype” and “Without Extend and Split”.
Table S10.Types of Col-0 ONT and Bionano SVs obtained against Ler reference.
Table S11. Size repartition of Col-0 ONT and Bionano insertions, deletions, INVersions, translocations obtained against Ler reference.
Table S12. Characteristics of Evry.Col-0 ONT and Bionano SVs, obtained after alignement against Ler reference.
Table S13. Characteristics of compared ONT Col-0 SVs with query size > 1kb.
Table S14. Characteristics of compared Bionano Col-0 SVs with query size > 1kb.
Table S15. Size repartition of Ler-1 ONT and Bionano insertions and deletions obtained against Col-0 TAIR10.1 reference.
Table S16. Characteristics of compared ONT Ler-1 SVs with query size > 1kb.
Table S17. Characteristics of compared Bionano Ler-1 SVs with query size > 1kb.
Table S18.Genes overlapping Ler-1 SV in common locations (query size >1 kb).
Table S19. Gene annotation overlapping Ler-1 SV in common locations (query size >1kb).
Note : PANTHER released 20200407 was used with GO Ontology database DOI: 10.5281/zenodo.3873405 Released 2020-06-01, [61], http://go.pantherdb.org/).
Table S20. PANTHER Overrepresentation results on Genes overlapping common Ler-1 SVs (query size >1kb).
Note : The PANTHER version is decribed in Mi et al. 2019.
Table S21. Genes overlapping specific ONT Ler-1 SVs (query size >1 kb).
Table S22. Gene annotation overlapping specific ONT Ler-1 SVs (query size >1kb).
Note : PANTHER released 20200407 was used with GO Ontology database DOI: 10.5281/zenodo.3873405 Released 2020-06-01, [61], http://go.pantherdb.org/).
Table S23. PANTHER Overrepresentation results on Genes overlapping specific ONT Ler-1 SVs (query size >1 kb).
Note : The PANTHER version is decribed in Mi et al. 2019
Additional_file_2.pdf : Additional figures results :
Figure S1A-C. Views of Col-0 contigs alignments on Col-0 TAIR10.1 reference (dotted end).
(A) Contigs obtained with SMARTdenovo, (B) with Canu and (C) with RA. Blue, green and orange dots and lines represent unique forward, unique reverse and repetitive alignments respectively.
Figure S2A-C. Views of Ler-1 contigs alignments on Ler reference (dotted end).
(A) Contigs obtained with SMARTdenovo, (B) with Canu and (C) with RA. Blue, green and orange dots and lines represent unique forward, unique reverse and repetitive alignments respectively.
Figure S3A-E. Bionano Access view of Ler-1 cmaps aligned on Col-0 TAIR10.1 reference.
(A) to (E) are alignments on Col-0 TAIR10.1 Chr1 to Chr5. Maps are in green for the Col-0 TAIR10.1 reference and light blue for Ler-1 genome with the molecules depth curve in blue. Consistant DLE-1 enzyme label between reference and Ler-1 maps are represented with dark blue bars with grey links between the genomes maps. Inconsistant DLE-1 enzyme label are yellow bars on the two genomes maps.
Figure S4. Description of SVs detected by MUMmer show-diff and Bionano Access tools.
Insertion in the query are called GAP with a negative size by MUMmer show-diff, INS by Bionano Access. Deletion in the query are called GAP with a positive size by MUMmer show-diff, DEL by Bionano Access. Inversion in the query are called INV by MUMmer show-diff and Bionano Access. Duplication in the query are called DUP by MUMmer show-diff and by Bionano Access. Rearrangement of reference sequence in the query are called jump (JMP) by MUMmer show-diff and translocation (TRA) by Bionano Access. Inverted Duplication are not described by MUMmer show-diff and called INVDUP by Bionano Access. Reference sequence junction between two assemblies contigs alignment are called SEQ by MUMmer show-diff and are not described by Bionano Access. Query sequence junction between two reference chromosomes alignment are called break (BRK) by MUMmer show-diff and are not described by Bionano Access. « − » means no detection with the technology.
Figure S5A-E. Col-0 SVs (>1kb) occurences.
All comparisons were performed against the Ler reference sequence per 100kb bins and black rectangles symbolize Ler centromeric regions. Average mapping coverage for Col-0 ONT reads (red line called COV), average DLE-1 density labelling (green line called DLE), and ONT and Bionano occurrences (rea and green bars respectively) are represented for each Ler chromosome in section A to E respectively for Chr1 to Chr5.
Figure S6. Bionano Solve zoom in the Chr2 Ler-1 translocations against Col-0 TAIR10.1 reference.
Maps are in green for the Col-0 TAIR10.1 reference and light blue for Ler-1 genome. Consistant DLE-1 enzyme label between reference and Ler-1 maps are represented with dark blue bars with grey links between the genomes maps. Inconsistant DLE-1 enzyme label are yellow bars on the two genomes maps. The purple bar locate the translocation events on the Ler-1 map. The red box and lines highlight the zoom.
Figure S7. Bionano Solve capture of the Ler-1 Chr4 extra-range Size Invertion against Col-0 TAIR10.1 reference.
Maps are in green for the Col-0 TAIR10.1 reference and light blue for Ler-1 genome. Consistant DLE-1 enzyme label between reference and Ler-1 maps are represented with dark blue bars with grey links between the genomes maps. Inconsistant DLE-1 enzyme label are yellow bars on the two genomes maps. The red box and lines highlight the zoom.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
The ONT reads files and the Bionano molecules files have been submitted to the European Nucleotide Archive (http://www.ebi.ac.uk) and are publicly available with the accession numbers ERP128342 and ERZ1959921 respectively. Assemblies and optical maps of the Col-0 and Ler-1 genomes are publicly available in separate ENA studies under the accession number PRJEB44316.
Competing interests
The authors declare that they have no competing interests.
Funding
This work was supported by INRAE (Institut National de Recherche pour l’Agriculture, l’alimentation l’Environnement) and Genoscope-CEA (Commissariat à l’Energie Atomique et aux Energies Alternatives).
Authors’ contributions
The project was conceived by VB, MCLP and PFR. Plant cultures and the HMW DNA extraction for Oxford Nanopore and Bionano Technologies were carried out by ED and GM, data acquisition by ED, GM, CC, CB, BI and equipment provided by PW, MCLP, PFR, VB. Data analysis were performed by AC, BI and CB for assemblies, optical maps and SV detection and AC and RG for SV comparisons. PFR, VB contributed to data interpretation with AC and RG. The manuscript was written by AC, RG, PFR, VB with inputs from PW and MCLP. All authors read and approved the final manuscript.
Footnotes
Not applicable.
Acknowledgements
The authors would like to thank the Versailles Arabidopsis Stock Center, INRAE providing Arabidopsis thaliana ecotypes and the Institut of Plant Sciences Paris Saclay (IPS2) where the Arabidopsis cultures were carried out.
We also greatly thank Damien Hinsinger for proofreading and advice on the manuscript and the INRAE PepiAnnot Group (https://pepi-ibis.inra.fr/annotation-genomes) for helpful discussions.
Footnotes
aurelie.canaguier{at}inrae.fr, romane.guilbaud{at}inrae.fr, erwandenis{at}hotmail.com, gmagdele{at}genoscope.cns.fr, cbelser{at}genoscope.cns.fr, bistace{at}genoscope.cns.fr, cruaud{at}genoscope.cns.fr, marie-christine.le-paslier{at}inrae.fr, pwincker{at}genoscope.cns.fr, vbarbe{at}genoscope.cns.fr
List of abbreviations
If abbreviations are used in the text they should be defined in the text at first use, and a list of abbreviations can be provided.
- bp
- base pairs
- BRK
- Break
- CGH
- Comparative Genomic Hybridization
- CNV
- copy number variations
- Col-0
- Arabidopsis thaliana ecotypes Columbia-0
- DEL
- Deletion
- DLE-1
- Direct Label Enzyme – 1
- DLS
- Direct Label and Stain
- DNA
- Desoxyribo Nucleic Acid
- DUP
- Duplication
- Gb
- Gigabases
- Hi-C
- HIgh-throughput chromatin conformation Capture
- Indels insertions/deletions
- INS
- Insertion
- INV
- Inversion
- JMP
- Jump
- Kb
- kilobases
- Ler-1
- Arabidopsis thaliana ecotypes Landsberg erecta 1
- LER
- Arabdopsis thaliana Ler-1 reference genome published by Zapata et al. 2016
- NA
- Not Available
- NGS
- Next Generation Sequence
- ONT
- Oxford Nanopore Technologies
- PAV
- presence/absence variations
- RA
- Rapid Assembler
- SDN
- SMARTdenovo
- SEQ
- Sequence
- SNP
- Single Nucleotid Polymorphism
- SV
- Structural Variation
- TAIR10.1
- last version of Arabdopsis thaliana Col-0 reference genome availbale at the The Arabidopsis Information Resource repository (TAIR)
- TE
- Transposable Element
- TRA
- Translocation