Keywords
chimerism, MinION, interferon, amplicon, R9.4, signal, nanopore
This article is included in the Nanopore Analysis gateway.
chimerism, MinION, interferon, amplicon, R9.4, signal, nanopore
High-throughput DNA sequencing is a rapidly evolving field with new methods and applications introduced almost weekly1. One of the most recent sequencing technologies available on the market is the MinION sequencing device from Oxford Nanopore Technologies (ONT)2. A brief overview of MinION sequencing technology is discussed in our previous study on mitochondrial genome assembly3.
Instead of exploiting base-pairing as in the sequencing by synthesis approach used by Illumina and others, nanopore sequencing uses an electronic sensor to detect DNA via a change in electric current (reviewed in 4). The MinION’s flow cell is comprised of 2048 wells containing a membrane perforated by nanopores. Ligated with a molecular motor, a single stranded DNA molecule passes through the pore, altering the recorded current. After the electronic sequencing is carried out, a software basecalling algorithm transforms the current trace into a modelled DNA sequence. The advantages of the MinION are rapid library preparation, portability5,6, long molecule sequencing7, and sequencing of non-model modifications of the DNA strand8. With the recent improvement in the chemistry of the MinION, ONT has overcome the majority of issues associated with low yield and high error rates that have limited the range of its application. The MinION sequencing device has now been successfully applied to sequence genomes of a wide range of sizes, from bacterial and viral genomes9,10, amplicon sequencing like bacterial 16S rRNA sequencing11, and more recently a human genome12. The MinION has also been used for cDNA sequencing13, for detecting DNA methylation patterns without chemical treatment8,14, and for direct RNA sequencing with detection of modified 16S rRNA nucleotides15.
Using the most recent R9.4 flow cells, we have evaluated the MinION technology for the amplicon sequencing of highly similar genes. Since we have an interest in interferon response during large parasitic infection16, we sequenced the type I Interferon (IFN) family. Type I IFNs are a family of intronless antiviral response genes comprised, in mice, of 14 highly homologous Ifna members, as well as the genes Ifnb, Ifnk and Ifne17. In humans, sequence similarity across the 14 members of the Ifna genes is 70–80%, with a further 35% sequence similarity between Ifna and Ifnb. Type I IFN has both an important role in innate antiviral immunity and in mounting adaptive T helper cell responses16,18. Based on previous observations, we aimed to identify which type I IFN member(s) were responsible for driving the type I IFN signalling in our infection model.
Due to the high homology between the Ifna family, accurately detecting quantitative expression of the different gene members by Sanger sequencing or next generation sequencing is difficult. We instead employed nanopore sequencing, which allowed us to acquire full-length reads from each individual sequence that were amplified by the PCR reaction. We aimed to determine the relative quantities of the various Ifna family and Ifnb transcripts, in Nb treated mouse ear tissue using MinION; therefore enabling both the differentiation between the various Ifna genes, and the potential to perform quantitative analysis.
Nippostrongylus brasiliensis was originally sourced from Lindsey Dent of the University of Adelaide, South Australia and has been maintained for 22 years by serial passage at the Malaghan Institute. Female Lewis rats were bred and used for maintenance of the N. brasiliensis life cycle when 4 months of age (and weight over 150g), as outlined in Camberis et al.19.
Two 8-week-old C57BL6/J male mice (Jackson Laboratories, approx 23g), housed and bred at the Malaghan Institute of Medical Research under specific pathogen free conditions, respecting the local and New Zealand ethics guidelines, were chosen for the investigation. 300 dead infective N. brasiliensis L3 larvae (Nb) were injected intradermally in each ear of one mouse in 30uL PBS after anaesthesia with an intraperitoneal injection of 200ul ketamine/xylazine. Another mouse was similarly anaesthetised and injected intradermally in each ear with 30uL PBS. The mice were euthanised in a CO2 chamber 3h post injection and ears (approx 27–30mg in weight) were immediately harvested and conserved in RNALater at 4C for <1h. RNA extraction of each whole ear (30mg) was done in 1mL of Trizol following the products’ guidelines (Thermofisher). cDNA was synthesised using the High Capacity RNA-to-cDNA kit (Applied Biosystems), according to the manufacturer’s instructions. Only the cDNA from the N. brasiliensis-treated mouse was used for this investigation. Ifna, Ifnb, and Actb amplicons were generated using specific primers: IfnaF (ATGGCTAGRCTCTGTGCTTTCCT) and IfnaR (AGGGCTCTCCAGAYTTCTGCTCTG)20; IfnbF (CTGGCTTCCATCATGAACAA) and IfnbR (GCAACCACCACTCATTCTGA); and ActbF (AGGGAAATCGTGCGTGACAT) and ActbR (ACGCAGCTCAGTAACAGTCC), which were purchased from Integrated DNA Technologies. PCR amplification was performed using Phusion High-Fidelity PCR Kit (Thermo Scientific) with 25ng cDNA, see Figure 1. The cycling conditions were as follows: denaturation at 98 for 30 seconds; 35 cycles of 98 degrees for 10 seconds, 61 degrees for 30 seconds, 72 degrees for 30 seconds; final extension of 72 degrees for 10 minutes. Samples were held at 4 degrees until use for PCR clean up and gel electrophoresis. PCR products were cleaned using QIAquick PCR Purification Kit (QIAGEN) and verified by gel electrophoresis.
Ifna cDNA were amplified by PCR using primers designed across a highly-conserved region of all Ifna coding sequences, which resulted in a mixed PCR product containing all 14 Ifna genes. cDNAs of Ifnb and Actb were amplified separately and used as quantification controls. Altogether, the three pooled amplicons were loaded into a flow cell and sequenced. Among the reads that we obtained, we noticed long chimeric reads comprising of two or more sequences from different amplicons. We decided to further examine this phenomenon.
Ethics approval for maintenance of the N. brasiliensis life cycle is overseen and approved by the Victoria University of Wellington Animal Ethics Committee. C57BL/6J mice were originally obtained from The Jackson Laboratory, Bar Harbour, Maine, USA, and maintained at the Biomedical Research Unit of the Malaghan Institute of Medical Research by brother X sister mating. Breeding pairs were refreshed regularly to maintain the genetic integrity of the strain. Mice were maintained in specific pathogen-free conditions, and housed and cared for according to the concepts of “A Culture of Care” of the Ministry of Primary Industries, NZ. All mouse experiments were approved by the Victoria University Animal Ethics Committee (permit number 23907) and carried out according to institutional guidelines.
The ONT Native Barcoding Kit (EXP-NBD002) and 2D Ligation Sequencing Kit (SQK-LSK208) were used to prepare the samples for sequencing, as per the manufacturer’s protocol. Briefly, purified PCR amplicon products were bluntended, ligated with barcode sequences, pooled in approximately equimolar amounts, then ligated with flow cell adapters and a hairpin linker. In order to explore the effect of ligation method on the degree of chimerism, two different adapter/hairpin ligation reactions were carried out: one using the standard quick (10-minute) ligation, and the other using an overnight ligation at 4° Celsius. No additional adapter-free controls were used; it has been our prior experience that sequencing does not proceed in a callable fashion unless adapter sequences are present. The barcoding scheme used in the library preparation is shown in Figure 2. Samples were quantified after barcoding for overnight ligation (2.14 ng/µl, 2.54 ng/µl and 2.56 ng/µl for Ifna, Ifnb, and Actb respectively) and for quick ligation (2.13 ng/µl, 2.68 ng/µl and 2.45 ng/µl for Ifna, Ifnb, and Actb respectively). These samples were normalised and pooled together to give 26.6ng each in 33.1µl distilled water for ligation. After adapter ligation, the quick ligation method had no detectable nucleic acid, as seen using a fluorescence quantitation with the Quantus fluorometer (Promega), while the overnight ligation quantified at 0.239ng/µl. We decided to pool the samples together anyway, and were pleasantly surprised to discover a substantial proportion of reads from quick-barcoded sequences.
Reads were initially basecalled during the sequencing runs in January 2017 using Metrichor 2D basecalling, from MinKNOW v1.3.25. An initial analysis of called reads demonstrated substantial disagreement between base calls and the raw signal (e.g. hairpin adapter sequences matching multiple times when the signal showed only one present), so reads were recalled as in March 2017 using Albacore v0.7.5.
During the initial MinION sequencing run to investigate the expression of Ifna-family members in mice (comparing with Ifnb and Actb transcripts), we encountered issues with 2D basecalling through the Metrichor web service, which seemed to be due to failed alignment of component 1D strands. A BLAST search on some of the longest basecalled 1D reads led to a discovery that some reads had multiple mappings to our target Ifna-family members. Further exploration of the data demonstrated a situation in which both Ifna and Actb sequences were present in the same read (see Figure 3). This was an unexpected result; we had carried out separate PCR reactions for each transcript, so were not expecting reads to appear that mapped to different transcripts. Our conclusion was that chimeric ligation of input DNA was occurring at some stage during the sample preparation process, but all we were able to determine at the time was that this chimerism was happening some time after the PCR, but before the sequencing. The present experiment was designed in light of these prior results to more easily quantify the degree of ligation that was happening.
Despite using a 2D ligation chemistry in the sample preparation, and selecting out hairpin-containing reads using streptavidin beads, the majority of reads could not be called as an aligned 2D sequence: of 329,591 sequenced reads, 299,124 were basecalled by Albacore, and 1005 (0.3%) of these basecalled reads had an aligned 2D sequence (see Supplementary File 1). The reasons behind this basecall failure were not investigated. Any called reads that were not called as 2D were processed further as 1D sequence, i.e. the remaining 298,119 (99.7%) of called reads.
Called 1D reads were mapped to Actb, Ifnb1, an Ifna consensus sequence, additional interferon sequences, the ONT control strand sequence, and known ONT adapter sequences (see Supplementary File 2) using LAST v83321. A total of 261,183 reads (87.6% of called 1D reads) were discovered that mapped to at least one known amplicon and/or barcode sequence.
Using a process of elimination, a total of 4563 reads (1.7% of amplicon or barcode-mappable 1D reads) were discovered with basecalled sequences that were definitively chimeric (see Supplementary File 5). These reads mapped at least once to either one of the three amplicon sequences, or at least once to one of the six barcode sequences. These were broken into four categories (with some overlap) based on the observed combinations of barcode and amplicon sequences (see Figure 4):
1. Repeated identical amplicons aligned in the same direction
2. At least two distinct amplicons
3. At least two distinct barcode
4. Disagreement between barcode and amplicon
The highest proportion of chimeric reads were associated with repeated identical amplicons, with 3441 reads seen (75% of all definitively chimeric reads). This suggests that an amplicon sequencing procedure will be particularly susceptible to read chimerism, as the same sequence will appear in increased abundance compared to an untargeted sequencing approach. The low-temperature overnight ligation had a much higher proportion of repeated amplicons than the quick ligation; in this case it appears that the quick ligation was better at reducing the occurrence of chimeric reads, despite prior expectations. Of the definitively chimeric reads, 2869 included at least one overnight barcode (1.8% of 159,188 amplicon-mapped reads with an overnight barcode), and 1203 included at least one quick barcode (2.6% of 45,850 amplicon-mapped reads with a quick barcode). While it appears that the use of overnight ligation has helped somewhat to reduce chimeric reads, a substantial proportion of chimeric reads still remain.
If a cassette of adjacent Ifna genes were transcribed together, it is possible that this cassette could be amplified together as a single sequence. These sequences would appear to be chimeric (and fall into the "Repeated amplicons" category), but wouldn’t have any intermediate barcodes. The count similarities for repeated Ifna, Ifnb1 and Actb genes in Table 1 suggest that this is not happening at any significant level.
After elimination of definitively chimeric reads, 256,620 reads remained that appeared to map uniquely to single sequences (see Figure 5). A small proportion of these sequences (14,223; or 5.5%) had detectable barcode sequences, but did not map to any amplicons (i.e. mappable to an overnight or quick barcode sequence only). It is expected that these unmapped barcoded sequences were unamplified mouse cDNA sequences.
A difference in read counts was observed between overnight-barcoded sequences and quick-barcoded sequences (77.8% overnight, 22.2% quick), which was consistent with the difference in input amount observed during sample preparation. An attempt was made during sample preparation to add in the three different amplicon preparations in equimolar quantities, which was more successful for the Actb preparation (33.6%) than it was for the Ifna and Ifnb preparations (42.7% and 23.7%, respectively).
An additional categorisation of Ifna family members (see Supplementary File 3) was attempted, but is not presented here as it detracts from the main chimeric read investigation. Intermediate results and a processing script from this categorisation are available in verbose form as and Supplementary File 4.
A few of the reads were investigated at the raw signal level to make sure that the electrical trace was in agreement with the base-called signal. A demonstrative signal trace for a non-chimeric 2D read comprising of a single barcode-adapted amplicon is shown in Figure 6. Read traces typically began with a high-current (but relatively uniform) open pore state, followed by an intermediate stall signal (also fairly uniform), after which the highly variable sequence trace begins. Hairpin adapters could be easily identified in the raw signal as a bridge structure a little over halfway through a 2D sequence.
A number of situations were observed in the basecalled sequence where ligation during sample prep seems to have occurred, and in some cases this ligation resulted in multiple hairpin adapters being ligated in the same sequence. One such occurrence of this is seen in Figure 7, where two barcoded overnight sequences from two different amplicons (Ifnb1 and Ifna2) were joined together. Because two amplicons were concatenated, this ligation must have happened after the barcoding step of sample preparation (i.e. during adapter ligation).
This finding has potential implications for other sequencing technologies, as the ligation process used for sample preparation is unlikely to be specific for nanopore sequencing. The formation of chimeric reads during sample preparation may be one explanation for the index switching phenomenon seen in Illumina-sequenced reads (e.g. see22–24), and presents a substantial problem for dual-indexed reads where identical indexes are used for different samples. Where dual-indexed reads are not used, ligation of reads with the same index may still be problematic depending on the particular sequencing application.
There were 8 instances where both an overnight and a quick barcode were observed in the basecalled sequence. In all such cases, there appears to have been a very short pore-protein dissociation between the sequencing of the two sequences (i.e. these were chimeric reads formed from in-silico ligation). The dissociation was only noticeable after inspecting the raw signal: a very short blip in the signal that matched the open pore current (e.g. see Figure 8).
It is likely to be the case that similar situations involving fast pore reloading are present in other reads, but not easily detectable from the called sequence because other barcode/amplicon combinations fit the expected base calling pattern. Considering that this situation can happen with non-identical sequences, software that is able to flag the presence of dissociation and/or stall events that are not at the start of the raw signal would be useful, as these features suggest that the base call is not likely to be a correct single sequence.
The imminent release of ONT’s R9.5 flow cells and 1D2 base calling will exploit this phenomenon of fast sequence loading into pores in order to produce high-accuracy reads derived from a combined template/complement base call (i.e. replacing the current hairpin-based 2D call). This replaces the 2D sample preparation process that we used for this investigation (see 25).
It is apparent from our investigation that chimeric reads can exist in the output of sequencing runs, and we recommend that researchers consider this possibility when interpreting their own results. As a result, it is a good idea to include easily-detectable adapters when sequencing DNA. These adapters, particularly if present at both ends of a sequence, will help substantially in the identification (and if necessary, filtering) of concatenated sequences that are not native to the sample.
Although a non-negligible 1.7% of reads were found to have post-amplification chimeric elements, careful quality control of reads after long-read sequencing should be able to identify and exclude the majority of chimeric reads that are produced during a sequencing run.
Raw read signal and basecalled reads have been uploaded to ENA under accession number PRJEB20601. Additional supplementary scripts used for FASTQ file filtering, mapping, and raw signal investigation are available as part of David Eccles’ bioinformatics script repository (doi, 10.5281/zenodo.556966)26. The following scripts from that repository were used for intermediate discovery and result generation:
maf_bcsplit.pl Converting MAF format to machine-readable CSV with forward-oriented location information
pos_aggregate.pl Merging adjacent MAF matches to the same target sequence in the same orientation
fastx-fetch.pl Retrieving sequences from a FASTQ/FASTA file given a a list of identifiers (possibly as a text file)
fastx-length.pl Generating length information and aggregate statistics for a FASTQ/FASTA file
length_plot.r Generating "digital electrophoresis" image and read density plots given a file containing length information
porejuicer.py Extracting raw data and called FASTQ files from FAST5 files
A rough shell command script (including additional dead-end attempts at discovery & analysis) is provided for reproduction and/or extension of these findings to other investigations (see Supplementary File 6).
RW: Sample preparation and QC; CP: Mouse injections, RNA extraction; FR: Project oversight; OL: Sample preparation, project design and oversight; DE: DNA sequencing and bioinformatics analysis. All authors contributed towards the preparation of the manuscript.
The R9.4 flow cell and sequencing kit (SQK-LSK208) used for this experiment were provided free of charge by ONT as replacements for a purchased kit and flow cell where the phenomena of chimeric reads was initially discovered. ONT provided advice regarding the sample preparation protocols, including the suggestion of a slow overnight ligation step.
This work was funded in full by Health Research Council of NZ Independent Research Organization (IRO) funding to the Malaghan Institute of Medical Research (grant number HRC14/1003).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
We would like to thank the Nanopore Community for providing help and insightful discussion for this investigation.
Supplementary File 1: Base calling summary from Albacore v0.7.5.
Supplementary File 2: Reference sequences used for the initial amplicon mapping.
Supplementary File 3: Reference sequences used for Ifna paralog mapping.
Supplementary File 4: R script and intermediate data files used for Ifna-family gene counting.
Supplementary File 5: R script and intermediate data files used for chimeric read filtering.
Supplementary File 6: Shell/process script for reproducing the data analysis.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Not applicable
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Partly
Competing Interests: I have two patents licensed to Oxford Nanopore.
Reviewer Expertise: Biophysics, Sequencing, Epigenetics
Is the work clearly and accurately presented and does it cite the current literature?
Yes
Is the study design appropriate and is the work technically sound?
Yes
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 2 (revision) 16 Aug 17 |
read | |
Version 1 05 May 17 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)