Abstract
The high throughput capacities of the Illumina sequencing platforms and the possibility to label samples with unique identifiers has encouraged a wide use of sample multiplexing. However, this practice results in low rates of read misassignment (<1%) across samples sequenced on the same lane on all Illumina sequencing platforms that rely on the traditional bridge amplification. Alarmingly high rates of read misassignment of up to 10% were recently reported for the newest Illumina machines (HiSeq X and HiSeq 4000). This potentially calls into question previously generated and published results and may make future use of these platforms prohibitive for many applications in biology and medicine. In this study we rely on inline barcodes that are ligated to both ends of the DNA insert, to directly quantify the amount of index hopping in historical museum-preserved samples. As the barcodes become part of the sequencing read, they allow us to reliably infer the read origin even in the presence of index hopping. After sequencing the same pooled library of seven samples on three independent HiSeq X lanes and accounting for multiple possible sources of error, including barcode and index cross-contamination, we identified on average only 0.470% hopped reads. We conclude that index hopping happens on the newest generation of Illumina sequencing platforms, but results in a similar rate of read missagnment as reported for older Illumina machines. We nonetheless recommend using inline barcodes in multiplexing studies that rely on low-coverage data, require absolute certainty and/or aim to characterize rare variants.
Introduction
Multiplexing samples for next-generation sequencing is a common practice in many biological and medical applications (Craig et al. 2008; Meyer and Kircher 2010; Smith et al. 2010; Caporaso et al. 2012; Rohland and Reich 2012). The high throughput capacities of most sequencing platforms clearly encourage multiplexing and optimized sequencing protocols with greater data output are continuously being developed. During multiplexing, samples are individually labelled with unique identifiers (indices) that are frequently embedded within one or both sequencing platform-specific adapters and are separated from the actual template (Meyer and Kircher 2010; Kircher et al. 2012, TruSeq Nano DNA kit (Illumina), NEBNext Ultra DNA kit (New England Biolabs)). The samples are subsequently pooled into a single sequencing library and sequenced on the same lane. Following sequencing, computational demultiplexing based on the sample-specific indices allows for assignment of the sequenced reads to the respective sample of origin. However, ever since multiplexing approaches were introduced, low rates of read misassignment across samples sequenced on the same lane were reported on all Illumina platforms (Kircher et al. 2012; Nelson et al. 2014; D’Amore et al. 2016; Wright and Vetsigian 2016b), the most frequently used next generation sequencing technology (Research & Markets 2017). This process results in reads from one sample carrying a wrong index and consequentially being erroneously attributed to the wrong original sample. The reported rate of read misassignment is low (<1%) on Illumina platforms that rely on the traditional bridge amplification for cluster generation (Illumina Inc. 2017) and therefore this source of error has been readily ignored.
The use of the exclusion amplification chemistry (ExAmp) in combination with patterned flow cells on the newest generation of the Illumina sequencing platforms (HiSeq X and HiSeq 4000) was an important improvement, as it significantly increased data throughput and lowered sequencing cost (Illumina Inc. 2017). However, recently reported high rates of read misassignment of up to 10% observed for single cell RNA libraries sequenced on the Illumina HiSeq 4000 platform (Griffiths et al. 2017; Sinha et al. 2017) have shaken the scientific community, potentially calling into question many generated and published results. This finding is particularly worrying in light of the recently introduced NovaSeq sequencing platform, which offers even higher throughput while relying on the same technology as HiSeq X and HiSeq 4000. As even more samples can be multiplexed on a single lane, the potential bias from read misassignment would be further increased.
Several different processes can lead to read misassigment, i.e. presence of reads with a switched index. The effect of sequencing errors that can convert one index sequence into another is well known and has led to series of recommendations for designing highly distinct indices (e.g. Meyer and Kircher 2010). Jumping PCR during bulk amplification of library molecules that carry different indices can generate chimeric sequences and should be avoided (Meyerhans et al. 1990; Odelberg et al. 1995; Lahr and Katz 2009; Holcomb et al. 2014; McDevitt et al. 2016). Similarly, cross-contamination of indexing adapters during oligonucleotide synthesis or laboratory work can lead to reads being attributed to the wrong sample of origin. Mixed clusters that can form on the flow cell if colonies from different template molecules grow into each other during cluster generation were identified as source of misassigned reads on older Illumina platforms (Kircher et al. 2012). For the Illumina platforms with patterned flow cells and ExAmp chemistry, read misassigment has been suggested to be caused by the presence of free-floating indexing primers in the final sequencing library. These primers can anneal to the pooled library molecules and get extended by DNA polymerase before the rapid exclusion amplification on the flow cell, creating a new library molecule with a wrong index (Illumina Inc. 2017; Sinha et al. 2017). We refer to this particular process of generating misassigned reads as index hopping.
The preprint by Sinha and colleagues (2017) has started an active discussion about the prevalence of index hopping on the Illumina platforms with ExAmp chemistry. Illumina acknowledged a higher rate of index hopping on platforms with ExAmp chemistry compared to platforms that rely on bridge amplification for cluster generation, reporting up to 2% compared to ≤1% read misassignment (Illumina Inc. 2017). However, another study found no evidence for index hopping neither on HiSeq X nor on HiSeq 2500 platforms (Owens et al. 2017). Rigorously removing free-floating primers and adapters during library preparation by means of size-specific library clean-up was suggested to be the most efficient way to avoid index hopping (Illumina Inc. 2017; Griffiths et al. 2017; Sinha et al. 2017).
Due to the conflicting reports, the prevalence and severity of index hopping on Illumina HiSeq X and HiSeq 4000 platforms remain unclear. This is partly due to the difficulties to reliably identify missassigned reads in sequencing experiments, particularly if pooling similar samples types (e.g. multiple individuals from the same population that have high sequence similarity). However, some research questions clearly require high confidence in read identity, as presence of rare sequence variants can influence biological and medical conclusions. For instance, detection of low abundance transcripts or rare mutations can influence diagnostic inferences (Greenman et al. 2007; Schmitt et al. 2012; Flaherty et al. 2012; Trapnell et al. 2013). Studies with low input DNA quantities are particularly susceptible to such errors. Besides single cell RNA sequencing, these include ancient and historical samples (Kircher et al. 2012). Similarly, population genomics studies frequently rely on low-coverage genomic data, and presence of shared rare alleles across several populations or species can be interpreted as evidence for gene flow (Green et al. 2010; Nielsen et al. 2012; Fumagalli et al. 2013; Allentoft et al. 2015; Wall et al. 2016; Therkildsen and Palumbi 2017).
In this study we make use of inline barcodes, short unique 7-bp sequences ligated to both ends of the DNA fragments (Rohland and Reich 2012), in combination with indexed primers that subsequently were used to amplify the libraries. This enabled us to directly quantify the amount of index hopping in historical museum-preserved samples. These barcodes become part of the sequencing read and thus allow for identification of the read origin, even in the presence of index hopping. Historical samples are characterized by low DNA quantity and quality (the DNA is degraded, chemically modified and shows single-strand overhangs (Mulligan 2005; Sawyer et al. 2012)). We purposefully use this low-quality sample source, as it has been suggested that libraries constructed from difficult samples may be more prone to index hopping than libraries constructed from high-quality and high-quantity samples (Froenicke, 2017). Following sequencing on the HiSeq X platform, we identified a small fraction of reads (<1%) with a wrong combination of barcodes and indices. After excluding several possible explanations, we conclude that index hopping likely happens in this system, but results in a similar rate of read misassignment as reported for older versions of Illumina sequencing platforms. After demonstrating how the use of inline barcode-containing sequencing adapters enables detection and removal of falsely indexed reads, we recommend using this approach independent of the sequencing platform in studies that rely on low-coverage data, require absolute certainty and/or aim to characterize rare variants.
Methods
Library preparation and sequencing
DNA extracts from seven historical eastern gorilla samples that previously yielded good sequencing results on the Illumina HiSeq 2500 platform and showed high endogenous content were turned into sequencing libraries following the strategy outlined in Rohland and Reich (2012) and Rohland et al. (2015), as detailed below. All library preparation steps except indexing PCR were performed in a dedicated ancient DNA facility to minimize contamination. Briefly, 20 µl DNA extract was used in a 50 µl blunting reaction together with USER enzyme treatment to remove uracil bases resulting from aDNA damage (final concentrations: 1× buffer Tango, 100 µM each dNTP, 1 mM ATP, 25 U T4 polynucleotide kinase (Thermo Scientific) 3U USER enzyme (NEB)). Samples were incubated for 3 h at 37°C, followed by the addition of 1 µl T4 DNA polymerase (Thermo Scientific) and incubation at 25°C for 15 min and 12°C for 5 min (Fig. 1). DNA fragment within each sample were then ligated to a unique combination of incomplete, partially double-stranded P5- and P7-adapters (10 µM each), each containing a unique seven base pair sequence. We refer to these as the P5 and P7 barcodes from here on. All barcode sequences were at least three nucleotides apart from each other to ensure high certainty during demultiplexing and avoid converting one barcode into another through sequencing error (Rohland et al. 2015, Table S1). To increase the complexity of the pooled sequencing library, one sample received two different barcode combinations (Table 1). Adapter ligation was performed in 40 µl volume using 20 µl of blunted DNA and 1 µl of unique P5 and P7 barcodes per sample (final concentrations: 1× T4 DNA ligase buffer, 5% PEG-4000, 5 U T4 DNA ligase (Thermo Scientific), Fig. 1). Samples were incubated for 30 minutes at room temperature and cleaned using MinElute spin columns following the manufacturer’s protocol. Adapter fill-in was performed in 40 µl final volume using 20 µl adapter ligated DNA (final concentrations: 1× T4 DNA ligase buffer, 5% PEG-4000, 5 U T4 DNA ligase (Thermo Scientific), Fig. 1), incubated at 37°C for 20 minutes, heat-inactivated at 80°C for 20 minutes, and cleaned using MinElute spin columns as above.
Outcome of index hopping. A) The library pool, containing barcoded and indexed library molecules and free-floating indexing primers, is mixed with ExAmp reagents before loading on the patterned flow cell. B) Free-floating adapters anneal to the adapter sequence of a library molecule and C) the library molecule subsequently gets extended by DNA polymerase forming a new library molecule containing a wrong index. D) The library molecules are denatured, separating the strands, and each library molecule is allowed to graft into a nanowell on the patterned flow cell.
Sequencing statistics and estimates of contamination and index hopping.
Indexing PCR was performed for 10 cycles in 125 µl volume using a unique P7 indexing primer for each sample, as in Meyer & Kircher (2010) (final concentrations: 1x AccuPrime reaction mix, 0.3µM IS4 primer, 0.3µM P7 indexing primer, 7 U AccuPrime Pfx (Thermo Scientific), cycling protocol: 95°C for 2 min, 30 cycles at 95°C for 30 s, 55°C for 30 s and 72°C for 1 min and a final extension at 72°C for 5 minutes, Fig. 1). Note that indexing PCR for sample 7 that received two different barcode pairs was performed in a single reaction combining both fractions of this sample. All index sequences differed by at least three base pairs from each other (Table S1). Following the indexing PCR, each DNA fragment contained three unique identifiers: the P5 and P7 barcodes directly ligated to the ends of the DNA fragments, and the P7 index which is part of the Illumina sequencing adapter (Fig. 1). Sample libraries were cleaned using MinElute spin columns, fragment length distribution and concentrations were measured on the Bioanalyzer. We then pooled all seven sample libraries in a ratio of 2:1:2:1:1:1:2 for samples 1 to 7 and performed two rounds of AMPure XP bead clean-up using 0.5X and 1.8X bead:DNA ratio, respectively. We confirmed that indexing primers were successfully removed during clean-up by running the final library on a Bioanalyzer (Fig. S1). The pooled library with final concentration of 18mM was sequenced on three HiSeq X lanes (150 bp paired-end, 1% PhiX) that were part of independent runs, at the SciLife sequencing facility in Stockholm.
Data processing
All reads were demultiplexed based on their unique indices using Illumina’s bcl2fastq (v2.17.1) software with defaults settings, allowing for one mismatch per index and only retaining “pass filter” reads (Illumina Inc.). All unidentified reads, i.e. reads with indices that were not used in our experiment, were subjected to the same filtering steps, as described below. We removed adapter sequences using AdapterRemoval V2.1.7 using standard parameters and subsequently merged the reads, requiring a minimal overlap of 11bp and allowing for a 10% sequencing error rate (Schubert et al. 2016). Unmerged reads and reads below 29 bp were removed leaving only merged reads with an original insert size of at least 15 bp (7 bp barcodeP7 + 7 bp barcodeP5 + 15 bp DNA fragment = 29 bp). To increase certainty, we only retained reads with intact and error-free P5 and P7 barcodes (assessed using an in-house python script) and an average quality score of at least 30 using prinseq V0.20.4 (Schmieder and Edwards 2011).
Estimating barcode and index cross-contamination and index hopping across sequencing runs
To estimate the rate of barcode cross-contamination, we identified reads with wrong barcode pairs for each sample within each run. We also included unidentified reads with wrong barcode pair combinations into this calculation. The proportion of cross-contaminated reads within a given sequencing run was determined as the ratio between the sum of all reads with wrong barcode pairs and the sum of all sequenced reads that passed the filtering criteria. Given that we used a total of eight different barcodes, we calculated the probability that barcode cross-contamination results in a valid barcode pair (i.e. barcode pair that is actually used in the experiment) as 7*(x/7 * x/7), where x corresponds to the estimated percentage of wrong barcode pairs present in our experiment.
Reads with a correct barcode combination but wrong index can result from index cross-contamination and/or index hopping. To distinguish between these two possibilities, we relied on the fact that only seven different indices were used in our experiment, whereas 40 different indices are routinely used in the ancient DNA laboratory. Therefore, we quantified index cross-contamination as the fraction of reads containing indices that were not included in our experiment. These reads are present within the unidentified reads and carry a valid barcode combination but an unused index.
To determine the proportion of hopped reads, for each sequencing run we calculated the ratio between the sum of all reads showing a wrong index-barcode combination and the sum of all sequenced reads that passed the filtering criteria. To account for the possibility of barcode cross-contamination that produces valid barcode combinations and index cross-contamination, we subtracted these two estimates from the proportion of reads with wrong barcode-index combination.
Statistical analyses
Statistical analyses were performed in R 2.15.3 (Team R Core 2016). Significant global chi-square tests were followed by a post hoc procedure as implemented in the R package polytomous (https://artax.karlin.mff.cuni.cz/r-help/library/polytomous/html/00Index.html). The minimum value of the chi-squared test statistic for the given degrees of freedom was used to assess if individual observed values differ significantly from an overall hypothetical homogeneous distribution. The test also identified the direction of these differences.
Results
Our sequencing libraries were made from degraded historical samples containing a large proportion of short DNA fragments (Fig. 2A), the majority of which could be confidently merged (95.3% SE ± 1.0%). After filtering (see Methods), the final dataset contained 89.3% ±1.9% of the original sequence reads.
A) Read length distribution and the proportion of index hopping by read length. B) Read GC-content distribution and the proportion of index hopping by read GC content. Vertical bars depict 95% confidence intervals.
Barcode cross-contamination
We observed low levels of barcode cross-contamination (0.0276% SE ± 0.0026 across all three runs, Table 1, Table S2). The rate of barcode cross-contamination differed significantly by sample (global chi-square test, P<10−15). The implemented posthoc procedure suggested that samples 5 and 7 had significantly more reads with wrong barcode combinations than expected, whereas all the other samples had significantly fewer such reads. Among reads with barcode cross contamination we found an overrepresentation of incorrectly paired barcodes #9 and #14 (Figure 3, Table S2), both of which were used for sample 7 in the following combinations: P5-#9 with P7-#9 and P5-#14 with P7-#14 (Table 1). Elevated cross-contamination between these two barcodes during laboratory procedures could explain the results. However, the observed high rate of wrong barcode pairs (P5-#9 with P7-#14, P5-#14 with P7-#9, Figure 3) is more likely the result of jumping PCR during the 10 rounds of indexing PCR, as both fraction of sample 7 were indexed in a pooled reaction. Equal frequency of wrong barcode pairs is further supporting this notion (Table S2) and can be explained by jumping PCR happening randomly among the reads. In contrast, it is rather unlikely that all four barcodes would have received equal amounts of cross-contamination during laboratory procedures. Assuming that adapter ligation of barcodes is unbiased with respect to the barcode sequence (Rohland et al. 2015), the detected low average percentage of cross-contamination will lead to 1.55x10−5 % of reads (7*(0.00276/7 * 0.00276/7)) * 100% = 0.0000155%) with a valid barcode pair, but wrongly appear as having undergone index hopping.
Index cross-contamination
The Illumina HiSeq X platform does not support a double-indexing design. Therefore, in contrast to barcode cross-contamination, index cross-contamination cannot be directly quantified from the sequencing data. Instead, we focused on the fraction of unidentified reads, which contain indices that were not used in our experiment (Methods, Table S3). The fraction of such reads was nearly identical among the three sequencing runs, ranging from 0.12% to 0.13% (mean = 0.124% SE ± 0.0023).
Index hopping
Index hopping will not affect the barcodes that are directly attached to the DNA fragments. Therefore, it can be readily distinguished from barcode cross-contamination by the presence of reads containing a wrong combination between an index and a barcode pair. Across all three sequencing runs, we detected a low proportion of reads with wrong index-barcode combinations (mean=0.594%, SE ± 0.0434%, Table 1). As detailed in Methods, to obtain the proportion of reads that result from index hopping, but not from barcode or index cross-contamination, we subtracted our estimates of barcode cross-contamination and index cross-contamination from this value. The estimated rate of index hopping in our experiment across all three sequencing runs is therefore 0.470% SE ± 0.044 (0.594% -1.55x10−5 % -0.124 %). The proportion of hopped reads differed significantly by sample (chi-square test, P<10−15). We observed a significant positive correlation between the number of sequenced reads per sample and the number of reads that hopped from this sample to other samples (Pearson’s r = 0.96, P = 0.0005), suggesting that samples with higher number of sequenced reads will serve as a dominant source of hopped reads (Fig. 3). Therefore, even though the overall rate of index hopping is low, samples with low number of sequenced reads are more affected by index hopping, leading to 1.47% SE ± 0.11% and 2.49% SE ± 0.29% of index hopped reads within these samples in our experiment (e.g. samples 2 and 4 in Table 1, Table S4, Fig. 3).
Barcode cross contamination and index hopping by sample. A) Proportion of a given wrong barcode pair in the data out of all erroneous barcode pairs. Barcodes 9 and 14 are paired significantly more often and at equal frequencies, which is likely explained by jumping PCR. B) Proportions of hopped reads by sample. Samples in the top row contribute hopped reads, whereas samples on the left receive hopped reads.
The rate of index hopping differed significantly by read length and GC content (chi-square test, P<10− 15, Figure 2). Reads shorter than 90 bp and reads with GC content above 40% showed significantly higher proportion of hopped reads than expected.
Discussion
We show that index hopping is a real phenomenon occurring on the Illumina HiSeq X platform, but its rate is below 1% in our study. Multiple sources of error can result in read misassignment on the HiSeq X platform, including barcode and index cross-contamination, jumping PCR, sequencing errors, and index hopping. However, through a careful experimental design, we can exclude these error sources and reliably quantify the rate of index hopping. First, we show that the rate of cross-contamination of barcodes is very low (on average, only 0.0027%). A slightly higher level of observed barcode cross-contamination in sample 7 is likely due to jumping-PCR. However, jumping PCR can be eliminated as explanation for wrong index-barcode combinations, as we prepared all libraries individually and avoided amplification of pooled libraries from different samples. Library pooling only occurred directly prior to sequencing. Second, we detect low levels of index cross-contamination by quantifying the presence of indices that are routinely used in the lab among our sequenced reads (0.124%). This further suggests that the presence of wrong index-barcode pairs cannot be explained by index cross-contamination. Third, we employed a very stringent procedure to control for sequencing error: we did not allow for mismatches in the 7-bp P5 and P7 barcodes, required high average read quality and only retained merged reads. By using the library preparation protocol as described in Rohland et al. 2015, we can thus accurately identify and quantify reads containing wrong index-barcode combinations that are the result of index hopping and not the effect of other sources of error.
Read misassignment is not a novel phenomenon for the Illumina sequencing platforms. Reported error rates range from 0.1% to 0.582% for HiSeq 2500 (Kircher et al. 2012; Wright and Vetsigian 2016a, Wright and Vetsigian 2016b) and from 0.06% to 0.21% for the MiSeq platforms (Nelson et al. 2014; D’Amore et al. 2016). It is therefore noteworthy that the fraction of hopped reads as estimated in our study (0.470%) is similar to that reported for other platforms. However, it markedly differs from the recent estimates for the Illumina HiSeq X/4000 platforms (Griffiths et al. 2017; Owens et al. 2017; Sinha et al. 2017). While (Owens et al. 2017) failed to detect any index hopping in libraries sequenced both on Illumina HiSeq X and HiSeq 2500, (Griffiths et al. 2017) and (Sinha et al. 2017) reported >1% and up to 10% of misassigned reads for single-cell RNA libraries on the HiSeq 4000 platform. Our low observed rate of index hopping might be explained by the low amounts of free-floating adapters during library preparation, since these had been rigorously removed through size selection and cleaning (Figure S1).
The number of reads with hopped indices is proportional to the total number of reads contributed by a given sample to the pooled sequencing library. Pooling samples in unequal amounts leads to a greater proportion of hopped reads into samples with fewer sequenced reads. In this study, libraries with the lowest number of sequenced reads displayed up to 3.2% of misassigned reads (Table 1). When working with low-quality samples, the effect of unequal amounts of index hopping can become even more severe if the endogenous content is markedly different between samples, as is often observed in aDNA studies (Damgaard et al. 2015; Pinhasi et al. 2015; van der Valk et al. 2017). In this case, hopping of endogenous reads will occur from samples with high endogenous content into samples with low endogenous content, potentially leading to pronounced biases. The interplay between endogenous content and the number of sequenced reads may result in libraries, in which the proportion of false assigned endogenous reads is considerably higher than reported here (Fig. S2).
Our study shows that while index hopping occurs on the Illumina HiSeq X platform, it results in low proportion of erroneous reads. Importantly, these reads can be readily identified using a library preparation protocol that combines two separate inline-barcodes and a unique index (or index pair on the HiSeq 4000). For studies generating high coverage data, the low detected rate of read misassignment, which is similar to that of the older sequence platforms, might be insignificant. However, in cases where low coverage data is generated or absolutely certainty is required, even low-rate index hopping might represent a major problem. Using short barcode adapters allows for the filtering of misassigned reads, and in the case of short read lengths (such as in aDNA studies) will lead to only a minimal loss of sequencing data. We therefore recommend the use the 7-bp barcode adapters when preparing pooled ancient DNA libraries or in studies were absolute certainty is required.
Funding sources
This project was supported by FORMAS grant 2015-676 to LD and the Jan Löfqvist and the Nilsson-Ehle Endowments of the Royal Physiographic Society of Lund to KG. Sequencing consumables were supplied by Illumina. Illumina had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We also acknowledge the support from the Science for Life Laboratory, the Knut and Alice Wallenberg Foundation, the National Genomics Infrastructure funded by the Swedish Research Council, and Uppsala Multidisciplinary Center for Advanced Computational Science for assistance with massively parallel sequencing and access to the UPPMAX computational infrastructure.