ABSTRACT
The constant threat of emergence for novel pathogenic influenza A viruses with pandemic potential, makes full-genome characterization of circulating influenza viral strains a high priority, allowing detection of novel and re-assorting variants. Sequencing the full-length genome of influenza A virus traditionally required multiple amplification rounds, followed by the subsequent sequencing of individual PCR products. The introduction of high-throughput sequencing technologies has made whole genome sequencing easier and faster. We present a simple protocol to obtain whole genome sequences of hypothetically any influenza A virus, even with low quantities of starting genetic material. The complete genomes of influenza A viruses of different subtypes and from distinct sources (clinical samples of pdmH1N1, tissue culture-adapted H3N2 viruses, or avian influenza viruses from cloacal swabs) were amplified with a single multisegment reverse transcription-PCR reaction and sequenced using Illumina sequencing platform. Samples with low quantity of genetic material after initial PCR amplification were re-amplified by an additional PCR using random primers. Whole genome sequencing was successful for 66% of the samples, whilst the most relevant genome segments for epidemiological surveillance (corresponding to the hemagglutinin and neuraminidase) were sequenced with at least 93% coverage (and a minimum 10x) for 98% of the samples. Low coverage for some samples is likely due to an initial low viral RNA concentration in the original sample. The proposed methodology is especially suitable for sequencing a large number of samples, when genetic data is urgently required for strains characterization, and may also be useful for variant analysis.
INTRODUCTION
Influenza A viruses (IAVs) infect a wide range of avian and mammalian species, including humans. IAVs are an important cause of human respiratory diseases, generating seasonal yearly infections and occasional pandemics1. These enveloped RNA viruses belong to the Orthomyxoviridae family, having a segmented genome composed of eight independent RNA segments, ranging in size from 890 to 2340 nucleotides. The eight different genome segments encode for over ten structural and no-structural viral proteins: PB2, PB1 and PA (forming the polymerase complex), HA (hemagglutinin protein, responsible for binding to cell receptors and is the main antiviral target), NP (nucleoprotein), NA (neuraminidase protein, that cleaves sialic acid allowing virion liberation, and is also an antiviral target), M1 and M2 (matrix and viral ion channel proteins) and NS1 and NS2 (non-structural proteins, major host immune modulators). Other nonessential accessory proteins, with a wide variety of functions, include PB1-F2, PB1-N40, PA-X, PA-N182 and PA-N1552. Based on the variability of the two surface glycoproteins (HA and NA), 18 HA and 11 NA varieties have been described so far, that combined generate different virus subtypes3–5. While almost all viral subtypes are believed to circulate in waterfowl, only three IAVs subtypes have been established in human populations (H1N1, H2N2, and H3N2)1.
IAVs evolve mainly by two mechanisms: single point mutations introduced into different genes by a low-fidelity RNA polymerase that may be fixed under varying selective pressures (antigenic drift), or by the exchange of whole segments between different viral strains during co-infection (antigenic shift)1. Both of these processes contribute to the long-term evolution of IAVs, and are associated with changes in antigenic or biological properties of emerging strains (such as an increased virulence, change of host tropism, resistance to antiviral drugs and antigenic escape, among others)6–8.
The evolution of circulating IAVs has been mainly studied through the HA and NA genes9, as whole genome sequences only became widely available within the last decade, after the introduction of the high-throughput sequencing (HTS) technologies 10–12. Before HTS, the most common method for obtaining complete IAVs genomes was by the amplification of overlapping genome regions using a reverse transcription polymerase chain reaction (RT-PCR), followed by Big Dye Terminator sequencing chemistry10,12–14. Since then, HTS has been successfully used to obtain whole genome sequences directly from clinical samples, after cell culture adaptation, or directly after RT-PCR amplification of individual genome segments (namely HA and NA)15–20. Multisegment reverse transcription-PCR reaction (M-RTPCR) is often used to amplify the genetic material to be used for preparation of sequencing libraries21–24. However, the quality and coverage of the viral sequences obtained can vary greatly, ranging from a high genome coverage with > 100x depth, to only a few reads per sample15,18,20–24. We present here a simple and cost-efficient methodology to obtain whole genome sequence from IAVs with a good success rate, even when started from a non-optimal amount of initial genetic material.
MATERIAL AND METHODS
Samples collection and viral RNA extraction
The Bioethics Committee of Instituto de Biotecnologia UNAM, the INER (Instituto Nacional de Enfermedades Respiratorias), and the Hospital Juarez approved human samples collection and all handling protocols used in this work. All hospitalized patients read, agreed and signed to written consent forms. The twenty-nine viral samples from different sources used for this work are listed in Table 1. For tissue culture-adapted H3N2 viruses and for avian influenza A viruses collected from cloacal swabs, 200 μl of the supernatants were treated with Turbo DNase (Ambion) for 30 minutes at 37°C to deplete foreign DNA. Viral RNA was then extracted with the Purelink Viral RNA/DNA extraction kit (Invitrogen), using 25 μg of linear acrylamide (Ambion) as a carrier. RNA was then stored at -70°C for further use. For clinical H1N1 samples, viral RNA was extracted from 200 μl of the sample using a high-throughput system (MagNA Pure®, Roche, Indianapolis, IN, USA), as described by the manufacturer. We did not observe important differences among samples, as both high and low amounts of genomic material were obtained using both extraction systems as quantified by qPCR (data not shown), indicating that the different extraction processes do not affect final results, and that the genomic material quality and quantity is highly dependent on the quality of the initial sample.
Whole genome amplification
Complete genomes for all IAV samples were amplified with a single multisegment reverse transcription-PCR reaction (M-RTPCR) as described by Zhou et al 25, using 5 μl of extracted RNA as input for each reaction. PCR products were visualized on 1% agarose gels and then purified using firstly by the AMPure magnetic beads (Beckman Coulter Genomics), and later by the DNA Clean & Concentrator kit (Zymo Research). All final products were quantified using the NanoDrop ND1000 (NanoDrop Technologies) and the Agilent 2100 Bioanalyzer.
Preparation of Illumina libraries and sequencing
Samples with more than 150ng of amplified material after M-RTPCR were subjected directly to the preparation of HTS libraries, as described below. DNA was fragmented to approximately 250-500 bp using the NEBNext Fragmentase enzyme (New England Biolabs), and further purified with the DNA Clean & Concentrator kit (Zymo Research). Products were used for building individual single-indexed Illumina libraries using the Genomic DNA Sample Preparation kit - Multiplex Sample Prep Oligo kit (Illumina), as instructed by the manufacturer. At the time, Illumina protocols recommended between 1 and 5 μg of DNA as starting material. Thus, when less than 150ng of total genomic material was obtained after initial M-RTPCR, these samples were additionally amplified using a PCR with random primers, as described in Taboada et al 26. In this protocol, no additional enzymatic fragmentation step is needed, given the nature of the reaction. Briefly, two rounds of synthesis with Sequenase 2.0 (USB, USA) were performed using primer A (5’-GTTTCCCAGTAGGTCTCN-3’), followed by ten amplification rounds using the Phusion DNA polymerase (Finnzymes) with primer B (5’-GTTTCCCAGTAGGTCTC-3’). DNA was then digested using the GsuI enzyme to remove the 16 additional nucleotides from 5’ and 3’ ends generated during the PCR reaction. Digested DNA was purified as described above, and then used to prepare Illumina sequencing libraries, as described above.
Between 4 to 6 libraries of approximately 350 bp were equimolarly pooled and loaded per lane to generate sequencing clusters, followed by 36 or 45 cycles of single base pair extensions using the Illumina Genome Analyzer II platform at the HTS Core Facility-UNAM. Image analysis was done using the Genome Analyzer Pipeline Version 1.4 (Illumina, San Diego, CA). As a negative control, the pure distilled water was included from the initial M-RTPCR and further processes as a sample for the preparation of a blank library (data not shown).
Bioinformatic analysis of sequencing data
Data analysis was performed using the computational cluster of the Instituto de Biotecnologia-UNAM, as described in Escalera et al 27,28. High-quality reads (Q30) were filtered and identical reads were collapsed. The reference genomes of the A/Netherlands/602/2009 (H1N1) and A/New York/392/2004 (H3N2) strains were used to filter out H1N1 and H3N2 viral sequences, using the mapping and assembly software MAQ v0.7.1 29. For viruses of avian origin, the reads obtained per sample were first mapped against an in-house curated database, as described by Paulin et al 30. The reference strains with most reads mapped were selected for further genome assembly. Mapping of reads to the reference genomes was performed using SMALT v0.7.1, whilst consensus sequences were called using SAMtools version 0.1.1831. Given that avian IAVs genomes could have a higher variability when compared to human H1N1 or H3N2 samples, two separate mapping rounds were performed. In the first round, 15 mismatches were allowed to generate a ‘relaxed’ consensus sequence, whilst during the second round only 5 mismatches were allowed for mapping to allow for a ‘strict’ consensus using as a reference the consensus sequence generated during the first round. Reads that did not mapped during the second round were not considered for further analysis. Finally, influenza-specific reads were used to plot genome coverages, using the R package. The sequences generated in this work are available in GenBank under the following accession numbers: CY100445-CY100622, KY575169-KY575224, KY593192-KY593200.
Statistical analysis
The differences in i) the ratio of total to unique reads (after collapsing identical), ii) in the ratio of unique to influenza-specific reads, and iii) in the mean of total normalized reads as a function of the sample processing method (M-RTPCR/enzymatic fragmentation [FR] vs. additional PCR amplification with random primers and no enzymatic fragmentation [PCR]), were statistically assessed under an unpaired (two sample means) two-tailed t-test in Prism 6.0. Given the sample size for each group (n=14 for FR and n=15 for PCR) and the mean and standard deviation values for each class, a statistical power of >80% was estimated with a Type 1 error rate of 5%.
RESULTS AND DISCUSSION
Viral RNA enrichment
The M-RTPCR described by Zhou et al 25 is based on all IAVs having conserved 5’ and 3’ ends for all genome segments. Thus, all segments can be amplified using the same pair of oligonucleotides in a single multiplex reaction. Given the diverse origin of the samples used in this study (culture-grown virus, clinical isolates and avian-derived samples), variable amounts of PCR-amplified whole genome products were obtained, ranging from 10 ng up to 2.7 μg of total DNA/sample (Table 1). When comparing the differential amplification for each genome segment, PCR products corresponding to the smaller segments (M and NS, followed by HA, NP and NA) were favored against the larger ones (PB2, PB1, PA). In general, the efficiency (yield) of PCR was greater for low-molecular than for a high-molecular weight DNA templates (Figure 1). For 15 samples, less than 150 ng of total genomic DNA was obtained (Table 1), thus these were subjected to a secondary random PCR amplification as described in material and methods.
Sequencing specificity
In total, 29 IAV genomes were sequenced, with a total number of reads per sample ranging from 90,000 to 19×106 (Table 1). The lowest number of raw reads was obtained for sample A/Mexico/InDRE756/2006 (H3N2), whilst the highest one was for sample A/Mexico/InDRE2246/2005 (H3N2). Given the low concentration of initially amplified genomic products obtained after M-RTPCR, these two samples were subjected to additional amplification by random PCR (Table 1). Identical reads were collapsed, reducing the number of reads by a mean of 84% (ranging from 70% to 95%). When comparing the mean percentage of unique reads with respect to the method of library preparation, a significant difference (p-value <0.0001) was observed between samples prepared by the FR method (mean: 11.25 ± 3.9), and samples processed under the PCR method (mean: 21.19 ± 6.5), with PCR method having higher proportion of unique reads (Figure 2A). When comparing the number of influenza-specific reads obtained under the different methods, samples processed under the PCR method showed a significantly smaller number of influenza-specific reads (14.7 ± 2.7, p-value < 0.01), when compared to samples prepared by the FR method (30.3 ± 5.1) (Figure 2B). This observation suggests that despite yielding a reduced number of unique reads after sequencing, the FR method increased specificity (as shown by a larger number of influenza-specific reads), rendering it better method when compared to the PCR method.
Additionally, when the number of influenza-specific reads was compared to genome coverage, samples prepared under the FR method showed a larger number of reads per nucleotide sequenced, as the majority of samples displayed between 10-20 reads per nucleotide. Contrastingly, >50% of the samples prepared by the PCR method showed < 5 reads per nucleotide sequenced (Figure 2C). A lower percentage of influenza specific reads obtained under the PCR method could be partially explained by non-specific amplification of non-viral genetic material during the random PCR. Thus, it is important to note that too many rounds of additional amplification by random PCR may introduce sequencing bias, and can also increase the relative quantities of non-viral genetic material. However, the PCR method does facilitate whole genome sequencing when samples have an initial low concentration of genomic material after M-RTPCR product (Table 1).
Genome coverage
IAV whole genome sequencing studies have shown variable results, as obtaining full genomic sequences can be difficult, especially for clinical or tissue-derived samples15–20. Whole genome sequencing was highly efficient, with > 99% genome coverage obtained for 66% of the samples (19/29): 11 (79%) for FR method and 8 (53%) for PCR. Many samples (59%, 17/29) also showed good overall sequencing depth across the whole genome (>50x) (Table 1). For the majority of the samples (20/29), more than 100,000 unique influenza-specific reads were obtained, supporting for the methodology proposed here as suitable for detection viral variants within a single sample. For some samples, gaps were observed within genome assemblies, mainly within the large genome segments (namely in PB2, PB1 and PA) (Figure 3). This can be explained to some extent by the M-RTPCR bias towards the amplification of smaller segments. Incomplete genome sequencing is also correlated to low viral RNA concentrations, in relation to the quality of the original sample.
Nonetheless, even in samples with lower genome coverage, the antigenically relevant segments (HA and NA), which are of mayor importance for epidemiological surveillance, were fully sequenced for 79% of the samples (23/29), with over 99.3% coverage and a minimum depth of 10X (Table 1). Samples with good quality and quantity of initial genomic material after M-RTPCR showed an overall good genome coverage and depth (Table 1). Comparably, samples with a low amount of starting genomic material after M-RTPCR (and thus reamplified by random PCR), also showed good genome coverage and depth, confirming the usefulness of both methods for obtaining whole genome sequences.
CONCLUSIONS
In this work, we present a simple methodology to improve the likelihood of obtaining whole genome sequences with good genome coverage and depth, for non-optimal samples. This methodology is especially suitable for sequencing a large number of samples, and is useful when genetic data is needed to perform evolutionary or epidemiological analyses during outbreaks. The proposed methodology also allows the identification and characterization of field samples difficult to sequence by conventional molecular biology methods. This methodology is also compatible with improved library preparation kits and sequencing technologies, such as the MinION portable device, that require less initial genomic material.
ACKNOWLEDGMENT
We thank Dr. Ricardo Grande and M.Sc. Verónica Jiménez and the HTS core facility of the National University of Mexico for their technical support. Computational analysis was performed using the cluster of the Instituto de Biotecnología-UNAM. We thank Dr. Daniel Aguilar Angeles from the Hospital Juárez de México for facilitating clinical samples. This work was supported by grants 5549 and I0110/184/09 from the National Council for Science and Technology CONACyT-Mexico, and grant PICOSI09-209 from the Instituto de Ciencia y Tecnologia del Distrito Federal. Marina Escalera-Zamudio and Georgina Cobián-Güemes were supported by a scholarship from CONACyT-Mexico. Marina Escalera-Zamudio is currently supported by an EMBO Long Term Fellowship (ALTF376-2017).
Footnotes
Marina Escalera: marina.escalerazamudio{at}zoo.ox.ac.uk
Ana Georgina Cobián Güemes: acobian{at}ibt.unam.mx
Blanca Taboada btaboada{at}ibt.unam.mx
Irma López-Martínez: lopezmi{at}yahoo.com
Joel Armando Vázquez-Pérez: joel.vazquez{at}cieni.org.mx
Maricela Montalvo-Corral maricela.montalvo{at}ciad.mx
Jesús Hernández jhdez{at}ciad.mx
José Alberto Díaz-Quiñonez adiazq{at}unam.mx
Gisela Barrera Badillo: gisvir20{at}yahoo.com.mx
Susana López: susana{at}ibt.unam.mx
Carlos F. Arias: arias{at}ibt.unam.mx