Abstract
Once a suitable reference sequence is generated, genomic differences within a species are often assessed by re-sequencing. Variant calling processes can reveal all differences between two strains, accessions, genotypes, or individuals. These variants can be enriched with predictions about their functional implications based on available structural annotations i.e. gene models. Although these functional impact predictions on a per variant basis are often accurate, some challenging cases require the simultaneous incorporation of multiple adjacent variants into this prediction process. Examples are neighboring variants which modify each others’ functional impact. Neighborhood-Aware Variant Impact Predictor (NAVIP) considers all variants within a given protein coding sequence when predicting the functional consequences. As a proof of concept, variants between the Arabidopsis thaliana accessions Columbia-0 and Niederzenz-1 were annotated. NAVIP is freely available on github: https://github.com/bpucker/NAVIP.
Introduction
Re-sequencing projects e.g. investigating many individuals or accessions of one species [1–3] are gaining relevance in plant research. Approaches similar to genome-wide association studies which are based on mapping-by-sequencing (MBS) were frequently applied [4–6]. They are boosted by an increasing availability of high quality reference genome sequences [7–12] and dropping sequencing costs [13, 14]. De novo assemblies are still beneficial for the detection of large structural variants [8, 11, 12, 15–17] and especially to reveal novel sequences [8, 11, 12, 18], but the reliable detection of modifying single nucleotide variants (SNVs) can be achieved based on (short) read mappings.
Once identified, the annotation of sequence variants in most species is performed by predicting their functional implications based on the available annotation of genes. Leading tools like ANNOVAR [19] and SnpEff [20] are currently performing this prediction by focusing on a single variant at a time. An impact prediction facilitates the identification of targets for post-GWAS analyses [21, 22]. Although the effect prediction for single variants is very efficient and usually correct, there is a minority of challenging cases in which predictions cannot be accurate based on a single variant alone. Multiple InDels could either lead to frameshifts or they compensate for each others’ effect leaving the sequence with minimal modifications [23–25]. Two SNVs occurring in the same codon could lead to a different amino acid substitution compared to the apparent effect resulting from an isolated analysis of each of these SNVs.
Here we present a new tool to accurately predict the combined effect of phased variants on annotated coding sequences. Neighborhood-Aware Variant Impact Predictor (NAVIP) was developed to investigate large variant data sets of plant re-sequencing projects, but is not limited to the annotation of variants in plants. As a proof of concept, NAVIP was deployed to identify cases between the A. thaliana accessions Columbia-0 (Col-0) and Niederzenz-1 (Nd-1) where an accurate impact prediction needs to consider multiple variants at a time [15].
Materials & Methods
Variant detection
Sequencing reads of Nd-1 [15] were mapped to the Col-0 reference genome sequence (TAIR9) [26] via BWA MEM v.0.7.13 [27] using the −m option to avoid spurious hits. Variant calling was performed via GATK v3.8 [28] based on the developers’ recommendation. All processes were wrapped into custom Python scripts (https://github.com/bpucker/variant_calling) to facilitate automatic execution on a high performance compute cluster. An initial variant set was generated based on hard filtering criteria recommended by the GATK developers. The two following variant calling runs considered the set of surviving variants of the previous round as gold standard to avoid the need for hard filtering.
Variant validation
Since a high quality genome sequence assembly of Nd-1 was recently generated [12], we harnessed this sequence to validate all variants identified by short read mapping. Starting at the north end of each chromosome sequence, sorted variants were tested one after the other by taking the upstream sequence from Col-0, modifying it according to all upstream bona fide variants, and searching for it in the Nd-1 assembly (AdditionalFile1). Variants were admitted to the following analysis if the assembly supports them. This consecutive inspection of all variants enabled a reliable removal of false positives.
Variant impact prediction
Our Neighborhood-Aware Impact Predictor (NAVIP, https://github.com/bpucker/NAVIP) takes a VCF file containing sequence variants, a FASTA file containing the reference sequence, and a GFF3 file containing the annotation as input. Provided variants need to be homozygous or in a phased state to allow an accurate impact prediction per allele. Effects on all annotated transcripts are assessed per gene by taking the presence of all given variants into account. NAVIP generates a new VCF file with an additional annotation field and additional report files including FASTA files with the resulting sequences (see manual for details: https://github.com/bpucker/NAVIP/wiki).
Assessing predicted premature stop codons and frameshifts
SnpEff [20] was applied to the validated variant data set to predict the effects of single variants. To assess the influence of the underlying annotation, this prediction was performed based on TAIR10 [26] and Araport11 [29]. Predicted premature stop codons with two variants within the same codon were selected for comparison to the NAVIP prediction, because these cases have the potential to show different results.
Transcripts with predicted frameshifts were analyzed to identify downstream insertions/deletions which are compensating each others’ effect i.e. the second frameshift is reverting an upstream frameshift. The distance between these events was analyzed by the third module of NAVIP.
Experimental validation of variants
A. thaliana Nd-1 plants were grown as previously described [15]. DNA for PCR experiments was extracted from leaf tissue using a cetyltrimethylammonium bromide (CTAB)-based method as previously described [30]. Oligonucleotides flanking regions with variants of interest were designed manually (AdditionalFile2) and purchased from Metabion (http://www.metabion.com/). Amplification via PCR, analysis of PCR products, purification of PCR products, Sanger sequencing, and evaluation of results was performed as previously described [31].
Results
Variant detection and validation
Nd-1 reads were mapped against the Col-0 reference genome sequence (TAIR9). Based on 124,662,140 mapped paired-end reads, 384,622 variants were detected in the first variant calling round of this study. This initial set was extended over three additional rounds of variant calling leading to over one million of variants. The variant calling was stopped, because no substantial increase in the number of novel variants was observed during the last rounds. An assembly based on independent Single Molecule Real Time (SMRT) sequencing reads supported 772,644 (76.6%) of all variants detected during the last iteration (AdditionalFile3, Fig. 1). On average, one variant was observed every 154 bp between Col-0 and Nd-1. SNV frequencies ranged from one event in 225 bp on Chr5 to one event in 158 bp on Chr4. InDel frequencies ranged from one event in 1,051 bp on Chr5 to one event in 809 bp on Chr4.
Distributions of SNVs and InDels over the chromosome sequences of Col-0 were visualized as previousl described [15].
Although the repeated variant calling processes were intended to increase the sensitivity, we did not observe a substantial improvement between the second and third round. This saturation indicates that no additional variants would be detected in further variant calling rounds. The number of detected variants as well as the validation rate was almost constant (Table 1).
Experimental validation
Randomly selected loci with two SNVs within one codon were experimentally validation via PCR and amplicon sequencing (Table 2). Successful sequencing reactions show a validation rate of >95%.
Relevance of NAVIP
Running NAVIP on this A. thaliana data set (AdditionalFile4) took about 5 minutes with a single core and a peak memory usage of about 3 GB RAM. Since SnpEff is one of the most frequently applied tools for the annotation of variants, the NAVIP output was compared with SnpEff predictions. SnpEff was applied to the same data set based on the Araport11 annotation. Interesting cases for comparison are codons containing at least two SNVs. Of 75 premature stop codons predicted in such codons by SnpEff, 73 were predicted as amino acid substitutions by NAVIP (Fig. 2). While a single SNV would cause a premature stop codon, the simultaneous presence of two SNVs results in an amino acid encoding codon. In total, 702 premature stop codons were predicted by SnpEff thus 9.6 % of them were false positives. NAVIP revealed that tyrosine occurs frequently instead of a premature stop codon, because the tyrosine codons are very similar to two of the three stop codons.
Premature stop codons predicted by SnpEff (pink) are frequently amino acid substitutions if a second variant is located within the same codon. NAVIP revealed 73 false positive predictions of premature stop codons by SnpEff which are in fact amino acid substitutions (green).
InDels can compensate each others’ frameshift when occurring together. Since premature stop codons can emerge by chance following a frameshift, the distance between such InDels was analyzed. This length distribution revealed that most compensating InDels (cInDels) occur within a short distance of 2-8 bp (Fig. 3). Multiples of three are more frequent than other distances of a similar size.
An InDel can compensate the frameshift caused by an upstream InDel. Distances between such cInDels are short and frequently multiples of three. In total, 484 genes contain cInDels.
Discussion
Variant validation, frequency, and distribution
Although differentiation between bona fide variants (true positives) and false positives based on a high quality genome sequence assembly worked very well, false negatives were not taken into account and might even bias this classification approach by preventing the validation of neighboring variants (AddtionalFile1). If a variant is missed by the initial variant calling, its presence in the flanking sequence used during the validation process will prevent a proper match. Therefore, the number of variants could be slightly higher than reported here. Nevertheless, this conservative approach was selected to minimize the risk of keeping false positive variants. There is always a trade-off between sensitivity and specificity in the variant calling process [32] and our approach is in strong favor of specificity. However, the number of identified and validated variants exceeds previous reports of 485,887 variants between Col-0 and Nd-1 [15]. Instead the observed variant frequency is closer to the results of a comparison between Bur-0 and Col-0 [33]. Despite the difference in total numbers, the distribution on the chromosome scale is similar to the previous comparison of Col-0 and Nd-1 [15]. It seems that Chr4 is the most variable one, while Chr5 is the least variable one between both compared accessions.
Successful validation via PCR and amplicon sequencing supported the presence of two SNVs within one codon. Although these variants are perceived as two SNVs, the underlying mechanism could be a multiple nucleotide polymorphism (MNP). It would be interesting to see if these SNVs occur independently in other accessions in the A. thaliana population.
Functional implications of variants
We developed NAVIP to assess the impact of neighboring variants on protein coding sequences. The presence of the 557 cases described here for the comparison of two A. thaliana accessions demonstrates the necessity to have such a tool at hand. NAVIP revealed the presence of second site mutations that compensate other variants e.g. turning a premature stop codon into an amino acid substitution or compensation of a frameshift. The purpose of NAVIP is not to replace existing tools, but to add novel functionalities to established tools like SnpEff [20]. This could boost the power of re-sequencing studies by opening up the field of compensating or in general mutually influencing variants. Such variants have the potential to reveal new insights into patterns of molecular evolution and especially co-evolution of sites. Although the number of cases is probably small, the consideration of multiple variants during the effect prediction could reveal novel targets in GWAS-like approaches. The remaining challenge is now the reliable detection of sequence variants prior to the application of NAVIP. For heterozygous species phasing of these variants is another task that needs to be addressed. The correct prediction of functional implications relies on the correct assignment of variants to respective haplophases. If provided with accurately phased variants, NAVIP can perform predictions for highly heterozygous and even polyploid species.
Availability of data
The data sets supporting the results of this article are included within the article and its additional files. Python scripts developed and applied for this study are available on github: https://github.com/bpucker/NAVIP (https://doi.org/10.5281/zenodo.2620396) https://github.com/bpucker/variant_calling (https://doi.org/10.5281/zenodo.2616418).
Authors’ contribution
BP designed research. JSB wrote the NAVIP code. JSB, DH, and BP conducted bioinformatic analyses. DH and BP performed experimental validation. BP wrote the manuscript. All authors read and approved the final version.
Additional Files
AdditionalFile1: Schematic illustration of the variant validation process.
AdditionalFile2: Oligonucleotide sequences used for the validation of randomly selected variants.
AdditionalFile3: Final set of validated variants.
AdditionalFile4: NAVIP annotation of variants between Nd-1 and Col-0.
Acknowledgements
We acknowledge support by members of Genetics and Genomics of Plants, Bioinformatics Resource Facility, and Sequencing Core Facility at the Center of Biotechnology. We thank Hanna Schilbert for critical reading of the manuscript.
Footnotes
JSB: janbaas{at}cebitec.uni-bielefeld.de DH: dhoward{at}cebitec.uni-bielefeld.de BP: bpucker{at}cebitec.uni-bielefeld.de
- minor updates in the text - Fig1 updated