NAVIP: Unraveling the Influence of Neighboring Small Sequence Variants on Functional Impact Prediction

Once a suitable reference sequence has been generated, intraspecific variation is often assessed by re-sequencing. Variant calling processes can reveal all differences between strains, accessions, genotypes, or individuals. These variants can be enriched with predictions about their functional implications based on available structural annotations, i.e. gene models. Although these functional impact predictions on a per-variant basis are often accurate, some challenging cases require the simultaneous incorporation of multiple adjacent variants into this prediction process. Examples include neighboring variants which modify each other’s functional impact. The Neighborhood-Aware Variant Impact Predictor (NAVIP) considers all variants within a given protein coding sequence when predicting the functional consequences. As a proof of concept, variants between the Arabidopsis thaliana accessions Columbia-0 and Niederzenz-1 were annotated. NAVIP is freely available on GitHub (https://github.com/bpucker/NAVIP) and accessible through a web server (https://pbb-tools.de). Author Summary Intraspecific variation gains increasing relevance as reference genome sequences are available for many investigated (plant) species. Understanding the functional consequences of sequence variants between individuals of a population is a challenge. SnpEff (Cigolnai et al., 2012) is the current standard tool to predict the functional impact of sequence variants, but does only consider one sequence variant at the time. We developed NAVIP to properly handle cases in which multiple sequence variants are clustering together and influence each other’s functional impact. A comparison of two Arabidopsis thaliana accessions demonstrates the relevance of considering multiple sequence variants simultaneously for the prediction of changes in encoded proteins. NAVIP is universally applicable to non-model organisms. All underlying code is freely available from GitHub and we operate a web server for users’ convenience.

Once identified, the annotation of sequence variants is performed by predicting their functional implications based on the available gene models (structural annotation).Leading tools such as ANNOVAR [25] and SnpEff [26] currently perform this prediction efficiently by focusing on a single variant at a time.An impact prediction facilitates the identification of targets for post-GWAS analyses and can lead to the identification of small sequence variants that form the molecular basis of economically relevant phenotypic differences [6,27,28].Although the effect prediction for single variants is computationally efficient and usually correct, there is a minority of challenging cases in which predictions based on a single variant alone cannot be accurate.
Multiple InDels could either lead to frameshifts or they could compensate for each other's effect leaving the sequence with minimal modifications [29][30][31].Two SNVs occurring in the same codon could lead to a different amino acid substitution compared to the apparent effects resulting from an isolated analysis of each of these SNVs.
Here we present a computational tool for accurately predicting the combined effect of phased variants on annotated coding sequences.The Neighborhood-Aware Variant Impact Predictor (NAVIP) was developed to investigate large variant data sets of plant re-sequencing projects, but is not limited to the annotation of variants in plants.As a proof of concept, NAVIP was used to identify cases between the A. thaliana accessions Columbia-0 (Col-0) and Niederzenz-1 (Nd-1) where an accurate impact prediction needs to consider multiple variants at a time.

Features of NAVIP
NAVIP predicts the functional impact of sequence variants by considering all sequence variants affecting the coding sequence of a gene simultaneously.Users need to supply a set of sequence variants (VCF), a reference genome sequence (FASTA), and a structural annotation (GFF3).NAVIP returns an annotated VCF file and FASTA files with corrected coding and polypeptide sequences.If phased sequence variants are provided in the VCF file, NAVIP performs separate analyses for the different haplophases.
There are multiple installation options described in the GitHub repository (https://github.com/bpucker/NAVIP)and NAVIP is also available free of charge through a web server (https://pbb-tools.de/NAVIP).This makes NAVIP accessible to a wide range of users and applicable to data sets of various sizes.Uploaded files are only used for the intended analysis and deleted 48 hours after offering the results for download.The web server is able to send notification emails upon completion of a job which can serve as documentation and facilitates the analysis of large data sets.

Relevance of NAVIP for predicting premature stop codons
Running NAVIP on an A. thaliana Nd-1 data set with 644,261 SNVs (S1 File, S2 File) took about 5 minutes with a single core and a peak memory usage of about 3 GB RAM.To the best of our knowledge, SnpEff is the most frequently used tool for the annotation of variants and also universally applicable.Consequently, the NAVIP output was compared with the SnpEff predictions produced for the same data set and structural annotation.The results are largely congruent, but interesting cases for comparison are predictions of premature stop codons as these have severe biological consequences.While a single SNV would cause a premature stop codon, the simultaneous presence of two SNVs can result in an amino acid encoding codon (Figure 1a).Of 600 premature stop codons predicted by SnpEff, 144 were identified as amino acid substitutions when considering multiple SNVs in the same codon via NAVIP (Figure 1b).
Given the total of 600 predicted premature stop codons in this Nd-1 data set, 24% were false positive predictions.NAVIP revealed that tyrosine occurs frequently instead of a premature stop codon, because the tyrosine codons are very similar to two of the three stop codons.There are also additional 17 premature stop codons predicted by NAVIP, which are the consequence of two sequence variants influencing the same codon.Despite the surprisingly large difference between the SnpEff and NAVIP results when it comes to predicting premature stop codons, the differences in affected genes are smaller.Many genes with a predicted premature stop codon have multiple downstream premature stop codons.While an individual prediction of a premature stop codon might be wrong, the gene can still be correctly identified by both tools (S3 File).
When a premature stop codon results in a loss-of-function event, the accumulation of additional variants is likely due to a lack of purifying selection.To support the assumption that genes with premature stop codons lost their function, the rate of amino acid changing variants in these genes was compared to all other genes (Figure 1c, Figure 1d).The number of variants changing amino acids (dN) to those resulting in the same amino acid (dS) were calculated for all genes.A significantly higher proportion of amino acid changing variants were observed in genes with predicted premature stop codons compared to all other genes (Mann-Whitney U test, p-value=10 -167 ).An additional comparison of the average expression of genes with a premature stop codon against all other protein encoding genes (Figure 1e and f) revealed also a significant difference (Mann-Whitney U text, p-value=10 -70 ).changing variants is significantly higher in genes with premature stop codons predicted by NAVIP (blue) compared to all other genes (red).(d) The proportion of amino acid changing variants is significantly higher in genes with premature stop codons predicted by SnpEff (blue) compared to all other genes (red).Data underlying these visualizations are available in S3 File.
(e) Comparison of the average expression of genes with a premature stop codon predicted by NAVIP against all other protein encoding genes with available expression data.(f) Comparison of the average expression of genes with a premature stop codon predicted by SnpEff against all other protein encoding genes with available expression data.

Role of compensating InDels (cInDels)
InDels can compensate for each others' frameshift when occurring together (Figure 2a).While the first InDel can alter the reading frame, the second one could revert the reading frame back to the original one thus resulting in only a few altered codons enclosed by the two events.Since premature stop codons can emerge in the novel codons following the first frameshift, the distance between such InDels was is expected to be very small.An analysis of the distance distribution of the InDels between Nd-1 and Col-0 (S4 File) revealed that most compensating InDels (cInDels) occur within a very short distance of 2-8 bp (Figure 2b).Multiples of three are more frequent than other distances of a similar size, which might be connected to the length of codons.

Discussion
This study demonstrates features of NAVIP by utilizing a previously generated set of high confidence sequence variants [35].There is always a trade-off between sensitivity and specificity in the variant calling process [35,40] and the aim in this study was to obtain a high specificity data set (see S1 File for details).The benchmarking of NAVIP through comparison with SnpEff neutralizes the influence of the sequence variant data set quality on the outcome.
As an additional validation of the outcome, NAVIP results were analyzed for additional amino acid substitutions in genes with premature stop codons.The frequency of such variants was higher in genes with premature stop codons compared to others suggesting a lack of purification selection in these genes which could point to pseudogenization.The comparison against all other genes also clearly revealed the increased frequency of amino acid substitutions in genes with premature stop codons.Additionally, a low expression of genes with premature stop codons compared to other genes suggests a pseudogenization.In summary, the properties observed for genes with premature stop codons match the expectations thus supporting the biological validity of the data set.
We developed NAVIP to simultaneously assess the impact of all neighboring sequence variants in protein encoding sequences and to be universally applicable.The described cases in the comparison of two A. thaliana accessions demonstrate the necessity to have such a tool at hand.NAVIP revealed the presence of second site mutations that compensate for other variants e.g.turning a presumed premature stop codon into an amino acid substitution or vice versa.
Another example are frameshifts resulting from InDels that are compensated by downstream InDels, which shift the reading frame back to the original pattern.Neglecting these interactions of sequence variants during the functional impact prediction can lead to mis-annotation.While NAVIP was developed to accurately predict changes in the polypeptide sequence based on DNA sequence variants, downstream tools are needed to predict consequences of these changes on the function of proteins.Tools like SIFT [45], PolyPhen-2 [49], or SNAP2 [50] could be applied for this next step.Many computational tools for the assessment of DNA sequence variant impact focus on human data sets [41][42][43][44].The objective is often to identify pathogenic variants [45,46].Universally applicable tools like SnpEff [26], which are also suitable to analyze plant data sets, predict the impact of isolated sequence variants.The purpose of NAVIP is to offer novel functionalities to the plant science community and other communities working on non-model organisms.NAVIP could boost the power of re-sequencing studies by opening up the field of compensating or in general mutually influencing variants.Such variants have the potential to reveal new insights into patterns of molecular evolution and especially co-evolution of sites.The consideration of multiple variants during the effect prediction could reveal novel targets in GWAS-like approaches.The availability through a web server enables a large community of scientists without computational skills to benefit from NAVIP.
The remaining challenge is now the reliable detection of sequence variants prior to the application of NAVIP.A range of tools is available for the mapping of short reads and the following identification of sequence variants [35].There is also rapid progress in the development of long read mapping tools [51,52] and the following variant identification [53][54][55][56].
For heterozygous and polyploid species, phasing of these variants is another task that needs to be addressed in the future.The correct prediction of functional implications relies on the correct assignment of variants to respective haplophases.If provided with accurately phased variants, NAVIP can perform predictions for highly heterozygous and even polyploid species.Previous studies demonstrated that sequence variants might only affect individual isoforms in a negative way [46].NAVIP analyzes all annotated transcript isoforms and would be able to discover such cases.Currently, a major limitation is the lack of isoform-resolved annotation for non-model plant species.Given the rapid progress in long read sequencing [13,47,48], it is likely that highly accurate structural annotation will become available for most plant species in the next few years.

Materials and Methods
Implementation of the Neighborhood-Aware Variant Impact Predictor (NAVIP) The Neighborhood-Aware Variant Impact Predictor (NAVIP) (https://github.com/bpucker/NAVIP) is implemented in Python3.NAVIP requires a VCF file containing sequence variants, a FASTA file containing the reference sequence, and a GFF3 file containing the structural annotation (gene models) as input.The variants provided must be homozygous or in a phased state to allow an accurate impact prediction per allele.If no information about the phasing is provided, all variants are assumed to be in the same haplophase.Effects on all annotated transcripts are evaluated per gene by taking into account the presence of all given variants simultaneously.
NAVIP generates a new VCF file with an additional annotation field and additional report files.
One annotation string in the VCF output file matches the SnpEff result format, but also has a NAVIP-specific string with additional information (see the manual for details: https://github.com/bpucker/NAVIP/wiki).NAVIP also produces FASTA files with sequences harboring all variants.An automatic assessment of complementing InDels (cInDels) reveals the relevance of simultaneously considering all InDels within a coding sequence when predicting their impact.All NAVIP scripts can be downloaded from the above-mentioned GitHub repository and do not require the installation of any dependencies other than the Python packages.NAVIP is also available through a web server (https://pbb-tools.de/NAVIP)free of charge.Files are kept confidential and will be deleted 48 h after offering the results for download.

Identification and validation of sequence variants
Illumina sequencing reads of A. thaliana Nd-1 [16] were mapped to the A. thaliana Col-0 reference genome sequence (TAIR10) [32] via BWA MEM v.0.7.13 [33] using the -m option to avoid spurious hits.Variant calling was performed via GATK v3.8 [34] based on the developers' recommendation.This combination of BWA MEM and GATK was previously identified as a reliable approach for this particular data set [35].All processes were wrapped into Python scripts (https://github.com/bpucker/variant_calling) to facilitate automatic execution on a high-performance compute cluster.An initial variant set was generated based on hard filtering criteria recommended by the GATK developers.The two following variant calling runs considered the set of surviving variants from the previous round as the gold standard to avoid the need for hard filtering.
Since a high-quality genome sequence assembly of Nd-1 was previously generated [17], we harnessed this sequence to validate all variants identified by short-read mapping.Starting at the north end of each chromosome sequence, variants sorted by genomic position were successively tested by taking the upstream sequence from Col-0, modifying it according to all upstream bona fide variants, and searching for it in the Nd-1 assembly (S5 File).Variants were admitted to the following analysis if the assembly supported them.This consecutive inspection of all variants enabled a reliable removal of false positives, leading to a set of high-confidence variants.The genome-wide distribution of the sequence variants was assessed using a previously developed Python script [16].
An independent confirmation of randomly selected sequence variants was performed using Sanger sequencing.A. thaliana Nd-1 plants were grown as previously described [16] to extract DNA from leaf tissue using a cetyltrimethylammonium bromide (CTAB)-based method [36].
Oligonucleotides flanking the regions that harbor the variants of interest were designed manually (S6 File).Amplification via PCR, analysis of PCR products via agarose gel electrophoresis, purification of PCR products, Sanger sequencing, and evaluation of results were following previously established protocols [37].

Comparison of NAVIP and SnpEff stop gain prediction
To the best of our knowledge, SnpEff [26] is the most widely used tool for predicting the functional consequences of sequence variants, thus it was selected for comparison.NAVIP can only provide more accurate predictions about functional consequences if multiple sequence variants interfere, e.g. if multiple SNVs are located within the same codon.Otherwise, the predictions of NAVIP and SnpEff would be the same.Consequently, the following comparison focuses only on cases of multiple sequence variants that might interfere with each other.SnpEff v4.1f [26] was applied with default parameters to the A. thaliana Nd-1 variant data set to predict the effects of SNVs based on the Araport11 [38] structural annotation of the TAIR10 genome sequence of A. thaliana Columbia-0.NAVIP was also applied to the same data set for benchmarking.Predictions of premature stop codons were compared between NAVIP and SnpEff results, as these cases have the potential to show biologically important differences.This analysis was performed exclusively on SNVs to avoid the influence of frameshifts that would be caused by InDels.Only the most upstream predicted premature stop codon within any gene was considered in this analysis.To support the loss of function of the affected genes, the frequency of amino acid changing variants (dN) was compared to the number of variants that did not alter the encoded amino acid (dS).This ratio was compared between genes with premature stop codons and all other genes, expecting a higher ratio of variants that change the encoded amino acids if the gene undergoes pseudogenization.The Python package plotly was used to visualize these data distributions in violin plots.If no dS events were discovered in an analyzed gene, a pseudocount was added to both dN and dS to enable the ratio calculation.dN/dS ratios greater than 10 were set to this maximum value to enable visualization.A Mann-Withney U test was performed using Python to test for significant differences between the two groups.When genes with a premature stop codon undergo pseudogenization, they may show lower than average gene expression.Therefore, a comparison of the expression of genes with a premature stop codon against all other protein-coding genes was performed.A previously compiled count table based on all publicly available paired-end RNA-seq data sets of A. thaliana [39] was harnessed for this analysis.Differences were visualized using the Python package plotly as described above, with the expression values clipped at 50 to enable an informative visualization.All Python scripts developed for these analyses are freely available on GitHub (https://github.com/bpucker/variant_calling).

Assessment of compensating InDels (cInDels)
An independent analysis of insertions/deletions (InDels) was performed by NAVIP to understand the relevance of considering all InDels within a CDS simultaneously.Transcripts with predicted frameshifts were analyzed to identify downstream insertions/deletions which are compensating each other's effect, i.e. the second frameshift is reverting an upstream frameshift.The distance between these events was analyzed by NAVIP and is included in the standard output.This analysis is not restricted to pairs of cInDels, but can also handle multiple InDels compensating each other's frameshifts.

Figure 1 :
Figure 1: (a) This illustration shows the concept of two SNVs affecting the same codon resulting

Figure 2 :
Figure 2: (a) Theoretical concept of two InDels compensating each others' frameshift.The first