Abstract
Variable Number Tandem Repeats (VNTRs) account for a significant amount of human genetic variation. VNTRs have been implicated in both Mendelian and Complex disorders, but are largely ignored by whole genome analysis pipelines due to the complexity of genotyping and the computational expense. We describe adVNTR-NN, a method that uses shallow neural networks for fast read recruitment. On 55X whole genome data, adVNTR-NN genotyped each VNTR in less than 18 cpu-seconds, while maintaining 100% accuracy on 76% of VNTRs.
We used adVNTR-NN to genotype 10,264 VNTRs in 652 individuals from the GTEx project and associated VNTR length with gene expression in 46 tissues. We identified 163 ‘eVNTR’ loci that were significantly associated with gene expression. Of the 22 eVNTRs in blood where independent data was available, 21 (95%) were replicated in terms of significance and direction of association. 49% of the eVNTR loci showed a strong and likely causal impact on the expression of genes and 80% had maximum effect size at least 0.3. The impacted genes have important role in complex phenotypes including Alzheimer’s, obesity and familial cancers. Our results point to the importance of studying VNTRs for understanding the genetic basis of complex diseases.
1 Introduction
The human genome consists of millions of tandem repeats (TRs) of short nucleotide sequences. These are often termed as Short Tandem Repeats (STRs) if the repeating unit is < 6bp, and Variable Number Tandem Repeats (VNTRs) otherwise. Together, they represent one of the largest sources of polymorphisms in humans1–3. While multiple resources have been developed for genome-wide analysis of STRs, here we focus specifically on VNTRs, which have been largely missing from genome-wide studies due to technical challenges of genotyping and the computational expense.
We define VNTR genotyping in the narrower sense of determining VNTR length (number of repeating units). As VNTRs can be located in coding regions4, untranslated regions5, and regulatory regions proximal to a gene6, 7, the variation in length can have an outsized functional impact. Not surprisingly, VNTRs have been implicated in a large number of Mendelian diseases that affect millions of people world-wide8–10. They also are known to modulate quantitative phenotypes in several other organisms11, and have shown pathogenic effects in other vertebrates12, 13. VNTRs have influenced the evolution of primates14 and specifically contributed to human evolution through gene regulation and differentiation of the great ape population15. Recent studies have identified VNTRs that have expanded in the human lineage or are differentially spliced or expressed between human and chimpanzee brains16.
Single nucleotide polymorphisms (SNPs) that associate with gene expression, often referred to as expression Quantitative Trait Loci (eQTLs), are molecular intermediates that drive disease and variation in complex traits17–19. Studies have shown that causal variants for diseases often overlap with cis-eQTL variants in the affected tissue20, 21. Therefore, we focus on the specific application of identifying expression mediating VNTRs (‘eVNTRs’), or VNTRs located in regulatory regions whose length is correlated with the expression of a proximal gene. Examples of ‘eVNTRs’ are: a) a VNTR in the 5’ UTR of AS3MT which is strongly associated with AS3MT gene expression and lies in a schizophrenia associated locus5; and b) a 12-mer expansion upstream of the cystatin B (CSTB) gene is associated with gene expression and with progressive myoclonus epilepsy10, 22.
Despite their importance, the full extent of VNTRs in mediating Mendelian and complex phenotypes is not known due to genotyping challenges. Traditionally, VNTR genotyping used capillary electrophoresis which did not scale to large cohorts. Despite the advent of sequence based genotyping, repetitive sequences continue to be challenging for genomic analysis23. For example, ‘stutter errors’ due to polymerase slippage during PCR amplification change VNTR length and reduce geno-typing accuracy24. While tools for genotyping STRs have been developed1,25,26, they generally do not detect or genotype VNTRs, which have non-identical and larger repeat units. Recently, a few specialized computational methods (including our own) have been published to tackle the problem of genotyping VNTRs from sequence data27, 28. However, these methods are too computationally intensive to scale to functional studies with hundreds of individuals and 104 VNTR loci (Results).
For these reasons, large-scale studies of VNTRs and their association with gene expression have been limited when compared to other sources of human variation such as SNPs and CNVs21,29–31. While the standard whole genome sequencing (WGS) frameworks often ignore repetitive regions23,there is some progress towards ‘harder’ variant classes such as eSTRs32–34 and ‘eSVs’31, 35. Therefore, ‘missing heritability’–the gap between estimates of heritability, measured for example by twin studies36, 37, and phenotypic variation explained by genomic variation– remains a limitation for eQTL studies38. It has been speculated that the inclusion of tandem repeats in association analyses may reduce this heritability gap8,38,39.
Here, we describe adVNTR-NN, a method that uses shallow neural networks for fast read recruitment followed by sensitive Hidden Markov Models for genotyping. We tested the speed and accuracy of adVNTR-NN on extensive simulations to demonstrate accuracy. We used adVNTR-NN to genotype over 10,000 VNTRs in 652 individuals from the GTEx project and associate VNTR length with gene expression in 46 tissues. We additionally validated eVNTRs in blood tissues in 903 samples from an Icelandic cohort and 462 samples from the 1000 genome project with Gene expression data (Geuvadis cohort). We compared the strength of eVNTR association against proximal SNPs to understand causality, and tested association with complex phenotypes. Our results suggest that it is computationally feasible to genotype VNTRs accurately in thousands of individuals, and multiple eVNTRs are likely to causally impact the expression of key genes involved in common and complex diseases.
2 Results
Target VNTR Loci
Using Tandem Repeat Finder40, 502,491 VNTRs were identified in the GRCh38 human assembly. Over 80% of these had total length < 140bp (Fig. 1a) and could be genotyped using Illumina sequencing. As genotyping VNTRs remains computationally expensive, we focused on the 13, 081 VNTRs located within coding, untranslated, or promoter regions of genes (Methods) as they are most likely to be involved in gene regulation. Of those, we identified 10, 262 VNTRs that were within the size range for short-read genotyping (Fig. 1a). We added two additional VNTRs that were previously linked to a human disease (Supp Table S1) to obtain 10, 264 target loci41, 42.
2.1 adVNTR-NN improves genotyping speed
Our previously published tool, adVNTR, used customized Hidden Markov Models (HMMs) for each VNTR and showed excellent genotyping accuracy, based on trio-analysis, simulations and PCR27. However, HMMs are compute-intensive, and despite some filtering strategies used by adVNTR(Methods), the time to genotype n=10K VNTRs was about 631 hours per individual. In developing adVNTR-NN, we first made significant improvements to pre-processing time. Next, we deployed a second filtering step with a 2-layer feed-forward network trained separately for each VNTR that accepted the k-mer composition for each read and filtered it specifically for that VNTR (Fig. 1b,c and Methods). The neural-network filter required 0.03s per read, and filtered reads with high efficiency in filtering reads. For 55X whole genome sequencing (WGS) with r = 4.2 × 106 unmapped reads, the NN supplied an average of 14 previously unmapped reads to each VNTR HMM. Combining with the mapped reads, each HMM received an average of 32 reads per VNTR locus. This reduced the running time for n VNTR loci to allowing each individual to be genotyped at n=10K VNTRs in 50 CPU hours, a 13× speedup over adVNTR.
adVNTR-NN outperforms alternative alignment strategies at VNTRs
While adVNTR was highly accurate by itself, its final accuracy depended upon reads filtered for genotyping, and specifically on false negatives–reads that were incorrectly removed by a filter. Formally, a read sampled from a VNTR was considered to be true positive (TP) if it passed the filter for that VNTR, and false negative (FN) otherwise. False positives (FP)–reads that passed the filter despite not being from the VNTR locus–were a lesser concern because they would eventually be discarded by the HMM for not aligning well to the model. However, high false-positives increase the running time. To account for this, we measured the trade-off between efficiency (1 −(TP+FP)/r) and recall TP/(TP+FN).
For comparisons with alternative filters, we used Bowtie2 as a representative read-mapping tool43. These tools are designed for fast mapping of reads and are accurate for most of the genome, but are not specifically designed for VNTR mapping genotyping (could have high FN). As a second comparison, we used adVNTR27, which has high recall (low FN), for VNTR mapping, and other graph based models in terms of sensitivity44, 45. We used a mix of real and simulated reads to test performance (Methods).
In terms of efficiency (1-(TP+FP)/r), Bowtie2 was the most efficient retaining only 0.9 in 106 reads for further processing for 90% of the VNTRs. Both adVNTR and adVNTR-NN were slightly less efficient retaining about 1.2 reads per million for 90% of the VNTRs. However, they had significantly better recall. adVNTR-NN filtered reads with at least 90% recall for 99% of the target VNTR loci (Fig. 1e). In comparison, 80% of the loci achieved that recall for adVNTR, and only 27% of the loci had a recall of 90% for Bowtie2. Notably, adVNTR-NN had much better recall compared to adVNTR while also being more efficient, and therefore faster.
We had previously shown that improvement in recruitment improves genotyping accuracy27. Here, we used a mix of whole genome sequencing data and simulated reads (Methods) to compare the overall running time and accuracy of adVNTR-NN genotyping with VNTRseek28, which was not available at the time of original release of adVNTR. Notably, VNTRseek combines VNTR discovery and genotyping and does not customize genotyping for each VNTR. Therefore, its running time on 55X WGS ranged from 9640-9686 minutes, and was largely independent of the number of target VNTRs (Supp. Fig. S1). This was in contrast to the 1,696 minutes required by adVNTR-NN. The speed advantage for adVNTR-NN could largely be attributed to filtering strategies which could potentially be used to improve VNTRseek genotyping time as well. On simulated heterozygous reads with 30X coverage (Methods), adVNTR-NN was highly accurate. It achieved 100% accuracy in 7343 (76%) of 9638 VNTRs compared to VNTRseek’s median accuracy of 60% (Supp. Fig. S3). In contrast with adVNTR-NN, VNTRseek’s genotyping accuracy was sharply asymmetric, with much lower accuracy for decreasing VNTR length (Supp. Fig. S2).
2.2 Profiling eVNTRs
Data
To identify expression-mediating VNTR Loci (eVNTRs), we primarily used data from the GTEx project21 (Methods). The GTEx project provided WGS for 652 individuals as well as RNA-seq for each of these individuals from 46 tissue types including whole-blood. A majority (86.0%) of the donors were of European origin; another 11.5% were African American and the remaining were Asian and American Indian. For validation, we used a second cohort of 903 Icelandic individuals46 with associated whole blood RNA expression data and WGS. We also chose a smaller, third cohort from the Geuvadis30 project which provided gene-expression data in lymphoblastoid cell-lines for 462 samples, where the WGS for the samples was available from the 1000 genomes project47. 80.7% of the Geuvadis cohort was individuals with predominantly European (80.7%) ancestry and the remaining had African ancestry (19.3%). Due to the match of tissue type and ethnicity, the Icelandic and Geuvadis whole blood data were used for validation of methods for identifying eVNTRs discovered from the GTEx project.
eVNTR identification
We genotyped 10,264 VNTR loci in all 652 samples from GTEx to study the role of VNTRs in mediating gene expression of proximal genes. As expected, the most frequent allele matched the reference allele in 96.8% of the cases (Supp. Fig. S4).
Despite the GTEx data being predominatly European, 51% of the target VNTRs were polymorphic. Consistent with evolutionary constraints, VNTRs in promoters were most likely to be polymorphic (57%) followed by Untranslated regions (UTRs) (51%) and coding exons (47%) (Fig. 2a). Each individual in the GTEx cohort had a non-reference allele in at least 839 of the tested VNTR loci, with an average of 1,259 non-reference VNTRs per individual. Altogether, the 10,264 VNTRs inserted or deleted an average of 47,197bp per individual (Fig. 1f). As this represents < 10% of all VNTRs, the results highlight VNTRs as an important source of genomic variation. The minimum variation in a non-reference VNTR allele involved at least 6 basepairs and the average change in each variant site was 37bp or about 3 repeat units (Suppl. Fig. S5).
We excluded VNTRs that were monomorphic (1817), violated Hardy-Weinberg equilibrium constraints (1445) or had low minor allele frequency (<1%) after removing individuals in the GTEx cohort with no expression data for the specific gene (4330) (Methods), resulting in 2,672 genotyped VNTRs for association analysis. We used linear regression to measure the strength of association between average VNTR length of the two haplotypes, and adjusted gene expression level of the closest gene (Fig. 2b and Methods). To account for confounding factors, we included sex and population principal components of each individual as covariates. We also added PEER (probabilistic estimation of expression residuals) factors to account for experimental variations in measuring RNA expression levels (e.g batch effects, environmental variables)48. Briefly, PEER infers hidden covariates influencing gene expression levels, and we removed their effect by producing a residual gene expression matrix and using it for linear regression (See Methods).
We measured association with gene expression in each of the 46 tissues. To control False Discovery Rate (FDR), we used the Benjamini-Hochberg procedure to identify a tissue-specific 5% FDR cutoff (Supp. Fig. S6 and Methods). Combining data from all tissues, 759 tests tied to 163 unique VNTR loci passed the significance threshold (Fig. 2c). We refer to these (VNTR, gene) pairs as eVNTRs. The strength of association did not depend upon the location of the VNTRs in either core promoter, UTR, or coding regions. (Supp. Fig. S7). However, we VNTRs within 100bp of the Transcription Start Sites (TSS) were twice as likely to be eVNTRs compared to other locations (P = 6 × 10−6; Fisher’s exact test), consistent with their known roles in core-promoters49.
The number of eVNTRs observed in each tissue type was different but mostly consistent with the number of individuals samples for each tissue type. Only 4% of the eVNTRs were tissue specific, with each tissue containing a similar number of tissue specific eVNTRs (Fig 2d). An analysis using mash50 showed that many (38%) eVNTRs were significant in at least half (23) of the tissues tested (Fig. S8).
Twenty-three of the 163 unique eVNTRs showed significant association in whole blood (Table 1), a tissue type in which we could validate the eVNTRs using independent data from the Icelandic cohort of 903 individuals. Two of the 23 VNTR loci could not be genotyped in the Icelandic cohort due to missing data. 18 (86%) of the 21 VNTRs showed significance at a similar level and same direction of effect in Icelanders, highlighting the strong reproducibility of the associations. The Geuvadis data were acquired for a smaller cohort compared to the Icelandic data and measured expression in lymphoblastoid cells–transformed B cells, which are a component of whole blood tissue. Nevertheless, 12 of the eVNTRs were replicated. Combined, 91% (20/22) of eVNTRs could be replicated in an independent cohort where data was available.
In 65% of the cases, VNTR length had a positive correlation with gene expression; the remaining cases had a negative correlation (Fig. 2e). This was consistent with the hypothesis that many VNTRs encode transcription factor binding sites and increasing length improved the TF binding affinity. Moreover, the overall effect size was also large and 80% of the eVNTRs had a maximum effect-size 0.3 or higher.
We computed correlation of eVNTR effect size between each pair of tissues using the Spearman rank test. Despite the multi-tissue activity of most eVNTRs, each tissue showed distinct behavior with low correlation to most other tissues (Fig. 2f). Similar tissue types were expectedly correlated (e.g. brain). Some correlations were seen among glandular tissues (salivary, prostate, pitutary) and also between adipose tissue and nearby tissues and organs (heart, esophagus, artery, breast). Thus, even though most eVNTRs are shared across tissues, we hypothesize that the combined effect of active eVNTRs is tissue-specific and leads to unique regulatory program for each tissue type.
Similar to SNPs, VNTR loci generally showed a negative correlation between Minor Allele Frequency (MAF) and effect size, so that common variants generally had low effect size with larger effects mainly shown by rare variants51 (Fig 2g). However, we still observed many eVNTRs where common VNTR (MAF > 0.05) showed large effects. These eVNTRs had highly significant p-values (Supp. Fig. S9) and in many cases, the proximal genes were associated with known diseases or phenotypes (Table 2). As these represent potentially the most interesting eVNTR findings, we tested them further for causality and function.
VNTRs mediate expression of key genes
Only a small number of examples have been reported where VNTR repeat unit counts have a causative on gene expression5. One well known example is the AS3MT gene which is involved in early brain development, where the VNTR was associated with expression and was in LD with SNPs associating with schizophrenia5.
To investigate causality, we ranked each eVNTR against all SNPs within 100kbp by (a) comparing the relative significance of association with gene expression; and (b) using the tool CAVIAR52 to measure the causality of association (Methods). Remarkably, the two rankings were very similar with mean discrepancy 2|r1 − r2|/(r1 + r2) = 2.3 × 10−3 across the 163 eVNTRs. We used the harmonic mean (2/ (1/r1 + 1/r2)) of the two ranks to order the eVNTRs. Of the 163 VNTRs, 111 had a harmonic rank ≤ 10 and 81 of the eVNTRs were ranked 1 (Supp. Fig. S10), indicating that the majority of the eVNTRs could be considered causal in some tissue. Separating tissue types, 170 (22%) of the 759 significant associations were likely causal. These results suggest a much larger causality fraction compared to SNPs, structural variants31 and even STRs34, even with the caveat that we only tested ‘genic’ VNTRs.
Looking at individual eVNTRs, we recapitulated a previous result by identifying an eVNTR in the AS3MT gene. The lowest association p-value measured in any tissue using 652 samples was 4.1 × 10−54, which was orders of magnitude higher than the significance reported with 322 samples5(Fig. 3a,b). Its harmonic rank for the two causality tests was 1. Finally, the VNTR is located in a regulatory region of the genome as identified by H3K27Ac and DNase marks (Fig. 3c).
Proopiomelanocortin (POMC) is a precursor of many peptide hormones with multiple roles including regulation of appetite and satiety53. Hypermethylation of POMC (and reduced expression) in peripheral blood cells and melanocyte-stimulating hormone positive neurons was strongly associated with obesity and body mass index54, 55. Surprisingly, POMC over-expression also predisposed lean rats into diet-induced obesity56. Our analysis identified a VNTR in the coding region of the POMC gene as the causal variant governing expression levels in 15 tissues, including adipose and nerve tissues. The 6R allele had 1.8-fold higher expression in blood and nerve cells (Fig 3d), and the correlation with expression was much stronger than neighboring SNPs (Fig 3e). Moreover, the VNTR was located within an H3K27Ac mark that was topologically close to the promoter of the gene based on chromatin conformation (Fig 3f).
The ZNF232 gene is differentially expressed in ovarian and breast cancers57, 58. Also, the chr17 locus containing the gene has been associated with Alzheimer’s in a recent large meta-GWAS study on the UK Biobank data59. We identified an eVNTR in the promoter region where expanded alleles (RU5+) had 2-fold higher median expression relative to RU3 (Fig. 3g). The VNTR was ranked 1 in 40 of 46 tissues including 7 brain sections, and specifically the Hippocampus, which is the affected region in Alzheimer’s60, 61 (Fig. 3h). It was also ranked 1 in ovary and breast with a normalized effect size that was twice the effect size of the best SNP (Table 2).
The RPA2 gene product is part of the Replication Protein A complex involved in DNA damage checkpointing62. Its over-expression is identified as a prognostic marker for colon cancer63 and bladder cancers64. A VNTR that overlapped the Transcription Start Site (TSS) of RP2A with lower VNTR length showed 1.9-fold higher expression of RPA2 in multiple tissues including colon (Supp. Fig. S11 and Table 2). Table 2 identifies other important genes including NBPF3 (Neuroblastoma65), TBC1D7 (lung cancer66), ZNF490 (colorectal cancer67), MSH3 (myotonic dystrophy68) and others. Taken together, our results suggest that VNTRs mediate the expression of key genes.
3 Discussion
VNTRs are the “hidden polymorphisms.” Despite high mutation rates and known examples of function modifications, VNTR analysis is not a component of Mendelian or GWAS analysis. This primarily is due to technical challenges in VNTR genotyping. Here, we use a combination of fast filtering followed by a hidden markov model-based genotyping to accurately determine VNTR genotypes. Our method can genotype 10K VNTRs for an individual in 50 hours making the time problem tractable. We used it to genotype close to 2, 000 human samples. The use of neural networks as a filtering strategy is novel, and we believe that further improvements could lead to another order of magnitude reduction in compute time, making it practical to genotype ≥ 105 individuals in the future.
Some VNTRs have complex multi-repeat structure making it difficult to map reads and count the repeating units. However, unlike other VNTR genotyping methods, our method customizes the genotyping for each VNTR. Future research will focus on improving the genotyping for the hard cases, possibly by building HMMs with separate profiles for each distinct repeating unit, as well as the use of long-reads to improve anchoring to the correct locations. We pursue a targeted genotyping approach which has the disadvantage of not being able to discover new VNTRs, and we rely on other methods for the initial discovery of VNTRs. However, we note that the discovery is a one-time process while genotyping must be repeated for each cohort, and therefore, it makes sense to separate the two problems.
We found that VNTRs were strongly associated with the expression of proximal genes with over 6.1% of the tested VNTRs showing genome wide significant association. Importantly, nearly half of the eVNTR loci were more significant compared to neighboring SNPs, suggesting that a much higher fraction of eVNTRs are causal relative to other variant classes such as SNPs, structural variants, and even STRs. While the high fraction of causal eVNTRs can partly be explained by the choice of ‘genic’ VNTRs for testing, we note that it was computed only for eVNTRs, and speculate that eVNTRs identified in non-genic regions are located in regulatory regions and will continue to have stronger associations compared to neighboring SNPs. In summary, ongoing technical innovations in speed and accuracy of VNTR genotyping are likely to improve our understanding of human genetic variation, and provide novel insights into the function and regulation of key genes and complex phenotypes.
Conflict
V.B. is a co-founder, serves on the scientific advisory board, and has equity interest in Boundless Bio, inc. (BB) and Digital Proteomics, LLC (DP), and receives income from DP and BB. The terms of this arrangement have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. BB and DP were not involved in the research presented here.
4 Method
4.1 Genotyping in adVNTR-NN
Filtering trade-off calculations
Let A(r) denote the HMM genotyping time using r reads. The goal of filtering is to reduce the number of reads supplied to each VNTR HMM. Any filter is characterized by three parameters:
run-time: Let P (r) denote the running time of the filter for r reads for each VNTR locus;
efficiency: Let fk denote the fraction of reads that were retained for any VNTR. The efficiency is defined as 1 − fk so that high efficiency implies only a small fraction being retained by the filter.
sensitivity/recall: The fraction of true VNTR overlapping reads that were accepted for each VNTR.
Consider a data-set with r unmapped reads and among the mapped reads, an average of r′ reads are assigned to each VNTR locus. Assuming that the filtered reads are distributed equally among the VNTRs, each HMM will receive fkr + r′ reads on the average. The total genotyping time for n VNTRs is given by:
Empirically, A(r) = 0.32r seconds per VNTR. The keyword match filter for adVNTR achieved fk = 7.7 × 10−5. For a 55X coverage WGS with r = 4.2 × 106 reads, P (r) = 111.22(s), r′ = 18, we run the HMM on an average of fkr + r′ = 341 reads per VNTR on the average. The running time is:
The genotyping time for n=10K VNTRs is about 631 hours per individual.
Read Filtering
For each VNTR locus V, and each read R, consider a binary classification function f: V × R → {0, 1}, where f (R, V) = 1 if and only if read R maps to locus V. For each read and each of N loci V1,…, VN, the neural recruitment method computes independent classification functions fi(Vi, R). Note that a read can be assigned to multiple VNTR loci, or to none. As an initial step toward this task, we perform a fast string matching based on prefix tree (trie) to assign each read to the VNTR loci that share an exact match with the read. For an efficient matching, we generate a separate aho-corasick trie using every k-mer in VNTR loci as dictionary X. A trie is a rooted tree where each edge is labeled with a symbol and the string concatenation of the edge symbols on the path from the root to a leaf gives a unique word (k-mer) X. We label each leaf with a set of T VNTRs that contain corresponding k-mer. On the other hand, the string concatenation of the edge symbols from the root to a middle node gives a unique substring of X, called the string represented by the node. We add extra internal edges called failure edges to other branches of the trie that share a common prefix which allow fast transitions between failed string matches without the need for backtracking80. Testing whether a query q has an exact match in the trie can be done in O(|q|) and we require additional O(|T |) time to assign read q to all T VNTR loci that share the keyword. The overall complexity of this algorithm is linear based in the length of original dictionary (VNTRs in the database) to build the Trie and recover matches plus the length of queries (sequencing reads). Hence, after construction of the trie, the running time is proportional to just reading in the sequences.
Neural Recruitment
To further reduce the set of reads assigned to each VNTR, we use a 2-layer feedforward Neural Network to compute fi, using a k-mer based embedding to encode DNA strings. Specifically, we use a DNA string w of length k, consider an bijection ϕ that maps w to a unique number in [0, 4k − 1]. Each read R can be defined by a collection of overlapping k-mers. We map read R to a unique vector vR ∈ {0, 1}4k, such that vR[i] = 1 if and only if ϕ−1(i) ∈ R. Deatils of the neural network architecture and hyper-parameters are presented below.
Network Architecture
Let v denote the mapping of a read. We use a shallow architecture with an input layer used to present v to the network. We add two layers of fully connected nodes as the hidden layers, with each node being a Relu function72. In the output layer, there are two nodes zero and one which specify that whether read should be classified as true (containing VNTR) or false (Fig. 1). We used the training set to train the network with Adam optimization algorithm78. The number of hidden layers N1 and N2 were chosen empirically. Too many nodes would increase both training time and test time and possibly cause over-fitting. We performed the training with the number hidden nodes of each layer varying from 10 to 100 with 10 increase in each step and selected N1 = 100 and N2 = 50 as the best parameters according to validation performance.
Choosing the optimal k-mer length
The choice of k-mer length is important. Increasing the k-mer size could decrease sensitivity in our case as small variation will significantly change the k-mer composition, whereas lowering k-mer size reduces the features that are discriminative for a pattern70. In addition, our embedding size exponentially grows with respect to the k so there is also a practical upper bound on the k. Following Zhang70 and Dubinkina71, we trained and tested in the range 4 ≤ k < 9. The accuracy remains comparable in this range (Fig. S12), and we chose k = 6 as its mean validation accuracy is the highest compared to four other values of k.
Effect of different loss functions
To choose the best loss function, we examined three regression loss functions: Mean Squared Error (MSE), Mean Squared Logarithmic Error (MSLE), and Mean Absolute Error (MAE), as well as three binary classification loss functions Hinge, Squared Hinge, and Binary Cross-Entropy. We compared the validation performance of our models for these 6 different loss functions. Each distribution in Supplementary Fig. S13 shows the accuracy on validation set across 1905 genomic loci. We analyzed these distributions using one-way analysis of variance (ANOVA) and none of them were significantly better than others. We chose binary cross-entropy as it obtained the highest mean accuracy (99.95%) among loss functions and its binary classification nature fits our requirement.
Speed and efficiency of neural network filtering
The neural-network filtering achieved a speed of N (r) ≃ 0.03r seconds for r reads, greatly increasing filtering efficiency (fnfk′ < 10−6) to input only 14 reads per VNTR on the average when r = 4.2 × 106. The running time using the two filters could be modeled as
Simulated data for training and testing
We used ART79 to generate r = 6 × 108 reads from human reference genome (30X coverage) with Illumina HiSeq 2500 error profile. For each target locus, we modified the number of the repeats to be ±3 of the original count in the reference with setting 1 as minimum number of repeats, and simulated reads from those regions. For each locus, we assigned labels to reads as being true reads or not, based on exact location. We divided the original set of reads into three parts: 70% for training, 10% for validation and 20% for testing. We trained all neural network models using the training and validation sets, and reported performance on the test dataset.
To augment the data, we added random single nucleotide variations in the genome sequences of the dataset74. For each sequence in the dataset, we replaced its nucleotides with a random one with probability re. We set re = 10−5, the novel base substitution mutation rate within VNTRs81. This method of dataset augmentation helps include ‘mutated’ k-mers in the embedding of reads, making the method more robust.
To test and compare genotyping accuracy against VNTRseek, we started with a random selection of 10,000 target VNTR loci (< 140 bp) and filtered them out if a VNTR locus was marked as indistinguishable in VNTRseek. As a result, 9,638 target VNTRs remained. We used ART79 to generate heterozygous samples by simulating 15X coverage reads from each modified haplotype which contained a non-reference allele and combined those with 15X reads that were simulated from reference. The non-reference allele for each VNTR was chosen to be in the range [c − 3, c + 3], where c is the reference count. Together, this provided six diploid simulated data-sets for each locus, at 30X coverage.
Performance test
We measured running time of adVNTR-NN and VNTRseek by running them with default parameters on a single core of Intel Xeon CPU E5-2643 v2 3.50GHz CPU. To measure the accuracy of genotyping, we ran adVNTR-NN and VNTRseek on diploid simulated data of heterozygous VNTRs and measured the number of correct calls divided by total number of VNTR loci.
4.2 Data and preprocessing
We accessed 30X Illumina WGS data from the GTEx cohort (652 individuals) through dbGaP (accession id phs000424.v8.p2). Specifically, we accessed CRAM files containing read alignments to the GRCh38 reference genome through cloud-hosted SRA data using fusera and downloaded VCF files containing SNP genotype calls from dbGaP.
As genotyping VNTRs remains computationally expensive, we focused on the smaller set of VNTRs located within coding, untranslated, or promoter regions of genes, which are most likely to be involved in regulation. We identified VNTRs in coding exons and UTRs by intersecting VNTR coordinates with refseq gene coordinates downloaded from UCSC Table Browser86. To identify VNTRs that appear within promoter regions, we considered 500bp upstream of the transcription start site of genes as the promoter regions. Overall, this procedure identified 13, 081 VNTRs, of which 10, 262 were within the size range for short-read genotyping (Fig. 1A). We subsequently added two VNTRs previously linked to a human disease to obtain 10, 264 target loci42, 42. We genotyped these VNTR loci in 652 individuals from GTEx cohort using adVNTR-NN on Amazon Web Services (AWS) cloud, which allowed us to do the computation in parallel for different samples.
We compared the most common allele of each VNTR with the reference allele (GRCh38) to observe representation of each VNTR in the reference. We also searched for VNTRs with multiple observed alleles to estimate a rate of polymorphism for VNTRs and find how common each allele was. To call a VNTR polymorphic, we set the minor allele frequency at 5% and any variation below that frequency was discarded. In addition, we identified the amount of base-pair difference that they make in genome of each individual by comparing the copy number difference of VNTRs between reference and the sample and multiplied that by the pattern length of each locus. We computed how many loci on average differed between an individual and reference by combining all non-reference calls in at least one haplotype from all individuals and dividing it by all called variants. VNTRs whose allele frequencies did not meet the expected percentage of homozygous versus heterozygous calls under HardyWeinberg equilibrium (P < 0.05 for two-sided binomial test) were eliminated. We further removed VNTRs that were monomorphic (only one allele) in the entire GTEx cohort or had minor allele frequency lower than 1% among the individuals with expression data in every tissue. We used the resulting 2,672 VNTRs for subsequent analysis (Supp Table S1).
We obtained processed RNA-expression data (RPKM values) from 54 tissues from dbGaP (phe000020.v1) and limited analysis to 46 tissues which had data for at least 100 individuals. ‘Non-expressed genes’– genes with median RPKM level zero– in each tissue were removed from analysis. For the remaining genes, we quantile-normalized RPKM values of each tissue to a normal distribution. We analyzed VNTR-Gene pairs for each VNTR and its closest gene based on refseq annotations85 in each of the 46 tissues.
4.3 Identification of eVNTRs
Before the analysis of the association of VNTR genotypes and gene expression levels, we adjusted gene expression levels for each tissue in order to control for covariates of sex, population structure, and technical variations in measuring expression. For population structure, we used the top ten principal components (PCs) from a principal components analysis (PCA) on the matrix of SNP genotypes using smartpca82 to provide a correction for population structure. To generate the SNP genotype matrix, we used the VCF files for GTEx cohort (accession phg001219.v1) and filtered biallelic SNP sites MAF > 0.05 using plink83. To correct for non-genetic factors such as technical variations in measuring RNA expression levels (e.g batch effects, environmental variables), we applied PEER factor correction and used the top 15 factors48. We removed the effect of covariates by regressing them out from the RNA expression matrix of each tissue and subtracting their factor contributions and used the residuals for all eQTL association analyses.
Let v denote a VNTR-gene pair, yiv denote the normalized expression value of gene in v for individual i and xiv denote the genotype of the VNTR in v for individual i. Then, where, PCik denotes the strength of the k-th principal component, and Rik the value of the k-th PEER factor. We performed the association test for each VNTR-gene pair separately for each tissue type using Python statsmodels linear regression, Ordinary Least Squares (OLS)84, and computed a nominal p-value of the strength of association for each VNTR-gene pair.
Multiple Testing Correction
We used permutation tests and the BenjaminiHochberg procedure to estimate a 5% False Discovery Rate (FDR) significance cut-off for each tissue. The significance thresholds for each of the 46 tissues ranged from 10−3 to 3.8 × 10−5 (Fig. S6). Overall, 759 significant tests were observed from total of 73,609 tests in all tissues and 163 unique VNTRs passed the significance test in at least one tissue.
Fine-mapping of Causal Variants
To compare the strength of the VNTR association relative to proximal SNPs, we extracted all SNPs from 50kb 5’ to the transcription start, from the gene body, and up to 50kb 3’ to the end of the transcript using the GTEx variant calls. To perform a fair comparison, we used the same test and covariates for VNTRs and repeated it for each SNP by replacing the genotype to obtain the strength of association for each SNP. Then, we ranked all variants based on their association P value.
We further used a fine-mapping method, CAVIAR, as an orthogonal method to identify the causal variant for the change in gene expression level. CAVIAR is a statistical method that quantifies the probability that a variant is causal by combining association signals (i.e., summary level Z-scores) and linkage disequilibrium (LD) structure between every pair of variants52. We ran CAVIAR with parameter -c 1 to identify the most likely causal variant, along with the causality probability distribution for each variant site. We ranked variants based on their causality probability given by CAVIAR and called it the causality rank.
Supplementary Figures
S0.4 Neural Network Parameter Tuning
Table S1: A list of 10,264 target VNTR loci used in this study (supplementary table s1.xlsx)
Acknowledgements
The research was supported in part by grants HG010149, and R01GM114362 from the NIH. The analyses presented in this paper are based on the use of study data downloaded from the dbGaP web site, under phs001095.v1.p1, phs001096.v1.p1 and phs001097.v1.p1. This work used data from UK Biobank (project 46122). The 30X whole genome sequencing data of 1000 Genomes Project samples used in this research were generated at the New York Genome Center with funds provided by NHGRI Grant 3UM1HG008901-03S1.
References
- [1].↵
- [2].
- [3].↵
- [4].↵
- [5].↵
- [6].↵
- [7].↵
- [8].↵
- [9].
- [10].↵
- [11].↵
- [12].↵
- [13].↵
- [14].↵
- [15].↵
- [16].↵
- [17].↵
- [18].
- [19].↵
- [20].↵
- [21].↵
- [22].↵
- [23].↵
- [24].↵
- [25].↵
- [26].↵
- [27].↵
- [28].↵
- [29].↵
- [30].↵
- [31].↵
- [32].↵
- [33].
- [34].↵
- [35].↵
- [36].↵
- [37].↵
- [38].↵
- [39].↵
- [40].↵
- [41].↵
- [42].↵
- [43].↵
- [44].↵
- [45].↵
- [46].↵
- [47].↵
- [48].↵
- [49].↵
- [50].↵
- [51].↵
- [52].↵
- [53].↵
- [54].↵
- [55].↵
- [56].↵
- [57].↵
- [58].↵
- [59].↵
- [60].↵
- [61].↵
- [62].↵
- [63].↵
- [64].↵
- [65].↵
- [66].↵
- [67].↵
- [68].↵
- [69].
- [70].↵
- [71].↵
- [72].↵
- [73].
- [74].↵
- [75].
- [76].
- [77].
- [78].↵
- [79].↵
- [80].↵
- [81].↵
- [82].↵
- [83].↵
- [84].↵
- [85].↵
- [86].↵
- [87].
- [88].
- [89].
- [90].
- [91].
- [92].
- [93].