Abstract
Short tandem repeats (STRs) are involved in dozens of Mendelian disorders and have been implicated in a variety of complex traits. However, existing technologies focusing on single nucleotide polymorphisms (SNPs) have not allowed for systematic STR association studies. Here, we leverage next-generation sequencing data from 479 families to create a SNP+STR reference haplotype panel for genome-wide imputation of STRs into SNP data. Imputation achieved an average of 97% concordance between genotyped and imputed STR genotypes in an external dataset compared to 63% expected under a random model. Performance varied widely across STRs, with near perfect concordance at bi-allelic STRs vs. 70% at highly polymorphic forensics markers. We demonstrate that imputation increases power over individual SNPs to detect STR associations using simulated phenotypes and gene expression data. This resource will enable the first large-scale STR association studies using existing SNP datasets, and will likely yield new insights into complex traits.
Introduction
Genome-wide association studies (GWAS) have become increasingly successful at identifying genetic loci significantly associated with complex traits in humans, largely due to the enormous growth in available sample sizes1–3. Hundreds of thousands of individuals have been genotyped using commodity genotyping arrays. These arrays take advantage of the correlation structure between nearby variants induced by linkage disequilibrium (LD), which allows genome-wide imputation based on genotypes of only a small subset of loci4. However, GWAS based on single nucleotide polymorphism (SNP) associations face important limitations. Even with sample sizes of up to 100,000 individuals, common SNPs still fail to explain the majority of heritability for many complex traits1,5.
One compelling hypothesis explaining the “missing heritability” dilemma is that complex variants, such as multi-allelic repeats not in strong LD with common SNPs are important drivers of complex traits but are largely invisible to current analyses. Indeed, dissection of the strongest schizophrenia association, located in the major histocompatibility complex, revealed a poorly tagged polymorphic copy number variant (CNV) to be the causal variant6. The signal could not be localized to a single SNP and could only be explained after deep characterization of the underlying CNV. This and subsequent discoveries7,8 highlight the importance of considering alternative variant classes.
Short tandem repeats (STRs), consisting of repeated motifs of 1-6bp in tandem, comprise more than 3% of the human genome9. Multiple lines of evidence support a role of STRs in complex traits10–12, particularly in neurological and psychiatric phenotypes. Due to their rapid mutation rates13, STRs exhibit high rates of heterozygosity14 and likely contribute more de novo mutations per generation than all other known sources of genetic variation. Furthermore, STRs have been shown to play a significant role in regulating gene expression15,16, splicing17–19, and DNA methylation16. Intriguingly, more than 30 Mendelian disorders are caused by STR expansions via a range of mechanisms, including polyglutamine aggregation (Huntington’s Disease, ataxias20), hypermethylation (Fragile X Syndrome21), and RNA toxicity (ALS/FTD22). Furthermore, causal STRs driving existing GWAS signals have already been identified23.
Existing technologies have not allowed for systematic STR association studies. Next-generation sequencing (NGS) can be used to directly genotype short STRs, but NGS is still too expensive to perform on sufficiently large cohorts for GWAS of most complex traits. An alternative approach is to impute STRs into existing SNP array datasets. Previous studies have demonstrated that STRs are often in significant LD with nearby SNPs24–26 and found that STRs and SNPs provide complementary information about the evolutionary history of a genomic region. Despite widespread SNP-STR LD, statistical phasing of STRs and SNPs is challenging for several reasons: SNP-STR LD is notably weaker than SNP-SNP LD24 due to the rapid mutation rates27,28 and high prevalence of recurrent mutations in STRs. As a result, the relationship between STR repeat number and SNP haplotype can be complicated and nonlinear, with the same STR allele present on multiple SNP haplotypes and vice versa. Furthermore, LD patterns at STRs vary widely as a function of properties of the repeat, such as the repeat unit length, mutation rate, and mutation step size24. Finally, STRs are prone to genotyping errors induced during PCR amplification29,30, further ambiguating phase information.
Sequencing related samples allows haplotype resolution by directly tracing inheritance patterns. The recent generation of deep NGS using PCR-free protocols for hundreds of nuclear families in combination with accurate tools for genotyping STRs from NGS31 now enables applying this technique genome-wide. Here, we profiled STRs in 479 families and used pedigree information to phase STR genotypes onto SNP haplotypes to create a genome-wide reference for imputation. We used this panel to impute STRs into an external dataset of similar ethnic background with average 97% concordance with observed STR genotypes. Imputation accuracy varied across STRs, ranging from nearly perfect concordance at bi-allelic STRs to around 70% for highly polymorphic forensics markers. We show that STR imputation achieves greater power than individual SNPs to detect underlying STR associations and demonstrate the utility of our panel by detecting novel STRs associated with gene expression. Finally, we imputed genotypes at STRs previously implicated in human disorders and found that we could accurately identify specific SNP haplotypes associated with long normal alleles most at risk for expansion.
To facilitate use by the community, we have released a phased SNP+STR haplotype panel for samples genotyped as part of the 1000 Genomes Project (see Data availability). This resource will enable the first large-scale studies of STR associations in hundreds of thousands of available SNP datasets, and will likely yield significant new insights into complex traits.
Results
A catalog of STR variation in 479 families
We first generated the deepest catalog of STR variation to date in a large cohort of families included in the Simons Simplex Collection (SSC) (see URLs). We focused on 1,916 individuals from 479 family quads (parents and two children) that were sequenced to an average depth of 30x using illumina’s PCR-free protocol. Based on comparison to 1000 Genomes Project samples, we estimated the cohort to consist primarily of Europeans (83%), with 2.0%, 9.0%, and 3.6% of East Asian, South Asian, and African ancestry respectively (Supplementary Figure 1). We used HipSTR31 to profile autosomal STRs in each sample. HipSTR takes aligned reads and a reference set of STRs as input and outputs maximum likelihood diploid genotypes for each STR in the genome. While HipSTR infers the entire sequence of each STR allele, we focus here on differences in repeat copy number rather than sequence variation within the repeat itself. To maximize the quality of genotype calls, individuals were genotyped jointly with HipSTR’s multi-sample calling mode using phased SNP genotypes and aligned reads as input (Online Methods). Multi-sample calling allows HipSTR to leverage information on haplotypes discovered across all samples in the dataset to estimate per-locus error parameters and output genotype likelihoods for each possible diploid genotype. Notably, our HipSTR catalog excluded most known STRs implicated in expansion disorders such as Huntington’s Disease and hereditary ataxias, since even the normal allele range for these STRs is above or near the length of illumina reads32–35. To supplement our panel, we additionally used Tredparse36 to genotype a targeted set of known pathogenic STRs in our cohort (Supplementary Table 1). Tredparse incorporates multiple features of paired-end reads to estimate the size of repeats longer than the read length.
An average of 1.14 million STRs passed HipSTR’s default filtering settings in each sample (Figure 1A). We obtained at least one call for 97% of all STRs in the HipSTR reference of 1.6 million STRs and for 15 of 25 STRs in the Tredparse reference with an average overall call rate of 90% (Figure 1B). We applied additional stringent genotype quality filters to ensure accurate calls for downstream phasing and imputation analysis. STRs overlapping segmental duplications, with call rates less than 80%, or with genotype frequencies unexpected under Hardy-Weinberg Equilibrium were removed (Online Methods). We further removed STRs with low heterozygosity (Figure 1C). After filtering, 453,671 and 9 STRs from the HipSTR and Tredparse panels, respectively, remained in our catalog.
Dashed line represents the mean of 1.14 million STRs per sample. B. Call rate per locus. Dashed line represents the mean call rate of 90%. C. Mendelian inheritance rate at filtered vs. unfiltered STRs. The x-axis gives the posterior genotype score (Q) returned by HipSTR. The y-axis gives the average Mendelian inheritance rate for each bin across all calls on chromosome 21. STRs that were homozygous for the reference allele in all members of a family were removed. Colors represent different motif lengths. D. Per-locus heterozygosity in SSC vs. 1000 Genomes. Only STRs with heterozygosity >0.095 in SSC are included. Color scale gives the log10 number of STRs represented in each bin. E. Allele frequencies at pathogenic STRs obtained by Tredparse vs. previously reported normal alleles. Blue=Tredparse, Gold=Previously reported. Boxes span the interquartile range and horizontal lines give the medians. Whiskers extend to the minimum and maximum data points. The y-axis gives the number of repeat units. Tredparse alleles are based on calls in the SSC panel. Sources of previously reported allele frequencies are described in detail in Online Methods.
We further assessed the quality of our STR genotypes by comparing patterns of variation from SSC to previous catalogs of STR variation obtained using a distinct set of samples and STR genotyping methods. For HipSTR calls, we found that per-locus heterozygosities (Online Methods) were highly concordant with a catalog generated from the 1000 Genomes Project37 data using lobSTR38. (r=0.96; p<10−200; n=386,100) (Figure 1D). For Tredparse calls, allele frequency spectra observed in SSC matched closely to previously reported normal allele frequencies at each STR (Figure 1E). For STRs genotyped both by HipSTR and Tredparse, estimated repeat lengths were highly concordant (average concordance 99.4%, Supplementary Table 1). Overall, these results show that our catalog consists of robust STR genotypes suitable for downstream phasing and imputation analysis.
A genome-wide SNP+STR haplotype reference panel
We examined the extent of linkage disequilibrium between STRs and nearby SNPs using two metrics. The first, termed “length r2”, is defined as the squared Pearson correlation between STR allele length and the SNP genotype. The second, termed “allelic r2”, treats each STR allele as a separate bi-allelic locus and is computed similar to traditional SNP-SNP LD (Online Methods). Similar to previous studies24, SNP-STR LD was dramatically weaker than SNP-SNP LD by both metrics (Supplementary Figure 2A) with length r2 generally stronger than allelic r2. We additionally determined the best tag SNP (Online Methods) for each STR, which was on average 5.5kb away (Supplementary Figure 2B). Nearly all STRs were in significant LD (Length r2 p<0.05) with the best tag SNP, suggesting that phasing would result in informative haplotypes.
We developed a pipeline to phase STRs onto SNP haplotypes leveraging the quad family structure (Figure 2A). Based on our LD analysis, we used a window size of ±50kb to phase each STR separately using Beagle39, which was recently demonstrated to perform well in phasing multi-allelic STRs40 and can incorporate pedigree information. Resulting phased haplotypes from the parent samples were merged into a single genome-wide reference panel for downstream imputation.
A. Schematic of phasing pipeline in the SSC cohort. To create the phased panel, STR genotypes were placed onto phased SNP haplotypes using Beagle. Any missing STR genotypes were imputed. The resulting panel was then used for downstream imputation from orthogonal SNP genotypes. Blue and red denote phased and unphased variants, respectively. Positions in gray are homozygous. B. Concordance of imputed STR genotypes vs. heterozygosity. Blue denotes observed per-locus values and green denotes values expected under a random model. Solid lines give median values for each bin and filled areas span the 25th to 75th percentile of values in each bin. X-axis values were binned by 0.1. Upper gray plot gives the distribution of heterozygosity values in our panel. Concordance values are based on the leave-one-out analysis in the SSC cohort. C. Per-locus imputation concordance in SSC vs. 1000 Genomes cohorts. Color scale gives the log10 number of STRs represented in each bin. Concordance values are based on the subset of samples from the 1000 Genomes deep WGS cohort with European ancestry. D. Per-locus imputation concordance using HipSTR vs. capillary electrophoresis genotypes. Each dot represents one locus. The x-axis and y-axis give imputation concordance using capillary electrophoresis or HipSTR genotypes as a ground truth, respectively. Concordance was measured in separate sets of European samples for each technology. E. Concordance of imputed vs. 10X STR genotypes in NA12878 stratified by concordance in SSC. STRs were binned by concordance value based on the leave-one-out analysis. Concordance in NA12878 was measured across all STRs in each bin. Dots give mean values for each bin and lines denote +/− 1 standard deviation. In all cases LOO refers to the leave-one-out analysis in the SSC cohort.
We evaluated the utility of our phased panel for imputation using a “leave-one-out” analysis in the SSC samples. For each sample, we constructed a modified reference panel with that sample’s haplotypes removed and then performed genome-wide imputation. We measured concordance, length r2, and allelic r2 between imputed vs. observed genotypes at each STR, where “observed” refers to genotypes obtained by HipSTR or Tredparse. For each of these metrics, we additionally computed the value expected under a model where genotypes are imputed randomly (Online Methods) for comparison. Imputed genotypes showed an average of 96.7% concordance with observed genotypes, compared to 61.0% expected under a random model (Table 1). As expected, concordance was strongest at the least polymorphic STRs (Figure 2B, Supplementary Figures 3A, 4) and allelic r2 was highest for the most common alleles (Supplementary Figure 3B). Length r2 was not strongly associated with heterozygosity, although the least and most heterozygous STRs tended to have lower length r2 (Supplementary Figure 3C). Imputation metrics were weakly negatively correlated with distance to the best tag SNP (r=−0.06; p=0.06, r=−0.04;p=0.27; and r=−0.06, p=7.5×10−5 for concordance, length r2, and allelic r2, respectively). To further evaluate imputation performance at highly polymorphic STRs, we examined the CODIS STRs used in forensics analysis (Supplementary Table 2). Per-locus concordances were highly correlated with imputation results recently reported by Edge, et al40 (Pearson r2=0.93; p=6.3×10−6; n=10), but were on average 8.8% higher, likely as a result of our larger and more homogenous cohort. Per-locus imputation statistics for all STRs are reported in Supplementary Tables 3 and 4).
Results indicate mean across all STRs analyzed. Allelic r2 values include all common alleles (frequency at least 5%). Values in parentheses for each metric give expected values under a random imputation model based on allele frequencies in each population. “Multi-allelic” refers to STRs with three or more common alleles.
We next evaluated our ability to impute STR genotypes into an external dataset. For this, we focused on samples from the 1000 Genomes Project37 with high quality SNP genotypes obtained from low coverage whole genome sequencing (WGS) (n=2,504) or genotyping arrays (n=2,486 for Affy 6.0, and n=2,318 for Omni 2.5). We validated imputed genotypes for subsets of 1000 Genomes samples using three orthogonal technologies: illumina WGS+HipSTR, capillary electrophoresis, and l0X Genomics+HipSTR. In each case we evaluated performance using the orthogonal data as the “truth” set.
First, we used HipSTR to genotype STRs in separate high-coverage (30x) WGS datasets available for 150 of the samples (see URLs) from European (n=50), African (n=50), and East Asian (n=50) backgrounds. Per-locus concordance, length r2, and allelic r2 were highly concordant between the SSC panel and 1000 Genomes samples of European origin (Pearson r=0.94, 0.63, and 0.85, respectively) (Figure 2C; Supplementary Figure 5; Table 1). Overall imputation performance did not vary when using phased genotypes obtained from WGS vs. Omni2.5 for imputation (Supplementary Table 5). Concordance was noticeably weaker in African and East Asian samples, likely due to different population background compared to the SSC samples and lower LD in African populations41.
Next, we compared imputed genotypes to capillary electrophoresis data42 (see URLs) available for a subset of samples in our panel at highly polymorphic STRs. After filtering non-European samples and STRs that could not be reliably mapped to HipSTR notation (Online Methods), 41 samples and 206 STRs remained for comparison. We obtained an average overall concordance of 76.9% with capillary genotypes compared with 76.4% expected based on HipSTR analysis. Per-locus concordances based on HipSTR vs. capillary genotypes were strongly correlated (r=0.83; p=1.05×10−53; n=206) (Figure 2D).
Finally, we compared imputed genotypes from the highly characterized NA12878 genome to phased data available from lOX Genomics (see URLs), a synthetic long read technology. We constructed a phased validation panel by calling HipSTR separately on reads from each phase and combining with phased SNP genotypes (Online Methods, Supplementary Figure 6). We could obtain phased 10X calls for 116,764 of the STRs in our panel. We used the nearest heterozygous SNP to each STR to match phase order between our panel and the 10X data, which allowed us to directly compare imputed alleles and evaluate phase accuracy. Overall, imputed STR alleles showed 96% concordance with those obtained from 10X and per-locus genotype concordance was consistent with concordance metrics measured in SSC (Figure 2E). Taken together, validation of imputed STR genotypes against three separate “truth” sets demonstrates the accuracy of our original SNP+STR haplotype panel and shows that our quality metrics are reliable indicators of per-locus imputation performance across datasets.
Imputation increases power to detect STR associations
We sought to determine whether our SNP+STR haplotype panel could increase power to detect underlying STR associations over standard GWAS. First, we simulated phenotypes based on a single causal STR and examined the power of the imputed STR genotypes vs. nearby SNPs to detect associations. We focused primarily on a linear additive model relating STR dosage, defined as the average allele length, to quantitative phenotypes (Figure 3A), since the majority of known functional STRs follow similar models (e.g.17,43–45). Association testing simulations were performed 100 times for each STR on chromosome 21 in our dataset (Online Methods). As expected, the strength of association for each variant as measured by the negative log10 p-value was linearly related with its length r2 with the causal variant (Figure 3B). On average, imputed STR genotypes explained 17.7% more variation in STR allele length compared to the best tag SNP (mean r2=0.92 and 0.74 for imputed STRs vs. SNPs, respectively). The advantage from STR imputation grew as a function of the number of common STR alleles (Supplementary Figure 7). Imputed genotypes showed a corresponding increase in power to detect associations at a given p-value threshold (Figure 3C). Similar trends were observed for case-control traits (Supplementary Figure 8). We additionally tested the ability of imputed STR genotypes to identify associations due to non-linear models relating STR genotype to phenotype (Supplementary Figure 9). While both STR and SNP-based tests had limited power to detect non-linear associations, per-allele STR association tests had higher power than the best tag SNP in 60% of simulations. Importantly, testing for complex models relating repeat length to phenotype will only be possible when allele lengths are available, thus demonstrating an additional need for STR imputation over SNP-based tests to detect these associations.
A. Example simulated quantitative phenotype based on SSC genotypes. A quantitative phenotype was simulated assuming a causal STR (red). Power to detect the association was compared between the causal STR, imputed STR genotypes, and all common SNPs (MAF>0.05) within a 50kb window of the STR (gray). B. Strength of association (-log10 p) is linearly related with LD with the causal variant. For SNPs, the x-axis gives the length r2 calculated using observed genotypes. For the imputed STR (blue), the x-axis gives the length r2 from leave-one-out analysis. C. The gain in power using imputed genotypes is linearly related to the gain in r2 compared to the best tag SNP. Gray contours give the bivariate kernel density estimate. Top and right gray area gives the distribution of points along the x- and y-axes, respectively. Power was calculated based on the number of simulations out of 100 with nominal p<0.05. D. Quantile-quantile plot for eSTR association tests. Each dot represents a single STR X gene test. The x-axis gives the expected log10 p-value distribution under a null model of no eSTR associations. Red and blue dots give log10 p-values for association tests using HipSTR genotypes and imputed STR genotypes, respectively. Black dashed line gives the diagonal. E. Comparison of eSTR effect sizes using observed vs. imputed genotypes. Each dot represents a single STR X gene test. The x-axis gives effect sizes obtained using imputed genotypes. Gray dots give the effect size in GTEx whole blood using HipSTR genotypes. Purple dots give effect sizes reported previously15 in lymphoblastoid cell lines. F., G. Example putative causal eSTRs identified using imputed STR genotypes. Left, middle, and right plots give HipSTR STR dosage (red), imputed STR dosage (blue), and the best tag SNP genotype (gray) vs. normalized gene expression, respectively. STR dosage is defined as the average length difference from hg19. One dot represents one sample. P-values are obtained using linear regression of genotype vs. gene expression. STR and SNP sequence information is shown for the coding strand. Gene diagrams are not drawn to scale.
We next determined whether STR imputation could identify STR associations using real phenotypes. We focused on gene expression, given the large number of reported associations between STR length and expression of nearby genes in cis15–16 (termed eSTRs). To this end, we analyzed eSTRs from samples in the Genotype-Tissue Expression46 (GTEx) dataset for which RNA-sequencing, WGS, and SNP array data were available. As a test case, we imputed STR genotypes using SNP data for chromosome 21 and tested for association with genes expressed in whole blood. For comparison, we additionally performed each association using genotypes obtained from WGS using HipSTR (Online Methods). A total of 2,452 STR x gene tests were performed in each case. Association p-values were similarly distributed across both analyses and showed a strong departure from the uniform distribution expected under a null hypothesis of no eSTR associations (Figure 3D). For all nominally significant associations (p<0.05), effect sizes were strongly correlated when using imputed vs. HipSTR genotypes (r=0.99; p=1.01×10−79, n=97). Furthermore, effect sizes obtained from imputed data were concordant with previously reported effect sizes in a separate cohort using a different cell type (lymphoblastoid cell lines)15 (r=0.79; p=0.0042, n=11) (Figure 3E).
We identified genes for which the STR is most likely the causal variant and tested whether STR imputation had greater power to identify causal eSTRs compared to SNP-based analyses. We used ANOVA model comparison to determine genes for which the STR explained additional variation over the top SNP (Online Methods). We additionally applied CAVIAR47 to fine-map associations using the most strongly associated STR and the top 100 associated SNPs for each gene (Online Methods). We identified 3 genes with ANOVA p<0.05 for which the STR was the top variant returned by CAVIAR. One example, a CG-rich STR in the promoter of CSTB, was previously demonstrated to act as an eSTR48 and expansions of this repeat are implicated in myoclonus epilepsy49. In each case, imputed STR genotypes were more strongly associated with gene expression compared to the best tag SNP (Figure 3F-G, Supplementary Table 6).
Phasing and imputing normal alleles at known pathogenic STRs
Finally, to determine whether alleles at known pathogenic STRs could be accurately imputed, we examined results of our imputation pipeline at 12 STRs previously implicated in expansion disorders that were included in our panel (Table 2). Our analysis focused on alleles in the normal repeat range for each STR, since pathogenic repeat expansions at these STRs are unlikely to be present in the SSC cohort. Notably, accurate imputation of non-pathogenic allele ranges is still informative as (1) long normal or intermediate size alleles may result in mild symptoms in some expansion disorders50,51,52 (2) longer alleles are more at risk for expansion53 and (3) allele lengths below the pathogenic range could potentially be associated with more complex phenotypes51.
a HD=Huntington’s Disease; SCA=spinocerebellar ataxia; DRPLA=Dentatorubral-pallidoluysian Atrophy; DM1=Myotonic Dystrophy Type 1; HDL=Huntingon’s Disease-Like 2. LOO refers to the leave-one-out analysis in the SSC cohort. The best tag SNP for an STR is defined as the SNP within 50kb with the highest length r2. STRs above the black bar were genotyped using Tredparse and below the bar were genotyped using HipSTR. Values in parentheses for concordance give the expectation under a random model.
Similar to the CODIS markers, these STRs are highly polymorphic with 10 or more alleles per locus. In all cases, imputed genotypes were more strongly correlated with HipSTR or Tredparse genotypes compared to the best tag SNP. Where both HipSTR and Tredparse genotypes were available, concordance results were nearly identical across all STRs (Supplementary Table 7). Visualization of SNP-STR haplotypes at the CAG repeat implicated in dentatorubral-pallidoluysian atrophy (DRPLA)54 reveals a typical complex relationship between STR allele length and local SNP haplotype (Figure 4A), with the same STR allele often present on multiple SNP haplotype backgrounds. Still, for most STRs there is a clear association of specific haplotypes with different allele length ranges allowing accurate imputation across a large range of allele sizes (Figure 4B, Supplementary Figure 10).
A. Example SNP-STR haplotypes inferred in European samples at a polyglutamine repeat in ATN1 implicated in DRPLA. Each column represents a SNP from the founder haplotype reported by Veneziano, et al. Each row represents a single haplotype inferred in 1000 Genomes Project phase 3 European samples, with gray and black boxes denoting major and minor alleles, respectively. Haplotypes are grouped by the corresponding STR allele. The number of SNP haplotypes for each group of STR alleles is annotated to the left of each box. Alleles seen fewer than 10 times in 1000 Genomes samples were excluded from the visualization. B. Comparison of imputed vs. observed STR genotypes in SSC samples at the DRPLA locus. The x-axis gives the maximum likelihood genotype dosage returned by HipSTR and the y-axis gives the imputed dosage. Dosage is defined as the sum of the two allele lengths of each genotype relative to the hg19 reference genome. The bubble size represents the number of samples summarized by each data point. C. Distribution of DRPLA repeat length vs. similarity to the pathogenic founder haplotype. The founder haplotype refers to the SNP haplotype reported by Veneziano, et al. on which a pathogenic expansion in ATN1 implicated in DRPLA likely originated. The x-axis gives the Hamming distance between observed haplotypes and the founder haplotype, computed as the number of positions with discordant alleles. White dots represent the median length.
Resolution of SNP-STR haplotypes can be used to infer the mutation history of a specific STR locus25,26. Notably, for many STR expansion orders it has been shown that pathogenic expansion alleles originated from a founder haplotype55–58 associated with a long allele. We compared SNP haplotypes at the DRPLA locus in our dataset to a previously reported founder haplotype58. In concordance with the hypothesis of a single founder haplotype, we found that SNP haplotypes with smaller Hamming distance to the known founder haplotype had longer CAG tracts (r=−0.79; p<10−200). This finding demonstrates that while we were unable to directly impute pathogenic expansion alleles, STR imputation can accurately identify which individuals are at risk for carrying expansions or pre-pathogenic mutations and the inferred haplotypes can reveal the history by which such mutations arise.
Discussion
Our study combines available whole genome sequencing datasets with existing bioinformatics tools to generate the first phased SNP+STR haplotype panel allowing genome-wide imputation of STRs into SNP data. Despite their exceptionally high rates of polymorphism, 92% of STRs in our panel could be imputed with at least 90% concordance, and 38% achieved greater than 99% concordance. Imputation performance varied widely across STRs, primarily due to differences in polymorphism levels across loci. Bi-allelic STRs could be imputed nearly perfectly (average concordance >99%, compared to 74% expected by chance), whereas STRs with the highest heterozygosity, including forensics markers and known pathogenic repeats, could be imputed to around 70% concordance (compared to around 35% expected by chance). We additionally show that imputation improves power to detect STR associations over standard SNP-based GWAS and could detect both known and novel associations between STR lengths and expression of nearby genes.
A widely recognized limitation of GWAS is the fact that common SNP associations still explain only a small fraction of heritability of most traits. Multiple explanations for this have been proposed, including minute effect sizes of individual variants and a potential role for high-impact rare variation59. However, studies in large cohorts reaching hundreds of thousands of samples1–3, as well as deep sequencing studies to detect rare variants60, have so far not confirmed these hypotheses. An increasingly supported idea is that complex variants not well tagged by SNPs may comprise an important component of the “missing heritability.”10,12,61
GWAS is essentially blind to contributions from highly polymorphic STRs and other repeats, despite their known importance to human disease and molecular phenotypes. Thus STR association studies will undoubtedly uncover additional heritability that is so far unaccounted for. Notably, while autism phenotypes are available for the SSC families, this cohort is too small to perform a GWAS and was specifically ascertained for families enriched for de novo, rather than inherited, pathogenic mutations. In future work our panel can be applied to impute STRs into larger cohorts for autism and other complex traits for which tens of thousands of SNP array datasets are available.
Our initial haplotype panel faces several important limitations. First, the majority of samples are of European origin, limiting imputation accuracy in other population groups. Second, imputation accuracy is mediocre for the most highly polymorphic STRs, some of which will ultimately have to be directly genotyped to adequately test for associations. Notably, our work relied on existing tools originally designed for SNP imputation. Further work on computational methods specifically for imputing repeats may be able to improve performance. Finally, thousands of long STRs are filtered from our panel due to the limitation imposed by short read lengths. While we have included target STRs implicated in STR expansion disorders, many long STRs are still inaccessible using current tools. New methods are now being developed for genome-wide genotyping of more complex STRs62 and longer variable number tandem repeats (VNTRs)63 from short reads and can be used to expand our panel in the future.
Overall, our STR imputation framework will enable an entire new class of variation to be interrogated by reanalyzing hundreds of thousands of existing datasets, with the potential to lead to novel genetic discoveries across a broad range of phenotypes.
URLs
Simons Simplex Collection, https://base.sfari.org/
HipSTR, https://qithub.com/tfwillems/HipSTR
Beagle, https://faculty.Washington.edu/browning/beagle/b4_0.html
1000 Genomes phased Affy6.0 and Omni2.5 SNP data, ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/shapeit2_scaffolds/hd_chip_s_caffolds/
1000 Genomes Phase 3 http://ftp.1000genomes.ebi.ac.uk/vol1/fto/release/20130502/
1000 Genomes STR data, http://strcat.teamerlich.org/download
Marshfield Capillary electrophoresis data, https://pavseur.genetics.wisc.edu/strpData.htm
Marshfield marker annotations, https://web.stanford.edu/group/rosenberglab/data/pembertonEtAI2009/Pemberton_AdditionalFile1_11242009.txt
NA12878 10X Genomics data, https://support.10xgenomics.com/genome-exome/datasets/2.2.1/NA12878_WGS_v2
High-coverage Illumina sequencing for 1000 Genomes samples, https://www.ebi.ac.uk/ena/data/view/PRJEB20654
PyVCF, https://github.com/jamescasbon/PvVCF
Python statsmodels, http://www.statsmodels.org/stable/index.html
Author Contributions
M.G. conceived the study, helped design and perform analyses, and drafted the initial manuscripts. S.S. generated the reference haplotype panel, performed downstream analyses, and participated in writing the manuscript. I.M. performed simulation analyses. S.F.F. performed analyses of expression data. N.M. performed analyses of pathogenic STRs. All authors have read and approved the final manuscript.
Competing and Financial Interests
The authors have no competing financial interests to disclose.
Online Methods
SSC Dataset
The SSC Phase 1 dataset consists of 1,916 individuals from 479 quad families. Aligned BAM and gVCF files for whole genome sequencing data of individuals were obtained through SFARI base (see URLs) and processed on Amazon Web Services (AWS). SNP genotypes were called from gVCF files using the GATK version 3 joint calling pipeline64. A total of 27,185,239 variants that passed the default GATK filters and overlapped with sites reported in the 1000 Genomes Project37 phase 3 data were retained for downstream analysis.
We performed principal components analysis (PCA) using SNPs from 2,504 samples from Phase 3 of the 1000 Genomes Project37 and projected SSC samples onto the resulting PCs to infer sample ancestry (Supplementary Figure 1). We estimated that the SSC cohort consists of 1585 Europeans, 39 East Asian, 172 South Asian, 69 African samples, and 51 individuals that did not clearly belong to any single population group.
Genome-wide multi-sample STR genotyping
STRs were jointly genotyped on the AWS EC2 platform in batches of 500 STRs. We streamed the corresponding region of each BAM file and of the phased SNP VCF files to a local EBS volume attached to each EC2 instance using samtools65 version 1.4 and tabix66 version 1.2, respectively. HipSTR31 version v0.5 was called individually per locus with default parameters. Phased SNPs were provided as input to allow HipSTR to perform physical phasing when possible. Resulting VCF files from each batch were merged to create a genome-wide callset in VCF format.
HipSTR calls were filtered using the filter_vcf.py script in the HipSTR package with suggested parameters (--min-call-qual 0.9 --max-call-flank-indel 0.15 --max-call-stutter 0.15). We used the following criteria to remove problematic STRs from the callset: (i) STRs overlapping segmental duplications (UCSC Table Browser67 hg19.genomicSuperDups table) were removed from the callset using intersectBed68 v2.25.0; (ii) Pentanucleotides and hexanucleotides containing homopolymer runs of at least 5 or 6 nucleotides, respectively, in the hg19 reference genome were removed as they were found to contain an excess of indels in the homopolymer regions; (iii) STRs with call rate <80%; (iv) STRs with heterozygosity <0.095, corresponding to a minor allele frequency of 5% for biallelic markers, were removed to restrict to polymorphic STRs; (v) STRs with significantly more or fewer heterozygous genotypes compared to expectation under Hardy-Weinberg equilibrium (p<0.01) as described previously69,70. After filtering, 453,671 STRs remained in our panel.
Genotyping clinically relevant STRs
A total of 25 clinically relevant STRs were called using Tredparse71 v0.75 from the aligned BAM files obtained through SFARI base on Amazon EC2. Default profiles containing information about the genomic position, reference repeat length, and repeat motif supplied with the software were used. We filtered STRs with call rate less than 80% or for which only a single allele was identified (Supplementary Table 1). 9 STRs remained after filtering.
Computing STR heterozygosity
For an STR with alleles {1…n}, let pi be the frequency of the ith allele computed from observed genotypes. STR heterozygosity is defined as: . For this study all alleles with identical length are treated as the same allele. On average each length-based allele corresponded to 1.8 sequence-based alleles.
Comparison to l000G catalog
STRs for 1000 Genomes samples as described in Willems et al.14 were downloaded from the strcat site (see URLs). Heterozygosity was computed using the PyVCF package (see URLs) for the 1000 Genomes calls and using a custom script for the SSC data to collapse alleles of identical length into a single allele. STRs passing all filters described above included in the comparison. Analysis was restricted to STRs with at least 500 calls in the 1000 Genomes dataset.
Comparison to normal allele frequency spectra at clinically relevant STRs
Control distributions for Figure 1E were obtained from previous studies of normal alleles at known pathogenic STRs. Allele frequencies for SCA1, SCA2, SCA3, SCA6, SCA12, SCA8, SCA17, and DRPLA were obtained from Figure 1 of Majounie, et al.32 and are based on 307 controls of Welsh origin. Frequencies for DM1 were obtained from Figure 1 of Ambrose, et al.33 and are based on 254 controls of Chinese origin. Frequencies for HDL were obtained from Figure 1 of Figley, et al.34 and are based on 352 controls of North American Caucasian origin. Frequencies for SCA7 were obtained from Figure 1 of Gouw, et al.35 and are based on 180 controls of European origin. Frequencies for HTT are based on data in the phv00173896.v1.p1 variable of dbGaP study phs000371.v1.p1 (“Genetic modifiers of Huntington’s Disease”) based on the shorter allele of 2,802 patients with Huntington’s Disease.
Phasing SNPs in the SSC
SNP genotypes were phased using SHAPEIT72 version 2.r837 with 1000 Genomes Phase 3 genotypes as a reference panel and ignoring pedigree information. SHAPEIT’s duoHMM73 version 0.1.7 method was used to refine phased haplotypes using pedigree structure and correcting for Mendelian errors.
Phasing STRs
Beagle39 version 4.0 was used to phase each STR separately using phased SNP genotypes, pedigree information, and unphased STR genotypes as input. In order to leverage the HipSTR genotype likelihoods (GL field), Beagle requires all samples to have GL information. To accommodate this, phasing was performed in two steps. First, samples with missing data were removed and the remaining samples were phased using the “-gl” Beagle flag. Next, missing samples were added back to the VCF and all samples were jointly phased in a second Beagle round using default parameters. In this step Beagle additionally imputed any calls with missing genotypes. Genotype values (GT field) were used for the STRs genotyped using Tredparse as it does not report genotype likelihoods, and phasing and imputation of STRs was done in a single step. Phased STRs and SNPs for only the unrelated parent samples from each locus were then merged into a single genome-wide reference panel in VCF format.
Imputation performance metrics
Let X = {x1, x2, … xn} be the true STR genotypes for samples 1.n and Y = {y1, y2, … yn) be the imputed STR genotypes. Each genotype xi is defined as xi = (xi1, xi2) where xi1 and xi2 give the (unordered) lengths of the two STR alleles for a diploid sample and similarly for Y. We then define the following metrics:
Genotype concordance
Concordance ci was defined as: 1 if both genotypes match (xi1 = yi1 and xi2 = yi2 or xi2 = yi1 and xi1 = yi2); 0 if neither imputed allele matched a true allele; else 0.5 if one but not both imputed alleles matched the true alleles. Genotype concordance for an STR is the average over all the samples .
Length r2
Define the STR genotype dosage as the sum of the lengths of the two alleles at a given site: di = xi1 + xi2 and Xd = {d1, d2, …, dn}. Length r2 is computed as cov2(Xd, Yd/(Var(Xd)Var(Yd)).
Allelic r2
For a given allele length a, define Xa = {a1, a2, …, an} where . Allelic r2 is computed as cov2(Xa, Ya)/(Var(Xa)Var(Ya)).
Best tag SNP
The best tag SNP for an STR is defined as the SNP within 50kb with the highest length r2.
For all concordance metrics, outlier genotypes containing alleles seen less than 3 times in the entire cohort were removed from the analysis.
For each STR, we additionally computed the expected value of each metric under a model where genotypes are imputed randomly based on the frequency of underlying alleles. Expected genotype concordance was calculated as , where (i, j) ∈ {1, …, n}2 and (k, l) ∈ {1, …, n}2, n is the number of alleles, fx gives the frequency of allele x, and C(i,j,k,l) gives the concordance between genotypes (i,j) and (k,l) as defined above. For example, fora bi-allelic marker with allele frequencies f1 and f2 expected genotype concordance is given by
. Random model values for length r2 and allelic r2 were computed by comparing randomly imputed genotypes to true genotypes at each locus.
Evaluating imputation performance in the 1000 Genomes data
STRs were imputed into SNP data downloaded from the 1000 Genomes Project site from three sources (WGS, phased SNPs from Affy6.0 array; and phased SNPs from Omni2.5 array; see URLs and Supplementary Table 5) with Beagle version 4.1 using the SSC SNP-STR haplotype panel. For comparison to WGS, STRs were jointly genotyped in 150 high-coverage WGS datasets for 150 of the 1000 Genomes Project samples (see URLs) using HipSTR version 0.6 followed by the filtering steps described above for the SSC cohort.
Capillary electrophoresis genotypes for 209 samples at 721 Marshfield STRs were downloaded from the Payseur Lab website (see URLs). PCR product sizes were converted to length differences in bp from the reference genome using product size annotations74 available from the Rosenberg Lab website (see URLs). Prior to comparing genotypes, offsets were calculated to match HipSTR lengths to the length of Marshfield STRs as previously described14. STRs with imperfect repeat structures were removed. Capillary genotypes were rounded down to the nearest number of repeat units.
10X Genomics data for NA12878 was obtained from the NA12878 Gemline Genome v2 available on the 10X Genomics website (see URLs). We extracted reads belonging to phase 1 or 2 from the phased, barcoded BAM based on the HP tag into separate BAM files. HipSTR v0.6.1 was called separately on each BAM with non-default parameters --def-stutter-model --min-reads 5 --use-unpaired and with --haploid-chrs containing a list of all autosomal chromosomes to force a haploid genotyping model. Haploid STR calls were obtained for both phases at 118,353 STRs. We identified the nearest heterozygous SNP to each STR that was genotyped in both the 10X data and in our phased panel. STRs for which the nearest SNP had discordant genotypes in the two datasets were discarded leaving 116,764 STRs for analysis.
Simulations for power analysis
We analyzed parental genotypes for 5,838 STRs across chromosome 21 that passed filtering and quality control as described above. For each STR, we simulated quantitative phenotype datasets under the model: P = βG + E, where P is a vector of standard normalized phenotypes, β gives the effect size, E gives the error term drawn from a normal distribution N(0, 1 – β), and G is a vector of the sum of genotype lengths for each individual scaled to have mean 0 and variance 1. For each simulated phenotype dataset, we tested the causal STR, the imputed STR genotypes, and the best tag SNP (strongest length r2) within 50kb of the STR for association. Association tests were performed using the Python statsmodels library OLS method (see URLs).
We performed additional simulations under a case control model shown in Supplementary Figure 8. Phenotypes (0=control, 1=case) were drawn for each sample according to the model logit(pi) = ßXi where pi is the probability that sample i is a case and Xi is the scaled genotype for individual i as described above. Association tests were performed using the Python statsmodels Logit method.
For the non-additive phenotype example (Supplementary Figure 9), we performed simulations under a quadratic model: P = βG2 + E where G is a vector of the squared sum of allele lengths scaled by the mean allele length, and P, β, E are as described above. Two sets of association tests were performed: the first tested for association between STR length and phenotype (Supplementary Figure 9B) and the second set performed a separate association test for each STR allele treating the allele as a bi-allelic locus (Supplementary Figure 9C).
In all cases 100 separate simulations were performed and power was defined as the percent of simulations for which the nominal association p-value was less than 0.05. Figures show results for all simulations with β set to 0.1.
eSTR analysis
Data for eSTR analysis was obtained from the Genotype-Tissue Expression (GTEx) through dbGaP under phs000424.v7.p2. This included high coverage (30x) Illumina whole genome sequencing (WGS) data from 650 unrelated samples, Omni 2.5 SNP genotypes for 450 samples, and gene-level RPKM values for whole blood in 336 samples. STRs were genotyped from WGS data using HipSTR v0.5 and subject to the same quality filtering as SSC samples. STRs were additionally imputed to Omni2.5 data with Beagle as described above. Downstream analyses were restricted to the 336 samples with available whole blood expression data. These samples consisted of 284 European, 45 African American, 3 Asian, and 3 Amerindian samples and 2 samples with no population label available.
We performed separate eSTR analyses using HipSTR and imputed genotypes. In each case, as described previously15, we performed a separate association test between gene expression and each STR within 100kb of the gene using a model Y = βX + C + ε, where X denotes STR genotype lengths, Y denotes expression values, β denotes the effect size, C denotes various covariates, and ε is the error term. Following our previous study75, we used “STR dosage”, defined as the sum of repeat lengths of the two alleles for each sample, to define STR genotypes. All repeat lengths are reported as length difference from the hg19 reference, with 0 representing the reference allele. STR dosages were scaled to have mean 0 and variance 1. Genes with median expression of 0 were excluded and expression values for remaining genes were quantile normalized to a standard normal distribution. We included sex, population structure, and technical variation in expression as covariates. For population structure, we used the top 15 principal components resulting from perform principal components analysis on the matrix of SNP genotypes from each sample. To control for technical variation in expression, we applied PEER factor correction76,77 using 83 PEER factors.
We used model comparison to determine whether the best eSTR for each gene explained variation in gene expression beyond a model consisting of the best eSNP. As described previously75, for each gene with an eSTR we determined the lead eSNP with the strongest p-value. We then compared two linear models: Y~eSNP (SNP-only model) vs. Y~eSNP+eSTR (SNP+STR model) using the anova_lm function in the python statsmodels.api.stats module. We used CAVIAR v1.0 to further fine-map eSTR signals against the top 100 eSNPs within 100kb of each gene. Pairwise-LD between the eSTR and eSNPs was estimated using the Pearson correlation between SNP dosages (0, 1, or 2) and STR dosages (sum of the two repeat allele lengths).
Comparison to DRPLA founder haplotypes
The founder haplotype for the expansion allele in ATN1 implicated in DRPLA was taken from Table 1 of Veneziano et al.58 and consists of rs4963516, rs1007924, rs7310941, rs7303722, rs2239167, rs34199021, rs2071075, rs2071076, and rs2159887 with hg19 alleles G, A, G, T, A, A, T, C, and C respectively. Distance from the founder haplotype was calculated as the number of mismatches.
Data Availability
Phased SNP-STR haplotypes for 1000 Genomes Project phase 3 samples and example commands for imputation are available at https://avmreklab.aithub.io/2018/03/05/snpstr_imputation.html. Upon acceptance for publication STR genotypes and phased SNP-STR haplotypes for the SSC samples will be made available at https://base.sfari.ora/.
Code Availability
Analysis scripts and Jupyter notebooks for reproducing the figures in this study are provided in the Github repository https://github.com/gymreklab/snpstr-imputation.
Acknowledgements
Research reported in this publication was supported in part by the Office Of The Director, National Institutes of Health under Award Number DP5OD024577 and by a SFARI Explorer Award Number 515568. Access to SSC data was approved for this project under request id 2405.1.1. M.G. was supported in part by NIH/NIMH grant R01 MH113715. This work used the Extreme Science and Engineering Discovery Environment (XSEDE) comet resource at the San Diego Supercomputing Center through allocations ddp268 and csd568. XSEDE is supported by National Science Foundation grant number ACI-1548562. We thank Alon Goren for helpful comments on the manuscript. We additionally thank Vineet Bafna and Vikas Bansal for helpful discussions and providing access to compute resources. We are grateful to all of the families that participated in the Simons Simplex Collection as well as the principal investigators.