Abstract
Telomeres shorten in replicating somatic cells and with age; in human leukocytes, telomere length (TL) is associated with a host of aging-related diseases1,2. To date, 16 genome-wide association studies (GWAS) have identified twenty-three loci associated with leukocyte TL3–18, but prior studies were primarily in individuals of European and Asian ancestry and relied on laboratory assays including Southern Blot and qPCR to quantify TL. Here, we estimated TL bioinformatically, leveraging whole genome sequencing (WGS) of whole blood from n=75,176 subjects in the Trans-Omics for Precision Medicine (TOPMed) Program. We performed the largest multi-ethnic and only WGS-based genome-wide association analysis of TL to date. We identified 22 associated loci (p-value <5×10−8), including 10 novel loci. Three of the novel loci map to genes involved in telomere maintenance and/or DNA damage repair: TERF2, RFWD3, and SAMHD1. Many of the 99 pathways identified in gene set enrichment analysis for the 22 loci (multiple-testing corrected false discovery rate (FDR) <0.05) pertain to telomere biology, including the top five (FDR<1×10−9). Importantly, several loci, including the recently identified TINF2 and ATM6 loci, showed strong ancestry-specific associations.
Results
High throughput sequencing with decreasing sequencing cost per sample has enabled the generation of WGS data at an unprecedented scale, and the National Heart, Lung and Blood Institute’s TOPMed Program offers the opportunity to address both sample size and population diversity limitations of prior TL GWAS. To optimize the computational task of estimating TL on the full set of TOPMed WGS samples, we compared two estimation methods, TelSeq 19 and Computel 20, on a subset of samples for which we had prior laboratory-based telomere length measurements from Southern blot. TelSeq and Computel estimates were highly correlated (Pearson correlation r=0.98, Supplementary Figure S1A) and had similar correlation with Southern blot data (Pearson correlation r=0.57 and 0.55 for TelSeq and Computel, respectively, Supplementary Figure S1B); this is similar to what has been reported previously 21. We selected TelSeq due to its computational efficiency (Supplementary Figure S1C). Given the sample heterogeneity and complexity of generating WGS across the large number of cohorts in the TOPMed program 22 (Nature, submitted, 2019), not unexpectedly, we observed cross-study and cross-sequencing center effects (Supplementary Figures S2A and S2B), and we chose a statistical approach to minimize them (see Materials and Methods and Supplementary Figures S2C and S2D). The final sample set analyzed included 38,193 European ancestry, 21,179 African ancestry, 9,808 Hispanic/Latino, 4,754 Asian ancestry, and 1,242 Samoan individuals. 42% of participants were male and age ranged from <1 to 98, median 55 years (Supplementary Tables S1A and S1B).
Genome-wide tests for association across 93M variants (genotype calling pipeline, sample selection and sequencing details described under Materials and Methods) were performed in multiple stages, reflecting different WGS freezes and the final analysis included a discovery set (n=46,458), a replication set (n=28,718), and a meta-analysis of both sets (n=75,176). We identified 22 loci reaching a meta-analysis p-value <5×10−8 (Figures 1 and 2, Table 1), of which 12 loci met the threshold of 5×10−9 recently suggested for WGS-based GWAS analyses 23. Of the 23 prior loci discovered through GWAS of TL assessed with laboratory assays, we confirmed twelve (TERC, TERT, NAF1, RTEL1, OBFC1, DCAF4, ZNF676, ACYP2, and the recently identified TERF1, TINF2, POT1 and ATM loci) at a significance threshold of p-value <5×10−8 (Table 1, Supplementary Tables S2A and S2B). Nominal evidence (p-value < 0.05/23) was noted for an additional 5 prior known loci at the specific reported variants (PARP1, NKX2-3, MPHOSPH6, TYMS, and ZNF208; see Supplementary Table S3).
Among the 22 loci reaching traditional GWAS thresholds in the multi-ethnic TOPMed samples, we also identified 10 novel loci (Table 1, Supplementary Tables S2A and S2B), three of which include genes encoding proteins that have plausible roles in telomere biology (the index gene definition for each locus is described in Table 1). RFWD324 plays a key role in DNA damage repair; TERF2 is a component of the telomere shelterin complex; and depletion of SAMHD1, which has reported roles in DNA resection and homology-directed repair, has been shown to lead to telomere breakage events in cells deprived of the shelterin component TERF125, a recently reported GWAS locus that we also identify as a Tier 1 locus. Gene set enrichment analysis 26,27 including the index gene(s) for each of the 22 loci resulted in 99 sets with an FDR < 0.05 (Supplementary Table S4). The top 5 gene sets, all with an FDR <1×10−9, were: regulation of telomere maintenance via telomere lengthening (GO:1904356), regulation of telomere maintenance (GO:0032204), negative regulation of telomere maintenance (GO:0032205), telomere maintenance (GO:0000723), and telomere organization (GO:0032200).
Each peak variant at a locus, henceforth referred to as the sentinel variant for that locus, accounts for a small proportion of phenotypic variation (Table 1), consistent with prior GWAS of telomere length. Prior GWAS SNPs cumulatively account for 2% - 3% of trait variance 28, with allelic effects ranging from ~ 49-117 base pairs. In the TOPMed data, effect sizes for common variants (minor allele frequency, MAF ≥5%) range from 22-71 bp/allele. Rare and low frequency variants (MAF <5%) show larger effects (152-631 bp/allele). Cumulatively, the 22 sentinel variants from the TOPMed WGS-based GWAS account for ~1.5% of phenotypic variance. Individually, TERC, TERT and OBFC1 each account for the largest phenotypic variance (~0.2%) and have similar effect sizes (~60-70bp/allele).
In an attempt to look beyond the single variant approaches, gene-based tests identified five protein coding genes with deleterious rare and low frequency (MAF <5%, including singletons) coding variants associated with telomere length in the discovery samples (see Materials and Methods and Supplementary Figure S3A, Supplementary Table S5A): RTEL1, RTEL1-TNFRSF6B, ATM, KDELC2, and NAF1. For each of these genes, a leave-one-out approach iterating over each deleterious variant identified one to three driver variants accounting for the association signal at the gene (Supplementary Figures S3B-S3F). Testing for evidence at these specific driver variants in the independent replication sample provides confirmation for RTEL1, RTEL1-TNFRSF6B and NAF1 (Supplementary Table S5B), with the same shared variants for RTEL1 and RTEL1-TNFRSF6B. All three genes are noted to be index genes from the GWAS loci identified (Table 1). Linkage disequilibrium with the peak sentinel variant supports overlap between the gene-based and single variant signals. Importantly, while the KDELC2 driver variant was not replicated (while showing a consistent direction of effect between the discovery and replication), the minor allele (rs74911261/A) of the driver variant from the discovery sample has previously been shown to be associated with decreased risk of breast cancer 29 and increased risk of renal cell carcinoma 30 in European ancestry individuals. Notably, the A allele is associated with lower telomere length (−80.4 bp/allele) and supports the prior observation that shorter telomere length is strongly associated with increased risk for renal cancers 31.
An evaluation of the 22 loci by race/ethnicity demonstrates that many of these loci are associated with TL in multiple groups. As illustrated in Figure 3 (see also Supplementary Figures S4A – S4V, Supplementary Table S2B) the previously reported TERC, TERT, RTEL1, TERF1, TINF2 and OBFC1 loci have p-values <10−5 among non-European populations. Among the novel loci identified, RFWD3 and TERF2 and have p-values < 10−5 in non-European groups. Not surprisingly, most of the 22 loci had strong evidence of association in the European ancestry sample, which also had the largest sample size. In fact, there were several loci (ATM, CHKB-AS1/MAPK8IP2, LINC01592/LOC100505739, OPRK1/ATP6V1H, RPN1 and YY1P2,LRP1B) where association was limited to the European ancestry sample (p-values<10−5); no variant mapping to these loci reached a p-value < 0.0023 (0.05/22 for the total of 22 loci evaluated) in any other population. One notable exception was the TINF2 locus, where the sentinel variant is highly differentiated between ancestral populations. The TINF2 association was not observed in the European population, where the allele frequency for the alternate allele is extremely low (AAF=0.05%, p-value=0.04), as compared to the Asian (AAF=9%, p-value=1.3×10−5), African (AAF=1%, p-value=2.6×10−5) and Samoan (AAF=23%, p-value=1.3×10−7) samples (Supplementary Table S2B).
Leukocyte telomere length (LTL) is associated with mortality and aging-related diseases such as cancer32, and genetic variants associated with LTL previously have been associated with risk of cancers as well as other non-neoplastic disease of aging33. We analyzed 1403 international classification of disease (ICD)-based phenotypes in ~402,000 Europeans from the UK Biobank (Supplementary Table S6, Supplementary Figures S5A-S5C), and we noted that the sentinel variants at TERT and TERC each had multiple phenome-wide disease associations (PheWAS), including myeloproliferative neoplasms, cancers of skin and brain, and leiomyoma/benign neoplasms of the uterus (all p-values<1.8×10−6). The associations with uterine leiomyomata are consistent with recently published GWAS which found that several telomere length-associated genes and variants (TERT, TERC, OBFC1, ATM) have genome-wide significant associations with uterine fibroids 34. Notably, several of our TOPMed sentinel variants (NAF1, TERF1, ZNF729, POT1, CHKB-AS1) had uterine fibroid p-values in the range of 0.008 to 0.07 in our UK Biobank PheWAS analysis. Additionally, several of the sentinel telomere length variants or their proxies (TERT, TERC, RFWD3, TCL1A, RPN1) were associated with quantitative hematologic traits or myeloproliferative disorders and malignancies either in the UK Biobank or in recently published GWAS 30,35,36. As a follow up to assess functional relevance, we used a set of 31,684 blood samples from eQTLGen 37 and found that 17 out of the 18 of our sentinel variants present in the data set were eQTLs for at least one local eGene. For many of these, the top eGene is the index gene we identified at the locus (Supplementary Table S7), but we recognize the limitation in the use of whole blood from adult samples as the sole tissue interrogated.
Leveraging WGS available through NHLBI’s TOPMed program, we have illustrated the feasibility of generating high quality TL from WGS data. We were able to take advantage of the well-powered sample size and multi-ethnic nature of the sample to confirm known GWAS loci and identify an additional set of novel loci that map to genes with plausible biological validity. We also explored loci across populations of diverse ancestry. The ability to implement this phenotype assessment of TL in large, multi-ethnic datasets with pre-existing WGS creates opportunities beyond the genetics of TL; it will expand our ability to evaluate of the role of TL and genes determining TL in health and human disease.
Supplementary Information is linked to the online version of the paper.
Author Contributions
M.A.T., R.A.M. conceived of and led the study. M.A.T., K.R.I., L.R.Y., M.P.C., A.K., M.Arvanitis, Y.C.C., L.M.R., M.Armanios, M.H.C., M.D., D.L., A.B., T.W.B., I.R., J.A.P., A.P.R., R.A.M. drafted the manuscript. M.A.T., J.S.W., M.P.C., J.A.B., A.K., C.C.L., G.A., A.A., D.A.N., J.G.W., S.S.R., D.L., A.B., T.W.B., I.R., T.T., J.O., J.A.P., N.P., A.P.R., R.A.M. contributed substantive analytical guidance. M.A.T., J.S.W., K.R.I., L.R.Y., M.P.C., J.A.B., A.K., C.A.L., M.Arvanitis, A.V.S., J.Lane, A.P.R., R.A.M. performed and led analysis. L.R.Y., L.C.B., J.C.B., J.B., E.R.B., E.G.B., J.C.C., Y.C.C., B.C., D.D., L.d., D.L.D., B.I.F., M.E.G., M.T.G., S.R.H., B.A.H., C.I., M.R.I., W.C.J., S.Kaab, L.L., J.Lee, S.L., A.M., K.E.N., P.A.P., N.R., L.M.R., D.E.W., M.M.W., L.W., W.Z., M.Armanios, S.A., P.L.A., D.W.B., B.E.C., I.Y.C., M.H.C., L.A.C., J.E.C., M.D., R.D., X.G., L.H., S.H., J.M.J., E.E.K., A.M.L., C.L., R.L.M., M.N., E.C.S., J.A.S., N.L.S., J.L.S., M.J.T., H.K.T., R.P.T., M.J.W., Y.Z., K.L.W., S.T.W., R.S.V., K.D.T., M.F.S., E.K.S., M.S., W.H.S., J.I.R., S.R., B.M.P., J.M.P., N.D.P., R.J.L., C.G.M., B.D.M., D.A.M., S.T.M., A.C.M., R.Kumar, C.K., B.A.K., S.Kelly, S.L.K., R.Kaplan, J.H., H.G., M.F., P.T.E., M.d., A.C., E.B., K.C.B., A.E.A., D.K.A., C.A., A.A., J.G.W., S.S.R., D.L., J.O., A.P.R., R.A.M. were involved in the guidance, collection and analysis for one or more of the studies which contributed data to this manuscript. All authors read and approved the final draft.
Author Information
Data Deposition Statement
TOPMed genomic data and pre-existing parent study phenotypic data are made available to the scientific community in study-specific accessions in the database of Genotypes and Phenotypes (dbGaP) (https://www.ncbi.nlm.nih.gov/gap/?term=TOPMed). Telomere length calls were derived from the raw sequence data as described in the Online Methods, and the phenotype covariates of age, sex, and race/ethnicity are available through the study-specific dbGAP accession IDs as listed in the Supplementary Information.
Competing Interests
The authors declare the following competing interests:
J.C.C. has received research materials from GSK and Merck (inhaled steroids) and Pharmavite (vitamin D and placebo capsules) to provide medications free of cost to participants in NIH-funded studies, unrelated to the current work.
B.I.F. is a consultant for Ionis and AstraZeneca Pharmaceuticals.
L.W. is on the advisory board for GSK and receives grant funding from NIAID, NHLBI, and NIDDK, NIH
S.A. receives equity and salary from 23andMe, Inc.
M.H.C. receives grant support from GlaxoSmithKline
S.T.W. receives royalties from UpToDate
E.K.S. received grant and travel support from GlaxoSmithKline in the past three years.
B.M.P. serves on the Steering Committee of the Yale Open Data Access Project funded by Johnson & Johnson.
P.T.E. is supported by a grant from Bayer AG to the Broad Institute focused on the genetics and therapeutics of cardiovascular diseases; has served on advisory boards or consulted for Bayer AG, Quest Diagnostics, and Novartis.
K.C.B. receives royalties from UpToDate
Correspondence and requests for materials should be addressed to Rasika A Mathias, ScD, rmathias{at}jhmi.edu, 410-550-2487
Online Methods
TOPMed study populations
To perform this multi-ethnic genome-wide association study of telomere length, we leveraged the whole genome sequence samples available through the NHLBI’s Trans Omics for Precision Medicine (TOPMed) 22 (Taliun, Nature, Submitted, 2019) program. The program currently consists of more than 80 participating studies 38 across a range of study designs as described in Taliun et al 22 (Nature, submitted, 2019). These participants are mainly U.S. residents with diverse ancestry and ethnicity (European, African, Hispanic/Latino, Asian, and other). Details on the specific samples included for telomere length analysis are outlined below, summarized in Supplementary Tables S1A and S1B, and described by TOPMed 38.
TOPMed whole genome sequencing (WGS)
WGS was performed to an average depth of 38X using DNA isolated from blood, PCR-free library construction, and Illumina HiSeq X technology. Details for variant calling and quality control are described in Taliun et al. 22 (Nature, submitted, 2019). Briefly, variant discovery and genotype calling was performed jointly, across all the available TOPMed Freeze 5b (September 2017) and Freeze 6a studies (August 2018), using the GotCloud 39 pipeline resulting in a single, multi-study, genotype call set.
Estimating telomere length for whole-genome sequencing (WGS) samples
A variety of computational tools exist that leverage WGS data to generate an estimate of telomere length 40. Here, we performed a thorough comparison of two leading methods for estimating telomere length from WGS data to choose the preferred scalable method for performing the estimation on all available samples from TOPMed. The first method, TelSeq 19, calculates an estimate of individual telomere length using counts of sequencing reads containing a fixed number of repeats of the telomeric nucleotide motif TTAGGG. Given that 98% of our data was sequenced using read lengths of 151 or 152 (as confirmed from the SEQ field in the analyzed CRAM files), we chose to use a repeat number of 12. These read counts are then normalized according to the number of reads in the individual WGS data set with between 48% and 52% GC content, to adjust for potential technical artifacts related to GC content. The second method, Computel 20 uses an alignment-based method to realign all sequenced reads from an individual to a “telomeric reference sequence”. Reads aligning to this reference sequence are considered to be telomeric and are included in the estimate of telomere length. Because Computel performs a complete realignment, additional computational steps are involved compared to those needed for TelSeq.
To compare the results and scalability from these two methods, we first directly compared estimates obtained from TelSeq and Computel on 3362 samples from the Jackson Heart Study (JHS) and found them to be highly correlated with one another (Pearson correlation r=0.98, Supplementary Figure S1A). We also compared computational time to generate the telomere length estimates on these samples and show that Computel is an order of magnitude more time-consuming (Supplementary Figure S1C). This is in part due to the fact that Computel requires CRAM-formatted files (as the WGS data are currently stored) to first be converted back to Fastq format (while TelSeq requires a CRAM to BAM conversion), but also due to the computationally expensive step of realignment to the telomeric reference genome that the Computel algorithm employs.
As a further comparison to orthogonally measured telomere length values, we used data from 2429 samples from JHS with Southern blot41 telomere length estimates 42. For these samples, the Southern blot assay was performed on the same source DNA sample that was used to generate the WGS in TOPMed. The Pearson correlation values between the TelSeq and Computel estimates and the Southern blot estimates did not differ (r=0.57 and 0.55 for TelSeq and Computel, respectively, Supplementary Figure S1B). We do note some technical sources of variability in our data, both within a study (colors in Supplementary Figures S1A and S1B indicate grouping by shared 96-well plate for shipment to sequencing center for these JHS samples) and across studies (Supplementary Figures S2A and S2B). Cross-study differences are accounted for in our modeling process (see Supplementary Figures S2C and S2D, and Single variant tests for association, below).
Based on our observation that both Computel and TelSeq showed similar correlation to the Southern blot estimates and high correlation with each other, and that TelSeq was an order of magnitude more computationally efficient, we chose to use TelSeq to perform telomere length estimation on our data.
Final telomere length estimation was performed on a set of 93,219 samples whose CRAM-files were available for analysis at the TOPMed IRC at the time of analysis.
Samples included in genetic analysis
Samples with telomere length estimated from the WGS data from the TOPMed Studies described above were included in either a discovery or replication dataset (Supplementary Table S1A and S1B) based primarily on their release as part of the TOPMed WGS data processing “Freezes” (Taliun, Nature, Submitted, 2019) 22. The discovery dataset (n=46,458, Supplementary Table S1A) is comprised of samples that were included in the TOPMed freeze5b data set (Taliun, Nature, Submitted, 2019) 22, released in September 2017, passing sample-level quality control (QC) checks as determined by the TOPMed Data Coordinating Center (DCC) (e.g. concordance of annotated and genetic sex, comparisons of genetically inferred and pedigree reported relatedness, and concordance of WGS genotype calls with prior array data), and with consent groups that allowed for genetic analysis of telomere length. Only samples with sequencing read lengths of 151 or 152 basepairs and having age at blood draw and reported race/ethnicity data available were included. For the set of samples that were part of a duplicate pair (either part of the intended duplicates designed by TOPMed, or a duplicate identified across the studies through sample QC) only one sample from each duplicated pair/group was retained. Relying on the same set of criteria, samples were included in the replication dataset (n=28,718, Supplementary Table S1B) if they were available as additional samples in the freeze6a TOPMed data release available in August 2018.
Race/ethnicity classifications as presented in Supplementary Table S1 were harmonized by the TOPMed DCC across studies based on study-specific self-reported questionnaire data. We included samples belonging to the following five race/ethnicity categories for our analysis: African ancestry, Asian ancestry, European ancestry, Hispanic/Latino and Samoan. For inclusion within the final set of samples described above, the minimum sample size for any study-race-sequencing center stratum had to be n=50. Samples belonging to a smaller stratum were not included in any analyses.
Single variant tests for association
The genome-wide tests for association were performed on the Analysis Commons43. Variants with minor allele count (MAC) of at least 5 and passing IRC quality filters were included for single variant analyses. Individual genotype calls with a read depth less than 10 at a particular variant were considered “missing” and were imputed using the sample allele frequency.
A two stage procedure8 was performed to test for association genome-wide in the discovery dataset; the steps were as follows:
Telomere length was regressed on age and sex separately within each study-race/ethnicity-sequencing center stratum for the n=46,458 discovery samples. Within each stratum, the regression residuals were then inverse-normal transformed and subsequently scaled by their original variances. This rescaling returns the within-stratum variance back to its original value, allowing for clearer interpretation of estimated genotype effect sizes (see Supplementary Figure S6). These inverse-normalized and scaled residuals were then combined across all strata for the discovery dataset, and tests for association were performed as follows.
Given the large sample size of the discovery dataset, a mega-analysis including all n=46,458 samples was performed in two steps:
All genetic loci were tested for association with the inverse-normalized residuals using a standard additive linear model again adjusting for age, sex, and study.
All loci with p-values for association between genotype and outcome < 0.01 from this standard additive linear model were then re-analyzed using a linear-mixed model (described below) that included a genetic relationship matrix (GRM) estimated using MMAP 44 to account for ancestry differences as well as within and between study relatedness among individuals, included age, sex, and study as model covariates, and allowed for heterogeneous residual variances across sample groups defined by study.
The final reported p-value for association is the value from b, if available, and is otherwise the value from a.
A two stage procedure similar to that used for the discovery dataset described above was performed to test for association genome-wide in the replication dataset:
Residuals from a linear model of telomere length regressed on age, sex, and 11 principal components (PCs) of ancestry were calculated within each study-race/ethnicity-sequencing center stratum for the n=28,718 replication samples. Within each stratum, the residuals were then inverse-normal transformed, and subsequently scaled by their original variances to return the within-stratum variance back to its original value.
A mega-analysis including all n=28,718 samples was performed using a linear-mixed model (described below) that included an empirical kinship matrix to account for all relatedness among individuals, included sex, age, 11 PCs of ancestry, and study as model covariates, and allowed for heterogeneous residual variances across sample groups defined by study.
Implementation of the Linear Mixed Model used for association tests
The tests for association were conducted using linear mixed models as implemented in the GENESIS [“Genetic association testing using the GENESIS R/Bioconductor package”, Gogarten et al., Bioinformatics, in press] application on the Analysis Commons. For both the discovery and replication analyses, the genesis_nullmodel app (versions v0.3 for discovery and v1.0.5 for replication) was used to fit the linear mixed model under the null hypothesis of no genetic association (i.e. without any individual genotype terms in the model), where the transformed residuals from step 1 above were used as the outcome, and the model was specified as described above. The output from the null model analysis was then used to perform single variant score tests of association with the genesis_tests app (versions genesis_dscan_single for discovery, genesis_tests_v.1.3.2 for replication). In the discovery analysis, the GRM used to account for both individual ancestry differences and relatedness was computed using MMAP 44. In the replication analysis, ancestry-representative PCs generated using PC-AiR 45 were included in the two steps of analysis to adjust for individual ancestry differences, and an empirical kinship matrix generated using PC-Relate 46 was used in step 2 to account for relatedness among individuals. The switch from a GRM to a kinship matrix for the TOPMed wide sample set on the Analysis Commons was done to accommodate the increased sample size in freeze 6a relative to freeze5b.
Meta-analysis
Meta-analysis was performed genome-wide combining the Discovery and Replication association results using the sample size weighted approach implemented in METAL (version 2018-08-28) 47.
Assessing significance and defining genetic loci
All variants with meta-analysis p-value < 5×10−8 were considered as significant in the meta-analysis. All variants passing this threshold were examined in BRAVO 48 to assess quality, and a set of 154 variants were filtered out due to variant call quality issues. Using the remaining significant variants, we determined which belonged to a “locus” (and were not just one-off singleton variants) by taking each peak variant and identifying if there were additional variants with a linkage disequilibrium (LD) r2 > 0.5 with this variant (across all samples) that also achieved a level of significance < 5×10−8 in the meta-analysis. From each set of variants at a locus, the sentinel variant was determined by selecting the position which was present in both the discovery and replication analysis (i.e., had minor allele count > 5 in both data sets) and which showed the smallest meta-analysis p-value of any variants falling in that locus. Index genes for each locus were selected based on (i) prior GWAS study definition for known loci, (ii) the specific gene annotation for each variant mapping directly to a gene in Supplementary Table S2A for novel loci, and (iii) the exception of the OBFC1 and ATM loci: For the OBFC1 locus three index genes were selected SH3PXD2A, OBFC1(STN1), SLK as all three had strong SNV signal not in LD with the sentinel variant (Supplementary Figure S4D); and for ATM, the sentinel variant mapped to NPAT, but was a peak eQTL for ATM (Supplementary Figure S4U).
Estimation of ancestry-specific p-values
Single variant tests for association were performed as described above for each of the five race/ethnicity subgroups within the discovery and replication data sets, splitting the samples after the first step (i.e., after calculating, inverse-normal transforming and rescaling residuals). Meta-analysis to combine the discovery and replication results within a race/ethnicity group was also performed as described above.
Estimation of effect sizes and percent of variance explained
To estimate the effect size and percent variance explained for individual variants, we performed the same two stage procedure as described for association testing with the replication dataset, but with two differences: we used the full set of 75,176 samples, and we only computed score test statistics for the 22 associated variants identified through the meta-analysis. Estimates of the additive effect size per copy of the alternate allele for each variant were approximated from the score test statistics using the approach illustrated in Zhou et al. 49 (i.e. , where Uβ is the covariate-adjusted score for testing the variant, and Vβ is its variance). Despite using inverse-normalized residuals as the outcome variable, we expect these effect size estimates to be approximately on the original trait scale (i.e. number of basepairs) because the distribution of residuals pre-inverse-normalization was not too far from Normal (Supplementary Figure S6), and we re-scaled the variance back to its original value50. To estimate the percent of phenotypic variance explained (PVE) by each individual variant, we used the formula PVE = 1 − RSS1/RSS0, where RSS0 and RSS1 are the residual sums of squares computed from the null model, and the model including the variant of interest, respectively. Following the idea of Zhou et al., we derived a similar approximation for PVE using only estimates from the null model: .
Gene-based coding variant tests - Variant annotation
For their use in the gene-based tests for association, variant annotation was performed using WGSA751 and dbNSFP 52. Variants were annotated as exonic, splicing, ncRNA, UTR5, UTR3, intronic, upstream, downstream, or intergenic. Exonic variants were further annotated as frameshift insertion, frameshift deletion, frameshift block substitution, stopgain, stoploss, nonframeshift insertion, nonframeshift deletion, nonframeshift block substitution, nonsynonymous variant, synonymous variant, or unknown. Additional scores available included REVEL 53, MCAP 54 or CADD 55 effect prediction algorithms.
Gene-based coding variant tests - Tests for association
Gene-based analysis was performed on the discovery samples only (n=46,458). To improve the power of identifying rare variant associations in coding regions, we aggregated deleterious rare coding variants in 19,387 protein-coding genes and then tested for association with telomere length. To enrich for functional variants, only variants with a “deleterious” consequence for its corresponding gene or genes 56, were included. For each protein-coding gene, a set of rare coding variants (MAF < 0.05, including singletons where MAC=1) was constructed, which was composed of all stop-gain, stop-loss, and frameshift variants, as well as the exonic missense variants that fulfilled one of these criteria: 1) REVEL score > 0.5, 2) M_CAP score was “Deleterious”, or 3) CADD score > 20. We applied the Sequence Kernel Association Test (SKAT) 57 as implemented in GENESIS, using the genesis_tests app on the Analysis Commons, with minor allele frequency based variant weights given by a beta-distribution with parameters of 1 and 25, as proposed by Wu et al 57, using the same null model products/objects used in single variant analysis. Significance was evaluated after a Bonferroni correction for multiple testing (0.05 / 19387 = 2.58×10−6).
Next, we sought to determine which rare deleterious variants in each significant gene were driving the association signal. We iterated through the variants, removing one variant at a time (Leave-one-out approach, LOO) 58, and repeated the SKAT analysis. If a variant made a large contribution to the original association signal, one would expect the signal to be significantly weakened with the removal of the variant from the set.
Mining association analysis results
The “Omics Analysis, Search and Information System” (OASIS) 59 is a web-based application for transforming the massive volumes of association results, such as those generated by investigators in the Trans-Omics for Precision Medicine program (TOPMed) Telomere Length Working Group, into biological discovery. OASIS is a one-of-a-kind application that enables fast, efficient data mining integrated with a broad spectrum of functional annotation, online resources (e.g. dbSNP60, gnomAD [Genome Aggregation Database (gnomAD) 61], GTEx 62, Open Targets Genetics63, UK Biobank64 and user-provided “known loci” lists to facilitate identification of novel genetic discoveries. Real-time analysis tools include linkage disequilibrium (LD) calculations, on-demand visualizations (e.g. boxplots, bar charts, histograms, Haploview 65 and LocusZoom 66 plots) and direct integration of selected variants with the UCSC Genome Browser 67 to visualize their proximity to functional regions (e.g. binding sites, Dnase hypersensitivity sites, enhancer/promoter regions). For the telomere length research, OASIS provided customized LD calculations based on genotypes for the actual TOPMed subjects with telomere length phenotypes and for multiple ancestry-based subsets. OASIS automatically fed the customized LD calculations directly to LocusZoom and thus provided an efficient method for producing multiple LocusZoom visualizations for inspection and comparison.
Gene-set enrichment analysis
Gene set enrichment for indexed gene(s) mapping to the 22 GWAS loci was performed using PANTHER 26,27. Gene set over-representation was evaluated against the GO Ontology Database for all genes in the Homo sapiens database using the FISHER test and all sets with an FDR <0.05 are listed. Input genes were: TERT, TERC, RTEL1, SH3PXD2A, OBFC1(STN1), SLK, RFWD3, NAF1, ACYP2, TERF1, LINC01592, LOC100505739, TINF2, SAMHD1, TERF2, ZNF676, ZNF729, TCL1A, YY1P2, OPRK1, LRP1B, LINC01429, ATP6V1H, RPN1, DCAF4, POT1, ATM, CHKB-AS1, MAPK8IP2. There were six unmapped IDs: TERC, LINC01592, LOC100505739, YY1P2, LINC01429 and CHKB-AS1. Index genes were selected based on (i) prior GWAS study definition for known loci, (ii) annotation for each variant mapping directly to a gene in Supplementary Table S2A for novel loci, and (iii) the exception of the OBFC1 and ATM locus: For the OBFC1 locus three index genes were selected (SH3PXD2A, OBFC1(STN1) and SLK) as all three had strong SNV signal not in LD with the sentinel variant (Supplementary Figure S4D); and for ATM, the sentinel variant mapped to NPAT, but was a peak eQTL for ATM (Supplementary Figure S4U).
Phenome-wide association tests (PheWAS)
We queried United Kingdom Biobank (UKBB) GWAS results using the University of Michigan PheWeb web interface (http://pheweb.sph.umich.edu/SAIGE-UKB/). The UKBB PheWeb interface contains results from a SAIGE 68 genetic analysis of 1403 ICD-based traits of 408,961 UKBB participants of European ancestry. PheWeb is a publicly accessible database that allows querying genome-wide association results for 28 million imputed genetic variants. 20 out of our 22 sentinel variants were present in PheWeb. We report all hits passing a Bonferroni correction for the number of tests performed (0.05/(20*1403) = 1.8×10−6).
Expression quantitative trait locus (eQTL) analysis using eQTLGen
The sentinel variants from the meta-analysis results were assessed for their role as eQTLs using the eQTLGen 37 data set, which includes eQTLs found in blood from a set of n=31,684 individuals. For all sentinel variants which were present in eQTLGen, we report all eGenes associated with these variants, as well as the most significant eGene and its FDR-corrected eQTL p-value.
Acknowledgements
Whole genome sequencing (WGS) for the Trans-Omics in Precision Medicine (TOPMed) program was supported by the National Heart, Lung and Blood Institute (NHLBI). Specific funding sources for each study and genomic center are given in the Supplementary Information. Centralized read mapping and genotype calling, along with variant quality metrics and filtering were provided by the TOPMed Informatics Research Center (3R01HL-117626-02S1; contract HHSN268201800002I). Phenotype harmonization, data management, sample-identity QC, and general study coordination, were provided by the TOPMed Data Coordinating Center (3R01HL-120393-02S1; contract HHSN268201800001I). We gratefully acknowledge the studies and participants who provided biological samples and data for TOPMed. The full study specific acknowledgments as well as individual acknowledgements are detailed in the Supplementary Information.
The views expressed in this manuscript are those of the authors and do not necessarily represent the views of the National Heart, Lung, and Blood Institute; the National Institutes of Health; or the U.S. Department of Health and Human Services.
Footnotes
↵† https://www.nhlbiwgs.org/topmed-banner-authorship; Full banner author list is included in Supplementary Information.
↵# Full working group author list is included in Supplementary Information.
↵§ Full working group author list is included in Supplementary Information.