Abstract
Recent reports have identified differences in the mutational spectra across human populations. While some of these reports have been replicated in other cohorts, most have been reported only in the 1000 Genomes Project (1kGP) data. While investigating an intriguing putative population stratification within the Japanese population, we identified a previously unreported batch effect leading to spurious mutation calls in the 1kGP data and to the apparent population stratification. Because the 1kGP data is used extensively, we find that the batch effects also lead to incorrect imputation by leading imputation servers and suspicious GWAS associations. Lower-quality data from the early phases of the 1kGP thus continues to contaminate modern studies in hidden ways. It may be time to retire or upgrade such legacy sequencing data.
Introduction
Batch Effects in Aging Reference Cohort Data
The last 5 years have seen a drastic increase in the amount and quality of human genome sequence data. Reference cohorts such as the International HapMap Project (International HapMap Consortium, 2005), the 1000 Genomes Project (1kGP)(1000 Genomes Project Consortium, 2010, 2012; Consortium et al., 2015), and the Simons Diversity project (Mallick et al., 2016), for example, have made thousands of genome sequences publicly available for population and medical genetic analyses. Many more genomes are available indirectly through servers providing imputation services (McCarthy et al., 2016) or summary statistics for variant frequency estimation (Lek et al., 2016).
The first genomes in the 1kGP were sequenced 10 years ago (van Dijk et al., 2014). Since then, sequencing platforms have rapidly improved. The second phase of the 1kGP implemented multiple technological and analytical improvements over its earlier phases (1000 Genomes Project Consortium, 2012; Consortium et al., 2015), leading to heterogeneous sample preparations and data quality over the course of the project.
Yet, because of the extraordinary value of freely available data, early data from the 1kGP is still widely used to impute untyped variants, to estimate allele frequencies, and to answer a wide range of medical and evolutionary questions. This raises the question of whether and how such legacy data should be included in contemporary analyses alongside more recent cohorts. Here we point out how large and previously unreported batch effects in the early phases of the 1kGP still lead to incorrect genetic conclusions through population genetic analyses and spurious GWAS associations as a result of imputation using the 1kGP as a reference.
Mutational Signatures
Different mutagenic processes may preferentially affect different DNA motifs. Certain mutagens in tobacco smoke, for example, have been shown to preferentially bind to certain genomic motifs leading to an excess of G to T transversions (Pfeifer et al., 2002; Pleasance et al., 2010). Thus, exposure of populations to different mutational processes can be inferred by considering the DNA context of polymorphism in search of signatures of different mutational processes (Alexandrov et al., 2013; Shiraishi et al., 2015). Such genome-wide mutational signatures have been used as diagnostic tools for cancers (e.g., Alexandrov et al. (2013); Shiraishi et al. (2015)).
In addition to somatic mutational signatures, there has been recent interest in population variation in germline mutational signatures which can be revealed in large sequencing panels. In 2015, Harris reported 50% more TCC → TTC mutations in European populations compared to African populations, and this was replicated in a different cohort in 2017 (Harris, 2015; Harris and Pritchard, 2017; Mathieson and Reich, 2017). Strong population enrichments of a mutational signature suggests important genetic or environmental differences in the history of each population (Harris, 2015; Harris and Pritchard, 2017). Harris and Pritchard further identified distinct mutational spectra across a range of populations, which were further examined in a recent publication by Aikens et al. (Harris and Pritchard, 2017; Aikens et al., 2019).
In particular, the latter two studies identified a heterogeneous mutational signature within 1kGP Japanese individuals. This heterogeneity is intriguing because differences in germline signatures accumulate over many generations. A systematic difference within the Japanese population would suggest sustained environmental or genetic differences across sub-populations within Japan with little to no gene flow. We therefore decided to follow up on this observation, by using a newly sequenced dataset of Japanese individuals from Nagahama.
While we were unable to reproduce the mutational heterogeneity within the Japanese population, we could trace back the source of the discrepancy to a technical artefact in the 1kGP data. In addition to creating biases in mutational signatures, this artefact leads to spurious imputation results which have found their way in a number of recent publications.
The results section is organized as follows. We first attempt to reproduce the original signal and identify problematic variants in the JPT cohort from the 1kGP. Next, we expand our analysis to the other populations in the 1kGP and identify lists of variants that show evidence for technical bias. Finally, we investigate how these variants have impacted modern genomics analyses.
Results
A peculiar mutational signature in Japan
Harris and Pritchard reported an excess of a 3-mer substitution patterns *AC→*CC in a portion of the Japanese individuals in the 1kGP (Harris and Pritchard, 2017). While trying to follow up on this observation in a larger and more recent Japanese cohort from Nagahama, we did not find this particular signature. When comparing the allele frequencies between the Japanese individuals from the 1kGP and this larger dataset, we observed a number of single nucleotide polymorphisms (SNPs) private to one of the two groups (Figure 1). Given the similarity of the two populations, this strongly suggests a technical difference rather than a population structure effect. These mismatches were maintained despite only considering sites that satisfied strict quality masks and Hardy-Weinberg equilibrium in both cohorts.
When mismatch sites are removed from the 1kGP data, the *AC→*CC signal disappears (Figure 1). To identify possible technical reasons for the difference, we performed regressions of the prevalence of the *AC→*CC mutational signature against different individual-level quality metrics provided by the 1kGP (see Figure S14). The average quality of mapped bases Q per individual stood out as a strong correlate : Individuals with low Q show elevated rates of the signature. Thus, sequences called from low-Q data contain variants that reproduce poorly across studies and exhibit a particular mutational signature.
To identify SNPs that are likely to reproduce poorly across cohorts without having access to a second cohort, we performed an association study in the JPT for SNPs that associate strongly with low Q (Figure 1). Traditionally, genome wide association studies use genotypes as the independent variable. Here we perform a “reverse GWAS”, in the sense that genotypes are now the dependent variable that we attempt to predict using the continuous variable Q as the independent variable (Song et al., 2015). We use logistic regression of the genotypes on Q and identify 587 SNPs with p < 10−8 and 1034 SNPs with p < 10−6. While identifying putative low-quality SNPs to exclude, using a higher p-value threshold increases the stringency of the filtering (i.e., excluding SNPs with p < 10−6 is more stringent than excluding SNPS with p < 10−8). The variants that are associated to Q have an enrichment in *AC→*CC mutations, GA*→GG*, and GC*→GG* mutations (Figure 1A). These three enrichments can be summarized as an excess of G**→GG* in individuals with low Q.
Thus, this mutational signal is heavily enriched in Q-associated SNPs, but residual signal remains in non-significant SNPs, presumably because many rare alleles found in individuals with low Q remain unidentifiable using association techniques (Figure S15). The removal of individuals with Q below 30 successfully removes the *AC→*CC signal, however other signals identified by Harris and Pritchard appear unchanged (Figure S15). For population genetic analyses sensitive to the accumulation of rare variants, the removal of individuals with low Q appears preferable to filtering specific low-quality SNPs. For other analyses where quality of imputation matters, identifying Q-associated variants may be preferable.
Identifying suspicious variants in the 1000 Genomes Project
The distribution of Q across 1kGP populations shows that many populations have distributions of Q scores comparable to that of the JPT, especially populations sequenced in the phase 1 of the project: sequencing done in the early phases of the 1kGP was more variable and overall tended to include lower quality sequencing data (Figure 2). This variability could result from evolving sequence platform and protocols or variation between sequencing centres. By 2011, older sequencing technologies were phased out, and methods became more consistent, resulting in higher and more uniform quality.
We therefore performed the same reverse GWAS approach in all populations independently, and similarly identified Q-associated SNPs in 23 of the 26 populations in the 1kGP, with the phase 1 populations being most affected, with on average four times as many significantly associated sites compared to the phase 3 populations. Over 812 variants were independently associated to low Q in at least two populations with p < 10−6 in each (Figure 3).
To build a test statistic to represent the association across all populations simultaneously, we performed a simple logistic regression predicting genotype based on Q with the logistic factor analysis (LFA) as an offset to account for population structure or Genotype-Conditional Association Test (GCAT) as proposed by (Song et al., 2015). We also considered two alternative approaches to account for confounders, namely using the leading five principal components, and using population membership as covariates. These models were broadly consistent (See Figure S1).
This method identifies a total of 24,390 variants associated to Q distributed across the genome with 15,270 passing the 1kGP strict mask filter (Figures S9,S10, S11, and S12). Most analyses below focus on the 15,270 variants satisfying the strict mask, since these variants are unlikely to be filtered by standard pipelines. To account for the large number of tests, we used a two-stage Benjamini & Hochberg step-up FDR-controlling procedure to adjust the p-values using a nominal Type-I error rate α = 0.01 (Benjamini et al., 2006). We tested SNPs, INDELs and repetitive regions separately as they may have different error rates (Table 1). Lists of Q-associated variants and individuals with low Q are provided in Supplementary Data.
Q-associated variants are distributed across the genome, with chromosome 1 showing an excess of such variants, and other chromosomes being relatively uniform (Figure S2). At a 10kb scale, we also see rather uniform distribution with a small number of regions showing an enrichment for such variants (Figure S3). An outlying 10kb region in chromosome 17 (bases : 22,020,000 to 22,030,000) has 35 Q-associated variants. Distribution of association statistics in this region is provided in Figure S4. By contrast, variants that do not pass the 1kGP strict mask are more unevenly distributed across the genome(Figure S3).
Cell line or technical artifact
In 2017, Lan et al. resequenced 83 Han Chinese individuals from the 1kGP (Lan et al., 2017). To assess consistency between the two datasets, we consider consistency of genotype calls for Q-associated variants that are predicted to be polymorphic in these 83 individuals according to the 1kGP. Among the 296 such variants that were Q-associated in the CHB or CHS, only 6 are present in the resequenced data (Figure S7). This is more than our nominal false positive rate of 1% of the sites. Thus a small number of variants associated to Q are present in the population but with somewhat biased genotypes.
We did a similar analysis using all variants identified in the GCAT model (rather than only variants significantly associated to Q within the CHB and CHS). Of the 15,270 Q-associated variants identified globally, 6,307 are polymorphic in the 1kGP for the 83 resequenced individuals (See Figure S5). From this subset, only 1,139 (or 18%) are present in the resequenced data. The allele frequencies of these variants are nearly identical between datasets suggesting that among these 83 individuals, these variants are properly genotyped in the 1kGP. There are 5 alleles that show differing frequencies between both datasets that are likely explained by biased genotypes. The vast majority of polymorphisms associated with Q are not present at all in the resequencing dataset, supporting sequencing rather than cell line artifacts.
Among the 15,270 Q-associated variants, 613 are present on Illumina’s Omni 2.5 chip (See Figure S13). These are likely among the small number of variants that are present in the data but exhibit biased genotyping in 1kGP.
Suspicious variants impact modern genomics analyses
State of the art imputation servers use a combination of many databases including some that are not freely available. From the perspective of researchers, they act as black-box imputation machines that take observed genotypes as input and return imputed genotypes.
To investigate whether suspicious calls from the 1kGP are imputed into genotyping studies, we submitted genotype data for the first two chromosomes of the 1kGP genotype data to the Michigan Imputation Server. We found that all of the variants associated with Q were imputed back in the samples. This suggests that the imputation reference panel still includes individuals with low Q, and the dubious variants will be imputed in individuals who most closely match the low-quality individual.
We searched the literature for any GWAS that might have reported these dubious variants as being significantly associated with some biological trait, even though there is no particular reason for these variants to be associated with phenotypes. The NHGRI-EBI Catalog of published genome-wide association studies identified seventeen recent publications that had reported these variants as close to or above the genome-wide significant threshold (Table 2).
Eleven of these studies included the 1kGP in their reference panel for imputation (Xu et al., 2012; Lutz et al., 2015; Park et al., 2015; Astle et al., 2016; Herold et al., 2016; Suhre et al., 2017; López-Mejías et al., 2017; Tian et al., 2017; Spracklen et al., 2017; Nagy et al., 2017; Gao et al., 2018) and another used the 1kGP sequence data and cell lines directly (Mandage et al., 2017). One study used an in-house reference panel for imputation (Nishida et al., 2018), two studies genotyped individuals and imputed the data using the HapMap II as a reference database for imputation (Kraja et al., 2011; Ebejer et al., 2013) and two studies used genotyping chip data (Yucesoy et al., 2015; Ellinghaus et al., 2016).
These articles used a variety of strict quality filters, including Hardy-Weinberg equilibrium test, deviations in expected allele frequency and sequencing data quality thresholds. They also removed rare alleles and alleles with high degrees of missingness. Despite using state-of-the-art quality controls, these variants managed not only to be imputed onto real genotype data, but they also reached genome wide significance for association with biological traits.
These associations are not necessarily incorrect – a weak but significant bias in imputation may still result in a correct associations. To distinguish between variants with weak but significant association with Q from variants with strong biases, we distinguished between variants where the allele frequency difference between individuals with low- and high-Q is larger than a factor of two (which naturally separates two clusters of variants on Figure S5). The majority (92.7%) of the Q-associated variants are strongly biased in that they are more than twice as frequent in individuals with low-Q compared to high-Q data. By contrast, most Q-associated variants reported in the GWAS catalogue had weak bias (See Figure S6), with three exceptions. One study reports associations with seven Q-associated variants that we find to be highly biased (Mandage et al., 2017). That study considered copy number of Epstein-Barr virus sequence in the 1kGP as a phenotype. Thus the phenotype in that study is likely confounded by the same technical artefacts that lead to biased SNP calling.
Discussion
The variants identified in this study are likely to be technical artifacts from legacy technologies. Different sequencing technologies will have different error profiles. A report comparing the Genome Analyzer II (GAII) to the Illumina HiSeq found that the GAII had much higher rates of reads below a quality score of 30 (Minoche et al., 2011) with, for instance, different patterns of quality decrease along reads. Differences in read quality and error profiles in turn require different calling pipelines.
To pinpoint the precise technical source of the discrepancy would require further forensic inquiries into the details of the heterogeneous sample preparation and data processing pipelines used throughout the 1kGP. Given the progress in sequencing and calling that occurred since the early phases of the 1kGP (Figure 2), it is likely that the source of these biases is not longer being actively introduced in recent sequence data.
However, because the 1kGP data is widely used as a reference database, these variants are still being imputed onto new genotype data and can then impact association studies for a variety of phenotypes. Even though significant association of a variant with a quality metric is not in itself an indication that the variant is spurious, we would recommend to carefully examine GWAS associations for such variants, e.g. by repeating the analysis without the 1kGP as part of the imputation panel.
For analyses where individual variants cannot be examined individually (mutation profiles, distributions of allele frequencies, polygenic risk scores), we would recommend to simply discard the Q-associated SNPs or the individuals with Q < 30 (lists of such variants and sample IDs are provided in the Supplementary Data). We also recommend that imputation servers discard individuals with low Q (or at least provide the option of performing the imputation without). Given the value of freely accessible data, resequencing individuals with low Q would also likely be a worthwhile investment for the community.
Conclusion
On a technical front, we were surprised that strong association between variants and technical covariates in the 1kGP project had not been identified before. The genome-wide logistic regression analysis of genotype on quality metric is straightforward, and should probably be a standard in a variety of -omics studies. The logistic factor analysis is more computationally demanding but produces more robust results (Song et al., 2015). Both approaches produce comparable results.
More generally, to improve the quality of genomic reference datasets, we can proceed by addition of new and better data and by better curation of existing data. Given rapid technological progress, the focus of genomic research is naturally on the data generation side. However, cleaning up existing databases is also important to avoid generating spurious results. The present findings suggest that a substantial fraction of data from the final release of the 1kGP project is overdue for retirement or re-sequencing.
Methods
Code and data availability
Since this data is primarily performed using publicly available data, we provide fully reproducible and publicly available on GitHub. This repository includes scripts used for data download, processing, analysis and plotting.
Metadata
The metadata used in this analysis was compiled from each of the index files from the 1kGP file system. Average quality of mapped bases Q per sample was obtained from the BAS files associated with each alignment file. Each BAS file has metadata regarding each sequencing event for each sample. If a sample was sequenced more than once, we took the average of each Q score from each sequencing instance. The submission dates and sequencing centres for each sample in the analysis was available in the sequence index files.
Quality Controls
For the mutation spectrum analysis, we reproduced the quality control and data filtering pipelines used by Harris et al. as they applied the current state of the art quality thresholds to remove questionable sequences for detecting population level differences. Several mask files were applied to remove regions of the genome that might be lower quality, or might have very different mutation rates or base pair complexity compared to the rest of the genome. The 1kGP strict mask was used to remove low quality regions of the genome, highly conserved regions were removed using the phastCons100way mask file and highly repetitive regions were removed using the NestedRepeats mask file from RepeatMasker. Furthermore, only sites with missingness below 0.01, MAF less than 0.1, and MAF greater than 0.9 were considered. In total, 7,786,023 diallelic autosomal variants passed our quality controls for the mutation spectrum analysis. We calculated the mutation spectrum of base pair triplets for the list of significant variants for the JPT population using a similar method as described in (Harris and Pritchard, 2017).
For the reverse GWAS, the only filtration used was the application of an minor and major allele frequency cutoff of 0.000599 (removing singletons, doubletons and tripletons) resulting in a total of S=28,516,063 variants included in the test. We also used the NestedRepeats mask file to flag variants inside repetitive regions as these were analyzed separately for false discovery rate estimation. Variants flagged by the 1kGP strict mask are included in the association test and included in the FDR adjustment. These variants are only removed after the FDR and excluded from downstream discussion of error patterns, since most population genetics analyses use the strict mask as a filter, and we expect to find problematic variants in filtered regions.
Testing the association of quality to genotype
When conducting a statistical analysis of population genetics data, we must account for population structure. In a typical GWAS, we are interested in modelling the phenotype as a function of the genotype. Here we have the opposite situation, where the quantitative variable (Q) is used as an explanatory variable. So we consider models where the genotype y is a function of an expected frequency πsi, based on population structure, and Q. The null model is
The expected frequency for a SNP s and individual i can be estimated using principal component analysis, categorical population labels, or logistic factor analysis (Song et al., 2015). The alternative model then takes in Q as a covariate:
Under the null hypothesis the slope coefficient βs is zero and Model (2) reduces to Model (1). βs denotes the association to average quality of mapped bases Q to genotype ys. To test the null hypothesis, we use the generalized likelihood ratio test statistic, whose deviance is a measure of the marginal importance of adding Q in the model. The deviance test statistic under the null model is approximately chi-square distributed with one degrees of freedom.
We run a total of S regressions, where S is the total number of genomic loci. Given the large number of tests, the large proportion of expected null hypotheses and the positive dependencies across the genome, we used the two-stage Benjamini & Hochberg step-up FDR-controlling procedure to adjust the p-values (Benjamini et al., 2006). By using a nominal Type-I error rate α = 0.01, a total of 15,270 variants were found to be statistically significance. See Supplementary Data for a list of variants and adjusted p-values.
Individual-specific allele frequency
Examples of models that are widely used to account population structure include the Balding-Nichols model (Balding and Nichols, 1995), and the Pritchard-Stephens-Donnelly model (Pritchard et al., 2000). These and several other similar models used in GWAS studies can be understood in terms of the following matrix factorization. where the ith column (h(i)) of the K × I matrix H encodes the population structure of the ith individual and the sth row of the S × K matrix A determines how that structure is manifested in SNP s. When Hardy-Weinberg equilibrium holds, observed genotype can be assumed to be generated by the following Binomial model. for s = 1 … S and i = i, ···, I, where ysi ∊ {0, 1, 2} and logit(πsi) is the (s, i) element of the matrix L such that πsi is the individual-specific allele frequency.
To test whether quality is associated to genotype while adjusting for population structure, we performed the Genotype-Conditional Association Test (GCAT) proposed by (Song et al., 2015). The GCAT is a regression approach that assumes the following model. for s = 1 … S and i = i, ···, I (S = 28,516,063 and I = 2,504) and where so that as0 is the intercept term and . The vectors hi of the matrix H are unobserved but can be estimated using Logistic Factor Analysis (LFA) (Song et al., 2015) and are therefore used directly in the model. We approximated the population structure using K = 5 latent components from a subsampled genotype matrix consisting of M = 2,306,130 SNPs (we picked SNPs from the 1kGP OMNI 2.5). To avoid possible biases in computing PCA from the biased variants, we considered the genotype matrix L obtained by downsampling 1kGP variants the positions from the OMNI 2.5M chip.
Imputation
Using the Michigan Imputation Server, we imputed the genotype data from 1kGP for chromosomes 1 and 2. We used the genotyped data from the 1kGP Omni 2.5M chip genotype data. The VCF file returned from the server was then downloaded and used to search for the number of significant variants successfully imputed.
Acknowledgments
We would like to thank Kelly Harris for sharing her mutation spectrum scripts. We would also like to thank the members of the Gravel lab for their help with coding and useful discussions.