Across-cohort QC analyses of genome-wide association study summary statistics from complex traits

Genome-wide association studies (GWASs) have been successful in discovering replicable SNP-trait associations for many quantitative traits and common diseases in humans. Typically the effect sizes of SNP alleles are very small and this has led to large genome-wide association meta-analyses (GWAMA) to maximize statistical power. A trend towards ever-larger GWAMA is likely to continue, yet dealing with summary statistics from hundreds of cohorts increases logistical and quality control problems, including unknown sample overlap, and these can lead to both false positive and false negative findings. In this study we propose a new set of metrics and visualization tools for GWAMA, using summary statistics from cohort-level GWASs. We proposed a pair of methods in examining the concordance between demographic information and summary statistics. In method I, we use the population genetics Fst statistic to verify the genetic origin of each cohort and their geographic location, and demonstrate using GWAMA data from the GIANT Consortium that geographic locations of cohorts can be recovered and outlier cohorts can be detected. In method II, we conduct principal component analysis based on reported allele frequencies, and is able to recover the ancestral information for each cohort. In addition, we propose a new statistic that uses the reported allelic effect sizes and their standard errors to identify significant sample overlap or heterogeneity between pairs of cohorts. Finally, to quantify unknown sample overlap across all pairs of cohorts we propose a method that uses randomly generated genetic predictors that does not require the sharing of individual-level genotype data and does not breach individual privacy.


Introduction
approach explores the genetic and QC context of the all cohorts in GWAMA together rather than by treating  We use WTCCC data as an illustration to detect 2,934 shared controls between any two of the diseases by When using 200 and 500 random SNPs, all the known 2,934 shared controls were detected from 21 cohort-4 1 8 pair-wise comparison; when using 100 randomly SNPs, on average 2,931 shared samples were identified, which is more accurate than using ߣ ௧ constructed using either genetic effects or allele frequencies ( SNPs as used in PPSR. Although Gencrypt guidelines suggest use of at least 20,000 random SNPs 10 , 1 3 average about 2,920 (99.6% of the shared controls) overlapping samples were detected, only slightly lower 4 2 6 than PPSR. For example, for BP and CAD, Gencrypt detected 2,912 shared controls, but was unable to identify about 20 overlapping controls, due to missing data (on average 1% missing rate). Increasing the 4 2 8 number of SNPs when using Gencrypt is likely to overcome the problem of missing data. Furthermore, PPSR is able to detect pairs of relatives. For example, between the BD and CAD cohorts, two 4 3 1 pairs of apparent first-degree relatives were detected (Fig. 9a). In order to find additional first-degree to any specific ethics requirements. We suggest to protect the privacy with sufficient accuracy 4 5 4 (Fig 9c). Of note, if a meta-analysis is conducted within a research consortium, the application of PPSR is 4 5 5 even safer because the exchange of information is between the consortium analysis hub and each cohort 4 5 6 independently. In this study, we provide a set of metrics for monitoring and improving the quality of large-scale GWAMA 4 6 0 based on summary statistics. These tools not only enrich the toolkit to analysts for GWAMA, but also cohorts, but the coordinates of a cohort may be slightly shifted with inclusion or exclusion of other cohorts. application of the data, we believe the impact will not influence the inference of the genetic background of purpose of both methods is to find the discordance between demographic information and genetic 4 7 9 information, or outliers. The projection is not attempt to discover the detailed demographic past that shapes a is likely to be slightly greater than 1 solely due to unknown heterogeneity, slight as observed, in generating It is well-recognised that overlapping samples may inflate the type-I error rate of GWAMA and therefore 4 9 7 lead to false positives. Although post-hoc correction of the test statistic is possible 26-28 , stringent quality 4 9 8 control ruling out overlapping samples makes the whole analysis easier and lowers the risk of false positives.

9 9
A better solution would be to rule out shared samples at the start, for pairs of cohorts that show deflated , and we propose PPSR to accomplish this. In summary, to maximize the inference from multi-cohort GWAMA, accurate cohort-level information on for gene discovery. All methods proposed are implemented in freely available software GEAR. generation of the data is available from www.wtccc.org.uk. We also thank from High Performing 5 1 3 Computing support from the Information Technology group at the Queensland Brain Institute, The 5 1 4 University of Queensland. NW discussed results and methods, and provided comments that improved earlier versions of the manuscript.

2 0
Other authors provided cohort-level summary statistics and contributed to improving the study and Competing financial interests: The authors declare no completing financial interests.  t  y  c  o  n  t  r  o  l  a  n  d  5  3  5   c  o  n  d  u  c  t  o  f  g  e  n  o  m  e  -w  i  d  e  a  s  s  o  c  i  a  t  i  o  n  m  e  t  a  -a  n  a  l  y  s  e  s  .   N  a  t  P  r  o  t  o  c   2  0  1  4  ;   9   :  1  1  9  2  -2  1  2  .  5  3  6   3  W  o  o  d  A  R  ,  E  s  k  o  T  ,  Y  a  n  g  J  ,  V  e  d  a  n  t  a  m  S  ,  P  e  r  s  T  H  ,  G  u  s  t  a  f  s  s  o  n  S   e  t  a  l .
N a t G e n e t 2 0 1 4 ; 4 6 : 5 5 :   for the GIANT height GWAS cohorts.   is a measure of genetic differentiation between populations. It is usually estimated using 8 1 8 individual-level genotype data from multiple samples in two or more populations 1 . Here, we using summary data on allele frequencies, which implicitly assumes Hardy- Weinberg equilibrium genotype frequencies within populations. We use summary statistic as a metric for quality control for each cohort. If the allele frequencies reported for a cohort depart genome-wide from its expectation based on known ancestry due to 8 2 3 technical artifacts, then we may observe an unexpected ‫ܨ‬ ௦ ௧ value when comparing to a 8 2 4 reference panel of know ancestry. Europe diversity, we chose CEU, FIN, and TSI as the reference panels. As the different allele 8 3 0 frequencies across three samples reflected the real diversity among these reference panels, we . These 30,000 markers are quasi-independent and evenly distributed across . Another reason we chose 30,000 markers is that there are around 30,000 quasi-independent markers for GWAS data as observed in empirical data and expected from 8 4 1 theory 2,3 . In this study, is calculated from the allele frequencies estimated from cohorts, provided as is treated as a data statistic for measuring allele frequency can vary with context 4 . is the weighted average frequency in the entire sample, and r is the number of populations.

4 9
Here, we only compared each cohort to the 1KG reference panel, so ‫ݎ‬ ൌ 2 and the equation At the right side of the equation, the first term represents the sampling variance for allele frequency for a pair of cohorts, and the second term represents the allele frequency difference Create the coordinates for the reference samples. Without loss of generality, these three Step . Connecting the three coordinates 9 0 1 created a "FTC" triangle inside the reference triangle. Step 2 Find the gravity of the cohort triangle using Equation 5. The gravity of the "FTC"  e  t  h  o  d  I  I  :  P  r  i  n  c  i  p  a  l  c  o  m  p  o  n  e  n  t  a  n  a  l  y  s  i  s  f  o  r  c  o  h  o  r  t  -l  e  v  e  l  a  l  l  e  l  e   9  1  6   f  r  e  q  u  e  n  c  i  e  s   9  1  7  M  e  t  h  o  d  I  I  I  :  T  h  e  d  e  t  e  c  t  i  o  n  o  f  o  v  e  r  l  a  p  p  i  n  g  s  a  m  p  l  e  s  w  i  t  h        9  3  4 Inference of cohort origins at the within-Europe level. To assess genetic background, for for a locus associated with disease its correlation of the regression coefficient is can construct a statistic