Phenotype prediction from genome-wide genotyping data: a crowdsourcing experiment

Background The increasing statistical power of genome-wide association studies is fostering the development of precision medicine through genomic predictions of complex traits. Nevertheless, it has been shown that the results remain relatively modest. A reason might be the nature of the methods typically used to construct genomic predictions. Recent machine learning techniques have properties that could help to capture the architecture of complex traits better and improve genomic prediction accuracy. Methods We relied on crowd-sourcing to efficiently compare multiple genomic prediction methods. This represents an innovative approach in the genomic field because of the privacy concerns linked to human genetic data. There are two crowd-sourcing elements building our study. First, we constructed a dataset from openSNP (opensnp.org), an open repository where people voluntarily share their genotyping data and phenotypic information in an effort to participate in open science. To leverage this resource we release the ‘openSNP Cohort Maker’, a tool that builds a homogeneous and up-to-date cohort based on the data available on opensnp.org. Second, we organized an open online challenge on the CrowdAI platform (crowdai.org) aiming at predicting height from genome-wide genotyping data. Results The ‘openSNP Height Prediction’ challenge lasted for three months. A total of 138 challengers contributed to 1275 submissions. The winner computed a polygenic risk score using the publicly available summary statistics of the GIANT study to achieve the best result (r2 = 0.53 versus r2 = 0.49 for the second-best). Conclusion We report here the first crowd-sourced challenge on publicly available genome-wide genotyping data. We also deliver the ‘openSNP Cohort Maker’ that will allow people to make use of the data available on opensnp.org.

As costs for genetic analyses keep dropping, genetic testing is becoming more available 2 and affordable for increasing numbers of people -a trend that can be seen in the rising 3 number of customers that use Direct-To-Consumer (DTC) genetic testing services like 4 23andMe and AncestryDNA [1]. Decreasing costs and increased availability have lead to 5 the creation of a number of genomic data resources, such as the Personal Genome 6 Project [2], DNA.land [3] and openSNP [4]. Amongst these data resources, openSNP is 7 unique, in that it offers open participation and open access to the data: Participants of 8 openSNP can use the platform to openly share their existing DTC genetic test data, 9 putting their data in the public domain. In addition, participants can share phenotypic 10 traits, such as eye color, hair color or height. Since its start in 2011, over 5,000 people 11 have used the platform to make their genetic data available. 12 Crowd-sourced competitions of data analysis have become more and more popular in 13 the past few years allowing data science experts and enthusiasts to collaboratively solve 14 real-world problems, through online challenges. This approach allows the broad 15 exploration of the model space on a specific dataset by people with data analysis skills 16 coming from very different backgrounds. In the context of genomic prediction of 17 complex diseases, it is unprecedented. While the most widely used platform, kaggle.com, 18 offers monetary rewards, crowdai.org is more academic-centered and offered the winner 19 the opportunity to present her work at a scientific conference. 20 We hereby present a crowd-sourcing experiment where participants could compete 21 on crowdai.org to produce the best possible prediction of the height phenotype using 22 data from opensnp.org. 23 Materials and methods 24 openSNP Cohort Maker 25 Because on opensnp.org, no restrictions are enforced on what users can upload, after 26 downloading the data dump of the whole community, there is a need for in-depth data 27 curation to produce a clean cohort of genome-wide genotyped individuals. To make 28 these data accessible to anyone, we developed the openSNP Cohort Maker tool that 29 through a systematic approach produced a clean and up-to-date openSNP cohort of 30 genome-wide genotyped individuals. 31 When running the openSNP Cohort Maker, the data processing starts by 32 downloading the archive containing all data that were uploaded on opensnp.org by the 33 community. Then, files are removed if: they are not text or compressed text; they 34 correspond to exome sequencing; they are genotyping data from decodeme; they are set 35 on the GRCh38 reference; they are corrupted. For individuals who submitted multiple 36 genome-wide genotyping data, either as duplicates or from different DTC companies, 37 only the largest file is kept. A set of tools are integrated into the pipeline: genotyping 38 data with coordinates based on NCBI36 are upgraded to match the GRCh37 39 reference [5] with liftOver [6]; PLINK [7][8][9] is used to convert file formats.

40
VCFtools [10] is used to sort variants; BCFtools [11] is used to normalize reference and 41 alternate alleles on the GRCh37 reference genome, rename samples, index files, and 42 finally merge all individuals into one file. The output file can be directly imputed on 43 the Sanger Imputation Service [12]. The openSNP Cohort Maker is available on GitHub 44 S1. Software. Leveraging parallel computing, with 28 CPUs it takes 16 hours to  The dataset that we used for the challenge was produced by the openSNP Cohort 50 Maker and imputed on Sanger Imputation Service with HRC (r1.1). We sent to the 51 opensnp.org community a survey asking for their height, allowing us to create a dataset 52 regrouping 921 individuals with both height phenotype and genotyping data.

53
Challenge participants could use two versions of the genotyping data. One version 54 was a sub-dataset containing 9,894 genetic variants, including the top 9,207 variants 55 (p < 5x10 −3 ) associated with height in the GIANT study [13], and 687 Y chromosome predictions for the samples of the test set. The test set predictions were then submitted 68 to the CrowdAI platform for evaluation and scoring. The score was produced based on 69 the Pearson's correlation (r 2 ) between the predicted and true height. The challengers 70 could submit as many prediction models as they wanted in an attempt to improve their 71 method and beat their best score. The scoring method was protected from known 72 exploits [14]. The data are available online on the zenodo platform S1. Dataset, and the 73 webpages presenting the challenge S1. Appendix and the leaderboard S2. Appendix 74 have been saved to PDF from the CrowdAI platform.

76
A total of 138 challengers participated, contributing a total of 1275 submissions. The 77 winner computed a polygenic risk score (PRS) using the publicly available summary 78 statistics of the GIANT study to achieve the best result (r 2 = 0.53 versus r 2 = 0.49 for 79 the second-best).

80
The winning method was based on PRS. The training set and testing set were 81 combined for quality control and data preparation. As self-reported sex was not 82 provided, participant's chromosomal sex (i.e. XX vs XY) was imputed using PLINK, 83 which uses the X chromosome inbreeding coefficient (F) to impute sex. Standard cutoffs 84 were used, whereby F < 0.2 yielded an XX call, while F > 0.8 yielded an XY call. One 85 participant yielded an F of exactly 0.2, and was removed from subsequent analyses 86 (they were in the training data). Of the remaining 920 individuals, 396 (43%) were XX, 87 and 524 (57%) were XY.

88
The openSNP platform contains genomic data of relatives. The presence of relatives 89 has the potential to bias results, as closely-related individuals will dominate the 90 estimation of principal components and will inflate prediction accuracy statistics [15] .

91
The genetic relationship between participants was calculated using the PLINK well-established that the frequency of genetic variants and correlational structure of the 99 genome differs across ancestral populations [15][16][17]. These differences are the major 100 barrier to combining genomic data across ancestries in genome-wide association 101 studies [18,19]. Genome-wide principal components were computed using PLINK. A 102 scree plot of eigenvalues indicated an elbow at three components. The large eigenvalue 103 of the first principal component, and the shape of all three components, clearly showed 104 that both the training and testing data contained participants of multiple ancestries (i.e. 105 participants of European, African, and Asian ancestry were present in both data sets), 106 though the majority of participants were of European descent.

107
Genomic data were further processed in PLINK, following the steps outlined by only the variant with the lowest p-value in the GIANT genome-wide association study 117 of height [13] was retained.

118
A PRS is a metric reflecting an individual's genetic burden for a disease or trait of 119 interest. [21,22]. Prior work on the genetic basis of height has found that a PRS for 120 height captures over 20% of the variance in independent samples [13]. PRS are 121 calculated by averaging the number of disease-associated alleles, weighted by their effect 122 size, from an independent study [23]. Put differently, a linear regression predicting the 123 outcome trait is modeled at each individual variant, using the effect size from an 124 independent study. These predictions are then averaged across all models. The one free 125 parameter is the decision of which variants to include in the calculation of the PRS.

139
PRSice and PLINK were used to compute PRS for height in the openSNP sample, 140 using the results from the GIANT study of height. PRS were computed at 14 different 141 p-value thresholds (p < 10 −8 to p < 1.0), shown in Fig 1. Linear regressions predicting 142 height in the training data were fit in R [26]. Chromosomal sex was the first variable 143 included in the model, followed by the top three genome-wide principal components, 144 which help to control for differences in ancestral background [27] . Chromosomal sex 145 predicted 46.81% of the variance in the training data, and the addition of the three 146 principal components subsequently explained 0.91% of variance. Finally, each of the 14 147 PRS were added to the model and compared. The PRS at p < 1x10 −5 was observed to 148 perform best, and captured an additional 10.08% of variance. Thus, the final linear Variance explained as a function of p value threshold. PRS were produced at a range of p-value thresholds (x-axis). Y-axis represents Nagelkerke's r-squared from training-sample linear regressions. The model with the best performance in the training data (p < 5 × 10 −4 ) was then used to predict height in the test-sample. Height is an extremely polygenic trait where even the hundreds of genome-wide 154 significant variants contribute all together for only a small portion of heritability [28].

155
Because of the modest size of the OpenSNP cohort, the lack of statistical power was the 156 main difficulty for the challengers to capture the association signals coming from the 157 genetic variants. The winning model of the challenge incorporated the GWAS summary 158 statistics from the GIANT study to compute a PRS, in addition to deriving each 159 participant's sex. It should be noted that PRS is a standard and widely-used technique 160 in the field of statistical genetics. While cross-population PRS have been shown to be 161 unreliable in multiple cases, such as Type II Diabetes [29], coronary artery disease [30], 162 and height [31], the similarities between the GIANT and openSNP cohorts were 163 sufficient to provide a winning strategy. This is likely because only a small portion of 164 samples were of non-European ancestry ( 7%). weakness. In this case, the effect of a variant depends on the presence or absence of 171 another variant, a mechanism that is not captured by additive models and accounts for 172 an unknown part of the phenotypic variance [33]. Eventually, more advanced statistical 173 approaches relying on machine learning could improve on the prediction accuracy 174 provided by purely additive risk scores. Because of the diversity in available methods 175 and the world-wide distribution of excellent data scientists, we believe that 176 crowd-sourcing approaches represent a promising strategy to help improve phenotypic 177 prediction from large-scale genomic data.

179
Because of privacy concerns, studies relying on crowd-sourcing are almost impossible to 180 set up in the field of human genomics. A first experiment was carried out in 2016 to 181 predict anti-TNF treatment response in rheumatoid arthritis [34], but participants had 182 to apply to participate in the challenge. Here -thanks to the OpenSNP community -we 183 released the first crowd-sourced and fully open challenge based on publicly available backgrounds, from the vibrant machine learning community. It resulted in the 186 assessment of a broad variety of methods for genotype-based phenotypic prediction 187 through a total of 1275 submissions. We hereby also report a tool to create an