Abstract
Motivation Polygenic risk scores describe the genomic contribution to complex phenotypes and consistently account for a larger proportion of the variance than single nucleotide polymorphisms alone. However, there is little consensus on the optimal data input for generating polygenic risk scores and existing approaches largely preclude the use of imputed posterior probabilities and strand-ambiguous SNPs.
Results We developed PRS-on-Spark (PRSoS) a polygenic risk score software implemented in Apache Spark and Python that accommodates a variety of data input (e.g., observed genotypes, imputed genotypes, or imputed posterior probabilities) and strand-ambiguous SNPs. We show that PRSoS is flexible and efficient and computes polygenic risk scores at a range of p-value thresholds more quickly than existing software (PRSice). We also show that the use of imputed posterior probabilities and the inclusion of strand-ambiguous SNPs increases the proportion of variance explained by polygenic risk scores for major depression.
Availability and Implementation PRSoS is written in Apache Spark and Python and is freely available (see https://github.com/MeaneyLab/PRSoS).