Abstract
Polygenic risk scores represent an individual’s genetic susceptibility to a phenotype. Like with any models, statistical models commonly employed to fit polygenic risk scores and assess their accuracy contain several assumptions. The effects of these assumptions on models of polygenic risk score have not been thoroughly assessed.
We assessed 26 variations of the traditional polygenic risk score model, each of which mitigate assumptions in one of five facets of disease modelling: representation of age (6 variations), censorship (3 variations), competing risks (7 variations), formation of disease labels (6 variations), and selection of covariates (4 variations). With data from the UK Biobank, each model variation included age, sex, and a polygenic risk score derived from the PGS Catalog. Each of the 26 model variations were fitted to predict 18 diseases. Compared to the plain model that contained all five facets of assumptions, the model variations often fit the data better and generated predictions that largely differed from the predictions of the plain model. The statistic Royston’s R2 measured a model’s goodness of fit, and thereby determined if the model was an enhancement upon the plain model. For 15 of the 26 model variations Royston’s R2 was greater than that of the plain model for >50% of diseases. Reclassification rates, defined as the fraction of individuals in the top five percentiles of the plain model’s predictions who are not in the top five percentiles of a model variation’s predictions, was used to determine if the variation led to significantly different predictions. For 20 of the 26 model variations the median reclassification rate calculated across the 18 diseases was greater than 10%. Comparisons of accuracy statistics further illustrated how much each model variation’s predictions differed from the plain model’s predictions.
Models containing polygenic risk scores appear to be significantly affected by many common modelling assumptions. Therefore, future investigations should consider taking some action to mitigate modelling assumptions.
Author Summary An individual’s genetics can increase their risk of experiencing a disease. The exact magnitude of the increased risk is estimated within a statistical model. The traditional model type employed in this process is relatively plain and contains several assumptions. The predicted risk estimates from this plain model may be unnecessarily inaccurate. To test this possibility, we searched the literature for model variations that reduce the assumptions of the plain model, ultimately creating 26 distinct model variations that may improve upon the plain model. Each model variation was fit with data from the UK Biobank to predict 18 diseases. We found that 15 of the 26 models variations fit the data better than the plain model for a majority of diseases. Goodness of fit was measured with Royston’s R2 statistic. Further calculations found that the predictions of the model variations were often significantly more or less accurate than the predictions of the plain model. We believe these results indicate that future investigations of polygenic risk scores should not employ the plain model, as unreliable risk predictions will likely result.
Competing Interest Statement
The authors have declared no competing interest.