## Abstract

Polygenic risk scores quantify the individual genetic predisposition regarding a particular trait. We propose and illustrate the application of existing statistical learning methods to derive sparser models for genome-wide data with a polygenic signal. Our approach is based on three consecutive steps. First, potentially informative loci are identified by a marginal screening approach. Then, fine-mapping is independently applied for blocks of variants in linkage disequilibrium, where informative variants are retrieved by using variable selection methods including boosting with probing and stochastic searches with the Adaptive Subspace method. Finally, joint prediction models with the selected variants are derived using statistical boosting. In contrast to alternative approaches relying on univariate summary statistics from genome-wide association studies, our three-step approach enables to select and fit multivariable regression models on large-scale genotype data. Based on UK Biobank data, we develop prediction models for LDL-cholesterol as a continuous trait. Additionally, we consider a recent scalable algorithm for the Lasso. Results show that statistical learning approaches based on fine-mapping of genetic signals result in a competitive prediction performance compared to classical polygenic risk approaches, while yielding sparser risk models that tend to be more robust regarding deviations from the target population.

## 1 Introduction

Polygenic risk scores (PRS) represent a quantification of the individual genetic predisposition for a particular phenotype (e.g., a clinical trait). PRS have already been developed for a large variety of human traits and have been integrated into prediction models for common diseases together with traditional risk factors and clinical variables (Khera et al., 2018; Kachuri et al., 2020).

Polygenic modelling has to deal with two major issues: the high-dimensionality of the genetic signal (typically characterized by millions of common variants) and the presence of high correlations among variants (i.e., linkage disequilibrium LD) within the genome (Ardlie et al., 2002). Therefore, a classical approach to PRS is based on the cumulative marginal effects from simple univariate regression models for individual variants, incorporating a variant selection filter according to the level of LD and to the level of significance with respect to the considered phenotype. This traditional method — usually referred to as clumping plus thresholding (C+T) — is still one of the most popular approaches given its computational efficiency (Euesden et al., 2015). More refined methods adapt penalized regression methods for summary statistics from genome-wide association studies (GWAS), including lassosum (Mak et al., 2017) implementing Lasso regression and penRegSum (Pattee & Pan, 2020) considering an elastic net penalty. Alternatively, Bayesian approaches can be employed to induce shrinkage of effect estimates, such as LDpred (Vilhjálmsson et al., 2015) and PRS-CS (Ge et al., 2019).

All these methods are based on summary statistics from GWAS, typically incorporating a particular way of external reference LD-panel matching to account for LD (Vilhjálmsson et al., 2015). While such approaches facilitate the exchange of research data and the computational feasibility, particularly regarding memory issues, a major limitation of using summary statistics is that the resulting polygenic scores do not fully utilize the joint information of the variants regarding the phenotype of interest. Furthermore, while transethnic GWAS indicate that causal variants are mainly shared across populations (Li & Keating, 2014), the poor generalization of classical PRS models to different populations is largely driven by different allele frequencies and LD-patterns across populations (Wang et al., 2020). Therefore, it can be hypothesized that PRS based only on the most informative (“fine-mapped”) variants may be less sensitive to LD differences among populations (Weissbrod et al., 2020).

From a statistical point of view, a more effective approach would be to apply modern multivariable regression models directly on the original genotype data, instead of relying on univariate summary statistics. Multivariable regression models enable the joint estimation of associations and could lead to improved fine-mapping (i.e., the identification of potentially causal variants; Benner et al., 2016). Although there exists a huge literature on statistical prediction models for high-dimensional data (large number of predictors *p* related to small sample size *n*), their application for PRS is often hindered by computational aspects and memory restrictions due to the size of the datasets (large *p and* large *n*). Recently, Qian et al. (2020) proposed a so-called batch screening iterative Lasso (BASIL) algorithm to apply penalized regression for large GWAS cohorts like UK Biobank, showing that one can compute the full Lasso solution path for the original genotype data by successively solving lower-dimensional sub-problems without requiring to load the full dataset into memory.

In this work we propose and illustrate an alternative technique to allow for the application of existing statistical learning methods for multivariable regression on original genotype data, by exploiting the characteristic correlation structure between the variants in the genome. In particular, variants within the same genomic region are likely to be correlated due to LD, while variants in distant genomic regions can be considered as independent, implying that variable selection methods can focus on the identification of the most informative variants within each region of high LD. Thus, in the proposed approach we first split the genome into independent LD blocks (chunks), then apply modern variable selection techniques on relevant blocks to finemap the genetic signal regarding the particular phenotype, and finally combine the selected variants to fit a joint prediction model using statistical boosting. Particularly, we illustrate our approach with several different statistical learning methods for variant selection (fine-mapping) in the independent blocks. As the first alternative, we make use of a stochastic search algorithm called the Adaptive Subspace (AdaSub) method, aiming to select the best model according to an *ℓ*_{0}-type information criterion by adaptively solving lower-dimensional sub-problems (Staerk et al., 2021). As the second alternative, we consider statistical boosting in combination with probing (Thomas et al., 2017), yielding automatic stopping and sparse prediction models. Finally, we use the recently proposed BASIL algorithm (Qian et al., 2020) to compute Lasso estimates based on chunk-based pre-filtered variants and compare it to the direct application of the Lasso without pre-filtering.

We apply the proposed approach on UK Biobank data with *n* = 487, 410 samples and *p* = 9, 812, 717 variants, considering LDL-cholesterol as a continuous trait characterized by a polygenic architecture. Our results suggest that the statistical learning methods AdaSub and probing yield sparser and more interpretable polygenic prediction models, while showing a competitive prediction performance on independent test data compared to classical C+T and LDpred approaches based on summary statistics. The direct application of the Lasso on full genotype data yields the best prediction performance with considerably larger numbers of selected variants compared to the sparse fine-mapped models via AdaSub and boosting with probing. However, we find that the Lasso still yields similar prediction accuracy with less selected variants when applied only on chunk-based pre-filtered variants. Furthermore, the evaluation of the final scores on data from other populations indicates potential benefits regarding the generalizability of the developed fine-mapped models.

## 2 Methods

### 2.1 Data preprocessing

This research has been conducted using the UK Biobank Resource under Application Number 81202. The UK Biobank (Bycroft et al., 2018) is a large-scale cohort study covering a huge prospective sample (*n >* 500, 000) of the British general population, including both genotype as well as phenotype (health-related outcomes) data. In this work we focus on LDL-cholesterol as the phenotype of interest. Only individuals with both self-reported white British origin (Field 21000) and Caucasian origin according to the principal components provided by UK Biobank (Field 22006) were used in the training cohort. Variants with a genotyping rate of less than 99%, or which had a Minor Allele Frequency MAF *<* 1% were removed. Variants not in Hardy-Weinberg equilibrium (p-value *<* 10^{−6}) were also excluded.

All data preprocessing steps were performed with PLINK 1.9 (Chang et al., 2015). The models were trained considering imputed dosages provided by UK Biobank after filtering for MAF *>* 0.01 and post imputation info *>* 0.8. UK Biobank is enriched of related individuals; specifically, 147,731 individuals were inferred to be related up to third degree or closer (Bycroft et al., 2018). We applied no filter on relatedness as removing related individuals would have led to a nontrivial decrease of the sample size. Previous work had shown that sample filtering according to the coefficients of relationship (kinship) leads to similar PRS associations (Lello et al., 2018).

Since one of the aims of the present work was the evaluation of the generalizability of the derived PRS, no population-based filter was applied for the test dataset. In total 318, 258 samples were included in the training cohort and 150, 521 samples were considered as an independent test cohort. In order to reduce the memory demand and allow for parallelization, we split the genome into independent LD blocks (i.e., chunks) according to the ldetect method as described in Berisa & Pickrell (2016). Specifically, we considered the 1,703 blocks in autosomal chromosomes identified by ldetect considering the European population of 1K genome. As the proposed approaches are based on fine-mapping of the significant regions, first a univariate linear-regression association test has been performed between the variant dosages and LDL-cholesterol (Field 30780), respectively. The chunks with at least one genome-wide significant association (i.e., p-value of association lower than 5 · 10^{−8}) were then further processed to fine-map the genetic signal. Within each chunk we additionally filtered for suggestively significant variants (*p <* 10^{−5}), compare also Fan & Lv (2008) and Hoffman et al. (2013).

### 2.2 Fine-mapping of variants

#### Stochastic search with AdaSub

As the first fine-mapping approach, we considered *ℓ*_{0}-type information criteria to identify the most informative variants in each of the chunks. Such variable selection criteria provide a natural trade-off between goodness-of-fit and model complexity, by explicitly penalizing the number of variants included in the model. As an important example, the extended Bayesian information criterion (EBIC; J. Chen & Chen, 2008) has been proposed for high-dimensional data situations with many possible covariates (variants) and has shown to yield variable selection consistency even when the number of covariates *p* exceeds the sample size *n*. In particular, for a subset of variants indexed by *S* ⊆ *{*1, …, *p}*, the EBIC_{γ} is given by
where *y*_{i} ∈ ℝ denotes the observed phenotype and *x*_{i} ∈ *{*0, 1, 2*}*^{p} the genotype for subjects *i* = 1, …, *n*. Furthermore, denotes the estimated intercept and the least squares estimate under model *S* (i.e. for *j* ∉ *S*). The EBIC_{γ} penalizes the number |*S*| of selected variants with the factor log(*n*) + 2*γ* log(*p*), where *n* refers to the sample size of the training data and *p* to the total number of variants in the training data. The additional constant *γ* ∈ [0, 1] controls the induced sparsity, with *γ* = 1 resulting in the sparsest models and *γ* = 0 corresponding to the classical Bayesian information criterion. Based on the EBIC_{γ}, the “optimal” set *Ŝ* of variants is defined by the one which yields the smallest criterion value among all possible sets *S* ⊆ {1, …, *p*} of variants, i.e.

Due to the combinatorial nature of the optimization problem (2), it is computationally challenging to find the model minimizing the EBIC_{γ}. Here we make use of the Adaptive Subspace (AdaSub) method (Staerk et al., 2021) as a stochastic search algorithm, which is based on reducing the high-dimensional search problem (2) to low-dimensional subproblems
for smaller random subspaces *V*^{[m]} ⊆ {1, …, *p*}, which can be solved exactly with branch- and-bound algorithms (Furnival & Wilson, 2000). After each iteration *m* of AdaSub, the information from the solved sub-problems (3) is adaptively combined, so that variables *x*^{(j)} which have already proven to be informative in many of the solved sub-problems receive a larger sampling probability
in the subsequent stochastic search (where 1_{S} denotes the indicator function for a set *S*). The initial expected size of the sampled subspaces is given by *q* (with *q* ≪ *p*), while the adaptation rate of the algorithm is controlled by the parameter *K >* 0. Under certain conditions it can be guaranteed that AdaSub converges against the optimal solution of the original problem (Staerk et al., 2021). Particularly in sparse and highly-correlated data situations it has been observed that AdaSub shows favorable variable selection properties in comparison to *ℓ*_{1}-type regularization methods such as the Lasso (Tibshirani, 1996). As high correlations occur frequently among variants within the same genomic region due to linkage disequilibrium, the AdaSub method is a well-suited candidate to identify sparse sets of informative variants explaining the genomic signal for a particular phenotype.

For each chunk, we independently applied AdaSub using the EBIC_{γ} with constant *γ* = 1 as the selection criterion. As the penalty of the EBIC_{1} incorporates the total number of considered variants, it accounts for the fact that AdaSub is only applied on the prefiltered chunks and the resulting multiple testing issue. In AdaSub we initialized the expected search size *q* = 5 and the parameter *K* = 2000 controlling the adaptation rate of the algorithm. For each chunk, the best model identified after 10,000 iterations of AdaSub was returned as the set of selected variants.

#### Statistical gradient boosting with probing

As an alternative to the explicit regularization imposed by information criteria, we implemented a statistical boosting algorithm (Mayr et al., 2014) in combination with probing (Thomas et al., 2017) for early stopping, providing implicit regularization and variable selection. The concept of boosting emerged from machine-learning (Freund & Schapire, 1996), where it is often used in combination with trees as base-learners to form a powerful and flexible classifier (T. Chen & Guestrin, 2016). For statistical boosting approaches, univariate regression functions are implemented as base-learners and are iteratively fitted to the current gradients of the loss function, yielding a gradient descent procedure in function space (Bühlmann & Hothorn, 2007).

More formally, the gradient vector *u*^{[m]} at iteration *m* = 1, …, *m*_{stop} is the first derivative of the loss function *ρ*(*y*_{i}, *η*_{i}) w.r.t. the model *η* evaluated at the previous iteration *m*−1, i.e.

In the context of linear regression, *η*_{i} represents the linear predictor for observation *i* at iteration *m* − 1:

The gradient vector *u*^{[m]} is then fitted one-by-one to the base-learners *h*_{j}(*x*^{(j)}), *j* = 1, …, *p*, which typically represent univariate regression functions for the different covariates *x*^{(1)}, …, *x*^{(p)}. In case of the *L*_{2} loss given by *ρ*(*y*_{i}, *η*_{i}) = (*y*_{i} − *η*_{i})^{2}, this procedure basically leads to a component-wise re-fitting of residuals by the base-learners. Although all base-learners are fitted to *u*^{[m]}, only the best-fitting base-learner *h*_{j∗} (*x*^{(j∗)}) is selected and its fit is added to the current model via a small step-length (e.g., sl = 0.1):

In the framework of classical linear regression with univariate linear base-learners *h*_{j}(*x*^{(j)}) = *μ* + *β*_{j}*x*^{(j)} for *j* = 1, …, *p*, this leads to

As in every iteration only the best-fitting base-learner *h*_{j∗} (*x*^{(j∗)}) is selected, variables that have not been included in any of the selected base-learners are effectively excluded from the final model when the algorithm is stopped. As a result, statistical boosting yields multivariable regression models while incorporating automated variable selection for potentially high-dimensional data situations, where classical statistical inference procedures become infeasible (Mayr et al., 2014).

For each chunk, we applied statistical boosting separately using linear models with single variants as base-learners. Instead of tuning the stopping iteration with respect to the prediction accuracy via resampling techniques, we incorporated a probing approach (Wu et al., 2007): for each variant *x*^{(j)}, we additionally included a base-learner with a shadow-variable (probe). Each probe is a randomly permuted sibling of an original variant and, due to the permutation, not associated with the outcome. Once the first base-learner corresponding to one of these probes is (falsely) selected, the boosting procedure is stopped. This combination of statistical boosting with probing has shown to yield particularly sparse models with low numbers of false-positives, shifting the focus of the tuning procedure from prediction accuracy to variable selection (Thomas et al., 2017).

#### Lasso via the BASIL algorithm

Qian et al. (2020) recently proposed a batch screening iterative Lasso (BASIL) algorithm to apply the Lasso (Tibshirani, 1996) directly on large-scale genotype data. The Lasso is defined as a solution to the penalized regression problem
with penalty parameter *λ* ≥ 0 controlling the sparsity and shrinkage of regression coefficients induced by the *ℓ*_{1}-regularization. For large-scale applications with a high memory demand, the BASIL algorithm enables the construction of the full Lasso path (for decreasing *λ*) by iteratively working on smaller adaptive batches of genomic variants. In more detail, the algorithm consists of three repetitive steps: in a screening step, variants which are most correlated with the current residuals are added to an active set of variants. Then, the Lasso is fitted only on the active set for a consecutive range of *λ* values. Finally, the found solutions are checked for validity using the Karush-Kuhn-Tucker conditions, before those three steps are repeated. By checking the validity of each solution, the exact Lasso path is retrieved (Qian et al., 2020). The final Lasso estimate is derived by choosing the penalty parameter *λ* yielding the best prediction performance on a validation set.

### 2.3 Final estimation and comparison to classical PRS

After identifying the informative variants separately for the different chunks using the two fine-mapping approaches AdaSub and boosting with probing based on the training cohort with *n* = 281, 843 samples, we combined the respective variants in a PRS by fitting a multivariable regression model on the training cohort via statistical boosting (component-wise *L*_{2}Boosting, Bühlmann & Yu, 2003). This procedure was performed once for the variants selected via stochastic search with AdaSub and once for variants selected via boosting with probing. In both cases, we fit the complete PRS, estimating new coefficients to allow those estimates to take the combined multivariable effects of the selected variants also among different chunks into account. In this final estimation step, we led the boosting algorithm converge by fixing a large number of 10,000 iterations. We employed the Lasso using the BASIL algorithm (Qian et al., 2020) for two different sets of variants. First, similarly as in Qian et al. (2020), we applied the Lasso on all genome-wide genotyped variants. Second, similarly as for AdaSub and boosting with probing, we applied the Lasso only on the chunk-based pre-filtered variants (see Section 2.1). For both sets, we first computed the Lasso path using 87.5% of our training cohort for the fitting and 12.5% of our training cohort as validation data. We then refitted the Lasso on the complete training data for the penalty parameter *λ* corresponding to the highest prediction accuracy on the validation data.

The derived models were also compared with the classical PRS obtained with clumping plus thresholding (C+T), considering the full genome-wide signal (without prefiltering and fine-mapping of variants). In the PRS derived via C+T, the regression coefficients reflect the univariate (marginal) associations between the allele dosages and the phenotype (as also derived from GWAS summary statistics). Different p-value thresholds were considered (i.e., 25 values equally distributed in the log scale between 5 × 10^{−8} and 0.05) and also different LD thresholds for clumping were used (i.e., 8 correlation values equally distributed between 0.1 and 0.8), as implemented in PRSice (Euesden et al., 2015). Moreover, in order to further evaluate the potential influence of model sparsity, we also computed genome-wide PRS based on genotyped variants, assuming a genetic architecture in which all variants are causal. To this aim, we implemented the infinitesimal model of LDpred2 based on the shrinkage of effect sizes according to heritability estimation (Privé et al., 2020). In addition, we also derived a PRS based on continuous shrinkage of effect sizes according to the UKB-EUR reference LD panel as implemented in PRS-CS (Ge et al., 2019). Similarly to the C+T methods, both LDpred2 and PRS-CS are based on univariate (marginal) associations which are usually available as summary statistics from GWAS.

As LDL blood levels are strongly influenced by lipid lowering drugs, we adjusted LDL values by a factor of 0.684 in individuals taking statins as estimated in a recent work (Sinnott-Armstrong et al., 2021). The final PRS for individual *i* is then computed as the weighted sum of effect alleles:
where is the estimated weight for variant *j* (obtained from the univariate association or derived from the final multivariable models after fine-mapping), is the corresponding genotype for individual *i* and *S* denotes the respective set of selected (fine-mapped) variants.

## 3 Results

### Fine-mapping

The considered polygenic risk approaches for LDL-cholesterol yield substantially different numbers of selected variants in the final models (see Table 1). Here, we specifically focus on the selected variants by the fine-mapping approaches via Ada-Sub and statistical boosting with probing as well as by the classical genome-wide C+T approach. In particular, the C+T approach selects 1, 588 variants (best fitting model obtained with p-value threshold of 5.25 *·* 10^{−5} and clumping *R*^{2} of 0.1), boosting with probing selects 792 variants and AdaSub with EBIC_{1} leads to only 108 selected variants. In Figure 1, the top significant locus (referring to univariate associations with LDL-cholesterol) is displayed on chromosome 16, highlighting the variants that were selected for the final PRS with the statistical learning methods AdaSub and probing as well as the classical C+T approach. It can be observed that the statistical learning approaches for fine-mapping tend to select less variants than the classical C+T approach. Furthermore, this example illustrates that AdaSub with EBIC_{1} does not necessarily select the variant with the highest marginal association (smallest p-value).

Figure 2 displays the final estimated absolute regression coefficients on their corresponding chromosome. The top variants (largest absolute coefficients) in all three models are mainly located in genomic loci corresponding to the localization of well-known cholesterol associated genes (main signal colocalize with LDL receptor and APOE genes in chromosome 19, a second peak is present in chromosome 1 at the level of the PCSK9 gene whose inhibition leads to a decrease of LDL; Sabatine, 2019).

The comparison of the included variants across the considered models shows that several variants were selected by all three approaches (mainly leading variants in significant loci). Concerning the two fine-mapped PRS, 9 and 36 variants were selected in the region surrounding the LDL receptor gene (1MB upstream and downstream) using AdaSub with EBIC_{1} and boosting with probing, respectively (Supplementary Table 1). Annotation using variant effect predictor (McLaren et al., 2016) revealed that four out of the nine variants selected by AdaSub were located in the regulatory region and one located in the splicing region for LDL. Notably, an upstream variant rs12151108 was previously reported in GWAS on LDL level among African population (Gurdasani et al., 2019) as well as in combined East Asian/ European population (Nielsen et al., 2020). In addition to selecting eight out of nine variants included in the AdaSub model, boosting with probing selected 28 further variants covering 11 genes (Supplementary Table 1).

When comparing the sizes of estimated coefficients for the selected loci (see Figure 2), one can generally observe a very similar pattern among the three models, while there is a tendency towards smaller (shrunk) regression coefficients for the two PRS that were finally estimated via statistical boosting (see Section 2.3). This is particularly true for the PRS based on variants selected via probing, which may be due to the larger number of selected variants compared to the one selected by AdaSub while using the same number of *m*_{stop} = 10, 000 iterations for the final boosting fit. Overall, the two fine-mapping methods via statistical learning identify less variants for PRS than the classical C+T approach, which could also increase the interpretability of the underlying models.

Of note, no fine-mapping takes place when applying LDpred2-inf and PRS-CS since these methods assume a genetic architecture in which all variants are casual, leading to omnigenic models based on all overlapping variants between the analyzed target data and the reference LD dataset. On the other hand, multivariable Lasso regression via the BASIL algorithm (Qian et al., 2020) yields a final model with 12,492 selected variants based on genotyped data, which is sparser compared to the omnigenic models, but considerably larger compared to the fine-mapped AdaSub and probing models. Compared to the genome-wide analysis, the application of the Lasso based on only the chunk-based prefiltered variants results in a sparser model with 1,821 selected variants, which is however still larger than both the AdaSub and probing models.

### Prediction performance

A sparser PRS model might be advantageous for interpretation; however, an important aim of PRS modelling remains prediction. In order to assess the predictive performance of the differently derived polygenic scores for LDL, we computed the *R*^{2} value given by the squared correlation between the observed and the predicted phenotypes on the European test cohort. The test cohort was composed of the remaining *n* = 120, 495 self-reported British individuals with Caucasian genetic origin that were not used in the training cohort.

We considered full models (*M*_{f}) including the estimated :
and the corresponding covariate-only model (*M*_{c}):

Here, the variable *Y* denotes the outcome (LDL level as a continuous phenotype) and PC_{k} represents the *k*-th principal component for *k* = 1, …, 10. The *R*^{2} attributable to the PRS (partial *R*^{2}) is defined as the difference between the *R*^{2} of the full model *M*_{f} and the *R*^{2} of the covariate-only model *M*_{c}.

There is large heterogeneity both in the sparsity of the final polygenic models *M*_{f} and in the reached prediction performance in terms of *R*^{2} on the European population (see Table 1). Overall comparisons with the covariate-only model *M*_{c} suggest that the prediction accuracy for LDL-cholesterol is largely driven by genetic predisposition as summarized by PRS. In particular, the sparse multivariable AdaSub and probing approaches based on chunk-based fine-mapping yield a very competitive prediction performance compared to the univariate C+T and LDpred2 approaches based on summary statistics, which include more variants in their final models. In this situation, the non-sparse PRS-CS model including all considered variants yields improved predictions, showing that also methods based on summary statistics can outperform chunk-based fine-mapping approaches when considering the full genome-wide signal.

In our analysis, multivariable Lasso regression applied on all genotyped variants reached the best prediction performance on the European population, further indicating that multivariable regression models are favorable compared to methods based on univariate summary statistics. As the Lasso selects substantially larger numbers of variants than the fine-mapping approaches via AdaSub and probing, this illustrates that the identification of sparser models based on the most informative variants can yield lower prediction performance compared to larger models including many variants with lower effect sizes. However, compared to the direct application of the Lasso on all genotyped variants, the Lasso still yields very similar prediction accuracy with less selected variants when applied only on the chunk-based pre-filtered variants. Overall, the competitive performances that are obtained using smaller numbers of variants suggests that fine-mapping approaches via multivariable statistical learning are able to detect the most predictive variants which can then be further analyzed for biological interpretation.

### Generalizability

One could hypothesize that sparser PRS might be more robust towards deviations from the target population. To test the generalizability of the derived PRS models, as test cohort we considered the complete UK Biobank data after removing samples used in the training dataset. As in similar works, to adjust for population stratification we first fit a linear regression model considering the first ten principal components (PC) in the training dataset to predict PRS in the test cohort only based on the genetic ancestry (Khera et al., 2019; Fahed et al., 2020). The residualized PRS are then used to fit the full model with covariates. The prediction performance was evaluated by splitting the complete test cohort in different ethnic groups according to the individual genetic background. The estimation of the genetic ethnicity via PC projection with respect to 1000 Genomes Project samples was performed using the bigsnpr package (Privé et al., 2018). Samples were assigned to one of the five 1000 Genomes Project superpopulations (European: EUR, African: AFR, East Asian: EAS, South Asian: SAS, American: AMR) according to the Euclidean distance in the 10 PC space with respect to the population centers as described in Privé (2020). Only cohorts with more than 1, 000 individuals were included in the analysis (i.e., the AMR superpopulation which included only 230 samples was excluded).

The *R*^{2} values for the different PRS as well as for the covariate-only model on all considered populations are reported in in Table 1 (13, 585 samples were excluded from this analysis because they did not cluster to any superpopulation due to a likely partial mixed ancestry origin). Overall, one can observe that the highest total *R*^{2} values are reached on the European population (after adjusting for population stratification).

On the other hand, all PRS models performed substantially worse in non-European populations, though to a different extent. In particular, the sparse fine-mapped AdaSub and probing models tend to yield a lower reduction of prediction performance in out-of-target populations compared to the univariate PRS approaches: for example, for the African population, the three univariate PRS approaches C+T, LDpred2 and PRS-CS achieve only between 35% and 39.5% of their prediction performance on the European population, while the fine-mapping approaches AdaSub and probing yield 59.1% and 72.7% of their respective performance on the European population. Furthermore, the Lasso model without pre-filtering appears to be less robust regarding the generalizability on different populations compared to the sparser Lasso model based on chunk-based pre-filtered variants: for example, for the African population, the genome-wide Lasso model achieves a relative prediction performance of 50.8% compared to its performance on the European population, while the sparser Lasso model based on pre-filtering yields a relative performance of 83.2%.

Despite these indicative results, the general pattern between sparsity and generalizability remains less clear across the different approaches implemented to model the polygenic architecture of LDL. Our results suggest that further analyses are required to investigate the hypothesis that sparser and more carefully fine-mapped models tend to be more robust with respect to LD differences among populations (Weissbrod et al., 2020).

## 4 Discussion

In this work we have proposed and illustrated the application of existing statistical learning approaches for sparser fine-mapped polygenic risk models. These methods take advantage of the full genotype data and incorporate modern statistical modelling approaches as well as data-driven variable selection strategies in the fitting of PRS.

PRS are usually constructed via estimated effects from simple linear models representing the cumulative univariate/marginal effects of many common variants from GWAS (Choi et al., 2020; Wand et al., 2021). One of the major methodological limitations of relying only on univariate summary statistics is that the joint information and inter-dependence of effects of multiple variants cannot be fully assessed. Our proposed approaches, in contrast, directly apply multivariable regression models on the actual genotype data. For genetic fine-mapping, we consider the recently developed stochastic search method Ada-Sub (Staerk et al., 2021) as well as statistical gradient boosting (Bühlmann & Hothorn, 2007). We employ these methods to yield particularly sparse models by incorporating the *ℓ*_{0}-penalized EBIC_{1} (J. Chen & Chen, 2008) for AdaSub and by enforcing early stopping via probing for the statistical boosting algorithm (Thomas et al., 2017). To overcome the high computational burden and memory demand of applying these existing methods on large-scale data (large *p and* large *n*), we first split the genome into independent LD blocks incorporating a chunk-based screening approach and then identify the most informative variants for the phenotype inside these chunks. Afterwards, the selected variants are combined into a final multivariable regression model – again fitted by a statistical boosting algorithm. Additionally, we consider the recent BASIL algorithm (Qian et al., 2020) to fit the *ℓ*_{1}-penalized Lasso (Tibshirani, 1996) directly on the full genotype data as well as on chunk-based pre-filtered variants to further encourage the sparsity of the resulting model.

The proposed statistical learning methods based on AdaSub and probing are able to yield sparser and hence more interpretable multivariable risk models than classical methods based on univariate summary statistics. Even though the final models fitted via statistical boosting after fine-mapping with AdaSub and probing were not optimized for prediction performance, they showed a competitive performance on UK Biobank data regarding the prediction of LDL-cholesterol compared to classical PRS methods based on summary statistics such as C+T and LDpred. With the sparse, fine-mapped AdaSub and probing models, around 5%-6% of the variability of LDL-cholesterol in our test cohort could be explained via genetic predisposition, after adjusting for age, sex and population stratification. These results are in line with the *R*^{2} = 0.054 obtained in the FinnGen cohort with a much larger PRS model including around 6M variants (Ripatti et al., 2020).

The Lasso fitted on full genotype data via the recently proposed BASIL algorithm (Qian et al., 2020) is able to yield even improved prediction performance on test data, which further demonstrates that the estimation and selection of variants in multivariable regression models is favorable compared to methods based on univariate summary statistics. However, the Lasso selected considerably more variants in the polygenic score for LDL-cholesterol compared to the fine-mapping approaches via AdaSub and probing. We further demonstrated that the number of selected variants in the Lasso PRS model can be substantially reduced when the Lasso is applied on the chunk-based pre-filtered variants, resulting in a sparser model with still very competitive prediction performance.

Additionally, results of the sparser scores tended to be more robust when applied to different populations. The dependency of PRS performance on ancestry (Curtis, 2018) and the inherent disadvantage for individuals from populations with less genetic data (Duncan et al., 2019) or with multi-ethnic origin are urgent practical and ethical problems for the application of PRS in clinical practice (Lewis & Green, 2021). As our proposed techniques might not be the ultimate solution to these problems, further research is warranted on robust methods for developing PRS models in the presence of multi-ethnic populations and for incorporating scores in distant populations (Grinde et al., 2019; Ji et al., 2021). Limitations of our approach include that the computational burden is still relatively high and high-performance computing clusters are necessary to apply the proposed techniques on large cohorts. Furthermore, the considered fine-mapping methods are only feasible with access to full genotype data, which are associated with a higher memory demand and lower availability compared to summary statistics that are used for classical techniques. However, it can be expected that with the development of large population-based cohorts, such as UK Biobank (Bycroft et al., 2018), FinnGen (Locke et al., 2019) or Biobank Japan (Matoba et al., 2020), researches will increasingly have access to full genotype data. Furthermore, this work focused on fine-mapping of variants in linkage disequilibrium, aiming to select only the most informative variants for LDL-cholesterol. Yet, in general, polygenic risk models with the best prediction performance may not be sparse depending on the considered phenotype (cf. Qian et al., 2020; Tanigawa et al., 2022). Future research should be targeted at adapting the considered methods when the main focus is not on the selection of variants but on the prediction performance.

Our three-step PRS approach, consisting of (1) screening, (2) fine-mapping (variant selection) and (3) final estimation via statistical boosting, has similarities with the recent batch screening approach for the Lasso by Qian et al. (2020), which we also incorporated into our framework as an alternative selection and multivariable estimation method after the initial screening. An important feature of this work is that we specifically aimed at fine-mapping in chunks of highly-correlated variants to obtain particularly sparse PRS models. In contrast, the Lasso approach of Qian et al. (2020) employs variant selection and estimation in a single step derived from the *ℓ*_{1}-regularized optimization problem based on all genotyped variants. As a consequence, while our fine-mapping approaches via AdaSub and boosting with probing yield sparser PRS models, the joint estimation and selection of the Lasso yields improved predictions in the considered situation. Nevertheless, it has been shown that, in many practical situations, statistical boosting with implicit penalization can yield very similar coefficient paths as the direct penalization in the Lasso (Hepp et al., 2016), indicating that there may be room for improved predictions when considering a direct application of boosting without chunk-based fine-mapping via AdaSub and probing.

In future research we want to extend the considered boosting and stochastic search methods for their efficient application on large-scale genotype data without considering prefiltered chunks for fine-mapping, aiming for an improved prediction performance at the potential cost of less sparse models. In this context, a main advantage of statistical boosting compared to the Lasso is the modular nature of the algorithm (Mayr et al., 2014; Hofner et al., 2014): any base-learner can be easily combined with any type of objective function. This leads to vast possibilities to extend the algorithm to further advanced statistical modelling techniques. In the current work, we have focused on the most classical combination of linear base-learners with the *L*_{2} loss (Bühlmann & Yu, 2003), leading to linear regression models. Future research will be focused on considering other combinations of base-learners and loss functions. The most obvious choices are loss functions leading to logistic regression (for binary traits) and time-to-event models (for time-to-event traits). However, also the application of more robust loss functions (e.g., *L*_{1}) or objective functions that might be better suited to identify patients with a particularly high-risk (e.g., the check-function leading to quantile regression) might be promising.