## Abstract

Polygenic scores (PGS) are individual-level measures that quantify the genetic contribution to a given trait. PGS have predominantly been developed using European ancestry samples and recent studies have shown that the predictive performance of European ancestry-derived PGS is lower in non-European ancestry samples, reflecting differences in linkage disequilibrium, variant frequencies, and variant effects across populations. However, the problem of how best to maximize performance within any one ancestry group given the data available, and the extent to which this varies between traits, are largely unexplored. Here, we investigate the effect of sample size and ancestry composition on the predictive performance of PGS for fifteen traits in UK Biobank and evaluate an importance reweighting approach that aims to counteract the under-representation of certain groups within training data. We find that, for a minority of the traits, PGS estimated using a relatively small number of Black/Black British individuals outperformed, on a Black/Black British test set, scores estimated using a much larger number of White individuals. For example, a PGS for mean corpuscular volume trained on only Black individuals achieved a 4-fold improvement on a corresponding PGS trained on only White individuals. For the remainder of the traits, the reverse was true; a PGS for height trained on only Black/Black British individuals explained less than 0.5% of the variance in height in a Black/Black British test set, compared to 3.9% for a PGS trained on a much larger training set consisting of only White individuals. We find that while importance weighting provides moderate benefit for some traits (for example, 40% improvement for mean corpuscular volume compared to no reweighting), the improvement is modest in most cases, arguing that only targeted collection of data from underrepresented groups can address differences in PGS performance.

## Introduction

Polygenic scores (PGS) are quantitative measures that aim to predict complex traits from genetic data. As well as providing insights into the genetic architecture of complex traits, PGS have considerable clinical potential for screening and prevention strategies [9, 40]. Largely driven by significant increases in sample sizes, the predictive utility of PGS has improved substantially in recent years for a variety of traits [19], including cardiovascular disease [20], breast cancer [22], and Type I diabetes [37].

These improvements, however, have largely been limited to populations of European ancestry [26, 12, 25], reflecting the lack of ethnic diversity in genomic samples collected to date [25]. The lack of transferability of PGS across ancestries may be due to a number of factors, including population differences in allele frequencies and linkage disequilibrium [25]. There is also increasing evidence that the underlying variant effects differ across ancestries [5, 3], which may be due to gene-by-gene or gene-by-environment interactions [2]. This raises one of the most important technical and ethical challenges in the clinical utility and applications of PGS due to their potential negative impacts on health inequalities [25].

There are major, ongoing initiatives to collect genomic data from traditionally under-represented groups, such H3Africa [11], that aim to address the lack of global genetic diversity in research data. However, it may take many years to collect sufficient data to reduce the disparities in PGS performance. Statistical methods provide an alternative, short-term, cost-effective and complementary potential solution to mitigate against the negative effects of the lack of diversity in genomic datasets, by using modelling techniques to make use of all the existing data available, while allowing for some differences between groups. There has been a growing interest in statistical methods to improve the transferability of PGS, which have thus far focused on GWAS-derived PGS, i.e. PGS based on summary statistics from a genome-wide association study. Grinde et al. [15] suggest using European GWAS results to select variants, then using the non-European GWAS to estimate the variant weights. Marquez-Luna et al. [24] propose a multi-ethnic PGS by combining scores trained separately on different populations. In a related approach, Cavazos & Witte [6] use local ancestry weighting to construct PRS for admixed individuals. These efforts have yielded promising improvements in PGS performance, though there remains a significant gap in predictive accuracy between European and non-European target populations.

Here, we investigate the use of multiple-ancestry datasets, such as population-scale biobanks, to estimate PGS for a range of anthropometric and blood-sample traits, with the explicit objective of improving predictive accuracy for under-represented ancestries. Specifically, we ask whether there are consistent optimal strategies for borrowing information from across ancestry groups to maximize prediction accuracy in groups (which may be defined either on the basis of self-identified ethnicity or genetically-inferred ancestry) that have small sample sizes in available resources. For each trait, we construct training sets with varying numbers of individuals from each ancestry to assess the effect of sample size and composition on PGS accuracy, using both simulated and real data from UK Biobank [4]. Moreover, and to counteract the imbalance of ancestries in a multiple-ancestry training set, we investigate the use of an importance reweighting approach that places more weight on underrepresented ancestries during training. Given the availability of individual-level information in biobanks, we estimate PGS using regularised regression applied to full genotype and covariate data (as opposed to genome-wide association summary statistics) to avoid introducing additional artefacts into our analysis via the reliance on assumptions about genotype and covariate correlation structure (including linkage disequilibrium) and GWAS methodology.

Our results show that the effect of sample size and composition on predictive performance is highly variable across traits. For some traits, polygenic scores estimated using a relatively small number of Black/Black British individuals outperformed on a Black/Black British test set scores estimated using a much larger number of White individuals. Moreover, adding White individuals to the training set did not always improvement performance and in some cases even led to poorer performance. Although importance weighting yields moderate improvement in performance for some traits, we find that sample size is a much more prominent factor, highlighting the limitations of statistical corrections and the importance of collecting more data from diverse sources.

## Results

### Overview of methods

To investigate the effect of sample size and ancestry composition on polygenic scores for different traits in UK Biobank [4], we constructed a range of training sets controlling the number of White and Black/Black British individuals. We considered three types of training sets: a single-ancestry set consisting only of White individuals, a single-ancestry set consisting of Black/Black British individuals, and a dual-ancestry set consisting of both White and Black/Black British individuals.

For each training set, we estimated PGS using L1-regularised regression, also known as the LASSO [39], which has previously been used in the context of genetic risk prediction (see for example [44, 31, 34, 35]). For the dual-ancestry training sets, we also estimated PGS using an importance reweighted LASSO, upweighting the Black/Black British individuals. Following Martin et al. [25], we assessed the predictive performance of a PGS using partial *r*^{2} relative to a covariate-only model. See Materials and Methods for full details.

### Adding individuals from one ancestry does not always improve PGS performance for a different ancestry

We first set out to examine whether prediction performance in an underrepresented ancestry group can be boosted by including individuals from a different ancestry among the training samples. First, we held the number of Black/Black British individuals fixed at approximately 5 000, corresponding to 80% of Black/Black British individuals with non-missing data for the trait in question. We then constructed six training sets, varying the number of White individuals from 0 to just under 50 000, so that the proportion of White individuals in the training set ranged from 0% to 90%. For each training set, we used the unweighted LASSO to generate polygenic scores.

Predictive performance on White individuals improved for each trait as the number of White individuals in the training set increased (Figure 1A). In contrast, predictive performance on Black/Black British individuals only increased markedly for height and platelet count, with the improvement for platelet count appearing to tail off between the two largest training sets. For three traits (mean corpuscular haemoglobin, white blood cell count, mean corpuscular volume), predictive performance worsened as White individuals were added to the training set. For the remainder of the traits, predictive performance stayed mostly constant.

Next, we performed reciprocal analyses in which we held the number of White individuals fixed at approximately 50 000, and instead varied the number of Black/Black British individuals from 0 to approximately 5 000. The predictive performance on Black or British individuals increased for five traits: mean corpuscular haemoglogin (MCH), white blood cell count (WBC), platelet count, red blood cell count (RBC), and mean corpuscular volume (MCV), with the largest increase (Δ*r*^{2} = 0.6) observed for MCH (Figure 1B). For the remainder of the traits, predictive performance stayed largely stable. Predictive performance on White individuals also remained largely constant for each of the fifteen traits. These results demonstrate the potential for increasing prediction performance by including samples from a different population, but also high between-trait heterogeneity in response, suggestive of major differences in genetic architecture. This raises the possibility that more flexible schemes which combine all samples, but with reweighting, may enable optimal prediction performance.

### Importance reweighting can provide modest improvement in PGS performance

Next, we set out to evaluate the potential of reweighting through simulation. We generated 50 quantitative traits, with genetic effect correlations ranging from *ρ* = 0.5, 0.6,…, 0.9, and for each trait we created five training sets with *n _{EUR}* = 1800 randomly selected individuals of European ancestry and

*n*= 200, 450, 771, 1200, 1800 randomly selected individuals of African ancestry. We applied the importance reweighted LASSO for

_{AFR}*γ*= 0, 0.1,…, 1 to obtain estimates for

*β*. See Simulation Study for full details.

The predictive performance for African individuals improved substantially as the number of African individuals in the training set increased (Figure 2). For example, when *n _{AFR}* = 200 and the correlation between genetic effects

*ρ*was 0.7, the mean proportion of variance explained by the unweighted LASSO (

*γ*= 0) on was just under 0.2 for the African test sets, compared to 0.4 for the European test sets. While the performance on Europeans did not change markedly as the number of African training set individuals increased, performance on Africans increased to approximately 0.3 and 0.4 when

*n*= 771 and

_{AFR}*n*= 1800 respectively. As expected, the discrepancy between the two ancestry groups decreased as the correlation

_{AFR}*ρ*between the European genetic effects and the African genetic effects increased.

Importance reweighting generally had a positive impact on predictive performance for African individuals for each set of simulations. The effect on predictive performance depended on the degree of reweighting, quantified by *γ*. As *γ* increased from 0 to 1, predictive performance typically increased up to an optimal value *γ** before decreasing. This reflects a crucial trade-off of importance reweighting in this context: while the bias of African ancestry genetic effect estimates may be lower with more reweighting, the increased variance of the weights results in a lower *effective* sample size.

The effect of reweighting was relatively small compared to the effect of increased sample size. For *n _{AFR}* = 200 and

*ρ*= 0.7, the optimal amount of reweighting was

*γ** = 0.7, with predictive performance of

*r*

^{2}= 0.21, compared to

*r*

^{2}= 0.18 for the unweighted LASSO (

*γ*= 0). The amount of improvement with reweighting decreased only slightly as

*n*increased, and decreased more markedly as the correlation between genetic effects

_{AFR}*ρ*increased.

This simulation study illustrates the relative influence of sample size and importance weighting on the transferability of PGS as a function of the correlation of genetic effects between ancestries. Reducing the disparity in the numbers of samples in a multiple-ancestry dataset reduces the gap in predictive performance between the majority ancestry and the minority ancestry. On the other hand, we find that importance weighting has a relatively limited effect on performance, particularly when the imbalance of ancestries is high.

### Optimal ancestry composition of training sets varies among traits

Next, we sought to explore the value of the reweighting approach in empirical data by considering a range of traits in UK Biobank. We considered three training sets: a White training set of approximately 300 000 White individuals, a Black-only training set of approximately 5 000 Black/Black British individuals, and a dual-ancestry training set made up of the Black training set combined with White individuals so that the proportion of Black individuals was 10%. For the dual-ancestry training set, we used the importance weighted LASSO with *γ* = 0, 0.2,…, 1 to construct the PGS, while for the single-ancestry training sets we just considered the standard, unweighted LASSO.

Figure 3A illustrates the predictive performance of the PGS for mean corpuscular volume. For the Black/Black British test set, the PGS estimated using only Black or British individuals outperformed the PGS estimated using only White individuals, despite the former sample size being much smaller (*n* = 5 304 versus *n* = 295 774). With no reweighting, the PGS estimated using both White and Black/Black British individuals, performed slightly worse than the PGS estimated using only Black/Black British individuals. Importance reweighting yielded a moderate improvement in partial *r*^{2} with the optimal amount of reweighting at *γ** = 0.6. For the White test set, the PGS estimated using only White individuals performed best, yielding a partial *r*^{2} of 0.18. The PGS estimated using only Black/Black British individuals offered no predictive power on top of a covariate-only model. Importance reweighting on the dual-ancestry training set reduced predictive performance for White individuals, reflecting the reduced effective sample size. The number of variants selected in the White PGS (i.e. those with non-zero effect estimates) was approximately 20 000 compared to approximately 2 100 for the unweighted dual-ancestry PGS and 83 from the Black PGS (Figure S1).

The predictive performance of the polygenic scores for height displayed somewhat different characteristics for the Black/Black British test set (Figure 3B). In this case, the score estimated using only White individuals performed best, followed by the score estimated on both White and Black/Black British individual. The score estimated using only Black/Black British individuals performed worst, offering very little predictive power over a covariate-only model. Importance reweighting had very little effect, with partial *r*^{2} staying largely constant over the range *γ* = 0, 0.2, 0.4, 0.6 and decreasing slightly for *γ* = 0.8, 1. The number of variants selected in the White PGS was approximately 25 000 compared to approximately 9 000 for the unweighted dual-ancestry PGS and 250 for the Black PGS (Figure S2).

The fifteen traits we considered could be split into two groups according to whether the score estimated using only a relatively small number of Black/Black British individuals outperformed the score estimated using a much larger number of White individuals (Figure 4). The first group consisted of mean corpuscular haemoglobin, white blood cell count, mean corpuscular volume, red blood cell count, and C-reactive protein for which the Black score outperformed the White score. For these traits, importance reweighting had a noticeably positive effect for the dual-ancestry scores but did not significantly outperform the Black scores. For the remaining ten traits in the second group, the White score outperformed the Black score. For these traits, importance reweighting had a much reduced effect on predictive performance.

### Differences in trait architecture explain variable performance by ancestry

Finally, to investigate the reasons why optimal training approaches vary across traits, we investigated the contribution of variants at different allele frequencies to prediction accuracy. Specifically, we measured the partial *r*^{2} for different subsets of variants grouped by minor allele frequency (MAF) in a given ethnic group. We grouped variants into four sets: ‘rare’ (MAF ≤ 1%), ‘uncommon’ (1% < MAF ≤ 5%), ‘intermediate’ (5% < MAF ≤ 20%), and ‘common’ (MAF > 20%). Given a set of variants and the genotype matrix of the test set *G*, let denote the submatrix given by the columns of *G* that are in . For a set of variants , we define to be the score associated with the set , where *X* is the test covariate matrix, and are the LASSO estimates. We defined the partial *r*^{2} to be the difference in *r*^{2} between models

We note that because we are considering the same set of genotyped variants for each trait, observed differences among traits in composition likely reflect differences in trait architecture, rather than differences in how the studies were performed.

Figure 5 illustrates the contribution to prediction accuracy explained by different segments of the allele frequency spectrum for the PGS for mean corpuscular volume and height. We observe striking differences between these traits in the contribution of different allele frequency segments. For MCV in Black/Black British individuals, the majority of the predictive performance in each of the polygenic scores was due to variants that were intermediate or common in Black/Black British individuals (see top row, LHS). Notably, over half the contribution to prediction accuracy of the Black and dualancestry PGS could be attributed to variants that have MAF < 5% in the White subgroup (second row, LHS). Conversely, prediction accuracy in the White subgroup was driven by variants that are intermediate or common in White individuals (fourth row, LHS). And approximately one quarter of the accuracy could be attributed to variants with MAF < 5% in Black/Black British individuals.

In contrast, for height we find that predictive performance in White individuals has an allele frequency decomposition with a slightly greater contribution of variants that are common in White individuals (bottom row, RHS). But, predictive performance in Black/Black British individuals is dominated by variants that are common (MAF > 5%) in both populations (top two rows, RHS). The effect of reweighting follows a similar pattern; for MCV shifting predictive power towards variants that are higher-frequency in the Black/Black British individuals (and considerably rarer in White individuals) and, for height, tending to favour variants that are common in both groups.

These results suggest differences in genetic architecture between the two traits. For mean corpuscular volume, the variance explained by the dual-ancestry PGS and the Black PGS could largely be attributed to variants that were relatively common in Black/Black British individuals but rare in White individuals. Since the number of Black/Black British individuals in these training sets was relatively small, this suggests that the corresponding effect sizes are comparatively large. On the other hand, the variance explained by each of the height PGS could be attributed to variants that were common in both White and Black/Black British individuals. Such differences in trait architecture are indicative of variation in the evolutionary and selective forces that have shaped trait variation [38, 45, 36].

## Discussion

The lack of diversity in human genetic studies has been brought into focus by a number of articles (e.g. [29, 33, 25]), revealing that around 80% of all GWAS participants are of European descent. In the context of polygenic scores, this bias has been reflected in the lack of transferability of scores across ancestries: PGS developed using European-ancestry samples tend to perform poorly in non-European test sets [26, 25, 12]. This raises serious technical, clinical and ethical issues, with likely substantial impacts on health disparities and inequalities if left unaddressed [25]. Correspondingly, our results found that, for a range of quantitative traits, PGS estimated using individuals self-reporting as White performed relatively poorly on individuals self-reporting as Black/Black British.

Here, we have considered the extent to which PGS performance in underrepresented ethnic groups can be boosted by training in datasets of multiple ancestries, but where statistical methods are used to focus weight on the ethnic group in question. The ability to ask such a question has been made possible by the advent of large-scale biobanks, which combine population-scale individual-level genetic data from multiple ancestries, accompanied by a wide range of phenotypic measurements. Previous studies on the lack of transferability of PGS have generally estimated scores using summary statistics from genome-wide association studies of single-ancestry populations [26, 25, 12]. These summary statistic approaches are often highly efficient computationally and typically achieve highly competitive predictive performance relative to full genotype approaches [21, 23, 43, 35]. However, combining such data such raises complexities, including assumptions that have to be made about the correlation structure of untyped variants and the comparability of phenotype definitions. By utilising a common set of genotyped variants on individual-level data across multiple ancestries and where phenotypic data is (at least approximately) comparable, enables us to focus on the cause of differences in performance and to consider strategies for their mitigation.

Our primary finding is that traits vary substantially in their optimal strategy for combining data across ancestries. Through simulation, we have shown that there are plausible regions of parameter space, notably where effect sizes are correlated across ethnic groups but not identical, where reweighting strategies can boost prediction performance using a mixed-ancestry training set. However, when applied to empirical data from the UK Biobank, relatively weak benefit from reweighting was observed. Rather, traits fell into two broad categories. For those in the first category (such as mean corpuscular volume), the most valuable training data was the small dataset from the minority group - here Black/Black British - alone (and where performance decreases by the inclusion of any individuals of the majority - here White - ethnicity). For those in the second category (such as height), the best performance in the underrepresented group was achieved by training in the much larger majority group. Although the first category was smaller, it included important biomarkers (for example, CRP) and even when the best strategy was to use the PGS trained in White individuals, the performance of the PGS in Black/Black British individuals was considerably worse than the performance of the same PGS in White individuals.

Our findings highlight the limitations of statistical corrections alone in reducing the disparities of PGS performance between ancestries. There are numerous sources of potential bias that adversely affect the application of an analytical pipeline to individuals of minority and/or underrepresented groups, including problem selection, data collection, outcome definition, model development and lack of real-world impact assessments and considerations [10]. Therefore, statistical solutions must be considered in combination with efforts to address the global health research funding gap [42], diversity of the bioinformatics workforce [17] and assessing the impact and translation of analyses to real-world data [30] in addition to efforts to increase the number of non-European participants in genetic research. Each of these partial solutions, in combination, provide essential contributions to reduce and remove opportunities for negative effects on health inequalities, particular amongst those from different ethnic backgrounds [32].

It is becoming increasingly clear that the vast disparities in PGS performance can only be bridged by collecting more samples from minority and under-represented groups. Perhaps counter-intuitively, with more diverse data, statistical tools such as importance reweighting may eventually play a more important role as we seek to boost predictive utility by using all the available data. Reweighting strategies have the benefit of overcoming the lack of universal definitions of race, ethnicity and ancestry, which causes considerable confusion and imprecision [27, 28]. The categorisation of individuals into discrete ethnic groups to explain differences between behaviours and exposures may be unhelpful [13], whereas continuous representations of genetic ancestry, such as genetic principal components (Figure 6) enable identification of areas of genetic ancestry where performance is stronger or weaker as well as how the trait itself varies across ancestry. Ultimately, approaches to genetic prediction must acknowledge both the many similarities of human biology, but also the differences in history, cultural heritage, exposure, and behaviour that can lead to certain factors being of greater relevance for particular groups of individuals.

## Materials and Methods

### PGS Estimation using the LASSO

To construct PGS from full genotype data, we use L1-penalised regression, also known as the LASSO [39]. The LASSO has previously been used in the context of genetic prediction (see for example [44, 31, 34, 35] and references therein) and is suitable for high-dimensional problems where one expects the number of non-zero predictors to be small relative to the number of total predictors. Although there exist other full genotype PGS methods such as linear mixed models (see e.g. [46]), we focus on the LASSO largely for its computational efficiency. We provide an analysis of predictive performance versus sample size, focusing on differences in predictive performance between European ancestry individuals and African and Afro-Caribbean ancestry.

We first briefly recap the LASSO algorithm for constructing polygenic scores. Let *n* be the number of individuals in the training set. Denote *X ∈* ℝ^{n×q} to be the matrix of *q* covariates, *G* to be the *n × p* genotype matrix, and *y* to be the *n*-vector of observed phenotype values. We assume a linear relationship between the phenotype and predictors,
where *θ* is a *q*-vector of covariate effects, *β* is a *p*-vector of variant effects, and *∊* is an environmental noise term with mean 0.

The LASSO aims to minimise the objective function,
with respect to (*θ, β*), where λ is a regularisation parameter. The purpose of the ||*β*||_{1} term is to penalise large values of |*β*| and thus encourage *sparse* solutions, that is, solutions with a relatively low number of non-zero coefficients. Note that the parameters *θ* are not penalised. The choice of λ controls the degree of penalisation with higher values of λ encouraging smaller values of *β*. To select λ, we use 5-fold (on real data) or 10-fold (on simulated data) cross-validation. We optimise (4) using the R packages `glmnet` [14] and `snpnet` [35] on simulated and real data respectively. The `snpnet` package is an extension of `glmnet` designed to interface directly with the PLINK software [8, 7] to handle large-scale single nucleotide polymorphism datasets such as the genotype data in UK Biobank.

#### Multiple-ancestry Datasets

The above description assumes that the genetic effects *β* are the same for all individuals. While this assumption may be reasonable within a single ancestry or ethnic group, it is unlikely to hold more generally. We consider a hypothetical model that allows genetic effects to vary among individuals. To this end, we introduce a low-dimensional latent variable *z* representing the ancestry of a given individual. For an individual *i*,
where (*θ*(·), *β*(·)) are now *functions* parameterised by *z*. The only assumption we make about these functions is one of *smoothness*, so that individuals with similar ancestry have similar covariate and genetic effects. Otherwise, we do not make any explicit statements regarding the form of the functions.

In this paper, we consider the special case with *J* = 2 ancestry groups, such that *z _{i}* ∈ {1, 2}. We denote (

*θ*(

*z*),

_{i}*β*(

*z*)) = (

_{i}*θ*

^{(j)},

*β*

^{(j)}) for an individual

*i*in ancestry group

*j*. Using superscripts to denote the ancestry group (for example,

*y*

^{(j)}is the vector of observed phenotypes for group

*j*),

Denoting *n _{j}* to be the number of individuals of ancestry

*j*in the training set, we assume

*n*

_{1}>

*n*

_{2}to reflect a European-ancestry dominated training set. Applying the standard LASSO to such a training set, we would expect the estimate to be closer to (

*θ*

^{(1)},

*β*

^{(1)}) than to (

*θ*

^{(2)},

*β*

^{(2)}). All else being equal, this would result in better predictive performance for test individuals in group 1.

#### Importance Reweighting

We return for a moment to the more general model in Equation (5). Suppose we wish to construct a polygenic score for a test set with ancestry distribution *π ^{T E}*(

*z*) given a training set of observations from individuals with ancestry

*z*

_{1},…,

*z*, how should one obtain a single estimate for (

_{n}*θ, β*)? Our proposed approach is to use

*importance reweighting*, assigning each of the training individuals a weight

*w*that quantifies the ‘similarity’ of the ancestry variable

_{i}*z*to the test set ancestries, with individuals who have a similar ancestry to the test set assigned a higher weight. More concretely, assuming an ancestry distribution

_{i}*π*(·) on the training set, we set Note that we do not have access to the distributions

^{T R}*π*,

^{T R}*π*so in practice we must approximate (7). Given weights

^{T E}*w*, to estimate (

_{i}*θ, β*), we minimise the following

*weighted*LASSO objective function:

For the special case with two ancestry groups (Eqn. 6), we consider importance weights *w*_{1}, *w*_{2} such that *w*_{2} ≥ *w*_{1} with the aim of improving estimates for group 2, i.e. the underrepresented group. Specifically, we use weights of the form:
where *γ* ∈ [0, 1] is a hyperparameter that controls the degree of reweighting. We normalised the weights so that *n*_{1}*w*_{1} + *n*_{2}*w*_{2} = *n*_{1} + *n*_{2}. With these weights, we thus minimise the objective function,
with respect to (*θ, β*).

### UK Biobank

The UK Biobank is a large-scale cohort study with genomic and phenotypic data collected on approximately 500 000 individuals aged 40-69 at the time of recruitment [4]. Participants reported their ethnic backgrounds, with the majority (94%) self-identifying as a subgroup within the broad-level group ‘White’. There are approximately 7 000 (1.7%) individuals with a self-reported ethnic background of Black/Black British. To harmonise terms across analyses on both UK Biobank data and the simulated data used in this study, the terms self-reported ethnicity and ancestry are used interchangeably.

We investigated a range of quantitative traits that are likely to be influenced by or correlated with a number of polygenic diseases. These included three common anthropometric traits (height, body mass index, waist circumference) and twelve blood-related traits (systolic blood pressure, diastolic blood pressure, C-reactive protein, platelet count, white blood cell count, red blood cell count, haemoglobin concentration, haematocrit percentage, mean corpuscular volume, mean corpuscular haemoglobin, lymphocyte count, and monocyte count). We used genotypes from individuals self-identifying as “White” or “Black/Black British” and who satisfied certain quality-control metrics, namely those individuals used in PCA calculations of [4] and who were not identified as displaying sex chromosome aneuploidy. To control for population structure, we included the following covariates in our model: age, sex, the first ten genetic principal components (PCs), and interactions between sex and the ten genetic PCs.

#### Construction of Training Sets

To assess the effect of sample size and composition on PGS performance, we used subsets of the above data as training sets, controlling the number of White and Black/Black British individuals. The remaining individuals were then used as a held-out test set. For each trait, we constructed three types of training sets using individuals with non-missing data for that particular trait. The first two types of training sets were single-ancestry datasets, one consisting solely of White individuals and the other consisting solely of Black/Black British individuals.

*White single-ancestry*This set consisted of a randomly-sampled 80% of the quality-controlled White individuals (approximately 300 000 with 3/4 of these used for training and the remaining 1/4 used as a validation set to select the regularisation parameter λ).

*Black single-ancestry*This set consisted of a randomly-sampled 80% of the quality-controlled Black/Black British individuals, corresponding to approximately 5 500 individuals. Given the relatively small number of individuals, we used 5-fold cross-validation to select λ.

*Dual-ancestry*This set consisted of both Black/Black British individuals and White individuals. The basic form of this training set was made up of the Black training set described above, combined with White individuals so that the proportion of Black individuals was 10%. The White individuals were matched to the Black/Black British individuals on age and sex. We also considered variants of this training set by removing a proportion of either the White individuals or the Black/Black British individuals (see Results for more details).

To each dataset, we applied a minor allele frequency threshold of 0.1% and a missing genotype call rate filter of 5%. Note that, as a result, the sets of variants generally differed by training set. For the single-ancestry datasets, we used the standard, unweighted LASSO to construct the PGS. For the dual-ancestry datasets, we used the weighted LASSO with *γ* = 0, 0.2, 0.4, 0.6, 0.8, 1. Note that *γ* = 0 corresponds to the unweighted LASSO.

#### Predictive Accuracy

We evaluated predictive accuracy of each PGS using individuals that were not included in the corresponding training sets. To assess the genetic predictive accuracy of a PGS, we calculated the partial *r*^{2} attributable to the PGS, relative to a covariate-only model, following Martin et al. [25]. Specifically, we fit the following nested linear models,
where *X* is the covariate matrix and is the vector consisting of polygenic scores for each individual in the test set. We defined the partial *r*^{2} to be the difference in *r*^{2} between models 11 and 12. To obtain a more reliable estimate of predictive accuracy, we performed 5-fold cross-validation. That is, we repeated five times the process of training set construction and PGS estimation to obtain five estimates of partial *r*^{2}. Finally, we took the mean across these five folds as our measure of predictive accuracy.

### Simulation Study

To evaluate the effect of importance reweighting on PGS performance under various settings, we un-dertook a simulation study. We used the simulation engine msprime [18] in the standard library of population genetic simulation models stdpopsim v0.1.2 [1] to generate African and European genotypes, following a similar simulation framework to that used by Martin et al. [26]. For each ancestry group, we generated 200 000 genotypes for chromosome 20, based on a three-population ‘out-of-Africa’ demographic model [16], using a mean mutation rate of 1.29 × 10^{−8} and a recombination map of GRCh37. Note that this model also generates genotypes for East Asian individuals but we do not use these in our analysis. To reduce the computational burden, we used only the first 10% of the chromosome, applying a minor allele frequency threshold of 0.1% to each ancestry group separately, resulting in a total of 7 563 variants.

We simulated phenotypes from these genotypes assuming a normal linear model,
where . The noise variance parameter *σ*^{2} controls the SNP heritability for each ancestry group and was chosen to yield an average SNP heritability of 0.5 across the groups. To reflect the sparsity of genetic effects, we randomly selected *p*_{0} = 100 of the *p* = 7, 563 variants to be causal. Motivated by Trochet et al. [41], we drew the causal effect sizes from a bivariate normal distribution to model the similarity of genetic effects between ancestries. Denoting to be the indices of the causal SNPs, and we have the equation
where *ρ* is the correlation between the ancestry-specific genetic effects. We set for *i* ∉ *C*. For each value of *ρ* = 0.5, 0.6,…, 0.9, we repeated the above procedure 50 times to generate 50 quantitative traits.

To investigate the effect of sample size, for each trait we created five training sets with *n _{EUR}* = 1800 randomly selected individuals of European ancestry and

*n*= 200, 450, 771, 1200, 1800 randomly selected individuals of African ancestry. For each of these training sets, we computed weights and where γ is a hyperparameter controlling the degree of reweighting. Note that γ = 1 corresponds to inverse proportion reweighting when

_{AF}_{R}*n*= 200 and

_{AFR}*n*= 1800. We normalised the weights to ensure that their sum equalled

_{EUR}*n*, in line with the unweighted case.

To assess predictive performance for each ancestry, we constructed an African test set and a European test set by randomly selecting 2 000 individuals from each ancestry out of those not included in the training set. We used the proportion of variance explained, denoted *r*^{2}, as the measure of predictive performance,
where denotes the sample variance. Note that this is equivalent to the partial *r*^{2} relative to an intercept-only covariate model. For *ρ* = 0.5, 0.6,…, 0.9, we repeated the above process 20 times with different randomly selected individuals in the training and test sets to calculate a mean *r*^{2} for each trait.

## Acknowledgements

BCLL and CM were supported by the UK Engineering and Physical Sciences Research Council through the Bayes4Health programme [Grant number EP/R018561/1]. BCLL also gratefully acknowledges funding from Jesus College, Oxford. CCH was also supported by The Alan Turing Institute, Health Data Research UK, the Medical Research Council UK, and AI for Science and Government UK Research and Innovation (UKRI). GM was supported by the Wellcome Trust [Grant Number 100956/Z/13/Z] and the Li Ka Shing Foundation. This research has been conducted using data from UK Biobank, a major biomedical database, under application 12788. The computational aspects of this research were supported by the Wellcome Trust Core Award Grant Number 203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health.

## Footnotes

↵† Joint supervision