## ABSTRACT

We present a theoretical framework for genomic prediction accuracy when the reference data consists of information sources with varying degrees of relationship to the target individuals. A reference set can contain both close and distant relatives as well as ‘unrelated’ individuals from the wider population, assuming they all come from the same homogeneous population. The various sources of information were modeled as different populations with different effective population sizes (*N _{e}*). With a similar amount of data available for each source, we show that close relatives can have a substantially larger effect on genomic prediction accuracy than lesser related individuals. However, the number of individuals from the wider population can be far greater than that of close relatives. We validate our theory with analysis of real data, and illustrate that the variation in genomic relationships with the target, rather than the variation in genomic relationship as a deviation for the expected relationship, is a predictor of the information content of the reference set and information from pedigree relationships is then naturally included in the prediction framework. Both the effective number of chromosome segments (

*M*) and

_{e}*N*are considered to be a function of the data used for prediction rather than being population parameters. We illustrate that when prediction also relies on closer relatives, there is less improvement in prediction accuracy with an increase in training data or marker panel density. We release software that can estimate the expected prediction accuracy and power when combining different reference sources with various degrees of relationship to the target, which is useful when planning genomic prediction (i.e. before collecting data) in animal, plant and human genetics.

_{e}## INTRODUCTION

Genomic prediction of (additive) genetic effects and phenotypes is emerging in a wide range of fields including animal and plant breeding, risk prediction in human medicine and forensics^{1–4}. Genomic prediction requires modeling of the association between genome-wide single nucleotide polymorphisms (SNPs) and phenotypes. The success of genomic prediction is measured by its accuracy, i.e. how reliable a future phenotype of target individuals can be predicted.

Genomic prediction requires a reference population of individuals having information on both genotype and phenotype. The accuracy of genomic prediction depends on various parameters, including sample size of the reference and its genetic structure. An important parameter in relation to the latter is the effective size of the population. The effective population size is a predictor of the effective number of chromosome segments that are represented in the population^{5–7}. Theoretical predictions have usually considered a homogeneous population of individuals that are essentially unrelated. However, in most practical applications, the reference population used for genomic predictions possibly consists of many sub-groups with individuals having a variety of relatedness to the target individual, e.g. direct relatives, more distant relatives, and individuals that are considered unrelated, but still part of the same population as the target individual. It is relevant to assess the contribution of these various sources to prediction accuracy before actually conducting an experiment.

A number of studies have shown that genomic predictions are more accurate if the genomic relationship between the proband and the reference population is higher, both in humans^{8–11} and in other species^{12–14}. Habier et al (2013)^{15} distinguished between three types of information in genomic prediction; linkage disequilibrium, additive-genetic relationships and co-segregation of QTL predicted from markers genotypes with a pedigree. They argued that it would be useful to understand how these sources contribute to the accuracy of genomic predictions, especially when designing reference populations for breeding programs. They show these contributions via simulated examples but did not provide methods that allow simple predictions for their contribution to accuracy. Pszczola et al. (2012)^{16} showed that the relationship between the reference population and the proband should be maximized to achieve an optimal design using a simulation study. However, they also did not attempt to derive the expected prediction accuracy from an optimal design in advance. Hayes et al. (2009)^{17} considered the influence of direct relatives on genomic prediction. They followed the same approach as the general theory, i.e. by considering the number of independently segregating chromosome segments within families. They showed the accuracy of genomic prediction from varying sizes of the first and second degree of relatives, but did not consider the information from combined sources^{18}. It should also be noted that those studies that derived genomic prediction accuracy from theory using effective number of chromosome segments (*M _{e}*)

^{5,6,19–21}, did not consider the correlation between relatedness at different chromosomes, therefore overestimating

*M*and underestimating whole-genome prediction accuracy

_{e}^{7}.

Wientjes et al (2016)^{22} proposed a simple selection index approach to combine information from different populations. They considered a genetic correlation between genetic effects expressed in different populations. We propose to use the same approach to combine different sources of information from within a population, where the different cohorts have a different degree of relationship with the target individual. To predict the accuracy, we derive the number of effective chromosome segments from a hypothetical *N _{e}* associated with each subset, and we show that that combining such subsets using selection index theory gives the same result as using a prediction from an

*M*derived from the variation in genomic relationships between the reference data and the target. Prediction accuracy is derived from variation in genomic relationship rather than the variation in genomic relationship as a deviation for the expected relationship among members of the reference set, as was proposed by Goddard et al (2011) and also applied by Wientjes et al (2016). This approach leads to a theoretical concept useful for assessing the accuracy of genomic predictions in advance, and we illustrate this with examples based on real data.

_{e}## MATERIALS AND METHODS

### Predicting genomic selection accuracy

The accuracy of genomic breeding values (GBV) or (genomic profile score in the context of human risk prediction^{23}) based on genome-wide SNP genotypes can be predicted from theory^{5–7, 24}, assuming that prediction is based on a reference population with phenotypes and genotypes for the same genome-wide SNPs that are linked to quantitative trait loci (QTL). The accuracy depends on i) the proportion of genetic variance at QTL captured by markers and ii) the accuracy of estimating marker effects. The proportion of genetic variance at QTL captured by markers (b) depends on linkage disequilibrium (LD) between markers and QTL, which in turn depends on the number of markers *(M)* and the number of ‘effective chromosome segments’ (*M*_{e})^{5}, that is

Various forms of prediction of *M _{e}* have been presented

^{5,6,21}that were however inconsistent to each other, and without considering the correlation between chromosomes. Recently, we presented a prediction formula with the form

^{7}where

*N*= effective population size;

_{e}*L*= average chromosome length;

*N*= number of chromosomes. This formula accounts for mutation, and that without considering mutation should be referred to equation (10) in Lee et al. (2017)

_{chr}^{7}The accuracy of the genomic prediction of a phenotype can be written as

^{5}where

*r*is the correlation coefficient between the true phenotypes (

_{y,ĝ}*y*) and estimated GBV,

*h*

^{2}is the heritability of the trait, and

*N*is the number of phenotypic observations. Other measures for genomic prediction accuracy, particularly for human risk prediction, such as the area under the receiver operating characteristic curve or odds ratio of case-control status contrasting the higher or lower risk group are described elsewhere

^{7}.

*M*_{e} and genomic relationship

_{e}

After collecting genotypic information of the reference data and the target individual, it is possible to obtain an empirical *M _{e}* from a genomic relationship matrix (GRM). In this derivation, the elements in the GRM are where

*x*and

_{Tm}*x*are the standardised genotype coefficients (mean 0 and variance 1) for the target individual (

_{jm}*T*) and

*j*individual in the reference data at the

^{th}*m*th locus. It is possible to construct a GRM for each locus, and the elements in the GRM at the

*m*th locus are

*G*

_{Tj}_{(m)}=

*G*

_{T*}_{(m)}=

*x*where * denotes the set of all reference individuals. Then, the variance of the mean of

_{Tm}x_{jm}*G*

_{T*}_{(m)}across all

*M,*SNPs in a single chromosome is where cov(

*x*

_{*(l)},

*x*

_{*(m)}) =

*x*

_{T(l)}⋅

*x*

_{T(m)}=

*r*, which is a correlation between the

_{m,l}*m*

^{th}and

*l*

^{th}SNP-genotype, because of var(

*x*) = 1 and mean(

*x*) = 0, i.e. the genotype coefficients are standardized in the population,

*M*is the number of SNPs in the

_{i}*i*th chromosome and

*Μ*(

_{e}*i*) is the effective number of chromosome segments for the

*i*th chromosome, which is the inverse of the expectation of the squared correlation between SNPs

^{6,7}. When considering multiple chromosomes, the covariance of the pairwise relationship between two chromosomes is not negligible

^{7}. Assuming equal length and number of SNPs for

*N*chromosomes,

_{chr}*M*for the whole genome can be written as where

_{e}*N*is the number of chromosomes.

_{chr}In Goddard et al. (2011)^{5}, their theoretical derivation had to assume a homogeneous population of individuals that are essentially unrelated. However, here we show that the assumption about unrelated individuals can be relaxed so that any random samples from the population can be used for Eq. (4), irrespective of they are related or not.

### Effective population size in a reference data set

One of critical parameters to determine the accuracy of genomic prediction is the effective population size (*N _{e}*). It is not very common to represent a reference population by a single value of

*N*when it consists of several cohorts of individuals with different relationships to the target individual. The only study using a single value for

_{e}*M*representing a reference set consisting of two population is by Wientjes et al. (2016)

_{e}^{22}. Here, we introduce a novel concept based on the relationship between

*N*and var(

_{e}, M_{e}*G*

_{T}_{*}), which can assign a value of

*N*for a reference population consisting of several cohorts. For any subset of the reference data set, there are realized relationships with the target sample. From Eq. (4), a value of

_{e}*M*which is the inverse of the variance of the genomic relationships between the proband and the reference sample, can be assigned to the reference data. Then, a single value of

_{e},*N*which is a function of

_{e},*M*from Eq. (1), can be obtained for the reference data. The effective population size of the reference set is therefore a parameter specific to the data used. It can be smaller, but also larger than the actual effective size that is often contributed to the population from which the data is derived, depending on whether closer or more distant individuals are chosen for the reference set.

_{e}Based on this concept of *N _{e},* reflecting information content of the reference sample in relation to the target sample, we consider three information sources consisting of 1) close relatives of the proband, e.g.

*N*= 10, 2) distant relatives or individuals from the local area of the proband, e.g.

_{e}*N*= 100 and 3) a wider population sample of individuals that are not related to the proband, e.g.

_{e}*N*= 1,000.

_{e}The GBV can be estimated based on each of these information sources, and the accuracy of the estimation can be calculated as above, e.g. *r _{g},_{ĝ(i)}* from Eq. (1) where

*i*represents the

*i*th information source. It is also possible to estimate GBV based on combined data of all three information sources. Assuming a random sample from the same population for each source, the accuracy of the GBV based on the combined data set can then be calculated using standard selection index theory as where

**g**is the vector with covariances between each of the GBV and the true breeding value, and

**P**is the variance-covariance matrix of the set of GBV. The accuracy of the GBV based on the combined data set can also be estimated based on the weighted

*M*from the three information sources. Assuming a random sample from the same population for each source, the weighted

_{e}*M*can be obtained as where

_{e}*p*is the proportion of the sample size over the total sample for each information source. The accuracy of the GBV based on the weighted

_{k}*M*is identical with that using standard selection index theory above (Eq. (5)). Following Wientjes et al. (2016)

_{e}^{22}we can further generalize for a case where genetic correlations among multiple reference populations and those between reference populations and the target are not one. Equation (5) can be generalized as where

*r*is the genetic correlation between the

_{Gk,T}*k*

^{th}reference population and the target set, and similarly,

*r*is the genetic correlation between the

_{Gi,j}*i*

^{th}and

*j*

^{th}reference population (

*i*=

*j*= 1 ~ k).

### Simulation

In a simulation, a stochastic gene-dropping method ^{25,26} was used to simulate 4,000 SNPs for each of 30 chromosomes, each of length *L=1* Morgan with *N _{e}* = 50, 500 and 1000 for 50, 500 and 1000 generations, respectively. Recombination and mutations were modelled according to the genetic distance between SNPs and the mutation rate of 1e-08 per site per generation

^{27}. In the final generation, we constructed a genomic relationship matrix for a random set of 3000 individuals. Among the 3000 individuals, we randomly selected 1000 individuals as target data and 2000 individual as reference data, and estimated variance of the genomic relationships between the target and reference data to validate Eq. (3).

### Evaluation of the formulas

For each of the three information sources contributing to genomic prediction we varied values for *N _{e},* sample size in reference data and marker density. We compared the expected accuracy of GBV from the sample of

*N*=1000 with predictions that additionally included information from the sample of

_{e}*N*=100 and

_{e}*N*=10. The total number in the reference population was kept equal between the comparisons.

_{e}### Real data analysis

We used publicly available data from the Framingham heart study (phs000007.v26.p10.c1)^{28}. There were 6950 individuals genotyped for 500,568 SNPs. Stringent quality control for genotype data and phenotype adjustment for confounders were applied to the data (the details can be found in Lee et al. (2016)^{7}). The quality control included SNP call rate > 0.95, individual call rate > 0.95, HWE p-value > 0.0001, MAF > 0.01 and individual population outliers < 6 SD from the first and second principal components (PC). After QC, 6920 individuals and 389,265 SNPs remained. Among them, 4243 individuals were phenotyped for height and body mass index (BMI).

We made three different information sources to form the reference data that were tested in 100 replicated analyses (Table 1). Initially, we randomly selected 800 individuals out of 4243 phenotyped individuals as a target data set. For reference data set #1, we selected 50% of individuals that were highly related (> relatedness of 0.3) to the 800 target individuals (*N _{1}* = 617 ± 19). For reference data set #2, we selected 80% of moderately related individuals (> relatedness of 0.1) of the 800 target individuals (

*N*= 1254 ± 30). For reference data set #3, we took the rest of the individuals that were not selected for reference data set #1 and #2 (

_{2}*N*= 1572 ± 33). There was no overlap sample between target data set and reference data sets #1, #2 and #3.

_{3}Using the real genotype data, the genomic relationships between the reference and target sample were constructed. Empirical *M _{e}* was estimated from equation (3) for reference #1, 2 and 3, and that for combined data. We took a median rather than mean because the distribution of variance of the genomic relationship between target and reference sample was skewed. The correlation between the true phenotypes (that were not used in the analyses) and estimated GBV in the target data set was estimated for the combined data set, which was used as the genomic prediction accuracy (

*r*). Phenotypes were adjusted for birth year, sex, and the first 10 PCs were used to control non-genetic confounding effects, e.g. population stratification.

_{y,ĝ}## RESULTS

In the simulation study, as shown in Figure 1A, 1B and 1C, the expected (from Eq. (1)) and empirically observed *M _{e}* from the simulated genotyped data (using Eq. (4)) are in good agreement, however, they are considerably lower than the expectation from the previous formulas

^{5,6,21}, which confirms the result from Lee et al. (2016)

^{7}. It is noted that Eq. (4) is still valid in the subset with a smaller

*N*= 50 that has a significant proportion of high related individuals, indicating that the assumption about unrelated individuals (made in Godard et al. (2011)

_{e}^{5}) can be relaxed.

In the evaluation of the formulas, we first tested how the prediction accuracy was changed with varying marker density, using formula (1) and (2) and b = *M* / *(M _{e}* +

*M)*(Figure 2). For

*N*=10,000, the accuracy gradually increased with marker density, but the slope became flat when using the number of SNPs exceeded 100,000 (Figure 2A). For

_{e}*N*=1,000, the accuracy did not increase with marker density as long as the number of SNPs was higher than 50,000 (Figure 2B). For N

_{e}_{e}=100, there was no improvement of the accuracy if the number of SNPs was more than 10,000 (Figure 2C). This would be expected because the proportion of genetic variance at QTL captured by markers (b =

*M*/ (

*M*+

_{e}*M*) approached one when the number of SNPs (

*M*) was more than 100,000, 50,000 and 10,000 for

*N*= 10,000, 1000 and 100, respectively (Figure 3), as

_{e}*M*was equal to 21,248, 2,313 and 254 for these three cases.

_{e}Next, we quantified the contribution of each information source when varying sample size in the reference data using formula (1), (2) and (4) (Figure 4). It was assumed that the number of SNPs was sufficient to capture most of causal variants (e.g. > 50,000). When adding 100 individuals of *N _{e}*=100 or

*N*=10 to the reference sample with

_{e}*N*=1000, the accuracy was slightly or substantially improved (Figure 4A). The improvement was larger when adding more individuals (500) (Figure 4B). Results showed that an information source of a smaller

_{e}*N*was more important when the samples sizes of each information source were the same. When the total number of reference data was increased, the importance of adding an information source of a smaller

_{e}*N*was relatively decreased (Figure 4). When heritability was higher, overall accuracy was increased, and the relative contribution from an information source of a smaller

_{e}*N*i.e. the close relatives, was reduced (Figure 5).

_{e},Figure 6 confirms again that the smaller *N _{e},* the better the prediction accuracy when using each information source separately. However, the sample sizes can be also varied across the information sources, as there are generally a lot fewer close relatives than individuals from the wider population. In Figure 6A, the accuracy at a sample size of 100 for

*N*=10 was 0.73, which was lower than that of a sample size of 1,000 for

_{e}*N*=100 (0.81) or that of a sample size of 20,000 for

_{e}*N*=1,000 (0.83). With a higher heritability, the result is similar in that the 20,000 records in the information source of

_{e}*N*=1000 gave a better accuracy than the 100 records of close relatives (

_{e}*N*=10).

_{e}In real situations, the most common and desirable design may combine all of the information sources to maximize the prediction accuracy. We plotted the accuracy using a composite design consisting of *N _{e}*=1000 +

*N*=100 (

_{e}*N*=500) +

*N*=10 (

_{e}*N*=50), compared to that using

*N*=1000 (Figure 7). The accuracy for a composite design was substantially increased especially when the total number of reference sample is low (Figure 7).

_{e}Figure 8 illustrates the real data analyses. The median of empirically estimated *M _{e}* from the inverse of the variance of the genomic relationship (Eq. 3) over 100 replicates was 2254 (SD=50), 3989 (SD=104) and 28848 (SD=920) for reference #1, #2 and #3, respectively (Table 1). Empirically estimated

*M*based on the combined data was 4836 (SD=106) while expected

_{e}*M*was 5309 (SD=88), approximately confirming Eq. (4). The (small) difference between empirical observation and expectation was probably due to skewed distribution of the variance of the genomic relationships.

_{e}Given *M _{e}, N* and

*h*the expected accuracy (from Eq. (2)) agreed well with the observed accuracy when using Framingham data (Figure 9). The reported heritabilities,

^{2},*h*

^{2}=0.8

^{29–31}for height and

*h*

^{2}=0.46

^{32,33}for BMI, were used.

Finally, we quantified the importance of marker density using the real data. In agreement with Figure 2, the prediction accuracy is not much decreased even with 50,000 SNPs that were randomly selected from 389,265 SNPs (Figure 10).

## DISCUSSION

This work shows a simple approach for modeling genomic prediction in a reference data set that contained several subpopulations that differ in relatedness to the target set, and by modeling these subpopulations as having different effective population size. The model allows assessing the prediction accuracy before actually conducting an experiment so that designing genomic prediction can be precise and effective in animal, plant and human genetics. For example, it can address a question how much the prediction accuracy can be increased by adding 10,000 (conventionally) unrelated individuals into the current experiment consisting of 100 relatives in the reference data. The value for *N _{e}* in equation (1) can be approximated based on prior knowledge of a population, and the relatedness of the sample with the target, possibly supported by some genotype information that maybe available on cohorts, or samples thereof. Prediction in advance indeed relies on arbitrary modelling a number of cohorts, but it would be a useful exercise, as illustrated in the results when considering marker density and various sizes of the subsets of the training data. The theory is also useful for an animal breeder to predict the value of genotyped animals in an own herd versus those in a wider references population consisting of a larger number of more distantly related individuals.

The genotypic and phenotypic information of close and distant relatives of the proband can be effectively used as a part of the unified reference panel that also include a large number of individuals that are not related to the predicted subject to improve the accuracy further as illustrated in Figure 7. For a random sample from the same homogenous population, e.g. within the same breed or ethnicity, an optimal design should consist of both close and distant relatives and unrelated individuals, e.g. a composite design, to maximise the prediction accuracy (Figure 7). That is, the composite design takes advantage of effective information from smaller number of relatives while it also use information from a greater number of unrelated individual.

Using equation (1) and (4), we showed that the prediction accuracy derived for a population with unrelated individuals turns out to be higher, compared to previous quantifications that overestimated *M _{e}* for a larger number of chromosomes

^{5,6,21,34}.

Using the same theory, we also showed that the information from close relatives could increase the accuracy even further, especially for smaller reference populations (Figures 3–6). It is important to note that the assumption about using unrelated individuals in estimating empirical *M _{e}* from genomic relationship

^{5}is not strictly necessary and can be relaxed (Eq. (3) and (4)). The theory and empirical observation from simulation study agreed well (Figure 1) even when using a population with a smaller effective population size (

*N*50) that consisted of a significant proportion of high relatedness.

_{e}=Previous studies related to genomic prediction accuracy have suggested that *M _{e}* can be derived from the variation in the differences between realized and expected relationships

^{6,22}, i.e.

**D**=

**G**–

**A**where

**G**is a genomic relationship matrix and

**A**is a numerator relationship matrix based on pedigree. Those studies validated their results also with simulation. If the individuals used in the training set have a low expected relationship to the target individuals, then there is not much difference between the variations in

**D**versus

**G**. However, when some closer relatives are used, var(

**G**) is larger than var(

**D**) and

*M*is therefore smaller. Note that non-random sampling of individuals used for the training set can cause a difference between the

_{e}*N*of the population that was simulated, and the

_{e}*N*of the data set that was used for prediction.

_{e}We have not tested the theory for multi-breed reference populations, i.e. those that are heterogeneous in the sense of consisting of populations from different genetic background, i.e. different breeds or ethnicities, each with different minor allele frequencies, different LD structure and different effects for causal variants. Wientjes et al. (2016)^{22} explicitly addressed the problem of different effects for causal variants (i.e. genetic correlation less than one) when combining data from two populations. Individuals from different populations share genomic relationships that are lower than those among members within each population. Evidence in literature suggests low prediction accuracies when using information from different breeds or populations, which could be viewed as predicting from populations with very large *N _{e}.* Moreover, we have not considered historical population dynamics such as bottleneck and admixture, but assumed a constant

*N*over the historical generations, which leads to simplifications that make the formulae tractable and easy to derive. Further work is required to extend the theory accounting for admixture populations and historical population dynamics.

_{e}We have shown an improved theory for the prediction of the effective number of chromosome segments, which is a key parameter in genomic prediction accuracy^{7}. The theory accounts for the correlation between relationships at different chromosomes and as a result the effective number of chromosome segments is smaller than predicted from previous theory^{5,6,21}. As a result, the increase of the genomic prediction accuracy appears to be less reliant on higher marker density unless *N _{e}* is very large (e.g. > 10,000) (Figure 2), compared to what have been quantified by previous theory

^{5,6,21}. The previous theory overestimates

*M*(mostly due to neglecting correlation between chromosomes), therefore underestimates the proportion of genetic variance at QTL captured by markers. Little improvement of prediction accuracy with increasing SNP marker density has been empirically observed in a number of studies

_{e}^{35–37}. This may also have important implication in genomic prediction as to designing marker density in animal, plant and human genetics.

The ability to quantify the accuracy in relation to various degrees of relationships (e.g. close relatives, distant relatives, local or extensive population sample) is important for predicting outcomes of genomic prediction for specific designs. This study has addressed this question, and the theory has been implemented in MTG2 software (https://sites.google.com/site/honglee0707/mtg2). Therefore, a user can know the expected prediction accuracy and the power^{38} before designing an experiment of genomic prediction. Our approach can be applied both before and after collecting the data.

## DISCLOSURE DECLARATION

The authors declare no competing financial interests.

## ACKNOWLEDGEMENTS

This research is supported by the Australian National Health and Medical Research Council (APP1080157), the Australian Research Council (DP160102126, FT160100229) and the Australian Sheep Industry Cooperative Research Centre. The Framingham Heart Study is conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract No. N01-HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02- HL-64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University. The authors acknowledge useful discussion with Han Mulder that contributed to a clearer paper.