Abstract
We define and study the problem of chromosomal selection for multiple complex traits. In this problem, it is assumed that one can construct a genome by selecting different genomic parts (e.g. chromosomes) from different cells. The constructed genome is associated with a vector of polygenic scores, obtained by summing the polygenic scores of the different genomic parts, and the goal is to minimize a loss function of this vector. While out of reach today, the problem may become relevant in the future with emerging future technologies, and may yield far greater gains in the loss compared to the present day technology of as embryo selection, provided that technological and ethical barriers are overcome. We suggest and study several natural loss functions relevant for both quantitative traits and disease. We propose two algorithms, a Branch-and-Bound technique, to solve the problem for multiple traits and any monotone loss function, and a convex relaxation algorithm applicable for any differentiable loss. Finally, we use the infinitesimal model for genetic architecture to approximate the potential gain achieved by chromosomal selection for multiple traits.
1 Introduction
Polygenic Scores (PS) are genetic scores predicting a phenotype of interest, by combining the contribution of multiple genomic alleles. In the last few years hundreds of polygenic scores were developed for predicting complex diseases and quantitative traits in humans [25], with the scores coefficients typically fitted in large Genome-Wide-Association-Studies (GWAS) [22, 39]. The accuracy of PSs is expected to improve significantly in the upcoming years due to increase in GWAS sample sizes, inclusion of additional populations [32], usage of whole genome sequencing (rather than genotyping using SNP-arrays) that enable to profile of additional (in particular rare) variants [14], and improvement in statistical methods for fitting such scores [5, 11, 29].
These recent advances make it possible to screen embryos for common, complex conditions and traits, when using in vitro fertilization (IVF), in a technology termed Polygenic Embryo Screening (PES). Conceptually, PES is quite straightforward: screen the potential embryos to calculate their PS for traits of interest, and select the embryo maximizing a PS (e.g. minimizing disease risk), and the technology is already offered commercially and is in use in the clinic [35]. Several works have analyzed the potential benefits and risks using theoretical and empirical analysis but the effectiveness of current technology is debatable [21, 26, 35, 36]. In particular, the limited number of embryos to select for (typically not more than 5–7) may limit the potential benefits from the technology, especially when selecting for multiple traits of interest simultaneously [26].
With novel technological advances, it might be possible in the future to go beyond embryo selection, and select and combine parts of the genome of different cells. Such flexibility, is made possible, will lead to a far greater space of possibilities for selection compared to PES, with potentially significantly larger benefits in disease risk reduction. For example, chromosomal transplantation was recently demonstrated in vitro [30, 31], in order to replace a defective chromosome by a normal one. Similar technologies are used in the lab to study humanized animal models [37, 41]. For complex polygenic traits, selection of individual chromosomes or large-scale genomic regions from available cells may be more effective compared to genome engineering approaches gene editing using the CRISPR-CAS9 system [18] that affect only a one or a handful of genes.
A major technological challenge will be to determine the PS of individual cells and chromosomes in a non-invasive manner. Such methods may be available for oocytes [17]. Alternatively, phenotyping individual cells (e.g. using imaging techniques) [20] may be correlated with DNA quality [28], and may provide indirect proxy for polygenic scores. Assuming that technological and ethical issues are resolved, allowing chromosomal selection in humans, animal models or agriculture, computational methods and statistical analysis are are needed in order to fully utilize the potential of the different selection methods. The goal of this paper is to develop these methods and analysis. Specifically, we focus on chromosomal selection for multiple quantitative traits and diseases. In this problem, multiple copies are available for each chromosome (or possibly a smaller genomic part), and based on these copies’ PS we select one of them. Such choices are utilized to generate an embryo from the different chromosomal parts. We address the following two main questions:
How should the chromosomes be selected in order to maximize utility across multiple traits? When selecting for T traits, each selection c of chromosomes leads to an overall genomic score vector . A loss function is defined and our goal is to find the selection c minimizing the loss , and compare it to the loss obtained for random selection. When C copies are available for each of the M chromosomes, the total number of possible selections CM is exponential in M, hence the need to design efficient general algorithms for the problem.
What is the expected gain when-using optimal chromosomal selection? how does it compare to the baseline, i.e. selecting embryos at random, as well as to the embryo selection procedure that is enabled by current technology and already offered to patients [26, 35]?
Our main contributions are threefold: first, we formulate the chromosomal selection problem mathematically. Second, we provide two algorithms for chromosomal selection for multiple traits and general loss functions and investigate their empirical performance. Third, we estimate the expected gain achieved by chromosome selection, for both linear loss functions where we establish an analytic approximate formula, as well as for a few nonlinear loss functions using simulations.
2 The Chromosomal Selection Problem
2.1 Background and Selection Problems for Polygenic Scores
Consider a genome composed of M distinct chromosomes, where for diploid cells we count the maternal and paternal chromosomes separately, hence for example for a human diploid cell M = 45 with 22 pairs of autosomes and the ‘XX’ pair (as explained later, in most plausible scenarios selection for the ‘XY’ pair is determined by the sex and not the scores). We associate with each chromosome a Polygenic Score (PS) vector representing the genetic contribution of the chromosome to a T traits of interest. Suppose that we have C distinct cells, each with its own genome, hence C copies are available for each chromosome overall. Our goal is to select one copy for each chromosome, possibly under constraints, yielding a full genome with desired properties in terms of the resulting polygenic score. For example, in embryo selection, the C cells are C embryos obtained from the same parents, and the selection is performed by simply choosing one of the cells, such that all selected chromosomes belong to the same cell. In chromosomal selection, it may be possible to select different chromosomes from different cells. For example, suppose that a diploid parental cell and a diploid maternal cell are available, and both are reprogrammed to create haploid sperm and oocytes cells, respectively [7, 16], and later to yield a viable (diploid) embryo after fertilization. Suppose further that it will be possible to select the maternal or paternal copy independently for each chromosomes in the created haploid cells. Hence, there are overall C = 2 copies of each chromosome, with chromosomes in each haploid cell (ignoring for simplicity the uniqueness of the sex chromosomes), and a space of 2M overall possible resulting genomes from the embryo, depending on the selected copy at each chromosome. Figure 1(b) shows an illustration of chromosomal selection for sperm cells and oocytes. A simplified variant of the problem is obtained when an oocyte is already available, and we only select chromosomes from a diploid parental cell for sperm cell, or vice versa, hence M = 22 or 23, and the scores vector of the selected gamete can be then added to the scores of the available gamete (assumed w.l.o.g. to be 0T), yielding a chromosomal selection problem with a smaller M value. Additional scenarios in which chromosomal selection may be possible are described in Appendix Section C.
2.2 Preliminaries
We first introduce mathematical notations used throughout the paper. For a natural number denote the set {1, .., n} by [n]. For two natural numbers m ≤ n denote the set {m, m + 1, .., n} by [m, n]. The vector of all zeros (ones) of length n is denoted 0n (1n).
For a polytope , we define the projection operator .
Tensors
Let be a 3rd order tensor with elements Xijk. We use the •, notation to define lower-dimensional fibers of a tensor. For example, Xij•, denotes the vector . Similarly, X•j•, is a matrix of size m × p containing all elements Xijk, ∀i ∈ [m], ∀k ∈ [p]. We can also describe sub-tensors obtained by taking subsets of the indices across each dimension. For example, X[i][j]• is a 3rd order tensor of size i × j × p obtained by taking the first i and j coordinates, on the first and second dimension, respectively.
For a 3rd-order tensor and a matrix , define the 2-mode tensor-by-matrix product [1, 23], as a matrix , with elements defined by:
Gaussian Distribution
For the multivariate Gaussian distribution with mean μ and variance Σ, denote by ϕ(x; μ, Σ) and Φ(x; μ, Σ) the density function and cumulative distribution function, respectively. When μ, Σ are omitted, Φ and ϕ refer to the standard multivariate Gaussian distribution with mean zero and identity covariance matrix. For a measurable set denote by the probability of the set under the Gaussian distribution, i.e. .
2.3 Chromosome selection for Multiple Traits
Let be a 3rd order tensor of polygenic scores, where Xijk denotes the score in chromosome i of copy j for trait k. Let c = (c1,.., cM) ∈ {1,.., C}M be a selection vector. The associated selected polygenic vector is defined as , with Xck denoting its k-th element (Xc)k, ∀k ∈ [T]. Our goal is to find the selected score vector Xc minimizing a loss function of our choice. The multi-trait chromosomal selection problem is defined as follows:
Given a 3rd order tensor of scores , and a loss function , find a vector c* ∈{1,.., C}M minimizing the loss: .
Table 1 lists several examples of natural loss functions of interest for both quantitative and disease traits. The computational difficulty of Problem 1 above depends on the loss function . For example, when the is linear in the polygenic risk vector Xc, we can select the optimal vector for each of the M independently, and the computational problem becomes trivial. This is the case when selecting for maximizing a weighted combinations of quantitative traits. However, for non-linear loss functions (e.g. minimizing the overall disease probability over multiple diseases), selecting the best chromosome may be computationally challenging since we need to take into account the scores of all chromosomes jointly. In Section 4, we propose two algorithms for finding the optimal selection applicable for general classes of loss functions.
In particular, many natural loss functions are monotone functions of the scores vector. This montonicity may be used when solving Problem 1. Formally, we define monotone loss functions with respect to the (partial) product order between vectors as follows:
A vector dominates a vector , denoted x ✓ y if xi < yi, ∀i ∈ [n]. We say that a loss function is monotonically non-increasing if for any two vectors Xc ✓ Yc we have .
The Gain due to Selection
For a tensor X of scores, and a loss function , we define the Gain G due to chromosomal selection and Gain Ge due to embryo selection as the differences between the optimal loss and the expected loss when selecting at random, i.e. with respect to the uniform distribution over all CM possible choices of chromosomes:
The gain is similar in spirit to previous definitions given in [21, 26], but with two major differences: First, it is defined for a general loss whereas [21, 26] defined the gain only for additive losses for quantitative and disease traits. Second, the gain in [21, 26] was defined with respect to the actual trait value, that is determined by the score, as well as non-score genetic and environmental components. Here, the gain and the loss are defined in terms of the scores only, hence the gain can be viewed as expectation over a latent variable representing the phenotype value of the previous gain. By definition, and we are interested in the expected gain of both approaches compared to random selection, and their expected (non-negative) difference. In Section B we use a statistical model for the scores to get approximate formulas for the expected gain and its dependence on model parameters for the linear loss.
3 The Expected Gain
The gain defined in eq. (2) represents the utility of chromosomal selection for a concrete set of chromosomes with scores. We are interested in statistical properties of the gain in a population, hence the need for a statistical model for the scores tensor X. That is, suppose that X ~ PX. Hence is also a random variable determined by the scores distribution PX and the loss . We study here expected gain , and similarly the expected gain due to embryo selection . In Section B.1 we derive a statistical model for X based on quantitative genetics principles, extending the models for whole-genome scores in [21, 26]. Under this model, the scores for the different chromosomes are independent, and the covariance matrix of a randomly selected vector Xc is denoted by Σ(X). For the linear loss , we showed in [21] that the gain due to embryo selection is
Chromosomal selection is simple for this case and is achieved by selecting for each chromosome i the copy j minimizing wtXij•. Using this property, we derive the approximate gain for chromosomal selection as (see details in Appendix B.1): where is the proportion of score variance explained by chromosome i, satisfying . The expected gain for chromosomal selection is thus roughly αi-fold higher compared to the expected gain for embryo selection in eq. (3). For the αi values in Table 2 in the Appendix representing human chromosomal lengths, this gives a 4.68-fold improvement. For general (nonlinear) loss functions, we compute the expected gain numerically using simulations, as is shown in Section 5.
4 Algorithms for Chromosome Selection
The optimization problem 1 is difficult due to the exponential search space of size CM. For example, selecting for 23 chromosomes in each of a single sperm and oocyte cell in humans, the number of possible selections is 245 ≈ 3 × 1013. We describe next two classes of algorithms for the problem: a Branch-and-Bound that eliminates dominated selections approach, and a relaxation of the discrete selector variables ci to continuous vectors in the simplex. The two algorithms are applicable for different scenarios: The Branch-and-Bound techniques can be applied to any monotone loss, are guaranteed to yield an optimal solution, but their worst-case computational complexity is exponential in M. The Relaxation approach is polynomial and can be applied to any differentiable loss, but has no optimality guarantees.
4.1 A Branch-and-Bound algorithm
In the Branch-and-Bound algorithm for selecting chromosomes for monotone loss functions, we grow a tree of all possible selected chromosomes, and at each level keep only leafs not dominated by other leafs. Finally, we evaluate the loss of all leafs at the last level. A tree of depth b is represented as a collection of paths from the root to the leafs Γ = {c(1), .., c(m)} where each c(j) ∈ {1,.., C}b represents the choices of chromosomes in the first k levels. The partial score sum is calculated: , and dominated partial score vectors are pruned. Then, each of the remaining c(j)’s is expanded into C paths of length b + 1. A formal step-by-step description is shown in Algorithm 1. The computational complexity is determined by the number of leafs corresponding to non-dominated vectors considered at each step b, with Cb possible leafs to consider. In the worst case, the Branch-and-Bound algorithm enumerates over all leafs, hence it may run in time exponential in M, as shown in the Appendix, Section A.2.
While the worst-case computational complexity of Algorithm 1 is exponential in M, the number of vectors considered may be far lower than CM in practice. Further pruning can also be achieved by computing upper-and lower-bounds for the optimal loss function as follows: Let X∨(X∧) be a vector obtained by summing over all chromosomes i the vector obtained by taking for each coordinate k the maximum (minimum) of Xijk over j ∈ [C]. Then . Solutions violating this bound are also pruned as part of the algorithm in Step 5.
For simulated problem instances, using the model in eq. (26), the average number of leafs at each stage b was far lower than Cb, and grows roughly as Cb/2, as shown for example in Figure 2(a,b), enabling the usage of Algorithm 1 in practice for large problem instances (e.g. C = 2, M = 45).
Divide-and-conquer
We can improve the speed of our algorithm, by dividing the M chromosomes into groups, optimizing each of them separately, and then combining the solution in a manner where sub-vectors that cannot be extended to the optimal solution are filtered out. This procedure significantly improve performance, while still guaranteed to yield an optimal solution for monotone losses. Due to its technical details, it is described in Appendix A.3.
4.2 A Relaxation Algorithm
Algorithm 1 (Branch-and-Bound) is inapplicable for non-monotone loss functions. Moreover, even for monotonic losses, the Branch-and-Bound algorithm could be computationally intensive, taking exponential time in the worst case, hence the need for alternative algorithms.
We encode each selection ci ∈ [T] using a one-hot vector: Ci•, = (Ci1, .., CiT) with Cici = 1 and Cij = 0, ∀j ≠ ci. Next, we relax the requirement that each Cij ∈ {0, 1}, and instead just require: Ci•, ∈ΔT, where ΔT denotes the T-dimensional simplex. Concatenating all selection vectors yields a stochastic matrix , and the score is given by Xc = [X ×2C]1M. This leads to the following relaxed problem:
(relaxation):
Given a 3rd order tensor of scores , and a loss function , find a matrix minimizing the loss: .
We solve Problem 2 using projected gradient descent, where each row of C is projected separately onto the simplex ΔT as described in [4]. Then,closest vertex of the polytope to c* is given as an approximate solution of the original Problem 1. The details are shown in Algorithm 2. When the loss is convex, it is possible to establish convergence guarantees for the relaxed Problem 2 (see e.g. [6]), yet the original Problem 1 is computationally hard in general. For smooth losses, it may be possible to get a closed-form solution using Lagrange multipliers, as is demonstrated for the Stabilizing selection loss in Appendix, Section A.5.
5 Simulation Results
To examine the utility of the two algorithms, we have implemented them as part of an R package called “EmbryoSelectionCalculator”, available at https://github.com/orzuk/EmbryoSelectionCalculator (see additional details in Appendix D). We simulated embryo scores from a Matrix Gaussian distribution (see [19]). We mimicked selection of a single sperm cell and a single oocyte, giving us M = 22 + 23 = 45 and C = 2, i.e. a search space of size 245 ≈ 3.5 × 1013. We selected for T = 5 diseases with equal prevalence of 0.1, and assumed that the polygenic scores explain 20% of the liability for each disease. The relative proportion of variance explained by each chromosome for all traits was according to Table 2. We assumed an heritability of h2 = 50% for all diseases liabilities, and as a consequence define the covariance matrix Σ of the non-score part ε to have 0.8 on the diagonal and 0.65 for the off-diagonal elements.
We used the sum of disease probability loss function from eq. with equal weights, and the (minus) probability of being disease-free loss functions (lines 3,4 in Table 1, respectively). The baseline loss under random selection was, as expected 0.1 × 5 = 0.5 for the first loss, and was 0.72 for the second, disease-free probability loss, slightly higher than 0.95 ≈ 0.59 if diseases were uncorrelated.
We repeated the simulation 100 times, and each time computed the optimal selection strategy using Algorithms 1 and 2. The results are shown in Figure 2(c,d), as a function of the number of available copies for selection C. For the first loss, the outputs obtained by the two algorithms usually coincided, and on average the loss was reduced by 37%. For the second loss, the relaxation algorithm achieved the same solution as the exact Branch-and-Bound algorithm only in 62 out of 100 simulations, and performed worse in the rest 38 simulations, which can be expected for a non-convex loss. The average reduction for this loss was smaller, at 29%. Perhaps surprisingly, the Branch-and-Bound algorithm was faster for both losses, indicating that the trees grown for this model were always kept small. We therefore recommend using the Branch-and-Bound algorithm, and only if the tree size explodes either prune the tree by using a heuristic of keeping only the top paths at each step, or resorting to the relaxation algorithm.
6 Discussion
We have defined and formulated the chromosomal selection problem, and provided two algorithms for solving it. Our Branch-and-Bound algorithm, while exponential in the worst case, can easily be used empirically for the problem of selecting chromosomes from a single sperm cell and a single oocyte for humans, for monotone loss functions. The relaxation algorithm can handle much larger selection problem, yet the performance of the solution obtained by this algorithm may vary. Developing an efficient algorithm with optimality guarantees for major classes of loss functions is an interesting direction for future research.
While the technology for chromosomal selection is not currently available, we believe that our analysis is insightful as it may guide practitioners in the future regarding the potential utility of such technologies. As technologies improve, it may be reformulate the selection problem and adjust the algorithms to adapt to the availability of scores and the constraints on selection imposed by the technology. For example, recent imaging studies of embryos may provide information on their viability and possibly disease risk, without needing to destructively sequence the embryos. If such techniques mature, they can be combined with our computational method to estimate the score of each chromosomal copy and select based on these estimates.
Finally, while current polygenic scores are linear, improved risk predictions may be achieved in the future using nonlinear scores. Formulating the chromosomal selection problem for such nonlinear scores and dealing with the increased combinatorial complexity will posses algorithmic challenges.
Appendix
A Algorithms and Optimization Details
A.1 Notations
For a matrix X, we denote by vec(X) the column vector obtained by stacking the columns of X, from first to last. Similarly, for a 3rd-order tensor X, we denote by mat(X) the matrix obtained by stacking the 2nd-order fibers of X, from first to last.
There are 2T possible binary vectors of length T, with each such vector d ∈ {0, 1}T, corresponding to an orthant defined as Od ≡ {(x1,.., xT) s.t. (−1)dixi < 0, ∀i ∈ [T]} These 2T orthants form a disjoint union of (ignoring equalities with the axes).
In similar to eq. (1), for a vector , define the 3-mode tensor-by-vector product as the matrix , with elements defined by:
Element-wise notations
For two matrices A, B of the same size, their Hadamard product ⊙ is defined as a matrix obtained by element-wise multiplication of their elements, i.e. [A ⊙ B]ij = aijbij. Similarly, we define their entry-wise minimum and maximum and as and . For a real number , the Hadamard power of A is defined by taking raising each element to power α, i.e. .
In the same spirit, the row-wise maximum and minimum vectors are denoted as , where and . Finally, we can similarly define a vector of indexes obtained by taking the index maximizing/minimizing the elements of A in each row, i.e. is defined as and similarly for .
A.2 Branch-and-Bound
Claim
In the worst case, the number of non-dominated vectors at stage b of Algorithm 1 is Cb.
Proof. We construct the chromosome scores as follows: Draw , ∀i ∈ [M], ∀j ∈ [C]. For each uij, set the vector Xij• ≡ (uij, uij, .., uij, (T – 1)uij). At stage b of Algorithm 1, any vector present is of the form for some ci ∈ [C] and for some . With probability one all such u values for different linear combinations are different. Any vector (u, u, .., u, (T – 1)u) is Pareto-optimal among any set of vectors all sharing the same direction, hence at stage b we get a set of Cb distinct, Pareto-optimal vectors, and in this case the Branch-and-Bound Algorithm does not exclude any of them, reaching at the final stage to all CM linear combinations.
A.3 Divide-and-conquer
Remark 1. Suppose that we divide the M chromosomes into b blocks, and let be a disjoint union of [M]. Let be the set of Pareto-optimal vectors obtained by running Algorithm 1 on XBi•• for i = 1,.., b. Furthermore, define as the vectors obtained by taking in coordinate j the maximum (minimum) over all , ∀j ∈ [Si]. Then:
Based on eq. (6), it is possible to design an algorithm that approximates the true loss by providing upper and lower bounds. When these bounds are close to each other, we may stop the algorithm, while if they are far from each other, we may continue by taking the union of Bi’s to get fewer and larger blocks.
We can also exclude some vectors, in similar to above. Namely, Let be the optimal vector for XBi••. Then:
The upper-bound is tighter (smaller) than the upper-bound .
We can use the bound to get a divide-and-conquer approach detailed in Algorithm.
A.4 Computing the Gradient
We show for example the gradient computation for the stabilizing selection loss and for the disease loss. We also show that the relaxed optimization problem 2 is convex for the first case, and not convex for the second case.
Consider the stabilizing selection loss: . In terms of the relaxed variables, the loss becomes: where the Hadamard power and the Hadamard product ⊙ are taken element-wise. We next compute the gradient and show that the problem is convex:
Claim. The loss in eq. (8) is convex in C.
Proof. The gradient is given by:
Therefore, the gradient is:
The Hessian elements are given by:
If we vectorize the matrix to get a vector , and similarly get , the Hessian can be written in matrix form as:
Hence the Hessian is positive semi-definite, therefore the loss is convex in C.
Recall the disease loss , with the conditional disease probabilities given by the liability-threshold model, . Taking the partial derivatives with respect to the relaxation variables yields: and the gradient is, in matrix form and using the tensor-by-vector product:
We next compute the Hessian matrix, where . We have sign(αk) = sign(zKk – Xck) therefore the sign of αk changes as we change Xck, hence the loss is not convex in C.
A.5 A Closed-form Relaxation
For the stabilizing selection loss, it is possible to obtain a closed-form solution to the relaxed Problem 2 of minimizing a quadratic loss under linear constraints by adding Lagrange multipliers [12]. Define:
Then:
Taking , ∀j ∈ [M], we can represent the above in matrix form: where is defined by: .
We can stack columns of C, and similarly stack fibers of the fourth-order tensor A, to get the following problem: where is a matrix in which each row contains the rows of the matrix A(i, j) concatenated, is a vector obtained by stacking the columns of C, and is a matrix encoding the equality constraints given by eij = 1 ∀i ∈ [M], ∀j ∈ [C(i – 1) + 1, Ci] and eij = 0 otherwise.
The solution for the above problem is given via Lagrange multipliers as a solution of the linear system:
Since mat(A) is a sum of T matrices of rank 1, we have rank(mat(A)) ≤ T and
When T + 2M < (C + 1)M the above system has an infinite subspace of solutions. When T + 2M ≥ (C + 1)M, there is typically a unique solution for C obtained by solving the above system, giving us: and the solution vector c is obtained by: where the outer mat operation is reshaping the solution vector vec(C) into M consecutive equally-sized vectors and stacking them together to form a matrix, in which each row is maximized.
The closed-form solution in eqs. (22,23) can be used instead of Algorithm 2 for the stabilizing selection loss.
When T + 2M < (C + 1)M the inverse in eq. (22) can be replaced by the Moore-Penrose pseudo-inverse, yielding the minimal Euclidean-norm solution vec(C) which is rounded to get c.
Regularization
The relaxation in the previous section may yield outputs with many non-zero entries, that are far away from the vertices of the polytope of stochastic matrices. To obtain a solution closer to one of the vertices we add a sparsity-promoting term to the optimization problem 2. The standard L1 regularization often employed to promote sparsity is inappropriate here, since for every row we have already hence the elements sum is constant. Several previous works have suggested algorithms for sparse projection and optimization over the simplex [24, 27]. Our space is a Cartesian product of multiple simplices, where each simplex representing a different row of the matrix C, and we employ a similar technique to promote sparsity. Specifically, we add to the optimization criteria in eq. (2) a negative quadratic loss term [27]: , where η < 0 and is the squared Frobenius norm. This term promotes solutions with a high Frobenius norm, that are likely to be concentrated on a few entries. Incorporating the additional term in the optimization is straightforward. For example, the term −2ηC is added gradient of the loss in Algorithm 2. The closed-form solution for the regularized problem with the stabilizing selection loss is obtained by simply replacing the term mat(A) with mat(A) – ηIMC in eqs. (19–22). In similar to Ridge regression, the addition of the regularization term yields a unique solution even when the matrix in the left hand side of eq. (21) is singular and the least square solution is not unique, as is the case whenever T + 2M < (C + 1)M.
B Quantitative Genetics
To put the abstract problem presented in the previous section in the context of current practice in embryo selection, we describe here a simple quantitative genetics model for embryo selection for multiple quantitative traits and diseases.
B.1 A Statistical Model for Chromosomal Selection
We describe here a statistical model for the joint distribution of the scores and non-score components determining a phenotype in a set of C genomes and for T complex traits. The model is related to and extends models used in [21, 26] for multiple traits, with two main differences: First, we model explicitly the joint distribution of the individual chromosomes’ scores, wheres in [21, 26] a model was given for the entire genomic score. Second, [21, 26] considered embryos derived from the same two parents, yielding a specific genetic relationship matrix Σ(C) representing the Identity-by-Descent sharing of siblings, while in our case the genetic relationship matrix may be more general depending on the selection scenario.
We assume that the genetic architecture of the traits is infinitesimal, namely that there are numerous causal variants, uniformly distributed along the genome. Denote the matrix of quantitative trait values as , where Zij denotes the value of trait j for the i-th copy. We can decompose Z as follows: where the error term Y represents a tensor of genetic components not accounted for by the scores X, and the error term ε represents a matrix of environmental components, both having zero mean.
We assume that all the traits have mean zero and variance 1, and further that the individual chromosome scores also have zero mean.
We further assume that the distribution of the polygenic scores X is approximately Normal in each embryo (due to the polygenic nature of most complex traits [38]), and that the joint distribution of the polygenic scores over n embryos is multivariate Gaussian,
Consider T traits normalized to have zero mean and unit variance. Let be a matrix of polygenic scores for the C copies, obtained by summing the individual chromosome scores Xi••, and similarly let . The vector of polygenic score for a single genomic copy for all traits has a covariance matrix Σ(X) under a Normal model: where contains the variance explained by the polygenic scores of each trait, and the off-diagonal elements of Σ(X) represent pleiotropic effects. For C full-genome copies we obtain a C × T matrix of polygenic scores with a matrix Normal distribution [13]:
The matrix Σ(C) represents (twice) the kinship coefficients between the C full-genome copies. For example, when the copies represent sibling embryos (as is the case for embryo selection), . We assume that the chromosome scores are independent, with the scores matrix of each chromosome having the matrix Normal distribution: where is the proportion of genetic variance explained by chromosome i. The genetic variances satisfy , and this proportion is assumed to be the same for all traits, a consequence of the infinitesimal model and provided that the relative density of causal variants across the genome is similar across traits.
This contributions determines the utility of chromosomal selection, and are expected to be roughly proportional to chromosomes’ length or to their number of genes. Here, we show a numerical analysis with the (normalized) chromosomes lengths as in [2], shown in Table 2. The actual coefficients may deviate from this rough estimate and from trait to trait, based on the distribution of causal alleles along the genome for each trait. Methods for partitioning heritability [10, 40] can be used to estimate these coefficients for specific traits, and in case of significant deviations, Eq. (27) can be modified accordingly.
Similarly, the non-score genetic components are modeled as: where Σ(X)+∑(Y)+Σ(e) = IT. The matrix Σ(X)+ ∑(Y) is known as the genetic covariance matrix, and can be estimated from GWAS data using e.g. methods like LD-Score-Regression [3]. The diagonal elements are the narrow-sense heritabilities of the traits.
The matrix Σ(Y)+Σ(e) is the covariance matrix of the residuals, and determines the conditional distribution of the phenotypes vector conditioned on the scores vector. For simplicity, our model makes several standard assumptions: no shared environment (hence the identity IC is used as a covariance matrix for ε), and no assortative mating. If these assumptions are violated, this can be encoded by the covariance matrices of our model.
The expected gain
The gain defined in eq. (2) is a random variable, with a sample space over all theoretical sets of C copies. In the following, we will derive the approximate mean of the gain for linear loss functions , as a function of the loss parameters, and of C, Σ(C), and Σ(X).
For embryo selection with a linear loss, selection is performed on the vector of scores , with the joint distribution:
It was shown in [21] that . Moreover, and
Using extreme value theory for the above, we get as in [21] the approximate gain from embryo selection:
Next, we will compare this result to the gain obtained from chromosomal selection. For each individual chromosome i, the distribution of the scores vectors is
Since selection is performed for each block separately, and using again the asymptotic approximation from [21] for the covariance matrices of individual chromosome’s scores, the gain can be written as:
Hence the expected gain due to chromosomal selection is roughly αi-fold higher compared to the expected gain from embryo selection in eq. (32). For the αi. values in Table 2, this gives a 4.68-fold difference between the gains.
B.2 Disease Traits
Consider a disease with population prevalence K and let X be the polygenic score with variance explained on the liability scale, using the liability threshold model: with z = X + ∈. The polygenic scores X1, .., Xn can be thought of as liabilities, where the actual disease score modeled as zi = Xi + ∈i with being random variables representing both the environmental contribution as well as unaccounted for genetic effects. The resulting disease status of each individual is given by thresholding the zi’s, Di = 1{zi<Φ−1(K)}.
We select the embryo with maximal score Xmax as in the quantitative trait example, and denote by imax the index of this embryo. As shown in [26], the risk for disease for the embryo with maximal polygenic score is given by a convolution: and the expected (absolute) gain for the single disease loss is:
Multiple Diseases
We consider screening for multiple T diseases, with the polygenic risk scores given in eq. (26), and with prevalences vector K = (K1,…, KT). We need to define a loss function, representing the trade-offs of reducing risk for multiple diseases - for example the probability of being disease free. The next Section formalizes the goal of selection for multiple quantitative traits or diseases, and in addition presents the problem of chromosomal selection.
We next define the associated disease status and disease probabilities for chromosomal selection.
For a vector of prevalences K ∈ [0, 1]T and a residual covariance matrix Σ(Y)+Σ(e), let Y × 1M + ε ~ N(0, Σ(Y) + Σ(e)). Then, the disease status is a vector random variable defined as: where the indicator function is taken element-wise.
The associated disease probability for a given binary vector d ∈ {0, 1}T is
The marginal disease probability for disease i and status j = 0, 1 is given by:
A chromosomal selection loss function is called a disease-loss function if there are vectors and a positive semi definite matrix such that can be written as follows:
If the loss above can be written as follows for a vector : then the loss function is called a linear disease-loss function.
C Modified Selection Problems
We describe here a few additional scenarios that yield the chromosomal selection Problem 1 or variants of it.
C.1 Gamete Selection
Consider an intermediate case of gamete selection, where it is possible to select a sperm and an oocyte separately, as discussed in [2]. We consider T continuous phenotypes as in Section B.1. Consider Cp haploid sperm cells, and Cm haploid oocyte cells. Denote their scores matrices by
The Gamete selection problem is to select a single sperm cell ip ∈ [Cp] and a single oocyte im ∈ [Cm] minimizing the loss . See an illustration in Figure 3(a).
We can compute the expected gain for gamete selection with a linear loss in similar to the derivations for embryo selection. First, in similar to eq. (26), we have where Σ(Cp), ∑(Cm) are the covariance matrices for the sperm and oocyte cells, respectively, with (twice) kinship coefficient of , and assuming that the covariance of the scores of any sperm and oocyte cell is zero. We also assume that the trait’s variance matrices Σ(X) are equal due to symmetry of the maternal and paternal contribution to traits (we ignore here the contribution of the sex chromosomes).
With these matrices, we get for all ip ∈ [Cp], im ∈ [Cm]: and
Following the derivation for a single trait, we get:
For example, suppose that we have an equal number of sperm and oocyte cells Cp = Cm = C. Then the gain is ≈ , a -factor improvement over the gain from embryo selection with the same C shown in eq. (32).
C.2 Chromosomal Selection for Multiple Sperm Cells and Oocytes
Suppose that we face the scenario in the previous sub-section, except that it is possible to select different chromosomes from different sperm cells, and similarly different chromosomes from different oocytes (assuming the scores can be computed form the cells in a non-destructive manner). Then, we face a chromosomal selection problem similar to Problem 1, except that the number of available copies may be different for different chromosomes (either Cp or Cm), as shown in Figure 3(b). When, Cp = Cm, the problem reduces back to Problem 1.
C.3 Chromosomal Selection from Multiple Diploid Cells
Consider C/2 diploid cells (for even C), and suppose that we select for each chromosome two copies in an arbitrary manner for the fertilized embryo (for example, it may be possible to select both copies of the same diploid cells). Then, we face a chromosomal selection problem with M = 23, except that two, rather than one copy, is selected from the scores tensor X. Algorithms 1 and 2 can be adapted in a rather straightforward manner to handle this case, and their implementation and study remain for future work.
D Implementation Details
The entire algorithms and the simulation study are implemented as part of an R package called “EmbryoSelectionCalculator”, available at https://github.com/orzuk/EmbryoSelectionCalculator, with the functions related to chromosomal selection located in the chrom sub-directory. To speed-up computations, code for finding Pareto-optimal vectors was implemented in cpp, and linked using rcpp [8]. To avoid combinatorial explosion of the Branch-and-Bound algorithm, a heuristic of passing only the top B vectors at each step when the number of partial Pareto-optimal vectors exceeds B was implemented as optional, and used with the default value of B = 10, 000. An additional optional improvement to the relaxation Algorithm 2 is also implemented: instead of rounding the solution of the relaxed problem C(t+1) to the nearest vertex of the polytope of stochastic matrices, it is possible to draw at random the selection variables ci from a categorical distribution with values {1, .., C} and with probabilities given by the i-th row of C(t+1). One can draw independently multiple such vectors (default value: R = 1, 000) and output the solution minimizing the loss among the resulting Xc scores vectors. Additional details about the software implementation and usage are available in the package documentation.