Uncovering complementary sets of variants for the prediction of quantitative phenotypes

Recent genome-wide association studies (GWAS) show that mutations in single genetic loci, frequently called single nucleotide polymorphisms (SNPs), alone are not suﬃcient to explain the phenotypic heritability of complex, quantitative phenotypes. Instead, many methods attempt to deal with this issue by considering a set of loci that can characterize the phenotype together. While the state-of-the-art methods are successful in selecting subsets of SNPs that can achieve high phenotype prediction rates, they are either slow in runtime or have hyper-parameters that require further ﬁne tuning through cross-validation or other similar techniques, which makes such methods inconvenient to use. In this work, we propose a fast and simple algorithm named Mac-arons to select a small, complementary subset of SNPs by avoiding redundant pairs of SNPs that are likely to be in linkage disequilibrium (LD). Our method features two interpretable parameters that control the time/performance trade-oﬀ without requiring any hyper-parameter optimization procedures. In our experiments, we benchmark the performance of the SNP selection methods on the 17 ﬂowering time phenotypes of Arabidopsis Thaliana . Our results consistently show that Macarons has similar or better phenotype prediction performance while being faster and having a simpler premise than other SNP selection methods.


A brief overview of GWAS and the SNP selection problem
Genome-Wide Association Studies (GWAS) attempt to find a relation between the genetic variations and a phenotype.In recent studies, many individual variants, also known as single nucleotide polymorphisms (SNPs), have been found to be directly related to various diseases or disorders including type II diabetes, obesity and schizophrenia as well as other quantitative traits like height (Goldstein et al., 2009;Visscher et al., 2017).Yet, the main drawback of focusing on individual SNPs is that they typically fail to explain complex phenotypes (Manolio et al., 2009).Thus, as a more powerful approach, many studies suggest focusing on combinations or sets of SNPs instead of analyzing them individually (Wang et al., 2010;Cordell, 2009;Phillips, 2008;Wei et al., 2014).This study is mainly focused on the problem of finding a subset of SNPs that are together predictive of a phenotype.For the sake of brevity, we will simply refer to it as the SNP selection problem throughout this paper.
Approaches investigating combinations of SNPs (e.g epistatis) require limiting the search space Finding combinations of SNPs that are predictive of a phenotype is computationally challenging due to the sheer number of possible SNP combinations that need to be considered.For this reason, some methods focusing on high-order interactions use exhaustive searches or greedy algorithms on a small, limited pool of SNPs which is usually not more than a few hundreds (Ritchie et al., 2001;Fang et al., 2012).While such pools of "promising SNP candidates" are typically obtained using a priori information sources which limit the analysis to coding regions of the genome, it is also possible to do other types of filtering based on automated searches (Ding et al., 2015).Indeed, for combinatorial studies investigating the pair-wise interactions (or as more commonly known as epistasis) between the variants in full genome, recent studies show the importance of limiting the search space through prioritization of the tests (Piriyapongsa et al., 2012;Cowman and Koyutürk, 2017), both to alleviate the computational intensity of the task, as well as to improve the overall statistical power.Specifically, a recent study (Caylak and Cicek, 2020) demonstrates the utility of an initial filtering of SNPs based on an automated SNP selection algorithm as a powerful approach that can improve the statistical power considerably.

There are limited options for SNP selection problem in quantitative phenotypes
While there are a number of approaches to choose from for binary phenotypes (e.g., as in case/control studies) (Yu and Liu, 2004;Ding and Peng, 2005;Guestrin et al., 2005;Van Hulse et al., 2012;Ayati and Koyutürk, 2016;Urbanowicz et al., 2018), the options for the SNP selection problem in quantitative phenotypes are more limited.In the machine learning community, this problem essentially translates to a feature selection task for regression analysis and there are many established methods for this purpose.However, due to the high dimensional nature of the problem (hundreds of samples, millions of SNPs), established methods for feature selection such as linear regression with l 1 (lasso) regularization (Tibshirani, 1996;Grave et al., 2011), spectral-relaxation based approaches (Zhao and Liu, 2007;Das et al., 2012), graphconstrained feature selection methods like GraphLasso and GroupLasso (Meier et al., 2008;Jacob et al., 2009), as well as various other methods with sparsity constraints known in the bioinformatics community in the context of other problems (e.g., for selecting gene sets) (Li and Li, 2008;Jia et al., 2011;Liu et al., 2017), are simply too computationally intensive for the SNP selection problem and do not extend to the full scale of variants in the human genome.
To achieve a scalable solution for all known variants in the genome, a new body of SNP selection algorithms have emerged that are designed with computational efficiency in mind (Azencott et al., 2013;Yilmaz et al., 2019).Such algorithms, instead of considering the combinatorial effect of the SNPs on a phenotype, simplify the problem by focusing on a linear combination of individual phenotype associations of SNPs, together with other a priori information encoded in the form of a biological network to improve the overall predictivity.In particular, SConES (Azencott et al., 2013) uses a minimum-cut solution under sparsity and connectivity constraints on a SNP-SNP network, and more recently, SPADIS (Yilmaz et al., 2019) uses a SNP-SNP network to enforce the diversity of the selected SNPs.

The drawbacks of existing methods
Linkage disequilibrium (LD) which refers to the non-random association of variants, is a common phenomenon for close variants on the same chromosome (Ardlie et al., 2002).While the connectivity constraint of SConES helps to improve the robustness of the selections, it also promotes the selection of SNPs that are in LD, thereby impairing the phenotype predictions.On the other hand, SPADIS seeks to increase the diversity of SNPs by penalizing the selection of close SNPs on the input network.While this diversity helps to avoid redundant SNPs in LD and improves the phenotype predictions, the main drawback of SPADIS is that it includes two hyperparameters without any interpretable meanings or default values, that need to be optimized through an external procedure such as cross-validation.The need for such external procedures not only makes the method hard to apply from a user viewpoint, but also considerably exacerbates the run time and reduces the robustness of the selections as we demonstrate in our results.To overcome these limitations, we determined three main objectives a SNP selection algorithm should satisfy: (i) Just as predictive as other state-of-the-art SNP selection methods for the prediction of quantitative phenotypes, (ii) Fast enough to consider all variants in the genome as the search space, and (iii) Easy to use without requiring any hyper-parameter optimization procedures.Thus, we propose a new algorithm named Macarons that take into account the correlations between SNPs to avoid the selection of redundant pairs of SNPs in linkage disequilibrium.Overall, Macarons features two simple, interpretable parameters to control the time/performance trade-off: The number of SNPs to be selected (k), and maximum intra-chromosomal distance (D, in base pairs) to reduce the search space for redundant SNPs.Note that, since the parameters have interpretable meanings, they can be determined in advanced (without requiring any procedures for hyper-parameter optimization) with the available computational resources and the goals of further studies in mind.

Background
Here, we first start by defining the problem as follows: We are given as input a ground set of SNPs V of size n, genotype matrix X ∈ {0, 1, 2} m×n decoding the number of alternate alleles for m samples and n SNPs, and a phenotype vector Y ∈ R m×1 containing quantitative values for m samples.Due to the nature of the data, the number of SNPs n is much larger than the number of samples n m.Thus, we would like a obtain a small subset of SNPs S = {s 1 , s 2 , . . ., s k } of size k that maximizes the prediction performance of the given phenotype vector Y based on a regression model M.
Since this is a computationally intensive problem, state-of-the art methods model the effect of a SNP on the phenotype individually through an association score like sequence kernel association test (SKAT), and aims to combine these using linear models with additional constraints to reduce the overall computational complexity.Specifically, SPADIS uses a greedy algorithm (given in Algorithm 1) that iteratively selects SNPs based on a gain function Ḡ(S t , s x ): S t is the subset of selected SNPs at the t iteration of the algorithm, s x is a candidate SNP being considered, c x is the phenotype association score of SNP s x , K(s i , s x ) is a function that applies a penalization for the pair (s i , s x ) based on their distance on an input network, and β is an adjustable parameter that is used to convert c x and K(s i , s x ) to comparable scales.

Macarons
Here, we propose a similar approach based on a simpler and more interpretable gain function G(S t , s x ) that penalizes the selection of highly correlated SNPs: where R 2 (S t , s x ) determines the penalization to be applied to a candidate SNP s x given the selected SNP set S t = {s 1 , s 2 , . . ., s t }.Since we would like to penalize SNPs that would be redundant to add, we choose the squared multiple correlation as the penalization function.This gain function has three main differences to Equation 1: (i) Instead of using shortest-path distance on an input network as a proxy for diversification, it measures correlations between the selected SNPs directly from data, (ii) Instead of a linear formulation, it estimates the overall redundancy of a candidate SNP s x using multiple correlation coefficient R 2 (S t , s x ), and finally (iii) It uses a multiplicative formula in order avoid using a hyper-parameter like β to bring phenotype association scores and the penalization function to comparable scales.

Estimation of the penalization function
The main issue here with the estimation of the multiple correlation is that it requires the computation of high order interaction terms between the selected SNPs.This makes its estimation for a given data sample both computationally intensive (with O(mt 2 + t 3 ) runtime complexity), and noisy due to over-fitting to the given sample instances.To help overcome these issues, we first express the calculation of the multiple correlation as a multiplication of several terms involving partial correlations: where R 2 (s i , s j |S ) denotes the squared partial correlation between SNPs s i and s j given the SNPs within the set S .Note that, these partial correlations are also high-order terms and do not simplify the computation of the multiple correlation by themselves.For this purpose, we make the following simplifying assumption: where r 2 (s i , s x ) is the squared zero-order correlation coefficient (i.e., ordinary Pearson's correlation) between SNPs s i and s x .Thus, with this assumption, the estimation of the multiple correlation simplifies to: This assumption both alleviate the overfitting problem in the estimation of multiple correlation (since it overall reduces the number of terms needed to be estimated from the data) as well as reduces the required computation time drastically (from O(mt 2 + t 3 ) to O(mt)).
In the remaining sections of this manuscript, we will refer to the estimation of the squared multiple correlation R2 (S t , s x ) simply as the penalization function, and we will refer to the zero-order correlation r 2 (s i , s x ) as the redundancy function.

Properties of the penalization function
Similar to the squared multiple correlation R 2 (S, s x ), the penalization function R2 (S, s x ) is always between 0 and 1 and applies diminishing returns principle, obeying the following intuitive properties: For a given SNP set S and a SNP s x ∈ S, the penalization function R2 (S, s x ) obeys: • If S does not include any SNPs correlated with s x , then the penalization of s x is zero: i.e., If r(s i , s x ) 2 = 0 ∀i ∈ S, then R2 (S, s x ) = 0.
• If S includes a single SNP s i , the penalization of s x is equal to their squared correlation: i.e., if S = {s i }, then R2 (S, s x ) = r(s i , s x ) 2 .
• If S includes a SNP s i perfectly correlated with s x , the penalization of s x is maximum: i.e., if s i ∈ S and r(s i , s x ) 2 = 1, then R2 (S, s x ) = 1 (thus, gain function G (S, s x ) = 0).
• The penalization of s x is monotonically non-decreasing and can only increase as the set grows: i.e., R2 (S, s x ) ≤ R2 (S ∪ {s i }, s x ) ∀s i ∈ S ∪ {s x }.
• Selecting a SNP s i uncorrelated with s x does not affect the penalization of s x : i.e., If r(s i , s x ) = 0, then R2 (S ∪ {s i }, s x ) = R2 (S, s x ).
• If S includes two SNPs s i and s j , the penalization of s x is equal to the sum of individual squared correlations with s i and s j minus their expected overlap: i.e., if S = {s i , s j }, then R2 (S, x) = r(s j , s x ) 2 + r(s i , s x ) 2 − r(s j , s x ) 2 r(s i , s x ) 2 .
• If a new SNP s i is added to a SNP set S, the joint penalization of s x is increased by the squared correlation of s i with s x minus their expected overlap with S: i.e., If S = S ∪ {s i }, then R2 (S , s x ) = R2 (S, s x ) + r(s i , s x ) 2 − R2 (S, s x )r(s i , s x ) 2 .

Limiting the search space through intra-chromosomal distance
One particular issue for directly using the penalization function given in Equation 5together with the gain function and algorithm in Equation 2and Algorithm 1 is that the overall runtime can still be slow for large k (number of SNPs selected) with algorithmic complexity of O(nmk) due to the requirement of computing O(nk) correlation coefficients.For this reason, we make an additional simplifying assumption to limit the search space to intra-chromosomal SNP pairs within a specified distance.Specifically, we assume the following: r 2 (s i , s j ) = 0, if s i and s j are not on the same chromosome where d(s i , s j ) is defined as the intra-chromosomal distance between SNPs s i and s j (i.e., the absolute difference of their position on the sequence) and D is an adjustable parameter (unit in base pairs) to control the time/performance trade-off of the algorithm by limiting the search space for the redundancy estimations.

Formulation of Macarons algorithm
Overall, with the assumptions given in Equation 4and Equation 6, the penalization function becomes as follows: r2 (s i , s j ) = r 2 (s i , s j ), if s i and s j are intra-chromosomal with d(s i , s j ) ≤ D 0, otherwise Thus, the gain function becomes: The Macarons algorithm that encodes this gain function for step-wise SNP selection is given in Algorithm 2.
Overall, it has O(nk + λ D mk) run time complexity where the first term is for maximizing the gain function, and the second term is for computing the gain function (which require the measurement of correlations from data).Here, λ D is a variable between [1, n] dependent on the D parameter.It represents the average the number of SNP pairs that require the computation of correlation for a given D threshold.Thus, the overall complexity for small D is O(nk) (when the first term dominates) and O(mnk) for large D (as the computation of Pearson correlations becomes the bottleneck).
Algorithm 2 Macarons Algorithm Input: Ground SNP set V , chromosome numbers and positions for all SNPs s i ∈ V , the phenotype association scores c i for all for all intra-chromosomal (s i , s x ) with d(s i , s x ) ≤ D do r(s i , s x ) ← compute Pearson's correlation between s i and s x 8: G(s x ) ← G(s x ) 1 − r2 (s i , s x ) end for 10: end while 3 Results

Datasets
We use the Arabidopsis Thaliana dataset from Atwell et al. (2010).This dataset contains 17 phenotypes which are all related to flowering times.There are between 119 and 180 samples (depending on the phenotype).Before applying any filters, the dataset contained 214051 SNPs.SNPs that had a minor allele frequency < 10% were removed, and 173219 SNP remained.The SNP-SNP that was used for SPADIS and SConES was constructed by connecting SNPs that are adjacent on the chromosomes.We use two principal components on the genotype data to correct the population stratification (Price et al., 2006) and score each SNP by using the Sequence Kernel Association Test (SKAT) (Wu et al., 2011).

Background Methods
We compare Macarons with the following methods: • Baseline: An greedy algorithm that picks the SNPs with the k highest SKAT scores.
• SConES: A SNP selection method that poses the SNP selection problem as a minimum-cut problem (Azencott et al., 2013).
• SPADIS: A SNP selection algorithm that selects distant and complementary set of SNPs Yilmaz et al. .

Experimental Setting
We note that SConES does not directly use the cardinality constraint k, whereas both Macarons and SPADIS do.In order to make the comparison fair, we apply a binary search to find the set of parameters that yield the closest number of selected SNPs during the parameter selection process for SConES.Macarons and SPADIS both to select SNPs.We repeat the tests for k = 20, 50, 100, 250, 500, 1000 to get an idea of how the parameter k affects the methods.
Our testing scheme consists of using a nested cross-validation scheme for each of the 17 phenotypes.We first use 10 cross-validation folds to split the data into train and test data.Then, each outer fold uses 10 inner cross-validation folds for the parameter selection of the methods.After the methods select their parameters, they are tested on the corresponding test set.For each parameter, 6 values using grid-search and chose the one that performed the best was chosen for testing.Since Macarons did not require any hyper-parameter selection, we tested it for two D values.We chose D = 20 kbp as suggested by Azencott et al. (2013).We also tested D = ∞ to be able to cover the entire chromosome.The performance (R 2 ) was measured using Pearson's squared correlation coefficient.Each method uses ridge regression to set up and train the linear regressor.To choose the ridge regularization parameter, we tested 6 parameters on 5 cross-validation folds.

Effect of the D parameter
The premise behind Macarons is to select a complimentary set of SNPs while avoiding redundant (correlated) SNPs that are in LD.As we discuss in the methods, the process of taking into account of all redundant SNPs overall requires k × n (number of selected SNPs × number of SNPs) correlation estimations from the data, which is both computationally intensive and superfluous since most highly correlated variants tends to be closely located on the genome.To overcome this issue, we limit the search space for correlated SNPs to close intra-chromosomal pairs with maximum distance of D, where D is an adjustable parameter (unit in base pairs).
Here, we investigate the effect of the parameter D for the SNPs selected by Macarons, particularly to examine its effect on the selection of highly correlated SNPs.For this purpose, we select k = 1000 SNPs for each of the 17 flowering time phenotype using Macarons with varying D parameters.Then, for each tested value of the D parameter, we investigate the distribution of the redundancy for 1000 2 × 17 pairs of selected SNPs, in addition to the overall runtime and predictivity characteristics of Macarons (Figure 1).Note that, since the distribution of the redundancy is greatly skewed (i.e., many pairs with low redundancy, a few with very high redundancy), we visualize it using percentile lines (similar to a boxplot) starting from the 50th percentile (median) all the way to the 99.9th percentile.In addition, in Figure 1, right panel, we investigate the effect of D parameter As it can be seen, with apriori selected D value of 20 kbps (which is the estimated LD range for AT according to Atwell et al. (2010)), Macarons can avoid the selection of highly redundant SNPs without making a compromise from the runtime, which translates into an improved phenotype prediction performance.

Figure 1 :
Figure 1: The predictivity, redundancy, and runtime characteristics of Macarons with respect to its D parameter for K = 1000 selected SNPs.(Left) Redundancy (measured by squared correlation) between all pairs of selected SNPs.Each line indicates a different percentile for the distribution of redundancy (e.g., if there were 1000 SNP pairs, the 99.5%th percentile would indicate the pair with the 5th highest redundancy value).Note that, the left-most point at the x-axis (for D=1 parameter) corresponds to the baseline method of selecting the highest scoring SNPs (without applying any penalization or regularization).The dotted line indicates priori selected D parameter value of 20 kbps.(Right) The average phenotype prediction performance (Predicitivity, measure by R 2 ) and the runtime of Macarons with respect to its D The blue line (left y-axis) shows the average R 2 performance across all 17 phenotypes and the shaded area indicate its 95% confidence interval.The red line (right y-axis) shows the total time required to run Macarons for all phenotypes.

Figure 2 :
Figure 2: The phenotype prediction performances of the SNP selection methods compared to the baseline method for k = 1000 selected SNPs.The y-axis corresponds to the 17 flowering time phenotypes and the x-axis corresponds to the SNP selection methods.The leftmost one is a simple baseline method that selects the highest scoring SNPs without any penalization or regularization.Each cell contains two values: (i) The phenotype prediction performance (R 2 ) for the corresponding method and phenotype, and (ii) the value in parenthesis shows the difference in R 2 compared to the baseline method as a percentage.Each cell is colored according to their relative improvement compared to baseline: Positive values (indicating improving) are colored blue, and the negative values (indicating decline in R 2 performance) are shown with red colors.

Figure 3 :
Figure 3: The phenotype prediction performances of the SNP selection methods averaged across all flowering time phenotypes for different number of selected SNPs (indicated by k values).The methods are tested for k = 20, 50, 100, 250, and 1000 selected SNPs.Each colored bar represents a different SNP selection method.The y-axis shows the averaged Predictivity (measured by Pearson's squared correlation coefficient, R 2 ) across all 17 flowering time phenotypes.The black lines indicate the 95% confidence interval for the average R 2 performance for the corresponding method and the k value.

Figure 4 :
Figure 4: Top-level overview of the characteristics of the SNP selection methods in terms of their predictivity, redundancy, runtime, and robustness.The predictivity indicate the average phenotype prediction performance (measured by Pearson's square coefficient, R 2 ) of the corresponding method for k = 1000 selected SNPs.The redundancy axis indicates the presence of high correlation among the selected SNPs (measured by 99.9th percentile of the squared correlation between all pairs of selected SNPs) The robustness axis indicates the consistency of the selected SNPs in different cross-validation folds (measured by averaged Jaccard Index between all pairs of folds).The time axis (in log-scale) shows the total time required (in seconds) to run the corresponding method for all 17 phenotypes in AT dataset.
Algorithm 1 Forward Selection Algorithm for the SNP selection problem Input: Gain function G, ground set V , cardinality constraint k ≤ |V |.Output: Set S ⊆ V such that |S|= k.