Detection of epistatic interactions with Random Forest

In order to elucidate the influence of genetic factors on phenotype variation, non-additive genetic interactions (i.e., epistasis) have to be taken into account. However, there is a lack of methods that can reliably detect such interactions, especially for quantitative traits. Random Forest was previously recognized as a powerful tool to identify the genetic variants that regulate trait variation, mainly due to its ability to take epistasis into account. However, although it can account for interactions, it does not specifically detect them. Therefore, we propose three approaches that extract interactions from a Random Forest by testing for specific signatures that arise from interactions, which we termed ’paired selection frequency’, ’split asymmetry’, and ’selection asymmetry’. Since they complement each other for different epistasis types, an ensemble method that combines the three approaches was also created. We evaluated our approaches on multiple simulated scenarios and two different real datasets from different Saccharomyces cerevisiae crosses. We compared them to the commonly used exhaustive pair-wise linear model approach, as well as several two-stage approaches, where loci are pre-selected prior to interaction testing. The Random Forest-based methods presented here generally outperformed the other methods at identifying meaningful genetic interactions both in simulated and real data. Further examination of the results for the simulated and real datasets established how interactions are extracted from the Random Forest, and explained the performance differences between the methods. Thus, the approaches presented here extend the applicability of Random Forest for the genetic mapping of biological traits. Author summary The genetic mechanisms underlying biological traits are often complex, involving the effects of multiple genetic variants. Interactions between these variants, also called epistasis, are also common. The machine learning algorithm Random Forest can be used to study genotype-phenotype relationships, by using genetic variants to predict the phenotype. One of Random Forest’s strengths is its ability to implicitly model interactions. However, Random Forest does not give any information about which predictors specifically interact, i.e. which variants are in epistasis. Here, we developed three approaches that identify interactions in a Random Forest. We demonstrated their ability to detect genetic interactions using simulations and real data from Saccharomyces cerevisiae. Our Random Forest-based methods generally outperformed several other commonly used approaches at detecting epistasis. This study contributes to the long-standing problem of extracting information about the underlying model from a Random Forest. Since Random Forest has many applications outside of genetic association, this work represents a valuable contribution to not only genotype-phenotype mapping research, but also other scientific applications where interactions between predictors in a Random Forest might be of interest.

Understanding the genetic components underlying biological traits has been a central 2 focus of genetic research in humans, as well as in model organisms and agriculturally 3 important species [1][2][3][4]. Many phenotypes, including quantitative traits (QTs) such as 4 body height in humans, or growth rate in yeast, follow complex inheritance patterns 5 and are influenced by a multitude of genetic variants [5,6]. Numerous genetic variants 6 have been associated with a wide range of phenotypes in various species (e.g., through 7 Genome-wide association studies, GWASs, and quantitative trait loci, QTL, studies). 8 However, these variants can cumulatively usually only explain a subset of the 9 heritability of each trait, a phenomenon that was termed 'missing heritability' [7]. 10 This inability to fully explain the heritable portion of phenotype variation can in 11 part be attributed to insufficient consideration of the genetic background and 12 interactions between genetic variants (i.e., epistasis) [8]. Epistasis describes the 13 phenomenon when the effect of a genetic variant on a phenotype depends on the 14 allele(s) of one or more 'modifier' variant(s) [9]. In the context of QTs, epistasis 15 represents non-additive interactions between loci, i.e. situations where the effect of two 16 or more genetic variants on a QT differs from the sum of their individual (i.e., marginal) 17 effects. Two types of epistasis can be distinguished, which are defined by their 18 equivalent logical operations (S1 Fig). Firstly, there is AND-epistasis, where an allele at 19 a locus enhances the effect of another locus. Secondly, XOR-epistasis describes cases 20 where the effects of alleles at two loci are diminished when they occur together [10][11][12]. 21 Epistasis is often interpreted as a functional relationship between genes, since the 22 interactions can be explained by the loci influencing the same pathway or biological 23 process. Identifying interacting genes can thus contribute to elucidate disease 24 mechanisms [13,14], to identify common biological functions of genes, and to ultimately 25 reconstruct genetic interaction networks [10,[15][16][17][18]. 26 A wide variety of methods to detect epistasis for discrete traits have been 27 developed [19][20][21][22][23]. However, there is a lack of accurate methods for the detection of 28 interactions between genomic loci that affect QTs. These are often still investigated 29 using parametric methods, for example by evaluating pair-wise linear models (LM) for 30 all possible locus pairs to test for non-additive interactions (for example in [24]). Such 31 exhaustive approaches suffer from computational and statistical burdens induced by the 32 combinatorial number of tests that have to be performed [25]. Therefore some authors 33 use a two-stage approach where loci are pre-selected or weighed prior to interaction 34 testing. The criteria used for the pre-selection can include the presence of marginal 35 effects, knowledge from external databases, or data-mining approaches such as machine 36 learning methods with variable selection [24,[26][27][28][29]. In addition, many existing methods 37 make assumptions about the distribution of the data and the type, scale, and order of 38 interactions, which may not apply to real data [25,30]. 39 Several studies have shown that Random Forest (RF) outperforms other QTL 40 mapping methods at identifying QTL and discrete trait associations [31][32][33][34][35]. This was 41 mainly attributed to its ability to account for interactions [12,36]. A RF consists of an 42 ensemble of classification and regression trees (CARTs). In each CART, the data is 43

PLOS
2/23 repeatedly split into two groups based on the predictors (in this case genetic loci) that 44 explain the most variance of a certain outcome variable (in this case the phenotype), see 45 also S2 Fig. Interactions are implicitly incorporated into the model through the 46 hierarchical succession of predictors. Notably, RF is not limited in the order of 47 interactions it can model. The predictive power and robustness of RFs stems from two 48 separate random sampling steps [37]: (i) each CART is grown on a bootstrap sample of 49 the data, and (ii) only a random subset of loci is considered at each split. There are 50 several importance measures that can be used to rank predictors based on their 51 influence on the outcome variable, which, in the context of genetic mapping, can be 52 used to determine which loci influence the phenotype [31]. RF is model free and 53 non-parametric in the sense that it does not assume any particular genetic model, or 54 estimate any parameters [37]. 55 A notorious problem of RF remains that, although it is able to take interactions into 56 account, it does not specifically detect them. That is however necessary for identifying 57 epistatic interactions between loci. The high number of studies that attempted to detect 58 interactions in RFs or other decision tree-based methods highlights the potential of RF 59 to capture interactions [38][39][40][41][42][43][44][45]. Some of these approaches employed pairwise 60 importance measures, which do not provide a statistical significance level of the 61 interactions without computationally expensive permutation-based statistics. In 62 addition, these methods were not sufficiently tested and compared to other approaches, 63 and/or are only applicable to discrete phenotypes. For example, we previously proposed 64 an approach for extracting interactions from RFs [41], which relied on computationally 65 expensive permutations and required very large RFs. This study partly builds on this 66 previous work (i.e., the splitA approach), replacing the empirical statistics with a more 67 powerful parametric testing approach. 68 We aimed at creating a method that can extract interactions between genetic loci 69 from a RF, evaluate their significance in a permutation-free way, and is applicable to 70 QTs. To that end, we conceived three different tests that exploit specific signatures in 71 the RF caused by interactions. We termed these RF-based methods 'paired selection 72 frequency' (pairedSF), 'split asymmetry' (splitA), and 'selection asymmetry' (selA).

73
Because they are expected to complement each other in different interaction scenarios, 74 an ensemble method was also conceived by combining the p-values of the respective 75 approaches. We compared these four RF-based methods to other common approaches 76 for the detection of interactions between QTL, including an exhaustive approach that 77 tests for linear interactions between all possible locus pairs, and several two-step 78 approaches, where loci are pre-selected prior to interaction testing. We chose three 79 criteria to pre-select loci: (i) an exhaustive t-test for marginal effects, (ii) LASSO-based 80 feature selection, and (iii) Random Forest importance measures. To evaluate the 81 strengths and weaknesses of the different methods, we applied them to various 82 simulations of QTs. We then compared their performances on two real datasets 83 comprising of expression data and growth trait data from high-throughput S. cerevisiae 84 crosses. For the real data, we measured the method's performance based on their ability 85 to recover previously published genetic interactions identified through double gene than would be expected by chance (based on their individual selection frequencies). The 93 second approach, splitA, examines the difference in the effect of a locus on trait value 94 distributions, depending on the allele of another, potentially interacting locus used 95 earlier in the same path of a CART. Finally, the selA approach exploits the fact that 96 loci in AND epistasis have different probabilities of being selected for a split in the 97 forest depending on a previous partitioning on the other interacting locus. Finally, we 98 combined the respective p-values of these three approaches to obtain an ensemble 99 method. 100 We chose several commonly used approaches to compare to our RF-based methods. 101 First, we applied an exhaustive linear model (LM) approach, where the significance of 102 the interaction term based on an ANOVA was used to test for interactions between all 103 possible locus pairs. However, because our four RF-based methods benefit from the 104 feature selection properties of the RF, a comparison to only an exhaustive approach 105 would be inadequate. Therefore, in order to evaluate to what extent our methods' 106 performances depended on this feature selection of RF, we included three two-step 107 approaches in the analyses. In these methods, loci were pre-selected prior to interaction 108 testing using the LM procedure. First of all, we implemented a 't-Test 2step' approach, 109 where loci were pre-selected based on whether they manifest marginal effects according 110 to an exhaustive t-test screen. In addition, we used the variable selection properties of 111 the LASSO to pre-select loci based on their coefficients ('LASSO 2step'), as suggested 112 previously [46,47]. And finally, we applied a 'RF 2step' approach, in which we exploited 113 the feature selection properties of the RF algorithm to pre-select loci based on their 114 importance values, as previously suggested [48][49][50][51][52]. The latter differs from our suggested 115 RF-based epistasis detection methods in that the RF is only used to pre-select the loci, 116 but the RF structure is not exploited to identify the interactions. Therefore, the genetic 117 background modeled by the RF is not taken into account for the actual interaction 118 testing with the RF 2-step approach. For each two-step approach, we both tested for 119 interactions within preselected loci ('both' loci have a marginal effect), and between the 120 pre-selected loci and all others (only 'one' locus with marginal effect), as previously 121 proposed [53]. So in total, there were the four RF-based methods (the three individual 122 approaches and their ensemble), an exhaustive approach, and a total of six two-step 123 approaches (three pre-selection criteria, with the 'both' and 'one' strategy each).  In general, all RF-based methods were able to detect pure AND epistasis ( Fig 1A   137 and S3 FigA). Among them, the ensemble method generally performed best, and it also 138 slightly outperformed the other methods at recovering simulated interactions (Fig 1E). 139 The RF-based approaches profited a lot from cases where at least one of the interacting 140 loci had a marginal effect (Fig 1B, S4 and S5 Figs). This represents one of the 141 limitations of RF: it relies on the presence of marginal effects for the selection of loci in 142 the CARTs. Correspondingly, the two-step approaches applied here also rely on the presence of marginal effects on at least one of the interacting loci. However, AND 144 epistasis will, even in the absence of underlying true marginal effects, lead to a 145 phenotype difference for the interacting alleles separately (i.e., a 'quasi-marginal' effect), 146 promoting their selection in the RF. For example, considering a case of pure AND 147 epistasis, where the phenotype is only affected if both loci have an altered allele (S1 148 FigB), each locus taken alone will appear to have a marginal effect on the phenotype.

149
This is the reason why the RF-based methods were still able to uncover the interactions 150 in scenarios without any marginal effects ( Fig 1A). In addition, interacting loci without 151 any marginal effects are unlikely in a real biological setting [55], which alleviates the 152 severity of this limitation of the RF-based approaches.

153
The RF-based approaches still identified the correct interactions in the presence of 154 additional marginal effects on non-interacting loci (i.e., Fig 1C). Surprisingly, the 155 RF-based methods were not as robust as the other methods against increasing numbers 156 of such 'interfering' marginal effects (S6 Fig). It should be noted, however, that the 157 two-step RF approach, together with the two-step LASSO seemed to be most suited to 158 identify and pre-select the truly interacting loci. This further emphasizes the advantage 159 of multivariate approaches for modeling complex genetic regulation.

160
The 'quasi-marginal' effects mentioned above induced by AND-epistasis do not 161 necessarily arise for XOR epistasis (S1 FigC, loci A and B will not appear to have a 162 marginal effect if examined individually), making them potentially hard to capture with 163 RF. For our simulations, we chose a scenario that does not lead to any 'quasi-marginal' 164 effects. As expected, the exhaustive LM approach outperformed all other methods at 165 the detection of this interaction type ( Fig 1D and S7 FigA). However, it should be 166 noted that aside from the exhaustive LM approach, all RF-based methods (i.e., our 167 novel approaches and the two-step RF approach) performed better than the others.
168 Surprisingly, adding marginal effects to both interacting loci did not facilitate the 169 detection of XOR epistasis for any of the methods (S7 FigB). Although there are 170 theoretically possible mechanisms that can lead to XOR epistasis for both discrete and 171 quantitative traits, the biological relevance of XOR epistasis has been a matter of 172 controversy [56][57][58]. Notably, a thorough search of the literature did not uncover any 173 real biological examples of XOR epistasis for a quantitative trait. This lack of 174 discoveries, however, might also be due to the fact that this type of interaction is 175 particularly difficult to detect [58,59].

176
RF is claimed to be able to model higher-order interactions. Therefore we also 177 created two simulation scenarios with three-way AND epistasis. For simplicity, we 178 considered three-way interactions as recovered if all pairwise combinations of the three 179 interacting loci were detected. The performance of all methods was relatively poor, 180 however employing RF (e.g., in the ensemble method, or the two-step RF method) 181 seemed to offer an advantage at capturing these higher-order interactions, as did the 182 employment of the relatively sensitive t-test-based pre-selection (S3 FigB-C).

183
As expected, the three RF-based approaches complemented each other for different 184 interaction scenarios. The pairedSF method, for example, which generally performed 185 weaker on scenarios involving AND-type epistasis, performed comparatively better for 186 XOR epistasis. Indeed, the ensemble generally outperformed each individual RF-based 187 approach. Having demonstrated that our methods are theoretically able to detect epistasis on 190 simulated data, we evaluated them on real data to test the applicability of the methods 191 in a real biological setting. Mapping expression QTL (eQTL) is a key aspect towards 192 understanding the relationship between genetics and phenotype. Therefore, an eQTL 193 dataset derived from a yeast cross with 112 segregants (RM×BY) was selected [54]. The 194

PLOS
6/23 methods were applied to detect interactions between loci influencing the expression of 195 each transcript. We evaluated the performances of the different methods by the ability 196 to recover epistatic interactions detected in double knock-out (DKO) experiments [18], 197 as previously proposed [41]. Since these DKO interactions were based on growth 198 phenotypes, we limited the epistasis detection to 1,050 transcripts corresponding to 199 essential genes (i.e., genes that are lethal when knocked-out), assuming that these were 200 more likely to affect growth than non-essential genes. An eQTL interaction was 201 assumed to be a true positive if genes close to the interacting loci were also found to 202 interact in the DKO dataset [18]. We used these true positive labels to compute should only be considered as the evaluation of relative performance differences between 211 the methods, as measured by the enrichment of true positive calls among the detected 212 interactions.

213
The PR curves of most methods and the AUROCs of all methods were better than 214 expected by chance (Fig 2), which confirms that the DKO data is-at least to some 215 extent-suitable as a reference for quantifying true epistatic interactions, despite the 216 differences outlined above [41]. The RF-based methods, especially the ensemble 217 approach, exhibited the best performance according to both the PR curves and the 218 AUROC (Fig 2). To better understand the higher performance of the RF-based 219 methods, we investigated the interactions for each transcript that were exclusively 220 detected by each method (but not detected by the other approaches) and which were 221 also considered to be true positives according to the DKO dataset (methods). For this 222 analysis we focused on a subset of methods, i.e. the three RF-based approaches, the 223 'exhaustive LM', the 't-Test 2step one', the 'LASSO 2step one', and the 'RF 2-step one'. 224 As a reference, we also extracted 'consensus' interactions, i.e. interactions that were 225 found by at least five of the seven considered methods, and which corresponded to 226 interactions also found in the DKO dataset. First of all, we examined the type of 227 interaction (i.e., AND or XOR epistasis) of the respective method-specific interactions. 228 The pairedSF approach had the highest AUROC among the RF-based methods and 229 primarily detected AND-type interactions (S8 Fig A). This was a somewhat surprising 230 result, since pairedSF had a high sensitivity for XOR epistasis in the simulations. In  Next, we investigated whether the loci of the method-specific interactions had 240 marginal effects (according to a t-test p-value below 0.05, without multiple testing 241 correction) (S8 FigB). The exhaustive LM was not dependent on marginal effects. The 242 RF 2-step and the LASSO 2-step detected 88% and 62% of the interactions, for which 243 none of the interacting loci had a detectable marginal effect, respectively. This stresses 244 the advantage of multivariate methods, which can unravel marginal effects that only 245 come to light if multiple genetic variants are considered together, but not when they are 246 tested individually, as for example with a t-test [25,32,33]. Notably, pure XOR epistasis 247 is one example of an interaction that would not lead to any detectable marginal effects. 248 The RF-based methods primarily detected interactions with additional marginal effects. 249 This is in line with the simulations, where the RF-based methods were better at  Finally, it should be noted that the number of locus pairs tested per transcript 261 varied a lot between the different approaches. As expected simply due to the nature of 262 the tests, the exhaustive approach always considered all locus combinations, while the 263 RF-based and two-step approaches tested fewer locus pairs due to their heuristic 264 properties (S8 Fig D). The RF-based methods were more precise than for instance the 265 two-step approaches, which investigated similar numbers of locus pairs. Therefore, the 266 limitation that the loci have to occur together in the tree represents a valuable addition 267 for the interaction testing as opposed just just looking at their individual importance in 268 a tree (i.e., RF 2-step approach). The pairedSF is less restrictive about which locus 269 pairs can be tested (i.e., it does not require the loci to occur in the same CARTs that 270 many times and is therefore applicable more often). The splitA and selA approaches 271 have a higher true positive rate at low false positive rates (beginning of the ROC curve) 272 compared to the pairedSF. Thus, the few interactions detected by the splitA and selA 273 are relatively precise, but the pairedSF has a higher recall, simply because it is 274 applicable to more locus pairs.

8/23
Benchmark on growth trait data 280 For a second real data benchmark, a dataset from a distinct yeast cross with 720 281 segregants was selected, which encompassed measurements for several growth traits [60]. 282 These growth phenotypes represent a more complex phenotype than gene expression, 283 offering more opportunities for complex interactions [61].  Conclusion and outlook 308 We have developed three approaches that are able to extract information about 309 interactions from a RF model. We used them to identify epistasis between genetic loci. 310 Traditionally, simulated QTL data have been used to evaluate the performance of 311 PLOS 9/23 epistasis detection methods. Obviously, simulated data have the advantage that the 312 true result is known, which greatly helps understanding the scenarios under which a 313 particular method performs better or worse than another method. However, simulations 314 suffer from the problem that they usually do not capture the full complexity of real 315 data. Hence, we aimed to also develop a scheme for evaluating QTL epistasis detection 316 methods on real QTL data even if a ground truth is unknown. Our study revealed that 317 indeed the comparison with DKO data is a reasonable way for quantifying the relative 318 performance of epistasis detection methods.

319
Benchmarks on simulated and real yeast cross data indicated that the RF-based 320 methods outperform other commonly used approaches at detecting epistasis.

321
Particularly the ensemble approach, which combines all three RF-based approaches, 322 performed relatively well at this task. It generally achieved the highest precision for the 323 simulated data. When applied to real data and compared to the other tested methods, 324 it seemed to produce the biologically most meaningful results, as indicated by the 325 recovery of high-confidence interactions that were identified in an independent study 326 from double gene knock-outs. Therefore, the approaches for the identification of 327 interactions within the forest introduced here represent a valuable enhancement of the 328 applicability of RF for genetic association analyses.

329
Higher-order interactions (three-way, four-way, etc.) are likely to play a role for 330 many complex phenotypes [23,62] predictions also for internal tree nodes. Unless stated otherwise, the RF was always 361 built in the same way: 30,000 trees were grown, the minimum final node size was set to 362 5, and all other parameters were left at the default.

363
RF cannot deal with missing data in the predictors (i.e., the genotypes). Therefore, 364 missing genotypes were imputed in 100 iterations by randomly selecting the missing 365 alleles according to the respective allele frequencies at each locus. 300 trees were grown 366 for each imputation, and the resulting collection of forests was then united into one big 367 forest of 30,000 trees. For the two real datasets, population structure representatives 368 were included as covariates in the model (i.e., as predictors used to build the RF), as 369 proposed previously [67]. For all methods, locus pairs in strong linkage disequilibrium 370 (LD), as indicated by an absolute Pearson correlation value above 0.9, were excluded 371 (i.e., the p-value was set to not available, NA).  The three RF-based epistasis detection methods described above complement each other 434 for different types of epistasis (i.e., selA theoretically only works for AND-epistasis, and 435 pairedSF is most likely to be able to detect XOR-epistasis). In addition, the methods 436 are applicable in different scenarios due to different statistical requirements: splitA and 437 selA can only be used when a locus was used at least five times on both sides, or on 438 either side of another locus, respectively. Thus, the p-values generated by the splitA, 439 selA, and pairedSF approaches were combined using Fisher's method to create an 440 ensemble score. Only the p-values of the methods that were available for each locus pair 441 were used (e.g., if only two out of three methods were applicable for a locus pair, only 442 the p-values of these two were combined).
where P represents the collection of population structure covariates p i . The significance 448 of the interaction termβ A×B according to an F-test was then used as the significance of 449 the locus-locus interaction. The same LM was also used for the two-step approaches to 450 test for interactions for the pre-selected loci. This approach employs a Student's t-test to preselect loci prior interaction testing.

453
First, an exhaustive Student's t-test was used to rank loci by their marginal effects.

454
Samples with an unknown genotype for a locus were not used. It should be noted that 455 in this first pre-selection step, population structure could not be taken into account. where their coefficients were non-zero, and the top 100 were used for subsequent 471 interaction testing using the LM procedure described above.

472
Random Forest-based two-step approach 473 To evaluate how much our methods profited from the feature selection properties of RF, 474 we applied a two-step approach based on RF importance values. A forest with 20,000 475 trees was computed. In the case of missing genotypes, they were imputed in 100 . The loci were ranked based 480 on this combined score and the top 100 were used for interaction testing.

481
Performance evaluation on simulated data 482 We simulated trait values based on the genotypes from the widely used data from a 483 cross between the two S. cerevisiae strains RM11-1a and BY4716 [54].
where y represents the phenotype values, ES m and ES ij represent the effect sizes for 489 the loci with marginal effect m, and interacting loci i and j, respectively. We selected 490 the baseline of 9 randomly. Normally distributed noise was added at eight noise levels 491 ranging from 2,5 to 20 % of the mean. The parameters for all scenarios are listed in 492 Table 1. In order to evaluate the methods' performance at identifying interactions for real data, 502 we used a DKO dataset as a reference [18]. This dataset consists of gene-gene 503 interactions that were identified via co-knockouts of yeast genes using the synthetic 504 genetic array method. In short, they tested whether double knock-outs showed a 505 significant deviation in fitness compared to the expected multiplicative effect of the 506 respective two single knock-outs. For this study, the results for essential (both damp-507 and temperature sensitive alleles) and non-essential genes were combined. Interactions 508 which passed the lenient threshold used in the original paper (p−value 0.05) in at 509 least one assay were considered interacting. We excluded pairs with genes that were 510 removed or marked as 'dubious' in the latest reference annotation (ENSEMBL, version 511 R64-1-1). Overall, about 13% of all possible gene pairs were found to interact. Because 512 QTL analysis identifies interacting loci and not interacting genes, genes had to be 513 assigned to nearby loci (see also below). A locus pair was considered a true positive if 514 any of the pairwise combinations of the genes assigned to the two loci were found to 515 precision-recall (PR) measures created using the 'precrec' R package [69].

517
Expression quantitative trait data 518 For the first real data benchmark, a dataset from an RM×BY yeast cross was used, 519 comprising of genotype information along with gene expression measurements for 112 520 segregants [54]. We used the same set of unique loci as for the simulations. Because the 521 methods would be evaluated based on the recovery of interactions for growth 522 phenotypes (DKO dataset), 1050 transcripts were selected that code for essential genes 523 according to the Saccharomyces Genome Deletion Project 524 (www-sequence.stanford.edu/group/yeast_deletion_project/deletions3.html, 525 [70]). This was done based on the assumption that genetic loci regulating the expression 526 of essential genes are more likely to also affect growth than loci regulating non-essential 527 genes. The expression values were scaled and centered to mean zero and equal variance. 528 The population structure covariates were created as follows: first, the relatedness, or 529 kinship, between strains was estimated using the R package 'emma' [71]. The first six 530 eigenvectors, explaining 60% of genotypic variance, were used as representatives for the 531 population structure. For the benchmark, the loci were mapped to nearby genes 532 according to the following procedure: for each locus, all variants that were in LD 533 (absolute correlation coefficient above 0.9) and that were less than 50 kilobasepairs 534 (kbp) apart, were grouped. The locus region was then defined as starting one basepair 535 (bp) after the previous locus and ending one bp before the next locus, or at the start or 536 end of each chromosome for the first and last locus, respectively. All genes that 537 overlapped this variant region according to the ENSEMBL annotation [72] were then 538 assigned to the locus. To evaluate the performance of the methods at identifying 539 interactions, the results for the different transcripts were grouped by taking the mean 540 p-value across the transcripts for each locus pair.

541
To evaluate the interactions that were detected by each method exclusively, we did 542 the following: for each transcript, and for each considered method (i.e., the three 543 RF-based approaches, the exhaustive LM, the t-Test 2step one, the LASSO 2step one, 544 and the RF 2-step one), we extracted the interactions with the lowest p-value that was 545 found by this method (p-value non-NA and in the lower quartile of p-values), but not by 546 any of the other considered approaches (p-value NA or bigger than its 25% quartile of 547 p-values), and which were also considered to be true positives according to the DKO 548 dataset. As a reference, we also extracted the interaction with the lowest mean p-value 549 (over all considered methods) that was found by at least 5 of the 7 methods (non-NA For the evaluation of the methods on more complex traits, another yeast cross dataset 553 was selected [60]. The dataset encompassed genotype information for 720 distinct 554 segregants of a cross between the strains BY4742 and SK1 as well as measurements for 555 several growth-related phenotypes. We selected the phenotypes 'growth rate on agar' 556 and 'resistance to cantharidin' (a monogenic trait) as positive and negative control 557 phenotypes, respectively. The phenotype values were scaled and centered to mean zero 558 and equal variances. The original dataset encompassed genotype information for 65,250 559 genetic variants. Variants in high LD were grouped in the following way: For each 560 chromosome, a hierarchical clustering of the variants was performed, using absolute 561 Pearson correlation values subtracted from one as distance measures. The clustering 562 tree was then cut at a height of 0.02. In the majority of cases, the clusters grouped 563 variants that were located consecutively on the chromosome. For the resulting 2,827 loci 564 genome-wide, a 'consensus locus' was created for each cluster by using the majority vote 565 (rounded mean of alleles encoded as 0 or 1) of the variants in each group. If the mean 566 was between 0.4 and 0.6 the genotype was treated as unknown. The resulting minor 567 allele frequencies ranged from 0.3 to 0.7, and there were no loci with more than 11% 568 missing genotypes.

569
The population structure was estimated in the same way as for the eQTL dataset.

570
Five principal components, explaining 55% of genotypic variance, were used as 571 representatives for the population structure. The original 65,250 variants were mapped 572 to genes in the following way: each (original, non-grouped) variant was defined as 573 starting one bp after the previous variant and ending one bp before the next variant, or 574 the start or end of the chromosome for its first and last variant, respectively. If there 575 was a gene overlapping this region according to the ENSEMBL annotation [72], it was 576 assigned to the variant. Finally, the 'consensus locus' of each variant group was assigned 577 all genes that were assigned to the variants in its locus.  The simulations encompass one AND-type interaction each, as well as marginal effects 614 for one to four separate loci, respectively.