Predictive modeling of microbial data with interaction effects

Microbial interactions are of fundamental importance for the functioning and the maintenance of microbial communities. Deciphering these interactions from observational data or controlled lab experiments remains a formidable challenge due to their context-dependent nature, i.e., their dependence on (a)biotic factors, host characteristics, and overall community composition. Here, we present a statistical regression framework for microbial data that allows the inclusion and parsimonious estimation of species interaction effects for an outcome of interest. We adapt the penalized quadratic interaction model to accommodate common microbial data types as predictors, including microbial presence-absence data, relative (or compositional) abundance data from microbiome surveys, and quantitative (absolute abundance) microbiome data. We study the effect of including hierarchical interaction constraints and stability-based model selection on model performance and propose novel interaction model formulations for compositional data. To illustrate our framework’s versatility, we consider prediction tasks across a wide range of microbial datasets and ecosystems, including metabolite production in model communities in designed experiments and environmental covariate prediction from marine microbiome data. While we generally observe superior predictive performance of our interaction models, we also assess limits of these models in presence of extreme data sparsity and with respect to data type. On a large-scale gut microbiome cohort data, we identify sparse family-level interaction models that accurately predict the abundance of antimicrobial resistance genes, enabling the formulation of novel biological hypotheses about microbial community interactions and antimicrobial resistance.


Introduction
A fundamental objective in microbial ecology is to elucidate how species compositions and species-species interactions are related to the maintenance and functioning of a microbial community [1].Interactions between microbial species come in many forms, including crossfeeding interactions through metabolite exchange, bacteriocin-induced growth-inhibitory interactions, and exchange of genetic material for genotype selection [2,3].Conceptually, microbial interactions can be described in terms of their net positive, negative, or neutral effect on their interaction partner, resulting in broad categories such as mutualistic, commensal, amensal, predatory/parasitic/exploitative, antagonistic or competitive interactions [4,5,6,2].Experimentally identifying and verifying such interactions within natural communities has remained a difficult task, owing to the sheer complexity of microbial ecosystems and limited technical capabilities to dissect such communities.With the emergence of large-scale microbial survey data, computational approaches have become popular that use statistical regression and correlation methods to estimate sparse species-species association and co-occurrence networks from microbial abundance measurements [5,7,8,9,10,11].While these networks do not necessarily reflect true ecological relationships [12], they can provide valuable insights into the global structure of microbial communities across ecosystems [13,14].However, none of these methods allow to relate species-species associations or "interactions" to a community functional outcome of interest or to concomitant environmental or host-related covariates.Furthermore, most network approaches deliver context-independent (or averaged) pairwise associations, thus potentially missing species-species interactions that are relevant for a specific function of the community.In this contribution, we provide a statistical regression framework for microbial data that allows the parsimonious inclusion of microbial interactions for predicting an outcome of interest, such as butyrate [15] or a concomitantly measured covariate [16].Using the generic quadratic interaction regression model as a starting point, we adapt the model to accommodate all common microbial abundance data modalities (see Fig. 1 for an illustration).Important examples include data from designed in-vitro experimental studies on model microbial communities where microbial abundance comes in form of presence-absence (binary) data or absolute abundances, i.e., non-negative count or continuous data [17].The majority of microbiome survey data, however, quantify taxon abundances by amplicon sequencing, thus providing primarily relative abundance (or compositional) data [18] in form of Operational Taxonomic Units (OTUs) or Amplicon Sequencing Variants (ASVs) [19].Moreover, recent quantitative microbiome profiling techniques [20,21,22] combine absolute cell count measurements and relative amplicon data, thus providing absolute microbial abundance information, albeit with potential biases [23].Illustration of three data modalities in microbiome analysis and their combinatorial behavior with respect to an outcome, e.g., a community function (created with BioRender.com).The sketch depicts three distinct data modalities in the columns: (i) quantitative microbiome data, representing absolute counts; (ii) presence-absence information of microbial species; and (iii) relative abundance data, also known as compositions.Each row illustrates a simplified scenario.In the first scenario, blue microbes are present while red are absent, resulting in a large (production of) outcome (e.g., butyrate).In the second scenario, red microbes are present while blue are absent, leading to another large (production of) outcome.In the third scenario, both blue and red groups of microbes are present, yet only minimal amounts of the outcome are produced, indicating an antagonistic combinatorial effect between the two groups.
Our framework unifies and generalizes several seemingly disjoint approaches in the literature of microbial ecology and microbiome data science.For example, several recent studies in microbial ecology use presence-absence and absolute abundance data from designed experiments on small microbial communities to predict community functions [24,15], such as, e.g., butyrate production, or overall host fitness [25] using the quadratic (and higher-order) interaction model.Our framework is readily available for such studies and gives statistical guidelines how to choose model complexity, how hierarchical constraints can increase model interpretability, and how to analyze higher dimensional datasets.On the other hand, for regression tasks based on high-dimensional large-scale amplicon sequencing data, many statistical approaches consider the linear log-contrast model [26], which is the standard linear model for compositional data, as the baseline model.To deal with the high dimensionality (where typically the number of features  is larger than number of samples ), penalized and structured regression models have been proposed [27,28,29,30,31,16].Here, we extend these linear (main effects) model to include species interaction effects.Specifically, starting with Aitchison's (low-dimensional) proposal [26], we introduce three models with quadratic interactions that work with relative input data: (a) the alr transformed quadratic model, (b) the quadratic log-contrast model, and (c) the quadratic log-ratio model.To achieve parsimonious models in the high-dimensional setting, we employ ℓ 1 penalization and illustrate via semi-synthetic data simulations when quadratic interactions are identifiable, given the excess sparsity of typical microbiome data.To achieve stable and interpretable interaction models [32], we follow [33] and incorporate hierarchical interaction modeling [34,35,36] and stability-based model selection [37,38] into our framework.The hierarchy assumption enforces constraints on interaction features, requiring that they can only be included in the model if both features (strong hierarchy) or at least one feature (weak hierarchy) are already present as main effects.Stability-based model selection ensures that interactions are only included if they can be consistently and reproducibly identified across different subsets of the data which will likely help reduce the number of testable biological hypotheses.We demonstrate the versatility of our framework by analyzing datasets that encompass all three data modalities across various ecosystems, including synthetic microbial communities, human gut microbiomes, and marine microbial ecosystems.Notably, our application of the quadratic interaction model on a quantitative microbiome data from the Metacardis study [39] reveals its effectiveness in accurately estimating the abundance of antimicrobial resistance genes (ARGs) from microbial taxa abundances.Furthermore, our analysis of a microbiome dataset containing presence-absence information rediscovers a stable interaction effect, specifically the inhibitory role of D. piger on the butyrate producer A. caccae [24].For the newly introduced sparse quadratic log-contrast model tailored for relative microbiome data, we provide both semi-synthetic data simulations to demonstrate the model's ability to accurately detect interaction effects and, following [16], re-analyze Tara ocean data [40], highlighting superior predictive performance of sparse interaction modeling compared to their linear counterparts.We conclude by providing a comparative analysis of the quadratic interaction models across the three data modalities using the Metacardis ARG prediction task, illustrating commonalities and differences across the resulting predictive models.The latter analysis gives further guidance for the practitioner regarding merits and pitfalls of quadratic interaction models.Our framework for quadratic interaction modeling is freely available as reproducible R code at https://github.com/marastadler/Microbial-Interactions.

Interaction modeling strategy
Given the abundance information of  microbial taxa  = ( 1 , ...,   ), the baseline model for uncovering (joint) additive effects of the microbial taxa on an outcome  ∈ R  (e.g., butyrate production), is the linear model 4/34 where  0 is the intercept term,   is the effect of taxon  on  , and  models the technical and biological noise term.In many prediction tasks, relying on a linear (main effect) model alone is insufficient to capture the complexity of dynamics within microbial communities.A common approach to introduce a more intricate yet interpretable model is the inclusion of quadratic terms.
Here, we extend the baseline model by introducing a generic quadratic interaction model, incorporating all pairwise interactions between microbial taxa, namely where Θ = Θ  ∈ R × is a symmetric matrix of interactions.We assume Θ   = 0 in this model formulation.However, the general principles still apply if this constraint is removed.
In the following section, we instantiate the interaction model to accommodate distinct data types and denote the microbial abundance information by  for count information (absolute or relative) and  for presence-absence information (see Fig. 1).

Interaction model for quantitative microbiome data
Whenever microbial abundance information is given as absolute counts, the model is equal to the generic model 2 and does not require further transformation of the input data or any constraints on the model coefficients.Throughout this work, we denote the absolute count input data by  ∈ R × + .Assuming that  depends on the actual amounts of taxon abundances, the quadratic interaction model is given by where the model parameters follow the description provided in model 2.

Interaction model for presence-absence microbiome data
If the microbial abundance information is represented as presence-absence data, given by a binary matrix, we denote the microbial abundance information as  ∈ {0, 1} × , where 1 indicates the presence of a microbial taxon, and 0 indicates its absence.One common alternative encoding is  ∈ {−1, 1} × , where the absence is encoded as -1.The choice of encoding does not affect the ability of the model to fit the data, it only changes the interpretation of the coefficient.Assuming that  depends on the presence-absence information 5/34 of microbial taxa, the quadratic interaction model is given by where the model parameters follow the description provided in model 2. For  ∈ {0, 1},  0 is the baseline effect when all features, here referring to microbial taxa, are absent, and   for  = 1, ...,  represents the effect of the presence of   when all other taxa are absent.The interaction term, Θ   , accounts for the additional effect when both features   and   are present.For  ∈ {−1, 1},  0 signifies the overall mean (assuming a completely balanced design).For more details on the interpretation and the linear transformations of model coefficients between these two encodings, see the Supplementary Material.When describing  as a fitness or phenotypic landscape, the different encodings in the interaction model are often associated with Fourier and Taylor expansions, allowing the parameters to describe landscape properties, like ruggedness [15,41,42].

Interaction modeling for relative microbiome data
When the microbial count information in  is provided as (sparse) compositions rather than as absolute counts, one way of modeling interaction effects between microbial taxa includes converting  to a binary matrix  = 1 {>0} that carries the presence-absence information of microbial taxa.However, the compositional information might hold valuable insights beyond that provided by presence-absence data.We introduce three methods for modeling quadratic interactions with relative input data: (a) the alr transformed quadratic model, (b) the quadratic logcontrast model, and (c) the quadratic log-ratio model.The three models differ in terms of interpretability, dimensionality, and optimization, and the choice of which model to use depends on the underlying data and the biological question.
Alr transformed quadratic model While comparing the relative count information in compositional data is not biologically meaningful, describing the response as a linear combination of log-ratios derived from the original compositions is a valid comparison.One popular way of building log-ratios is by choosing a common reference feature , such that the transformed count is given by   = log(   /  ),  = 1, ...,  − 1 [26].This transformation is known as the additive log-ratio transformation (alr)-transformation.The alr transformation allows modeling an outcome  based on the (  − 1)-dimensional compositional input data, assuming that  depends on the composition of , not on the actual amount, by a model linear in the features 6/34 and an extension to interaction effects, given by For  :=  − 1 this formulation is equal to the generic model 2 (with Θ not being symmetric).
While this very general model formulation allows for the interpretation of the effects with respect to a specific reference feature , extensions to expressions in the -dimensional space [26] and log-ratio models that allow pairwise comparisons between features [30] have been proposed.Both approaches can be translated back to the model formulation in 5 and 6, respectively, and will be discussed in the following two paragraphs.
Constrained quadratic log-contrast model As shown in [26], a more convenient symmetric expression of the linear alr transformed model 5, that does not require a reference feature, can be derived by reformulating the equation as a -dimensional problem including a zero-sum constraint, given by where the main (log) effect coefficients   ,  = 1, ...,  sum up to zero.As illustrated in [26], the extension to the quadratic log-contrast model can be represented as where the main (log) effect coefficients   ,  = 1, ...,  sum up to zero, with  ∈ R  , and the interaction effect coefficients Θ   correspond to the quadratic (log-ratio) interaction effect of   and   , with Θ = Θ  ∈ R × .In the Supplementary information we show how to formulate the alr transformed model as constrained quadratic log-contrast model.
Quadratic log-ratio model Another way of accounting for compositionality in regression models is to build log-ratios between all possible pairs of features in  ∈ R × + .This approach is referred to the (all-pairs) log-ratio model [30], which is given by where the main effect coefficient  , corresponds to the pairwise (log-ratio) effect of   and   .
In the same way as in 8 the log-ratio model can be extended to a quadratic version, the quadratic log-ratio interaction model (qlr), namely, , log(   /  ) + where the main effect coefficient  , corresponds to the pairwise (log-ratio) effect of   and   , with  ∈ R ( −1)/2 and the interaction effect coefficient Θ   corresponds to the quadratic (log-ratio) effect of   and   , with Θ = Θ  ∈ R × .There exists a linear transformation between the main effect coefficients   in model 7 and model 8 and the main effects coefficients  , in model 9 and model 10, , implying that the zero-sum constraint on  ∈ R  is inherently met in the linear and quadratic log-ratio model.While the models are mathematically equivalent, their interpretations are different and the choice might depend on the particular data application.

Penalized model estimation
Microbial datasets typically include a large number of features  and interactions between features (  − 1)/2 compared to the number of observations .Moreover, even in scenarios where  > ( 1)/2, we assume that a parsimonious model is most appropriate, focusing only on the selection of few features and interactions that are relevant for the outcome.To facilitate penalized model estimation, we employ regularized maximum-likelihood estimation incorporating ℓ 1 -norm (lasso) penalization for both linear and interaction coefficients, as proposed by [43].We introduce a generic optimization problem, consisting of an objective function (,  0 , , Θ) and a (potential) constraint set on the model parameters ( 0 , , Θ) that facilitates parameter estimation for all (linear and interaction) models introduced in Section 2.1.The objective function takes the general form Here,  > 0 serves as a tuning parameter, regulating the sparsity levels of the coefficients  and Θ, respectively.The loss function  ( 0 , , Θ) is specific to each model.Consequently, the generic optimization problem is given by minimize This optimization problem is subsequently instantiated by specific loss functions and constraints.
Sparse quadratic interaction model for quantitative and presence-absence microbiome data The loss function  ( 0 , , Θ) for the sparse quadratic interaction model, also all-pairs lasso, for the interaction models for absolute count data or presence-absence data, introduced in 3 and 4, is defined as for absolute count data and  :=  ∈ {0, 1} × (or  ∈ {−1, 1} × ) for presence-absence data.This model does not require further constraints on the model parameters, such that ( 0 , , Θ) = ∅.Consequently, the optimization problem is given by minimize In the linear model case the loss function in the optimization problem reduces to  ( 0 , ) = .
Sparse alr transformed quadratic model Given  ∈ R × is a matrix containing the relative abundance information of  microbial taxa, the loss function in 11 for the sparse alr transformed quadratic model, introduced in 6, is defined as with   = log(   /  ),  = 1, ...,  − 1.The model does not require further constraints on the model parameters, such that ( 0 , , Θ) = ∅.Consequently, the optimization problem is given by minimize In the linear model case the loss function in the optimization problem reduces to  alr ( 0 , ) = .

Sparse quadratic log-contrast model
The linear log-contrast model has been extended to the high-dimensional setting [31,27,28,16], and is also known as the sparse log-contrast model.While this model has been used in various microbiome data analysis applications, the concept of introducing interactions in the log-contrast model has been defined in [26], but has not been extended to the high-dimensional setting or used in practical applications.Here, we translate the interaction model proposed in [26] to the high-dimensional setting.The loss function for the sparse quadratic log-contrast model (qlc) corresponding to the interaction 9/34 model for compositional data, introduced in 8, is defined as As this model comes with a zero-sum constraint on the main effect coefficients, the constraint set in 12 is given by Thus, the optimization problem for the sparse quadratic log-contrast model is given by minimize In the linear sparse log-contrast model defined in 7, the loss function reduces to  lc ( 0 , ) = . The main effect covariates, denoted by   for  = 1, ...,  typically remain unscaled under the zero-sum constraint.However, the interaction features log(   /  ) 2 are not subject to the zero-sum constraint and we scale them.The ℓ 2 -norm of these interaction features tends to increase with the ℓ 2 -norm of their associated main effects (more specifically, the ℓ 2 -norm of the main effects after transforming them with the centered log-ratio (clr) transformation).The clr divides each compositional part by the geometric mean of all parts, namely clr( ) = log   (   ) =1,..., with (   ) = exp 1 Here, we introduce a way of scaling the interaction features that ensures equal penalization of the interaction features.Moreover, we adjust the scale of the interaction features to align with the norm of the average clr transformed ℓ 2 -norms of all main effects.Mathematically, this can be expressed as follows: We denote each column of the interaction feature matrix as , and its scaled version is given by , where  clr = clr( ) ∈ R × is the clr transformed main effects matrix  and  clr  2 is the ℓ 2 -norm of the -th column of  clr .

Sparse quadratic log-ratio model
The loss function of the sparse quadratic log-ratio (qlr) model corresponding to the interaction model for compositional data, introduced in 10, 10/34 is defined as This model does not require further constraints on the model parameters, such that ( 0 , , Θ) = ∅.The optimization problem for the sparse quadratic log-ratio model is therefore given as minimize  0 ,,Θ ( qlr ,  0 , , Θ).
In the sparse log-ratio model, that is linear in the features and corresponds to the model in 9, the loss function reduces to  lr ( 0 , ) . The (  − 1)/2-dimensional sparse log-ratio problem has been shown to be equivalent to the sparse log-contrast model problem for  qlr = 2 qlc [30].This equality can be directly translated to their quadratic extensions.As the dimensionality of the predictor space in the (  − 1)/2dimensional log-ratio model becomes computationally inefficient for large , the authors in [30] propose a two-stage procedure that involves a pre-selection step for covariates to reduce the predictor space before applying the log-ratio lasso.This two-step procedure can be directly applied to the 2 • (  − 1)/2-dimensional quadratic log-ratio lasso, introduced in 10, in scenarios where  is large.

Modeling hierarchical interactions
The quadratic interaction models, introduced before, can enhance predictive performance compared to models that are linear in the features, but they may not detect robust microbial interaction effects suitable for further functional analysis.To enhance model interpretability, we introduce the statistical concept of hierarchy in the context of quadratic models for microbiome data.The concept of hierarchy permits the inclusion of an interaction Θ   in the model only if both associated main effects are also present in the model (strong hierarchy), or if at least one of the associated main effects is included (weak hierarchy) [see [34], and references therein].This hierarchy can be implemented by imposing constraints on the interaction effects Θ  ∈ R  for  = 1, ...,  namely By eliminating the symmetry constraint on Θ, the resulting model relaxes to weak hierarchy on the interaction features.While 17 is non-convex, we follow [34] who proposed a convex relaxation of the problem and provided an efficient implementation in the corresponding R package hierNet [44] (v1.9).The hierarchical constraint can be imposed within the generic optimization problem described in 12 and allows a direct application under the 11/34 convex relaxation for (i) quantitative microbiome data (ii) or presence-absence information of microbial species with 13; and (iii) relative microbiome data, after performing the alr transformation with 14.

Model selection
An essential challenge in high-dimensional penalized regression is the selection of the regularization parameter .This parameter balances the sparsity of model coefficients with out-of-sample predictive performance [45,46].Standard methods for main effects and interaction models often involve techniques such as cross-validation [34] or Information Criteria like the Akaike (AIC) and the Bayesian Information Criterion (BIC) [47].However, these methods tend to select more predictors and interactions than necessary [47].If the main aim lies in detecting robust effects, one way of accounting for the potential limitations caused by cross-validation in penalized regression models is the concept of stability selection [37] for identifying a set of predictive features and interactions in microbial data.Stability selection has shown effectiveness across various scientific domains, ranging from network learning [48,49] to data-driven partial differential equation identification [50,51].In the context of regression, stability selection involves iteratively learning sparse regression models from subsamples of the data (e.g.,   = ⌊/2⌋), recording the frequency of selected predictors across models, and selecting the most frequent predictors for the final model.A variant of stability selection, complementary pairs stability selection (CPSS) [38], is particularly advantageous for handling unbalanced experimental designs, as it ensures that individual subsamples are independent of each other.CPSS draws  subsamples as complementary pairs {( 2−1 ,  2 ) :  = 1, ..., }, with  2−1 ∩  2 = ∅ from samples {1, ...} of size ⌊/2⌋.Applying a variable selection procedure  (for instance choosing the  first predictors entering the penalized model in the regularization path or cross-validation) to each subsample allows defining a feature specific selection probability π for  = 1, ...,  + (  − 1)/2 that is given by π The final selection set, denoted as ŜCPSS , consists of predictors  for which the estimated selection probability π exceeds a predefined threshold  thr , that represents the minimum selection frequency required for a predictor to be included in the final set.We employ the stabs R package [52] (v0.6-4), which offers an efficient implementation of the CPSS procedure.This approach involves defining several hyperparameters, including the set of regularization parameters Λ, the threshold  thr ∈ [0, 1], the number of initial predictors  entering the sparse model, and the number of complementary splits .The CPSS procedure in [52] can be directly applied to linear models.As CPSS does not make a distinction between main and interaction effects, it can be directly applied to quadratic models.An integration of the 12/34 CPSS procedure within the hierarchical interaction modeling framework has been introduced in [33].

Applications in quantitative and presence-absence microbial data
Antimicrobial resistance genes (ARGs) play a crucial role in the survival and evolution of individual microbial species.The extensive use of antimicrobials has increased the development of resistance in pathogens, leading to an increased presence of ARGs.The number of ARGs might be associated with the composition and abundance of certain microbes in the human gut [53].
To get a better understanding of how community composition and interactions might be related to ARGs, we use quantitative microbial count information derived as mOTUs (metagenomic operational taxonomic units) from quantitative microbiome profiling for a subset of  = 690 individuals.Specifically, we take the abundance data from the MetaCardis cohort [39] for which metadata information is available.We aggregate the mOTUs on genus level and illustrate the modeling strategy by considering the 30 most abundant genera in our model.We denote the underlying data by  ∈ R × and the number of ARGs by  ∈ R  .Given that the microbial counts in  are quantitative, we fit the sparse interaction model for absolute count data defined in 3 and 13, respectively, for 10 train test splits by using 5-fold cross-validation (CV).Our results suggest that some genera and interactions between them can explain the prevalence of ARGs (see Fig. 2a, right panel).In Fig. 2a (left panel), we visualize the estimated coefficients that exhibit a non-zero median over 10 train test splits.
Next to some minor main and interaction effects, Bacteroides and Escherichia show a substantial effect on the increase of the number of ARGs, while Prevotella shows a decrease.Moreover, we identify a positive interaction effect between Prevotella and Faecalibacterium, which is contrary to their individual negative effects, indicating an antagonistic association.These findings suggest that the presence and co-presence of certain bacterial species in the gut microbiota can influence the prevalence of ARGs.
In a second example we investigate the contribution of certain bacteria as well as their pairwise interplay on butyrate production, a short-chain fatty acid beneficial to human health, within an in-vitro community given the presence-absence information of bacteria within a synthetic community from [24].Certain bacteria, known as butyrate-producers, have the ability to ferment dietary fibers into butyrate, contributing to gut health, immune function, and energy metabolism [54].Understanding how bacteria interact in this context is essential for understanding the complexity of this process.Following [15], we use the presence-absence information, denoted by  ∈ {0, 1} × of  = 25 bacteria in  = 1561 experiments, to fit the 13/34 sparse quadratic interaction model defined in 4 and 13, respectively, for 10 train test splits by using 5-fold cross-validation, to the butyrate production, denoted by  ∈ R  .Only coefficients with a non-zero median estimated coefficient are shown; right: Scatterplot comparing the observed and predicted number of ARGs on a test data set for the sparse quadratic interaction model.b, Prediction of butyrate production from the abundance information of microbial species from [24].Left: comparison of median estimated coefficients (over 10 train test splits) between the sparse quadratic interaction model ( axis) and the sparse hierarchical interaction model 17 with weak hierarchy ( axis); right: Top 35 selection probabilities from complementary pairs stability selection (CPSS) in the sparse hierarchical interaction model.The model identifies a strong positive effect of A. caccae (AC) on butyrate production.AC is known to be a butyrate producer [55].However, this effect experiences notable inhibition when AC is combined with D. piger (DP), reflected by a strong negative interaction effect (AC:DP) in our model, while DP itself shows only a modest negative impact on the butyrate production.The inhibiting effect of DP on AC with respect to butyrate production has been shown in tri-cultures with E. hallii before [56] as well as in the context of hydrogen sulfide production by DP [24].The model identifies a multitude of further minor main and interaction effects, including R. intestinalis (RI), E. rectale (ER), AC and E. lenta (AC:EL), and AC and C. aerofaciens (AC:CA).Under the assumption that at least one bacterium from each pair contributing to a pairwise interaction influences butyrate production individually, we apply the sparse hierarchical interaction model with weak hierarchy on the same data.The model with hierarchy yields a substantially reduced set of selected effects while maintaining a similarly strong predictive performance (test set  2 = 0.72 with weak hierarchy versus  2 = 0.78 without hierarchy) (see Fig. 2b, left panel).While the effects of AC or the interaction between AC and DP stand out as clearly important predictors of butyrate production, regardless whether we fit the model with or without hierarchy or the choice of model selection procedure, the stability of smaller effects in the model, like ER or AC:ER, remains unclear.To further investigate, we combine the sparse hierarchical interaction model with stability selection, specifically complementary pairs stability selection (CPSS).The selection probabilities π , where  = 1, ...,  + (  − 1)/2, indicate that some of the small effects are robust such as ER, C. comes (CC), and D. longicatena (DL) ( π > 0.7).However, all other interaction effects, such as AC:EL or AC:CA, fail to demonstrate stability across multiple subsamples.

Application of interaction modeling on relative microbiome data 3.2.1 Feasibility study of accurate interaction detection on semi-synthetic relative microbiome data
In this section, we generate semi-synthetic data to demonstrate the ability of the sparse quadratic log-contrast model (sparse qlc), defined in 8 and 15, to accurately detect interaction effects in relative microbiome data.We elucidate the conditions under which accurate estimation is feasible by varying the degree of sparsity of the interaction features and the level of noise present in the data.In the semi-synthetic scenario, we leverage real-world relative microbial count data derived from 16S rRNA sequencing,  ∈ R × + , to generate  synthetic outcomes   ∈ R  for  = 1, . . ., .The generation of synthetic outcomes from real data ensures that all inherent distribution properties of the dataset are retained.In the first simulation, our goal is to understand how the sparsity level of an interaction feature affects the accuracy of the estimates.We use a subset of the data of the American Gut Project [57], processed in [16], comprising a selection of  = 50 OTUs, ranging from dense to sparse, and  = 300 subsamples (see Fig. 3b).We define  = 5 semi-synthetic scenarios, where   ∈ R  for  = 1, ...,  is given as the sum over a sparse linear combination 15/34  according to model formulation 8 (see Fig. 3a).To account for the zeros in the data when building log-ratios we add a pseudo count of one to each entry in , such that  :=  + 1.Each outcome is characterized by a fixed intercept term  * 0 = 10, three non-zero main effects ( 1 = 10,  2 = 20, and  3 = −30), and a noise term  * =  1 • N (0, 1), with  1 = 10.Moreover, each outcome   for  = 1, ...,  is characterized by a unique non-zero interaction effect Θ *   = 3 between two features   and   for  = 1, ...,  − 1 and  =  + 1, ...,  which varies across interaction features of different sparsity levels, namely, 36% (f7:f8,  = 1), 52% (f15:f16,  = 2), 67% (f29:f30,  = 3), 74% (f39:f40,  = 4), and 88% (f49:f50,  = 5) zero entries (see Fig. 3b).To avoid undesired correlation effects between the predictive features in the model, we ensure that the interaction features are uncorrelated (absolute Kendall's pairwise correlation || < 0.2) with the main effects (see Fig. 3c).By 16/34 fitting the sparse qlc model to the semi-synthetic data, we demonstrate that as interaction features become sparser, the accuracy of feature estimates with comparably small effect sizes (here Θ *   = 3) diminishes (see Tab. 1).Our simulations show that, while very sparse features with small true nonzero estimates are still selected by the sparse quadratic log-contrast model, they tend to be underestimated and are accompanied by many spuriously selected features with estimates of similar magnitude (see also Fig. S1).In a second simulation, we investigate the impact of noise in the data on the accurate estimation of main and interaction effects.Again, we utilize a subset of the data from the American Gut Project [57], preprocessed as described in [16].In contrast to the previous scenario, our objective is to evaluate the influence of noise while mitigating the effects of sparsity.To achieve this, we aggregate the data at the phylum level which is the highest taxonomic level with  = 10 phyla.We generate  = 4 synthetic outcomes   ,  = 1, ...,  according to model formulation 8 by fixing main and interaction effects, while allowing the noise levels to vary.We fix the intercept term at  * 0 = 10, assign six non-zero main effects that sum up to zero to the first six features as  * = (10, 20, 30, −10, −20, −30, 0, . . ., 0) ∈ R  , and introduce three non-zero interaction effects (between  1 and  3 ,  8 and  10 , and  9 and  10 ) as Θ * = 10 • 1 1:3, 8:10, 9:10 ∈ R × .The noise terms  *  follow a normal distribution with mean 0 and variance 1 with that is multiplied by a constant factor   ,  = 1, . . ., , given by  = (10, 100, 200, 500)  , that varies the noise level.Thus,  *  =   • N (0, 1).Consequently, the synthetic outcomes   ,  = 1, ...,  are given by To relate the noise terms to the signal part, we translate the noise levels to signal-to-noise ratios (SNR) denoted by snr  for  = 1, ...,  given by snr = (178.01,1.78, 0.45, 0.07)  .We fit both models, the sparse linear log-contrast model (sparse lc) and the sparse quadratic log-contrast (sparse qlc), to the newly generated data.Notably, features f2, f4, f5, and f6 show no contribution to interaction effects in our synthetic example and are accurately estimated by the (misspecified) sparse lc model (see Fig. 4a and b).However, for the features f1 and f3, which both have non-zero main effects as well as a common interaction effect, the sparse lc model accommodates the positive interaction effect between f1 and f3 within their main effect estimates, leading to an overestimation of the true positive main effect of f1 ( * 1 = 10) and an underestimation of the true positive main effect of f3 ( * 3 = 30).Moreover, the sparse lc model selects f8, f9 and f10 as relevant main effect features despite their lack of true nonzero main effects, in order to integrate their true underlying interaction effects.In contrast, the sparse qlc model captures the coefficients accurately.For both models, the overall performance deteriorates as the SNR decreases (see  2 values in Fig. 4c and estimation error summary statistics in Tab. 2).Our simulations indicate that the sparse qlc model outperforms the sparse lc model when true interaction effects are present, provided that the noise level is not excessively high or the SNR is not too low (see Fig. 4c).In scenarios where noise levels are exceptionally high, the interaction model tends to overfit the data (see Fig. 4c, right plot).Here, we perform sparse interaction modeling on environmental rather than experimental or host-associated microbiome data, highlighting salinity as a crucial factor in marine ecosystems.Variations in salinity play a critical role in shaping microbial community diversity and functionality in marine ecosystems, thereby influencing nutrient dynamics and ecosystem health [58,59].Using the marine data collection from Tara Oceans [60], which includes relative microbial abundance information derived as metagenomic OTUs (mOTUs) of ocean surface water and associated environmental covariates, we illustrate how sparse interaction modeling can substantially improve the predictive performance compared to models linear in the features.We aggregate the data on family level and learn a sparse quadratic log-contrast model (sparse qlc) as defined in 15 for ocean salinity from  = 136 samples and a subset of  = 30 most abundant families [61] and compare this to the corresponding sparse log contrast model that is linear in the features (sparse lc).Our comparison over 10 train test splits shows that modeling interaction effects rather than only main effects not only improves predictive accuracy (see Fig. 5a), but also allows predicting salinity concentrations beyond the interval [33.8, 36.6](see Fig. 5b).The model selects various features that are not consistently selected among all train test splits and no large main effects.However, it also identifies strong negative interaction effects between a family without annotation (f55) that belongs to the order SAR11 clade and the family Sphingomonadaceae as well as between two families without annotation (f13 and f96) that both belong to the order SAR11 clade.The negative interactions indicate that the quadratic effect of these groups (high abundances of both groups) comes with reduced salinity levels.For the families f8 and f60, that are also both part of SAR11 clade, the model identifies a positive interaction effect for all train test splits, indicating that the quadratic effect of these families comes with higher salinity levels.
Several studies indicate a potential link between salinity in the ocean and the abundance of the SAR11 clade [62,63,64,65].Our findings suggest that interactions among subtypes of the SAR11 clade and between SAR11 clade families and other families may play a role in relation to the salinity level in the ocean.).b, Scatterplot between the observed test set outcome  (salinity) and the prediction ŷ from the main effects sparse log-contrast model (SLC) and the all-pairs log-contrast interaction lasso (SLC + int.).c, Distribution of estimated main effect and interaction effect coefficients in the the all-pairs log-contrast interaction lasso (SLC + int.) over 10 train test splits.Only features (main or interaction features) with a non-zero mean coefficient are shown.Features with a non-zero median are bold.

Abundance information can hold valuable insights beyond presence-absence information
The interpretation of the relationship between an outcome variable  and species abundance information varies with the different data modalities discussed in this work.When using absolute counts, the effect of the actual abundance value can be derived, whereas presence-absence data provide insights into the effects of existence.Compositional data offer information on the impact of proportions between species abundances.While these modalities are typically analyzed based on availability, it remains unclear how effectively each data modality truly explains an outcome of interest.Furthermore, it is uncertain whether the same microbial taxa and interactions between taxa would be identified if another modality was used in the 20/34 prediction task.
To illustrate this, we revisit the example on quantitative microbial abundance information from the MetaCardis cohort as discussed in Section 3.1.This data can be transformed into compositions or presence-absence information using  = 1 >0 , allowing for a comparative analysis across the three data modalities for predicting the number of antimicrobial resistance genes (ARGs).We apply the sparse quadratic interaction model for absolute count data (defined in 3), the sparse quadratic interaction model for presence-absence data (defined in 4), the sparse quadratic log-contrast model (defined in 8) to the three versions of the abundance data.In this comparison, we employ the quadratic log-contrast model to analyze the relative information, as it offers the most straightforward interpretation for such comparisons.
We observe that transforming counts to relative abundances does not substantially affect the predictive performance, underscoring the importance of both absolute values and their proportions in describing the number of antimicrobial-resistant genes (ARGs), as shown in Fig. 6a.However, the presence-absence information alone does not adequately explain the number of ARGs.The genera, Prevotella, Bacteroides, and Escherichia, are identified as important predictors with consistent signs for the number of ARGs in both absolute counts and relative count information, as shown in Fig. 6b.This indicates that both the absolute abundance and the proportions relative to other taxa are significant.Escherichia, which is also part of the Proteobacteria phylum known to harbor a variety of ARGs [53,66], and Bacteroides fragilis, a species within the Bacteroides genus, exhibits high antimicrobial resistance rates and possesses numerous mechanisms related to antimicrobial resistance [67].These factors could explain the effects detected by our model.However, the presence-absence data do not select any of the main effects.Instead, this model frequently selects a multitude of interaction effects involving Prevotella.Overall, there is almost no agreement in the interaction effects identified across the different data modalities.In summary, our findings suggests that abundance information is overall more informative than presence-absence information and that whether we identify an interaction between two taxa is highly dependent on the underlying data modality.
21/34  a, Scatterplots (one train test split) and test set R squared  2 (average over 10 train test splits) for the comparison of the predicted number of anti microbial resistance genes (ARGs) and the observed number of ARGs on a test dataset based on the absolute count information of genera, the relative count information of genera and the presence-absence information of genera of the MetaCardis cohort.b, Distribution of coefficients over 10 train test splits for the superset of coefficients that are non-zero in one of the three data modalities.

22/34
Identifying predictive and interpretable main and interaction effects between microbial taxa that can be related to ecological, host-associated or environmental features is a cornerstone of statistical data analysis of microbiome data.To this end, we have introduced a generic modeling approach for quadratic interactions, that is further instantiated to accommodate three distinct data types, in which microbial abundance information typically appears: (i) quantitative microbiome, (ii) presence-absence information, or (iii) relative count information.For these three data modalities we introduce a generic optimization problem, that allows penalized model estimation to estimate parsimonious models, where the number of features  (here the microbial taxa) and pairwise interactions often exceeds the number of measurements .Our interaction modeling strategy combines existing approaches for sparse interaction modeling and introduces novel ways of modeling interaction effects in the compositional setting.In the realm of interpretability, we combine the quadratic interaction modeling framework with the concept of hierarchical interactions [34] and stability selection [38,37] as optional extensions.We introduce three mathematically equivalent ways of modeling quadratic interactions with relative input data: (a) the alr transformed quadratic model, (b) the quadratic log-contrast model, and (c) the quadratic log-ratio model.The three models differ in terms of interpretability, dimensionality, and optimization.In the p + p( p − 1)/2dimensional alr transformed quadratic model, where p = −1, the interpretation of coefficients is consistently tied to a reference feature, which is enforced to be part of the model in the regularized model case.This model can be optimized without constraints on the main effects and can be flexibly extended to hierarchical interactions.The  + (  − 1)/2-dimensional quadratic log-contrast model requires a zero-sum constraint on the main effects but offers a more convenient expression that does not require the assignment of a reference feature.In this model, each main effect is interpreted with respect to all other features.The 2 • (  − 1)/2dimensional quadratic log-ratio model does not require a constraint on the main effect coefficients in the optimization, as this property is automatically met, and allows for the interpretation of the relative effects between all pairs of features.Notably, the interpretation of the interaction coefficients in both the quadratic log-contrast model and the quadratic log-ratio model are identical.Although the log-contrast model in the linear case enjoys the clear advantage of lower dimensionality compared to the log-ratio model, this distinction becomes less relevant when introducing interactions.Hence, the primary criterion for selecting between the models should be based on the preferred interpretation.We demonstrate the broad applicability of our framework by analyzing microbial data covering all three data modalities across various ecosystems, including synthetic microbial communities, human gut microbiomes, and marine microbial ecosystems.We show how quadratic models can improve the predictive performance compared to linear models on real-world data and illustrate how main and interaction effects can be robustly estimated by integrating hierarchical interaction modeling and stability selection.Notably, using a synthetic community dataset from [24], we identified a strong inhibition of butyrate production by A. caccae when D. piger is present.We demonstrate that this is the only robust and consequently relevant combinatorial effect within this community.Additionally, we generate semi-synthetic data to demonstrate the ability of the sparse quadratic interaction model for relative microbiome data to accurately detect interaction effects.Based on this, we showcase how sparse an interaction feature can be in order to achieve an accurate estimation in our model.Further, we demonstrate, for varying noise levels, how a misspecified main effects model tends to inaccurately estimate effects when true interactions are present.Finally, we perform a comparative analysis of the quadratic interaction models for the three data modalities discussed in this study, demonstrating that absolute microbial counts and compositions achieve similarly strong predictive performance when predicting the number of antimicrobial resistance genes (ARGs).In contrast, transforming abundance data into presence-absence information results in a decline in predictive accuracy.Our analysis suggests that the identified (main and) interaction effects highly depend on the underlying data type.Moreover, the identified effects of individual genera and interactions between them are an interesting finding in themselves.Many of the relevant features correspond to the definition of enterotypes [68], as they were defined for this specific data in [69], suggesting a potential link between enterotypes and ARGs.As more and larger data sets become available, our models can be extended to higher-order interactions or more complex, arbitrary non-linear effects.Moreover, our results suggest that different data modalities carry different information, and it might be meaningful to include multiple layers of information in statistical prediction tasks with interaction effects.In summary, we believe that our framework and its implementation in R provide a valuable tool to study robust interaction effects in microbiome data of different modalities and distinct ecosystems, ranging from synthetic communities and community function studies in microbial ecology to observational microbiome data linking the microbiome to the host or environment.

Fig 1 .
Fig 1.Illustration of three data modalities in microbiome analysis and their combinatorial behavior with respect to an outcome, e.g., a community function (created with BioRender.com).The sketch depicts three distinct data modalities in the columns: (i) quantitative microbiome data, representing absolute counts; (ii) presence-absence information of microbial species; and (iii) relative abundance data, also known as compositions.Each row illustrates a simplified scenario.In the first scenario, blue microbes are present while red are absent, resulting in a large (production of) outcome (e.g., butyrate).In the second scenario, red microbes are present while blue are absent, leading to another large (production of) outcome.In the third scenario, both blue and red groups of microbes are present, yet only minimal amounts of the outcome are produced, indicating an antagonistic combinatorial effect between the two groups.
t e r o id e s : B il o p h il a P r e v o t e ll a : F a e c a li b a c t e r iu m P r e v o t e ll a

Fig 2 .
Fig 2. a, Prediction of the number of anti microbial resistance genes (ARGs) from absolute abundance information of microbial genera from the MetaCardis cohort.Left: distribution of estimated coefficients over 10 train test splits in the sparse quadratic interaction model 13.Only coefficients with a non-zero median estimated coefficient are shown; right: Scatterplot comparing the observed and predicted number of ARGs on a test data set for the sparse quadratic interaction model.b, Prediction of butyrate production from the abundance information of microbial species from[24].Left: comparison of median estimated coefficients (over 10 train test splits) between the sparse quadratic interaction model ( axis) and the sparse hierarchical interaction model 17 with weak hierarchy ( axis); right: Top 35 selection probabilities from complementary pairs stability selection (CPSS) in the sparse hierarchical interaction model.

Fig 4 .
Fig 4. Influence of model misspecification and noise in semi-synthetic scenarios.a, Solution path for the misspecified main effects model (sparse lc) and the interaction model (sparse qlc) for the synthetic scenario  = 1 with a signal-to-noise ratio (SNR) of 178.01 for one train test split.b, Estimated coefficients distributions over 10 train test splits corresponding to the solution paths in a.For the interaction model only three non-zero interaction features are shown for visualization purposes.c, Comparison of model performance via R squared ( 2 ) for the main effects model and the interaction effects model on train and test data for four different SNRs.

Fig 5 .
Fig 5.  Summary plots of the Tara ocean data on family level for salinity prediction.a, Train and test set R squared  2 distribution over 10 train test splits for the main effects sparse log contrast model (SLC) and the all-pairs log-contrast interaction lasso (SLC + int.).b, Scatterplot between the observed test set outcome  (salinity) and the prediction ŷ from the main effects sparse log-contrast model (SLC) and the all-pairs log-contrast interaction lasso (SLC + int.).c, Distribution of estimated main effect and interaction effect coefficients in the the all-pairs log-contrast interaction lasso (SLC + int.) over 10 train test splits.Only features (main or interaction features) with a non-zero mean coefficient are shown.Features with a non-zero median are bold.

Fig 6 .
Fig 6.  a, Scatterplots (one train test split) and test set R squared  2 (average over 10 train test splits) for the comparison of the predicted number of anti microbial resistance genes (ARGs) and the observed number of ARGs on a test dataset based on the absolute count information of genera, the relative count information of genera and the presence-absence information of genera of the MetaCardis cohort.b, Distribution of coefficients over 10 train test splits for the superset of coefficients that are non-zero in one of the three data modalities.