Through the Random Forest: Ontogeny as a study system to connect prediction to explanation

As modeling tools and approaches become more advanced, ecological models are becoming more complex and must be investigated with novel methods of analysis. Machine learning approaches are a powerful toolset for exploring such complexity. While these approaches are powerful, results may suffer from well-known trade-offs between predictive and explanatory power. We employ an empirically rooted ontogenetically stage-structured consumer-resource model to investigate how machine learning can be used as a tool to root model analysis in mechanistic ecological principles. Applying random forest models to model output using simulation parameters as feature inputs, we extended established feature analysis into a simple graphical analysis. We used this graphical analysis to reduce model behavior to a linear function of three ecologically based mechanisms. From this model, we find that stability depends on the interaction between internal plant demographics that control the distribution of plant density across ontogenetic stages and the distribution of consumer pressure across ontogenetic stages. Predicted outcomes from these linear models rival accuracy achieved by our random forests, while explaining results as a function of ecological interactions.

To whom correspondence may be addressed: As ecologists expand the range and depth of ecological theory they necessarily integrate 66 advanced modeling techniques and computational tools to produce increasingly complex models. 67 This increase in model complexity has been driven, in part, by the long-standing debate on the 68 relationship between diversity and stability in ecosystems 1 . Differing levels of diversity (in 69 terms of number of species and interactions) directly affects the number of interacting variables 70 and, therefore, model complexity 2,3 . Indeed, numerous studies continue to raise unique sources 71 of complexity and separate mechanisms tying diversity to increased stability. These mechanisms 72 include weak species interactions 4 , adaptive foraging 5 ; allometric scaling of interaction strength 73 6 ; and omnivorous 7 , mutualistic 8 , or high-order interactions 9 . These diverse sources of model 74 complexity inspired by ecological processes, their unique mechanistic effects on model 75 dynamics, and the nonlinearities they produce emphasize ecologists' need for an adaptable 76 methodology that can be deployed across model frameworks. Our work shows how machine 77 learning can be used to develop such an adaptable methodology while addressing its typically 78 associated limits in interpretability. 79 Modern machine learning algorithms are flexible, powerful tools used to study many 80 complex systems and are increasingly applied to ecological datasets 10 . Recently, ecologists have 81 applied machine learning algorithms to predicting species interactions from empirical trait data 82 11 , improving estimation of viral host ranges from incomplete datasets 12 , and examining food 83 web responses to variable functional diversity across trophic levels 13 . While these machine 84 learning approaches are powerful, they have been called "black boxes" because their high 85 predictive power does not guarantee highly interpretable results 14 . Recently, methods for the 86 productive interpretation of machine learning results have been reviewed in the hopes of 87 transforming this "black box" into a "translucent box" 10 . However, developing broadly 88 interpretable results is difficult. This is partially because machine learning algorithms necessarily 89 rely on direct model output, which can potentially limit ecological generalizability as model 90 complexities lead to incommensurable (i.e., lacking common measurement standards) results 91 across different formulations and parameterizations 13 . 92 One key source of ecological complexity is demographic heterogeneity, frequently 93 modeled as stage-structured organismal ontogeny 15  Eq. 1, Fig. 1, and Appendices S1.1 & S1.2). Plant ontogeny is divided into three stages: a seed 129 bank ( 1 ), non-reproductive seedlings ( 2 ), and fecund adults ( ). While there are clearly a 130 variety of potential ontogenetic structures, a three-stage ontogenetic structure provides initial 131 tractability in analysis and is well-represented across plant species (see Appendix S1.1). Public

139
For further details, please see SI File 1 and Appendix S1.

140
The stage-structured plant in our model is consumed by a single herbivore species ( ) 141 (Eq 1; Fig 1a). Real world herbivory can certainly involve multiple species interactions.

142
However, specialization of herbivore species on a single plant taxa (especially at the family 143 level) is common (especially amongst insect herbivores) and geographically widespread 22,23 .

144
The modeled herbivore is limited to eating the vegetative stages ( 2 & ) as the role of "seed 145 predator" is rarely filled by a species which also consumes vegetative tissue, again especially in  Table 1. Simulations and numerical analyses were done in Mathematica 11. We focus analysis on 162 three specific demographic parameters and two trophic parameters. The demographic parameters 163 are germination rates of seeds into seedlings ( 12 ), seedling maturation rates into fecund adults 164 ( 2 ), and seed production rate by fecund adults ( ). The trophic parameters are the attack rates 165 of herbivores ( ) on seedlings ( 2 ) and fecund adults ( ). These five rates represent both 166 demographic and trophic flow, a functional basis for studying how plant ontogeny interacts with 167 trophic dynamics (Fig 1a). 168 We informed the ranges of demographic parameter values simulated in our parameter demographic rates, indicating no appreciable covariation between parameters (Appendix S1.3; 176 Fig S2). In demographic terms, no covariation between these rates means there was no clear 177 evidence of tradeoffs such that, for example, a higher seed production rate necessitated lower 178 germination rates, or direct correlative relationships between the stage transition rates. The range 179 of herbivore attack rates was chosen heuristically (see Table 1). The lack of any apparent numerous biological fields, 5) and can readily be applied to both categorical and quantitative 195 data 30,31 . Using the randomForest package in R 32 , our five simulation model parameters (Fig 1a) 196 functioned as features/predictors with local-stability indicators serving as our predicted variables 197 (Appendix S2). To avoid terminology confusion, we refer to simulation model parameters in 198 general as "parameters" and refer to them specifically as "features" when used as random forest 199 inputs (i.e., independent variables). We then use the term "predictor" only to specifically refer to 200 independent predictors used in our liner models.

201
For categorical tasks we used a simple indicator, locally stable or unstable. For regression 202 tasks we used the model equilibria's eigenvalues. We trained random forests using hold out cross 203 validation methods. As a default, random forest parameter "mtry" was set at floor(sqrt(p)) for with little to no effect on performance. Note these random forest parameters specifically refer to 208 random forest formulation and are not related to the simulation model parameters. We measured 209 random forest performance on validation/test data using area under the receiver operating Our five simulation model parameters (3 demographic and 2 trophic rates; see Table 1) 216 served as random forest input features. We determined the degree to which they were plots; Fig S3) on simulation dynamics belied a high degree of variability in feature effects 240 throughout the simulation data (e.g., ICE plots; Fig S3).  (Fig 2), which then leads us to a better overall understanding of our system's trophic dynamics. categorical stability (Fig 2a). We can see that stability estimates are increased by lowering either  Our PD plot (Fig 2a) shows that as both 12 and 2 increase, predictions gradually shift 291 from stable to unstable. Based on this observation, we can make a "threshold plot" which depicts 292 thresholds between our categorical variables, stable and unstable behavior, as a function of a 293 third yet unexamined parameter, which in our case is seed production, (Fig 2b). Plotting the 294 thresholds between our stability categories shows a similar dynamic between 12 and 2 as 295 seen in the PD plot. It also reveals that the gradual changes seen in the PD plot were in fact a 296 function of the rate of seed production, , where higher seed production supports stability at 297 higher maturation rates. This is striking given that increased resource production is generally a 298 destabilizing influence in the traditional Lotka-Volterra formulation. Increases in seed 299 production are also related to increases in L:D ratio (Fig S5), so the stabilizing effect of must 300 be coming from a different mechanism. Using a similar analysis on pre-simulation equilibrium 301 values as was done with L:D ratio, we can integrate * (see Eq. 1 & description in Table 1) into 302 our regression given its clear connection to rF values. Doing so raises our explanatory power in 303 our regression (R 2 =0.95 in this subset of data) and our predictive power (AUC=0.98), giving us 304 comparable results to our random forest. However, this explains very little of the ecology given 305 that * is largely just the immediate realized effect of and doesn't describe any other changes 306 in system behavior.

307
Examining the effects of on the rest of our system shows that is also positively 308 correlated with functions 12 * and 2 * (see definition and description of these functions in Eq. 1 309 and Table 1, respectively; Fig S5). This implies a stabilizing effect of high densities of maturing 310 seedlings, which seems unintuitive given the stabilizing effect of lowering baseline maturation 311 rates, 12 and 2 . The key difference is the mechanistic difference between increasing 12 * and 312 2 * via baseline maturation rates ( 12 & 2 ) vs increasing overall seedling density via increases 313 in seed production ( ) (see Appendix S3.2 for more details). Increasing 12 * and 2 * via baseline 314 maturation rates ( 12 & 2 ) changes the density distribution ratio in favor of (Fig S6a),  (Fig S6b). In adult-only herbivory, these saturated stages face no consumer pressure and 320 therefore act as a more immediate reservoir to replace adults lost to consumption by . The 321 functions 12 * and 2 * increase because there are simply more seeds and seedlings maturing.

322
Therefore, even though increasing plant density may lead to higher overall consumption, the 323 reservoir of density in younger stages titrates into adult stages due to consumption, raises the 324 trough of oscillations as (and subsequently 12 * and 2 * ) increases ( Fig S7) and ultimately 325 leads to a stabilizing consumer dynamic. increasing resource availability for the herbivore can be both stabilizing (via ) or destabilizing 338 (via 12 , 2 ) depending on the specific parameters underlies much of the parameter context 339 dependence detected by our initial random forest (Fig 1d).

340
Integrating the 12 * and 2 * factors into our earlier linear regression analysis (with partial 341 least regression due to collinearity between our variables) shows that our three factors (L:D ratio, 342 12 * and 2 * ) explain 91% of the variance of the maximum eigenvalue (from Fig 2a&2b) and 343 matches our expected direction of effect (shown in Fig 3a). Predictive power increased as well

355
Simulation results reveal that the stability of the seedling-only consumer is vulnerable to 356 destabilization from even limited multi-stage consumption (Fig 2c&2d). PD plots offer some 357 ecological explanation in showing that oscillations can still be stabilized by restrictions in 358 maturation rates (Fig 2c). Extending this analysis with our threshold plots we see that the effect 359 of rF has effectively been reversed (Fig 2d). Overall then, the slight addition of adult 360 consumption institutes a relationship between 12 and 2 that mimics an adult-only herbivore 361 but with the caveat that higher values require substantially more restricted maturation to 362 stabilize dynamics. 363 We can explain this result using our earlier ecological factors (L:D ratio, 12 * , 2 * ).

364
Compared to the adult-only herbivore, we can see that in the seedling-dominant herbivore, 365 raising has much higher proportional effect on L:D ratio compared to 12 * and 2 * (Fig S8).

366
This is because L:D ratio now consists of consumption on both stages and the seedling stage can 367 no longer act as a saturating reservoir with increased seed production. Additionally, increasing 368 either 12 * or 2 * via (higher density) induces higher cumulative attacks ( values; see Eq 1 & 369 Table 1) on both stages. This becomes clear when we regress our ecological factors on our 370 eigenvalue data from this subset of herbivory data. We see that the effects of 12 * and 2 * have 371 switched from stabilizing to destabilizing (i.e., raising the max eigenvalue; Fig 3a). Despite these 372 changes, our linear model using our ecological factors still performs comparatively well to our  values is now more dependent on higher 2 (Fig 2e). Extending the analysis with our threshold 383 plots again indicates that higher seed production values ( ) limit stable { 12 , 2 } parameter 384 space (Fig 2f). strength of density dependence (see Table 1) with a notable exception when handling times for 408 seedling consumption are smaller than for adult consumption (see Fig S10). Expectation plots revealed the extent of this context-dependency (e.g., ICE plots Fig S3), but our 437 goal was to go beyond the well-known ecological axiom that parameter effects are context-438 dependent. We therefore subdivided our data into its component parts and focusing on the 439 consistently important model parameters and it became clear that different parameters drove 440 model behavior across different herbivory allocations (Fig 1g-1i). By investigating feature 441 interactions using two-dimensional PD plots and then extending categorical analysis with our 442 threshold plots (Fig 2), we were able to move beyond connecting model behavior to parameter 443 values to describing model output via tangible ecological processes.

444
For ease of communication, we called our ecological processes (L:D ratio, 12 * , and 2 * ) 445 "factors." Using these factors, we detailed how demographic and trophic rates interact to drive 446 ontogenetically mediated consumer-resource dynamics via a consistent set of ecological 447 mechanisms. These factors were also able to predict model output with comparable accuracy to 448 that of the random forest. Furthermore, despite being found from categorically tasked random 449 forests (stable or unstable), these same factors performed well on continuous predictions 450 (maximum eigenvalues) given the relatedness of each variable.

451
In expanding upon our work to higher dimensionalities found in food web/network  We do not claim our exact methodology will be entirely applicable for all models or