Including random effects in statistical models in ecology: fewer than five levels?

As generalized linear mixed-effects models (GLMMs) have become a widespread tool in ecology, the need to guide the use of such tools is increasingly important. One common guideline is that one needs at least five levels of a random effect. Having such few levels makes the estimation of the variance of random effects terms (such as ecological sites, individuals, or populations) difficult, but it need not muddy one’s ability to estimate fixed effects terms – which are often of primary interest in ecology. Here, I simulate ecological datasets and fit simple models and show that having too few random effects terms does not influence the parameter estimates or uncertainty around those estimates for fixed effects terms. Thus, it should be acceptable to use fewer levels of random effects if one is not interested in making inference about the random effects terms (i.e. they are ‘nuisance’ parameters used to group non-independent data). I also use simulations to assess the potential for pseudoreplication in (generalized) linear models (LMs), when random effects are explicitly ignored and find that LMs do not show increased type-I errors compared to their mixed-effects model counterparts. Instead, LM uncertainty (and p values) appears to be more conservative in an analysis with a real ecological dataset presented here. These results challenge the view that it is never appropriate to model random effects terms with fewer than five levels – specifically when inference is not being made for the random effects, but suggest that in simple cases LMs might be robust to ignored random effects terms. Given the widespread accessibility of GLMMs in ecology and evolution, future simulation studies and further assessments of these statistical methods are necessary to understand the consequences of both violating and blindly following simple guidelines.

Generalized linear mixed-effects models are a regression type analysis that are flexible in that they can handle a variety of data generating processes such as binomial (e.g. presence / absence of a species, alive / dead individuals) and Poisson (e.g. wildlife counts). When the sampling distribution is Gaussian (also known as normal; e.g. mean-centered continuous data such as: tree diameter, vocalization frequency, or body condition residuals), this is a special case of a GLMM that is referred to as simply a linear mixed-effects model (LMM). GLMMs (and LMMs) differ from their simpler counterparts, (generalized) linear models (GLMs and LMs), in that they include random effects, in addition to the fixed effects (hence mixed-effects).
Fixed effects are also often called predictors, covariates, explanatory or independent variables in ecology and include both variables of interest (e.g. average temperature in climate change studies or sound pressure levels in anthropogenic noise research) and other variables that are only included to control for unexplained variation but often not directly useful to understanding the research question at hand (e.g. date or time of sampling in the above studies). Fixed effects are fixed in that the model parameters (ߚ in equation 1 below) are assumed to be fixed, or nonrandom, and are not drawn from a hypothetical distribution.
Random effects, on the other hand, allow one to combine information (e.g. in a meta-analysis), deal with spatiotemporal autocorrelation, use partial pooling to borrow strength from other populations or groups, account for grouping or blocked designs (e.g. repeat-measures data from sites or individuals), and estimate population-level parameters, among others (Kéry & Royle, 2015). Thus, the random effects structure should be decided by the experimental design (Barr et al., 2013;Arnqvist, 2020). Random effects are random in that they are assumed to be drawn randomly from a distribution -often a Gaussian distribution -during the data-generating process. This is most often done by fitting a random intercept for each group (see equation 1 below), but one can, and should, also assign random slopes to variables, where the slopes of variables (not just the intercepts) are allowed to vary by group (see Bolker, 2008;Schielzeth & Forstmeier, 2009;Kéry & Royle, 2015;Harrison et al., 2018). If we are interested in the variability of a population (of individuals, groups, sites, or populations), it is difficult to estimate this variation with too few levels of individuals, groups, sites, or populations (i.e. random effects terms).
"When the number of groups is small (less than five, say), there is typically not enough information to accurately estimate group-level variation" (Gelman & Hill, 2006). "...if interest lies in measuring the variation among random effects, a certain number is required...To obtain an adequate estimate of the among-population heterogeneity -that is, the variance parameter -at least 5 -10 populations might be required" (Kéry & Royle, 2015).
"With <5 levels, the mixed model may not be able to estimate the among-population variance accurately" (Harrison et al., 2018).
"Strive to have a reasonable number of levels (at the very least, say, four to five subjects) of your random effects within each group" (Arnqvist, 2020).
This guideline that random effects terms should have at least five levels (i.e. groups) is backed by only limited empirical evidence (Harrison, 2015), but it is intuitive that too few draws from distribution will hinder one's ability to estimate the variance of that distribution. Indeed, in each of the above segments of quoted text, the authors suggest that five levels are needed for estimation of group-level, or among-population, variance. However, this rule is often adhered to out of context, where authors or reviewers of ecological journals suggest that one cannot use random effects terms if they do not contain at least five levels, in any case.
Simulations by Harrison (2015) demonstrate that random effects variance can be biased more strongly when the levels of random effects terms are low, yet in this work it appears that slope (beta) estimates for fixed effects terms are generally not more biased with only three random effects levels compared to five. There are many cases (and some would argue that in most cases, see below) in which the variance of random effects is not directly of interest to the ecological research question at hand. "...in the vast majority of examples of random-effects (or mixed) models in ecology, the random effects do not have a clear ecological interpretation. Rather, they are merely abstract constructs invoked to explain the fact that some measurements are more similar to each other than others are -i .e., to model correlations in the observed data" (Kéry & Royle, 2015).
Thus, it is unclear whether or not it is appropriate to use random effects when there are fewer than five grouping levels in situations where one does not directly care about the 'nuisance' among-population variance, but instead is interested in estimates and uncertainty of predictor variables (i.e. fixed effects). The current state of practice in ecology is to drop the random effects terms such that we are now using generalized linear models where we are not grouping observations (we drop the Mixed-effects from the GLMM to become GLM). I question whether we are choosing to accept pseudoreplication in ecology (Hurlbert, 1984;Kéry & Royle, 2015;Arnqvist, 2020), rather than inaccurate estimates of among-population variance. In cases where one does not care about among-population variance, this tradeoff may be non-existent, but little research exists to support this.
Here, I simulate ecological datasets to assess whether fixed effects estimates are more biased when the accompanying random effects consist of fewer than five levels; I also ask whether using an alternative model without random effects (LMs) leads to higher type I errors (demonstrating a 'significant' effect when in fact one does not exist). I also analyze a real dataset within a similar framework to understand how the number of random effects levels and model structure (random effects or not) might change ecological inference.

Methodology
All simulation of datasets and model fitting was done in R v4.0.4 (R Core Team, 2017), all visualizations were accomplished with the aid of R package `ggplot2` (Wickham, 2011).

Data generation
I used a modified version of code from Harrison (2015), to explore the importance of varying two parameters in a linear mixed-effect model (LMM): the number of observations in a dataset (30, 60, or 120), and the number of levels of the random intercept term (3, 5, or 20). We can think of the latter as the number of individuals in an experiment or the number of field sites in a study. This was done by generating a response variable ‫ݕ‬ from the following equation: Where ߙ ሺ ሻ is the intercept for site (or individual) j to which observation i belongs. Thus, each observation shared a site-level intercept, which were drawn from a normal distribution with mean (μ) = 0 and standard deviation (σ) = 0.5. 0.25 (same as equation 2 above); this term is simply adding noise to our system during the data generating process, but is modelled implicitly as residual variance in many generic GLMM functions.
As an ecological example, you might think of the above equations as Now a ߚ term is estimated for each site (or population) level independently. Site effects no longer come from a normal distribution (as in equation 2), but instead are considered fixed, hence fixed effects. Thus, both a LMM and a LM were fit to each simulated dataset (n = 10,000) of each of the nine combinations of data-generation (30, 60, or 120 observations by 3, 5, or 20 sites; 90,000 total simulated datasets and 180,000 models fit to data). This allowed for a comparison of the type-I error rates of LMMs and LMs, the latter of which ignores the blocked structure of data (i.e. site-level grouping).

Type-I error calculation and p values
Type-I error rate was calculated as the proportion of 10,000 models that a 'significant' p value of ≤ 0.05 was obtained for the ߚ ଶ parameter estimate (e.g. ocean salinity) in which the true value of that parameter was set to be 0. I sampled (with replacement) 10,000 p value 'observations' from each group of 10,000 models to produce a new proportion of type-I error; this process was repeated 1,000 times, and the bootstrapped 95% confidence intervals were calculated as the 0.025 and 0.975 quantiles of those 1,000 replications (see code; modified from Harrison, 2015).

Case study: spider body condition and noise
I used orb-weaving spider body condition data from a previous experiment (Gomes, Hesselberg & Barber, 2021). In short, speakers were used to experimentally broadcast whitewater river noise (predictor variable: sound pressure level) over the course of multiple summers. Orb-weaving spiders (Tetragnatha versicolor) were then weighed and measured (femur length), and body condition (response variable) was calculated from a residual index, using these measurements (Jakob, Marshall & Uetz, 1996;Gomes, 2020).
Here, I randomly sampled this dataset to create 30,000 new datasets, such that each new dataset contained three random observations from each of 3, 5, or 10 sites (10,000 datasets for each level of random effects terms). Thus, each dataset contained 9, 15, or 30 total observations when there were 3, 5, or 10 sites, respectively. While it would have been ideal to separate sample size from the number of random effects (as in the simulated datasets above), this simply wasn't possible to do with this real dataset (i.e. obtaining 30 total observations from three sites was not possible, while easily done for ten sites). Each dataset was fit with a linear mixed-effect model ( I used the same methodology as above in 'Type-I error calculation' to calculate the proportion of 10,000 models in which p < 0.05, which corresponds to rejecting the null hypothesis that sound pressure level is not related to spider body condition. In this case, we are not assessing a type-I error, because we do not truly know if this variable should be significant or not, but instead it serves as a point of comparisons across methods (LMM vs LM) with real data rather than simulated.

Estimating model parameters and uncertainty
Linear mixed models and linear models were able to resurrect simulated fixed effect relationships with no noticeable patterns in bias, regardless of number of levels of random effects or sample size. That is, both mean model parameter estimates (ߚ ଵ and ߚ ଶ ) were centered on their true values (2 and 0, respectively; Table 1; Figure 1, S1  Figure 1). Another doubling to 120 observations lead to a further decrease in uncertainty by 33.4% and 32.9%, respectively. The number of levels of random effects appears to be relatively non-important in resurrecting model parameter estimates within these simulation scenarios (Table 1; Figure 1); instead there were small, likely negligible, increases in uncertainty around fixed effect parameter estimates as the number of levels of random effects increased.
All LMM estimates of the distribution mean (μ) were unbiased, regardless of number of levels of random effects or sample size (Table 1; Figure 2A). The random effects variance (σ) estimates, however, were not centered at the true value, and estimates were more biased with fewer levels of random effects, whereas sample size did not affect this bias (Table 1; Figure 2B). That is, with only three levels of random effects the magnitude of the bias was 12.2% of the true value.
Increasing to five levels of random effects nearly halved this bias to 6.4%, and increasing to 10 levels halved the bias again to 3.2% of the true value. Averaged across numbers of random effects terms, estimates were biased by about 7% regardless of sample size (7.1%, 7.4%, and 7.2% for N = 30, 60, and 120 respectively).
The uncertainty around random effects estimates (μ and σ ) generally decreased with an increased number of random effects levels, whereas sample size did little to alleviate this uncertainty (Table 1; Figure 2). Increasing the number of random effects levels from 3 to 5, and then from 5 to 10, decreased the uncertainty for μ by 22.4% and 29.1%, respectively, and for σ by 26.6% and 29.8% respectively.

Type-I errors
For all simulated datasets, both LMM and LM produced type-I error rates around the typical α = 0.05, with 95% confidence intervals overlapping this value. Neither sample size, nor the number of random effects levels seemed to influence the type-I error rate. Furthermore, dropping the random effects structure (using a LM instead of a LMM) did not increase the probability type-I errors (Figure 3), nor the ability of the model to accurately estimate fixed effects parameters (fixed effects estimates appear the same as they do in Figure 1, see Figure S1).

Case study results
Linear mixed models (LMM) and linear models (LM) both consistently estimated the fixed effect parameters for sound pressure level to be very weakly positive, with large overlap with zero (or no relationship). Linear model estimates are slightly shrunk towards zero, compared to LMMs ( Figure 4A), but it is impossible to assess bias, since we do not know truth with these biological data. The uncertainty around these estimates decreases with LMMs, compared to LMs, and decreases as the sample size and number of sites increases (in tandem).
Output from 10,000 LMs suggests that the null hypothesis (no effect) would be rejected by a proportion of approximately 0.05 ( Figure 4B), which is around the typical type-I error alpha level (i.e. suggesting that sound pressure level is not a significant predictor of spider body condition). LMMs, on the other hand, experienced null hypothesis rejection rates at nearly double that of their LM counterparts, which is consistent with the reduced uncertainty around the LMM estimates in Figure 4A. The rate of rejecting the null hypothesis, however, did not appear to be related to the number of sites (and thus total observations).

Discussion
The work presented here demonstrates that i) fixed effects estimates are not more biased when the levels of an accompanying random effect have fewer than five (n < 5) levels, but populationlevel (random effects) variance estimates are and ii) type-I error rates are not increased by using linear models (LM) instead of linear mixed-effects models (LMM).
Fixed effects parameter estimation does not appear to be strongly influenced by, nor biased by, the number of levels of random effects terms. Instead, uncertainty in those estimates is much more strongly influenced by sample size. While this pattern may appear to contradict the decreased uncertainty (with more random effects levels) around beta estimates in Figure 2 of Harrison (2015) and in Figure 4 of the current work, this instead is due to differences in the way that sample size relates to the number of random effects levels. Harrison (2015) coded each random effect level to be associated with a fixed number of observations (N=20), such that each additional random effect level yielded an increased sample size -as is also the case in the spider case study presented here. However, in the simulations here (Figure 1), sample size (i.e. number of observations) has been separated from the number of random effects terms (e.g. sites or individuals).
Despite these differences in coding, the estimation of random effects terms (μ and σ ) in the simulations here suggest consistent patterns with Harrison (2015) in that variance (σ) is more biased with fewer levels. This seems to support previous suggestions and simulations suggesting that few levels of random effects terms can make estimation of population-level variance difficult (Gelman & Hill, 2006;Harrison, 2015;Kéry & Royle, 2015;Harrison et al., 2018), but the cutoff at five random effects levels appears quite arbitrary. The combination of these results suggest that using fewer than five levels of random effects is acceptable when one is only interested in estimating fixed effects parameters (i.e. predictors, independent variables); in other words, when inference about the variance of random effects terms (e.g. sites, individuals, populations) is not of direct interest, but instead are used to group non-independent data. In these cases, however, caution should be taken in reporting the variance estimates for such populationlevel parameters -as this information can later be taken out of context of the question at hand and may result in the propagation of biased estimates.
Those following the "less than five levels" guideline typically drop random effects from analyses, turning a LMM into a LM. In both the simulations and the case study, LMMs and LMs did not appear to give drastically different parameter estimates for fixed effects. In the spider case study, LMMs gave more consistent results, leading to increased parameter certainty when compared to LMs, which was also reflected in a higher probability of 'significant' p values (around 10% of the models), when compared to LMs (around 5% of the models, which is to be expected to due chance). That is, results from LMs here suggest that there is no significant effect of the predictor (sound pressure level) on the response (body condition). While misspecified mixed-effects models can be overconfident in their estimates (Schielzeth & Forstmeier, 2009) The use, and abuse, of p values, in general, is highly debated and controversial (Yoccoz, 1991;Schervish, 1996;Wagenmakers, 2007;Murdoch, Tsai & Adcock, 2008;Gelman, 2013;Murtaugh, 2014;Leek & Peng, 2015;Greenland et al., 2016;Ho et al., 2019), but is further complicated by mixed-effects models (Luke, 2017). Douglas Bates, and other authors of the R package `lme4` (Bates et al., 2007) which I use here, does not include a p value 'baked in' to the output for a reason (in short how one should calculate this is not straightforward or as obvious as it sounds). While my personal approach is generally to rely more on probabilistic (i.e. Bayesian) approaches or effect sizes and ignore p values wherever possible, I use p values here because they are still widely used across ecology and evolutionary studies and by fish and wildlife managers. The R package `lmerTest` (Kuznetsova, Brockhoff & Christensen, 2017)  Interestingly, type-I errors were not more likely in any of the LMs of simulated data. This possibly suggests that misspecified linear models (theoretically missing a random effect) are relatively robust to this omission -at least in some simple cases such as the scenarios presented here. While this perhaps alleviates some concern over inflated type-I errors due to pseudoreplication while ignoring the grouped nature of repeat-measures studies and nonindependent data (Arnqvist, 2020), this should not be taken as evidence to purposefully omit random effects when such a structure is appropriate. Instead, it warrants future investigation and further simulation studies with more thorough scenarios (especially with varying degrees of random effect variance) and more complex data structures (e.g. including correlations and link functions).
Often researchers (sometimes nudged by peer-reviewers) cite this guideline of needing 5 levels before random effects inclusion as a reason why they were unable to use a mixed-effects model (Bain, Johnson & Jones, 2019;Bussmann & Burkhardt-Holm, 2020;Evans & Gawlik, 2020;Gomes & Goerlitz, 2020;Zhao, Johnson-Bice & Roth, 2021). Although there is confusion over this recommendation, as some opt to use mixed-effects models despite this suggestion (Latta et al., 2018;Fugère, Lostchuck & Chapman, 2020;Gomes, Appel & Barber, 2020;Allen et al., 2021), likely because of the numerous advantages that mixed-effects models offer (Bolker, 2008;Kéry & Royle, 2015;Harrison et al., 2018), or fear of the consequences of pseudoreplication (although this can easily occur in mixed-effects models as well: Schielzeth & Forstmeier, 2009;Arnqvist, 2020). The trend to automatically follow this rule is likely exacerbated by the fact that authors or peer-reviewers can easily point out that this rule exists (Gelman & Hill, 2006;Harrison, 2015;Kéry & Royle, 2015;Harrison et al., 2018;Arnqvist, 2020), but may find it more difficult or time-consuming to make a nuanced argument against following such a rapidly growing rule. Hopefully the results presented here will challenge that view, and allow the fitting of random effects when inference is not being made for the random effects. More importantly, I hope it sparks further conversation and debate over this issue. Given the widespread accessibility of GLMMs, future simulation studies and further assessments of these statistical methods are necessary to understand the consequences of both violating and blindly following methodological rules.