Valid statistical approaches for clustered data: A Monte Carlo simulation study

The translation of preclinical studies to human applications is associated with a high failure rate, which may be exacerbated by limited training in experimental design and statistical analysis. Nested experimental designs, which occur when data have a multilevel structure (e.g., in vitro: cells within a culture dish; in vivo: rats within a litter), often violate the independent observation assumption underlying many traditional statistical techniques. Although previous studies have empirically evaluated the analytic challenges associated with multilevel data, existing work has not focused on key parameters and design components typically observed in preclinical research. To address this knowledge gap, a Monte Carlo simulation study was conducted to systematically assess the effects of inappropriately modeling multilevel data via a fixed effects ANOVA in studies with sparse observations, no between group comparison within a single cluster, and interactive effects. Simulation results revealed a dramatic increase in the probability of type 1 error and relative bias of the standard error as the number of level-1 (e.g., cells; rats) units per cell increased in the fixed effects ANOVA; these effects were largely attenuated when the nesting was appropriately accounted for via a random effects ANOVA. Thus, failure to account for a nested experimental design may lead to reproducibility challenges and inaccurate conclusions. Appropriately accounting for multilevel data, however, may enhance statistical reliability, thereby leading to improvements in translatability. Valid analytic strategies are provided for a variety of design scenarios.


Introduction
Preclinical studies, which range from molecular and in vitro studies to in vivo studies 46 utilizing biological systems to model disease [1], are not immune [2-3] from the well-documented reproducibility issues observed in clinical fields [4]. Various factors, including rigorous 48 standardization of preclinical experiments [e.g., 5-6], lack of scientific rigor [e.g., 7-8], and bias

49
[e.g., Publication Bias: 9-10; Reporting Bias: 11], threaten reproducibility in preclinical science.  Nested data occurs when multiple subjects and/or measurements are obtained from a single 68 higher-order group. Examples of nested data range from in vitro experiments (i.e., cells within a 69 culture dish (A)) to in vivo experiments utilizing polytocus species (e.g., rat pups within a litter 70 (B)). Multilevel data can also occur with the use of longitudinal experimental designs (i.e.,

124
Main effect (β 1 ). For the main effect of β 1 in the fixed effects model (Fig 2A)

141
When the nested experimental design was appropriately accounted for via a random 142 effects model, elevated type 1 error rates were largely attenuated ( Fig 2B)

159
Utilization of a random effects model, to appropriately account for the nested experimental 160 design, largely attenuated the elevated type 1 error rates for the interaction effect of β 3 (Fig 2D).

161
When the ICC was small, mean estimates for the probability of type 1 error in the random effects The dashed blue line reflects the established criterion of 0.80.

187
Utilizing a random effects model to appropriately account for the nested experimental 188 design did not significantly improve the statistical power to detect effects. In the random effects 189 model, statistical power ranged from 7.8% to 74.6% for the small ICC and from 0.03% to 5.6% for 190 the large ICC ( Fig 3B); these estimates were dependent upon an interaction between the 235 and from -6.5% to 6.9%, respectively. Overall, the pattern of relative bias was random, centered 236 around zero, and did not consistently exceed the established criterion of |10%| in these conditions. Overall, however, the pattern of relative bias was random, centered around zero, and not 246 consistently exceed the established criterion of |10%| for either the fixed effects or random effects 247 ANOVA. The green area within the two dashed blue lines reflects the acceptable levels of relative 248 bias. Points outside of the green area are greater than the established criterion of |10%|.

249
With regard to the main effect of β 1 in the random effects model, a practically significant estimates for relative bias ranging from -6.6% to 7.3% when β 1 =0.14 from -3.1% to 1.9% when 254 β 1 =0.39, and from -1.4% to 2.2% when β 1 =0.59. Under parameter conditions where the ICC was 255 large (Fig 4B, 4D, 4F), results demonstrated intolerable relative bias when the magnitude of β 1 256 was small, with estimates ranging from -16.5% to 37.9%. This contrasted to relative bias 257 estimates when the magnitude of β 1 was either medium (0.39) or large (0.59) in these conditions,

259
The comparability of relative bias results across the random effects and fixed effects models

265
When β 3 =0.14, mean estimates for relative bias ranged from -17.8% to 11.1% and from -23.4% 266 to 21.3% for the small ICC and large ICC, respectively. When β 3 =0.39, mean estimates for relative 267 bias ranged from -2.0% to 1.9% for the small ICC and from -8.4% to 8.6% for the large ICC.

268
Finally, when β 3 =0.59, mean estimates for relative bias ranged from -1.8% to 1.3% and from -5.4% to 11.7% for the small ICC and large ICC, respectively. Overall, relative bias did not 270 consistently exceed the established criterion of |10%|. Notably, however, there were excessive 271 rates of relative bias when the magnitude of the interaction term was small.

272
Mean estimates for relative bias for the interaction effect of β 3 in the random effects model

322
Main Effect (β 1 ). For the main effect of β 1 in the fixed effects model (Fig 5A)

342
When the nested experimental design was appropriately accounted for via a random 343 effects model, however, the relative bias of the standard error was largely attenuated. When the 344 ICC was small, mean estimates for the relative bias of the standard error in the random effects 345 model ranged from 4% to -0.2%; estimates that were less than the established criterion of 5% 346 across all level-2 units per cell. For the large ICC, mean estimates for the relative bias of the 347 standard error ranged from -6.6% to -1.4%; observations which revealed a greater likelihood of 348 biased standard errors when fewer level-2 units were sampled (Fig 5B). The overall ANOVA 349 confirmed our observations, revealing a practically significant interaction between the number of

369
Utilization of a random effects model to appropriately account for the nested experimental 370 design, however, largely attenuated the relative bias of the standard error. In the random effects 371 model, mean estimates for the relative bias of the standard error ranged from 14.6% to 3% for 372 the small ICC and from -4.8% to -1.1% for the large ICC (Fig 5D). Independent of ICC, as the decreasing within-group variance. When multilevel data is appropriately modeled via a random 407 effects ANOVA, however, the type 1 error rate and relative bias of the standard error approximate 408 the established criterion (i.e., α < 0.05 and |5%|, respectively).

409
Low statistical power has been recognized as a critical, albeit not universal, issue in where is the intercept, 1 is a level-1 predictor (e.g., Treatment) relating to , 2 is 496 the regression coefficient relating , a second level-1 predictor (e.g., Biological Sex), to , 497 3 is the regression coefficient relating the interaction of the two level-1 predictors 498 to and is the level-1 random effects.

499
All level-1 coefficients were permitted to randomly vary, yielding the following 500 unconditional level-2 random-coefficient model equations:

526
Levels of ICC were manipulated by altering the variances of both level 1 and level 2 error 527 terms. Two levels of ICCs were considered, including a small (0.16) and large (0.60) cluster effect.

528
The ICCs were based on the unconditional model. It is noted that the ICC for a given condition