## Abstract

Sometimes a re-analysis of toxicological data is useful. However, this usually requires the availability of the original data and in many cases only summary data are available in the publications. Here the generation of pseudo-data under certain assumptions using extension packages in the open-source project R on statistical computing is shown. Several case studies are used to illustrate the applicability in regulatory toxicology.

## 1 Introduction

Results in publications on toxicological bioassays are usually presented as summary statistics; often in bar chart plots or table format, showing means, standard deviations and sample sizes. This practice has been repeatedly criticized (6; 2; 21) because potential sub-grouping in the data is hidden and the validity of the performed statistical analysis cannot be peer-reviewed. In internationally accepted test guidelines for regulatory bioassays, the reporting of individual data is mandatory, but often not publicly available, because they may support commercial interests. However, summary data usually are available, e.g. from the European Chemicals Agency (ECHA) or the European Food Safety Authority (EFSA), depending on the regulatory field.

When data are presented as means and standard deviations, the authors stating such values implicitly assume normal distribution of the reported data, irrespective of the actual underlying empirical distribution of the data. Therefore, it is possible to generate individual observations, here denoted as pseudo-data, from these summary statistics using a simulation algorithm.

Pseudo-data can be used for peer-review, re-analysis by different or novel methods, educational purposes and for incorporating multiple data sets in an analysis, when only summary statistics are available. While the original data are obviously more valuable, the pseudo-data approach can be a compromise; e.g. when individual data is hard to obtain or not electronically available - the latter is sometimes the case for historical control data.

We demonstrate the use of the simulation algorithm in the function `ermvnorm` within the *SimComp* package, available for the open-source statistical software R. This function is based on the function `rmvnorm` of the package *mvtnorm*. This method differs from other random data generation algorithms in that a multivariate normal distribution with an exact nominal mean vector, exact variance vector and approximate correlation matrix (for multiple endpoints or repeated measures) is calculated. (Note that random numbers with exact mean and standard deviation cannot be used for simulations concerning type I error or power of statistical tests.)

Multiple case studies are presented and an appropriate re-analysis is demonstrated: i) repeatability by a second assay, ii) fold change effect size, iii) taking variance heterogeneity into account, iv) considering dose as quantitative covariate and/or qualitative factor, v) user-defined contrasts, and vi) multiple endpoints. In a seventh case study the limitation of the approach is shown because the normal distribution assumption failed.

The presented method can be easily applied by statisticians and toxicologists by using the example R-code provided within this manuscript.

## 2 Generating random numbers with exact mean and standard deviation

First, summary statistics are selected, e.g. from a publication table for 7 antioxidant indices in epididymides of mice exposed to 0 (control), 25 (dose1), 50 (dose2) and 100 (dose3) mg/L sodium fluoride as shown by the following table from (19). Second, sample sizes (n), arithmetic means (mean) and standard errors (sd) for a particular endpoint, for example the low total antioxidant capacity (T-AOC), are simply transferred to the following R code:

The object `TAOC` contains the pseudo-data with the qualitative factor `dose` and can be used for re-analysis. These data can be easily visualized by modified box plots using the R-package `toxbox` (13)

An extension of this example considering multiple correlated endpoints and repeated measures for assumed or random correlations is available and is demonstrated in the sixth case study below.

The simple idea behind the function `ermvnorm` is to apply the function `rmvnorm` of the package *mvtnorm*. This function generates *n* 2 random vectors from a *k*-dimensional multivariate normal distribution whose mean vector is given by `mean` and corresponding vector of standard deviations (SD) by `sd`. If only one endpoint is of interest, dimension *k* equals 1. For more details, see (14). These *n* − 2 random vectors do not yet have the indented exact mean and SD. However, based on the first *n* − 2 vectors, a system of equations is formed by the usual formulas of sample mean and sample variance. The remaining 2 vectors are obtained as solution so that the specified values for the mean vector `mean` and for the vector of standard deviations sd are achieved. Depending on the first *n* − 2 random vectors, the underlying system of equations may have no solution. In this case, `ermvnorm` creates a new set of *n* − 2 random vectors until a valid pseudo-data set is obtained or until the maximum number of tries is reached.

## 3 Case studies

The following case studies show the applicability of the method for re-analysis for several common statistical scenarios encountered in toxicology.

### 3.1 Demonstrating repeatability

Demonstrating repeatability of a previously observed effect appears relatively simple, e.g. by a second assay under comparable conditions and seems necessary in the current reproducibility crisis. (1) argue that this might be due insufficient statistical expertise, i.e. sampling effects. Usually, data of a new assay will be compared with data from a previous assay in the literature. In this example, data for the first and the second assay were provided within the same project for the hematology endpoint mean corpuscular volume (MCV) for mice that were treated with five concentrations of sodium dichromate dihydrate (18). Summary statistics were extracted and pseudo-data generated. The boxplots in Figure 3 reveal a rather similar pattern for both studies (parallel vertical dotted lines: left study 1, right study 2) with a general tendency to lower values in the second study.

Table 1 shows similar effect sizes (Effect size 1, Effect size 2), i.e. the mean difference *µ*_{i} − *µ*_{0}, and similar tiny *p*-values (*p*-value 1, *p*-value 2) for almost all comparisons against control for both study 1 and study 2, which is a convincing empirical criterion for repeatability. Irrespective of the observed additive shift, the patterns of effect sizes and *p*-values are rather similar in both studies. Alternatively, one can use a more complex model, a mixed effects model with a random factor between both studies. Again, the effect size (Effect size mixed) and the *p*-value (*p*-value mixed) are quite similar.

A third and quite different approach is the claim of no interactions as a criterion for repeatability using an interaction contrast test (10). The dose-by-study interaction contrasts are estimated in Table 2 revealing quite different slopes for and , i.e. a substantial higher sensitivity of study 2 with respect to study 1. Hereby it is difficult to conclude on similarity of the two studies, as this is injured evenly with two dose comparisons substantially.

### 3.2 Fold change as effect size

The majority of statistical significance tests use (implicitly) the difference of means as effect size. However, the ratio of means, denoted as fold change, is used in toxicology for specific interpretations, such as the 2-fold rule in the Ames assay (4). Unfortunately, this multiplicative effect size is sometimes used together with *t*- or Wilcoxon test for an additive effect size, e.g. for urine excretion in rats treated with seven doses of hexachloro-butadiene (20)(in their Table 5, see the simulated data in Table 3). The common approach to switch into the multiplicative model is the log-transformation (which requires the validity of a log-normal distribution, see e.g. (16)), but for normal distributed endpoints a modified Dunnett or Williams test for ratios-to-control is available (9) as an alternative.

The actual data from (20) is reproduced in Table 3 and the generated pseudo-data depicted in Figure 4.

In this balanced design with extreme small sample sizes, an extreme variance heterogeneity can be seen for higher doses in Figure 4. A modified Dunnett-type test for ratios-to-control allowing heterogeneous variance is used therefore (7).

The fold changes are well reproduced from Tab. 5 of (20), but the Dunnett test for ratio-to-control (assuming normal distributed errors with any variance) reveals a weak significant increase for the 15 mg dose only (adjusted *p*-value) – primarily induced by this extreme small variance, i.e. the three similar values such as (18.7, 17.3, 18.6) by a value range from about 7.2 to 72.4. There is no biological sound increase at any dose – induced by *n*_{i} = 3 and the extreme variances at high doses. The conclusion by the authors *”Urinary KIM-1 excretion was increased at doses between 10 and 30mgkg, but owing to a high group variance (i.e. large SDs), the increases of 3.31- and 4.10-fold at, respectively, 45 and 90mgkg HCBD, compared with the control group, were not statistically significant”* can not be reproduced. This example shows very nicely what the *p*-values would have looked like if slightly higher sample sizes had been used, e.g. *n*_{i} = 6, just running the simulation code with *n* = 6, assuming that the higher sample size would have the same means and SD.

The well-known fact that the sample size has a substantial influence on the *p*-value becomes very clear, which is particularly important in toxicology.

### 3.3 Taking variance heterogeneity into account

The common assumption of homogeneous variances in dose-responses relationships may be sometimes unrealistic. Corresponding tests tend to be liberal when higher variances occur in higher doses, particularly when sample sizes are smaller in the dose groups (vice versa). In these cases, a modified Dunnett-type test allowing heterogeneous variance should be used (7). As an example we use summary data of pregnant mice which were exposed to five phthalates and their mixture. In their Table 2 (3), the body weight mean, SD and *n* for each treatment group is reported together with stars for Dunnett’s post hoc test. From the simulated data in Figure 5 an unbalanced design with heterogeneous variance can be seen.

Consequently, the re-analysis with the Dunnett test assuming homogeneous variance reproduced the significance signs. But taking variance heterogeneity into account, there is no significant body weight change in the mixture, but more significant changes in the phthalate groups. This is not supported by the conclusion by the authors *Data showed that, except for NP, all mice exposed to EDCs, single or mixed, presented higher BW than control mice*.

### 3.4 Considering dose as quantitative covariate and/or qualitative factor

Trend tests are classified into those that accept the dose as a factor, i.e. qualitative, and those that accept the dose as a covariate, i.e. quantitative. Both concepts have advantages and disadvantages, but they can even be combined (15). As an example the clinical chemistry endpoints of a feeding study with GMO-corn for the female rats, where the endpoint CHO was selected (Table 4) (5).

Assuming the concentration as a quantitative covariate, no significant increasing trend reveals. However, taking additionally the Williams contrast with dose as factor into account, a clear significant plateau effect (contrast C3) follows. This finding is more precise than the significance signs for the 12.5 and 50 group in Table 4 of (5).

### 3.5 User-defined comparisons

Sometimes in trend analysis, additional comparisons than against control are of interest, e.g. control versus high dose (11). As an example, the endpoint VAP (average path velocity) in the gestational exposure to Bisphenol-A affects the spermatozoa in adult mice (12).

Just to report a significant trend test is one story, a different one is to demonstrate a significant trend and a high dose vs. control test (11).

### 3.6 Multiple endpoints

Although multiple endpoints occur frequently in long-term bioassays, such as hematology, commonly they are analysed univariately and independently. A simultaneous test over correlated endpoints can be formulated, to avoid a too high false positive decision rate (8). As an example, the changes in the activities of GSH-R levels in the testicular tissues of male rabbits after treatment with a daily dose of four nitrosamines (MEN, DEN, DMN, DPN) (17) are used. In their Table 1, the bivariate data set of GSH content and GSH-R activity was selected, assuming a correlation of 0.9 (no such correlation information available).

Seriously decreasing levels for both GSH content and GSH-R activity in each of the nitrosamine compounds against control exists under the assumption of a correlated bivariate normal distribution.

### 3.7 A counterexample where normal distribution assumption is unrealistic

Many endpoints in toxicology are not continuous, they are discrete such as counts, e.g., number of micronuclei or proportions, such as tumor rates. For these discrete endpoints the limitation of the above approach based on the normal distribution assumption is obvious. As an example the increase of mutant frequency (MF) after in utero exposure to three doses of benzopyrene in bone marrow (12) (their Figure 3) is considered where the appendix contains the summary statistics. Clearly, MF is a count and was consequently analysed in the generalized linear model in that paper assuming overdispersed Poisson counts. Although counts can be generated as well by the above approach, the limitation of normal distribution assumption becomes obvious in Figure 6 A with jittered box plots, whereas for the 20 and 40 mg group the raw data can be estimated from this figure. This makes a difference in re-analysis, see the boxplots and the evaluation for either Tukey-type trend test or Dunnett-type comparisons against control. The *p*-values for simulated data from mean and SD are unrealistic tiny for any comparisons, whereas ones based on estimated data are more realistic. This highlights the limitation of the ubiquitous use of arithmetic mean and SD as summary statistics – irrespective of the underlying distribution and shows the advantage of raw data presentation within boxplots (13). Comparing the the odds ratios to control (and their simultaneous confidence intervals) for both data sets, reveals a false positive tendency in the simulated data set (see appendix).

## 4 Discussion

Based on published summary data (means, standard deviations, sample sizes) simulation-based pseudo-data can be used for statistical re-analysis. Examples are: i) Dunnett-type testing instead of independent pairwise t-tests, e.g. for NOAEL identification; ii) test statistics allowing variance heterogeneity, particularly in unbalanced designs; iii) combining multiple bioassay studies, e.g. by using mixed effect models; iv) Analysing trends by alternative models, e.g. using dose as quantitative covariable or a combination of quanti- and qualitative formulation; v) multiple endpoint analysis (instead of independent univariate analysis) assuming low to high correlation between the endpoints; vi) using historical control data in addition to the concurrent control; vii) using robustified trend tests for possible downturn effects at high doses; viii) providing confidence intervals for the effect sites *µ*_{i} − *µ*_{0} or *µ*_{i}/*µ*_{0}; ix) using the generated *x*_{0,i} as part of a historical control data base; x) simulating their influence of virtual larger or smaller *n*_{i} on false negative decision rate; xi) deriving the impact of considering less dose groups, e.g. only 0, 10, 100 instead of 0, 10, 50, 100. Moreover, this approach can be used for teaching purposes, to establish new statistical approaches in toxicology and generating compound/strain/endpoint/sex/dose-specific data base from historical publications on individual data. For the case studies here, the data and R-code is available in the Appendix.

## Acknowledgements

Prof. Dilba, SDSU, for an early version of the algorithm