## Abstract

In regulatory toxicology an outcome is claimed positive when both a trend is significant and any pairwise test against control. Two statistical approaches are proposed: a joint Dunnett and Williams test (assuming the dose as a qualitative factor) and a joint test of the Tukey regression test and Dunnett test (assuming the dose as a quantitative covariate). Related R software is available.

## 1 The problem

In regulatory toxicology, different bioassays are routinely evaluated (commonly on the basis of multiple endpoints) into positive or negative outcomes. In most cases, the criteria are not clearly formulated from a statistical point of view. However, there are guidelines with explicit criteria, such as OECD No. 487: … *a test chemical is considered to be positive if*…: *1)At least one of the treatment groups exhibits a statistically significant increase in the frequency of micronucleated immature erythrocytes compared with the concurrent negative control, 2) This increase is dose-related at least at one sampling time when evaluated with an appropriate trend test, and 3) c) Any of these results are outside the distribution of the historical negative control data.…* Although criterion 3 is also challenging from a statistical point of view, a statistical approach is derived below for the first two criteria: **pairwise tests and trend test**.

Without limitation of generalizability, only tests for normal distributed and homoscedastic errors are represented here, in order to keep the calculation simple.

## 2 Methods

Various criteria must be observed. First, a trend test should be sensitive to as many forms of dose-response dependency as possible. Here the Williams test is selected for dose modeling as a qualitative factor and the Tukey test for dose modeling as a quantitative covariate. Secondly, pairwise tests are only considered as comparisons against control (i.e. not between doses comparisons) in terms of the widespread Dunnett test [5] (which controls familywise error rate). Sometimes independent t-tests are simply used for this purpose. However, these only control the comparisonwise error rate, use only pairwise df and error estimates. Thirdly, only one-sided tests are used because a test is difficult to motivate which is either to formulate for an increase or a decrease, but this for monotonous dependencies. The combination of trend test and Dunnett-test takes place on the level of linear models [8]. On the one hand, by the simultaneous consideration of Williams and Dunnett contrasts [9], on the other hand by the simultaneous testing of Tukey [10] and Dunnett test [7]. Simultaneous means the use of the common joint distribution of all tests in the sense of a maximum test.

This shows a fundamental contradiction in this approach. On the one hand, claims are possible with regard to trend and any pair comparison in the sense of the guideline, on the other hand, the false negative decision rate increases. A simulation study shows that the balancing of interests: extended claim - despite of increased f-rate is acceptable. With the help of two examples the advantages and disadvantages of the approach are shown, the data and the R-code are made available, so that a recalculation of own data should be possible for toxicologists.

The colloquial “and” is statistically translated into “or” by a union-intersection hypothesis. A trend exists if at least one local trend alternative is significant, a pairwise comparison if at least one of the comparisons is significant. Of course, different patterns may emerge from significant local alternatives until all are significant. Conversely, if none of the local comparisons are in the alternative, there is no trend that is only a paired difference (see the examples below).

## 3 Data examples

### 3.1 Number of revertants in Ames assay

The raw data of an Ames assay using TA98 is available in the R-library(dispmod) taken from [4]. A clear downturn effect can be see above a dose of 333.

The Tukey-and-Dunnett approach is used by means of the library(tukeytrend) and the function *tukeytrendfit*, where a linear model for the log-transformed number of revertants (y) is used as an object (modTA). The object ex2 contains the test statistics and the multiplicity-adjusted p-value (see Table).

Both linear trend test and Tukey trend test alone argue for ‘no trend’, where the joint approach reveal a significant increase in dose 100.

### 3.2 Crude tumor rates in a long-term carcinogenicity study

The crude incidence of hepatoblastoma in male mice (1,1,16,5) were reported after treatment of 0 3 15 50 mg/kg pentabrominated diphenyl ether [6] (*n*_{i} = 50). Assuming dose as a qualitative factor, the proportions were modeled in the generalized linear model with the logit link function and small sample add-2 sample adjustment [2].

The simultaneous confidence limits (on the log-odds ratio scales) for the six individual comparisons: i) Dunnett-type vers. control, and ii) Williams-type C1 for 0 − 50, C2 for 0 − (50 + 15)*/*2, C3 for 0 − (50 + 15 + 3)*/*3 reveal no trend, but a significant increase of the tumor rate at dose 15.

## 4 Simulation study

In a simulation study for a common balanced design [*NC, D*_{1}, *D*_{2}, *D*_{3}] with homoscedastic, normal distributed errors with small sample sizes (main *n*_{i} = 10 and down to *n*_{i} = 4 the false positive rate (under the global null hypothesis of equal expected values (means) and the false negative rate under various monotonic and non-monotonic dose-response relationships. Seven tests were compared: i) the standard Dunnett-test (Dun), the Williams test formulated as multiple contrast [3] (Wil), the joint Dunnett and Williams test (DunWil), the Tukey trend tests (Tukey), the joint Tukey and Dunnett test (TukDun), the linear regression test (Lin) and the joint linear Regression and Dunnett test (LinDu), see the following Table:

As expected, the regression test has the lowest f-rate for a linear alternative. As expected, the regression test has the lowest f-rate for a linear alternative. For many other nearly linear alternatives, the Tukey test has the lowest f-rate. For plateau alternatives, the Williams test shows the lowest f+ rates (while the regression test there shows alarmingly high f-rates, which disqualifies it as a routine approach in toxicology). As expected, the f-rates of the both regression tests are high for non-monotonous alternatives, much lower for the Williams test. The three combination tests with the additional Dunnett test show a balanced f-behavior. Their f-rate is more or less increased compared to the standard test with exactly monotonous alternatives - however tolerable due to the considerable robustness gain. This difference is about the same over wide ranges of the f-rate, represented by designs with decreasing sample sizes *n*_{i} = 11…*n*_{i} = 4 (common in toxicology).

It can be assumed that this behavior also applies to other endpoint types in GLM. Therefore, the recommendations of the CA-trend tests for crude or poly-3-adjusted proportions should be reconsidered by the US-NTP [1].

## 5 Conclusion

Two statistically consistent approaches for trend and pairwise comparisons were derived and their properties characterized: Dunnett and Williams test and Tukey and Dunnett test. A relatively simple R-code is available for both. These approaches can be generalized in GLM to other endpoint types such as proportions, counts and survival functions. Thus, this can be recommended for routine evaluation in regulatory toxicology.

Extensions for selected real data conditions such as variance heterogeneity, over-dispersion or values below the detection limit are currently being processed.