Issues in the statistical detection of data fabrication and data errors in the scientific literature: simulation study and reanalysis of Carlisle, 2017

Background The detection of fabrication or error within the scientific literature is an important and underappreciated problem. Retraction of scientific articles is rare, but retraction may also be conservative, leaving open the possiblity that many fabricated or erroneous findings remain in the literature as a result of lack of scrutiny. A recently statistical analysis of randomized controlled trials [1] has suggested that the reported statistics form these trials deviate substantially from expectation under truely random assignment, raising the possiblity of fraud or error. It has also been proposed that the method used could be implemented to prospectively screen research, for example by applying the method prior to publication. Methods and Findings To assess the properties of the method proposed in [1], I carry out both theoretical and empirical evaluations of the method. Simulations suggest that the method is sensitive to assumptions that could reasonably be violated in real randomized controlled trials. This suggests that deviation for expectation under this method can not be used to measure the extent of fraud or error within the literature, and raises questions about the utlity of the method for propsective screening. Empirically analysis of the results of the method on a large set of randomized trials suggests that important assumptions may plausibly be violated within this sample. Using retraction as a proxy for fraud or serious error, I show that the method faces serious challenges in terms of precision and sensitivity for the purposes of screening, and that the performance of the method as a screening tool may vary across journals and classes of retractions. Conclusions The results in [1] should not be interpreted as indicating large amount of fraud or error within the literature. The use of this method for screening of the literature should be undertaken with great caution, and should recognize critical challenges in interpreting the results of this method.


32
Meta-research, a scientific endeavor aimed at studying and improving the process of science 33 itself, has gained increasing interests among scientists. This interest has partially been driven 34 by theoretical [2,3] and empirical work [4,5] that suggests concerns about the validity of 35 the published scientific literature. One area of the scientific process that is amenable to 36 meta-research is the detection of data validity/data integrity issues within the literature. 37 Methods such as statcheck [6] and granularity testing [7] and its variants [8] have been 38 developed to identify possible data validity issues by checking summary statistics reported in 39 published research for consistency. In some cases, it has been proposed that the method be 40 applied in an automated manner at various stages of the scientific process, for instance, prior 41 to publication [6,9]. 42 One class of methods for the detection of data validity issues is based on detecting whether 43 data or summary statistics are consistent with their expected statistical distribution [10][11][12][13][14][15]. 44 Under this framework, large deviations from the expected distribution of reported data are 45 interpreted as indication of possible data integrity issues. In several cases within the literature, 46 this method has been used to flag publications that were later determined to be based on 47 fabricated data [11,12]. One variation on this, developed by Carlisle [11,13], uses reported 48 summary statistics on baseline variables from randomized clinical trials to score published 49 trials in terms of statistical deviation from that expected if subjects were truely assigned at 50 random to various experimental groups. Large deviations potentially suggest issues with the 51 validity of the reported summary statistics. 52 If methods for the detection of data validity issues are to play an increasing role in the 53 scientific process, it is critical that scientists have a good understanding of the appropriate 54 interpretation of these kinds of procedures. Of particular concern is that scientists may 55 interpret the fact that a study or numerical result is flagged by these methods as substantial 56 evidence of some type of flaw even when the method can sometimes flag an analysis for other 57 reasons [16,17]. Especially if such methods are used to systematically screen research, it will 58 be essential for scientists to have a grasp on the limitations that these methods may face. 59 In order to understand these limitations, it is useful to distinqiush between multiple types 60 of numerical results that may be identified by these methods. In what follows, I make a 61 distinction between two different threats to data validity: data fabrication and data errors. 62 Data fabrication may be said to occur when authors of published research intentionally alter 63 data that they report in a way that is not consistent with how the data was collected or by 64 reporting ficticious results about data that was never acutally collected. Data errors may be 65 said to occur when authors of published research unintentionally report data in a way that is 66 not consistent with what was acutally observed, for example by unintentional typographical 67 errors or accidental errors in numerical calculations. Often, methods aimed at detecting 68 data validity issues can be expected to flag both data fabrication and data errors, without 69 distinqiushing between the two. In principle, this fact does not preclude the use of these 70 methods for screening scientific research, because both fabrication and honest errors should 71 3 be detected and corrected. However, the fact that these methods can not distingiush between 72 errors and fabrication presents important interpretational challenges, since parties involved in 73 the process are likely to respond differently if they interpret a flag by one of these methods as 74 evidence of fabrication vs evidence of error. As a result, it is critical to manage expectations 75 about what these methods show and how they should be applied.

76
Potentially of more concern for the application of these methods is the possiblility that some potential data validity issues, as well as bringing unfair suspicion upon honest scientists.

86
Understanding the relative frequencies of these different categories: fabrication, honest errors, 87 and false positives, among flagged results is essential for the proper interpretation of these 88 methods.

89
Although these issues are generally applicable to methods aimed at identifying data validity 90 issues, they are particularly timely in light of a recent analysis by Carlisle [1], which applied 91 a data validity detection method to a large sample of randomized controlled trials. This 92 analysis has already generated significant attention both within the scientific literature [9] as 93 well as the in the press [19,20]. The importance of [1] can be seen as relating to two related  Second, the method utilized by [1] is already being used to screen papers submitted to

123
To facilate understanding of the theoretical and empirical results I present, I briefly review 124 the method utilized by Carlisle (which I refer to as the CM) [1]. For a single randomized 125 controlled trial, the CM first involves manually extracting summary statistics on baseline 126 (pre-treatment) variables from all groups which are randomized. For each variable, a p-value is 127 calculated which tests the null hypothesis that the population means of the variable are equal 128 across the groups. If the groups were truely assigned at random, then the null hypothesis 129 is expected to be true for all of the variables. To combine the tests for all variables, the 130 CM as applied in [1] utilized several methods for combining p-values that test a common 131 null hypothesis, but [1] focuses on Stouffer's method [23], which transforms the p-values 132 to z-scores and calculates their sum. Under the assumption that the p-values included are 133 independent, this sum is then compared to its own null distribution to derive a global p-value. 134 Below, I highlight several stages at which this process may go wrong, along with re-analyses 135 of the data used in [1] showing that these issues plausibly effected the analysis.  in [1], aggregated by Stouffer's method. As Carlisle [1] notes, this distibution has an excess of 202 p-values near 1 and 0 relative to the null (Fig 3A). However, this distribution is remarkably 203 similar to the simulated distribution with correlated baseline variables (Fig 3B), suggesting 204 that correlated variables could plausibly explain the deviations for uniformity. showed that the CM p-values for retractions are more extreme compared to unretracted trials 241 [1]. I extend this analysis by evaluation the distribtuions of the trial-level p-values from [1] 242 across several retraction categories (Fig 4A and E). indicative of error, suggesting that this critique is not simply speculation.

314
Implications for use of the CM for screening 315 The theoretical arguments I give also have implications for the use of the CM in screening.