Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

The Reporting of Observational Clinical Functional Magnetic Resonance Imaging Studies: A Systematic Review

  • Qing Guo ,

    guoq@mcmaster.ca

    Affiliations Department of Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada, Biostatistics Unit, St Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada

  • Melissa Parlar,

    Affiliation Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton, Ontario, Canada

  • Wanda Truong,

    Affiliation Department of Psychology, University of Calgary, Calgary, Alberta, Canada

  • Geoffrey Hall,

    Affiliations Department of Psychology, Neuroscience and Behaviour, McMaster University, Hamilton, Ontario, Canada, Mood Disorders Program, St. Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada

  • Lehana Thabane,

    Affiliations Department of Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada, Biostatistics Unit, St Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada, Centre for Evaluation of Medicine, St Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada

  • Margaret McKinnon,

    Affiliations Mood Disorders Program, St. Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada, Kunin-Lunenfeld Applied Research Unit, Baycrest, Toronto, Ontario, Canada, Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton, Ontario, Canada

  • Ron Goeree,

    Affiliations Department of Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada, Centre for Evaluation of Medicine, St Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada, PATH Research Institute, St. Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada

  • Eleanor Pullenayegum

    Affiliations Department of Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada, Biostatistics Unit, St Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada, Centre for Evaluation of Medicine, St Joseph's Healthcare Hamilton, Hamilton, Ontario, Canada

Abstract

Introduction

Complete reporting assists readers in confirming the methodological rigor and validity of findings and allows replication. The reporting quality of observational functional magnetic resonance imaging (fMRI) studies involving clinical participants is unclear.

Objectives

We sought to determine the quality of reporting in observational fMRI studies involving clinical participants.

Methods

We searched OVID MEDLINE for fMRI studies in six leading journals between January 2010 and December 2011.Three independent reviewers abstracted data from articles using an 83-item checklist adapted from the guidelines proposed by Poldrack et al. (Neuroimage 2008; 40: 409–14). We calculated the percentage of articles reporting each item of the checklist and the percentage of reported items per article.

Results

A random sample of 100 eligible articles was included in the study. Thirty-one items were reported by fewer than 50% of the articles and 13 items were reported by fewer than 20% of the articles. The median percentage of reported items per article was 51% (ranging from 30% to 78%). Although most articles reported statistical methods for within-subject modeling (92%) and for between-subject group modeling (97%), none of the articles reported observed effect sizes for any negative finding (0%). Few articles reported justifications for fixed-effect inferences used for group modeling (3%) and temporal autocorrelations used to account for within-subject variances and correlations (18%). Other under-reported areas included whether and how the task design was optimized for efficiency (22%) and distributions of inter-trial intervals (23%).

Conclusions

This study indicates that substantial improvement in the reporting of observational clinical fMRI studies is required. Poldrack et al.'s guidelines provide a means of improving overall reporting quality. Nonetheless, these guidelines are lengthy and may be at odds with strict word limits for publication; creation of a shortened-version of Poldrack's checklist that contains the most relevant items may be useful in this regard.

Introduction

In the past decade, the use of functional MRI (fMRI) studies in cognitive neuroscience has increased a great deal [1], [2]. Given that fMRI is increasingly applied to the study of clinical disorders (e.g., [3][8]), and considering the vulnerability of clinical participants, there is an ethical imperative for scientists to apply rigorous methodology and to provide adequate reporting. Rigorous methodology is required in order to uphold the promises typically made to participants during the consent process, namely that the study will help investigators to understand their conditions. Complete reporting with sufficient details permits readers to ensure the methodological rigor of a study [9], consider the validity of findings [10][14], and extend and replicate the findings [9][13], [15][17]. In particular, recent evidence indicates that overall, the fMRI literature lacks key details in their methods section, such as sample size calculations, whether temporal autocorrelations were modeled, descriptions of slice-timing and motion correction, slice order and coverage of functional brain images [18], and related parameter estimates (i.e., effect size and variance components) in the results section [19].

Standard guidelines have been developed to aid authors in reporting their research, such as the Consolidated Standards for Reporting Trials (CONSORT) [10] and the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) initiative [9]. Recently, Poldrack and his colleagues have proposed guidelines specifically for reporting fMRI studies [14]. Although many authors have suggested endorsing the guidelines proposed by Poldrack et al. in reporting fMRI studies to improve the quality, transparency and consistency of results [2], [18], [20], [21], few systematic reviews have been conducted to appraise the quality of reporting based on these guidelines. Although a study by Carp (2012) recently examined adherence to Poldrack et al.'s guidelines in randomly selected fMRI studies published since 2007, it included few studies involving clinical populations. Thus, the reporting quality in clinical fMRI studies remains unclear. Given the unique challenges (e.g., technical, interpretive, and methodological) that confront clinical fMRI studies, reporting details on design, subject characteristics, analyses and interpretation is suggested to enhance reproducibility of results in this subset of fMRI studies. Therefore, we expect that reporting in clinical fMRI studies is different from that of the overall fMRI literature.

Moreover, based on our experience and anecdotal evidence that the majority of fMRI studies are observational (i.e., the type of study is not designed to randomize participants to test efficacy and safety of any therapeutic intervention), these studies are less scrutinized than randomized clinical trials with experimental interventions; for example, randomized trials have to be registered with clinicaltrials.gov. Therefore, we aimed to systematically evaluate the quality of reporting in observational fMRI studies involving clinical human participants (i.e., individuals who either have a disease or are at risk of developing a disease) using a checklist adapted from the guidelines proposed by Poldrack et al. In this study, we set out to address the following two questions: (1) what percentage of articles reported each item of the fMRI-specific guideline, and (2) what percentage of items was reported per article?

Methods

Search Strategy and Eligible Journals

We searched OVID MEDLINE on January 2012 by using key word search terms (e.g., functional magnetic resonance imaging) combined with the acronym (i.e., fMRI) for articles published in 2010 and 2011, in the English language, and involving human participants. Compared with journals in general, top journals are cited more frequently (e.g., higher impact factors (IF)) and more scrutinized prior to publication (e.g., lower manuscript acceptance rates). Furthermore, studies have indicated that high IF and low manuscript acceptance rates of journals are associated with higher methodological rigor of articles published in the journals [22][26]. In this study, we further constrained our selection to six leading journals: In the Journal Citation Report 2010, we selected four journals with a high IF in the category “Neurosciences”, namely, Neuron (IF 14.9), Nature Neuroscience (IF 14.2), Brain (IF 9.2), Journal of Neuroscience (IF 7.3), one journal with the highest impact factor in the category “Neuroimaging” (NeuroImage, IF 5.94), and one journal which contributes a great number of articles in fMRI studies [18] and has a high impact factor (Proceedings of the National Academy of Sciences of the United States of America, IF 9.8). More details on the search strategy can be found on Table S1. Duplicate articles were removed.

Eligibility Criteria for Studies and Study Selection

We included articles that were peer-reviewed, full reports of observational fMRI studies involving human clinical participants, and block or event-related or mixed design for the fMRI paradigm. We excluded articles that were published only in abstract form or any that were only editorials, letters, comments or reviews. Genetic, resting-state observational fMRI studies, fMRI studies other than observational studies (e.g., randomized clinical trials), and studies of connectivity were also excluded. As studies of connectivity aim to identify and quantify the correlations between brain regions [27], these studies have a different reporting focus vis-à-vis fMRI data analyses. For example, they report the Psycho-Physiological Interaction analyses to estimate effective connectivity or functional coupling rather than data preprocessing steps, which were demonstrated to have significant impacts on the quality of data and the reliability and interpretation of fMRI results [28],[29]. However, the reporting essentials for effective connectivity studies have not been reflected in the current available guidelines including the one proposed by Poldrack et al. As our study aimed to evaluate the quality of reporting based on Poldrack et al.'s guidelines, we therefore excluded this type of study to ensure consistency.

In this study, we decided to include a target sample size of 100 articles that had to meet the predefined inclusion and exclusion criteria. We therefore randomly selected and assessed the eligibility of articles among the unique citations, which were identified from the initial search strategy and after the duplicates were removed, until 100 articles were reached.

Data Extraction

We created an electronic data extraction form containing 83 items adapted from the guidelines proposed by Poldrack et al. [14] to assess the reporting of study articles, which we piloted using a random selection of four studies reviewed by three independent reviewers (QG, MP, and WT). Through the pilot testing, we modified the abstraction form by deleting three items (Unwarping of B0 distortions; Describe any data quality control measures; any additional operations, e.g., masking out parts of the image) from Poldrack et al.'s original checklist. The reason for excluding these three items was that we found assessing them required too much subjectivity, meaning that biases among reviewers' judgments were very high. Excluding them meant we were better able to achieve a common perception and interpretation of definitions among items we did evaluate, and hence increased between-reviewer agreement. The observed percentage of agreement on judgments between any two reviewers was 0.78 or higher. Final abstraction forms were devised prior to use (see Table S2). The data were extracted from each article and any online supplements. Items were answered with “Reported”, “Not Reported”, or “Not Applicable”.

Three authors (QG, MP, and WT), blinded to each other's assessments, abstracted the reporting of each article independently. Instead of all three raters reviewing all articles, we decided to have two reviewers rate each article. To determine the number of articles needed to be evaluated by the second reviewer to ensure a desired level of reliability, we performed a sample size calculation [30], [31]. The sample size of 50 was chosen so as to estimate the kappa for the inter-rater agreement within a margin of error of 0.3 with 95% confidence, assuming that the true kappa would be 0.6 or more and that the proportion of agreements by chance was 0.7 or less (see File S2). The first reviewer (QG) evaluated all 100 articles, of which 50 articles were randomly selected for the second reviewer (MP), and the other 50 articles were given to the third reviewer (WT) for abstraction; each article was therefore rated by two reviewers.

After completion of independent assessments, any disagreements between any pair of reviewers (i.e., QG and MP; QG and WT) were resolved by discussion among two reviewers, and if necessary, involving the third reviewer or expert (GH) until consensus was reached. The raw data collected from the 100 studies is available at online Supporting Information (see File S4).

Statistical Analysis

We calculated the percentage of studies that reported each evaluation item and a 95% confidence interval (CI) using an exact binomial method [32]. We then estimated the median, minimum and maximum percentages of reported items for each article.

Inter-rater agreement was assessed using the prevalence-adjusted bias-adjusted kappa (PABAκ) coefficient [33]. When the prevalence of a rating is very high or low, the value of kappa may indicate a low level of agreement while the observed percentage of agreement is high, known as the kappa paradox [34]. Hence, we used prevalence-adjusted bias-adjusted kappa [33] to address this paradox and to better interpret the inter-rater agreement. Kappa coefficient results were interpreted based on the scale as proposed by Byrt [35]: 0.00 or less (No agreement), 0.01–0.20 (Poor agreement), 0.21–0.40 (Slight agreement), 0.41–0.60 (Fair agreement), 0.61–0.80 (Good agreement), 0.81–0.92 (Very good agreement), 0.93–1.00 (Excellent agreement).

We performed a sample size calculation to determine the number of articles to be included in the extraction and analysis. A sample size of 100 was chosen so that with 95% confidence, we would be able to quantify the true percentage of articles that reported each item to within 10% (see File S1). All statistical analyses were conducted using the SAS 9.2 software (Cary, NC).

Results

Study Selection

After removing the duplicates, the initial search strategy identified 1120 unique articles. We screened the articles in a random order for eligibility until the quota of 100 eligible articles was reached. To reach this target, we assessed 1100 articles (see Figure S1 for a flow diagram). The list of the 100 eligible articles is included in File S3.

Study Characteristics

Among the included 100 eligible articles published in six leading journals in 2010 and 2011, about 60% came from the journal NeuroImage. The majority of study designs were cross-sectional (94%). The funding source was reported in 78% of the citations, and came primarily from two or more different sources (77%) rather than from industry alone (1%). Fifty three percent of included articles were published in 2010 and the remaining forty seven percent in 2011. The median total number of subjects was 34 (first quartile (Q1)  = 26, third quartile (Q3)  = 48) ranging from 8 to 126, and most studies (79%) had a sample size of no more than 50 (see Table 1).

thumbnail
Table 1. Characteristics of Included fMRI Studies (Information Extracted from Each Article).

https://doi.org/10.1371/journal.pone.0094412.t001

Items Commonly Reported

Of the 83 items, 22 items were reported by 85% or more of the 100 included articles. Specifically, all of the studies reported sample sizes. Most studies further described the manufacturer, field strength and model name of the scanner and the pulse sequence type (98%), statistical methods used for group modeling (97%), subjects' characteristics such as age and gender (94%), statistical methods used for within-subject modeling (92%), eligibility criteria on selecting subjects (91%), and whether statistical inferences were corrected for multiple comparisons (90%). Similarly, 86% of the articles reported how regions of interest (ROIs) were defined. Of 86 articles that reported analyses not conducted on the whole brain, 80 (93%) explained how regions were determined (see Tables 210).

thumbnail
Table 2. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Experimental Design”.

https://doi.org/10.1371/journal.pone.0094412.t002

thumbnail
Table 3. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Study Subjects”.

https://doi.org/10.1371/journal.pone.0094412.t003

thumbnail
Table 4. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Image Properties”.

https://doi.org/10.1371/journal.pone.0094412.t004

thumbnail
Table 5. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Data Preprocessing”.

https://doi.org/10.1371/journal.pone.0094412.t005

thumbnail
Table 6. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Inter-subject Registration and Smoothing”.

https://doi.org/10.1371/journal.pone.0094412.t006

thumbnail
Table 7. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Statistical Modeling”.

https://doi.org/10.1371/journal.pone.0094412.t007

thumbnail
Table 8. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Statistical Inference on Statistic Image (thresholding)”.

https://doi.org/10.1371/journal.pone.0094412.t008

thumbnail
Table 9. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Statistical Inference on ROI Analysis”.

https://doi.org/10.1371/journal.pone.0094412.t009

thumbnail
Table 10. Percentage of articles reported each item, inter-rater agreement on the item and whether the item should be included in future shortened checklist relating to “Figures and Tables”.

https://doi.org/10.1371/journal.pone.0094412.t010

Items Not Commonly Reported

Among the 83 items, a total of 31 items were reported by no more than 50% of the included articles; 13 items were reported by fewer than 20% of the articles. Critically, and in sharp contrast to Poldrack's guidelines, none of the studies reported observed effect sizes if they failed to reject the null hypothesis. Only one article (3%, 1/31) provided justifications for using fixed-effect inferences for group modeling. Other items that were insufficiently reported included slice-timing and motion corrections (12/100), temporal autocorrelation modeling used to account for within-subject variances and correlations (18/100), whether and how the task design was optimized for efficiency if it was an event-related design (22%, 8/35), distributions of inter-stimulus intervals (ISI), whether ISI was variable (23%, 9/39), statistical methods for repeated measurements (24/100), and smoothness and resolution element (RESEL) count if family-wise error (FWE) was found by random field theory (RFT) (25%, 1/4). Moreover, only six articles (28%, 6/21) described whether variances were assumed equal among groups if there were more than two groups. Of the 35 articles that reported percent signal changes, 12 (34%, 12/35) explained how scaling factors were determined. Similarly, 45% (45/100) of the articles stated how signal was extracted within ROIs.

Reported Items per Article

The median (minimum, maximum) percentage of reported items per article was 51% (30%, 78%).

The inter-rater agreement was very good (PABAκ >0.8) for 31 items, good (0.6< PABAκ ≤0.8) for 31 items, fair (0.4<PABAκ ≤0.6) for 20 items, and slight (PABAκ = 0.34) for one item (Table 210). We note that some items had a lower inter-rater agreement than the others. This may be due to varying interpretations of items among reviewers. For example, item 6 (“State how behavioral performance was measured”) had the lowest kappa statistic because it involved much subjectivity (e.g., if standard tools including E-Prime were cited, was it safe to assume the item was reported? Or if not the standard tool, what minimum details should be reported? Was this item necessary to report in each study?). We used duplicate reviewers and the consensus among reviewers to help reduce the biases and hence increase the reliability of findings.

Specifics on Reported Items

Manuscript quality hinges not only on whether an item was reported, but the specifics of the method that was used. Here we describe manuscripts' methodological choices regarding software, spatial smoothing, temporal filtering and thresholding for statistical significance.

Seventy-eight percent of the articles reported a version of the software package used in fMRI data analyses (see Table 5), and 98% reported using at least one software package. Of the 98 articles, 71.4% used SPM, 11.2% used FSL, and 10.2% used BrainVoyager (Table 11). The packages used by fewer than 10 articles include AFNI (7.1%), MATLAB (6.1%) and XBAM (1.0%). Many software packages were reported with a version; SPM5 was the most commonly used by 43.9% (43/98) of the articles, followed by SPM2 (17.3%, 17/98), SPM8 (8.2%, 8/98), and FSL-no version (6.1, 6/98). No version of XBAM was specified (see Table 11 for details).

reasons for exclusion are given.

Spatial smoothing reduces noise and hence increases the signal-to-noise ratio while reducing the resolution of data [36], [37]. Therefore, it is important to specify the extent to which spatial smoothing that has been applied. Specifically, the size of the smoothing kernel determines how much the data is smoothed, which has an effect on the extent of within-subject variability of estimates [38]. Reporting smoothing parameters helps readers to determine the balance between improving the sensitivity and maintaining the resolution of the functional image. As can be seen in Table 12, the majority of studies reported using spatial smoothing (88/100), with 95.5% (84/88) specifying a type of kernel. The widths of smoothing kernel ranged from 3 mm to 12 mm with a median width of 8 mm. The most frequent kernel width was 8 mm (42%, 37/88). Other common widths included 6 mm (29.5%, 26/88), 9 mm (8%, 7/88), and 10 mm (5.7%, 5/88). The widths used by fewer than 5 studies were 5 mm, 12 mm, 4 mm, 4.2 mm and 3 mm. None of the studies justified their choices of smoothing kernel.

thumbnail
Table 12. The use of spatial smoothing, temporal filtering, and between-subject inference.

https://doi.org/10.1371/journal.pone.0094412.t012

As with spatial smoothing, temporal filtering aims to increase the signal-to-noise ratio. Since most of the noise in fMRI is low frequency, high-pass filtering improves the ratio better than low-pass filtering, and is almost as good as band-pass filtering [36], [39]. Specifying the filter cut-off parameter helps understand the temporal filtering process. Most studies (61/100) reported whether temporal filtering was used. Of the 60 studies that reported actual use of temporal filtering, most (95%, 57/60) used high-pass filtering. Only a few studies used low-pass (1.7%, 1/60) and band-pass (3.3%, 2/60) temporal filtering. Forty-eight studies reported the filter cut-off, among which the high-pass filtering cut-off ranged from 2.8 s to 318 s with a median and mode value of 128 s, compared to low-pass filtering with a single cut-off value of 6.7 s.

The threshold for statistical significance in voxel- or cluster-level analysis controls the type I error rate [40], and many papers have suggested using formal correction methods [40][45]. Of the 100 included studies, 78% reported the use of per-voxel (or height) threshold. The most common per-voxel threshold was p<0.001 (32.1%, 25/78), followed by p<0.05 (30.8%, 24/78), p<0.01 (16.7%, 13/78), and p<0.005 (15.4%, 12/78). More than half of the studies (63/100) reported using cluster-extent threshold. The size of cluster-extent threshold ranged from 3 mm3 to 5625 mm3 with a median threshold of 184 mm3. The majority of studies (81%, 81/100) reported using corrections for multiple testing; among these studies, around 16.1% (13/81) did not report which correction method was used. Among the studies that reported a method, the correction methods included False-wise Error (28.4%, 23/81), False Discovery Rate (27.2%, 22/81), Monte Carlo Simulation (18.5%, 15/81), Gaussian Random Field Theory (4.9%, 4/81) and several others (4.9%, 4/81).

Discussion

This study identified some reporting practices in observational clinical fMRI studies that met expectations and other areas where reporting was less than adequate. In particular, only one quarter of the items from the recommended reporting guidelines by Poldrack et al. (2008) were reported adequately. Indeed, only one half of recommended items were routinely reported in each article. Moreover, one third of the items were reported by less than half of the articles. Less adequately reported items were distributed across the categories: experimental design, inter-subject registration and smoothing, data preprocessing, statistical modeling, and statistical inference on ROI analysis. These results indicate that substantial room for improvement exists in the reporting of observational clinical fMRI studies.

Specifically, improvement in reporting important details is recommended in areas such as observed effect sizes in the results section when study results are negative, justifications for fixed-effect inferences used for group modeling, and temporal autocorrelation matrix used to account for within-subject variance and correlations. As effect sizes observed from statistically significant regions overestimate true effect sizes [46], [47], including values from non-significant regions (e.g., those that are identified from similar previous studies) would help provide a more realistic range of effect size estimates and reduce the risk of bias arising from reporting on active regions only. Given the existence of temporal autocorrelation in fMRI time series, incorporating an autocorrelation structure increases the accuracy of variance estimates. Reporting temporal autocorrelation estimates enables proper power analyses based on the method proposed by Mumford and Nichols [48]. Whereas findings from fixed-effect inferences particularly reflect the cohort of subjects studied, random-effect inferences generalize findings to the population at large from which the study sample was drawn [49]. The current recommendation is to use random-effect inferences for between-subject group modeling and fixed-effect inferences for single-subject modeling. Providing justifications for using fixed-effects for group modeling would enhance understanding and interpretation.

This study differed substantially from the one existing review of fMRI reporting [18] in the number of items, definitions of items, study population and study design. For example, although Carp's study used a single reviewer, we conducted a systematic review by using a duplicate abstraction, measuring inter-rater agreement and resolving disagreements through consensus. Moreover, our study focused on observational studies with clinical participants; in contrast, Carp evaluated fMRI studies in general which may not capture many studies involving clinical participants. There are also some notable differences in results between the two studies. For example, in the current study around one-third reported the distribution of inter-trial intervals, compared to one-twelfth in Carp's study. About one half reported the number of subjects rejected from analyses with reasons for rejection in our study, which is one quarter greater than that of Carp's study. Similarly, less than one-third of the articles in our study reported the following four methodological items but still showed better reporting than those in Carp's study: how potentially confounding variables were matched across groups for group comparisons, whether autocorrelations were modeled, whether equal variance was assumed across groups for multiple group designs, and the number of RESELs and image smoothness for studies using FWE correction. Unfortunately, we are unable to identify the specific factors associated with these differences between the current study and Carp's study; the factors might be the type of clinical participants involved in the study, impact factors of the journal, or the exclusion of studies of connectivity. Future research may be helpful in this regard by comparing reporting quality among studies with clinical participants versus without clinical participants, with high impact factor journals versus with low impact factor journals, and including studies of connectivity versus excluding connectivity. Although different, both studies did detect some commonality in important items that are frequently absent from published reports, indicating that incomplete reporting challenges the evaluation, understanding and interpretation of study findings, and limits the use of results for synthesis, e.g., for meta analyses.

Complete reporting becomes particularly important for studies involving clinical populations, where ensuring methodological rigor is necessary to uphold investigators' promises to their participants that their participation will help society to better understand the nature of their condition. Our findings point towards the need for substantial improvement in this regard. In several other fields of health research, it has been demonstrated that journals adopting standard reporting guidelines (e.g., CONSORT statement) have better quality of reporting than those that do not [50][52], thus the use of guidelines in the fMRI literature may help improve the quality of reporting as well.

Implementation of the guidelines for reporting fMRI studies proposed by Poldrack and his colleagues (2008) do face some challenges. Firstly, authors often have strict word limits and the current guidelines are lengthy, making it important to identify which items are most essential. Secondly, some items are relevant to the quality of reporting observational clinical studies but are not covered in Poldrack et al.'s guidelines (for example, sample size calculations in the methods section, characteristics of clinical participants, and participation data flow diagrams to better understand potential bias due to non-participation [53]). Since reporting guidelines are evolving documents [54], we suggest dividing the list of items that should be reported into those that are essential, which should be placed in the manuscript itself, and those which are helpful to report can be included as online supplements. Some methodological parameters have more impact than others [28], [55] and hence should be considered as essential items. Some journals (e.g., Nature) have recently removed space limitations on methods sections, however, since this is not a widespread practice it would still be useful to distinguish between essential and helpful items. In addition to the form of text-based reporting, some items can be reported in the form of source code (e.g., for data collection and statistical analyses) [56] and machine-readable information compatible to different imaging analyses packages [57]. Our recommendation for creating a list of essential items is not intended to supplant the existing guidelines but rather a suggestion to consider during the next update of the guidelines. We hope that our suggestions will lead to more discussion and future consensus regarding what is in fact essential to report in the manuscript itself for observational clinical fMRI studies. For example, the consensus can be reached through a consensus meeting involving a variety of experts in this area, in a similar way that the standard CONSORT guideline was created. Involving journal editors in the process and having their endorsement of the guidelines would encourage researchers to comply with the new standards.

The present study has several limitations. First, findings in this study reflect the quality of reporting of observational clinical fMRI studies in six top neuroscience journals published between 2010 and 2011, results that may not apply to journals in general. Most likely, these results may overestimate true rates of reporting. Second, several items on the checklist used for evaluation in this systematic review involve subjectivity. However, using duplicate review and consensus for any disagreements helped to reduce differences in interpretations between reviewers.

Conclusion

This study has highlighted under-reported areas in observational fMRI studies involving clinical participants and points towards a need for improvement. Adherence to the guidelines for fMRI studies proposed by Poldrack and his colleagues could help improve quality of reporting. Considering that the guidelines are evolving and need continual updates, we suggest constructing a checklist that captures essential items to report to accommodate practical needs, and enforcing the reporting guidelines through proposed ways.

Supporting Information

Figure S1.

Flow Diagram of Citation Selection Process.

https://doi.org/10.1371/journal.pone.0094412.s001

(DOC)

File S1.

Sample size calculation for estimating a single proportion with a level of confidence.

https://doi.org/10.1371/journal.pone.0094412.s003

(DOC)

File S2.

Sample size calculation for estimating a Cohen's kappa coefficient with a given precision.

https://doi.org/10.1371/journal.pone.0094412.s004

(DOC)

File S4.

Raw data collected from the 100 studies.

https://doi.org/10.1371/journal.pone.0094412.s006

(XLS)

Table S1.

Search strategy for Ovid Medline database.

https://doi.org/10.1371/journal.pone.0094412.s007

(DOC)

Table S2.

Data extraction form containing 83 items adapted from Poldrack et al.'s checklist.

https://doi.org/10.1371/journal.pone.0094412.s008

(DOC)

Acknowledgments

We thank Joshua Carp, Russell Poldrack and one anonymous reviewer for their constructive comments on the earlier version of the manuscript.

Author Contributions

Conceived and designed the experiments: QG LT EP GH. Performed the experiments: QG MP WT. Analyzed the data: QG. Wrote the paper: QG. Interpreted data: QG RG MM GH LT EP. Reviewed manuscript: EP GH LT MM WT MP.

References

  1. 1. Huettel SA, Song AW, McCarthy G (2009) Functional magnetic resonance imaging: Sunderland, MA: Sinauer Associates, Inc.
  2. 2. Carter CS, Heckers S, Nichols T, Pine DS, Strother S (2008) Optimizing the design and analysis of clinical functional magnetic resonance imaging research studies. Biol Psychiatry 64: 842–849.
  3. 3. Sheline YI, Barch DM, Donnelly JM, Ollinger JM, Snyder AZ, et al. (2001) Increased amygdala response to masked emotional faces in depressed subjects resolves with antidepressant treatment: An fMRI study. Biol Psychiatry 50: 651–658.
  4. 4. Siegle GJ, Steinhauer SR, Thase ME, Stenger VA, Carter CS (2002) Can't shake that feeling: Event-related fMRI assessment of sustained amygdala activity in response to emotional information in depressed individuals. Biol Psychiatry 51: 693–707.
  5. 5. Glahn DC, Ragland JD, Abramoff A, Barrett J, Laird AR, et al. (2005) Beyond hypofrontality: A quantitative meta-analysis of functional neuroimaging studies of working memory in schizophrenia. Hum Brain Mapp 25: 60–69.
  6. 6. Snitz BE, MacDonald A III, Cohen JD, Cho RY, Becker T, et al. (2005) Lateral and medial hypofrontality in first-episode schizophrenia: Functional activity in a medication-naive state and effects of short-term atypical antipsychotic treatment. Am J Psychiatry 162: 2322–2329.
  7. 7. Monk CS, Klein RG, Telzer EH, Schroth EA, Mannuzza S, et al. (2008) Amygdala and nucleus accumbens activation to emotional facial expressions in children and adolescents at risk for major depression. Am J Psychiatry 165: 90–98.
  8. 8. Yoon JH, Minzenberg MJ, Ursu S, Ryan Walter BS, Wendelken C, et al. (2008) Association of dorsolateral prefrontal cortex dysfunction with disrupted coordinated brain activity in schizophrenia: Relationship with impaired cognition, behavioral disorganization, and global function. Am J Psychiatry 165: 1006–1014.
  9. 9. von Elm E, Altman DG, Egger M, Pocock SJ, Gotzsche PC, et al. (2007) The strengthening the reporting of observational studies in epidemiology (STROBE) statement: Guidelines for reporting observational studies. Ann Intern Med 147: 573–577.
  10. 10. Begg C, Cho M, Eastwood S, Horton R, Moher D, et al. (1996) Improving the quality of reporting of randomized controlled trials. the CONSORT statement. JAMA 276: 637–639.
  11. 11. Chan AW, Krleza-Jeric K, Schmid I, Altman DG (2004) Outcome reporting bias in randomized trials funded by the canadian institutes of health research. CMAJ 171: 735–740.
  12. 12. Chan AW, Altman DG (2005) Identifying outcome reporting bias in randomised trials on PubMed: Review of publications and survey of authors. BMJ 330: 753.
  13. 13. Dwan K, Altman DG, Arnaiz JA, Bloom J, Chan AW, et al. (2008) Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS One 3: e3081.
  14. 14. Poldrack RA, Fletcher PC, Henson RN, Worsley KJ, Brett M, et al. (2008) Guidelines for reporting an fMRI study. Neuroimage 40: 409–414.
  15. 15. Young NS, Ioannidis JP, Al-Ubaydli O (2008) Why current publication practices may distort science. PLoS Med 5: e201.
  16. 16. Langan S, Schmitt J, Coenraads PJ, Svensson A, von Elm E, et al. (2010) The reporting of observational research studies in dermatology journals: A literature-based study. Arch Dermatol 146: 534–541.
  17. 17. Papathanasiou AA, Zintzaras E (2010) Assessing the quality of reporting of observational studies in cancer. Ann Epidemiol 20: 67–73.
  18. 18. Carp J (2012) The secret lives of experiments: Methods reporting in the fMRI literature. Neuroimage 63: 289–300.
  19. 19. Guo Q, Thabane L, Hall G, McKinnon M, Goeree R, et al. (2014) A systematic review of the reporting of sample size calculations and corresponding data components in observational functional magnetic resonance imaging studies. Neuroimage 86: 172–181.
  20. 20. MacDonald AW III, Thermenos HW, Barch DM, Seidman LJ (2009) Imaging genetic liability to schizophrenia: Systematic review of FMRI studies of patients' nonpsychotic relatives. Schizophr Bull 35: 1142–1162.
  21. 21. Huang W, Pach D, Napadow V, Park K, Long X, et al. (2012) Characterizing acupuncture stimuli using brain imaging with FMRI—a systematic review and meta-analysis of the literature. PLoS One 7: e32960.
  22. 22. Lee KP, Schotland M, Bacchetti P, Bero LA (2002) Association of journal quality indicators with methodological quality of clinical research articles. JAMA 287: 2805–2808.
  23. 23. Birken CS, Parkin PC (1999) In which journals will pediatricians find the best evidence for clinical practice? Pediatrics 103: 941–947.
  24. 24. Opthof T (1997) Sense and nonsense about the impact factor. Cardiovasc Res 33: 1–7.
  25. 25. Schoonbaert D, Roelants G (1996) Citation analysis for measuring the value of scientific publications: Quality assessment tool or comedy of errors? Trop Med Int Health 1: 739–752.
  26. 26. Bruer JT (1982) Methodological rigor and citation frequency in patient compliance literature. Am J Public Health 72: 1119–1123.
  27. 27. Lazar NA (2008) The statistical analysis of functional MRI data. New York, NY: Springer-Verlag New York.
  28. 28. Strother S, La Conte S, Kai Hansen L, Anderson J, Zhang J, et al. (2004) Optimizing the fMRI data-processing pipeline using prediction and reproducibility performance metrics: I. A preliminary group analysis. Neuroimage 23 Supplement 1S196–S207.
  29. 29. Churchill NW, Oder A, Abdi H, Tam F, Lee W, et al. (2012) Optimizing preprocessing and analysis pipelines for single-subject fMRI. I. standard temporal motion and physiological noise correction methods. Hum Brain Mapp 33: 609–627.
  30. 30. Altman DG (1991) Practical statistics for medical research. Chapman and Hall/CRC.
  31. 31. El Emam K, Jonker E, Arbuckle L, Malin B (2011) A systematic review of re-identification attacks on health data. 12 (6).
  32. 32. Clopper C, Pearson ES (1934) The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26: 404–413.
  33. 33. Byrt T, Bishop J, Carlin JB (1993) Bias, prevalence and kappa. J Clin Epidemiol 46: 423–429.
  34. 34. Feinstein AR, Cicchetti DV (1990) High agreement but low kappa: I. the problems of two paradoxes. J Clin Epidemiol 43: 543–549.
  35. 35. Byrt T (1996) How good is that agreement? Epidemiology 7: 561.
  36. 36. Skudlarski P, Constable RT, Gore JC (1999) ROC analysis of statistical methods used in functional MRI: Individual subjects. Neuroimage 9: 311–329.
  37. 37. Hopfinger JB, Buchel C, Holmes AP, Friston KJ (2000) A study of analysis parameters that influence the sensitivity of event-related fMRI analyses. Neuroimage 11: 326–333.
  38. 38. Desmond JE, Glover GH (2002) Estimating sample size in functional MRI (fMRI) neuroimaging studies: Statistical power analyses. J Neurosci Methods 118: 115–128.
  39. 39. Della-Maggiore V, Chau W, Peres-Neto PR, McIntosh AR (2002) An empirical comparison of SPM preprocessing parameters to the analysis of fMRI data. Neuroimage 17: 19–28.
  40. 40. Bennett CM, Wolford GL, Miller MB (2009) The principled control of false positives in neuroimaging. Soc Cogn Affect Neurosci 4: 417–422.
  41. 41. Poldrack RA (2012) The future of fMRI in cognitive neuroscience. Neuroimage 62: 1216–1220.
  42. 42. Genovese CR, Lazar NA, Nichols T (2002) Thresholding of statistical maps in functional neuroimaging using the false discovery rate. Neuroimage 15: 870–878.
  43. 43. Nichols TE, Holmes AP (2002) Nonparametric permutation tests for functional neuroimaging: A primer with examples. Hum Brain Mapp 15: 1–25.
  44. 44. Nichols T, Hayasaka S (2003) Controlling the familywise error rate in functional neuroimaging: A comparative review. Stat Methods Med Res 12: 419–446.
  45. 45. Friston KJ, Holmes A, Poline JB, Price CJ, Frith CD (1996) Detecting activations in PET and fMRI: Levels of inference and power. Neuroimage 4: 223–235.
  46. 46. Maxwell SE (2004) The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychol Methods 9: 147–163.
  47. 47. Mumford JA (2012) A power calculation guide for fMRI studies. Soc Cogn Affect Neurosci 7: 738–742.
  48. 48. Mumford JA, Nichols TE (2008) Power calculation for group fMRI studies accounting for arbitrary design and temporal autocorrelation. Neuroimage 39: 261–268.
  49. 49. Frackowiak RSJ, Ashburner JT, Penny WD, Zeki S (2004) Random effects analysis (chapter 12). In: Ashburner J, Friston K, Penny W, editors. Human Brain Function. London, UK: Academic Press.
  50. 50. Moher D, Jones A, Lepage L, CONSORT Grp (2001) Use of the CONSORT statement and quality of reports of randomized trials - A comparative before-and-after evaluation. Jama-Journal of the American Medical Association 285: 1992–1995.
  51. 51. Plint AC, Moher D, Morrison A, Schulz K, Altman DG, et al. (2006) Does the CONSORT checklist improve the quality of reports of randomised controlled trials? A systematic review. Med J Aust 185: 263–267.
  52. 52. Alvarez F, Meyer N, Gourraud PA, Paul C (2009) CONSORT adoption and quality of reporting of randomized controlled trials: A systematic analysis in two dermatology journals. Br J Dermatol 161: 1159–1165.
  53. 53. Young EA, Breslau N (2004) Cortisol and catecholamines in posttraumatic stress disorder: An epidemiologic community study. Arch Gen Psychiatry 61: 394–401.
  54. 54. Moher D, Schulz KF, Simera I, Altman DG (2010) Guidance for developers of health research reporting guidelines. PLoS Med 7: e1000217.
  55. 55. Carp J (2012) On the plurality of (methodological) worlds: Estimating the analytic flexibility of FMRI experiments. Front Neurosci 6: 149.
  56. 56. Carp J (2013) Better living through transparency: Improving the reproducibility of fMRI results through comprehensive methods reporting. Cogn Affect Behav Neurosci 13: 660–666.
  57. 57. Ince DC, Hatton L, Graham-Cumming J (2012) The case for open computer programs. Nature 482: 485–488.