Abstract
Despite their recognized limitations, bibliometric assessments of scientific productivity have been widely adopted. We describe here an improved method that makes novel use of the co-citation network of each article to field-normalize the number of citations it has received. The resulting Relative Citation Ratio is article-level and field-independent, and provides an alternative to the invalid practice of using Journal Impact Factors to identify influential papers. To illustrate one application of our method, we analyzed 88,835 articles published between 2003 and 2010, and found that the National Institutes of Health awardees who authored those papers occupy relatively stable positions of influence across all disciplines. We demonstrate that the values generated by this method strongly correlate with the opinions of subject matter experts in biomedical research, and suggest that the same approach should be generally applicable to articles published in all areas of science. A beta version of iCite, our web tool for calculating Relative Citation Ratios of articles listed in PubMed, is available at https://icite.od.nih.gov.
Introduction
In the current highly competitive pursuit of research positions and funding support (Couzin-Frankel, 2013), faculty hiring committees and grant review panels must make difficult predictions about the likelihood of future scientific success. Traditionally, these judgments have largely depended on recommendations by peers, informal interactions, and other subjective criteria. In recent years, decision-makers have increasingly turned to numerical approaches such as counting first or corresponding author publications, using the impact factor of the journals in which those publications appear, and computing Hirsch or H-index (Hirsch, 2005). The widespread adoption of these metrics, and the recognition that they are inadequate (Seglen, 1997; Anon, 2005, Anon, 2013), highlight the ongoing need for alternative methods that can provide effectively normalized and reliable data-driven input to administrative decision-making, both as a means of sorting through large pools of qualified candidates, and as a way to help combat implicit bias. A return to purely subjective evaluation with its attendant risk of partiality is neither desirable nor practical, and the use of metrics that are of limited value in decision-making is widespread and growing (Pulverer, 2013). The need for useful metrics is particularly pressing for funding agencies making policy decisions based upon the evaluation of large portfolios that often encompass diverse areas of science.
Though each of the above mentioned methods of quantitation has strengths, accompanying weaknesses limit their utility. Counting first or corresponding author publications does on some level reflect the extent of a scientist’s contribution to their field, but it has the unavoidable effect of privileging quantity over quality, and may undervalue collaborative science (Stallings et al., 2013). Journal impact factor (JIF) was for a time seen as a valuable indicator of scientific quality because it serves as a convenient, and not wholly inaccurate, proxy for expert opinion (Garfield, 2006). However, its blanket use also camouflages large differences in the influence of individual papers. This is because impact factor is calculated as the average number of times articles published over a two-year period in a given journal are cited; in reality, citations follow a log-normal rather than a Gaussian distribution (Price, 1976; Wang et al., 2013). Moreover, since practitioners in disparate fields have differential access to high-profile publication venues, impact factor is of limited use in multidisciplinary science-of-science analyses. Despite these serious flaws, JIF continues to have a large effect on funding and hiring decisions (Anon, 2005; Johnston, 2013; Misteli, 2013). H-index, which attempts to assess the cumulative impact of the work done by an individual scientist, disadvantages early career stage investigators; it also undervalues some fields of research by failing to normalize raw citation counts (Pulverer, 2013).
Alternative models for quantifying scientific accomplishment have been proposed but have not been widely adopted, perhaps because they are overly complicated to calculate and/or are difficult to interpret (Bollen et al., 2009; Waltman et al., 2011a). Some have dramatically improved our theoretical understanding of citation dynamics (Walker et al., 2007; Radicchi et al., 2008; Stringer et al., 2010; Wang et al., 2013). However, to combine a further technical advance with a high likelihood of widespread adoption by varied stakeholders, including scientists, administrators and funding agencies, several practical challenges must be overcome. Citation metrics must be article-level, field-normalized in a way that is scalable from small to large portfolios without introducing significant bias at any level, benchmarked to peer performance in order to be interpretable, and correlated with expert opinion. In addition, metrics should be freely accessible and calculated in a transparent way. Many efforts have been made to fulfill one or more of these requirements, including citation normalization to journals or journal categories (Moed et al., 1985; Zitt and Small, 2008; Opthof and Leydesdorff, 2010; van Raan et al., 2010; Waltman et al., 2011a, 2011b; Bornmann and Leydesdorff, 2013), citation percentiles (Bornmann and Leydesdorff, 2013; Bornmann and Marx, 2013), eigenvector normalization (Bergstrom and West, 2008; Bergstrom et al., 2008) or source-normalization (Zitt and Small, 2008; Moed, 2010) including the Mean Normalized Citation Score (Waltman et al., 2011a) and Source-Normalized Impact per Paper metrics (Moed, 2010). While all are improvements on Impact Factor, none meet all of the criteria listed above. Furthermore, these existing approaches are often unhelpful to decision-makers because they aggregate works from researchers across disparate geographical regions and institutional types. For example, current methods do not provide a way for a primarily undergraduate institutions to compare their portfolios against other teaching-focused institutions, nor do they allow developing nations to compare their research to that done in other developing nations (Crous, 2014). Incorporating a customizable benchmark as an integral part of an ideal citation metric would enable such an apples to apples comparison and facilitate downstream decision making activity.
We report here the development and validation of the Relative Citation Ratio (RCR) metric, which meets all of the above criteria and is based upon the novel idea of using the co-citation network of each article to field- and time-normalize by calculating the expected citation rate from the aggregate citation behavior of a topically linked cohort. An average citation rate is computed for the network, benchmarked to peer performance, and used as the RCR denominator; as is true of other bibliometrics, article citation rate (ACR) is used as the numerator. We use the RCR metric here to determine the extent to which National Institutes of Health (NIH) awardees maintain high or low levels of influence on their respective fields of research.
Results
Co-citation networks represent an article’s area of influence
Choosing to cite is the long-standing way in which one scholar acknowledges the relevance of another’s work. Before now, however, the utility of citations as a metric for quantifying influence has been limited, primarily because it is difficult to compare the value of one citation to another; different fields have different citation behaviors and are composed of widely varying numbers of potential citers (Jeong et al., 2003; Radicchi and Castellano, 2012). An effective citation-based evaluative tool must also take into account the length of time a paper has been available to potential citers, since a recently published article has had less time to accumulate citations than an older one. Finally, fair comparison is complicated by the fact that an author’s choice of which work to cite is not random; a widely known paper is more likely to be referenced than an obscure one of equal relevance. This is because the accrual of citations follows a power law or log-normal pattern, in accordance with a process called preferential attachment (Jeong et al., 2003; Eom and Fortunato, 2011; Wang et al., 2013). Functionally this means that, each time a paper is cited, it is a priori more likely to be cited again.
An accurate citation-based measure of influence must address all of these issues, but we reasoned that the key to developing such a metric would be the careful identification of a comparison group, i.e., a cluster of interrelated papers against which the citation performance of an article of interest, or reference article (RA), could be evaluated. Using a network of papers linked to that RA through citations occurred to us as a promising possibility (Figure 1). There are a priori three types of article-linked citation networks (Small, 1973). A citing network is the collection of papers citing the RA (Figure 1a, top row), a co-citation network is defined as the other papers appearing in the reference lists alongside the RA (Figure 1a, middle row), and a cited network is the collection of papers in the reference list of the RA (Figure 1a, bottom row).
Properties of co-citation networks. (a) Schematic of a co-citation network. The Reference Article (RA) (red, middle row) cites previous papers from the literature (orange, bottom row); subsequent papers cite the RA (blue, top row). The co-citation network is the set of papers that appear alongside the article in the subsequent citing papers (green, middle row). The Field Citation Rate is calculated as the mean of the latter articles’ journal citation rates. (b) Growth of co-citation networks over time. Three RAs published in 2006 (red dots) were cited 5 (top row), 9 (middle row), or 31 times (bottom row) by 2011. Three intervals were chosen to illustrate the growth of the corresponding co-citation networks: 2006-2007, 2006-2009, and 2006-2011 (the first, second, and third columns, respectively). Each article in one of the three co-citation networks is shown as a separate green dot; the edges (connections between dots) indicates their presence together in the same reference list. (c) Cluster algorithm-based content analysis of the 215 papers in the co-citation network of a sample reference article (RA; panel b, bottom network series) identified a changing pattern of relevance to different sub-disciplines over time. This RA described the identification of new peptides of possible clinical utility due to their similarity to known conotoxins. Papers in the co-citation network of this RA focused on: (1) α-conotoxin mechanisms of action; (2) structure and evolution of conotoxins; (3) cyclotide biochemistry; (4) conotoxin phylogenetics; and (5) identification and synthesis of lantibiotics. (d) Growth of an article’s co-citation network is proportional to the number of times it has been cited. Each point is the average network size of 1000 randomly chosen papers with between 1 and 100 citations (error bars represent the standard error of the mean). Each paper is only counted once, even if it is co-cited with the article of interest multiple times. An average of 17.8 new papers is added to the co-citation network for each additional citation. This suggests substantial duplication of articles within a co-citation network, since on average 32.4 papers (median of 30) are referenced in each citing article.
All three types of networks would be expected to accurately reflect the interdisciplinary nature of modern biomedical research and the expert opinion of publishing scientists, who are themselves the best judges of what constitutes a field. Unlike cited networks, citing and co-citation networks can grow over time, allowing for the dynamic evaluation of an article’s influence; they can also indicate whether or not an article gains relevance to additional disciplines (Figure 1b, c). An important difference between citing and co-citation networks, however, is size. Papers in the biomedical sciences have a median of 30 articles in their reference list in this dataset, so each citation event can be expected to add multiple papers to an article’s co-citation network (Figure 1d), but only one to its citing network. The latter are therefore highly vulnerable to finite number effects; in other words, for an article of interest with few citations, small changes in the citing network would have a disproportionate effect on how that article’s field was defined. We therefore chose to pursue co-citation networks as a way to describe an individual paper’s field.
Calculating the Relative Citation Ratio
Having chosen our comparison group, the next step was to decide how to calculate the values that numerically represent the co-citation network of each RA. The most obvious choice, averaging the citation rates of articles in the co-citation network, would also be highly vulnerable to finite number effects. We therefore chose to average the citation rates of the journals represented by the collection of articles in each co-citation network. If a journal was represented twice, its journal citation rate (JCR) was added twice when calculating the average JCR. For reasons of algorithmic parsimony we used the JCRs for the year each article in the co-citation network was published; a different choice at this step would be expected to have little if any effect, since almost all JCRs are quite stable over time (Supplemental Figure 1; Supplemental Table 1). Since a co-citation network can be reasonably thought to correspond with an RA’s area of science, the average of all JCRs in a given network can be redefined as that RA’s field citation rate (FCR).
Using this method (Figure 2a-c; Supplemental Figure 2; Supplemental Equations 1 and 2), we calculated FCRs for 35,837 papers published in 2009 by NIH grant recipients, specifically those who received R01 awards, the standard mechanism used by NIH to fund investigator-initiated research. We also calculated what the FCR would be if it were instead based on citing or cited networks. It is generally accepted that, whereas practitioners in the same field exhibit at least some variation in citation behavior, much broader variation exists among authors in different fields. The more closely a method of field definition approaches maximal separation of between-field and within-field citation behaviors, the lower its expected variance in citations per year (CPY). FCRs based on co-citation networks exhibited lower variance than those based on cited or citing networks (Table 1). Interestingly, a larger analysis of the 88,835 papers published by investigators with continuous R01 funding between 2003 and 2010 shows that FCRs also display less variance than either ACRs (p < 10−4, F-test for unequal variance) or JIFs (p < 10−4, F-test for unequal variance, Figure 2d, Table 1), confirming that co-citation networks are better at defining an article’s field than its journal of publication.
Variance of Field Citation Rates and Expected Citation Rates using different levels of the citation network for calculations (based on 35,837 R01-funded papers published in 2009).
Algorithm for calculating the Relative Citation Ratio. (a) Article Citation Rate (ACR) is calculated as the total citations divided by the number of years excluding the calendar year of publication (Supplemental Equation 1), when few, if any, citations accrue (Supplemental Figure 2). (b) Generate an expectation for article citation rates based on a preselected benchmark group, by regressing the ACR of the benchmark papers onto their FCRs (Supplemental Equations 3, 4), one regression each publication year. The graphed examples were sampled from a random distribution for illustrative purposes. (c) The coefficients from each year’s regression equation transforms the Field Citation Rates of papers published in the same year into Expected Citation Rates (Supplemental Equation 5). Each paper’s RCR is its ACR/ECR ratio. A portfolio’s RCR is simply the average of the individual articles’ RCRs (Supplemental Equation 6). (d) Box-and whisker plots of 88,835 NIH-funded papers (published between 2003 and 2010), summarizing their Article Citation Rate, Journal Impact Factor (matched to the article’s year of publication), and Field Citation Rate. Boxes show the 25th-75th percentiles with a line at the median; whiskers extend to the 10th and 90th percentiles.
Having established the co-citation network as a means of determining an FCR for each RA, our next step was to calculate ACR/FCR ratios. Since both ACR and FCR are measured in CPY, this generates a rateless, timeless metric that can be used to assess the relative influence of any two RAs. However, it does not measure these values against any broader context. For example, if two RAs have ACR/FCR ratios of 0.7 and 2.1, this represents a three-fold difference in influence, but it is unclear which of those values would be closer to the overall mean or median for a large collection of papers. One additional step is therefore needed to adjust the raw ACR/FCR ratios so that, for any given FCR, the mean RCR equals 1.0. Any selected cohort of RAs can be used as a standard for anchoring expectations, i.e. as a customized benchmark (Supplemental Equations 3-6). We selected R01-funded papers as our benchmark set; for any given year, regression of the ACR and FCR values of R01-funded papers yields the equation describing, for the FCR of a given RA published in that year, the expected citation rate (Figure 2b and Supplemental Table 2). Inserting the ACR as the numerator and FCR of that RA into the regression equation as the denominator is the final step in calculating its RCR value, which incorporates the normalization both to its field of research, and to the citation performance of its peers (Figure 2b, c and Supplemental Information).
For analyses where it is important that article RCRs sum to the number of papers for accounting purposes, ordinary least squares (OLS) linear regression of ACR on FCR will benchmark articles such that the mean RCR is equal to 1.0. However, the median article RCR will be lower if there is a skewed distribution. For comparison to the “average” article, quantile regression will yield a median RCR equal to 1.0. OLS regression benchmarking may be more suitable for large-scale analyses conducted by universities or funding agencies, while the quantile regression benchmarking approach might be more suitable for web tools enabling search and exploration at the article or investigator level. In the following analyses, we used OLS regression such that the mean RCR for benchmark articles is equal to 1.0.
Expert validation of RCR as a measure of influence
For the work presented here, we chose as a benchmark the full set of 311,497 RAs published from 2002 through 2012 by NIH-R01 awardees. To measure the degree of correspondence between our method and expert opinion, we compared RCRs generated by benchmarking ACR/FCR values against this standard to three independent sets of post-publication evaluations by subject matter experts (details in Supplemental Information). We compared RCR with expert rankings for 2193 articles published in 2009 and evaluated by Faculty of 1000 members (Figure 3a), as well as rankings of 430 Howard Hughes Medical Institute- or NIH-funded articles published between 2005 and 2011 and evaluated in a study conducted by the Science and Technology Policy Institute (STPI, Figure 3b), and finally, 290 articles published in 2009 by extramurally funded NIH investigators and evaluated by NIH intramural investigators in a study of our own design (Figure 3c; Supplemental Figs. 5-7). All three approaches demonstrate that RCR values are well correlated with reviewers’ judgments. We asked experts in the latter study to provide, in addition to an overall score, scores for several independent sub-criteria: likely impact of the research, importance of the question being addressed, robustness of the study, appropriateness of the methods, and human health relevance. Random forest analysis indicated that their scores for likely impact were weighted most heavily in determining their overall evaluation (Supplemental Figure 6).
Relative Citation Ratios correspond with expert reviewer scores. (a-c) Bubble plots of reviewer scores vs. RCR for three different datasets. Articles are binned by reviewer score; bubble area is proportionate to the number of articles in that bin. (a) F1000 scores for 2193 R01-funded papers published in 2009. Faculty reviewers rated the articles on a scale of one to three (“Good”, “Very Good”, and “Exceptional”, respectively); those scores were summed into a composite F1000 score for each article (Supplemental Figure 3). (b) Reviewer scores of 430 HHMI and NIH-funded papers collected by the Science and Technology Policy Institute. (c) Scores of 290 R01-funded articles reviewed by experts from the NIH Intramural Research Program. Black line, linear regression.
In addition to correlating with expert opinion, RCR is ranking invariant, which is considered to be a desirable property of bibliometric indicators (Rousseau and Leydesdorff, 2011; Glänzel and Moed, 2012). In short, an indicator is ranking invariant when it is used to place two groups of articles in hierarchical order, and the relative positions in that order do not change when uncited articles are added to each group. The RCR metric is ranking invariant when the same number of uncited articles is added to two groups of equal size (Supplemental Equations 7-9). RCR is also ranking invariant when the same proportion of uncited articles is added to two groups of unequal size (Supplemental Equations 10-11). This demonstrates that the RCR method can be used effectively and safely in evaluating the relative influence of large groups of publications.
Quantifying how past influence predicts future performance
We next undertook a large case study of all 88,835 articles published by NIH investigators who maintained continuous R01 funding from fiscal year (FY) 2003 through FY2010 to ask how the RCR of publications from individual investigators changed over this eight year interval. Each of these investigators had succeeded at least once in renewing one or more of their projects through the NIH competitive peer review process. In aggregate, the RCR values for these articles are well-matched to a log-normal distribution; in contrast, as noted previously by others, the distribution of impact factors of the journals in which they were published is non-normal (Mansilla et al., 2007; Egghe, 2009) (Figure 4a, b). Sorting into quintiles based on JIF demonstrates that, though journals with the highest impact factors have the highest median RCR, influential publications can be found in virtually all journals (Figure 4c, d). Focusing on a dozen representative journals with a wide range of JIFs further substantiates the finding that influential science appears in many venues, and reveals noteworthy departures from the correlation between JIF and median RCR (see Supplemental Information). For example, NIH-funded articles in both Organic Letters (JIF = 4.7) and the Journal of the Acoustical Society of America (JIF = 1.6) have a higher median RCR than those in Nucleic Acids Research (JIF = 7.1; Figure 4e).
Properties of Relative Citation Ratios at the article and investigator level. (a, b) Frequency distribution of article-level RCRs (a) and Journal Impact Factors (b), from 88,835 papers (authored by 3089 R01-funded PIs) for which co-citation networks were generated. Article RCRs are well-fit by a log-normal distribution (R2 = 0.99), and Journal Impact Factors less so (R2 = 0.79). (c) Box-and-whisker plots summarizing Journal Impact Factors for the same papers, binned by Impact Factor quintile (line, median; box, 25th–75th percentiles; whiskers, 10th to 90th percentiles). (d) RCR for the same papers using the same bins by Journal Impact Factor quintile (same scale as c). Although the median RCR for each bin generally corresponds to the Impact Factor quintile, there is a wide range of article RCRs in each category. (e) Box-and-whisker plots summarizing RCRs of these same papers published in selected journals. In each journal, there are papers with article RCRs surpassing the median RCR of the highest Impact Factor journals (left three). The Impact Factor of each journal is shown above. (f, g) Frequency distribution of investigator-level RCRs (f) and Journal Impact Factors (g), representing the mean values for papers authored by each of 3089 R01-funded PIs. Dashed line in (f), mode of RCR for PIs.
As part of this case study we also calculated the average RCR and average JIF for papers published by each of the 3089 NIH R01 principal investigators (PIs) represented in the dataset of 88,835 articles. In aggregate, the average RCR and JIF values for NIH R01 PIs exhibited log-normal distributions (Figure 4f, g) with substantially different hierarchical ordering (Supplemental Figure 8). This raised a further question concerning PIs with RCR values near the mode of the log-normal distribution (dashed line in Figure 4f): as measured by the ability to publish work that influences their respective fields, to what extent does their performance fluctuate? We addressed this question by dividing the eight year window (FY2003 through FY2010) in half. Average RCRs in the first time period (FY2003 through FY2006) were sorted into quintiles, and the percentage of PIs in the second time period (FY2007 through FY2010) that remained in the same quintile, or moved to a higher or lower quintile, was calculated. The position of PIs in these quintiles proved to be relatively immobile; 53% of PIs in the top quintile remained at the top, and 53% of those in the bottom quintile remained at the bottom (Figure 5a). For each PI we also calculated a weighted RCR (the number of articles multiplied by their average RCR); comparing on this basis yielded almost identical results (Figure 5b). It is worth noting that average FCRs for investigators were extremely stable from one 4-year period to the next (Pearson r = 0.92, Table 2), Since FCRs are the quantitative representation of co-citation networks, this further suggests that each co-citation network is successfully capturing the corresponding investigator’s field of research.
Scientific mobility of investigators’ influence relative to their field. Color intensity is proportional to the percentage of PIs in each quintile. (a) 3089 investigators who were continuously funded by at least one R01 were ranked by their articles’ average RCR in each time window, and split into quintiles. From left to right, investigators starting in different quintiles were tracked to see their rank in the next 4-year period. (b) The same analysis, but the number of published articles was multiplied by their average RCR to calculate an influence-weighted article count. PIs were ranked by this aggregate score and split into quintiles.
Summary of investigator-level bibliometric measures and their stability from one 4-year period to the next (PIs with 5 or more articles in each period, except for article count).
Discussion
The relationship between scientists and JIFs has been likened to the prisoner’s dilemma from game theory: because grant reviewers use JIFs in their evaluations, investigators must continue to weigh this in their decision-making or risk being out-competed by their peers on this basis (Casadevall and Fang, 2014; Shaw, 2014). A groundswell of support for the San Francisco Declaration on Research Assessment (http://www.ascb.org/dora) has not yet been sufficient to break this cycle (Alberts, 2013; Bertuzzi and Drubin, 2013; Schekman and Patterson, 2013; Suhrbier and Poland, 2013; Casadevall and Fang, 2014; Shaw, 2014). Continued use of the Journal Impact Factor as an evaluation metric will fail to credit researchers for publishing highly influential work. Articles in high-profile journals have average RCRs of approximately 3. However, high-Impact-Factor journals (JIF ≥ 28) only account for 11% of papers that have an RCR of 3 or above. Using Impact Factors to credit influential work means overlooking 89% of similarly influential papers published in less prestigious venues.
Bibliometrics like JIF and H-index are attractive because citations are affirmations of the spread of knowledge amongst publishing scientists, and are important indicators of the influence of a particular set of ideas. These and other prior attempts to describe a normalized citation metric have resulted in imperfect systems for the comparison of diverse scholarly works (Supplemental Information and Supplemental Figure 17), either because they measure only the average performance of a group of papers (Vinkler, 2003), or because the article of interest is measured against a control group that includes widely varying areas of science (Waltman et al., 2011a; Radicchi and Castellano, 2012; Leydesdorff and Bornmann, 2015). An example of the latter is citation percentiling, which the Leiden manifesto (Hicks et al., 2015) recently recommended as best practice in bibliometrics. The RCR method is an improvement over the use of citation percentiling alone, since masking the skewed distribution of citations and article influence, while statistically convenient, can disadvantage portfolios of high-risk, high-reward research that would be expected to have a small proportion of highly influential articles (Rand and Pfeiffer, 2009).
Though tracking the productivity of individual scientists with bibliometrics has been controversial, it is difficult to contradict the assertion that uncited articles (RCR = 0) have little if any influence on their respective fields, or that the best-cited articles (RCR > 20) are impressively influential. We have not determined whether smaller differences, for example those with average or slightly above-average RCRs (e.g. 1.0 versus 1.2), reliably reflect differential levels of influence. Further, citation-based metrics can never fully capture all of the relevant information about an article, such as the underlying value of a study or the importance of making progress in solving the problem being addressed. The RCR metric is also not designed to be an indicator of long-term impact, and citation metrics are not appropriate for applied research that is intended to target a narrow audience of non-academic engineers or clinicians. However, as citation rates mark the breadth and speed of the diffusion of knowledge among publishing scholars, these quantitative metrics can effectively supplement subject matter expertise in the evaluation of research groups seeking to make new discoveries and widely disseminate their findings.
Bibliometric methods also have the potential to track patterns of scientific productivity over time, which may help answer important questions about how science progresses. In particular, co-citation networks can be used to characterize the relationship between scientific topics (including interdisciplinarity), emerging areas, and social interactions. For example, is the membership of an influential group of investigators in a given field or group of fields stable over time, or is it dynamic, and why? Our data demonstrate the existence of an established hierarchy of influence within the exclusive cohort of NIH R01 recipients who remained continuously funded over an eight-year time frame. This may mean that investigators tend to ask and answer questions of similar interest to their fields. Additionally or alternatively, stable differences in investigators’ status, such as scientific pedigree, institutional resources, and/or peer networks, may be significant drivers of persistently higher or lower RCR values. Future statistical analyses may therefore reveal parameters that contribute to scholarly influence. To the extent that scientific (im)mobility is a product of uneven opportunities afforded to investigators, there may be practical ways in which funding agencies can make policy changes that increase mobility and seed breakthroughs more widely.
There is increasing interest from the public in the outcomes of research. It is therefore becoming necessary to demonstrate outcomes at all levels of funding bodies’ research portfolios, beyond the reporting of success stories that can be quickly and succinctly communicated. For this reason, quantitative metrics are likely to become more prominent in research evaluation, especially in large-scale program and policy evaluations. Questions about how to advance science most effectively within the constraints of limited funding require that we apply scientific approaches to determine how science is funded (Danthi et al., 2014; Kaltman et al., 2014; Mervis, 2014). Since quantitative analysis will likely play an increasingly prominent role going forward, it is critical that the scientific community accept only approaches and metrics that are demonstrably valid, vetted, and transparent, and insist on their use only in a broader context that includes interpretation by subject matter experts.
Recent work has improved our theoretical understanding of citation dynamics (Radicchi et al., 2008; Stringer et al., 2010; Wang et al., 2013). However, citation counts are not the primary interest of funding bodies, but rather progress in solving scientific challenges. The NIH particularly values work that ultimately culminates in advances to human health, a process that has historically taken decades (Contopoulos-Ioannidis et al., 2008). Here, too, metrics have facilitated quantitation of the diffusion of knowledge from basic research toward human health studies, by examining the type rather than the count of citing articles (Weber, 2013). Insights into how to accelerate this process will probably come from quantitative analysis. To credit the impact of research that may currently be underappreciated, comprehensive evaluation of funding outputs will need to incorporate metrics that can capture many other outputs, outcomes, and impact, such as the value of innovation, clinical outcomes, new software, patents, and economic activity. As such, the metric described here should not be viewed as a tool to be used as a primary criterion in funding decisions, but as one of several metrics that can provide assistance to decision-makers at funding agencies or in other situations in which quantitation can be used judiciously to supplement, not substitute for, expert opinion.
Materials and Methods
Citation data
The Thomson Reuters Web of Science citation dataset from 2002-2012 was used for all citation analyses. Because of our primary interest in biomedical research, we limited our analysis to those journals in which NIH R01-funded researchers published during this time. For assigning a journal citation rate to a published article, we used the 2-year synchronous journal citation rate (Garfield, 1972; Rousseau and Leydesdorff, 2011) for its journal in the year of its publication. Publications from the final year of our dataset (2012) were not included in analyses because they did not have time to accrue enough citations from which to draw meaningful conclusions, but references from these papers to earlier ones were included in citation counts.
Grant and Principal Investigator data
Grant data was downloaded from the NIH RePORTER database. Grant-to-publication linkages were first derived from the NIH SPIRES database, and the data were cleaned to address false-positives and - negatives. Grant and publication linkages to Principal Investigators were established using Person Profile IDs from the NIH IMPAC-II database. To generate a list of continuously funded investigators, only those Person Profile IDs with active R01 support in each of Fiscal Years 2003-2010 were included.
Calculations and data visualization
Co-citation networks were generated in Python (Python Software Foundation, Beaverton, OR). This was accomplished on a paper-by-paper basis by assembling the list of articles citing the article of interest, and then assembling a list of each paper that those cited. This list of co-cited papers was de-duplicated at this point. Example code for generating co-citation networks and calculating Field Citation Rates is available on GitHub (http://github.com/NIHOPA). Further calculations were handled in R (R Foundation for Statistical Computing, Vienna, Austria). Visualizations were generated in Prism 6 (GraphPad, La Jolla, CA), SigmaPlot (Systat Software, San Jose, CA), or Excel 2010 (Microsoft, Redmond, WA).
When comparing citations rates to other metrics (e.g. post-publication review scores), citation rates were log-transformed due to their highly skewed distribution, unless these other scores were similarly skewed (i.e. Faculty of 1000 review scores). For this process, article RCRs of zero were converted to the first power of 10 lower than the lowest positive number in the dataset (generally 10−2). In the analysis of Principal Investigator RCRs, no investigators had an average RCR of zero.
Acknowledgements
We thank Francis Collins, Larry Tabak, Kristine Willis, Mike Lauer, Jon Lorsch, Stefano Bertuzzi, Steve Leicht, Stefan Maas, Riq Parra, Dashun Wang, Patricia Forcinito, Carole Christian, Adam Apostoli, Aviva Litovitz and Paula Fearon for their thoughtful comments on the manuscript, Michael Gottesman for help organizing post-publication peer review, and Jason Palmer, Fai Chan, Rob Harriman, Kirk Baker and Kevin Small for help with data processing and software development for iCite.