1 Abstract
Motivation The National Institutes of Health (NIH) is the major source of federal funding for biomedical research in the United States. Analysis of past and current NIH funding can illustrate funding trends and identify productive research topics, but these analyses are conducted ad hoc by the institutes themselves and only provide a small glimpse of the available data. The NIH provides free access to funding data via NIH EXPORTER, but no tools have been developed to enable analysis of this data.
Results We developed the nihexporter R package, which provides access to NIH EXPORTER data. We used the package to develop several analysis vignettes that show funding trends across NIH institutes over 15 years and highlight differences in how institutes change their funding profiles. Investigators and institutions can use the package to perform self-studies of their own NIH funding.
Availability The nihexporter R package can be installed via github.
Implementation The nihexporter package is implemented in the R Statistical Computing Environment.
Contact Jay Hesselberth jay.hesselberth{at}gmail.com, University of Colorado School of Medicine
2 Introduction
The National Institutes of Health (NIH) is the major source federal funds for biomedical research in the United States. The NIH budget is approved by Congress each year, and in fiscal year 2014, the total NIH budget was 30.3 billion dollars (REF). The NIH is divided into 25 institutes, each with its own focus and mission. For example, National Cancer Institue (NCI) focuses on malignant diseases; the National Institute for Allergy and Immune Disease focuses on the immune system and transmissible diseasei; and the National Institute for General Medical Sciences focuses on basic research without a specific disease focus. Each institute negotiates with the NIH director (currently Francis Collins, MD, PhD) for its yearly budget, with budget institutes ranging from millions to several billion dollars.
The NIH provides funds through competetive grants written by internal and external investigators, and the funds associated with these grants can be divided into ‘direct’ and ‘indirect’ costs. Direct costs are funds that are given to an investigator (or group of investigators) to conduct their proposed research. These funds buy supplies for the experiments and pay the salaries of people to do the work.
By contrast, indirect costs are funds that are paid to institutions associated with investigators, and are used to “keep the lights on”: they pay for infrastructure costs. However, the “indirect cost recovery” (ICR) rate of each institution, the fraction of each award the institute receives, is congressionally mandated, and there is a wide range in ICR rates. Some of the highest ICR rates are close to 100%, meaning that for every dollar an investigator receives, the institutions receive an equal amount.
NIH funding is an investment strategy: the institutes invest money in specific research areas, hoping for future returns in the form of publications, patents and skilled trainees. As with any investment strategy, a periodic review can help rebalance the portfolio in order to maximize returns. Analysis of NIH funding data has been performed internally by the NIH, or by contracted third-parties. Several of these analyses have highlighted funding trends and suggested metrics to gauge the ‘return’ on the NIH ‘investment’. For example, “productivity” can be examined as a function of the number of publications produced by grants per dollar of “direct costs”.
3 Methods
We downloaded NIH funding data from the NIH EXPORTER website in comma-separated value (CSV) format and parsed these data into R data files that each contain specific information:
projects has information about projects in each fiscal year, keyed by project.num
org_info has information about project organizations
project_orgs links applications to organizations
project_pis links PI information to a project
publinks table links PubMed IDs to project.num
publications links project IDs to PubMed IDs
patents links patent IDs to project IDs
See the documentation in the R package for more information about each table.
The package also has several precomputed variables and tables that enable quick and easy exploratory analysis:
nih.institutes: Two-letter format for 27 NIH institutes
project_io: This table contains pre-computed values for overall project cost (project.cost), as well as the number of publications (n.pubs) and patents (n.patents) associated with each project.
NIH EXPORTER provides access to the total costs of each grant in each fiscal year, comprising both direct and indirect costs.
We developed several analyses using the dplyr, ggplot2 and knitr packages in the RStudio environment.
4 Results and Discussion
4.1 Project costs
Let’s look at the costs of grants over time for a few institutes:
library(scales) library(stringr) library(dplyr) library(ggplot2) library(nihexporter) select_inst <- c('GM', 'AI', 'CA') cost.over.time <- projects %>% select(institute, fy.cost, fiscal.year) %>% filter(institute %in% select_inst) %>% group_by(fiscal.year, institute) %>% summarize(yearly.cost = sum(fy.cost, na.rm = TRUE)) cost.over.time %>% ggplot(aes(x = factor(fiscal.year), y = yearly.cost / 1e9, group = institute, color = institute)) + geom_line() + geom_point() + theme_bw() + theme(legend.position = 'bottom') + xlab('Fiscal year') + ylab('Project costs (dollars in billions)')4.2 Funding distributions
4.2.1 By Institution
Let’s look where the money is going. This example illustrates linking of the project, org_info and project_orgs tables.
institution_funding <- projects %>% filter(activity == 'R01') %>% left_join(project_orgs, by = 'application.id') %>% group_by(org.duns, fiscal.year) %>% summarise(total.dollars = sum(fy.cost, na.rm = TRUE)) %>% ungroup() %>% left_join(org_info, by = 'org.duns') %>% arrange(desc(total.dollars)) institution_funding %>% select(fiscal.year, org.name, org.duns, total.dollars) %>% mutate(total.award.millions = total.dollars / 1e6) %>% select(-total.dollars) %>% head(10) %>% knitr::kable(caption='R01 grant dollars awarded to specific institutions', col.names=c('Fiscal year', 'Org.', 'DUNS', 'Dollars (millions)'))We can also identify movers-and-shakers by institution.
4.2.2 By PI
One can also examine how dollars are accrued by specific PIs. It is not possible to assign dollars directly to a PI, because some grants have multiple investigators. Rather, these are total costs that a given PI has been associated with over all grants in NIH EXPORTER. Here we identify PIs with the largest dollar amounts accrued for R01 grants.
dollars.per.pi <- projects %>% filter(activity == 'R01') %>% left_join(project_io, by = 'project.num') %>% left_join(project_pis, by = 'project.num') %>% filter(!is.na(pi.id)) %>% select(project.num, pi.id, total.cost) %>% unique() %>% group_by(pi.id) %>% summarise(pi.dollars = sum(total.cost) / 1e6) %>% arrange(desc(pi.dollars)) dollars.per.pi %>% head(10) %>% knitr::kable(digits = 3, caption = 'R01 grant dollars associated with specific PIs', col.names=c('PI ID', 'Dollars (millions)'))4.3 Productivity
In order to measure the “return” on the money the NIH invests in the research enterprise, we can measure scholarly output (i.e., publications) per dollar invested.
Here we identify th highest performing grants outside of the R01 category. Much has been made of the wasteful spending outside of investigator-initiated research. Here we identify the cost of publications for grants other than R01’s that cost more than 1 million dollars.
high.perf.not.r01 <- projects %>% filter(activity != 'R01') %>% left_join(project_io, by = 'project.num') %>% select(project.num, total.cost, n.pubs) %>% filter(total.cost > 1e6 & n.pubs > 0) %>% mutate(cost.per.pub = total.cost / n.pubs) %>% arrange(cost.per.pub) high.perf.not.r01 %>% head(10) %>% mutate(total.cost = round(total.cost / 1e6, 3), cost.per.pub = comma(round(cost.per.pub, 0))) %>% knitr::kable(caption='Productivity (publications / dollar) of non-R01 grants', col.names=c('Project ID', 'Project cost (dollars, millions)', 'Number of publications', 'Cost per publications (dollars)'))We can also identify the specific publications associated with grants with the least expensive publications.
high.perf.not.r01 %>% head(1) %>% select(project.num) %>% left_join(publinks, by = 'project.num') %>% head(5) %>% knitr::kable(caption='Publications from the most productive grants', col.names=c('Project ID', 'Pubmed ID'))We can also identify productive PIs with current R01s…
productive.pis <- projects %>% filter(activity == 'RO1' & fiscal.year == 2O14) %>% select(project.num) %>% left_join(project_io, by = 'project.num') %>% left_join(project_pis, by = 'project.num') %>% group_by(pi.id) %>% summarize(total.pi.dollars = sum(total.cost, na.rm = TRUE), total.pubs = sum(n.pubs)) %>% mutate(pub.cost = total.pi.dollars I total.pubs) %>% # prevent PI Ids from being commafied mutate(pi.id = as.character(pi.id)) %>% arrange(pub.cost) productive.pis %>% head(1O) %>% knitr::kable(digits = O, format.args = list(big.mark = ','), col.names = c('PI ID', 'Cost per publication (dollars)', 'Total publications', 'Total project costs (dollars)'))Or we can identify the all-time most produtive PIs, independent of grant type or time frame…
productive.pis.all.time <- projects %>% select (project.num, activity) %>% left_join(project_io, by = 'project.num') %>% left_join(project_pis, by = 'project.num') %>% group_by(pi.id, activity) %>% summarize(total.pi.dollars = sum(total.cost, na.rm = TRUE), total.pubs = sum(n.pubs), total.patents = sum(n.patents)) %>% ungroup () %>% filter(total.pi.dollars >= 1e6) %>% mutate(pub.cost = total.pi.dollars / total.pubs, patent.cost = total.pi.dollars / total.patents) %>% select(pi.id, activity, pub.cost, patent.cost, total.pubs, total.patents, total.pi.dollars) %>% mutate(pi.id = as.character(pi.id)) productive.pis.all.time %>% select (-patent.cost, -total.patents) %>% arrange (pub.cost) %>% head(1O) 7.>7. knitr::kable(digits = O, format.args = list(big.mark = ','), caption = 'Most productive PIs all time for publications', col.names = c('PI ID', 'Grant type', 'Cost per publication (dollars)', 'Total publications', 'Total dollars'))When Jeremy Berg was head of the Institute of General Medical Sciences (NIGMS) from 2003–2011, he routinely provided analysis of funding trends at NIGMS in his “Feedback Loop” blog. One of these measured the productivity per grant dollar by measuring its “scholarly output” (i.e., publications) as a function of direct costs. In this plot there is an increase in productivity per dollar, until an inflection point at $700K, after which the number of publications drops, suggesting a negative influence of grant money on scholarly output. This was interesting and covered here.
Here we flesh out this analysis and look at how all institutes perform by this measure (Berg, and now Lorsch, only analyzed GM). One caveat is that we only have access to total.cost in NIH EXPORTER, so the numbers include indirect costs. But, this is real cost to the tax-payer.
First, we need to calculate the lifetime costs of all R01 grants.
# calculate costs of all grants, over the entire lifetime of the grant grant_costs <- projects %>% filter(institute %in% nih.institutes & activity == 'R01') %>% left_join(project_io, by = 'project.num') %>% select(institute, project.num, total.cost, n.pubs) %>% unique()Next, we need to identify grants in each of the bins that Berg previously alluded to. dplyr makes this easy with the ntile() function. Berg previously divided grants into ~15 bins, we’ll bin into ~5%.
bin_grant_costs <- grant_costs %>% group_by(institute) %>% group_by(n.tile = ntile(total.cost, 20))dollar_bin_grant_costs %>% group_by(dollar.tile) %>% summarize(count = n())
That looks better. Now we can make the summary plots…
dollar_bin_grant_costs %>% # need to remove higher tiles because there are too few grants filter(dollar.tile <= 13) %>% ggplot(aes(x = factor(dollar.tile), y = n.pubs)) + geom_boxplot(outlier.shape = NA) + scale_x_discrete(labels = breaks / 1e6) + theme(axis.text.x = element_text(angle=45, vjust=0.8)) + scale_y_log10() + theme_bw() + facet_wrap(~ institute, scales = "free_x") + ylab('Number of publications') + xlab('Total costs (minimum, in millions)')4.4 Comparison of grant programs
The NIH provides funds through differet grant programs:
research: investigator-intitiated, actvities begin with R (e.g., R01)
program: activities begin with P (e.g. P01)
contract: actvities begin with U (e.g. U54)
We can examine the total costs spent on specific grants and specific institutes over time.
Or we can see how institutes allocate money generally to different types of grants.
# from http://grants.nih.gov/grants/funding/funding_program.htm research_projects <- projects %>% filter (grepl(' ̑R', activity)) %>% select (project.num) program_projects <- projects %>% filter (grepl ('̑P ', activity)) %>% select (project.num) contract_projects <- projects %>% filter (grepl ('̑U', activity)) %>% select (project.num) select_inst <- c('AI','CA','GM','HG','AA','MH') grant_costs <- projects %>% filter (institute %>% nih.institutes) %>% select(project.num, institute, fiscal.year, fy.cost) research_costs <- grant_costs %>% semi_join(research_projects, by = 'project.num') %>% group_by(project.num, institute, fiscal.year) %>% summarize(project.cost = sum(fy.cost, na.rm = TRUE)) %>% mutate(type = 'research') program_costs <- grant_costs %>% semi_join(program_projects, by = 'project.num') %>% group_by(project.num, institute, fiscal.year) %>% summarize (project.cost = sum(fy.cost, na.rm = TRUE)) %>% mutate(type = 'program') contract_costs <- grant_costs %>% semi_join(contract_projects, by = 'project.num') %>% group_by(project.num, institute, fiscal.year) %>% summarize (project.cost = sum(fy.cost, na.rm = TRUE)) %>% mutate(type = 'contract') combined_tbl <- bind_rows(research_costs, program_costs, contract_costs) ggplot(combined_tbl, aes(x = factor(fiscal.year), y = project.cost, fill = type)) + geom_boxplot(outlier.shape = NA) + scale_y_log10(limits = c(1e4,1e7), labels = trans_format("log10", math_format(10̑.x))) + facet_wrap(-institute) + theme_bw() + scale_fill_brewer(palette = "Dark2") + theme(legend.position = 'bottom') + xlab('') + ylab('Total grant costs, log-scaled') + theme(axis.text.x = element_text(angle = 90))4.5 Duration
The nihexporter exposes project.start and project.end, which we can use to examine the duration of projects. For example, we can identify the longest running R01 grants.
long.grants <- projects %>% filter (activity == 'Ro1') %>% select (project.num, project.start, project.end) %>% group_by(project.num) %>% summarize (longest.run = max (project.end) - min(project.start)) %>% arrange(desc(longest.run)) %>% mutate(in.years = as.numeric(longest.run) I 365) %>% select(project.num, in.years) long.grants %>% head(10) %>% knitr::kable(digits = 2, caption = 'Longest running R01 grants (all-time)', col.names=c('Project ID', 'Duration (years)'))4.5.1 Geographical distribution
Geographical distribution of grant dollar is easily visualized using the package.
state.data <- data.frame(org.state = state.abb, state.name = tolower(state.name)) state_funding <- projects %>% select(application.id) %>% left_join(project_orgs) %>% left_join(org_info) %>% left_join(projects, by = 'application.id') %>% select(application.id, org.state, fy.cost) %>% group_by(org.state) %>% summarize(total.fy.cost = sum(fy.cost) / 1e9) cost.by.state <- state_funding %>% left_join(state.data) %>% select(state.name, total.fy.cost) %>% filter(state.name != "NA") %>% mutate(region=state.name, cost = total.fy.cost) %>% select(region, cost) %>% data.frame() state.map.data <- map_data("state") plot_data <- left_join(state.map.data, cost.by.state) ggplot() + geom_polygon(data=plot_data, aes(x=long, y=lat, group = group, fill=plot_data$cost), colour="black") + scale_fill_continuous(low = "lightgrey", high = "red", guide="colorbar") + theme_bw() + labs(fill = "Total Cost per year \n (Billions of dollars)",x="", y="") + #title = "Total cost per fiscal year by state" scale_y_continuous(breaks=c()) + scale_x_continuous(breaks=c()) + theme(panel.border = element_blank()) + coord_fixed(ratio = 1.4)5 Acknowledgments
The nihexporter package was developed over several weeks during a course on data analysis (http:// molb7621.github.io). Thanks to the NIH EXPORTER team for fixing issues with the EXPORTER tables.