A crosswalk to align the GSS and NIH grants databases based on university names

Notable reports over the past dozen years have recommended the federal government and others improve data collection on the research workforce in the United States. The federal government already collects a wealth of data, but important datapoints, like the number of biomedical postdocs working in the U.S., for example, are still not well defined. Furthermore, of the data that are collected, differences in collection method and data naming conventions, like inconsistent naming of universities, hinders our ability to merge information across different datasets. Here I describe the creation of three macros meant to align the National Center for Science and Engineering Statistics’ Survey of Graduate Students and Postdoctorates in Science and Engineering (GSS) with NIH grants databases based on consolidating university names under a single name common to both databases. Aligning these databases will allow for a deeper understanding of how various federal and university policies affect the number trainees, grants, and funding at individual institutions.


Introduction
Collecting and analyzing data on the scientific workforce is necessary to develop policies that ensure an equitable, vibrant research enterprise. The federal government supports many data collection efforts and databases that are invaluable sources of information on the research funding and the workforce. For example, the National Science Foundation's (NSF) National Center for Science and Engineering Statistics (NCSES) coordinates and conducts a series of surveys of universities, scientists and engineers, and trainees to provide snapshots of the science and engineering workforce (https://www.nsf.gov/statistics/). And the National Institutes of Health publishes a wealth of information on each grant the agency awarded since 1985 (https://report.nih.gov/). In addition, non-federal databases also provide necessary information about the research workforce including those from the Association of American Medical Colleges (https://www.aamc.org/data-reports/faculty-institutions/faculty-roster) and the Coalition for Next Generation Life Sciences (http://nglscoalition.org).
Despite this volume of information, there are still dimensions to the research enterprise that are poorly understood. For example, we do not have a firm grasp on the number of postdocs working in biomedical research (1)(2)(3)(4). For this reason and others, multiple reports on the research enterprise, and the biomedical research enterprise in particular, recommended the federal government improve data collection on the research workforce (5). Recently, universities began collecting and publishing the career outcomes of their Ph.D. and postdoc alumni (6)(7)(8)(9). These efforts satisfy some of the recommendations from prior reports calling for more extensive data collection on the research workforce, but they do not solve the entire problem.
In addition to collecting more data, it is important to fully interrogate the data that have already been collected. My intention was to relate trainee numbers at the institution level in the GSS dataset with grant awards and direct costs to institutions in the NIH dataset. The various unique codes assigned to institutions in either dataset were not the same, and the best way to connect the two datasets was based on institutional names. However, even within datasets, universities were not consistently named.

GSS consolidation
In the GSS dataset between 1985 and 2016, I first isolated the "institution_id" and "Institution_Name" fields. The "institution_id" is a unique code assigned to institutions and is useful for identifying institutions that had multiple names over time. WA, USA) formulas and processes were used to identify institutions with the same ID but different names. Next, a visual scan of institution names was conducted to identify institutions that were the same but not identified in prior steps. This step was informed, in part, by previous work (10). A list of GSS institutions was arrayed vertically with groups of aliases for institutions (hereafter "associated names") appearing horizontally next to the main institution name (hereafter "common name") (Table S1).
Between 1985 and 2016, the GSS had 905 unique entries for institutions in the  (Table S1).

NIH consolidation
To consolidate data within the NIH dataset, I started from the merged list of institutions published previously (10). Then, similar to the GSS data, I used standard Microsoft Excel functions to identify institutions with different "ORG_NAME" but the same "ORG_DUNS." The "ORG_DUNS" value functions analogously to the "institution_id" in the GSS. I then conducted a visual scan of institution names combined with internet searches to determine the university affiliations of hospitals, research centers, and institutions, as well as to identify name changes and other rearrangements. The remaining 1,943 institutions had only a single name in the ORG_NAME field in the datasets sampled here. This left 2,149 unique common names in the consolidated NIH database (Table   S1).

Crosswalk
To make a crosswalk between the NIH and GSS data, I first attempted to match NIH and GSS common names using standard Microsoft Excel functions. This met with limited success primarily because of simple differences in naming conventions. I next conducted a visual scan of the datasets to match institutions across them. The NIH and GSS institution arrays were then combined with groups of associated names for institutions appearing horizontally next to the common name (Table S1). Of the 652 entries in the consolidated GSS database, 375 entries

Macro creation
I chose to create macros in Microsoft Excel to accelerate the consolidation of university names in the GSS and NIH datasets. Because all of the data cited here are downloadable as Excel-readable spreadsheets, the macros could be applied to the spreadsheets prior to additional analyses or conversion to other programs.
Once I had identified all associated names for a university in the GSS dataset above, I created the NSF_University_update macro that searches for each instance of an associated name and replaces it with the common name (Table S2). When the macro is run on a GSS dataset with the "Institution_Name" column, all associated names for an institution are replaced with the institution's common name, as defined in Table S1.
Similarly, once I had identified all associated names for a university in the NIH dataset, I created the NIH_university_update macro that searches for each instance of an associated name and replaces it with the common name (Table S3). When the macro is run on an NIH dataset, all associated names in the "ORG_NAME" field are replaced with the institution's common name, as defined in Table S1.
I then created the NSF_NIH_university_crosswalk macro that searches for each instance of the common name used for the NIH dataset and replaces it with the common name used in the GSS dataset (Table S4). When the macro is run on an NIH dataset, all common names in the "ORG_NAME" field that have a match in the GSS dataset are replaced with the common name used in the GSS dataset. After running these three macros, all entries in NIH RePORTER and the GSS between 1985 to 2016 can be compared based on the institution's common name.

Isolating biological and biomedical trainees
The NIH primarily supports biological and biomedical scientists, while the GSS surveys the numbers and demographics of trainees from all science and engineering fields. To better understand the relationship between NIH funding and biological and biomedical trainees, I selected GSS data based on the gss_code field corresponding best to biological and biomedical trainees: all 600 series codes (biological and biomedical sciences) and code 950 (neurobiology and neuroscience). I also included all 700 series codes (health sciences) because research in these departments is largely funded by NIH grants, and the research done in these departments often overlaps with those in the biological and biomedical departments.

Results
The NIH Data Book (https://report.nih.gov/nihdatabook/) and other resources from the NIH give some indication of trainee support, but these data are not broken down to the level of individual institutions. The GSS has information on the numbers and demographics of graduate students and postdocs at individual institutions, but no information on how much training support each institution receives. Applying the macros designed here to data from the NIH and the GSS make it possible to make direct comparisons of information from both datasets.
To test the macros, I applied them to the GSS and NIH databases, and then I determined  Table 1). The primary mechanisms of direct trainee support provided by the NIH are the F31, F32 and T32 grant awards. The ten institutions analyzed here received between 23 and 308 F31, F32, and T32 training awards from the NIH in 2016, totaling between $3 million and nearly $53 million ( Table 1).
The consolidation of institutions described here allows for a more direct relation of data between the GSS and the NIH datasets and adds a new dimension of analysis. For example, these ten institutions ranged from 15.5 trainees per F31, F32, and T32 award (Stanford University) up to 82.2 trainees per award (University of Florida) ( Table 1). These data indicate a wide disparity among these ten institutions in how many training-specific awards they received on a per trainee basis.
T32 awards can support multiple trainees. As such, the awards per trainee discussed above may not give a complete picture of trainee support at an institution. Instead, I analyzed the amount of F31, F32, and T32 funding each institution received per trainee. In 2016, the institutions analyzed here received between $1,620 per trainee (University of Florida) up to $9,742 per trainee (Stanford University; Table 1). These data give a better sense of the trainingspecific resources individual institutions received in 2016.
Additionally, all research institutions supporting graduate students or postdocs can be visualized with these metrics. Despite the seeming large differences in trainees per training award, the variance across the research enterprise is much greater than that observed in the ten institutions analyzed above. Specifically, institutions that have fewer training awards tend to have more trainees per award and are more likely to have starkly different numbers of trainees per award relative to institutions that have the same number of awards (Fig. 1A).
Similarly, the ten institutions analyzed above received more training money per trainee than most of the research enterprise (Fig. 1B). These data indicate that, despite training more graduate students and postdocs, these institutions draw training dollars well above the average institution.

Discussion
Here I present three Microsoft Excel macros designed to make the NCSES's GSS data on trainee populations directly relatable to grant award data published by the NIH. With this crosswalk, researchers can mine these databases to better understand how various policies shaped the biomedical research workforce by relating changes in trainee numbers, funding, and awards made, among others. Because the crosswalk uses universities and institutions to relate the databases, it is also possible to determine how specific institutions responded to new policies.
Reports and analyses of the biomedical research enterprise often call for an expansion of publicly available data on the workforce. Some recent progress has been made in universities publishing the career outcomes of their Ph.D. and postdoc alumni (6)(7)(8)(9). But much more remains. The federal government collects and publishes an incredible amount of information about the research workforce and the money provided to it. By connecting the GSS and NIH databases, researchers can mine new insights from these data without expanding data collection efforts. Furthermore, beyond connecting the GSS and NIH databases, the work here provides a platform to connect other databases that include university names in their data collection efforts. This includes other NCSES databases, like the Survey of Doctoral Recipients, other federal databases, and possibly those of non-governmental organizations. Relating this wide variety of databases will improve our understanding of the research enterprise.  Table 1. For both graphs, N = 176 institutions.