Publication, Funding, and Experimental Data in Support of Human Reference Atlas Construction and Usage

Experts from 18 consortia are collaborating on the Human Reference Atlas (HRA) which aims to map the 37 trillion cells in the healthy human body. Information relevant for HRA construction and usage is held by experts (clinicians, pathologists, anatomists, single-cell experts), published in scholarly papers, and captured in experimental data. However, these data sources use different metadata schemes and cannot be cross-searched efficiently. This paper documents the compilation of a dataset, called HRAlit, that links the 136 HRA v1.4 digital objects (31 organs with 2,689 anatomical structures, 590 cell types, 1,770 biomarkers) to 583,117 experts; 7,103,180 publications; 896,680 funded projects, and 1,816 experimental datasets. The resulting HRAlit represents 23 tables with 21,704,001 records including 7 junction tables with 13,042,188 relationships. We demonstrate how HRAlit can be mined to identify leading experts, major papers, funding trends, or alignment with existing ontologies in support of systematic HRA construction and usage. Data and code are at https://github.com/cns-iu/hra-literature.


Introduction
Constructing an atlas of the healthy human body is a massive undertaking due to the multiscale, biological complexity of human physiology.Since March 2020, international experts funded by the National Institutes of Health and/or supported by the Human Cell Atlas have been collaborating on the construction of a Human Reference Atlas (HRA) 1 .The 5th release of the HRA was published in June 2023 and comprises 31 organs with 2,689 unique anatomical structures, 590 unique cell types, 1,770 unique biomarkers linked to 31 Anatomical Structures, Cell Types, plus Biomarkers (ASCT+B) tables, 21 two-dimensional functional tissue units (FTU), and 65 three-dimensional, anatomically correct reference organs 2 .A total of 101 experts created and 99 experts reviewed (158 unique experts) the HRA digital objects across all releases and compiled 420 papers with DOIs that provide scholarly evidence for the anatomical structures, cell types, and biomarkers in the 31 ASCT+B tables.As the HRA grows in the number of organs and data types it captures, it becomes important to use data-driven decision making to ensure systematic and efficient collaboration of scholars from different areas of research and development; 3 literature).It was constructed using these seven steps: (1) HRA data across five releases, including digital objects and experts, were downloaded, and the data from 5th release HRA, which includes publication references and ontology terms, was also acquired.(2) The names of all 31 organs covered in the 5th release HRA were used in a PubMed literature search, see Methods.(3) Dataset metadata and donor metadata (e.g., sex, age) for healthy human experimental data from HuBMAP, CZ CELLxGENE (CxG), and GTEx was downloaded.(4) Resulting publications were flagged as associated with HRA or experimental datasets, or listed in CellMarker for human cell makers and saved in the "hralit_publication" table together with their PMID, DOI, publication year, article title, and journal title.Citation and h-index metrics for publications were calculated from Web of Science.(5) All HRA experts and authors associated with selected publications, with ORCIDs listed in PubMed, were identified and saved in the "hralit_author" table that records ORCID, author first name, last name, earliest publication year, career age (see Methods), and total number of funded projects.Citation for publications and h-index metrics for authors were computed from Web of Science citation data.(6) Institution metadata was cleaned using OpenAlex.The cleaned institutions associated with author ORCIDs in the "hralit_author" table were stored in the "hralit_institution" table.(7) Funded projects listed in PubMed were stored in the "hralit_funding" table and the relationships among PMIDs, funding IDs, and funders sourced from PubMed were extracted and linked to the "hralit_pub_funding_funder" table.Funding organizations were cleaned using funder metadata and the relationships among PMID, funding IDs, and funders, as provided by OpenAlex.Funders associated with PMIDs and funded projects in the "hralit_pub_funding_funder" table were stored in the "hralit_funder_cleaned" table.   2 shows the number of experts, publications, funding (converted to U.S. dollars, see Methods), and experimental data for the 31 organs in the 5th release of the HRA (for details, see Table S2).Expert count refers to the number of experts (authors and HRA experts) (hralit_author).
Citation count and h-index count were retrieved from the Web of Science core collection.Note that average citation counts differ across organs-analogous to differences in citation counts that are typical in scholarly disciplines 16,17 .For example, research on main bronchus receives comparatively little funding and generates about 8 papers per year that have a comparatively low number of citations-about 3 citations on average, with the highest cited paper (entitled "Induced bronchus-associated lymphoid tissue serves as a general priming site for T cells and is maintained by dendritic cells") having 183 citations.In contrast, brain research enjoys a total of 20 billion U.S. dollar in funding as acknowledged in PubMed, generates 4,825 papers each year, and there are 32 citations per paper on average with the highest cited paper entitled "Analysis of relative gene expression data using real-time quantitative PCR and the 2(-Delta Delta C(T)) Method" having 106,059 citations.

Identify HRA Reviewers
The 5th release of the HRA comprises 136 digital objects (DO) including 31 ASCT+B Tables plus a crosswalk, 65 3D reference objects plus two united files (one for male and another for the female body; each comprising all organs) and a crosswalk, 21 2D reference objects plus a crosswalk, 13 Organ Mapping Antibody Panels (OMAP) objects, and one Vascular Common Coordinate Framework (VCCF) object.Across all five HRA releases, there are 295 digital objects.All of these DOs need to be reviewed by domain experts using a process analogous to reviewing scholarly papers before publication.The HRAlit database was used to identify experts with highly cited papers, funding, and expertise for a certain organ.Title and MeSH terms assigned to 31 organs associated with papers published by an expert author were used to predict the expertise subjects (e.g., anatomy and pathology expertise is needed to review anatomical structures while single cell and chemistry expertise is needed to review cell types and/or biomarkers in the ASCT+B Tables).A total of 583,117 unique experts with ORCIDs were retrieved, including both HRA experts and authors associated with all selected publications.Many of them have expertise in multiple organs, and those with high citation counts and many funded projects were invited to serve as reviewers.Exactly 34 experts agreed, served as reviewers, and are/will be acknowledged in future releases of the HRA.

Prioritize Atlas Construction
Single-cell data experiments are expensive-acquiring gene or protein expression data costs about 30 cents per cell.A typical tissue sample has about 10,000 cells and costs about 300,000 U.S. dollars on average.Many datasets and different assay types are required to construct an atlas.The HRAlit database can be used to identify the number of existing datasets per organ and the likely future number of datasets based on available funding.Plus, the number of available experts matters as human expertise is key for atlas construction as are scholarly publications that provide detailed documentation for high quality datasets.Table 2 lists the types of DOs and HRAlit data types per organ and it is actively used to prioritize HRA construction.We also list the availability of OMAP-aligned spatial proteomics data 18 that can be easily mapped to the HRA.Also listed is the availability of single cell bulk data annotation tools such as Azimuth 19 , Cell Typist 20 recently renamed to CellHint 21 , or popV used in Tabula Sapiens 22 .

Linking the HRA to Existing Ontologies
The 5th release of the HRA covers (1) 2,691 HRA anatomical structures that link to

Enhancing Diverse Perspectives
The HRA is meant to present healthy humans in an inclusive manner.Ideally, the demographics of experts engaged and human tissue data used matched the demographic of the world population.Fig. 4 and Table S3 report gender and (career) age via population pyramids, ethnic composition of experts and donor data, and geolocation for experts engaged in and datasets used for HRA construction.CellMarker, and experimental datasets for the 31 organs in the 5th release HRA. 5. Donor metadata for experimental datasets served via the HuBMAP, GTEx, and CZ CELLxGENE data portals.6. World population data for 2023 from PopulationPyramid portal downloaded from https://www.populationpyramid.net/api/pp/900/2023/?csv=true.The gender of authors was predicted using the gender_guesser package, which tags gender based on first names.The race of authors was determined using the ethnicolr package, based on both first and last names.Career age was estimated based on the publication year of an author's first paper, as detailed in the Methods section.

Discussion
The paper presented the compilation of a unique database that considerably extends the scope and value of the HRA.Specifically, we linked HRA data to publications, experts and their institutions, funded projects, and funders, experimental data and ontologies.The value of the HRAlit database was showcased in six applications that aim to improve the quality, coverage, utility, and inclusiveness of the HRA.All code used to compile the data was made available on GitHub in October 2023, and will be re-run for each HRA release every six months.There are several known data limitations.First, some organs have multiple names but only their official ontology term was used to retrieve PubMed publications (e.g., "large intestine" but not "colon" was used as a query term) and no synonyms were used.Second, the CADRE provided WoS database is no longer being updated; the current copy covers years 1898 to 2023.Third, we plan to add data from EBI's Single Cell Expression Atlas (https://www.ebi.ac.uk/gxa/sc) 24,25 .Cellpedia (https://cellpedia.org/en/cell-therapy)and CellKb (https://www.cellkb.com)do not currently offer APIs that support data download but would add value.Going forward, we will link the HRAlit database to the Scalable Precision Medicine Open Knowledge Engine (SPOKE) biomedical knowledge graph 26,27 that interlinks 27 million concepts (21 types, including genes, proteins, diseases) via 53 million semantically meaningful relationships (55 types, including TREATS_CtD and CONTRAINDICATES_CcD connecting compounds and diseases) downloaded from 41 databases that use 11 ontologies (including Uberon, CL).This expansion of HRAlit will support novel queries.For example, SPOKE has Gene-encodes-Protein assertions that can be mined to compare gene biomarkers from single cell bulk assays with protein biomarkers from spatial assays.SPOKE also links molecular to cellular entities to pharmacology and clinical practice making it possible to run knowledge graph queries in support of drug discovery and precision health.

Data
An overview of the HRAlit database is given in Fig. 1 and in the entity relationship diagram at https://github.com/cns-iu/hra-literature).

Human Reference Atlas
HRA digital objects for the HRA 5th release (v1.4) are listed at https://hubmapconsortium.github.io/ccf-releases/v1.4/docs and include 31 ASCT+B Tables plus 1 crosswalk table, 21 Functional Tissue Units (FTUs), 13 Organ Mapping Antibody Panels (OMAPs), 1 Vascular Geometries table, as well as 65 3D Reference Organs plus 3 digital objects for two united and 1 crosswalk from the 3D organs to the ASCT+B tables for a total of 136 digital objects.All 295 HRA digital objects across all HRA releases (starting with v1.0 in March 2021 to v1.4 in June 2023) can be accessed via the CCF-HRA releases 28 , and extracted as a HRA metadata table.The 5th release ASCT+B tables are available at https://ccfontology.hubmapconsortium.org/v2.2.1/ccf-asctb-all.json.The linkages among anatomical structures, cell types, and biomarkers are extracted from the 5th release ASCT+B tables.In total, there are 4,835 part_of edges between 4,384 unique anatomical structures, 7,879 located_in edges between 691 unique cell types and these anatomical structures, 620 is_a linkages between cell types, and 6,812 characterizes edges between 2,927 unique biomarkers and the unique cell types.HRA Experts.Information about the creators and reviewers of HRA across all versions is selected based on the HRA metadata table.Resulting table "hralit_creator" and "hralit_reviewer" includes ORCIDs, first names, and last names, associated digital objects.In total, there are 101 unique creators and 99 unique reviewers for a total of 158 unique experts.HRA References.Each publication in the 5th release ASCT+B tables is characterized by an ID, DOI, accompanying notes, categorization type, and an associated organ.Within the HRA reference framework, publication references are classified as "general_publication" that supports organ-specific ASCT-B tables and are listed in the table header and "reference" papers that support combinations of table row-specific Anatomical Structures (AS), Cell Types (CT), and Biomarkers (B).In total, the 5th release ASCT+B tables has 83 general references and 1,070 specific references for a total of 1,057 unique publication references.420 of these publications have DOIs and 391 are in PubMed.HRA Ontologies.Anatomical structures, cell types, and biomarkers in the HRA are linked to their counterparts in existing ontologies; for 621 terms there do not exist entries in ontologies and HRA TEMP IDs are assigned to facilitate ID-based mapping while these new terms are added to existing ontologies.Ontology IDs were extracted via 5th release ASCT+B tables (https://ccfontology.hubmapconsortium.org/v2.2.1/ccf-asctb-all.json).In total, 1,246 anatomical structures are linked to the Foundational Model of Anatomy (FMA) and 1,443 to the Uber-anatomy ontology (Uberon), 127 cell types are linked to entries in the Cell Ontology (CL) and 127 Provisional Cell Ontology (PCL), 1,669 biomarkers are linked to the Human Genome Nomenclature Committee (HGNC), and 101 are linked to another biomarker ontology (e.g., UniProt).Note that there are issue requests for the 621 ASCTB-TEMP that do not yet exist in CL; these will be added over the coming 1-2 years to properly represent healthy human cells in CL.HRA Diversity survey data is available at https://github.com/cns-iu/hra-literature.

CellMarker
CellMarker 2.0 23 is a manually curated dataset of biomarkers for distinguishing different cell types in different anatomical structures in human and mouse derived from and linked to over 100,000 publication references.For human, CellMarker covers 13,605 biomarkers, 467 cell types, and 158 anatomical structures linked to ontologies such as Uberon, CL, NCBI Gene, and UniProt.Data was downloaded manually as a file from the CellMarker portal at http://xteam.xbio.top/CellMarker,see counts in Table 3. Publications associated with CellMarker are tagged as "cellmarker" in HRAlit and are stored in the "hralit_other_refs" table, including 1,622 publications with DOIs.

Publications
PubMed (https://pubmed.ncbi.nlm.nih.gov)provides free access to publications encompassing various health and life sciences disciplines.Daily updates of PubMed can be downloaded via API (https://www.ncbi.nlm.nih.gov/books/NBK25497/).We queried a local, daily updated database on September 12, 2023 using 31 organ names as query keywords.7,103,180 unique publications covering years 1900 to 2023 (see Table S2 for the number of papers per publication year) resulted and the HRAlit database covers their identifiers (PMIDs and DOIs), article titles, journal titles, publication year, authors, among others.Web of Science (WoS) (https://www.webofscience.com/wos)provides citations for publications.WoS XML raw data was acquired by Indiana University Network Institution from Clarivate Analytics, and the Cyberinfrastructure for Network Science Center has access to this data under a Data Custodian user agreement the Collaborative Archive & Data Research Environment (CADRE) 30 .Based on the linkages between WoS IDs and PMIDs provided by WoS, citations are counted as the total number of papers that cited a publication.

Experts
Author details were extracted from PubMed using unique identifiers for authors, known as ORCID IDs.There are 583,117 authors with ORCID identifiers that authored 483,998 of the 7,103,180 publications.The HRAlit database covers for each author: ORCID IDs, first name, last name, publication year of author's first publication, career age, and the number of unique funded projects assigned to publications authored by an author, which is stored in the "hralit_author" table.Additionally, HRAlit database also includes linkages associated with authors, with author-institutions linkages stored in the "hralit_author_institution" table, authorpublication linkages in the "hralit_publication_author" table, author-organ linkages in the "hralit_author_expertise" table.Based on the publication citations, h-index is also calculated through the linkage between PMIDs and ORCIDs.

Institutions
Author institutions provided by PubMed were cleaned using data from the OpenAlex database 31 on August 31, 2023.OpenAlex sources institution data from Crossref, PubMed, ROR, MAG, and various publisher websites, focusing on indexing institutions associated with authors.Accessible via the OpenAlex API (https://docs.openalex.org),each institution is uniquely identified using both an internal ID (e.g., "https://openalex.org/I2802101240") and a canonical external ID, called ROR ID.OpenAlex provides the institution name, institution type, and country code for each registered entity.We obtained 102,494 institution entities from OpenAlex, of which 26,235 institutions linked to 463,617 unique ORCIDs covering 79.50% of 583,117 unique ORCIDs, see "hralit_institution" table in the HRAlit database.

Funding
From the 7,103,180 PubMed publications, we identified 899,796 publications with at least one funded project number and acronym.In total, there are 896,680 funded projects associated with these publications, see "hralit_funding" table in the HRAlit database.
To get an understanding of how much funding is available for which of the 31 organs and for use in Fig. 2, we retrieved additional funding information from six funders in four countries and the European Union: Australian Research Data Commons (ARDC) using API (https://researchdata.edu.au/api/v2.0/registry/activities), the Canadian Institutes of Health Research (CIHR) portal (https://open.canada.ca/data/en/dataset/),European Commission (EC) portal (https://ec.europa.eu/info/fundingtenders/opportunities/portal/screen/support/apis_old),Grants-in-Aid for Scientific Research (KAKEN) using API (https://support.nii.ac.jp/en/kaken/api/api_outline),National Institutes of Health (NIH) using API (https://api.reporter.nih.gov/),National Science Foundation (NSF) portal (https://catalog.data.gov/dataset/nsf-awardsearch-web-api-3f6f4),and UK Research and Innovation (UKRI).The final data includes funder, project number, project title, total amount, fiscal year.Specifically, the total amount is converted to United States dollars using the latest annual exchange rate sourced from World Bank (https://databank.worldbank.org/source/global-economic-monitor-(gem)#).Funders 51,522 funders, detailed by the name of the grant agency and its location, were gathered from PubMed databases.For further processing, funder data was cleaned utilizing the OpenAlex database.Funder data in OpenAlex originates from Crossref, linked with both funding IDs and PMIDs.OpenAlex offers details such as the internal ID (e.g., "https://openalex.org/F4320309241"),funder name, and country code.We obtained 14,327 funder entities from OpenAlex on August 31, 2023, of which 6,427 associated with the PMIDs and funded projected numbers in the "hralit_funding" table were subsequently integrated into the "hralit_pub_funding_funder" table in HRAlit database.

Experimental Data
Tissue dataset metadata was downloaded from three public data portals on Aug 7, 2023.HuBMAP Portal (https://portal.hubmapconsortium.org):The HuBMAP SmartAPI was used to download 1,610 tissue samples from 193 donors with 1,445 associated datasets using https://entity.api.hubmapconsortium.org/datasets/prov-info.Note that there are many more datasets that are currently in preprocessing and quality control stages and will soon become available; with member-login, there are 411,200 tissue samples from 397 donors with 3,628 associated datasets.CZ CELLxGENE Portal (https://cellxgene.cziscience.com):Using the cellxgenecensus API at https://chanzuckerberg.github.io/cellxgene-census/index.html,593 datasets from 5,328 donors, linked to 58 publications with DOIs, were downloaded.tables are established, ensuring data integrity and enforcing referential constraints.(4) Export database.HRAlit database is exported in both SQL and CSV formats.

Diversity -Predict Gender, Ethnicity
Gender is predicted using the "gender_guesser" Python module (https://pypi.org/project/gender-guesser), which classifies authors based on first names into categories including "male", "female", "andy", "mostly_male", "mostly_female", "unknown".For classifications other than "male" or "female" for 36 experts, gender was identified by examining photos found on Google using full names with affiliated institutions from ORCID, with exception of one expert named Li Yao, who lacked affiliation information.Ethnicity is determined using the "ethnicolr" Python module, which is trained on U.S. Census data and other large-scale nameethnicity datasets.Based on both first names and last names, the module calculates probabilities for five ethnic categories: Asian, Black, Hispanic, White, and Other.

Figure 1 .
Figure 1.Overview of the HRAlit Database.Seven data types are interlinked by HuBMAP IDs of digital objects, PMIDs, DOIs, ORCIDs, funding IDs, dataset IDs, donor IDs, funder IDs, institution IDs.Note that WoS data is access restricted to CADRE affiliates; it cannot be shared publicly via the HRAlit.Additional funding data for six agencies Australian Research Data Commons (ARDC) , the Canadian Institutes of Health Research (CIHR) , European Commission (EC) , Grants-in-Aid for Scientific Research (KAKEN), National Institutes of Health (NIH), and National Science Foundation (NSF) was downloaded for Fig 2c.but cannot be linked to HRAlit due to the differences in grant IDs used in PubMed and application IDs and funding project numbers in the funder datasets; it was not included in the HRAlit database.

Figure 2 .
Figure 2. HRAlit Database and additional funding data.a. Experimental data, experts, funding, and publications for 31 organs in the HRA.Publication count refers to the number of publications (in hralit_publication).Dataset count refers to the number of experimental datasets per organ (in hralit_dataset).Funding count refers to the number of funded projects per organ that are cited by PubMed publications (in hralit_funding).Funder count refers to the number of funding agencies per organ that support the funding (in hralit_funder_cleaned).

Figure 4 .
Figure 4. Comparison of HRA, publication, and experimental to world population data.a. Population pyramids by age group of survey respondents and tissue data donors in comparison to world population.Population pyramids by career age for HRA experts and publication authors.b.Ethnic composition of survey respondents, HRA tissue donors, HRA experts, paper authors, world population in percentages.c.Choropleth map showing the number of paper authors overlaid on a world map.

Table 1 . HRAlit data types, names, number of records, and source data.
,661,813 non-unique records in metadata tables and 13,042,188 non-unique records in junction tables for a total of 23 tables with 21,704,001 records.All datasets and code used are detailed in the Methods section.Intermediate datasets and code used to retrieve, clean, and interlink the data can be found on Reference Atlas (HRA), CZ CELLxGENE (CxG), the Genotype-Tissue Expression (GTEx), and Anatomical Structures, Cell Types, plus Biomarkers tables (ASCT+B tables).** Publications in "hralit_other_refs" and "hralit_asctb_publication" tables are included in the publication metadata table ("hralit_publication"). ***Authors in "hralit_creator" and "hralit_reviewer" tables are included in the author metadata table ("hralit_author").**** In the HRAlit database, there are 8

Providing Expert, Publication, and Experimental Data Evidence for the HRA HRA
authors and users need to understand what expertise, which publications, and what experimental data evidence exists for a specific organ.Fig.

Table 2 . Publication evidence, experts, institutions, funded projects, funders, and experimental data available for 31 organs
. Rows represent the organs from the 5th HRA release (bronchus is part of the lung).For each organ, we provide the number of HRA DO types and the number of publications, experts, institutions, funded projects and funders, together with the number of healthy human datasets available via the HuBMAP, GTEx, and CELLxGENE portals on Aug 1, 2023.
1,443 terms in Uberon and 1,246 in FMA; (2) the 590 cell types link to 127 terms in PCL, 461 in CL and 621 new ASCTB-TEMP that do not yet exist in ontologies, and 1 term in LMHA;(3) 1,770 biomarkers, such as HGNC, GNC, LOC, UNIPROT, LM, and PR, which associated with 1,608 genes, 460 proteins, and 1 lipids.CellMarker 2.0 23 is a manually curated database of experimentally supported gene and protein biomarkers for cell types in humans and mice.Here, we compare 31 organs covered in the HRA to those covered in the human part of CellMarker 2.0 and CZ CELLxGENE (see Table3).

Table 3 . Comparing HRA, CellMarker, CELLxGENE for 31 healthy human adult organs
. Note that HRA reports gene, protein, lipid, and metabolite biomarkers; CellMarker focuses on tissue-cell type-biomarker triples; CELLxGENE focuses on gene biomarkers.*Includeshealthy and diseased human.**Number of tissue-cell type pairs.***The total number of experts with ORCIDs.For HRA, this equals the DO creators and reviewers and publication reference authors.****The number of publications with DOIs or PMID.Compare Canonical

HRA Marker Genes with Computational CZI Genes in CellGuide CZ
CELLxGENE 9 recently added the 5th release of the HRA data to the CellGuide tool, see Fig. 3. CellGuide helps researchers understand what cells in the Cell Ontology (https://www.ebi.ac.uk/ols4/ontologies/cl) are in the experimental dataset corpus (blue coloured nodes) and which are not (grey nodes).Subtrees can be expanded to see descendant nodes.
cziscience.com/cellguide/CL_0000540. On Aug 3, 2023, there are 2,706 cell types in CL, interlinked by is_a child to parent relations; the root note of this cell typology is 'cell'.Plus, there are 791 cell types (613 for homo sapiens) in the CxG experimental data and 296 of these cell types are also in the 5th release ASCT+B Tables.Note that CellGuide only features cell types from the ASCT+B Tables that are connected to the CL typology-omitting the 621 ASCTB-TEMP and 1 term in the LungMAP Human Anatomy (LMHA) that do not yet exist in CL.A total of 588 cell types (127 in PCL and 461 in CL) are connected via CellGuide is valuable for improving the coverage and quality of ASCT+B Tables in future HRA iterations.ASCT+B Table authors are provided with a listing of cell types and gene biomarkers seen in healthy adult data and are asked to consider inclusion of these cell types and biomarkers in future table iterations.