Information silos distort biomedical research

Information silos have been an oft-maligned feature of scientific research for introducing a bias towards knowledge that is produced within a scientist’s own community. The vastness of the scientific literature has been commonly blamed for this phenomenon, despite recent improvements in information retrieval and text mining. Its actual negative impact on scientific progress, however, has never been quantified. This analysis attempts to do so by exploring its effects on biomedical discovery, particularly in the discovery of relations between diseases, genes and chemical compounds. Results indicate that the probability that two scientific facts will enable the discovery of a new fact depends on how far apart these two facts were published within the scientific landscape. In particular, the probability decreases exponentially with the citation distance. Thus, the direction of scientific progress is distorted based on the location in which each scientific fact is published, representing a path-dependent bias in which originally closely-located discoveries drive the sequence of future discoveries. To counter this bias, scientists should open the scope of their scientific work with modern computational approaches.


Introduction
The wide communication of scientific discoveries across the scientific community is an essential element of scientific research. Scientific silos have long been bemoaned for hindering this process (e.g. (Leischow et al., 2008;Vodovotz & An, 2013;Törmä, 2019)) by introducing a bias towards knowledge that is produced within a scientist's own community. Analogous to corporate knowledge silos, there are at least three aspects that would define them: (1) enormous growth in the knowledge available to scientists, (2) organization of scientists into communities and (3) slowing of the propagation of scientific knowledge between those communities. Regarding the first aspect, the growth of information available for scientific research (Larsen PO & von Ins M, 2010;Bornmann & Mutz, 2015) represents a challenge for individual scientists as information seekers. In a perfect world, scientists would possess complete knowledge of all existing scientific information and select their research goals accordingly. Abundance of information, however, can represent its own "resource course" challenge. One could paraphrase the famous corporate knowledge-management adage (Sieloff, 1999) by saying: if only science knew what science knows. In this respect, the field of literature-based discovery (LBD) has propounded the existence of "undiscovered public knowledge" concerning facts that have never been put together because of the disparate venues in which they were published (Swanson, 1986;Bekhuis, 2006;Thilakaratne et al., 2019). Thus, there is a recognition that the milieu in which a discovery is published influences its later use by the scientific community due to the sheer abundance of existing scientific knowledge.
With respect to the second aspect, it has been shown that scientific publications are anchored around communities of scientists (Bruggeman et al., 2012;Shia et al., 2015;Fortunato et al., 2018), which go beyond traditional scientific communities (e.g. university departments, scientific organizations), representing a self-organizing process. This process might be encouraged by an institutional bias against interdisciplinary research (Bromhan & Dinnage, 2016;Baumwol et al., 2011), which would hamper collaboration across communities, despite recent trends towards fostering interdisciplinary research in systems and translational sciences (Luke et al., 2015;Auffray et al., 2009). It could also be a consequence of human cognitive limitations, due to scientists' bounded capacity to learn and produce new knowledge and as a response to an increasingly more complex scientific landscape (Rodriguez-Esteban & Loging, 2013).
The third aspect, and the focus of this study, relates to the negative impact that scientific silos ultimately have on scientific progress. The existence of silos would entail that intra-silo information exchange is more frequent and faster than inter-silo. This would increase the 4 likelihood of certain discoveries based on facts published within a silo, to the detriment of discoveries based on facts coming from different silos. Because new discoveries feed on past discoveries in a path-dependent manner (Soler et al., 2015;Tambolo, 2017;Heimeriks & Boschma, 2014), this dynamic could affect the long-term outcome of scientific research.

Research Objective
While siloization, and solutions that try to address it, have been a recurrent topic of scientific debate, no effort has been made to-date to quantify its negative impact on scientific progress, particularly its effect on the slowdown in the propagation of scientific facts, leading to the delay of certain discoveries and to the acceleration of others. This first attempt focuses on measuring the propagation of scientific facts about relations between compounds, genes and diseases, which are of broad interest in biomedical discovery, including clinical, pharmaceutical and translational research. Because defining silos is challenging, a surrogate distance measurethe citation distance-is used to represent the separation between publications within the scientific landscape. This measure would be related to the likelihood that publications belong to the same silo. Results of the analysis show that the citation distance between two published facts influences the probability that they will lead to a new discovery and thus signal the importance that knowledge silos (or, more broadly, the large-scale structure of relations between scientific publications) have in distorting scientific progress.

Methods
Scientific discovery can be modeled as a process in which facts are progressively connected to each other, thereby building growing networks in which the discovery of new facts is connected to already discovered facts (Cokol et al., 2005;Rzhetsky et al., 2015). The scientific discovery model employed in this study is inspired by the ABC model used in literature-based discovery (LBD) (Smalheiser, 2012;Thilakaratne et al., 2019) and it is based on undirected networks of up to 3 nodes (A, B and C). The nodes are particular elements that are the focus of research and the edges are relations between those elements that have been published in scientific publications. These networks are built sequentially over time: the edge AB is associated to the relation that is published first, the edge BC is associated to the second one, and the edge AC to the third one. Based on the time sequence order, the nodes are labeled appropriately as A, B or C. At any given point in time, and based on the existing published literature, there are networks with 1, 2 and 3 edges. For networks with all 3 edges, we say that AB and BC enabled the discovery of AC, even if there is no direct evidence of that, by virtue of precedence. AB and BC are considered "enabling facts" and AC, a "new discovery." Networks with 2 edges comprise potentially enabling pairs of facts (i.e. AB and BC), which could enable a new discovery AC in the future.
In a full, three-edge network, the time lapse for a new discovery is the time between the publication of BC and the publication of AC. In a two-edge network (i.e. AC does not exist), the time lapse is measured between publication of BC and the cut-off time (January 1, 2020). This is done because potentially enabling facts can still enable a new discovery at a future date. This is handled analogously to a Kaplan-Meier curve to avoid biases due to right-censoring. Oneedge networks are not considered for this calculation.
In this study, each network node (A, B, C) is one of each a gene, a disease or a compound.
Each edge is a relation (e.g. a gene-disease relation) linked to a specific publication in the database MEDLINE. Data about relations came from The Comparative Toxicogenomics Database (CTD) (Davis et al., 2019), which was downloaded on May 4th, 2020. From this database, 1,603,976 unique relations between chemicals and genes were extracted; 34,830 relating genes and diseases and 218,868 relating chemicals and diseases. Additionally, cooccurrence data came from the MeSH and gene2pubmed databases. Chemical/drug and disease annotations were MeSH term annotations designated as "Major Topic" from the "Chemicals and Drugs" (D) and "Diseases" (C) branches, respectively, in the 2020 MeSH tree.
Gene annotations came from the gene2pubmed database (Maglott et al., 2011) downloaded on August 20, 2020. These comprised 1,515,080 human gene annotations from 664,085 MEDLINE articles. MEDLINE data came from the 2020 MEDLINE/PubMed baseline. The reference date for each publication was the publication date (PubDate).
The citation distance was computed as the distance between nodes in an undirected citation network in which the nodes were scientific publications recorded in MEDLINE and connections were citations between them (Rodriguez-Esteban R, 2020; Rodriguez-Esteban R, 2021). This citation distance differed from those described in previous work in that those typically involved directed connections (Botafogo et al., 1992). The citation distance between any pair of publications was computed using bidirectional breadth-first search (BFS) on citations existing at the time of publication of the latest article of the pair. Pairs of publications for which a path in the citation network could not be found were discarded from the analysis. A randomized version of the citation network was created by randomly swapping the nodes of the citation network, thus maintaining the network structure.
Citations came from the Open Citation Index repository (Peroni et al., 2017) and in particular from the March 23, 2020 update, which contained 721,655,465 citations between pairs of articles identified by a digital object identifier (DOI). DOI to PMID mappings were extracted from EBI's PMID-PMCID-DOI dataset (Levchenko et al., 2018) downloaded on July 9, 2020, which contained 22,504,850 mappings between PMIDs and DOIs-thus covering 22,504,850 unique PMIDs in total. Using these mappings, 269,956,002 citations from the Open Citation Index were 7 mapped from DOIs to PMIDs. As of July 2020, the fraction of publications covered by the Open Citation Index was 60% out of 51.1 million articles with references deposited with Crossref (https://i4oc.org/#about; checked on July 29, 2020).
The code used for this analysis is available at: https://github.com/raroes/scientific-silos

Results
Research on biomedical properties of compounds and genetic bases of disease is modeled here as a series of sequentially-built networks made of up to three nodes concerning each a gene, a compound and a disease. The nodes are connected by facts, which are molecular and medical relations published in the scientific literature. Central to this analysis is that two existing facts, e.g. a gene-disease and a disease-compound relation, precede and, therefore, enable the posterior new discovery of another fact, i.e. a gene-compound relation (Figure 1). For example, the compound isopropanol leads to increased expression of the gene NQO1's mRNA (Vandebriel et al., 2010). This, together with the fact that inhibition of NQO1 is linked to the amelioration of kidney diseases (Chen et al., 2011), enables a new discovery, namely the relation between isopropanol and kidney diseases (Brott et al., 2013). Using a comprehensive dataset containing thousands of such facts, this model can be employed to understand the dynamics of scientific discovery. The first step in the analysis is to find all combinatorially-possible pairs of facts sharing an element in the dataset, such as all pairs of facts involving the gene NQO1. Together, these facts comprise all pairs of facts that can enable new discoveries. If a pair of these facts is followed by a new discovery, the time lapsed until that event is computed. E.g., in Figure 1, the time lapse is between August 2011, when the second fact was published, and January 7, 2013, when the new discovery was published. This time lapse is then used to estimate the pace at which scientists produce new discoveries from existing facts and, in the case studied here, to test its dependence on the "distance" between the publications in which the facts were published. The distance metric used is the citation distance, which is a simple way to measure proximity in the scientific landscape (Rodriguez-Esteban, 2020). This distance is computed based on the citations existing at the time that the second fact is published. E.g., in Figure 1  As can be seen in Figure 3A, the percentage of all combinatorially-possible pairs of facts that were followed by a new discovery increased linearly over the years, as scientists had time to work with them. This percentage, however, decreased with increasing citation distance, following an exponential decay ( Figure 3B). For citation distance of 2, the percentage of combinatorially-possible pairs of facts enabling a new discovery was, on average, 0.090% per year, while for citation distance of 5 it was an average of 0.036%. After 5 years, it was 2.6 times more likely that a new discovery would be made out of facts separated originally by a citation distance of 2 than out of facts separated by a citation distance of 5 (0.47% vs. 0.18%). This effect disappeared if all publications were randomly swapped within the citation network ( Figure 4). In this case, the rate of discovery did not vary with citation distance, except for the case of distance equal to 1, due to data sparsity. To seek additional validation for these results, a similar analysis was performed with a different dataset based on co-occurrence of manual annotations of genes, diseases and chemicals/drugs of MEDLINE records. Co-occurrences have been considered suggestive of relations (Pavlopoulos et al., 2014) and have been used to discover new relations between drugs, genes and diseases (Frijters et al., 2010). The combinatorial space of all potentially enabling pairs of facts was three times larger (n=17,040,304) in this case than for CTD but the overall outcome was similar ( Figure 5): Only a small percentage of those pairs of facts (0.26%) enabled new discoveries 5 years after publication. The percentage grew steadily with time, but at a different rate depending on the citation distance, following an exponential decay (Figure 3). For facts separated by a citation distance of 2, the percentage enabling a new discovery increased, on average, 0.10% per year, while for a citation distance of 5, it was 0.035%. After 5 years, it was 3 times more likely that a new discovery was be made out of facts published within a citation distance of 2 than out of facts within a citation distance of 5 (0.54% vs. 0.18%). This effect disappeared if publications were randomly swapped (Figure 4). Similarly to the previous case, 12 the percentage of pairs that enabled a new discovery did not vary with citation distance and was similar to the baseline, except for distance equal to 1 due to data sparsity. One potential weakness of this analysis could be missing citation data. The effect of this shortcoming was examined by eliminating existing citations randomly. This reduction did not change the shape of the outcome except when it was large (75% reduction) ( Figure 6). Thus, an increase in the availability of citation data would not be expected to change the overall picture either.
13 Figure 6. Percentage of pairs of facts enabling new discoveries after 5 years based on citation distance in a citation network with progressively less citations (100% = all citations available used, 75% = 75% of all citations available used, etc.). Exponential regressions were fitted to each curve. Data source was CTD.

Discussion and conclusions
The fact that the analyses on both datasets led to similar outcomes lends some validation to the results. Both analyses show that, over time, scientists "connect" only a small percentage of existing facts about relations between compounds, genes and diseases. Thus, biomedical scientists appear to have a wide set of facts available from which they only end up publishing discoveries about a small subset of them, whether because of lack of resources, lack of interest, or because many combinations lead to negative results. Moreover, scientists steadily "accumulate" discoveries over the years but the rate of collective accumulation is higher when those discoveries concern facts that were originally closer within the citation network. This points towards a path-dependency in scientific discovery  in which originally closely-located discoveries drive the sequence of future discoveries rather than optimal unbiased choices.
14 As more facts are discovered, one may expect their potential combinations to grow quadratically and siloization to be a consequence, at least partially, of this. However, there is a countervailing trend, which is that the scientific literature grows exponentially and it is able therefore to produce an increasingly larger number of discoveries. This analysis points to a somewhat stable relation between these two opposing forces. The overall percentage of facts that are being connected to form new discoveries has not changed much over the last decades and even increased slightly despite enormous growth in combinatorial possibilities (Figure 2). If scientists were falling behind, we would expect to see a decrease. Additionally, the rate of accumulation of new discoveries (Figures 3 and 5) appears generally stable and does not show signs of acceleration or deceleration over time (if only slight acceleration for co-occurrence data). Therefore, Swanson's warning about "connection explosion" (Swanson, 2008) ("The literature of science cannot grow faster than the communities that produce it, but not so with connections. Implicit connections between subspecialties grow combinatorially. LBD is challenged more by a connection explosion than by an information explosion.") does not bear on this case, probably because scientists tend to lend a higher focus to a reduced set of drugs, diseases and genes (Yao et al., 2015;Haynes et al., 2018;Stoeger et al., 2018;Rzhetsky et al., 2015), which would tend to limit combinatorial explosion.
The citation distance is only a rough estimate of scientific proximity between articles. One could expect that a more precise surrogate for scientific proximity could show an even stronger siloization effect. The citation distance was chosen for its simplicity. Measures of semantic similarity between articles, for example, could create cross-feedback between article annotations (i.e. gene annotations) and the distance metric itself.
Reaching more often for facts that are "closer" could be a simple heuristic or a type of availability bias. That scientists may use heuristic biases, even if unconscious, to select their 15 research goals should not be surprising, given the extraordinary growth of the scientific literature in most fields. However, this bias leads to a distortion of scientific progress and an opportunity for those who may venture further away from their silos with the aid of modern tools (Krenn & Zeilinger, 2020;Whalen et al., 2016). Siloization is ultimately an emerging property of scientific organization and self-organization with cognitive, social and technological aspects.

Funding
Not applicable.

Conflicts of interest
None declared.

Availability of data and material
All data used in this study is publicly available.

Code availability
The code used for this analysis is available at: https://github.com/raroes/scientific-silos