Environmental biases in the study of ecological networks at the planetary scale

Ecological networks are increasingly studied at large spatial scales, expanding their focus from a conceptual tool for community ecology into one that also adresses questions in biogeography and macroecology. This effort is supported by increased access to standardized information on ecological networks, in the form of openly accessible databases. Yet, there has been no systematic evaluation of the fitness for purpose of these data to explore synthesis questions at very large spatial scales. In particular, because the sampling of ecological networks is a difficult task, they are likely to not have a good representation of the diversity of Earth’s bioclimatic conditions, likely to be spatially aggregated, and therefore unlikely to achieve broad representativeness. In this paper, we analyze over 1300 ecological networks in the mangal.io database, and discuss their coverage of biomes, and the geographic areas in which there is a deficit of data on ecological networks. Taken together, our results suggest that while some information about the global structure of ecological networks is available, it remains fragmented over space, with further differences by types of eco-logical interactions. This causes great concerns both for our ability to transfer knowledge from one region to the next, but also to forecast the structural change in networks under climate change.


Introduction
Ecological networks are a useful representation of ecological systems in which species or organisms interact 2 (Heleno et al. 2014, Delmas et al. 2018, and there has been a recent explosion of interest in their dynamics across large temporal scales (Tylianakis & Morris 2017, Baiser et al. 2019, and 4 along environmental gradients (Trøjelsgaard & Olesen 2016, Pellissier et al. 2017. As ecosystems are changing rapidly, networks are at risk of undergoing rapid and catastrophic changes to their structure: for example by 6 invasion leading to a collapse (Strong & Leroux 2014, Magrach et al. 2017, or by a "rewiring" of interactions among existing species (Bartley et al. 2019, Guiden et al. 2019, Hui & Richardson 2019. Simulation studies 8 suggest that knowing the structure of the extant network, i.e. being able to map all interactions between species, is not sufficient (Thompson & Gonzalez 2017) to predict the effects of external changes, and that data on the species, the local climate and its future projection, are also required.
This change in scope, from describing ecological networks as local, static objects, to dynamical ones that vary 12 across space and time, has prompted several methodological efforts. First, tools to study spatial, temporal, and spatio-temporal variation of ecological networks in space and in relationship to environmental gradients 14 have been developed and continuously expanded (Poisot et al. 2012(Poisot et al. , 2015. Second, there has been an improvement in large-scale data-collection, through increased adoption of molecular biology tools (Evans 16 et al. 2016, Eitzinger et al. 2019, Makiola et al. 2019) and crowd-sourcing of data collection (Pocock et al. 2015, Bahlai & Landis 2016, Roy et al. 2016. Finally, there has been a surge in the development of 18 tools that allow us to infer species interactions (Morales-Castilla et al. 2015, Dallas et al. 2017) based on limited but complementary data on existing network properties (Stock et al. 2017), species traits (Gravel et 20 al. 2013, Bartomeus et al. 2016, Brousseau et al. 2017, Desjardins-Proulx et al. 2017, and environmental conditions (Gravel et al. 2018). These latter approaches tend to perform well in data-poor environments 22 (Beauchesne et al. 2016), and can be combined through ensemble modelling or model averaging to generate more robust predictions (Pomeranz et al. 2018). The task of inferring interactions is particularly important 24 because ecological networks are difficult to adequately sample in nature (Banašek-Richter et al. 2004, Gibson et al. 2011, Chacoff et al. 2012, Jordano 2016. The common goal to these efforts is to facilitate the prediction 26 of network structure, particularly over space (Poisot & Gravel et al. 2016, Gravel et al. 2018, Albouy et al. 2019) and into the future (Albouy et al. 2014), in order to appraise the response of that structure to possible 28 environmental changes.
2 These disparate methodological efforts share another important trait: their continued success depends on stateof-the art data management, but also on the availability of data that are representative to the area we pretend 2 to model. Novel quantitative tools demand a higher volume of network data; novel collection techniques demand powerful data repositories; novel inference tools demand easier integration between different types 4 of data, including but not limited to: interactions, species traits, taxonomy, occurrences, and local bioclimatic conditions. In short, advancing the science of ecological networks requires us not only to increase the volume 6 of available data, but to pair these data with ecologically relevant metadata. Such data should also be made available in a way that facilitates programmatic interaction so that they can be used by reproducible data 8 analysis pipelines. Poisot & Baiser et al. (2016) introduced mangal.io as a first step in this direction. In the years since the tool was originally published, we continued development of the data representation, amount 10 and richness of metadata, and digitized and standardized as much ecological data as we could find. The second major release of this database contains over 1300 networks, 120000 interactions across close to 7000 taxa, 12 and represents what is to our best knowledge the most complete collection of species interactions available.
Here we ask if the current mangal database is fit for the purpose of global-scale synthesis research into eco-14 logical networks. We conclude that interactions over most of the planet's surface are poorly described, despite an increasing amount of available data, due to temporal and spatial biases in data collection and digitization. 16 In particular, Africa, South America, and most of Asia have very sparse coverage. This suggests that synthesis efforts on the worldwide structure or properties of ecological networks will be weaker within these areas.

18
To improve this situation, we should digitize available network information and prioritize sampling towards data-poor locations. 20 2 Global trends in ecological networks description 2.1 Network coverage is accelerating but spatially biased 22 The earliest recorded ecological networks date back to the late nineteenth century, with a strong increase in the rate of collection around the 1980s ( fig. 1). Although the volume of available networks has increased over time, 24 the sampling of these networks in space has been uneven. In fig. 2, we show that globally, network collection is biased towards the Northern hemisphere, and than different types of interactions have been sampled in 26 different places. As such, it is very difficult to find a spatial area of sufficiently large size in which we have 3 Figure 1: Cumulative number of ecological networks available in mangal.io as a function of the date of collection. About 1000 unique networks have been collected between 1987 and 2017, a rate of just over 30 networks a year. This temporal increase proceeds at different rates for diferent types of networks; while the description of food webs is more or less constant, the global acceleration in the dataset is due to increased interest in host-parasite interactions starting in the late 1970s, while mutualistic networks mostly started being recorded in the early 2000s. Figure 2: Each point on the map corresponds to a network with parasitic, mutualistic, and predatory interactions. It is noteworthy that the spatial coverage of these types of interactions is uneven; the Americas have almost no parasitic network, for example. Some places have barely been studied at all, including Africa and Eastern Asia. This concentration of networks around rich countries speaks to an inadequate coverage of the diversity of landscapes on Earth.
networks of predation, parasitism, and mutualism. The inter-tropical zone is particularly data-poor, either because data producers from the global South correctly perceive massive re-use of their data by Western world 2 scientists as a form of scientific neo-colonialism (as advanced by Mauthner & Parry 2013), thereby providing a powerful incentive against their publication, or because ecological networks are subject to the same data 4 deficit that is affecting all fields on ecology in the tropics (Collen et al. 2008). As Bruna (2010) identified almost ten years ago, improved data deposition requires an infrastructure to ensure they can be repurposed 6 for future research, which we argue is provided by mangal.io for ecological interactions. 8 Whittaker (1962) suggested that natural communities can be partitioned across biomes, largely defined as a function of their relative precipitation and temperature. For all networks for which the latitude and longitude Figure 3: List of networks across in the space of biomes as originally presented by Whittaker (1962). Predation networks, i.e. food webs, seem to have the most global coverage; parasitism networks are restricted to low temperature and low precipitation biomes, congruent with the majority of them being in Western Europe. total annual) from the WorldClim 2 data (Fick & Hijmans 2017). Using these we can plot every network on the map of biomes drawn by Whittaker (1962) (note that because the frontiers between biomes are not based 2 on any empirical or systematic process, they have been omitted from this analysis). In fig. 3, we show that even though networks capture the overall diversity of precipitation and temperature, types of networks have been Figure 4: Distance to the centroid (in the scaled climatic space) for each network, as a function of the type of interaction. Larger values indicate that the network is far from its centroid, and therefore represents sampling in a more "unique" location. Mutualistic interactions have been, on average, studied in more diverse locations that parasitism or predatory networks. 7

Some locations on Earth have no climate analogue
In figures 5, we represent the environmental distance between every pixel covered by BioClim data, and the 2 three networks that were sampled in the closest environmental conditions (this amounts to a k nearest neighbors with k = 3). In short, higher distances correspond to pixels on Earth for which no climate analogue 4 network exists, whereas the darker areas are well described. It should be noted that the three types of interactions studied here (mutualism, parasitism, predation) have regions with no analogues in different locations. 6 In short, it is not that we are systematically excluding some areas, but rather than some type of interactions are more studied in specific environments. This shows how the lack of global coverage identified in fig. 3,   8 for example, can cascade up to the global scale. These maps serve as an interesting measure of the extent to which spatial predictions can be trusted: any extrapolation of network structure in an area devoid of analogues 10 should be taken with much greater caution than an extrapolation in an area with many similar networks.

For what purpose are global ecological network data fit?
What can we achieve with our current knowledge of ecological networks? The overview presented here shows 14 a large and detailed dataset, compiled from almost every major biome on earth. It also displays our failure as a community to include some of the most threatened and valuable habitats in our work. Gaps in any 16 dataset create uncertainty when making predictions or suggesting causal relationships. This uncertainty must be measured by users of these data, especially when predicting over the "gaps" in space or climate that we have 18 identified. In this paper we are not making any explicit recommendations for synthesis workflows. Rather we argue that this needs to be a collective process, a collaboration between data collectors (who understand the 20 deficiencies of these data) and data analysts (who understand the needs and assumptions of network methods).
One line of research that we feel can confidently be pursued lies in extrapolating the structure of ecological  Figure 5: Environmental distance for every terrestrial pixel to its three closest networks. Areas of more yellow coloration are further away from any sampled network, and can therefore not be well predicted based on existing empirical data. Areas with a dark blue coloration have more analogues. The distance is expressed in arbitrary units and is relative. 9 the task of predicting the overarching structure greatly. Finally, this approach to prediction which neglects the composition of networks is justified by the fact that even in the presence of strong compositional turnover, 2 network structure tends to be maintained at very large spatial scales (Dallas & Poisot 2017).
3.2 Can we predict the future of ecological networks under climate change? 4 Perhaps unsurprisingly, most of our knowledge on ecological networks is derived from data that were collected after the 1990s ( fig. 1). This means that we have worryingly little information on ecological networks before 6 the acceleration of the climate crisis, and therefore lack a robust baseline. Dalsgaard et al. (2013) provide strong evidence that the extant shape of ecological networks emerged in part in response to historical trends 8 in climate change. The lack of reference data before the acceleration of the effects of climate change is of particular concern, as we may be deriving intuitions on ecological network structure and assembly rules from

Active development and data contribution
This is an open-source project: all data and all code supporting this manuscript are available on the Mangal 8 project GitHub organization, and the figures presented in this manuscript are themselves packaged as a selfcontained analysis which can be run at any time. Our hope is that the success of this project will encourage 10 similar efforts within other parts of the ecological community. In addition, we hope that this project will encourage the recognition of the contribution that software creators make to ecological research.

12
One possible avenue for synthesis work, including the contribution of new data to Mangal, is the use of these published data to supplement and extend existing ecological network data. This "semi-private" ecological 14 synthesis could begin with new data collected by authors -for example, a host-parasite network of lake fish in Africa, or a pollination network of hummingbirds in Brazil. Authors could then extend their analyses by 16 including a comparison to analogous data made public in Mangal. After publication of the research paper, the original data could themselves be uploaded to Mangal. This enables the reproducibility of this particular 18 published paper. Even more powerfully, it allows us to build a future of dynamic ecological analyses, wherein analyses are automatically re-done as more data get added. This would allow a sort of continuous assessment 20 of proposed ecological relationships in network structure. This cycle of data discovery and reuse is an example of the Data Life Cycle (Michener 2015) and represents one way to practice ecological synthesis. 22 Finally, it must be noted that as the amount of empirical evidence grows, so too should our understanding of existing relationships between network properties, networks properties and space, and the interpretation to be 24 drawn from them. In this perspective, the idea of continuously updated analyses is very promising. Following the template laid out by White et al. (2019) and Yenni et al. (29-Jan-2019), it is feasible to update a series of canonical analyses any time the database grows, in order to produce living, automated synthesis of ecological networks knowledge. To this end, the mangal database has been integrated with EcologicalNetworks.jl