Abstract
Although studies have shown that urban environments and mass-transit systems have geospa-tially distinct metagenomes, no study has ever systematically studied these dense, human/microbial ecosystems around the world. To address this gap in knowledge, we created a global metagenomic and antimicrobial resistance (AMR) atlas of urban mass transit systems from 58 cities, spanning 3,741 samples and 4,424 taxonomically-defined microorganisms collected for three years. The map provides annotated, geospatial data about microbial strains, functional genetics, antimicrobial resistance, and novel genetic elements, including 10,928 novel predicted viral species. Urban microbiomes often resemble human commensal microbiomes from the skin and airways but contain a consistent “core” of 61 species which are predominantly not human commensal species. These data also show that AMR density across cities varies by several orders of magnitude with many AMRs present on plasmids with cosmopolitan distributions. Conversely, samples may be accurately (91.4%) classified to their city-of-origin using a linear support vector machine over taxa. Together, these results constitute a high-resolution global metagenomic atlas, which enables the discovery of new genetic components of the built human environment, forensic application, and an essential first draft of the global AMR burden of the world’s cities.
1 Introduction
The high-density urban environment has historically been home to only a fraction of all people with the majority living in rural areas or small villages. In the last century the situation has reversed; 55% of the world’s population now lives in urban areas (United Nations, 2018). Since the introduction of germ theory and John Snow’s work on cholera it has been clear that people in cities interact with microbes in ways that may be markedly different than in rural areas (Neiderud, 2015). Microbes in the built environment have been implicated as a possible source of contagion (Cooley et al., 1998) and certain syndromes, like allergies, are associated with increasing urbanization (Nicolaou et al., 2005). It is now apparent that cities in general have an impact on human health though the mechanisms of this impact are broadly variable and often little understood. In general our understanding of microbial dynamics in the urban environment outside of pandemics is limited (Gilbert and Stephens, 2018).
Technological advances in next-generation sequencing (NGS) and metagenomics have created an unprecedented opportunity for rapid, global studies of microorganisms and their hosts, providing researchers, clinicians, and policymakers with a more comprehensive view of the functional dynamics of microorganisms in the city. NGS facilitates unbiased sampling of the microorganisms in an area with the potential for both taxonomic and functional annotation. This is particularly important for surveillance of microorganims as they acquire antimicrobial resistance (AMR) (Fresia et al., 2019). Metagenomic methods enable nearly real-time monitoring of organisms, AMR genes, and pathogens as they emerge within a given geographical location, and have the potential to reveal hidden microbial reservoirs and detect microbial transmission routes as they spread around the world (Zhu et al., 2017). A molecular map of urban environments will enable significant new research on the impact of urban microbiomes on human health.
Urban transit systems such as subways and buses are a daily contact interface for a large proportion of the people who live in cities. Urban travelers bring their commensal microorganisms with them as they travel and come into contact with organisms and mobile elements present in the environment, including AMR markers. The urban microbiome, however, has only been profiled with comprehensive metagenomic methods or in longitudinal studies (Afshinnekoo et al., 2015; Hsu et al., 2016; Kang et al., 2018; MetaSUB International Consortium. Mason et al., 2016) in a select few cities on a limited number of occasions. This leaves a gap in scientific knowledge about a microbial ecosystem which a large proportion of the global population readily interacts with. Human commensal microbiomes have been found to vary widely based on culture and geography and geographically constrained studies, while beneficial, may prove to miss similar differences (Brito et al., 2016). Moreover, data on urban microbes and AMR genes are urgently needed in developing nations, where antimicrobial drug consumption is expected to rise by 67% by 2030 (United Nations, 2016; Van Boeckel et al., 2015), both from changes in consumer demand for livestock products and an expanding use of antimicrobials which can be expected to alter AMR profiles of these cities.
The International Metagenomics and Metadesign of Subways and Urban Biomes (MetaSUB) Consortium was launched in 2015 to address this gap in knowledge on the density, types and dynamics of urban metagenomes and AMR profiles. Since then, we have developed standardized collection and sequencing protocols to process 3,741 samples across 58 cities worldwide (Table S1). Sampling took place at three major time points: a pilot study in 2015-16 and two global city sampling days (gCSD) in mid 2016 and mid 2017, with most samples taken on June 21st. Each sample was sequenced with 5-7M 125bp paired-end reads using Illumina NGS sequencers (see Methods). To deal with the challenging analysis of our large dataset, we generated a custom made analysis pipeline (MetaSUB Core Analysis Pipeline, CAP), which is open-source and includes a comprehensive set of state-of-the-art, peer-reviewed, metagenomic tools for taxonomic identification, k-mer analysis, AMR gene prediction, functional profiling, annotation of particular microbial species and geospatial mapping.
This first global metagenomic study of urban microbiomes reveals a consistent “core” urban microbiome across all cities underlying distinct geographic variation that may reflect epidemiological variation and enable future forensic capabilities. More importantly, our data demonstrate that a significant fraction of the urban microbiome remains to be characterized. Though 1,000 samples are sufficient to discover roughly 80% of the observed taxa and AMR genes, we continued to observe novel taxa and genes at an ongoing discovery rate of approximately one new species and one new AMR marker for every 10 samples. Notably, this genetic variation is affected by various environmental factors (e.g., climate, surface type, latitude, etc.) and samples show greater diversity near the equator. Moreover, sequences associated with AMR markers are widespread, though not necessarily abundant, and show geographic specificity. Here, we present the results of our global analyses and a set of tools developed to access and analyze our unprecedented data collection, including: two interactive map-based visualizations for samples (metasub.org) and AMRs (resistanceopen.org), an indexed search tool over raw sequence data (dnaloc.ethz.ch/), a Git repository for all analytical pipelines and figures, and application programming interfaces (APIs) for computationally accessing results (github.com/metasub/metasub_utils).
2 Results
We present the main results of the ongoing MetaSUB study. MetaSUB is intended to develop our knowledge of urban microbiomes by studying mass transit systems within cities. We have collected and sequenced 3,741 samples from 58 cities worldwide (Tables 1 and S1). Our results constitute the first large scale metagenomic study of the urban microbiome. We conclude that there is a consistent core urban microbiome (Section 2.1), that the core microbiome is supplemented by geographic variation (Section 2.2), that significant diversity remains to be characterized (Section 2.3) and that novel taxa may be discovered in the data (Section 2.5), that environmental factors affect variation (though the source of much variation remains unknown, Section 2.2), and that sequences associated with AMR are widespread but not necessarily abundant (Section 2.4). In addition to these results we present several ways to access and analyze our data including interactive web based visualizations, search tools over raw sequence data, and high level interfaces to computationally access results.
2.1 A Core Urban Microbiome Centers Global Diversity
We first investigated the distribution of microbial species across the global urban environment. Specifically, we asked whether the urban environment represents a singular type of microbial ecosystem or a set of related, but distinct, communities, especially in terms of biodiversity. We observed a bimodal distribution of taxa prevalence across our dataset, which we used to define two separate sets of taxa based on the inflection points of the distribution: the putative “sub-core” set of urban microbial species that are consistently observed (>70% of samples) and the less common “peripheral” (<25% of samples) species. We also arbitrarily define a set of true “core” taxa which occur in essentially all samples (>95% of samples). Applying these thresholds, we identified 1,145 microbial species (Figure 1B) that make up the sub-core urban microbiome with 61 species in the true core microbiome (Figure 1A). Core and sub-core taxa classifications were further evaluated for sequence complexity and genome coverage on a subset of samples. Of the 1,206 taxa with prevalence greater than 70%, 69 were flagged as being low quality classifications. The sub-core microbiome was principally bacterial with just one eukaryotic taxa identified and not flagged: Saccharomyces cerevisiae. Notably, no archaea or viruses were identified in the group of sub-core microorganisms (note that this analysis did not include viruses newly discovered in this study). For viruses in particular, this may be affected by the sampling or DNA extraction methods used, by limitations in sequencing depth or by missing annotations in the reference databases used for taxonomic classification, which is principally problematic with phages. It is worth noting that potentially prevalent RNA viruses are omitted with our DNA-based sampling. The three most common bacterial phyla across the world’s cities ordered by the number of species observed were Proteobacteria, Actinobacteria, and Firmicutes. To test for possible geographic bias in our data we normalized the prevalence for each taxa by the median prevalence within each city. The two normalization methods broadly agreed (Figure 1B). This can also be observed in the global co-occurrence map of microbial species (Figure 1C) which shows a single strongly co-occurring group and a minor tail of co-occurrence but no secondary cluster.
Despite their global prevalence, the core taxa are not uniformly abundant across all cities. Many species exhibited a high standard deviation and kurtosis (calculated using Fisher’s definition and normal kurtosis of 0) than other species (Figure 1D). Furthermore, some species show distinctly high mean abundance, often higher than the core species, but more heterogeneous global prevalence. For example, Salmonella enterica is identified in less than half of all samples but is the 12th most abundant species based on the fraction of DNA that can be ascribed to it. The most relatively abundant microbial species was Cutibacterium acnes (Figure 1D) which had a comparatively stable distribution of abundance across all samples. Cutibacterium acnes is known as a prominent member of the human skin microbiome. To test for any biases arising from uneven geographic sampling, we measured the relative abundance of each taxon by calculating the fraction of reads classified to each particular taxon, and compared the raw distribution of abundance to the distribution of median abundance within each city (Figure 1D); the two measures closely aligned. Also, an examination of the positive and negative controls indicates that these results are not likely due to contamination or batch effect (Supp. Figure S6). In total, we observed 61 core taxa (>95%), 1,145 sub-core taxa (70-95%) 2,466 peripheral taxa (<25%), and 4,424 taxa across all samples. We term the set of all taxa observed pan-taxa.
The taxonomic classification used (KrakenUniq) revealed that much of the diversity in our data represents sequences of novel or unknown provenance. Indeed, the average fraction of DNA unclassified at the species level across all of our samples is approximately 50% (Figure 1E), mirroring results seen by previous urban microbiome works (Afshinnekoo et al., 2015; Hsu et al., 2016). As humans are a ma jor part of the urban environment, the DNA in our samples could be expected to resemble commensal human microbiomes. To investigate this, we compared non-human DNA fragments from our samples to a randomized set of 50 samples from 5 commensal microbiome sites in the Human Microbiome Project (HMP) (Consortium et al., 2012) (stool, skin, airway, gastrointestinal, urogenital). We used MASH to perform a k-mer based comparison of our samples vs. the selected HMP samples, which showed a roughly uniform dissimilarity between MetaSUB samples and those from different human body sites (Figure 1F, Supp. Figure S4C). A similar comparison based on taxonomic profiles rather than k-mers revealed greater similarity between MetaSUB samples and the skin and airway commensal microbiomes (Supp. Figure S4D) than other body sites though the overall similarity was small. Given that a large fraction of DNA in our samples could not be classified and that a k-mer based comparison did not find significant body-site specificity it is possible that the unclassified DNA in our samples is from novel taxa which are not human commensals.
Neither do our samples resemble the taxonomic composition of soil samples. We processed 28 metage-nomic soil samples (Bahram et al., 2018) using the same pipeline as the rest of the data and compared soil samples to our samples using MASH. Our samples were very dissimilar from the soil samples (Supp. Figure S4F) even in comparison to human skin microbiomes. This suggests that the unclassified DNA may represent heretofore uncharacterized taxa that are not known commensals being shed into the environment.
We further investigated the impact of geography on sample composition. In ecology, an increasing distance from the equator is associated with a decrease in taxonomic diversity (O’Hara et al., 2017). Indeed, the MetaSUB dataset recapitulates this result and identifies a significant decrease in taxonomic diversity (though with significant noise, p < 2e16, R2 = 0.06915) as a function of absolute latitude; samples are estimated to lose 6.9672 species for each degree of latitude away from the equator (Figure 1G). The effect of latitude on species diversity is not purely monotonic, since several cities have higher species diversity then their latitude would predict. This is expected as latitude is only a rough predictor of a city’s climate. While this is an observation consistent with ecological theory, the samples are heavily skewed by the location of the target cities, as well as the prevalence of those cities in specific latitude zones of the northern hemisphere.
To quantify the number of likely unobserved taxa, we performed a rarefaction analysis on the taxa that were identified. By estimating the number of taxa identified for different numbers of samples, we see a diminishing trend (Figure 1H), which indicates that at some point, the species in every new sample were likely already identified in a previous one. Our rarefaction curve did not reach a plateau and, even after including all samples, it still shows an expected marginal discovery rate of roughly 10 new species for every 100 samples added to the study. Despite the remaining unidentified taxa, we estimate that most (80%) of the classifiable taxa in the urban microbiome could be identified with roughly 1,000 samples. However, as noted below, this new diversity is likely not evenly distributed, with the bulk of new discoveries expected to be made from equatorial regions and cities in the Southern hemisphere.
2.2 Global Diversity Varies According to Covariates
Despite the core urban microbiome present in almost all samples, there was also geographic variation in taxonomy and localization. We calculated the Jaccard distance between samples measured by the presence and absence of species (which is robust to noise from relative abundance given) and performed a dimensionality reduction of the data using UMAP (Uniform Manifold Approximation and Projection, McInnes et al. (2018)) for visualization (Figure 2A). Jaccard distance was correlated with distance based on Jensen-Shannon Divergence (which accounts for relative abundance) and k-mer distance calculated by MASH (which is based on the k-mer distribution in a sample, so cannot be biased by a database) (Supp. Figure S3A, B, C). In principle, Jaccard distance can be influenced by read depth, however the total number of species identified stabilized at 100,000 reads (Supp. Figure S2B) compared to an average of 6.01M reads per sample. Samples collected from Europe and East Asia were distinct, but the separation between other continents was less clear. A similar trend was found in an analogous analysis based on functional pathways rather than a taxonomy (Supp. Fig S1B), which indicates geographic stratification of the metagenomes at both the functional and taxonomic levels. These findings confirm and extend earlier analyses performed on a fraction of the MetaSUB data which were run as a part of CAMDA Challenges in years 2017 and 2018 (camda.info).
We quantified the degree to which metadata covariates influence the taxonomic composition of our samples using MAVRIC, a statistical tool to estimate the sources of variation in a count-based dataset (Moskowitz and Greenleaf, 2018). We identified covariates which influenced the taxonomic composition of our samples: continent, elevation above sea-level, average temperature, surface material, and proximity to the coast. The most important factor, which could explain 11% of the variation in isolation, was the continent from which a sample was taken. The other four factors ranged from explaining 2% to 7% of the possible variation in taxonomy in isolation (Supp. Table S2). We note that many of the factors were confounded with one another, so they can explain less diversity than their sum. One metadata factor tested, the population density of the sampled city, appeared to have no discernible effect on taxonomic variation.
To quantify how the principle covariates, climate, continent, and surface material impacted the taxonomic composition of samples, we performed a Principal Component Analysis (PCA) on our taxonomic data normalized by proportion and identified principal components (PCs) which were strongly associated with a metadata covariate in a positive or negative direction (PCs were centered so an average direction indicates an association). We found that the first two PCs (representing 28.0% and 15.7% of the variance of the original data, respectively) associated strongly with the city climate while continent and surface material associate less strongly (Figure 2B).
Next, we tested whether geographic proximity (in km) of samples to one another had any effect on the variation, since samples taken from nearby locations could be expected to more closely resemble one another. Indeed, for samples taken in the same city, the average JSD (Jensen-Shannon distance) was weakly predictive of the taxonomic distance between samples, with every increase of 1km in distance between two samples representing an increase of 0.056% in divergence (p < 2e16, R2 = 0.01073, Supp. Figure S4E). This suggests a “neighborhood effect” for sample similarity analogous to the effect described by Meyer et al. (2018), albeit a very minor one. To reduce bias that could be introduced by samples taken from precisely the same object we excluded all pairs of samples within 1km of one another.
At a global level, we examined the prevalence and abundance of taxa and their functional profiles between cities and continents. These data showed a fairly stable phyla distribution across samples, but the relative abundance of these taxa is unstable (Figure 2C) with some continental trends. In contrast to taxonomic variation, functional pathways were much more stable across continents, showing relatively little variation in the abundance of high level categories (Figure 2D). This pattern may also be due to the more limited range of pathway classes and their essential role in cellular function, in contrast to the much more wide-ranging taxonomic distributions examined across metagenomes. Classes of antimicrobial resistance were observed to vary by continent as well. Many samples had no AMR genes identified but clusters of AMR classes were observed to occur in groups of taxonomically similar samples (Figure 2E).
We quantified the relative variation of taxonomic and functional profiles by comparing the distribution of pairwise distances in taxonomic and functional profiles. Both profiles were equivalently normalized to give the probability of encountering a particular taxon or pathway. Taxonomic profiles have a mean pairwise Jensen-Shannon Divergence (JSD) of 0.61 while pathways have a mean JSD of 0.099. The distributions of distances are significantly different (Welch’s t-test, unequal variances, p < 2e – 16). This is consistent with observations from the Human Microbiome Project, where metabolic function varied less than taxonomic composition (Consortium et al., 2012; Lloyd-Price et al., 2017) within samples from a given bodysite.
2.3 Diversity and Novel Sequences
To facilitate characterization of novel sequences we provide GeoDNA, a high-level web interface (Figure 3A) to search raw sequences against our dataset. Users can submit sequences to be processed against a k-mer graph-based representation of our data. Query sequences are mapped to samples and a set of likely sample hits is returned to the user. This interface will allow researchers to probe the diversity in this dataset and rapidly identify the range of various genetic sequences.
While the taxonomic separation between continents was not strictly linear, taxonomic groups could nonetheless be identified as differentially abundant between cities (Supp. Fig S5C). To quantify the strength of the separation, we trained a Linear Support Vector Machine (SVM) classifier to predict the city of origin from binary taxonomic data (present or absent, detection thresholds are described in detail in the methods) reduced to 1,000 dimensions by principle component analysis. We trained the model on 90% of the samples in our dataset and evaluated its performance on the remaining 10%. This classifier achieved 91.4% precision and 91.4% recall on the held out test data. Sample classes were not balanced for training as this would have drastically reduced the available samples but results on test data were not dominated by cities with many samples (Supp. Figure S5A). The classifaction accuracy of the SVM was robust to Gaussian noise. We added gaussian noise to each variable in a test dataset (mean of 0, standard deviation proportional to the mean of the variable) and performed classification using a classifier trained as above. The classifier was robust to noise until the standard deviation of that noise exceeded the mean of the variable at which point performance of the classifier rapidly deteriorated (Supp. Fig S5D). The successful geographic classification of samples demonstrates distinct city-specific trends in the detected taxa, that may enable future forensic biogeographical capacities.
Within cities taxonomic distribution appears to be roughly geographically defined (Supp. Figure S9) but the forensic application of predicting the geographic origin of samples from cities not represented in a training dataset appears challenging. As a preliminary investigation into the predictive forensic capacity of our dataset we also trained classifiers to predict various metadata features on held out cities. Models were trained as above except that all samples from a given city were held out and PCA was used to reduce the data to just 100 dimensions. After training the model was used to classify samples from the held out city for the appropriate metadata features. Mean accuracy by city for features varied (Supp. Table S3) from 38% for climate classification to 65% for city population. Different held-out cities were more predictable than others (Supp. Figure S5B, in particular features for a number of British cities were roughly similar and could be predicted effectively). Predicting different metadata features enables forensic classification of samples which could be assigned to several broadly limiting categories thus reducing a sample’s potential provenance.
However, unique, city-specific taxa are not uniformly distributed (Figure 3B). To quantify this, we developed a score to reflect how endemic a given taxon is within a city, which reflects upon the forensic usefulness of a taxon. The Endemicity score (ES) is defined as ES = P (City|Taxa)P (Taxa|City), with the probabilities both given by empirically observed fractions. This score is designed to simultaneously reflect the chance that a taxon could identify a given city and that that taxon could be found within the given city. A high endemicity score for a taxon in a given city could be evidence of the evolutionary advantage that the taxon has in a particular cities environment. However, neutral evolution of microbes within a particular niche is also possible and the ES alone does not distinguish between these two hypotheses.
Note that while the ES only considers taxa which are found in a city, a forensic classifier could also take advantage of the absence of taxa for a similar metric. ES shows a clear inflection of values at 0.000025 (Figure 3B, inset), below that level we filter scores. Some cities, like Offa (Nigeria), host many unique taxa while others, like Zurich (Switzerland), host fewer endemic species. Large numbers of endemic species in a city may reflect geographic bias in sampling. However, some cities from well sampled continents (e.g., Lisbon, Hong Kong) also host many endemic species which would suggest that ES may indicate interchangeability and local pockets of microbiome variation for some locations.
Predicted viral species (Section 2.5) from samples collected within 10, 100 and 1000 kilometers of one another were agglomerated to examine their planetary distribution at different scales (Figure 3C). At any scale, most viral clusters appear to be weakly cosmopolitan; the majority of their members are found at or near one location, with a few exceptions.
These cosmopolitan viral clusters also exhibit an observational effect; regions in which more samples were recovered share more viral clusters (notably, the northeast United States, Lagos and the southern Brazil). Normalizing the number of shared clusters by the number of raw reads or assembled contigs results in altered patterns of connectivity, suggesting that observations to date does not completely represent viral diversity. We report the connectivity among raw viral cluster but note potential bias.
2.4 Antimicrobial Resistance Genes Form Distinct Clusters
Quantification of antimicrobial diversity and AMRs are key components of global antibiotic stewardship. Yet, predicting antibiotic resistance from genetic sequences alone is challenging, and detection accuracy depends on the class of antibiotics (i.e., some AMR genes are associated to main metabolic pathways while others are uniquely used to metabolize antibiotics). As a first step towards a global survey of antibiotic resistance in urban environments, we mapped reads to known antibiotic resistance genes, using the MegaRES ontology and alignment software. We quantified their relative abundance using reads/kilobase/million mapped reads (RPKM) for 20 classes of antibiotic resistance genes detected in our samples (Figure 4D). 2,210 samples had some sequence which was identified as belonging to an AMR gene, but no consistent core set of genes was identified. The most common classes of antibiotic resistance genes were for macrolides, lincosamides, streptogamines (MLS), and betalactams, yet the most common class of antibiotic resistance genes, 56%, was found in only 56% of the samples where AMR sequence was identified.
Despite being relatively common, antibiotic resistance genes were universally in low abundance compared to functional genes, with RPKM values for resistance classes typically ranging from 0.1 – 1 compared to values of 10 – 100 for typical housekeeping genes (AMR classes contain many genes so RPKM values may be lower than they would be for individual genes). In spite of the low abundance of the genes themselves, some samples contained sequences from hundreds of distinct AMR genes. Clusters of high AMR diversity were not evenly distributed across cities (Figure 4E). Some cities had more resistance genes identified on average (15-20X) than others (e.g. Offa) while other cities had bimodal distributions (e.g. Seoul) where some samples had hundreds of genes while others very few.
We identified AMR genes that were found on assembled contigs. Contigs were identified as either representing bacterial plasmids or chromosomes and were given rough taxonomic assignments. We quantified the number of AMR genes identified on each assembled contig (Figure 5C). Contigs with at least one AMR gene from both chromosomes and plasmids had a mode of three AMR genes. Chromosomal contigs had more AMR genes than plasmid contigs on average but this does not account for the larger size of chromosomal contigs. Three taxa groups, Acinetobacteria, Bacteroidetes, and Tenericutes, had AMR genes predominantly on chromosol contigs. One taxa group, Deinococcus-Thermus, had more AMR genes identified on plasmid contigs than chromosomes. AMR genes from three taxa groups, Cyanobacteria, Deinococcus-Thermus, and Tenericutes, had geographically constrained ranges while AMRs from other taxa had cosmopolitan ranges.
Prominent AMR genes identified on plasmid contigs formed co-occurring clusters (Figure 5A). We identified thirteen clusters of at least two AMR genes that consistently occurred on the same contigs and were found on at least 1,000 contigs. All of these genes were found to have cosmopolitan ranges but were not necessarily found in equal abundances across regions. A<R genes in clusters were of many different types but there were notable clusters for vancomycin resistance, tetracycline resistance, and Multi-drug efflux complexes. Some clusters were dominated by a single type of resistance gene while others had genes with multiple sources.
We performed a rarefaction analysis on the set of all resistance genes in the dataset, which we call the “panresistome” (Figure 4A). Similar to the rate of detected species, the panresistome also shows an open slope with an expected rate of discovery of 10 previously unseen AMR gene per 100 samples. Given that AMR gene databases are rapidly expanding and that no AMR genes were found in many samples it is likely that future analyses will identify many more resistance genes in this data.
Additionally, AMR genes show a “neighbourhood” effect within samples that are geographically proximal analogous to the effect seen for taxonomic composition (Figure 4B). Excluding samples where no AMR genes were detected, the Jaccard distance between sets of AMR genes increases with distance for pairs of samples in the same city. As with taxonomic composition the overall effect is weak and noisy but significant.
2.5 Discovery of Novel Viral Genomes and Host Interactions
We investigated assembled contigs from our samples to identify 16,584 predicted uncultivated viral genomes (UViGs). Taxonomic analysis of predicted UViGS to identify viral species yielded 2,009 clusters containing a total of 6,979 UViGs and 9,605 singleton UViGs for a total of 11,614 predicted viral species.
We compared the predicted species to known viral sequences in the IMG/VR system which contains viral genomes from isolates, a curated set of prophages and 730k UViGs from other studies. Of the 11,614 species discovered in our data 94.1% did not match any viral sequence in IMG/VR (Paez-Espino et al., 2019) at the species level. This suggests that urban microbiomes contain significant diversity not observed in other environments.
Next, we attempted to identify possible hosts for our predicted UViGs. For the 686 species with similar sequences in IMG/VR we projected known host information onto 2,064 MetaSUB UViGs. Additionally, we used CRISPR-Cas spacer matches in the IMG/M system to assign possible hosts to a further 1,915 predicted viral species. Finally, we used a database of 20 million metagenome derived CRISPR spacers to provide further rough taxonomic assignments. We note that virus-host associations were found for 41% of species in the core microbiome (Figure 1A).
2.6 Quality control confirms other results
2.6.1 Unmapped DNA is not similar to any known sequence
A large proportion of the reads in our samples were not mapped to any references sequences. There are three major reasons why a fragment of DNA would not be classified in our analysis 1) The DNA originated from a non-human and non-microbial species which would not be present in the databases we used for classification 2) Our classifier (KrakenUniq) failed to classify a DNA fragment that was in the database due to slight mismatch 3) The DNA fragment is novel and not represented in any existing database. Explanations (1) and (2) are essentially drawbacks of the database and computational model used and we can quantify them by mapping reads using a more sensitive aligner to a larger database, such as BLASTn (Altschul et al., 1990), or ensemble methods for analysis (McIntyre et al., 2017). To estimate the proportion of reads which could be assigned, we took 10k read subsets from the non-human reads of 152 samples and mapped the subsets to the NCBI-NT database with BLASTn. On average, 46.3% of non-human reads could be mapped to some known reference sequence (>90% average BLAST nucleotide identity roughly equivalent to the rank of Genus or Family) compared to an average 41.3% of reads mapped using our approach with KrakenUniq. We note that our approach to estimate the fraction of reads that could be classified using BLASTn does not account for hits to low quality taxa which would ultimately be discarded in our pipeline, and so represents a worst-case comparison. Explanation (3) is altogether more interesting and we refer to this DNA as true unclassified DNA. In this analysis we do not seek to quantify the origins of true unclassified DNA except to postulate that it may derive from novel species as have been identified in other similar studies.
2.6.2 Sequencing quality scores show expected trends
We measured sequencing quality based on 5 metrics: number of reads obtained from a sample, GC content (taken after removing human reads), Shannon’s entropy of k-mers (from 10,000 reads sampled from each sample), post PCR Qubit score, and recorded DNA concentration before PCR. We observed good separation of negative and positive controls based on both PCR Qubit and k-mer entropy (Supp. Figure S7). Distributions of DNA concentration and the number of reads were as expected. GC content was broadly distributed for negative controls while positive controls were tightly clustered, expected since positive controls have a consistent taxonomic profile. Comparing the number of reads before and after quality control did not reveal any major outliers.
2.6.3 Batch effect appears minimal
A major concern for this study was that our results could be conflated with batch effect. For practical purposes most of the flowcells used to sequence our samples contained samples from a relatively confined region. The median flowcell used in our study contained samples from just 3 cities and 2 continents. However, two flowcells covered 18 cities from 5 or 6 continents respectively. When samples from these flowcells were plotted using UMAP (see Section 2.2 for details) the major global trends we described were recapitulated (Supp. Figure S8A). Further, when plotting samples by PCR qubit and k-mer entropy (the two metrics that most reliably separated our positive and negative controls) and overlaying the flowcell used to sequence each sample only one outlier flowcell was identified and this flowcell was used to sequence a large number of background control samples (Supp. Figure S8B). Plots of the number of reads against city of origin and surface material (Supp. Figure S8C & D) showed a stable distribution of reads across cities. Analogous plots of PCR Qubit scores were less stable than the number of reads but showed a clear drop for control samples (Supp. Figure S8E & F). These results leads us to conclude that batch effects are likely to be minimal. However, future sequencing will be performed with samples more evenly distributed across flowcells which will be more robust.
2.6.4 Strain contamination is rare or absent
Despite good separation of positive and negative controls (see Section 6.3.1) we identified several species in our negative controls which were also identified as prominent taxa in the data-set as a whole (See Section 2.1). Our dilemma was that a microbial species that is common in the urban environment might also reasonably be expected to be common in the lab environment. In general negative controls had lower k-mer complexity, fewer reads, and lower post PCR Qubit scores than case samples and no major flowcell specific species were observed. Similarly positive control samples were not heavily contaminated. These results suggest samples are high quality but do not systematically exclude the possibility of contamination.
Previous studies have reported that microbial species whose relative abundance is negatively correlated with DNA concentration may be contaminants. We observed a number of species that were negatively correlated with DNA concentration (Supp. Figure S6A) but this distribution followed the same shape (but had a greater magnitude) as a null distribution of uniformly randomly generated relative abundances (Supp. Figure S6B) leading us to conclude that negative correlation may simply be a statistical artifact. We also plotted correlation with DNA concentration against each species mean relative abundance across the entire data-set (Supp. Figure S6C). Species that were negatively correlated with DNA concentration were clearly more abundant than uncorrelated species, this suggests that there may be a jackpot effect for prominent species in samples with lower concentrations of DNA but is not generally consistent with contamination.
Finally, we compared assemblies from negative controls to assemblies from our case samples searching for regions of high similarity that could be from the same microbial strain. We reasoned that uncontaminated samples may reasonable contain the same species as negative controls but were less likely to contain identical strains. Only 137 case samples were observed to have any sequence with high similarity to an assembled sequence from a negative control (8,000 base pairs minimum of 99.99% identity). The identified sequences were principally from Bradyrhizobium and Cutibacterium. Since these genera are core taxa (See Section 2.1) observed in nearly every sample but high similarity was only identified in a few samples we elected not to remove species from these genera from case samples.
2.6.5 K-Mer based metrics correlate with taxonomic metrics
We found clear correlations between three pairwise distance metrics (Supp. Figure S3A, B, C): k-mer based Jaccard distance (MASH), taxonomic Jaccard distance, and taxonomic Jensen-Shannon divergence. This suggests that taxonomic variation reflects meaningful variation in the underlying sequence in a sample.
We also compared alpha diversity metrics (Supp. Figure S3D): Shannon entropy of k-mers, and Shannon entropy of taxonomic profiles. As with pairwise distances these metrics were correlated though noise was present. This noise may reflect sub-species taxonomic variation in our samples.
3 Discussion
Here, we have generated a comprehensive global metagenomic collection and identified a core urban microbiome present across 58 cities, consisting largely of bacterial and fungal species (Section 2.1). We also identified species that are geographically constrained and showed that these can be used to determine a samples city of origin (Section 2.2). Many of these species are associated with commensal microbiomes from human skin and airways, but we observed that urban microbiomes are nevertheless distinct from both human and soil microbiomes. Notably, no species from the Bacteroidetes, a prominent group of human commensal organisms (Eckburg et al., 2005; Qin et al., 2010), was identified in the core urban microbiome.
Unique taxonomic composition and association with covariates specific to the urban environment suggest that urban microbiomes should be treated as ecologically distinct from both surrounding soil microbiomes and human commensal microbiomes. Though these microbiomes undoubtedly interact with the urban environment, they nonetheless represent distinct ecological niches with different genetic profiles. While our metadata covariates were associated with the principal variation in our samples, they do not explain a large proportion of the observed variance. It remains to be determined whether variation is essentially a stochastic process or if a deeper analysis of our covariates proves more fruitful. We have observed that less important principal components (roughly PCs 10-100) are generally less associated with metadata covariates but that PCs 1-3 do not adequately describe the data alone. This is a pattern that was observed in the human microbiome project as well, where minor PCs (such as our Figure 2B) were required to separate samples from closely related body sites.
Much of the urban microbiome likely represents novel diversity as our samples contain a significant proportion of unclassified DNA. As noted in in Section 2.6 more sensitive methodology only marginally increases the proportion of DNA that can be classified. We consider the DNA which would not be classified by a sensitive technique to be true unclassified DNA and postulate that it may derive from novel genes or species. Given that our samples did not closely resemble human commensal microbiomes or soil samples it is possible this represents novel urban DNA sequences.
Additionally, our discovery of a large number of novel viral sequences in our data suggests that there are likely to be additional novel taxa from other domains. The fraction of predicted viral sequences which belonged to previously unobserved taxa was particularly high in our study (94.1%) however taxonomic associations of these viruses to observed microbial hosts suggests these results are not spurious. This rate of discovery may prove prescient for novel taxa in other domains and novel discovery of taxa may help to reduce the large fraction of DNA which cannot currently be classified.
Many of the identified taxa are frequently implicated as infectious agents in a clinical setting including specific Staphylococcus, Streptococcus, Corynebacterium, Klebsiella and Enterobacter species. There is no clear indication that the species identified in the urban environment are pathogenic and further indepth study is necessary to determine the clinical impact of urban microbiomes. This includes microbial culture studies, specifically searching for virulence factors and performing strain-level characterization. Seasonal variation also remains open to study as the majority of the samples collected here were from two global City Sampling Days (June 21, 2016 and 2017). Further studies, some generating novel data, will need to explore whether the core microbiome shifts over the course of the year, with particular interest in the role of the microbiome in flu transmission (Cáliz et al., 2018; Korownyk et al., 2018).
As metagenomics and next-generation sequencing becomes more available for clinical (Wilson et al., 2019) and municipal use (Hendriksen et al., 2019), it is essential to contextualize the AMR markers or presence of new species and strains within a global and longitudinal context. The most common AMR genes were found for two classes of antibiotic: MLS and beta-lactams. MLS represents macrolides, lincosamides and streptogramins, which are three groups of antibiotics with a mechanism of action of inhibiting bacterial protein synthesis. Macrolides, with strong Gram-positive and limited Gram-negative coverage, are prevalently used to treat upper respiratory, skin, soft tissue and sexually transmitted infections amongst others. Beta-lactam antibiotics are a major class of antibiotics including penicillins, cephalosporins, monobactams, carbapenems and carbacephems that are all used to treat a wide array of infections. Antimicrobial resistance has surged due to the selection pressure of widespread use of antibiotics and is now a global health issue plaguing communities and hospitals worldwide. Antimicrobial resistance genes are thought to spread from a variety of sources including hospitals, agriculture and water (Bougnom and Piddock, 2017; Klein et al., 2018). The antimicrobial classes particularly impacted by resistance include beta-lactamases, gylcopeptides and fluoroquinolones (www.ncbi.nlm.nih.gov/pmc/articles/PMC3498059/), all of which we found antimicrobial resistance genes for across our samples. We found that there was uneven distribution of AMR genes across cities. This could be the result of some of combination of different levels of antibiotic use, differences in the urban geography between cities (population density, presence of untreated wastewater etc), or reflect the background microbiome in different places in the world. Techniques to estimate antibiotic resistance from sequencing data remain an area of intense research as certain classes of AMR gene (ie. fluoroquinolones) are sensitive to small mutations and it is possible that our methods misestimates true resistance. Further research is needed to fully explore AMR genes in the urban environment including culture studies which directly measure the phenotype of resistance.
In summary, this study presents a first molecular atlas of urban and mass-transit metagenomics from across the world. By facilitating large scale epidemiological comparisons it is a first critical step towards quantifying the clinical role of environmental microbiomes and provides requisite data for tracking changes in ecology or virulence. Previous studies have already demonstrated a role for precision clinical metagenomics in managing infectious disease and global health (Afshinnekoo et al., 2017; Gardy and Loman, 2018; Ladner et al., 2019). As an atlas this data has the potential to aid physicians, public health officers, government officials, and others in diagnosis, clinical decision making and policy within their communities.
3.1 Open Science
The MetaSUB dataset is built and organized for full accessibility to other researchers. This is consistent with the concept of Open Science. Specifically, we built our study with the FAIR principles in mind: Findable, Accessible, Interoperable and Reusable.
To make our study reproducible, we released an open source version-controlled pipeline (documented here and in more detail in a separate publication) called the MetaSUB Core Analysis Pipeline (CAP). The CAP is intended to improve the reproducibility of our findings by making it easy to apply a number of analyses consistently to a large dataset. This pipeline includes all steps from extracting data from raw sequence data to producing refined results like taxonomic and functional profiles. The CAP itself is principally composed of other open peer-reviewed scientific tools, with only a few custom scripts for mundane tasks. Every tool in the CAP is open source with a permissive license. The CAP is available as a docker container for easier installation in some instances and all databases used in the CAP are available for public download. The CAP is versioned and includes all necessary databases allowing researchers to replicate results. The CAP is not designed to produce highly novel results but is meant to be a good practice agglomeration of open source tools.
However, the output of the CAP still consists of a number of different output formats with multiple files for each sample. To make our results more reproducible and accessible, we have developed a program to condense the outputs of the Core Analysis Pipeline into a condensed data-packet. This data packet contains results as a series of Tidy-style data tables with descriptions. The advantage of this set-up is that result tables for an entire dataset can be parsed with a single command in most high level analysis languages like Python and R. This package also contains Python utilities for parsing and analyzing data packets which streamlines most of the boilerplate tasks of data analysis. All development of the CAP and data packet builder (Capalyzer) package is open source and permissively licensed.
In addition to general purpose data analysis tools essentially all analysis in this paper is available as a series of Jupyter notebooks. Our hope is that these notebooks allow researchers to reproduce our results, build upon our results in different contexts, and better understand precisely how we arrived at our conclusions. By providing the exact source used to generate our analyses and figures, we also hope to be able to quickly incorporate new data or correct any mistakes that might be identified.
For less technical purposes, we also provide web-based interactive visualizations of our dataset (typically broken into city-specific groups). These visualizations are intended to provide a quick reference for major results as well as an exploratory platform for generating novel hypotheses and serendipitous discovery. The web platform used, MetaGenScope, is open source, permissively licensed, and can be run on a moderately powerful machine (though its output relies on results from the MetaSUB CAP).
Our hope is that by making our dataset open and easily accessible to other researchers the scientific community can more rapidly generate and test hypotheses. One of the core goals of the MetaSUB consortium is to build a dataset that benefits public health. As the project develops we want to make our data easy to use and access for clinicians and public health officials who may not have computational or microbiological expertise. We intend to continue to build tooling that supports these goals.
3.2 CAMDA
Since 2017 MetaSUB has partnered with the Critical Assessment of Massive Data Analysis program (CAMDA) camda.info, a full conference track at the Intelligent Systems for Molecular Biology (ISMB) Conference. At this venue a subset of the MetaSUB data were released to the CAMDA community in the form of two challenges: ‘The MetaSUB Inter-City Challenge’ in 2017 and ‘The MetaSUB Forensics Challenge’ in 2018. The challenges addressed the problem of geographically locating samples with Open Science tools. This Open Science approach has generated multiple interesting results and concepts relating to urban microbiomics, resulting in several publications biologydirect.biomedcentral.com/articles/collections/camdaproc. The partnership is continued in 2019 with ‘The Metagenomic Forensics Challenge’ where MetaSUB data has been complemented by data from EMP (Thompson et al., 2017) and other studies (Hsu et al., 2016).
4 Data Access
Raw sequencing reads from this study contain significant amounts of human DNA and cannot yet be made public. However, reads with the majority of human DNA filtered and low quality bases removed are available for download from Wasabi (an Amazon S3 clone) with individual URLs located here: https://github.com/MetaSUB/metasub_utils. In addition to raw reads higher level results (e.g. taxonomic profiles, functional pathways, etc.) are available in the MetaSUB data packet also available for download from Wasabi. For instructional purposes we also provide a simplified data packet for teaching which includes balanced numbers of samples from each city and completely filled metadata tables.
Interactive data visualizations are available on https://www.metagenscope.com and GeoDNA, an interface to search query DNA sequences against MetaSUB samples, is available at (dnaloc.ethz.ch/). MetaSUB metadata is available in the data-packet or may be downloaded from https://github.com/MetaSUB/MetaSUB-metadata. Programs used for analysis of data may be found at https://github.com/MetaSUB/MetaSUB_CAP and https://github.com/dcdanko/capalyzer. Jupyter notebooks used to generate the figures and statistics in this study can be found at https://www.github.com/MetaSUB/main_paper_figures. Additional tools and resources are described here https://github.com/MetaSUB/bioinformatics_management.
6 Methods
6.1 Metadata Collection and Cleaning
Metadata from individual cities was collected from a standardized form and set of fields. The principle fields collected were the location of sampling, the material being sampled, the type of object being sampled, the elevation above or below ground, and the station or line where the sample was collected. However, several cities were unable to use the provided apps for various reasons and submitted their metadata as separate spreadsheets. Additionally, certain metadata features, such as those related to sequencing and quality control, were added after initial sample collection.
To collate various metadata sources, we built a publicly available program which assembled a large master spreadsheet with consistent sample UUIDs. After assembling the originally collected data attributes we added normalized attributes based on the original metadata to account for surface material, control status, and features of individual cities. A full description of ontologies used is provided as part of the collating program.
6.2 Sample Collection and Preparation
To obtain a comprehensive picture of microbial communities within a sample it is essential to choose a sampling method which absorbs and preserves biological materials during sampling, transport and storage until DNA extraction. The effectiveness of a swab may be influenced by a number of factors, including most importantly the material of the swab tip affecting the rate at which bacteria are absorbed during the sampling process. Furthermore, the design of the transport tube and DNA preserving liquids affect the integrity of the material during transport. Finally, the amount of background contamination identified for different products should be taken into account. In this study we have benchmarked various types of swabs and DNA preservative tubes, including Copan Liquid Amies Elution Swab (ESwab, Copan Diagnostics, Cat.:480C) referred to as ‘copan swab’ and Isohelix Swabs (Mini-Swab, Isohelix Cat.:MS-02) referred to as ‘isohelix swabs’, which were combined with 2D Thermo ScientificTM MatrixTM storage tubes (3741-WP1D-BR/Matrix 1.0 ml/EA) referred to as ‘matrix tube’, which have been prefilled with the preservative liquid Zymo Shield Zymo DNA/RNA ShieldTM (R1100-250) referred to as ‘Zymo shield’. Copan swabs do not contain a preservative and need to be kept on ice after sampling until DNA extraction to prevent DNA degradation. Samples collected with Copan swabs have therefore been transported on dry ice and stored at −80C. Isohelix swabs have been stored in matrix tubes containing 400l Zymo shield preservative. Matrix tubes have been transported at RT and stored at −80C until DNA extraction.
We tested the absorption strength of Copan and Isohelix swabs for various biological and surface materials encountered when sampling subway stations. For a designated sampling area of an office desk, a Isohelix swabs were moistened by submerging the swab for a few seconds in preservative media. The desk area was then swabbed for 3 min. Results were compared to sampling with dry copan swabs, which were similarly used to swab the area for 3 min. Following the same procedure, surfaces in the subway have been swabbed for 3 min during the two global city sampling days in 2016 and 2017.
6.2.1 In-Lab controls CSD2016
As positive lab control we used 30l ZymoBiOMICS Microbial Community standard (Catalog #D6300), which we added to an empty sterile urine cap, followed by swabbing with Copan Liquid Amies Elution Swab (ESwab, Copan Diagnostics, Cat.:480C) for 1.5min / 3 minutes. As negative (background) lab control we used 50l of the final resuspension buffer (MoBio PowerSoil®DNA Isolation Kit, Cat.:12888-100), which we have added to an empty sterile urine cup followed by swabbing for 3 min (Fig. S1). Furthermore, the working space has been swabbed for 1.5 min / 3 min before and after treatment with 10% bleach (Fig. S2) to test for background contamination rates. To identify the background levels of biological material in the air at sample areas, a Copan swab has been held for 1.5 min – 3 min in the air. To estimate the source and amount of contamination in commercial swab and tube products used for MetaSUB, we tested all consumables in triplicates in the sterilized hood (UV light and 10% bleach wiped with ethanol).
6.2.2 DNA Extraction from Isohelix swabs using ZymoBiomics 96 MagBead
The Isohelix swab head and the entire 400 l of DNA/RNA Shield-solubilized sample were transferred into ZR BashingBead Lysis Tubes (0.1 & 0.5 mm) (Cat# S6012-50) to which an additional 600 l of DNA/RNA Shield was added. Mechanical lysis using bead beating was performed on a maximum of 18 samples simultaneously using the Scientific Industries Vortex-Genie 2 with Horizontal-(24) Microtube Adapter (Cat # SI-0236 and SI-H524) at maximum power for 40 minutes. The resulting lysate (400 l) was transferred to NuncTM 96-Well Polypropylene DeepWell Storage Plates (Cat # 278743), followed by DNA extraction using the ZymoBIOMICS 96 MagBead Kit (Lysis Tubes) (Catalog # D4308) on the Hamilton Star according to manufacturer instructions.
6.2.3 DNA extraction from Copan swabs using MoBio PowerSoil®DNA
Droplets in the Copan Liquid Amies Elution Swab tube (ESwab, Copan Diagnostics, Cat.:480C (http://goo.gl/8a9uCP)) were spun down at 300rpm/1min. Next, the swab pad was transferred to a Mo-Bio PowerSoil®DNA vial containing beads using sterile scissors, which we sterilized by flaming with 100% ethanol. The remaining 400-500l Copan Amines liquid has been transferred into an Eppendorf tube and centrifuged at full speed to collect bacteria and debris in a pellet. The pellet was finally transferred to the same MoBio PowerSoil®DNA vial also containing the corresponding swab pad. MoBio PowerSoil®DNA Isolation Kit, Cat.:12888-100 (https://goo.gl/65rcn2) was used according to manufacturer’s instructions except for the following modifications:
Both swab and pellet have been re-suspended with 135l C1 buffer (MoBio PowerSoil®DNA). Sample homogenization was performed using either TissueLyser II (Qiagen) with 2 cycles of 3 minutes at 30Hz (https://goo.gl/hBg8Lb), or using the Vortex-Genie 2 (Vortex Catalog #13000-V1-24) adaptor and vortex at maximum speed for 10 minutes. The sequencing centers in Stockholm and Shanghai used different procedures for homogenization. Stockholm used a method based on MPI FASTPREP, while Shanghai added 0.6 grams of 100-micron zirconium-silica beads to 2ml tubes containing the swab pad and the media, followed by bead beating for 1 min. The eluted samples have been additionally purified and concentrated by Beckmann Coulter Agencourt AMPure XP (Cat.:A63881) purification (1.8X) and eluted into 121 – 501 elution buffer. Subsequently, DNA was quantified using Qubit® dsDNA HS Assay (Catalog #Q32854).
6.2.4 DNA extraction using Promega Maxwell
We added 300l Promega Maxwell Lysis buffer and 30l Promega Maxwell Proteinase K to Copan swab heads or Isohelix swab heads and transferred the swabs back to their respective collection tube. For lysis the sample tubes containing the swabs and the lysis mixture were incubated in a water bath at 54C for 30min. Following lysis, Copan swab heads were cut off their stem using sterile scissors and transferred into a filter tube (Promega XXX). The filter containing the swab was placed into a 2ml Eppendorf tube and spun down at full speed for 2min. This step is necessary since the Copan swab material consists of a foam, which harbors the main liquid containing the extracted DNA. Next, the eluate has been combined with the corresponding sample tube media and added to the first well of the cartridge (Maxwell® RSC Buccal Swab kit AS1640). Cartridges were processed using the Maxwell® RSC Instrument (AS4500) following the manufacturer’s default instructions. Extracted DNA was eluted in 50l Promega Elution Buffer and stored at −80C.
The matrix tubes containing the Isohelix swabs and the lysis buffer have been vortexed at full speed for one minute. The Isohelix swab head material is a non-porous material, which allows for easy collection of the lysate. We transferred the lysate to the first cartridge of the Maxwell® RSC Blood DNA KitAS1400 using syringes (BD 3 mL Syringes with 18Gx 1.5” Luer Lok Tip Blunt Fill Needles) and ranthe Promega Maxwell using the Blood program according to manufacturer’s instructions. Samples were subsequently eluted in 50l elution buffer and stored at −80C.
Pilot samples collected in Barcelona and Stockholm were prepared for NGS analysis using QIAGEN QIAseq FX DNA Library Kit. Samples from CSD2017 and CSD2018 have been prepped at HudsonAlpha Genome Center described by Afshinnekoo et al. (2015).
6.3 Quality Control
6.3.1 Sequencing quality
We measured sequencing quality based on 5 metrics: number of reads obtained from a sample, GC content, Shannon’s entropy of k-mers, post PCR Qubit score, and recorded DNA concentration before PCR. The number of reads in each sample was counted both before and after quality control, we used the number of reads after quality control for our results though the difference was slight. GC content was estimated from 100,000 reads in each sample after low quality DNA and human reads had been removed. Shannon’s entropy of k-mers was estimated from 10,000 reads taken from each samples. PCR Qubit score and DNA concentration are described in the wet lab methods.
6.3.2 Strain Contamination
We used BLASTn to align nucelotide assemblies from case samples to control samples. We used a threshold of 8,000 base pairs and 99.99% identity as a minimum to consider two sequences homologous. This threshold was chosen to be sensitive without solely capturing conserved regions. We identified all connected groups of homologous sequences and found approximate taxonomic identifications by aligning contigs to NCBI-NT using BLASTn searching for 90% nucleotide identity over half the length of the longest contig in each group.
6.3.3 K-Mer Based Analyses
We generated 31-mer profiles for raw reads using Jellyfish. All k-mers that occurred at least twice in a given sample were retained. We also generated MASH sketches from the non-human reads of each sample with 10 million unique minimizers per sketch.
We calculated the Shannon’s entropy of k-mers by sampling 31-mers from a uniform 10,000 reads per sample. Shannon’s entropy of taxonomic profiles was calculated using the CAPalyzer package (Section 4).
6.3.4 Sequence Preprocessing
Sequence data were processed with AdapterRemoval (v2.17, Schubert et al. (2016)) to remove low quality reads and reads with ambiguous bases. Subsequently reads were aligned to the human genome (hg38, including alternate contigs) using Bowtie2 (v2.3.0, fast preset, Langmead and Steven L Salzberg (2013)). Read pairs where both ends mapped to the human genome were separated from read pairs where neither mate mapped. Read pairs where only one mate mapped were discarded. Hereafter, we refer to the read sets as human reads and non-human reads.
6.4 Computational Analysis
6.4.1 Taxonomic Analysis
We generated taxonomic profiles by processing non-human reads with KrakenUniq (v0.3.2 Breitwieser et al. (2018)) using a database based on all draft and reference genomes in RefSeq Microbial (bacteria, fungi, virus, and archaea) ca. March 2017. KrakenUniq was selected because it was highly performant, as it has been demonstrated to be comparable or having higher sensitivity than the best tools identified in a recent benchmarking study (McIntyre et al. (2017)) on the same comparative dataset. In addition, KrakenUniq allows for tunable specificity. KrakenUniq identifies k-mers that are unique to particular taxa in a database. Reads are broken into k-mers and searched against this database. Finally, the taxonomic makeup of a sample is given by identifying the taxa with the greatest leaf to ancestor weight.
KrakenUniq reports the number of unique marker k-mers assigned to each taxon, as well as the total number of reads, the fraction of available marker k-mers found, and the mean copy number of those k-mers. We found that requiring more k-mers to identify a species resulted in a roughly linear decrease in the total number of species identified without a plateau or any other clear point to set a threshold (Supp. Figure S2A). In an ongoing but unpublished clinical study we have used a threshold of 512 marker k-mers to accurately recapitulate the results of culturing while identifying few species which were not cultured. Since false positives are less problematic in the current study than in a clinical study and because we could use our large number of samples as a partially orthogonal confirmation we chose less strict thresholds for KrakenUniq in this study.
At a minimum we required three reads assigned to a taxa with 64 unique marker k-mers. This setting captures a group of taxa with low abundance but reasonable (∼ 10-20%) coverage of the k-mers in their marker set (Supp. Figure S2C). However, this also allows for a number of taxa with very high (105) duplication of the identified marker k-mers and very few k-mers per read which we believe is biologically implausible (Supp. Figure S2D). We filtered these taxa by applying a further filter which required that the number of reads not exceed times the number of unique k-mers, unless the set of unique k-mers was saturated (> 90% completeness). We include a full list of all taxonomic calls from all samples including diagnostic values for each call. We do not attempt to classify reads below the species level in this study.
We further evaluated prominent taxonomic classifications for sequence complexity and genome coverage. For each microbe evaluated we calculated two indices generated using a random subset of 152 samples: the average topological entropy of reads assigned to the microbe and the Gini-coefficient of read positions on the microbial genome. For brevity we refer to these as mean sequence entropy (MSE) and coverage equality (CE). The formula for topological entropy of a DNA sequence is described by ?. Values close to 0 correspond to low-complexity sequences and values near 1 are high complexity. In this work we use a word size of 3 with an overall sequence length of 64 since this readily fits into our reads. To find the MSE of a microbial classification we take the arithmetic mean of the topological entropy of all reads that map to a given microbial genome in a sample. The Gini-coefficient is a classic economic measure of income inequality. We repurpose it here to evaluate the evenness of read coverage over a microbial classification. Reads mapping to a microbial genome are assigned to a contiguous 10kbp bin and the Gini-coefficient of all bins is calculated. Like MSE, the Gini-coefficient is bounded in [0, 1]. Lower values indicate greater inequalit, very low values indicate that a taxon may be misidentified from conserved and near conserved regions. We downloaded one representative genome per species evaluated and mapped all reads from samples to using Bowtie2 (sensitive-local preset). Indices were processed from alignments using a custom script. Species classifications with an average MSE less than 0.75 or CE less than 0.1 were flagged.
To determine relative abundance of taxa where applicable we rarefied samples to 100,000 classified reads, computed the proportion of reads assigned to each taxon, and took the distribution of values from all samples. This was the minimum number of reads sufficient to maintain taxonomic richness (Supp. Figure S2B). We chose rarefaction based on the study by Weiss et al. (2017), showing that rarefaction effectively estimates relative abundance. Note that we use the term prevalence to describe the fraction of samples where a given taxon is found at any abundance and we use the term relative abundance to describe the fraction of DNA in a sample from a given taxon.
We compared our samples to metagenomic samples from the Human Microbiome Project and a metagenomic study of European soil samples using MASH (Ondov et al., 2016), a fast k-mer based comparison tool. We built MASH sketches from all samples with 10 million unique k-mers to ensure a sensitive and accurate comparison. We used MASH’s built-in Jaccard distance function to generate distances between our samples and HMP samples. We then took the distribution of distances to each particular human commensal community as a proxy for the similarity of our samples to a given human body site.
We also compared our samples to HMP and soil samples using taxonomic profiles generated by MetaPhlAn v2.0 (Segata et al., 2012). We generated taxonomic profiles from non-human reads using MetaPhlAn v2.0 and found the cosine similarity between all pairs of samples.
We used the Microbe Directory (Shaaban et al., 2018) to annotate taxonomic calls. The Microbe Directory is a hand curated, machine readable, database of functional annotations for 5,000 microbial species.
6.4.2 Functional Analysis
We analyzed the metabolic functions in each of our samples by processing non-human reads with HU-MAnN2 (Franzosa et al., 2018). We aligned all reads to UniRef90 using DIAMOND (Buchfink et al., 2014)and used HUMAnN2 to produce estimate of pathway abundance and completeness. We filtered all pathways that were less than 50% covered in a given sample but otherwise took the reported pathway abundance as is after relative abundance normalization (using HUMAnN2’s attached script).
High level categories of functional pathways were found by grouping postively correlated pathways and manually annotating resulting clusters.
6.5 Assembly and Plasmid Annotations
All samples were assembled using metaSPAdes (v3.8.1 Nurk et al. (2017)) with default settings. Assembled scaffolds of at least 1,500bp of length were annotated using PlasFlow (v1.1 Krawczyk et al. (2018)) using default settings. PlasFlow predicts whether a contigs is likely from a chromosome or plasmid and gives a rough taxonomic annotation. Predicting which sequences are from plasmids is a difficult problem and some annotations may be incorrect.
6.5.1 Analysis of Antimicrobial Resistance Genes
We generated profiles of antimicrobial resistance genes using MegaRes (v1.0.1, Lakin et al. (2017)). To generate profiles from MegaRes, we mapped non-human reads to the MegaRes database using Bowtie2 (v2.3.0, very-sensitive presets, Langmead and Steven L Salzberg (2013)). Subsequently, alignments were analyzed using ResistomeAnalyzer (commit 15a52dd github.com/cdeanj/resistomeanalyzer) and normalized by total reads per sample and gene length to give RPKMs. MegaRes includes an ontology grouping resistance genes into gene classes, AMR mechanisms, and gene groups. AMR detection remains a difficult problem and we note that detection of a homologous sequence to a known AMR gene does not necessarily imply an equivalent resistance in our samples. Currently, the gold standard for detecting AMR is via culturing.
Known AMR genes can come from gene families with homologous regions of sequence. To reduce spurious mapping from gene homology we used BLASTn to align all MegaRes AMR genes against themselves. We considered any connected group of genes with an average nucleotide identity of 80% across 50% of the gene length as a set of potentially confounded genes. We collapsed all such groups into a single pseudogene with the mean abundance of all constituent genes. Before clustering genes we removed all genes which were annotated as requiring SNP verification to predict resistance.
In addition to MegaRes we mapped non-human reads from all samples to the amino acid gene sequences in the Comprehensive Antibiotic Resistance Database (McArthur et al., 2013) using DIAMOND. While we do not use this analysis explicitly in this study we provide the results as a data table.
Assembled contigs were annotated for AMR genes using metaProdigal (Hyatt et al., 2010), HMMER3 (Eddy, 2011), and ResFam (Gibson et al., 2015) as described by Rahman et al. (2018). All predicted gene annotations with an e-value higher than 10-10 were discarded.
6.5.2 Beta Diversity
Intersample (beta) diversity was measured by using Jaccard distances. We note that Jaccard distances do not use relative abundance information. Matrices of Jaccard distances were produced using built in SciPy functions treating all elements greater than 0 as present. Hierarchical clustering (average linkage) was performed on the matrix of Jaccard distances using SciPy (https://www.scipy.org/).
Dimensionality reduction of taxonomic and functional profiles was performed using UMAP (McInnes et al., 2018) on the matrix of Jaccard distances with 100 neighbours (UMAP-learn package, random seed of 42). We did not use Principal Component Analysis as a preprocessing step before UMAP as is sometimes done for high dimensional data.
6.5.3 Alpha Diversity
Intrasample (alpha) diversity was measured by using Species Richness and Shannon’s Entropy. We took species richness as the total number of detected species in a sample after rarefaction to 1 million reads. Shannon’s entropy is robust to sample read depth and accounts for the relative size of each group in diversity estimation. Shannon’s entropy is typically defined as H = Σailog2ai where ai is the relative abundance of taxon i in the sample. For alpha diversity based on k-mers or pathways, we simply substitute the relative abundance of a species for the relative abundance of the relevant type of object.
6.5.4 GeoDNA Sequence Search
For building the sequence graph index, each sample was processed with KMC (version 3, [1]) to convert the reads in FASTA format into lists of k-mer counts, using different values of k ranging from 13 to 19 in increments of 2. All k-mers that contained the character “N” or occurred in a sample less than twice were removed. For each value of k, we built a separate index, consisting of a labeled de Bruijn graph, using an implicit representation of the complete graph and a compressed label representation based on Multiary Binary Relation Wavelet Trees (Multi-BRWT). For further details, we refer to the manuscript [2]. To build the index, for each sample the KMC k-mer count lists were transformed into de Bruijn graphs, from which path covers in the form of contig sets were extracted and stored as intermediate FASTA files. The contig sets of each sample were then transformed into annotation columns (one column per sample) by mapping them onto an implicit complete de Bruijn graph of order k. All annotation columns were then merged into a joint annotation matrix and transformed into Multi-BRWT format. Finally, the topology of the Multi-BRWT representation was optimized by relaxing its internal tree arity constraints to allow for a maximum arity of 40.
6.5.5 Viral Discovery
We followed the protocal described by Paez-Espino et al. (2017). Briefly, we used an expanded and curated set of viral protein families (VPFs) as bait in combination with recommended filtering steps to identify 16,584 UViGs directly from all MetaSUB metagenomic assemblies greater than 5kb. Then, the UViGs were clustered with the content of the IMG/VR system (a total of over 730k viral sequences including isolate viruses, prophages, and UViGs from all kind of habitats). The clustering step relied on a sequence-based classification framework (based on 95% sequence identity across 85% of the shortest sequence length) followed by the markov clustering (mcl). This approach yielded 2,009 viral clusters (ranging from 2-611 members) and 9,605 singletons (or viral clusters of 1 member), sequences that failed to cluster with any sequence from the dataset or the references from IMG/VR, resulting in a total of 11,614 vOTUs. We define viral species from vOTUs as sequences sharing at least 95% identity over 85% of their length. Out of this total MetaSUB viral diversity, only 686 vOTUs clustered with any known viral sequence in IMG/VR.
6.5.6 Identifying Host Virus Interactions
We used two computational methods to reveal putative host-virus connections (Paez-Espino et al., 2016). (1) For the 686 vOTUs that clustered with viral sequences from the IMG/VR system, we projected the known host information to all the members of the group (total of 2,064 MetaSUB UViGs). (2) We used bacterial/archaeal CRISPR-Cas spacer matches (from the IMG/M 1.1 million isolate spacer database) to the UViGs (allowing only for 1 SNP over the whole spacer length) to assigned a host to 1,915 MetaSUB vOTUs. Additionally, we also used a database of over 20 million CRISPR-Cas spacers identified from metagenomic contigs from the IMG/M system with taxonomy assigned. Since some of these spacers may derive from short contigs these results should be interpreted with caution.
7 Contributing Members of the MetaSUB Consortium
Natasha Abdullah, Marcos Abraao, Ait-hamlat Adel, Muhammad Afaq, Ebrahim Afshinnekoo, Affifah Saadah Ahmad Kassim, Sofia Ahsanuddin, Faisal S Al-Quaddoomi, Ireen Alam, Gabriela E Albuquerque, Kalyn Ali, Josue Alicea, Chettouh Allaeddine, Lucia E Alvarado-Arnez, Sarh Aly, Jennifer Amachee, Maria G Amorim, Majelia Ampadu, Muhammad Al-Fath Amran, Nala An, Watson Andrew, Michael Angelov, Benoiston Anne-Sophie, Thiébaut Antonin, Catharine Aquino, Álvaro Aranguren, Luiza F Araujo, Jenny Arevalo, Carme Arnan, Fernanda Arredondo, Matthew Arthur, Freddy Asenjo, Fleiss Aubin, Thomas Saw Aung, Nuria Aventin, Sadaf Ayaz, Ana Valeria B Castro, Silva Baburyan, Abd-Manaaf Bakere, Katrin Bakhl, Thais F Bartelli, Erdenetsetseg Batdelger, Kevin Becher, Carla Bello, Hannah Benisty, Joseph Benson, Diego Benítez, Juliana Bernardes, Denis Bertrand, Silvia Beurmann, Daniela Bezdan, Chandrima Bhattacharya, Malay Bhattacharyya, Christina Black, Guillaume Blanc, Ran Blekhman, Brittany Blyther, Kari O Boifot, Stephen Eduard Boja Ruiz, Bazartseren Boldgiv, Kevin Bolzli, Alexia Bordigoni, Sonia Bouchard, Gabriela P Branco, Alessandra Breschi, Christian Brion, Alan Briones, Paulina Buczansla, Catherine M Burke, Daniel J Butler, Alina Butova, Irvind Buttar, Jalia Bynoe, Hiram Caballero, Xiao Wen Cai, Dayana Calderon, Angela Cantillo, Alessandra Carbone, Anais Cardenas, Katerine Carrillo, Laurie Casalot, Ana V Castro, Astred Castro, Sofia Castro, Eduardo Castro-Nallar, Simone Cawthorne, Ana M Cañas, Jonathan Cedillo, Salama Chaker, Jasna Chalangal, Allison Chan, Greselle Charlotte, Anastasia I Chasapi, Aspassia D Chatziefthimiou, Akash K Chavan, Francisco Chavez, Gregory Chem, Jenn-Wei Chen, Michelle Chen, Xiaoqing Chen, Daisy Cheung, Diana Chicas, Dequeker Chloé, Kern Rei Chng, Hira Choudhry, Carl Chrispin, Kianna Ciaramella, Erika Cifuentes, Jake Cohen, David A Coil, Colleen Conger, Cecilia N Cossio, Ana F Costa, Denise Syndercombe Court, Robert W Crawford, Delisia Cuebas, David C Danko, Aaron E Darling, Pujita Das, Lucinda B Davenport, Laurent David, Natalie R Davidson, Gargi Dayama, Paola F De Sessions, Chris K Deng, Youping Deng, Alexandre Desert, Christelle Desnues, Monika Devi, Felipe S Dezem, Emmanuel Dias-Neto, Manoir Dimitri, Timothy Ryan Donahoe, Daisy Donnellan, Sonia Dorado, LaShonda Dorsey, Steven Du, Alexandra Dutan, Marius Dybwad, Naya Eady, Jonathan A Eisen, Miar Elaskandrany, De Lazzari Eleonora, Eran Elhaik, Laine Elodie, Faure Emile, Lennard Epping, Danilo Ercolini, Juan P Escalera-Antezana, Velter Fabienne, Iqra Faiz, Luice Fan, Nadine Farhat, Fazlina Fauzi, Skye Felice, Gabriel Figueroa, Francesca De Filippis, Corsi Flavia, Aubin Fleiss, Denisse Flores, Marcos AS Fonseca, Jonathan Foox, Juan Carlos Forero, Nadalin Francesca, Oteri Francesco, Aaishah Francis, Baudon François, Jacob Friedman, Alina Frolova, Jaime J Fuentes, Josephine Galipon, Laura Garcia, Catalina García, Annie Geiger, Samuel Gerner, Dao Phuong Giang, Donato Giovannelli, Dedan Githae, Liliana Godoy, Samantha Goldman, Gaston H Gonnet, Andrea Gonzalez, Juana Gonzalez, Camila Gonzalez-Poblete, Alexandra B Graf, David C Green, Tranette Gregory, Sophie Guasco, Nika Gurianova, Patrick K H Lee, Iman Hajirasouliha, Sebastien Halary, Arya Hawkins-Zafarnia, Nur Hazlin Hazrin-Chong, Jochen Hecht, Eric Helfrich, Tamera Henry, Mark Hernandez, Samuel Hernandez, David Hess-Homeier, Lauren E Hittle, Nghiem Xuan Hoan, Aliaksei Holik, Chiaki Homma, Irene Hoxie, Michael Huber, Elizabeth Humphries, Stephanie Hyland, Andrea Hässig, Roland Häusler, Nathalie Hüsser, Badamnyambuu Iderzorig, Mizuki Igarashi, Bilal Iqbal, Gregorio Iraola, Shino Ishikawa, Sakura Ishizuka, Riham Islam, Sharah Islam, Kohei Ito, Sota Ito, Takayuki Ito, Tomislav Ivankovic, Tomoki Iwashiro, Sarah Jackson, JoAnn Jacobs, Pathmanathan Jananan, Soojin Jang, Bouly Jean-Pierre, Esmeralda Jiminez, Ayantu Jinfessa, Hyun Woo Joo, Boeri Julia, Auvinet Juliette, Andre Kahles, Takema Kajita, Mikhail Karasikov, Nao Kato, Amrit Kaur, Inderjit Kaur, Frank J Kelly, Vedbar S Khadka, Shaira Khan, Michelle Ki, Gina Kim, Hyung Jun Kim, Sangwan Kim, Ryan J King, Kaymisha Knights, Ellen Koag, Nadezhda Kobko-Litskevitch, Maryna Korshevniuk, Michael Kozhar, Jonas Krebs, Nanami Kubota, Andrii Kuklin, Sheelta S Kumar, Lawrence Kwong, Rachel Kwong, Ingrid Lafontaine, Juliana Lago, Tsoi Ying Lai, Manolo Laiola, Olha Lakhneko, Isha Lamba, Gerardo de Lamotte, Hyunjung Lee, Lucy Lee, Yunmi Lee, Emily Leong, Dagmara Lewandowska, Chenhao Li, Weijun Liang, Moses Lin, Priscilla Lisboa, Anna Litskevitch, Eric Minwei Liu, Tracy Liu, Mayra Arauco Livia, Per Ljungdahl, Yui Him Lo, Sonia Losim, Jennifer Lu, Bittner Lucie, Olexandr Lykhenko, Abigail Lyons, Simona Lysakova, Salah Mahmoud, Sara Abdul Majid, Natalka Makogon, Gustavo Adolfo Malca Salas, Denisse Maldonado, Krizzy Mallari, Tathiane M Malta, Maliha Mamun, Loubens Manon, German Marchandon, Natalia Marciniak, Ruiz Maria, Jaubert Marianne, Jerier Marie-Laure, Sonia Marinovic, Christopher E Mason, Gabriella Mason-Buck, Nicole Mathews, Garcia Mathilde, Mignotte Mathilde, Yuri Matsuzaki, Muscat Maureen, Mendy Irène Mauricette, Madelyn May, Elias McComb, Ken McGrath, Arif Asyraf Md Supie, Adiell Melamed, Dmitrii Meleshko, Wayne Menary, Ambar Mendez, Katterinne N Mendez, Irene Meng, Ajay Menon, Mark Menor, Roy Meoded, Nancy Merino, Karishma Miah, Carbajo Miguel, Tanja Miketic, Wilson Miranda, Athena Mitsios, Ryusei Miura, Kunihiko Miyake, Maria D Moccia, Natasha Mohan, Mohammed Mohsin, Karobi Moitra, Mauricio Moldes, Laura Molina, Jennifer Molinet, Orgil-Erdene Molomjamts, Emmanuel F Mongodin, Eftar Moniruzzaman, Sookwon Moon, Mario Moreno, Maritza S Mosella, Josef W Moser, Christopher Mozsary, Amanda L Muehlbauer, Oasima Muner, Muntaha Munia, Naimah Munim, Tatjana Mustac, Harun Mustafa, Beth Mutai, Beth Kinya Mutai, Cristina Muñoz, Benchouaia Médine, Areeg Naeem, Niranjan Nagarajan, Dorottya Nagy-Szakal, Mayuko Nakagawa, Masaki Nasu, Irene González Navarrete, Hiba Naveed, Bryan Nazario, Narasimha Rao Nedunuri, Aida Nesimi, Amanda Ng, Wan Chiew Ng, Gloria Nguyen, Elsy Ngwa, Agier Nicolas, Marina Nieto-Caballero, Ab-dollahi Nika, Olga Nikolayeva, Tatyana Nikolayeva, Hosna Noorzi, Avigdor Nosrati, Houtan Noushmehr, Diana N Nunes, Kathryn O’Brien, Niamh B O’Hara, Gabriella Oken, Rantimi A Olawoyin, Javier Quilez Oliete, Manuela Oliveira, Kiara Olmeda, Itunu A Oluwadare, Tolulope Oluwadare, Da Silva Ophélie, Nils Ordioni, Jenessa Orpilla, Jacqueline Orrego, Melissa Ortega, Princess Osma, Stephan Ossowski, Israel O Osuolale, Olayinka O Osuolale, Oluwatosin M Osuolale, Mitsuki Ota, Yuya Oto, Rachid Ounit, Christos A Ouzounis, Subhamitra Pakrashi, Rachel Paras, Coral Pardo-Este, Young-Ja Park, Andrea Patrignani, Manuel Perez, Ante Peros, Sabrina Persaud, Anisia Peters, Robert A Petit III, Adam Phillips, Nicolas Pierre, Lisbeth Pineda, Melissa P Pizzi, Alketa Plaku, Alma Plaku, Eileen Png, María Gabriela Portilla, Leonardo Posada, Max Priestman, Bharath Prithiviraj, Sambhawa Priya, Phanthira Pugdeethosal, Catherine E. Pugh, Benjamin Pulatov, Kyrylo Pyrshev, Tao Qing, Saher Rahiel, Savlatjon Rahmatulloev, Kannan Rajendran, Aneisa Ramcharan, Adan Ramirez-Rojas, Shahryar Rana, Nicolas Rascovan, Prashanthi Ratnanandan, Timothy D Read, Hubert Rehrauer, Vicedomini Riccardo, Hugues Richard, Renee Richer, Alexis Rivera, Michelle Rivera, Alessandro Robertiello, Courtney Robinson, Paula Rodríguez, Nayra Aguilar Rojas, Paul Roldán, Conte Romain, Lannes Romain, Anyelic Rosario, Sandra Roth, Kaitlan Russell, Mariia Rybak, Krista Ryon, Gunnar Rätsch, Thais S Sabedot, Mahfuza Sabina, Ikuto Saito, Yoshitaka Saito, Kaung Myat San, Jorge Sanchez, Jorge L Sanchez, Khaliun Sanchir, Ryan Sankar, Shalaby Sarah, Zulena Saravi, Kai Sasaki, Kaisei Sato, Masaki Sato, Ryo Sato, Seisuke Sato, Yuma Sato, Nowshin Sayara, Oli Schacher, Anna-Lena M Schinke, Ralph Schlapbach, Christian Schori, Jason R Schriml, Lynn M Schriml, Felipe Segato, Torsten Semmler, Felipe Sepulveda, Marianna S Serpa, Juan C Severyn, Osman U Sezerman, Heba Shaaban, Maheen Shakil, Leming Shi, Hyenah Shim, Hikaru Shirahata, Yuh Shiwa, Maria A Sierra, Jordana M Silva, Gwenola Simon, Harilanto Simon, Shaleni K Singh, Kasia Sluzek, Rebecca Smith, Eunice So, Núria Andreu Somavilla, Le Huu Song, Yuya Sonohara, Vacant Sophie, Camila Souza, Jason Sperry, Nicolas Sprinsky, Ray Chaudhuri Sreya, Stefan G Stark, Delmas Stéphane, Kiyoshi Suganuma, Hamood Suliman, Jill Sullivan, Chisato Suzuki, Haruo Suzuki, Collin Sylvie, Sora Takagi, Fumie Takahara, Kou Takahashi, Naoya Takahashi, Tomoki Takeda, Isabella K Takenaka, Soma Tanaka, Anyi Tang, Yuk Man Tang, Emilio Tarcitano, Andrea Tassinari, Mahdi Taye, Alexis Terrero, Eunice Thambiraja, Antonin Thiébaut, Andrew M Thomas, Dominique Thomas, Neff Thomas, Sade Thomas, Scott W Tighe, Takumi Togashi, Yuto Togashi, Anna Tomaselli, Itsuki Tomita, Masaru Tomita, Xinzhao Tong, Nora Toussaint, Jennifer M Tran, Bitard-Feildel Tristan, Catalina Truong, Kazutoshi Tsuda, Takafumi Tsurumaki, Michelle Tuz, Yelyzaveta Tymoshenko, Klas I Udekwu, Juan A Ugalde, Carmen Urgiles, Mariko Usui, Brandon Valentine, Hitler Francois Vasquez Arevalo, Dimitar I Vassilev, Elena Vayndorf, Jhovana L Velasco Flores, Thirumalaisamy P Velavan, Valeria Ventorino, Patricia Vera-Wolf, Lemaire Vincent, Matthys Vincent, Sierra Vincent, Renee Vivancos-Koopman, Andrew Wan, Cindy Wang, Tomoro Warashina, Ayuki Watanabe, Samuel Weekes, David Westfall, Lothar H Wieler, Michelle Williams, Silver Wolf, Brian Wong, Tyler Wong, Yan Ling Wong, Rasheena Wright, Tina Wunderlin, Marcus H Y Leung, Ryota Yamanaka, Jingcheng Yang, Hirokazu Yano, George C Yeh, Olena Yemets, Shusei Yoshikawa, Ben Young, Laraib Zafar, María M Zambrano, Amy Zhang, Shu Zhang, Yang Zhang, Yuanting Zheng, Jifeng Zhu, Sibo Zhu, Stas Zubenko, Kelly french, Angelika pupiec, Orhan Özcan, Paweł P Łabaj, Giuseppe LoMonaco,
5 Acknowledgement
DCD was supported by the Tri-Institutional Training Program in Computational Biology and Medicine (CBM) funded by the NIH grant 1T32GM083937.
We thank GitHub for providing private repositories to the MetaSUB consortium at no cost.
We thank XSEDE and Philip Blood for their support of this project.
We would like to thank the Epigenomics and Genomics Core Facilities at Weill Cornell Medicine, funding from the Irma T. Hirschl and Monique Weill-Caulier Charitable Trusts, Bert L and N Kuggie Vallee Foundation, the WorldQuant Foundation, Igor Tulchinsky, The Pershing Square Sohn Cancer Research Alliance, NASA (NNX14AH50G, NNX17AB26G), the National Institutes of Health (R01ES021006, R25EB020393, 1R21AI129851, 1R01MH117406), TRISH (NNX16AO69A:0107, NNX16AO69A:0061), the NSF (1840275), the Bill and Melinda Gates Foundation (OPP1151054) and the Alfred P. Sloan Foundation (G-2015-13964), Swiss National Science Foundation grant #407540_167331 “Scalable Genome Graph Data Structures for Metagenomics and Genome Annotation” as part of Swiss National Research Programme (NRP) 75 “Big Data”
Discovery of novel viral sequences was work conducted by the US Department of Energy Joint Genome Institute, a DOE Office of Science User Facility, under contract number DE-AC02-05CH11231 and used resources of the National Energy Research Scientific Computing Center, supported by the Office of Science of the US Department of Energy.
MetaSUB Seoul was supported by the Institut Pasteur Korea (2015MetaSUB) and a National Research Foundation of Korea (NRF) grant (NRF-2014K1A4A7A01074645, 2017M3A9G6068246).
Metasub Chile was supported by funding from CONICYT Fondecyt Iniciación grant 11140666 and 11160905, as well as funding from the Millennium Science Initiative of the Ministry of Economy, Development and Tourism, Government of Chile.
MetaSUB Japan was supported by research funds from Keio University, the Yamagata prefectural government and the City of Tsuruoka.
MetaSUB Austria and Ukraine acknowledge the bilateral AT-UA collaboration fund (WTZ:UA 02/2019; Ministry of Education and Science of Ukraine, UA:M/84-2019).
MetaSUB Ukraine was supported by research funds from Kyiv Academic Univeristy, Ministry of Education and Science of Ukraine grant 0118U100290. MetaSUB Ukraine would like to express gratitude to Kyiv Metro for the support of sampling days.
Work in Colombia was partially funded by Colciencias (project No. 639677758300)
Sampling was carried out in compliance with regulations and permissions from local authorities (Azienda Napoletana Mobilità s.p.a. in Naples, Italy; Régie des Transports Métropolitains in Marseille, France; Transmilenio and ANLA permit 1484 in Bogotá, Colombia; Nigerian Railway Corporation (NRC) [Ilorin and Offa Branch] and Kwara Express Transport)
Footnotes
↵⋆ Listed Alphabetically, Affiliations listed separately