Abstract
High-throughput molecular analysis of sewage is a promising tool for precision public health. Here, we combine sewer network and demographic data to identify a residential catchment for sampling, and explore the potential of applying untargeted genomics and metabolomics to sewage to collect actionable public health data. We find that wastewater sampled upstream in a residential catchment is representative of the human microbiome and metabolome, and we are able to identify glucuronidated compounds indicative of direct human excretion, which are typically degraded too quickly to be detected at treatment plants. We show that diurnal variations during 24-hour sampling can be leveraged to discriminate between biomarkers in sewage that are associated with human activity from those related to the environmental background. Finally, we putatively annotate a suite of human-associated metabolites, including pharmaceuticals, food metabolites, and biomarkers of human health and activity, suggesting that mining untargeted data derived from residential sewage can expand currently-used biomarkers with direct public health or policy relevance.
Introduction
Nearly all human activity within our cities leaves a chemical or biological footprint in sewage, but very little has been done to mine this wealth of information for the public good (Daughton 2018). Wastewater-based epidemiology could have enormous benefit to public health practice, providing real-time assessments of population health and generating data to support public health and city policies, while current practice often relies on survey-based data and evaluation of indirect outcomes (Falbe 2016, Escobar 2013). Public health surveillance is primarily based on reports from hospitals and other clinical centers, and appropriate responses are initiated after a threshold of clinical cases have been reported. For example, wastewater-based public health surveillance platforms could provide data more effectively than hospital-based reporting, allowing public health officials to respond to emerging outbreaks before a critical mass of patients have gotten sick (Manor 2014, Hellmer 2014, Yan 2018). In addition to identifying infectious disease outbreaks, wastewater could provide a novel data source to evaluate the effectiveness of policies addressing long-term population health (Daughton 2018, Burgard 2014), providing a more immediate measure of the impact of health-related policies on targeted outcomes than existing approaches.
Recently, wastewater-based epidemiology was used as an early indicator of poliovirus outbreaks to target vaccination campaigns (Manor 2014, Kaliner 2015), to monitor asymptomatic virus circulation (Berchenko 2017), and as a novel source of data on illicit drug consumption in Australia (Australian Report) and over 60 European cities and towns (van Nuijs 2018, Thomas 2012, Baz-Lomba 2016). More recently, wastewater analysis has been proposed as a promising tool to tackle the severe opioid epidemic in the U.S. and Canada (Keshaviah 2016, Keshaviah 2017) as well as to monitor antimicrobial resistance (Pehrsson 2017, Initiatives 2018), obesity levels, and lifestyle biomarkers (Daughton 2018, Arnaud 2018, Newton 2015). Apart from polio surveillance and drug consumption estimation, many applications remain proof-of-concept academic projects and are not yet implemented into public health practices.
Current practices, such as sampling downstream sites like pump stations or wastewater treatment plants (WWTPs) (Manor 2014, Hellmer 2014, Australian Report, Thomas 2012, Baz-Lomba 2016) may not be effective for all applications because of degradation of chemical and biological molecules during transport. Moreover, grouping samples of different origins can complicate analysis by aggregating multiple municipalities with variable transit times to the collection site, and by mixing of residential and industrial waste. These factors create technical biases in the data and prevent accurate consumption or infection rates from being calculated from measured quantities (Thai 2014, Fahrenfeld 2016). Comparisons among WWTPs from different municipalities are further confounded by factors such as non-uniform population sizes (O’Brien 2013), variable wastewater transit times (Thai 2014), differences in sampling platforms (Ort 2010), and a lack of information on which biomarkers are stable and quantifiable across WWTPs (Arnaud 2018, Daughton 2018). Finally, composite samples from WWTPs also lack geographic specificity, reducing their usefulness to policy makers (Zuccato 2008, Castiglioni 2006, Thomas 2012).
Here, we demonstrate the utility of combining upstream residential sewage sampling with untargeted ‘omics analyses to obtain direct measures of human activity in residential populations, facilitating and expanding applications to public health. We combine demographic, sewage network, and geographic data sources to curate and select an upstream sampling site that reflects a residential population. We next generate microbiome and untargeted metabolomics data from a 24-hour sampling period and show that residential sewage represents the human microbiome and metabolome and reflects human diurnal activity. We putatively identify a variety of human-derived biomarkers, which can be extended to confirm biomarkers for use in a broad range of public health or policy evaluation applications.
Results
Upstream residential sewage contains more markers of human activity than sewage sampled at WWTPs
Sewage at WWTPs lacks geographic specificity, can represent populations of different orders of magnitude (e.g. 1,000 to 2 million people, Supp. Fig. 1), and contains sewage which has travelled on the order of minutes to hours, resulting in different rates of degradation and dilution of biological and chemical biomarkers (Newton 2015, Thai 2014). Residential catchments can be selected to have similar characteristics to each other, mitigating some confounding factors of WWTPs. We integrated GIS datasets with city-wide demographic and sewage network information to select a manhole serving an area with at least 80% residential land use, a population of over 5,000 people, and a maximum sewage hydraulic retention time of 60 minutes (Fig. 1A). We conducted two experiments: (1) a comparison of upstream grab samples with respective downstream samples and (2) a time series analysis of 24 hourly grab samples. We sampled sewage at the identified residential catchment (Fig. 1A) and a downstream pump station (N = 3 samples for each upstream and downstream sites, N = 1 upstream sample for each time point in the 24 hour experiment; Fig. 1B, Table 1) and performed untargeted metabolomics and 16S rDNA sequencing on the filtrate and cell fraction, respectively.
Metabolite relative abundances differed significantly between upstream and downstream sites. 34.5% of metabolites detected in both sewer and WWTP samples (n=245/710) had significantly different concentrations between these two sampling locations (q < 0.05; t-test with FDR correction). Of these differentially abundant metabolites, 73% (n=179/245) decreased or disappeared completely from wastewater collected at the WWTP, including many metabolites from human activity (see “Residential sewage microbiome and metabolome reflect human diurnal activity” section, Fig. 2A yellow dots, Fig. 3C yellow bars, Supp. Fig. 2).
Biological signals also differed between upstream and downstream sites. Residential sewage contained a higher proportion of human-associated bacteria than WWTP sewage (Fig 2B). Human-associated bacteria were defined as the top ten bacterial families in human stool (as in Newton et al. 2015), representing 90% of the average stool from the Human Microbiome Project (Huttenhower 2012). On average, 66% of the microbial community from residential sewage was of human fecal origin by this definition, compared to 51% in the WWTP pump station samples (Fig. 2B), with reduced representation by the Ruminococcaceae, Lachnospiraceae, Prevotellaceae, Veillonellaceae and Erysipelotrichaceae families (Supp. Fig. 3). These results indicate that the abundance of human gut bacteria decreases in the sewer environment. We expect the rate of decrease to depend on sewer environmental conditions and to result in dramatically different amounts of human fecal bacteria in WWTPs. Consistent with this expectation, a previous study observed large variability in the relative amounts of human fecal bacteria in 71 WWTPs across the U.S. (1.59% – 69.2%; average 15%, Newton et al. 2015).
Upstream sampling enables detection of human associated biomarkers via glucuronidation
Some metabolites are not typically detected in downstream analyses due to their reactivity in sewer systems. We hypothesized that such variants would be uniquely present in upstream sewage because they would be collected shortly after excretion. Glucuronides are chemical moieties added to exogenous molecules in the human liver to aid in their excretion and are thus specific indicators of human excretion (de Wildt 1999). Importantly, these charged moieties are often added to hydrophobic molecules that are not detectable by mass spectrometry. Thus, glucuronidation enables the quantification of molecules that would otherwise elude measurement. However, glucuronide moieties are labile and many bacteria express glucuronidase enzymes that hydrolyze them, reducing the signal at WWTPs (Pollet 2017, Jacox 2017). Thus, we hypothesized glucuronides present at upstream sites might be underrepresented or absent in WWTP samples.
We identified 21 glucuronidated compounds in our untargeted data, based on analytical standards as well as matches in mass/charge (m/z) and structural characteristics (MS2 spectra) (see Methods, Table 2). Glucuronides that were readily detectable in residential catchments were reduced or altogether absent in wastewater collected at the WWTP (nine examples shown in Fig. 2C, others shown in Supp. Fig. 4). By contrast, compounds that are resistant to biological and/or chemical degradation, such as the artificial sweetener sucralose, showed similar levels in samples from residential and WWTP catchments (six examples shown in Fig. 2C). Our data suggests that sampling upstream residential catchments can expand the diversity of molecules available for future public health surveillance applications (Daughton 2018, Jacox 2012).
Residential sewage microbiome and metabolome reflect human diurnal activity
Chemical and bacterial biomarkers detected in residential sewage reflect the diurnal cycles of human activity, as approximated by changes in average sewage flow rates collected over four weeks (Fig. 3A and B, Ort 2010). We hypothesized that human-associated metabolites and bacteria would increase during the day, when the contributing population would be more active, and decrease at night, whereas the environmental background would remain relatively constant (Fig. 3B). To test this hypothesis, we collected hourly samples over 24 hours from the residential catchment and produced 16S rDNA sequencing and untargeted metabolomics profiles. Metabolomics data from four time points did not pass quality control filters and these were excluded from our analysis (2AM, 2PM, 4PM, 6PM, see Methods).
Clustering metabolites and bacterial taxa enabled us to identify components associated with human activity. We calculated the Spearman correlations between metabolites (n=3,672) and bacterial taxa (n=254) and co-clustered metabolites and OTUs based on this correlation matrix. One co-cluster of metabolites (n=2,961/3,672) and bacteria (n=81/254) (dashed box, Supp. Fig. 5) had strong positive correlations to each other and increased during the day and became much less abundant at night. We identified these as putative “human-associated” features, and they included most of the metabolic features (mean abundance = 80%, yellow bars Fig. 3C) and bacteria in the sewer (mean relative abundance = 78%, orange bars Fig. 3D).
We next asked whether these metabolites and bacteria were indeed primarily human-associated by annotating the most abundant metabolites (see Methods). We found that the “human-associated” cluster was statistically enriched in human metabolites such as glucuronide compounds, hormones, bile acids and pharmaceuticals (Fisher’s exact test p-value = 1.4e-10, Table 3). Bacteria which clustered with the “human-associated” group of metabolites included 81 bacterial OTUs primarily from the Firmicutes and Bacteroidetes phyla, which are major components of the human gut flora (Fig. 3E, Huttenhower 2012). 65% of bacteria in the “human-associated” bacterial cluster belonged to the top ten most abundant bacterial families in human stool (Supp. Fig. 6).
The cluster of non-“human-associated” metabolites (n=711) and bacterial OTUs (n=173), on the other hand, likely reflect bacteria and metabolites sourced from the environmental background rather than from human activity. These metabolites had few matches to human compounds and remained constant throughout the 24-hr sampling period, likely representing chemicals derived from the environment or sewer biogeochemistry. Similarly, this bacterial cluster was primarily composed of Proteobacteria (Fig. 3E), which are not major members of the human gut flora. Indeed, human fecal bacterial families comprised only 1.4% of the abundance of this cluster (Supp. Fig. 6). The relative abundance of the bacteria in this cluster increased at night, but this likely reflects the fact that there were fewer bacteria of fecal origin rather than an increase in environmental bacteria, since a similar increase in environmental metabolites was not observed (Fig. 3C and D, gray bars).
Identification of urinary and fecal biomarkers
Urinary and fecal biomarkers supplement sewage flow rate as a more direct measure of human activity in sampling optimization, and can be used as a normalization factor for specific quantities of interest (Ort 2010). For example, drugs and pharmaceutical products are mainly excreted in urine, while certain infectious diseases like polio virus are shed in stool. To the extent that feces or urine is concentrated at certain times of day, sampling efforts could be matched to these hours to reduce the logistical burden of continuous sampling. In particular, we hypothesized that fecal matter might be concentrated in a reduced set of hours.
We identified and confirmed metabolites corresponding to human urinary and fecal markers through a combination of analytical standards and m/z plus MS2 matching, according to the protocols in the Metabolomics Standards Initiative (MSI, Sumner 2007). We prioritized mass spectral features with high abundance across all samples, and also in a few samples of interest (8 am and 1 pm samples, which contained more total metabolites and which we expected to correlate with human behavior). We screened the top m/z values with METLIN and MetFrag to characterize putative human-derived metabolites (Smith 2005, Wolf 2010). We compared all m/z values to the Human Metabolite Database (HMDB) database and used MetFrag to compare the observed MS2 fragmentation patterns of potential glucuronide compounds with their expected fragmentations (see Methods; Wishart 2012, Wolf 2010, Ashcroft 1995). When commercially available, we obtained analytical standards to confirm our identifications.
The most abundant feature in our dataset was identified (level 1 confidence, MSI, Sumner 2007) as alpha-N-phenylacetyl-L-glutamine (PAG), an end product of phenylalanine metabolism excreted in urine (Seakins 1971, Stein 1954). Previous studies on human physiology have found that PAG excretion remains constant in response to diet and does not exhibit diurnal cycling in individual patients (Seakins 1971). Thus, the abundance of PAG in our sample likely correlates with the total volume of urine in sewage at a given time. Urobilinogen, a hemoglobin breakdown product excreted primarily in stool (Balikov 1957), was also identified (level 1 confidence). We identified four bile acids (chenodeoxycholic acid, cholic acid, glycocholic acid, taurocholic acid; level 1 confidence). We considered these compounds as putative fecal markers, as they represent primary conjugated (produced in the liver) and unconjugated bile acids (acted on by bacterial bile salt hydrolases) that aid in digestion (Hofmann 1989).
Urinary and fecal metabolites tracked general diurnal human activity, but their dynamics differed from each other (Fig. 4). Both types of metabolites decreased at night, with the fecal markers almost completely disappearing (Fig 4C and D) while the urinary marker remained detectable (Fig 4B), albeit at reduced levels. Fecal markers also showed large deviations in the 8 am and 3 pm samples, and were slightly elevated throughout the evening (8 – 10 pm, Fig 4C and D). In contrast, the urinary markers increased in the morning and stayed relatively consistent throughout the day (Fig 4B). These patterns correspond with known patterns of human behavior: during the day, urination remains relatively constant across the entire population, while defecation spikes in the morning right after waking and in the early afternoon (Heaton 1992). Thus, sampling continuously throughout the day may be needed to capture the urinary fraction (see PAG, Fig. 4B), while fecal surveillance could be targeted to early mornings (see bile acids, Fig. 4C).
Sewage reveals dynamics of bile acid production in humans
Human production and excretion of bile acids is a complex physiological process, with conjugated and unconjugated bile acids exhibiting related but different responses to meals and to specific types of dietary intake (Hofmann 1989). The daily dynamics of fecal bile acid excretion is not well studied in humans, in part because it is difficult to access temporally-resolved data: humans defecate only 1-3 times per day on average or irregularly (Heaton 1992). Our 24-hour samples provide a unique opportunity to study the mix of bile acids on a population level.
We previously reported day-to-day variability in the types and quantities of bile acids produced in a single individual (Kearney 2018). Using our sewage aggregate data, we found additional variability in the types of bile acids excreted at different times of day. Levels of unconjugated cholic and chenodeoxycholic acids remained relatively constant during the day but conjugated cholic acids (glycocholic and taurocholic acid) showed more variation, with a notable spike in the morning (Fig 4C). These spikes in conjugated cholic acids may reflect an aggregate population-level physiological response to meals, or simply the sampling variability of residential sewage. Estimates of per sample population size are needed to discriminate these cases. In either case, these data suggest that analyzing physically aggregated biological data could provide an alternative avenue by which to study human physiology.
Residential sewage contains potential indicators of community health
Novel potential biomarkers of population health and behavior in our metabolomics data suggest that residential sewage is a promising source of community health indicators (Daughton 2018, Metzler 2008). 332 out of 3,672 untargeted metabolomic features were putatively identified in the HMDB database. Within the putatively-annotated glucuronides, we found human physiology biomarkers, compounds derived from drugs and pharmaceuticals, and dietary metabolites (Fig. 5 and Table 2). Physiological biomarkers included glucuronidated hormones and steroids such as adrenaline, aldosterone, and cortolone (Table 2), and other endogenous markers of key physiological processes could likely be identified (Daughton 2018). We found many examples of glucuronidated drug compounds, including over-the-counter painkillers (acetaminophen and ibuprofen glucuronide), immunosuppressants (mycophenolic acid glucuronide), and recreational drugs (3-hydroxycotinine glucuronide, a metabolite of nicotine). Such compounds could be measured to track human consumption, identify communities with high use of pharmaceuticals, or evaluate the effectiveness of drug take-back programs (Egan 2017). Finally, some metabolites mapped to dietary markers like sucralose, an artificial sweetener, and metabolites of ferulic acid, a phenolic acid abundant in fruits, vegetables, and beverages like juice and coffee (Piazzon 2012).
Discussion
Here, we present a framework to adapt the practices of wastewater-based epidemiology to the unique needs of city-level public health officials. We show that upstream sampling and leveraging human diurnal activities is key to unlocking much of the potential of sewage surveillance. Untargeted genomic and metabolomic data, rather than a targeted approach to measure a few known biomarkers, allowed us to uncover a wide range of biomarkers that we would not have anticipated a priori.
Such biomarkers may reflect a wide range of human activities and could also be used for a wide range of policy applications. For example, analyzing the amounts of un-glucuronidated prescription opioids or antibiotics could be used to evaluate the effectiveness of drug take-back programs. Bacterial or chemical biomarkers of nutrition intake and diet could be used to study food deserts and the impact of programs to increase the availability of fresh foods in certain neighborhoods. Marijuana metabolites and plant strains could be measured in sewage to measure the impact of cannabis legalization roll-outs across different counties or states. Residential sewage monitoring could also alert city officials of infectious disease outbreaks before a corresponding uptick in clinical cases, enabling earlier intervention in the affected communities. Finally, mining residential sewage can also uncover behaviors not effectively measured by other means. For example, identifying communities with high rates of unreported non-fatal opioid overdoses could be possible by finding catchments with high levels of naloxone-associated metabolites but low reported overdose rates (Walley 2013, Beeson 2018).
Despite the exciting results presented here, we recognize key limitations and areas for future work. First, limited grab samples show strong variability (Supp. Fig 2, Lai 2011), suggesting the need for development of continuous sampling methods, ideally proportional to flow and possibly urine/feces content (Ort 2010). In addition, estimates of per sample population size could allow more quantitative inferences, such as estimating not only presence of an emerging pathogen, but its incidence within a given population (Daughton 2012, O’Brien 2013, van Nuijs 2011). In addition, identifying molecules was a significant bottleneck in our analysis, thus larger collections of chemical standards and more sophisticated computational approaches are needed to expand the set of quantifiable metabolites. As new tools and methods are developed, we expect that the data generated in this and future studies could be re-analyzed and mined for additional novel insights. To enable such re-analyses, our data is available in public databases (see Methods).
Finally, data privacy and ethics should be considered in all stages of development and deployment. Sewage is a physical aggregate of multiple individuals, and so does not share many of the same re-identification concerns as human genomics and microbiome studies (Hall 2012, Prichard 2014, Lowrance 2007, Franzosa 2015). However, concerns about disparate impact and data misuse are real and should be addressed with continual input and feedback from all stakeholders, including not just the scientists and engineers developing these platforms for deployment but also community leaders, sociologists, and policy and privacy experts.
The last decade has seen exponential growth of a "data economy", which has transformed almost every aspect of our lives. Yet this revolution has focused almost entirely on digital data, while very little has been done to capture biological and chemical data. Upstream sewage analysis represents a unique opportunity to tap into this wealth of data for public health and policy evaluation impact, ensuring that public health practitioners and other public officials also benefit from the increased interest in and development of “big data”, ‘omics, and precision medicine tools and technologies.
Materials and Methods
Selection of Residential Catchment sampling site
The sampling site was identified by cross-referencing Cambridge’s wastewater network maps, which contained the layout of pipes and manholes, as well as flow direction, with demographic and land use data. We selected a location where the upstream catchment fulfilled three major requirements: 1) a land use of over 80% residential so that we avoid transient populations and can characterize the demographics of the catchment population accurately, 2) a total catchment population >5,000 people to provide an anonymous reading of the community, and 3) the wastewater travel time from the furthest point in the catchment is <60 minutes to preserve the integrity of the sample. Average sewage flow rate at the selected manhole was calculated by measuring sewage flow rate every 15 minutes with a commercial flow meter over four weeks, then averaging mid-week measurements only (Tuesday-Thursday). Average sewage flow rate at the WWTP pump station was provided by their engineering staff.
Wastewater collection and processing
For the time series analysis, 24 time-proportional samples were collected. Grab wastewater samples (500 mL each) were taken every hour from 10 AM on Wednesday April 8 2015 to 9 AM on Thursday April 9 2015. There was light rain during most of the day (on average <0.01 in) with peak precipitation between 11pm and 12am (0.09 in). For the comparison between the residential and the WWTP catchments, three grab samples (500 mL each) were collected from both the residential manhole and a pump station directly upstream of the WWTP on Thursday October 15 2015, and approximately at the same time (12pm – 2pm). There was no rain in the geographical area represented by the residential and WWTP catchments on this day. Sewage at the manhole was collected with a commercial peristaltic pump (Global Water) sampling at a rate of 100 mL/min. Sewage at the WWTP pump station was collected with a commercial sampling pole.
For all samples, 200 ml of collected sewage were filtered through 10-μm PTFE membrane filters. 150 ml of the outflow were filtered through 0.2-μm PTFE membrane filters. 100 ml of the final outflow were acidified with concentrated HCl (Optima grade, Fisher Scientific) to pH 3.0 and frozen at −80 degrees Celsius until metabolomics analysis. PTFE membrane filters were kept in RNAlater at −80 degrees Celsius until DNA extraction. Deionized water was processed through the same filtration system to generate blank filtrates and blank PTFE membranes. Four blanks were generated in total, one was collected at the start before any wastewater had passed through the system, another one at the end, and two more were spread over the processing time. The lab filtration system consisted of a Masterflex peristaltic pump (Pall), Masterflex PharMed BPT Tubing (Cole-Palmer), 47 mm PFA filter holders (Cole-Palmer) and 47mm PTFE Omnipore filter membranes (Millipore). Both the tubing and filter holders were previously cleaned with HCl (10% v/v) and ultrapure deionized water. The tubing and filter holders were flushed with 10% bleach and with distilled water, for 10 minutes each after every wastewater sample was processed.
DNA extraction and amplicon based Illumina sequencing of 16S rRNA genes
0.2-μm filter membranes were thawed in ice. RNAlater was removed and filters were washed with phosphate-buffered saline (PBS) buffer twice. DNA was extracted from each filter with Power Water extraction kit (MO BIO Laboratories Inc.) according to manufacturer’s instructions. The only modification was that DNA from up to three filters from the same sample were pooled into the same DNA-binding column. Paired-end Illumina sequencing libraries were constructed using a two-step PCR approach targeting 16S rRNA genes previously described (Preheim 2013). All paired-end libraries were multiplexed into one lane and sequenced with paired end 300 bases on each end on the Illumina MiSeq platform at the MIT Biomicro Center.
Processing of 16S rRNA gene sequencing data
Raw data was processed with an in-house 16S processing pipeline (https://github.com/thomasgurry/amplicon_sequencing_pipeline). The specific processing parameters can be found in Supplementary Table 1. To assign OTUs, we clustered OTUs at 99% similarity using USEARCH (Edgar 2010) and assigned taxonomy to the resulting OTUs with the RDP classifier (Wang 2007) and a confidence cutoff of 0.5. For each dataset, we removed samples with fewer than 100 reads and OTUs with fewer than 10 reads. We calculated the relative abundance of each OTU by dividing its value by the total reads per sample. We then collapsed OTUs by their RDP assignment up to genus level by summing their respective relative abundances. OTUs without annotation at the genus-level were combined into one “unknown” genus.
Untargeted metabolomics analytical methods
Acetonitrile and deuterated biotin (d2-biotin) were added to the 0.2 um-filtrate (acetonitrile final concentration was 5%, deuterated biotin was 0.05 µg ml−1). The resulting solution was analyzed using liquid chromatography (LC) coupled via electrospray ionization (negative ion mode) to a linear ion trap-7T Fourier-transform ion cyclotron resonance hybrid mass spectrometer (Thermo Scientific, FT-ICR MS; LTQ FT Ultra). Chromatographic separation was performed using a Synergi Fusion C18 reversed phase column (2.1 × 150mm, 4um, Phenomenex). Samples were eluted at 0.25 mL/min with the following gradient: an initial hold at 95% A (0.1% formic acid in water): 5% B (0.1% formic acid in acetonitrile) for 2 minutes, ramp to 65% B from 2 to 20 minutes, ramp to 100% B from 20 to 25 min, and hold at 100% B until 32.5 minutes. The column was re-equilibrated for 7 min between samples at 95% solvent A. The injected sample volume was 20 uL. Full MS data were collected in the FT-ICR cell from m/z 100-1000 at 100,000 resolving power (defined at 400 m/z). In parallel to the FT acquisition, MS/MS (MS2) scans of the four most intense ions were collected at nominal mass resolution in the linear ion trap (LTQ). Samples were analyzed in random order with a pooled sampled run every six samples in order to assess instrument variability.
Untargeted metabolomics raw data processing
Raw XCalibur files were converted to mzML files with MSConvert (threshold = 1000 for the negative ion mode files) (Chambers 2012). Peaks were picked using the function xcmsSet from the R package xcms, with the following parameters (negative mode): method = ‘centWave’, ppm = 2, snthresh = 10, prefilter = (5, 1000), mzCenterFun = ‘wMean’, integrate = 2, peakwidth = (20, 60), noise = 1000, and mzdiff = −0.005 (Smith 2006). To align retention times, we used the retcor.obiwarp function with the following parameters: plottype=‘deviation’, profStep=0.1, distFunc=‘cor’, gapInit=0.3, and gapExtend=0.4. To group peaks from different samples we used the group.density function with the following parameters: minfrac=0, minsamp=1, bw=30, and mzwid=0.001. To integrate areas of missing peaks, we used the fillPeaks function with method=‘chrom’.
Sample-wise QC was performed based on the spiked in d2-biotin (features identified as Cl adduct: mz = 281.070 and rt = 8.6 min; and H adduct: mz = 245.077 and rt = 8.5 min). We discarded samples with lower recovery of d2-biotin, as we considered this evidence of ion suppression and thus reduced confidence in the peak area values for the remaining features. Specifically, the 2 am, 2 pm, 4 pm, and 6 pm had levels of d2-biotin more than one standard deviation below the mean of all samples, and were discarded from further analyses.
We also performed feature-wise QC on the metabolomics data. For the 24-hour sampling data, we discarded peaks which were present in any of the 4 blank samples or which were absent in any 7 of the pooled samples. We determined each peak’s coefficient of variation (CV) within the pooled samples by calculating its standard deviation divided by the mean, and discarded peaks with a CV greater than 0.35. We also discarded any peaks which were present in fewer than 4 of the time samples. For the upstream/downstream samples, we discarded any features present in either of the 2 blanks.
Metabolite assignments
Mass matching (level 3 confidence)
We assigned putative metabolites according to the confidence levels proposed by the Metabolomics Standards Initiative (Sumner 2007). A level-3 confidence putative classification describes molecules whose observed masses match calculated exact mass values. Here we compared 397 of the most abundant mass features with METLIN (Smith 2005) and putatively characterized 126 (ppm error <= 2). We also programmatically mapped all metabolite m/z values to the Human Metabolome Database (HMDB) database (Wishart 2013, https://github.com/cduvallet/match_hmdb, database as of May 2016). Untargeted m/z values were first converted to their expected neutral mass (i.e. [M-H]- or [M+Cl]-ions), and then scanned against the neutral masses of all HMDB compounds, with an error tolerance of 1 ppm. All HMDB compounds within the error tolerance were returned. Results are shown in Supplementary Table 2. An m/z with a single hit in HMDB (“hit_type” column = hmdb_name) shows the chemical name of the HMDB entry and its HMDB ID. When an m/z had multiple HMDB hits, the feature was named with the common chemical taxonomic classification (“hit_type” column).
Matches by fragmentation spectra (level 2 confidence)
For a level-2 putative annotation, we compared observed fragmentation spectra (MS/MS) with in silico or database spectra. Here we focused on putative glucuronide derivatives from our level 3 comparisons (from m/z hits through METLIN and HMDB searching). MS/MS spectra were extracted from individual mzML files via MZMine (Pluskal 2010) and pymzML (Bald and Barth 2012) and matched to predicted in silico fragmentation patterns from MetFrag (Wolf 2010). We considered a glucuronide to be putatively annotated if the expected glucuronide was among the top MetFrag hits and the observed fragments included peaks corresponding to the un-glucuronidated compound, diagnostic glucuronide derivatives (m/z = 113, 157, 175), or other diagnostic peaks (parent compound minus a CO2 and/or OH) (Ashcroft 1995).
Identifications (level 1 confidence)
For a level 1 identification, putative annotations or characterizations must be confirmed through matches of m/z, retention time and fragmentation spectra of commercial standards analyzed on our platform (Sumner 2007). For these efforts, we purchased several metabolites (Table 2, ID level 1). Each standard was analyzed individually in pure solvent as well as spiked into the 1 pm and 2 am samples. We extracted the mz and retention times corresponding to our standards from the aligned feature table (level 3 m/z match as described above). We found the corresponding peaks in the standard-only runs via pymzML (m/z error between feature and peak in standard-only run = 0.005, RT error between feature and peak = 30 seconds) and also confirmed them by identifying their corresponding m/z and retention time in the spike-in samples. Next, we extracted the MS/MS spectra for each standard from its standard-only run and compared it to a corresponding MS/MS in an unspiked sample. For urobilinogen, the standard-only run did not trigger an MS/MS, so we compared its MS/MS from the spiked sample with one from a non-spiked sample. These analyses were performed through a combination of MZMine (Pluskal 2010) and pymzML (Bald and Barth 2012). Detailed results from the level 1 identifications are available in Supplementary File 1.
Metabolites in the upstream/downstream and 24-hour sampling experiment runs were matched based on their mz and retention time (ppm error < 5 for mz, ΔRT < 30 seconds). 867 metabolites were found to match between the two datasets. Of the 29 molecules identified via MS2 or standard matching (level 1 or 2) in the 24-hour dataset, 17 were also found in the upstream/downstream dataset.
Code and data availability
Code to reproduce these figures are available at https://github.com/mmatus/24hr_publication. Raw and processed microbiome and metabolomics datasets will be made available upon publication in the respective databases (SRA, MetaboLights, and Zenodo).
Conflicts of interest
M.M. and N.G. are co-founders of Biobot Analytics, a company measuring opioid consumption through city sewers. E.A. and E.K. are advisors to Biobot; C.D. and N.E. are employed by Biobot. All these authors hold shares in the company.
Acknowledgements
We thank The Cambridge Department of Public Works, in particular Jim Wilcox from the City of Cambridge, and Sam Lipson from the Cambridge Public Health Department for all their support in selecting, accessing and manning manholes. We are indebted to all researchers from the Alm Lab, Senseable City Lab, and Runstadler Lab who volunteered their time to help us collect and process sewage non-stop through one day, and for Herbie from Cambridge Public Works for providing his good humor and professional insights during our many sampling trips. We thank the generous sponsorship of the Kuwait Foundation for Advancement of Sciences, and the MIT Center for Microbiome Informatics and Therapeutics provided support for computational resources. We thank the WHOI FT-MS facility for sample analysis and Krista Longnecker for training in metabolomics data analysis and discussions throughout the study.
Footnotes
Finalize the standard confirmation process via MS2 and clarify the text (abstract and discussion).