Abstract
Necrotizing enterocolitis (NEC) is a devastating intestinal disease that occurs primarily in premature infants. We performed genome-resolved metagenomic analysis of 1,163 fecal samples from premature infants to identify microbial features predictive of NEC. Features considered include genes, bacterial strain types, eukaryotes, bacteriophages, plasmids and growth rates. A machine learning classifier found that samples collected prior to NEC diagnosis harbored significantly more Klebsiella, bacteria encoding fimbriae, and bacteria encoding secondary metabolite gene clusters related to quorum sensing and bacteriocin production. Notably, replication rates of all bacteria, especially Enterobacteriaceae, were significantly higher two days before NEC diagnosis. The findings uncover biomarkers that could lead to early detection of NEC and targets for microbiome-based therapeutics.
Main
Necrotizing enterocolitis (NEC) is widely studied yet poorly understood. First described in the early 1800s1, NEC is a disorder of intestinal inflammation that can progress to bowel necrosis, sepsis, and death2. NEC affects 7% of very low birthweight infants born in the United States each year, and mortality rates have remained around 20 – 30% for several decades2. The direct cause or causes of NEC remain unknown.
The primary risk factor for NEC is preterm birth2. Immature enterocytes exhibit hyperactive immune responses through the TLR4 pathway in response to bacterial lipopolysaccharide (LPS)3, which can lead to bowel damage4. Experimental NEC occurs in conventionally raised animals but not those reared in a germ-free environment5,6. These observations suggest that the intestinal microbiome plays a role in the disease and led to the prevailing hypothesis that an excessive immune response to abnormal gut microbes is the most likely basis for the pathogenesis of NEC. Although no single microbe has been consistently identified as a biomarker for NEC7, increased abundance of bacteria in the phylum Proteobacteria is a frequently reported microbial pattern in NEC infants8. Most fecal microbiome-based profiling studies of NEC utilize 16S rRNA amplicon sequencing, which provides a general overview of the bacteria present, but does not reveal metabolic features that could contribute to NEC pathogenesis.
Genome-resolved methods may provide new insights into NEC development. The approach has several advantages over 16S rRNA amplicon sequencing. All DNA in a sample is sequenced, allowing detection of bacteriophages, plasmids, eukaryotes, and viruses. Bioinformatics techniques can also infer in situ bacterial replication rates directly from metagenomic data9,10, an important metric, as some microbiome-related diseases have a signal related to bacterial replication but not relative abundance10. Importantly, genome assembly and annotation can provide functional information about organisms present and possibly reveal genes associated with NEC. Further, whole-genome comparisons provide strain discrimination, and thus detailed testing of Koch’s postulates. Finally, mapping to reference genomes is not required for genome detection, allowing for the discovery of novel bacterial clades11. While identification of a single causative strain, virus, or toxin would be the most actionable result for clinicians, any associations could potentially be used as biomarkers to identify early warning signs of NEC, and microbial communities associated with NEC could be targeted with microbiome-altering techniques such as probiotics, prebiotics, or other approaches12.
Results
Metagenomic characterization of premature infant fecal samples
We analyzed 1,163 fecal metagenomes from 34 preterm infants that developed NEC and 126 preterm infants without NEC (Figure 1). Premature infant subjects were matched for gestational age and calendar date, and recruited from the UPMC Magee-Womens Hospital (Pittsburgh, PA) over a 5 year period. Fecal samples were banked and specific samples were later chosen for DNA extraction and sequencing to preferentially study samples immediately prior to NEC. An average of 7.2 samples per infant, mostly from the first month of life, were sequenced and a total of 4.6 tera base pairs of shotgun metagenomic sequencing generated (Supplemental Table S1). Detailed sequencing information (Supplemental Table S1) and patient metadata (Supplemental Table S2) are provided.
Extensive computational analyses were performed on all samples to recover genomes de novo, and determine their phylogeny, metabolic potential, and replication rates (iRep9). We also searched samples for eukaryotic viruses, virulence factors, secondary metabolite gene clusters, and previously implicated pathogens13,14 (Figure 1a). This analysis resulted in 36 gigabase pairs of assembled sequence, 2,425 de-replicated bacterial genomes (average of 92% completeness and 1. 1% contamination), 5,218 bacteriophage genomes, 1,183 plasmid genomes, 7 eukaryotic genomes, and 804,185 de novo protein clusters (Figure 1b; Supplemental Table S6). As NEC can be a rapidly progressive disorder, for most statistical tests we defined NEC samples as those taken within 2 days prior to NEC diagnosis (“pre-NEC” samples). For infants that did not develop NEC, only one sample from the period associated with NEC onset was used (“control” samples). Pre-NEC and control samples were matched for day of life (DOL), gestational age, and recent antibiotic administration (Figure 1c; Supplemental Figure S1). For other analyses, when explicitly stated, all samples were used.
Klebsiella pneumoniae is enriched in samples from infants with NEC
The gut microbiomes of all infants were dominated by Proteobacteria, regardless of NEC development (Figure 2a,b). As compared to previous studies of full-term infants15,16, the premature infants in this study had increased Enterobacteriaceae (a family of Proteobacteria to which many nosocomial pathogens belong17) and notably low abundances of Actinobacteria and Bacteroidetes. Factors that could select for these organisms include prophylactic antibiotics given to all premature infants at birth, high rates of birth by cesarean-section, predominance of formula feeding and immaturity of the intestine and immune system. Only the NEC microbiomes contained Fusobacteria and Tenericutes (Figure 2). Compared to control infants, the NEC infant microbiomes exhibited less stability, lower levels of Firmicutes, and higher levels of Enterobacteriaceae than the microbiomes in control infants (p = 8.9E-7; Wilcoxon rank sums test; Supplemental Table S9; Figure 2a). The general association of Enterobacteriaceae and infants that go on to develop NEC has been described previously18, but this prior analysis was not restricted to the period immediately prior to NEC detection. In our study, the gut microbiomes of infants that developed NEC were not significantly enriched in Enterobacteriaceae in pre-NEC vs. control samples (p = 0.15; Wilcoxon rank sums test), so the association of Enterobacteriaceae and NEC infants overall may be due to the proliferation of these bacteria after administration of antibiotics to treat NEC (Supplemental Figure S9).
A principal component analysis based on weighted UniFrac distance was performed to compare the microbiomes of all samples from all time points (Figure 2b). The first two principal components explained 73% of the overall variance, but samples collected from NEC infants (red) did not cluster separately from control infants (black dots). Consideration of higher principal components (up to the 5th principal component) did not separate pre-NEC and control samples, and samples coded by clinical metadata also did not cluster together (Supplemental Figure S7).
To identify strains enriched in pre-NEC samples, the percentage of pre-NEC vs. control samples carrying each assembled bacterial, bacteriophage, and plasmid genome was calculated (Figure 2c,d). Klebsiella pneumoniae strain 242_2 was the most associated with NEC, and was present above the threshold of detection in 52% of pre-NEC samples vs. 23% of control samples (p = 0.008; Fisher’s exact test) (Supplemental Table S12). Interestingly, closely related bacteria (>99% average nucleotide identity (ANI)) colonized up to 35% of all infants (Figure 2c). This is likely the result of colonization of multiple infants by the same hospital-associated bacteria19. Importantly, no organisms in this study satisfied Koch’s postulate that a disease causing organism should be found in all NEC infants and no healthy patients.
Bacterial replication rates are higher prior to NEC development
Bacterial replication rates are measured from metagenomic data by determining the difference in DNA sequencing coverage at the origin vs. terminus of replication, yielding an index of replication (iRep) that correlates with traditional doubling time measurements9,10. Remarkably, iRep values of bacteria overall were significantly higher in pre-NEC vs. control samples (p = 0.0003; Wilcoxon rank sums test), in a cohort balanced for DOL, gestational age, and recent antibiotic administration (Figure 3) (Supplemental Table S9). Further, iRep values followed a striking pattern in relation to NEC diagnosis: bacterial replication was stable four or more days prior to NEC diagnosis, increased daily in the three days prior to diagnosis, and crashed following diagnosis (probably due to resulting antibiotic administration) (Figure 3a). Individual species did not have enough data-points to be plotted confidently (minimum of 5 measurements per DOL), but genomes of the family Enterobacteriaceae displayed similar but more dramatic patterns than other bacteria overall (Figure 3ab). Increased bacterial replication prior to NEC could promote disease onset or merely be a reaction to changing conditions in the gut that led to NEC.
Machine learning identifies additional differences between NEC and control cases
2,119 features (e.g., the abundance of organisms encoding secondary metabolite gene clusters and each of 600 KEGG modules (specific metabolic pathways)) were measured for each of the 1,163 metagenomic samples (Figure 1; Supplemental Table S3). In order to evaluate which features are most different between pre-NEC and control samples, a machine learning (ML) classifier was developed. Multiple ML algorithms were evaluated, and although all performed with similar accuracy (Supplemental Table S10), the boosted gradient classifier was ultimately chosen due to its known ability to handle class imbalance. The classifier was trained on all 2,119 features to predict if samples were pre-NEC or control, and accuracy was measured through cross-validation over 100 iterations. The classifier achieved a median accuracy of 64% on balanced sets; 14% better than random chance. While a classifier with this accuracy may have limited utility in a clinical setting, it allowed us to interrogate which features were most informative for differentiating pre-NEC and control samples.
The most important individual features used by the ML classifier were replication rates (iRep values), KEGG modules, secondary metabolite gene clusters, and overall plasmid abundance (Figure 4). iRep values of both specific bacterial taxa and median iRep values overall were some of the most important features (Figure 4b), while KEGG modules accounted for over 50% of the total feature importance (Figure 4a) (Supplemental Table S4). A similar number of KEGG modules were associated and anti-associated with NEC (Figure 4c), but descriptions of the pathways associated with NEC (e.g., erythritol and galactitol transport systems) and anti-associated with NEC (e.g., sodium and capsular polysaccharide transport systems) bear no obvious relationship to the disease (Supplemental Table S4). Secondary metabolite gene clusters were the second most important category overall (Figure 4a), but unlike KEGG modules, very few were anti-associated with NEC (Figure 4c). The most significant secondary metabolite gene cluster encodes an unusual operon of biosynthetic genes found in Klebsiella (cluster 416). In other species, similar operons are implicated in biosynthesis of quorum sensing butyrolactones20. The second most significant cluster of genes occurs in Enterococcus and is involved in biosynthesis of a sactipeptide resembling subtilosin A1, an antimicrobial agent with known hemolytic activity21 (cluster 438) (Supplemental Table S5). Interestingly, another cryptic secondary metabolite gene cluster with a high feature importance (cluster 432) is closely related to a previously characterized cluster on a plasmid of Enterotoxin-producing Clostridium perfringens adjacent to the enterotoxin gene (cpe) and beta2 toxin gene (cpb2)22. Overall, high plasmid abundance was correlated with pre-NEC samples (Figure 4b) and K. pneumoniae plasmids in particular were significantly more abundant in pre-NEC samples (p = 0.03) (Supplemental Figure S3).
Feature importances were also analyzed in combination. Each bacterial strain was assigned an importance value based on the sum of the importance scores for the KEGG modules encoded by its genome. 150 genomes have high KEGG importance values (hereinafter referred to as “organisms of interest”) (Supplemental Table S7; Supplemental Figure S4). Interestingly, the organisms of interest were significantly more abundant in pre-NEC samples as compared to control samples (p = 0.004) (Figure 4d), and they cluster phylogenetically (Figure 4f). 97% were in the family Enterobacteriaceae, and of those, 90% were in the genus Klebsiella. The prevalence of K. pneumoniae in pre-NEC samples (Figure 2c) may explain the high abundance of K. pneumoniae plasmids in these samples.
Secondary metabolite biosynthetic gene clusters identified to be important by the ML classifier occur in 218 organisms that are significantly associated with pre-NEC samples (Figure 4e). Several types of secondary metabolite gene clusters were enriched in this these genomes (p < 0.01; Fisher’s exact test), including sactipeptides, bacteriocins, and butyrolactones (encoded by 382, 286, and 11 genomes, respectively) (Supplemental Table S7). As opposed to organisms of interest, these bacteria were spread around the phylogenetic tree (Figure 4f). This may indicate that the clusters themselves are associated with pre-NEC samples. Overall, the results point to quorum sensing and anti-microbial peptide production as being associated with NEC onset.
Bacteria associated with NEC encode specific types of fimbriae
We leveraged the gene content information provided by genome-resolved metagenomics to search for proteins associated with 1) pre-NEC samples and 2) organisms of interest. Three clustering algorithms were evaluated for their ability to reconstruct known clusters of ribosomal proteins (Supplemental Table S11), and a hybrid Markov Cluster algorithm approach23 performed best. Application of the algorithm to the 36,701,491 proteins reconstructed in this study yielded 804,277 protein clusters, none of which was statistically associated with NEC (Fisher’s exact test with false discovery rate correction) (Figure 5a). However, 85 protein clusters were associated with organisms of interest with high precision and recall (>0.7) (Figure 5b). The most common pFam annotations for these clusters were fimbriae and ABC transport proteins (Supplemental Table S8). However, only genomes encoding fimbrial proteins also had a significant association with NEC (p = 0.02; Wilcoxon rank-sum with Benjamini-Hochberg FDR correction; Supplemental Table S8).
Comparison of fimbrial operons against public databases revealed that the majority encode chaperone-usher (CU) type fimbriae. A classification scheme exists for CU fimbriae based on usher protein pFam PF00577.1924,25. The 32,646 usher proteins identified in our sequencing data (Supplemental Table S13) were clustered into groups based on amino acid sequence identity, and the ten most prevalent groups were placed in a phylogenetic tree with reference sequences from each subtype of CU fimbriae (Figure 5d). All ten fimbriae clusters fit into the established CU fimbriae taxonomy, with 9/10 falling in the γ super-clade and one into the π clade (Figure 5d). Four fimbriae clusters identified in this study were significantly more abundant in pre-NEC samples, and genomes encoding cluster 49 (γ4 clade) also had significantly higher iRep values in pre-NEC samples (Figure 5c). 27 genomes that encode fimbrial cluster 49 were not identified as genomes of interest, yet they were at significantly higher abundance, and have significantly higher iRep values, when considering all samples from NEC vs. control infants (Supplemental Figure S6) (p < 0.01; Wilcoxon rank-sums test). This suggests fimbrial cluster 49 itself may be associated with NEC and not incidentally associated with metabolically important genomes.
Biomarkers of NEC are most informative closer to NEC diagnosis
Statistical tests uncovered four factors significantly associated with pre-NEC samples (samples taken within day days prior to NEC diagnosis): iRep values overall (Figure 3b), genomes encoding specific types of secondary metabolite gene clusters (sactipeptides, bacteriocins, and butyrolactones) (Supplemental Table S7), Klebsiella (Figure 2c), and fimbriae cluster 49 (Figure 5c). We performed a similar analysis each day up to eight days prior to NEC diagnosis (Figure 6). Genomes encoding specific types of secondary metabolite gene clusters and Klebsiella genomes were always significantly more abundant in NEC samples, though the effect size of the difference became slightly higher closer to NEC diagnosis. iRep values and the abundance of genomes encoding fimbriae cluster 49, on the other hand, were only significantly higher 3 days and 1 day prior to diagnosis, respectively.
Discussion
Given that we found no single predictor of NEC and identified several factors as important by machine learning, our results support prior indications that NEC is a complex and likely multifactorial disease2,26. Of the four aspects of the gut microbiome that differ in pre-NEC compared to control samples (Figure 6), the iRep values of all organisms in each sample had the highest effect size. Given that iRep is a measure of bacterial replication rather than relative abundance, the result highlights that reliance on relative abundance alone could be misleading. This is largely due to the fact that relative abundance metrics are themselves misleading because an organism can increase in relative abundance simply due to the decline in relative abundances of other organisms. The higher bacterial replication rate prior to NEC diagnosis could be sustained by nutrient release from the breakdown of gut tissue. Alternatively, increased bacterial replication may trigger onset of NEC, possibly because high activity of a specific organism leads to imbalance in concentrations of compounds in the gut environment.
Secondary metabolite gene clusters of specific types (bacteriocins, sactipeptides, and butyrolactones) were significantly enriched in pre-NEC compared to control samples (Supplemental Table S5). Bacteriocins are small peptides that kill closely related bacteria, and when produced, cell lysis could contribute to onset NEC via release of immunostimulatory compounds such as LPS. Sactipeptides are a class of posttranslationally modified peptides with diverse bioactivities27. The sactipeptide with the highest overall importance is related to a subtilosin (antimicrobial agent) with known hemolytic activity. Interestingly, all sactipeptides identified in this study were encoded by Firmicutes, including Clostridium perfringens and Clostridium difficile (Supplemental Figure S8; Supplemental Table S5). Production of sactipeptides by these species could trigger NEC through direct toxicity to human cells or via release of immunostimulatory bacterial compounds following bacterial cell lysis. This phenomenon could explain previous reports that implicate Clostridium in development of NEC28–30.
Butyrolactones are generally involved in quorum sensing in Actinobacteria20, but in this study were mostly found encoded in genomes of Proteobacteria, and over half were identified in Klebsiella genomes. Whereas known quorum sensing systems in Proteobacteria are responsible for the production of virulence factors, including fimbriae31,32, the functions of butyrolactones in Proteobacteria remain unstudied. Higher proportions of Klebsiella were found in infants that went on to develop NEC, and their capacity to produce secondary metabolites and fimbriae could explain this association.
Organisms with genomes encoding for fimbriae cluster 49 were at significantly higher abundances on both the day of and the day before NEC diagnosis. Fimbriae are known stimulants of TLR4 receptors33, immune receptors that are overexpressed in premature infants and previously linked to NEC in animal studies34,35. Fimbriae are the hallmark pathogenicity factors of uropathogenic E. coli36, a group of organisms that have been previously implicated as a causative agent of NEC13. Uropathogenic E. coli were specifically evaluated in this study and not found to be significantly enriched in pre-NEC compared to control samples (Supplemental Table S3; Figure 2c). The associations in prior work and the current study may instead reflect a general link between fimbriae and TLR4 receptor stimulation.
An advantage of genome-resolved metagenomics is that it provides whole community information, going far beyond what can be deduced from 16S rRNA gene surveys that are the hallmark of most prior and much current human microbiome research. Here we applied this approach to a sufficiently large dataset to achieve statistical power unprecedented in a genome-resolved metagenomic study, and find that there is there is likely no single bacteriophage, plasmid, eukaryote, virus, or even protein that is responsible for NEC. However, we identify several promising associations through machine learning, many of which have previously been proposed to explain NEC onset but none of which alone can explain all cases. Bacteria of the genus Klebsiella emerged from our analyses as organisms of potential importance, with secondary metabolite, LPS and fimbriae production all being possible contributors. The association of these bacteria, as well as bacteria of the Clostridium genera, with NEC and their presence in the NICU19 supports prior reports proposing that colonization of premature infants by nosocomial microbes may be clinically significant. Overall, we provide insight into how previously proposed but distinct explanations for development of NEC are interconnected, and identify bacterial growth rates as the strongest predictor of disease onset.
Data availability
Reads are available under bioproject under Bioproject PRJNA294605, SRA studies SRP052967, SRP114966, and SRP012558, and SRA accessions SRR5405607 to SRR5406014 (Brooks et al., 2017; Brown et al., 2018; Rahman et al., 2018; RavehSadka et al., 2015, 2016; Sharon et al., 2013)).
Author Contributions
M.R.O., M.M., and J.F.B. designed the study; M.R.O. performed metagenomic analyses; N.B. and Y.S.S. guided statistical and machine learning analyses, A.C.C. contributed to secondary metabolite analyses; R.B. recruited infants for the study, and B.F. performed all DNA extractions; M.R.O. and J.F.B. wrote the manuscript, and all authors contributed to manuscript revisions.
Declaration of Interests
The authors declare that there is no conflict of interest regarding the publication of this article.
METHODS
Subject recruitment, sample collection, and metagenomic sequencing
This study was reviewed and approved by the University of Pittsburgh Institutional Review Board (IRB PRO12100487 and PRO10090089). This study made use of many different previously analyzed infant datasets. These datasets have previously published descriptions of the study design, patient selection, and sample collection, and are referred to as NIH137,38, NIH219, NIH339, NIH440, NIH541, and Sloan219. Collated sequencing and health information for all infants and samples is provided in the supplemental materials of this manuscript (Supplemental Tables S1, S2).
Metagenomic profiling
Read processing and assembly
Reads from all samples were trimmed using Sickle42, and reads that mapped to the human genome with Bowtie 243 under default settings were discarded. Reads from all samples were assembled independently using IDBA-UD44 under default settings. Co-assemblies were performed for each infant as well, where reads from all samples from that infant were combined and assembled together. Scaffolds <1 kb in length were discarded, and remaining scaffolds were annotated using Prodigal45 to predict open reading frames using default metagenomic settings.
Recovery of de novo bacterial genomes
DasTool46 was used to select the best bacterial bins from the combination of 3 programs for automatic binning-abawaca (https://github.com/CK7/abawaca). concoct47, and maxbin248. Cross-mapping was performed between samples for each infant to generate differential abundance signals, and each sample was binned independently. For each infant, dRep v1.4.249 was then used on all bins created from all samples from that infant to generate an infant-specific genome set, using the command “dRep -comp 50 -con 15 --S_algorithm ANImf -sa .99 -nc .25 --checkM_method taxonomy_wf’.
To determine the taxonomy of bins, the amino-acid sequences of all predicted genes were searched against the uniprot database using the command “usearch64 -ublast $aa_file -db uniprot_cp.fasta.udb -maxhits 1 -evalue 0.0001 -threads 6 -blast6out $b6”, and tRep (https://github.com/MrOlm/tRep/tree/master/bin) was used in combination with ETE 350 to convert the list of identified taxIDs into taxonomic levels. Briefly, this assigns a call to each taxonomy level when at least 50% of protein hits reach that taxonomic level.
Bacterial growth rates
iRep values9 were calculated by first mapping reads from all samples in each infant to the de-replicated genome set from that infant using Bowtie 2. iRep was then run using the command “samtools view $bam | iRep -s - ; iRep_filter.py --long”, and values resulting from genomes with less than 0.9 breadth of coverage were discarded.
To visualize growth rates over time (Figure 3a), all iRep values from all bacteria were averaged together for each DOL relative to NEC and plotted using the seaborn command “sns.pointplot(showfliers=False, ci=68)” (https://seaborn.pydata.org/). Days of life in which less than 5 infants were profiled were manually removed. Boxplots in Figure 3b were also created using seaborn.
Bacteriophages, plasmids, and Eukaryotes
For all assemblies, circular contigs were identified using VICA51, and bacteriophages were identified using VirSorter52 and VirFinder53. Bacteriophages were defined as scaffolds that were considered “level 2” or “level 1” by VirSorter, or less than 0.01 p-value by VirFinder. Plasmids were defined as scaffolds which were circular, but not identified as bacteriophage according to the above definition. Bacteriophages and plasmids over 10kb in length were then each de-replicated separately on a per-infant basis using dRep version 2.0.5, with the command “dRep dereplicate -pa .9 --S_algorithm ANImf -nc .5 −l 3000 -N50W 0 -sizeW 1 --noQualityFiltering --clusterAlg singleOverlap”. All plasmid and bacteriophage genomes were then compared to each other using the dRep command “dRep dereplicate -pa .9 --S_algorithm ANImf -nc .5 −l 10000 -N50W 0 -sizeW 1 --noQualityFiltering --clusterAlg single -d”. Eukaryotes were assembled and binned from the gut samples of premature infants as previously reported41.
Eukaryotic viruses
Eukaryotic viruses were analyzed using the vFam collection55, a set of HMMs designed for the identification of eukaryotic viruses within metagenomic sequence data. Hmmsearch56 was used to search the HMM set “vFam-A_2014.hmm” against each assembly. All hits with e-values less than 1e-5 were considered significant and retained. Reads were also mapped to a previously curated list of human viruses57. This lead to the identification of no viruses when individual samples were used, and a very small number of viruses when combined sets of reads from each infant were used (Torque teno midi virus 2, Torque teno virus 14, and Macaca mulatta polyomavirus 1). This line of work was not followed up on due to lack of signal.
Diversity
Shannon diversity and overall bacteria richness were calculated for each sample. Shannon diversity was calculated using the command skbio.diversity.alpha.shannon (http://scikit-bio.org/). Richness was calculated as the number of bacteria with relative abundances over 0.1%.
KEGG Modules
KEGG modules were annotated by using HMMER against an in house HMM database built from the Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology groups (KOs)58. Briefly, all KEGG database proteins with KOs were compared with all-v-all global similarity search using USEARCH59. MCL was then used to sub-cluster KOs (inflation_value = 1.1). Each sub-cluster was aligned using MAFFT60, and HMMs were constructed from sub-cluster alignments. HMMs were then scored against all KEGG sequences with KOs and a score threshold was set for each HMM at the score of the highest scoring hit outside of that HMMs sub-cluster. KEGG modules were considered present in a genome if all necessary KOs were present in that genome.
Secondary metabolite gene clusters
In order to identify secondary metabolites, antismash-4.0.2 was run on each infant co-assembly61. The results were parsed using the custom script parse_antismash.py (https://github.com/MrOlm/Public-Scripts) and resulting key proteins were clustered using diamond62 (commands: diamond makedb --in $coassemblies.faa -p 6 -d $coassemblies.faa.db; diamond blastp -q $coassemblies.faa -d $coassemblies.faa.db.dmnd -o $coassemblies.faa.out --id 50). Alignments were filtered to only retain those with >75% amino acid identity and 50% alignment. Hierarchical clustering was then performed using average amino acid identity (AAI) and resolved using a distance threshold of .5 to assign each secondary metabolite gene cluster to a gene cluster family. Next, for each infant the nucleotide sequences of all genes in a representative for each gene cluster family were concatenated together. The reads from each sample from that infant were mapped to this concatenation of genes in order to determine the dynamics of these genes in all samples from that infant. The breadth of each cluster was calculated as the weighted breadth (considering length) for all genes in that cluster.
Virulence Factors
Virulence Factors Database (VFDB) was used to search for virulence factors63. The database used was from Mar 17, 2017, containing 2597 sequences. Abricate was used to search all predicted protein sequences against the VFDB (https://github.com/tseemann/abricate). A metadata file from VFDB website (http://www.mgc.ac.cn/VFs/) named “FVs.xls” was used to get additional information about the virulence factors. About 15% of virulence factors were not included in this metadata file and were excluded from additional analysis.
Botulinum toxin
A blast database of all subtypes of BoNTs toxin was downloaded from from https://bontbase.org/ (as accessed on February 15 2018). Blastp was used to search predicted amino acid sequences of all genes against the database. Hits with an e-value less than 1e-5 were considered valid.
Pathogenic E. coli
It was previously reported that pathogenic E. coli may be associated with NEC development; specifically the clades 73, 95, 127, 131, 144, 998, abd 6913. To identify E. coli genomes of these sequencing types in our dataset, all genomes were MLST profiled using PubMLST64 and the program “mlst” (https://github.com/tseemann/mlst). The MLST definition requires having 7 genes; in cases where only 6 genes could be identified, if only one sequence type (ST) existed with those 6 gene types, the ST was inferred. Each sample with an E. coli genome of the above STs at over 1% relative abundance were considered to have a “pathogenic” E. coli.
Proteins
Three protein clustering methods were evaluated for use in this study- MMseqs265 (run using default settings), cd-hit (command: “cd-hit -c 0.7 -M 200000 -T 10)66, and a previously described hybrid Markov Cluster approach23. Algorithms were evaluated based on their ability to reconstruct known protein clusters (specifically a previously described set of 16 universal ribosomal proteins67), and the hybrid Markov Cluster approach approach performed best (Supplemental Table S11). This method was used to cluster the amino acid sequences of all predicted genes from all assembled scaffolds.
The average microbiome of NEC and control infants
To calculate the relative abundance of all microbes in each infant, a full “genome inventory” was generated for each infant by resolving the overlap between the recovered bacteria, eukaryote, bacteriophage, and plasmid genomes. Bacteriophage and plasmid genomes were first aligned using nucmer68, and in all cases where scaffolds aligned with over 95% ANI on over 50% of the scaffold, the scaffold was removed from the plasmid list. The resulting scaffolds were next aligned to bacterial genomes, and all phage/plasmid scaffolds that aligned to bacterial genomes with the same thresholds were removed. Finally eukaryotic genomes were aligned to the remaining scaffolds, and in cases where similar scaffolds were detected, the scaffold was removed from the eukaryotic genome. Reads from all samples were then mapped to that infant’s genome inventory using Bowtie 2, and the relative abundance of each organism was calculated as the percentage of total samples reads that map to that genome (Supplemental Table S12).
In order to compare the microbiome between NEC and control infants, the microbiome of each cohort was averaged across all infants in that cohort (Figure 2a) using the relative abundance values described in the previous paragraph. For each day of life, the average relative abundance of each taxa was first calculated. A 5-day sliding window was next applied, and values from samples in each window were averaged. For example, DOL 10 represents the average abundances from DOL 8 to 12.
Strain-level differences between NEC and control infants
In order to calculate the relative abundance of each bacterium in each sample, each sample was mapped to the infant-specific bacterial genome set for that infant using Bowtie 2. Relative abundances of all bacteria were calculated as the percentage of total sample reads mapping to each genome (Supplemental Table S9). Bacteria assembled from all infants were then compared to each other using dRep, and bacterial genomes with at least 99% ANI were considered to be the same “strain”. A bacterium was considered present in a sample if it had over 0.1% relative abundance, and the fraction of pre-NEC and control samples in which each strain was present was calculated and plotted in Figure 2c.
A similar procedure were performed for the bacteriophage and plasmid genome sets of each infant. Mapping was done to each infant set separately, and genomes were considered to be the same “strain” if they had 99% ANI over at least 50% of their genomes. Organisms were considered present in a sample if they were preset at over 50% genome breadth.
Principal component analysis
Principal component analysis (PCA) was performed based on the relative abundance of bacteria in each sample (Supplemental Table S9) as assessed using weighted UniFrac distance69. A phylogenetic tree was created by comparing all assembled bacterial genomes to each other using dRep (command: “dRep --SkipSecondary -ms 100000”), the weighted UniFrac distance between all samples was calculated using scikit-bio (http://scikit-bio.org/), and PCA was performed using scikit-learn70.
Machine learning
Algorithm development
2,119 features were calculated for each sample and used as the input to a machine learning classifier to predict pre-NEC and control samples (Supplemental Table S3). See above methods for how individual features were calculated.
Three machine learning methods were evaluated for their ability to classify pre-NEC vs. control samples-a random forest classifier (sklearn.ensemble.RandomForestClassifier(n_estimators=460, max_features=10)) balanced using SMOTE (imblearn.combine.SMOTEENN()), a gradient boosting classifier (sklearn.ensemble.GradientBoostingClassifier (learning_rate = 0.1, max_depth = 10, max_features = ‘sqrt’, min_samples_split = 0.7, n_estimators = 200)) balanced using SMOTE, and the same gradient boosting classifier without balancing70,71. Hyperparameters were empirically determined using sklearn.model_selection.RandomizedSearchCV, and in general many different combinations of hyperparameters gave similar results. Models were trained and evaluated using cross-validation for 5 iterations each (sklearn.model_selection.StratifiedKFold (n_splits = 10); sklearn.model_selection.cross_val_predict; sklearn.metrics.accuracy_score), and all achieved similar prediction ability (Supplemental Table S10).
To determine the accuracy of the gradient boosting classifier, 100 iterations were performed where each iteration consisted of 1) randomly balancing the input to include 21 pre-NEC samples and 21 control samples, 2) classifying each sample in the input using 10 fold cross validation (same methods as above), and 3) calculating the percentage of samples that were correctly classified. The median accuracy value was reported.
Feature importance analysis
Feature importances were determined by 100 iterations of training the gradient boosted classifier on the full dataset of pre-NEC and control samples. Importance values were scaled for each iteration such that the overall sum equals 1. The median importance value for each feature is reported (Supplemental Table S4).
KEGG and secondary metabolite enriched genomes
Each bacterial genome was assigned a metabolic importance value by summing the median feature importances of each KEGG pathway encoded by that genome (see above methods for how KEGG pathways were determined). A distribution of KEGG genomes importances was generated (Supplemental Figure S4a), and based on this distribution, genomes with importance values over 15 were considered “organisms of interest”. Each bacterial genome was also assigned an importance value equivalent to the highest importance value of all secondary metabolite clusters encoded by that genome. A distribution was generated (Supplemental Figure S4b), and genomes with importances over 0.5 were considered enriched in important secondary metabolite clusters.
Phylogenetic tree
A phylogenetic tree was made to visualize the distributions of organisms of interest and organisms enriched in important secondary metabolite clusters (Figure 4f). Ribosomal protein S3 was identified in bacterial genomes using pFam PF00189.19 and HMMER with a score cutoff of 5025,56. An anchael outgroup was added, and all sequences were aligned using MAFFT60 under default parameters. All positions with gaps in over 50% of sequences were trimmed from the alignment, and FastTree was used with default parameters to generate a phylogenetic tree72. The tree was visualized and annotated using iTol73.
Protein clustering
Protein association with NEC
Each protein cluster was considered present in a sample if a protein from that cluster had been assembled from the sample. Fisher’s exact test was run on each protein cluster to determine if it was enriched in pre-NEC or control samples, and after Benjamini-Hochberg correction no p-values were statistically significant.
Protein association with organisms of interest
Each protein cluster was considered present in an organism of interest if a protein from that cluster was encoded in the organism’s genome. The recall and precision of each cluster with organisms of interest was calculated as follows: recall = the number of organisms of interest the cluster in / the total number of organisms of interest; precision = the number of organisms of interest the cluster is in / the total number of genomes the cluster is in. The recall and precision of each protein cluster was plotted (Figure 5b), and clusters with recall and precision over 0.7 were considered enriched in organisms of interest.
The 85 protein clusters enriched in organisms of interest were profiled using using the pFam database25 with provided noise cutoffs, and the two most common pFams were PF00419.19 (Fimbrial) and PF00005.26 (ABC transporter) with four proteins each. We next determined if organisms encoding these proteins were enriched in pre-NEC samples. For each pFam with at least three proteins enriched in organisms of interest, we compared the total relative abundance of all bacteria encoding that pFam in pre-NEC vs. control samples, as well as all iRep values of bacteria encoding that pFam in pre-NEC vs. control samples using the Wilcoxon rank-sum test with Benjamini-Hochberg p-value correction (Supplemental Table S8).
Fimbriae
Chaperone-usher fimbriae were identified in our dataset using pFam PF00577.19 (usher protein) and clustered using usearch59 with the command “usearch -cluster_fast -id 0.9”. The taxonomic profile of each fimbriae cluster was determined based on the taxonomy of organisms encoded by that cluster, and relative abundance and iRep associations with pre-NEC vs. control samples were calculated using the Wilcoxon rank-sum test applied to all bacterial genomes encoding each cluster. A similar procedure was performed using genomes which were not classified as organisms of interest but did encode fimbriae cluster 49, comparing between pre-NEC and control samples and between all samples from NEC infants and all samples from control infants (Supplemental Figure S6).
A phylogenetic tree was made in order to establish the type of usher proteins identified in our study. Three reference sequences from each previously established type24 were aligned with three representatives of each of our clusters using MAFFT. All columns with gaps in over 50% of sequences were trimmed from the alignment, IQtree was used with default parameters to generate a phylogenetic tree74, and tree annotation was performed using iTol73.
Effect size calculations
To determine when signals first become apparent relative to NEC diagnosis, control samples were compared to samples collected over different sliding 3-day windows (Figure 6). To compare the signal at 5 days prior to NEC diagnosis, for example, a rarefied set of samples was chosen from 4-6 days prior to diagnosis where one sample from each infant that has a sample in that window was randomly chosen. This procedure was repeated 10 times, and the average effect size and 95% confidence intervals were plotted. The effect size was calculated based on the Wilcoxon rank-sum test statistic (as calculated by SciPy (scipy.stats.ranksums)75) using the formula: effect size = (test statistic / square root ((observations in population 1) + (observations in population 2))). For iRep all iRep values were compared between the two sets, for secondary metabolite gene clusters the total relative abundance of genomes encoding secondary metabolite gene clusters classified as producing sactipeptides, bacteriocins, or butyrolactones was compared, for Klebsiella the total relative abundances of all genomes classified as the genus Klebsiella were compared, and for Fimbriae cluster 49 the total relative abundances of all genomes encoding fimbriae cluster 49 were compared.
ACKNOWLEDGEMENTS
This research was supported in part by the National Institutes of Health (NIH) under awards RAI092531A and R01-GM109454; the Alfred P. Sloan Foundation under grant APSF-2012-10-05; and National Science Foundation Graduate Research Fellowships to M.O. under Grant No. DGE 1106400. The study was approved by the University of Pittsburgh Institutional Review Board (Protocol PRO10090089). This work used the Vincent J. Coates Genomics Sequencing Laboratory at UC Berkeley, supported by NIH S10 OD018174 Instrumentation Grant.