Annotation-free prediction of microbial dioxygen utilization

Aerobes require dioxygen (O2) to grow; anaerobes do not. But nearly all microbes — aerobes, anaerobes, and facultative organisms alike — express enzymes whose substrates include O2, if only for detoxification. This presents a challenge when trying to assess which organisms are aerobic from genomic data alone. This challenge can be overcome by noting that O2 utilization has wide-ranging effects on microbes: aerobes typically have larger genomes, encode more O2-utilizing enzymes, and tend to use different amino acids in their proteins. Here we show that these effects permit high-quality prediction of O2 utilization from genome sequences, with several models displaying >70% balanced accuracy on a ternary classification task wherein blind guessing is only 33.3% accurate. Since genome annotation is compute-intensive and relies on many assumptions, we asked if annotation-free methods also perform well. We discovered that simple and efficient models based entirely on genome sequence content — e.g. triplets of amino acids — perform about as well as intensive annotation-based algorithms, enabling the rapid processing of global-scale sequence data to predict aerobic physiology. To demonstrate the utility of efficient physiological predictions we estimated the prevalence of aerobes and anaerobes along a well-studied O2 gradient in the Black Sea, finding strong quantitative correspondence between local chemistry (O2:sulfide concentration ratio) and the composition of microbial communities. We therefore suggest that statistical methods like ours can be used to estimate, or “sense,” pivotal features of the environment from DNA sequencing data. Importance We now have access to sequence data from a wide variety of natural environments. These data document a bewildering diversity of microbes, many known only from their genomes. Physiology — an organism’s capacity to engage metabolically with its environment — may provide a more useful lens than taxonomy for understanding microbial communities. As an example of this broader principle, we developed algorithms that accurately predict microbial dioxygen utilization directly from genome sequences without first annotating genes, e.g. by considering only the amino acids in protein sequences. Annotation-free algorithms enabled rapid characterization of natural samples, demonstrating a quantitative correspondence between sequences and local O2 levels. These results suggest that DNA sequencing can be repurposed as a multi-pronged chemical sensor, estimating concentrations of O2 and other key facets of complex natural settings.


Abstract (250 words)
Aerobes require dioxygen (O 2 ) to grow; anaerobes do not.But nearly all microbesaerobes, anaerobes, and facultative organisms alikeexpress enzymes whose substrates include O 2 , if only for detoxification.This presents a challenge when trying to assess which organisms are aerobic from genomic data alone.This challenge can be overcome by noting that O 2 utilization has wide-ranging effects on microbes: aerobes typically have larger genomes, encode more O 2utilizing enzymes, and tend to use different amino acids in their proteins.Here we show that these effects permit high-quality prediction of O 2 utilization from genome sequences, with several models displaying >70% balanced accuracy on a ternary classification task wherein blind guessing is only 33.3% accurate.Since genome annotation is compute-intensive and relies on many assumptions, we asked if annotation-free methods also perform well.We discovered that simple and efficient models based entirely on genome sequence contente.g.triplets of amino acidsperform about as well as intensive annotation-based algorithms, enabling the rapid processing of global-scale sequence data to predict aerobic physiology.To demonstrate the utility of efficient physiological predictions we estimated the prevalence of aerobes and anaerobes along a well-studied O 2 gradient in the Black Sea, finding strong quantitative correspondence between local chemistry (O 2 :sulfide concentration ratio) and the composition of microbial communities.We therefore suggest that statistical methods like ours can be used to estimate, or "sense," pivotal features of the environment from DNA sequencing data.

Importance (150 words)
We now have access to sequence data from a wide variety of natural environments.These data document a bewildering diversity of microbes, many known only from their genomes.
Physiology -an organism's capacity to engage metabolically with its environmentmay provide a more useful lens than taxonomy for understanding microbial communities.As an example of this broader principle, we developed algorithms that accurately predict microbial dioxygen utilization directly from genome sequences without first annotating genes, e.g. by considering only the amino acids in protein sequences.Annotation-free algorithms enabled rapid characterization of natural samples, demonstrating a quantitative correspondence between sequences and local O 2 levels.These results suggest that DNA sequencing can be repurposed as a multi-pronged chemical sensor, estimating concentrations of O 2 and other key facets of complex natural settings.

Introduction
Dioxygen (O 2 ) is a hugely consequential molecule for the biosphere.Aerobic respiration yields a tremendous amount of energy and is the most common bioenergetic mode in cells across Earth's surface environments.Yet O 2 is also highly reactive, presenting challenges to organisms that encounter it (1).Aerobes require O 2 to grow; anaerobes do not.But nearly all microbesaerobes, anaerobes and facultative organisms alikeexpress enzymes whose substrates include O 2 , if only for detoxification.This presents a challenge when trying to assess which organisms are aerobic from genomic data alone (2,3).
Aerobes, however, are different from anaerobes, and these differencesthough subtleare legible in their genomes.Aerobes tend to have larger genomes (4) with more O 2 utilizing enzymes (2,5) and usually belong to specific phylogenetic groups (4).Conversely, anaerobes make use of diverse fermentation pathways to conserve energy in low O 2 settings (3, 5).These differences have been used to predict O 2 utilization from the genome with reasonable accuracy (2,6,7).
Classification of microbial O 2 utilization typically relies on intensive preprocessing where, for example, O 2 utilizing enzymes are identified by sequence homology (2) or a full metabolic network is reconstructed (6).This processing is costly and limited by our very incomplete knowledge of the relationship between protein sequence and function.We therefore asked whether high-quality classification can be achieved without annotation, instead using DNA and protein sequences directly to build classifiers.
The input for classification is a genome and the output is an O 2 utilization category.Here we considered the ternary classification problemcategorizing a genome as an (i) obligate aerobe, (ii) obligate anaerobe, or (iii) a facultative aerobe (6).Due to limited data, accuracy is typically evaluated by cross-validationrepeatedly reserving small subsets of training data for evaluation (2,6,7).We developed a more stringent approach, training classifiers on a compendium of ≈3300 genomes with known O 2 utilization (8), and evaluating accuracy using an independent dataset (2).

Observation
Genomes can be processed in different ways to enable classification (Fig 1A ).A typical intensive pipeline begins with identification of protein coding sequences (ORF prediction) followed by homology-based annotation of gene functions (2,7).Further processing is sometimes performed, for example, by constructing a metabolic network based on annotated genes (6).Processed data are then used to train classifiers that predict attributes like carbon source preference (9) or O 2 utilization from genomic features.Features can include the counts of annotated gene functionse.g., 1 benzoate dioxygenase, 2 heme oxygenases, etc., (2)or the suite of molecules that could be produced by annotated enzymes (6,9).
Avoiding annotation and working instead with nucleotide (NT) and amino acid (AA) sequences elides most pre-processing steps.Reducing processing can simplify pipelines, remove assumptions, and substantially improve runtime (Fig. 1A, inset).One simple way of representing patterns in NT and AA sequences is by counting K-merssubstrings of length K (10,11).A more complex, and potentially valuable, annotation-free approach uses advances in machine learning to summarize ("embed") protein sequences in vectors of fixed dimension for training classifiers (12,13).
Comparison of model accuracies revealed that the best-performing classifier involved genome annotation (14), predicting microbes' O 2 utilizationobligate aerobe, obligate anaerobe or facultativefrom counts of annotated gene functions (KEGG orthogroups, Fig. 1C).This model displayed 74% validation accuracy as calculated with a balanced metric that accounts for unequal representation of O 2 utilization classes in the validation set.Yet we found that several annotation-free classifiers also exceeded 66% validation accuracy (2x the accuracy of blind guessing), including models based on counts of AA triplets (70%), NT 5-mers (67%), and protein sequence embeddings (70%).
Counting AA triplets is far more efficient than annotating genomes, which accelerated our evaluation of O 2 utilization in environmental samples.To demonstrate the utility of rapid characterization, we analyzed ≈50,000 metagenome assembled genomes (MAGs) assembled from Earth Microbiome Project samples (15).Consistent with expectations, samples of characteristically anaerobic habitats (e.g.rumen, 603/606 MAGs classified as anaerobes) contained a much larger proportion of anaerobic MAGs than settings with higher O 2 (e.g.freshwater lakes, 30/416, Fig. 2A).
To examine the quantitative relationship between local chemistry ([O 2 ]) and physiology (O 2 utilization), we next applied the AA 3-mer model along a well-studied natural O 2 gradient (Fig. 2B).Due to its unique hydrography and intense density gradient, the surface of the Black Sea mixes poorly with deeper waters, leading to a sharp transition from oxia near the surface to anoxic and sulfidic habitats at depth (16,17).As expected from these chemical transitions, the AA 3-mer classifier predicted sympathetic traces of O 2 utilization with depth, with aerobic MAGs dominating near the surface and anaerobes in deeper waters below the mixed layer (Fig. 2C).This correspondence was also quantitative, with the [O 2 ]/[H 2 S] ratio correlating strongly with inferred aerobe/anaerobe ratio, highlighting that chemical gradients might be "sensed" by statistical analysis of DNA sequences (Fig. 2D).

Discussion
As a step towards establishing DNA sequencing as a multiplexed chemical sensor for natural environments, we established a stringent pipeline for building classifiers of microbial O 2 utilization that works across the tree of life.Using this pipeline, we compared sophisticated annotation-based classifiers to simpler ones eliding genome annotation entirely.To our surprise, simple approaches based entirely on sequence content (e.g.amino acid triplet counts) performed nearly as well as intensive, annotation-based methods (Fig. 1C).
There are two mutually compatible explanations for the success of these naive approaches.First, O 2 utilization is correlated with phylogeny to a degree (4) and K-mer counts are a proxy for phylogenetic proximity (18,19).Indeed, NT K-mers are widely used to cluster DNA sequences during metagenome binning (20).Consistent with phylogeny predicting O 2 utilization, we achieved useful prediction accuracy with a classifier trained on machine-learned embeddings of ribosomal 16S sequences (Fig. S4).Further, it has been argued that the sequence content of the whole genome adapts to O 2 (21).If this is correct, we do not yet understand the biochemical logic underpinning this adaptation.For example, we attempted to summarize genomic sequence content with chemical metrics including elemental content (22) and the carbon redox state (21) of nucleotides and amino acids.A classifier based on these features displayed only 44% validation accuracy (Fig 1C).Nonetheless, a practical advantage of K-mer representations is that they contain both chemical and phylogenetic information, indicating that K-mers may be useful in simplifying model-based predictions of other complex physiologies (9).
Our exploration of diverse metagenomes indicated that the physical and chemical conditions of natural environments affects their sequence content in a legible way (Fig. 2).Indeed, we observed a quantitative correspondence between local chemistry ([O 2 ]/[H 2 S]) and the inferred aerobe/anaerobe ratio in the Black Sea (Fig. 2D), suggesting that the inverse problemestimating the concentrations of [O 2 ] and other key molecules from sequencesis tractable.
Perhaps sequencing data could serve as a "multi-sensor" of the local concentrations of several nutrients (e.g., O 2 , phosphate, nitrate, and sulfate) that play key roles in microbial growth and biogeochemical cycling.It is currently very challenging to characterize and monitor environmental chemistry at scale, whether for precision agriculture or for Earth system models.
Our results here suggest that sequencing-as-sensing is a scalable way forward.The best-performing models were ≈70% accurate at classifying out-of-sample genomes as aerobes, anaerobes or facultative.These included an annotation-free method trained on amino acid (AA) 3-mers (70%), and a compute-intensive approach relying on whole-genome annotation to identify known protein functional groups (as defined by the KEGG database, 74% accuracy).The dashed gray line marks the expected 33% performance of picking one of three categories at random.(A) We applied our AA 3-mer classifier to Earth Microbiome Project samples for which sufficient numbers of metagenome assembled genomes (MAGs) were available (n ≥ 10).Samples from environments with characteristically low O 2 levels (e.g.ungulate rumen, anaerobic digesters) displayed high anaerobe content, while, conversely, oxic surface environments predominantly hosted MAGs inferred to be aerobes (e.g.freshwater lakes).(B-D) The Black Sea is a well-studied stratified euxinic ecosystem with a long-lived systematic O 2 gradientoxygenated at the surface with a loss of O 2 and an increase in sulfide with depth.We drew depth-dependent O 2 and H 2 S concentrations as well as MAGs from (16).(C) We applied our annotation-free AA 3-mer model to 160 MAGs assembled in (16) to estimate the depth-dependent prevalence of aerobes and anaerobes.Consistent with O 2 and H 2 S profiles, aerobes were most prevalent near the surface and anaerobes most prevalent at depth.(D) The [O 2 ]/[H 2 S] ratio was strongly correlated with the inferred aerobe/anaerobe ratio on a log-log plot (Pearson  = 0.94, P < 10 -6 ) such that estimating the redox gradient from sequencing data resulted in less than threefold error over ≈6 orders (mean multiplicative error of 2.6-fold).

Black sea analysis
The relative abundances of metagenome-assembled genomes (MAGs) from metagenomic samples in the Black Sea (16) were generated in prior studies (25).Briefly, metagenomic samples were aligned to the previously assembled MAGs using bbmap.Alignments with a mapping quality above 10 were retained, converted to BAM format, sorted, and indexed using samtools.The relative abundance of each MAG was determined by the fraction of reads mapped to it, as summarized by samtools idxstats.This process was automated via a Python script, utilizing samtools v1.8 and bbmap.sh,and executed on the Resnick High-Performance Computing Center cluster at Caltech.

Earth Microbiome Project (EMP) analysis
O 2 utilization phenotypes of EMP metagenome assembled genomes (MAGs) from (15) were classified using the AA 3-mer model.Since EMP projects are categorized with an ad hoc nomenclature describing the environment sampled, we manually mapped tags to a simplified set of categories.For analysis, we removed MAGs with less than 50% estimated completeness, considered only samples from which at least 10 MAGs were assembled and only environmental labels (e.g."rhizosphere") for which at least 10 samples were available.This left 1598 samples and 31,279 MAGs under consideration.The data presented in Fig. 2C gives the fraction of MAGs that are inferred to be aerobes, anaerobes, and facultative in each habitat for samples meeting these criteria.

Fig. S2 :
Fig. S2: Training and test-balanced accuracies for all models trained.Models are grouped by whether or not they are annotation-full (purple, left) or annotation-free (blue, right).Within groups, they are ordered left-to-right by increasing balanced accuracy on the validation set (hatched bars).Notice that three of the top five performing approaches were annotation-free: genomic DNA 5-mers, amino acid 3mers, and genome embeddings.The dashed gray lines in the background mark the random guessing threshold (33% balanced accuracy) and twice that threshold (66% balanced accuracy) for ternary classification.See Methods for detailed description of the individual models.

Fig. S3 :
Fig. S3: Confusion matrices for three top-performing models.Confusion matrices depict the per-class accuracies.Notice that, in all cases, facultative organisms were the most difficult to accurately classify.(A) All gene families model; 74% total balanced validation accuracy, (B) AA 3-mer model; 70%, and (C) DNA 5-mer model (67%) as described in Methods.

Fig. S4 : 20 Fig. S5 :
Fig. S4: A predictor based on embedding 16S sequences performs similarly to other approaches, but with a distinct error profile.We used machine-learning driven DNA sequence embedding to develop a classifier based on 16S rRNA sequences.Due to constraints associated with our validation set, however, balanced accuracy is here calculated over a smaller, randomized testing set containing 99 examples (Methods).Accuracy values are therefore not directly comparable to Figs. 1, S2 or S3.