Predicting genomic traits in ammonia-oxidizing archaea using phylogenetic signals

Phylogenetic conservatism of microbial traits has paved the way for phylogeny-based predictions, allowing us to move from descriptive to predictive functional microbial ecology. Here, we applied phylogenetic eigenvector mapping, an approach not previously used for microorganisms, to predict key traits of ammonia-oxidizing archaea (AOA), which are important players in nitrogen cycling. Using 168 nearly complete AOA genomes and metagenome assembled genomes from public databases, we predicted the distribution of 18 ecologically relevant genes across an updated amoA gene phylogeny, including a novel variant of an ammonia transporter found in this study. Of the selected genes, 94% displayed a significant phylogenetic signal and gene presence was predicted with >88% accuracy, >88% sensitivity, and >80% specificity. The phylogenetic eigenvector approach performed equally well as ancestral state reconstruction of traits. We implemented the predictive models on an amoA sequencing dataset of AOA soil communities and show key ecological predictions, e.g., that AOA communities in nitrogen rich soils have capacity for ureolytic metabolism while those adapted to low pH soils have the high affinity ammonia transporter (amt2). Predicting genomic traits can shed light on the potential functions that microbes perform across earth biomes, further contributing to a better mechanistic understanding of their community assembly.


Introduction
Genomic information is essential for trait-based approaches in microbial ecology, which can provide mechanistic explanations to ecological dynamics and ecosystem functions [1].Since many microbial functions can be directly inferred from genomic content and microbial traits are phylogenetically conserved [2,3], there is a framework for phylogeny-based trait predictions [4,5] in which we can use massive amounts of environmental sequence data to predict the potential functions that microorganisms perform in the environment.In this way, we can impute functions to taxa whose genomes are not yet available and characterize ecologically relevant groups that represent a minor fraction in metagenomes.Phylogenybased trait imputation have generally used either ancestral state reconstruction (ASR) by means of phylogenetic generalized least squares [6][7][8] or phylogenetic eigenvector maps (PEM) [9][10][11].In contrast to ASR, PEMs offer the additional advantage of accommodating different modes of evolution, as well as the possibility of using phylogenetic signals together with abiotic factors when predicting traits or species distributions [12].While PEMs have been used in a wide range of studies for macro-organisms, they have yet to be applied to microbial communities.Among microorganisms, ammonia-oxidizing archaea (AOA) are an optimal group to evaluate the power of PEMs for predictions of genomic traits.First, they are key players in the nitrogen cycle and inhabit most ecosystems on earth [13,14].Given their ecological relevance, AOA genomes and metagenome-assembled genomes (MAGs) are increasingly available, thus expanding our knowledge of the functions of archaeal genes [15][16][17].Second, there is a coherence between the organismal phylogeny and that of the amoA gene encoding the ammonia monooxygenase, which has long been used as a marker gene for AOA in environmental studies [18][19][20].This has resulted in the availability of a global amoA phylogeny [21] that reflects environmental preferences of the organisms at different scales [22][23][24].Whether or not gene content can be predicted using the amoA phylogeny, thus providing a basis for a more mechanistic understanding of AOA community assembly, remains uncertain.
The aim of this work was to predict genomic traits in AOA using the amoA phylogenetic signal.To reach this goal, we updated a recent amoA gene reference phylogeny of AOA [21] by adding 168 highly complete AOA genomes and MAGs available in public databases.The updated phylogeny was then used together with phylogenetic eigenvector mapping [11] to predict the presence of a set of genes (Table 1) belonging to four functional categories selected from a comparative genomics study [16], and a gene encoding a novel ammonia transporter found in the present study.We validated the predictions using hold-out validations and compared them to estimations based on ancestral state reconstruction.Finally, we implemented the predictive approach on AOA communities obtained from a field study [25], and linked the predicted AOA genes to soil properties by means of simultaneous analysis of environmental characteristics, species distributions, and species traits [26].This study revealed that PEMs are useful for highly accurate predictions of genomic traits in AOA and can inform about their potential functions in the environment and the mechanisms underpinning community assembly.

Update of the archaeal amoA reference phylogeny
We updated the amoA phylogeny developed by Alves et al. [21] by including all AOA genomes from isolates and metagenome-assembled genomes (MAGs) downloaded from the National Centre for Biotechnology Information (NCBI; https://www.ncbi.nlm.nih.gov/) and the Joint Genome Institute (JGI; https://genome.jgi.doe.gov/portal/), up to October 2021.
Sequences of 1527 archaeal genomes and MAGs belonging to phylum Nitrososphaerota were initially screened for the presence of amoA genes using HMMER (http://hmmer.org) and the translated alignment from Alves et al. [21] as seed alignment.Significant hits (e-value < 10 -6 ) were then aligned by amino acid to the original seed alignment using HMMER.An initial phylogeny of the significant hits, together with the 1190 sequences from Alves et al. [21] and the amoA sequences of seven ammonia-oxidizing bacteria to be used as an outgroup, was generated from the nucleotide alignment using FastTree 2. 1 [27] to ensure the correct identification of AOA amoA genes.The alignment and tree were then inspected using the ARB software [28] to correct alignment errors and remove fragmented or poor quality (i.e.multiple 'N's) sequences, resulting in a total of 457 genomes and MAGs with amoA genes.
We further discarded genomes/MAGs with <80% completeness or >5% contamination as determined by BUSCO [29] as well as those with identical amoA sequences and the same presence/absence values of the selected genes (see below) to remove duplicate taxa.At the end, amoA sequences from 168 high-quality genomes and MAGs were added to the original alignment of Alves et al. [21], and the total alignment was used to build maximum likelihood phylogenies using the IQTree software [30].Tree search was carried out using ten independent tree searches, in which the first half used the default parameters while, for the second half, the 'nbest' and 'nstop' arguments were increased to 10 and 500, respectively.
Node support for all trees was determined using ultrafast bootstrap approximation (1000 replicates).The ten trees were checked individually, and the tree with both the highest likelihood and coherence with the Alves et al. [21] topology was selected as the final reference tree.The model of the best tree was GTR+F+R10.The display of the phylogenetic trees for the present article was produced using iTOL [31].

Selection of genes for predictions and screening of genomes
We selected 18 genes for phylogenetic modelling (Table 1) based on the comparative genomic study by Kerou et al. [16].These genes were i) distributed across different AOA lineages to avoid genes only present in an isolated clade within the phylogeny, and ii) involved in pathways relating to different ecological functions and especially relevant to soil AOA.The selected genes belonged to four functional categories: nitrogen metabolism (amt and ureC genes), carbon and amino acid metabolism (metE, proDH, rocA genes), chemotaxis and motility (cheA, cheY, flaK, flaI, tadC genes) and environmental adaptation (ipct, nhaP, trk and cspC genes).We downloaded the orthologous gene protein alignment from each of these genes from the EggNOG 5.0 database (http://eggnog5.embl.de)using the arCOG ID reported by Kerou et al. [16].When more than one arCOG was provided, we used each of them separately.With the alignments obtained from the EggNOG database, the 168 AOA genomes/MAGs were screened for the presence/absence of each gene using HMMER.To ensure that gene hits were accurate, we retrieved the protein sequences found on the genomes/MAGs, aligned them with the reference protein alignment from the EggNOG database and constructed phylogenies using the fastTree2 software [27].The location of the hits relative to the reference sequences within the phylogenies was examined to make sure that they were in fact hits of the searched genes and not artefacts or other homologues with different functions.In the specific case of the ammonium transporter gene (amt), and due to its ecological relevance, the gene phylogeny was used to further classify the hits as Amt2 (high affinity) or Amt1 (low affinity) type of transporter [32][33][34].The hits falling together with Nitrosocosmicus taxa in the phylogeny were assigned the low affinity transporter type (Amt1), as all sequenced Nitrosocosmicus taxa encode uniquely one low affinity Amt [35][36][37][38].It should be noted here that other studies have used the reverse nomenclature, in which Amt1 and Amt2 are defined as high and low affinity transporters, respectively [39][40][41].To clarify these inconsistencies in the nomenclature, we screened the Amt hits of our study for the primers used by Nakagawa and Stahl [33] for both types of transporters, and found that the amt2 primers from Nakagawa and Stahl [33] matched the genes that were defined as amt1 in Offre et al. [39] and vice-versa.Therefore, we refer here to the gene encoding the high affinity transporter as amt2 [32,33], which corresponds to the amt1 in Offre et al. [39].The procedure of checking the amt hits using phylogenies allowed us to find a third type of ammonium transporter uniquely present in the Nitrosocaldales lineage (Supplementary Text 1; Supplementary Fig. 1-2; Supplementary Table 1).The gene names and arCOG references are provided in Table 1.

Test of phylogenetic signal for binary traits
We tested the phylogenetic signal of each gene using the phylo.dfunction from the caper package [42], running 1000 permutations.This method determines the strength and the statistical significance of the phylogenetic signal for binary traits compared to random and Brownian motion distributions of the trait [43].The resultant parameter (D) equals 1 when the binary trait has a random distribution across the phylogeny, and 0 if under Brownian motion.D values can be < 0 if the trait distribution is more clustered than expected under Brownian motion, and > 1 if it is more overdispersed than expected by random.The phylo.d function tests D for significant departure from 1 when the trait is phylogenetically conserved, and a significant departure from 0 when the trait does not follow a Brownian evolutionary model.

Gene predictions using phylogenetic eigenvectors and comparison to ancestral state reconstruction
The input for the gene predictions was the final amoA phylogenetic tree and the matrix of presence/absence of the 18 genes across the 168 AOA genomes/MAGs.For gene predictions, we used phylogenetic eigenvectors obtained from the phylogenetic tree and compared the prediction results to those obtained by ancestral state reconstruction.In both cases, we predicted the presence of each gene in the taxa across the amoA phylogeny that had unknown genomes.
The phylogenetic eigenvector-based predictions were done in three steps: 1) decompose the phylogenetic tree into eigenvectors, 2) fit and regularize individual predictive models for each gene using the eigenvectors as the descriptors, 3) estimate the presence/absence of the genes from the models of step 2 on the taxa with unknown genomes given their locations in the phylogeny.To decompose the phylogenetic tree into eigenvectors (step 1), we used R package MPSEM [44].The input for the MPSEM package is the phylogenetic tree containing all tips, i.e., taxa with and without gene information.The MPSEM package calculates an influence matrix using only the tips of the tree for which there is information about presence/absence of genes by pruning from the tree the taxa without trait information.From the influence matrix, phylogenetic eigenvectors were obtained by singular value decomposition.When calculating the phylogenetic eigenvectors, we fixed argument a=0 to assume a Brownian motion evolution of all traits.In step 2, the phylogenetic eigenvectors were then used as fixed factors in a multiple logistic regression whose coefficients were regularized using elastic net regularization, which combines L2 (ridge regression) and L1 (LASSO) penalties on the model coefficients.For model regularization, we used the R package glmnet [45] with arguments family="binomial" and α = 0.5 to use same amounts of both L1 and L2 shrinkage.The penalization hyper-parameter (l) was tuned using leave-oneout cross validation within the training dataset and choosing the l value that provided the highest accuracy of the predictions.The outcome of step 2 is a regularized model that predicts the probability of gene presence given the eigenvectors' values.We classified the probabilities of the predictive model into presence or absence by choosing a threshold that maximizes both true positive (sensitivity) and true negative rate (specificity) of the predictions.For this, we created ROC curves using the R package pROC [46] with the function roc, and used the function coords to select a threshold for classification that would render the highest Youden's J statistic, where J = sensitivity + specificity -1.Probabilities that had values equal to or greater than the selected threshold were classified as presence, whereas probabilities lower than the threshold were classified as absence.The final output of step 2 is a model with tuned l and classification threshold parameters that predict the gene presence for a taxon given its phylogenetic eigenvector scores.We then obtained the eigenvector scores for taxa with unknown genomes and used them on the model of step 2 to predict the gene presence.Function getGraphLocations found in MPSEM places the taxa with unknown genomes in the initial influence matrix of step 1 and function Locations2PEMscores obtains the phylogenetic eigenvector scores.It is important to note that the influence matrix and phylogenetic eigenvectors are not obtained using all tips of the tree, but only the ones with known gene information.For taxa with unknown information, we calculated the eigenvector score values, which are projections of new influence matrix coordinates on the eigenvectors obtained from the initial influence matrix (see Guénard et al. [11], for details on this procedure).Thus, the number of phylogenetic eigenvectors of the training dataset, and therefore the number of coefficients on the predictive model, is independent of the number of taxa to be predicted.The code can be used for analysis of other datasets by providing the phylogenetic tree and its associated binary trait table with and without missing values (NAs) in the input data.
The ancestral state reconstruction was done with R package picante R [47].We used the phyEstimateDisc function to predict the genes of the AOA taxa with unknown genomes.In this procedure, for each taxon with unknown trait data, the phylogenetic tree is rerooted on the most recent ancestor common to the unobserved taxon and the rest of the phylogeny.The trait of the unobserved taxon is then estimated from the ancestral state reconstruction of the root of the rerooted phylogeny [6,48].The function phyEstimateDisc provides a trait state, i.e., presence or absence, for a given threshold (default = 0.5), as well as a value for the statistical support of the trait state.

Validation of the predictive models
We validated the predictions of each gene using a 20% hold-out validation.This procedure consists of randomly removing 20% of the initial dataset before creating the predictive model and validating it on the removed samples.We repeated this procedure 500 times, always randomizing the taxa included in the held-out dataset.For each gene, we obtained the mean accuracy, sensitivity, and specificity of the prediction.We also varied the proportion of taxa to be held out in the validations to 30% and 40%, and obtained accuracies of 87.1% and 86.4%, respectively (compared to >88% accuracy at 20% hold out).We could not increase the proportion of data to be held out because many genes were class unbalanced.Both sensitivity and specificity were obtained using R package pROC.
We validated the accuracy of the predictions at the genome/MAG level using leave-one-out cross validation, in which we deleted each genome/MAG from the dataset in turn and used the rest of the genomes to predict the gene content of the previously removed genome or MAG.

Implementation of predictive modelling on natural communities: a case study
To illustrate how predicting AOA traits could help scientists link community composition, predicted traits, and environmental properties, we used data from a previous study that characterized AOA communities by amoA amplicon sequencing on 50 sampling points across an agricultural area in which soil properties were measured (see Enwall et al. [49] and Jones & Hallin [25] for more information).The study site is a 44 hectare farm divided into 14 fields, with sampling points taken at 51 locations throughout the field based on environmental gradients identified in a previous study [50].We deleted one of the locations (S17) because it lacked data on soil properties.We predicted the presence/absence of the 17 genes showing significant phylogenetic signal on the AOA communities and linked the gene composition with the soil properties that were reported by Enwall et al. [49].
To predict the traits of the AOA members of the soil communities, we placed the amoA sequences of each OTU (162 in total) on the reference phylogeny using the EPA-ng [51] (accessed at https://github.com/pierrebarbera/epa-ng)and gappa software [52] (accessed at https://github.com/lczech/gappa),therefore obtaining a new phylogenetic tree containing all sequences of the updated reference phylogeny and the representative OTU sequences to be predicted as pendant branches.We used this phylogenetic tree to implement the phylogenetic eigenvector modelling described above.The output was a matrix with presence/absence of all 17 genes for each OTU in the study.
We studied the link between the predicted genes and the soil properties mediated by community composition by performing a RLQ analysis followed by a univariate fourth-corner analysis [26,53,54].RLQ is an ordination method that displays the covariation of traits, i.e., the predicted genes, and environmental properties providing site and species scores, and a global test for significance.The fourth-corner correlations, on the other hand, provide tests of single associations between genes and environmental properties.The two methods are complementary and can be performed sequentially.Following Dray et al. [26], we calculated three separate ordinations for the RLQ analysis: a correspondence analysis for the community data, a principal component analysis after standardization of the variables in the environmental data matrix, and a principal component analysis for the predicted genes presence/absence without standardization.The three ordinations were analyzed together using the rlq function of the ade4 package for R [55].The RLQ analysis was tested for significance using the randtest function of the ade4 package with argument modeltype = 6 for the permutation test.This model performs two sequential permutational tests, a first one testing the link between species distribution and environmental conditions (model 2), and a second testing the link between species distribution and traits (model 4).When both tests are significant, the highest of the two P-value provides the statistical significance for the global tests of association between traits and environment [56].
To test which specific gene was associated with each soil property, we performed a fourthcorner analysis using the fourthcorner function from the ade4 package, with modeltype = 6, 99 999 permutations, and "false discovery rate" as the multiple testing correction method for the P-value.

Congruence between amoA phylogeny and content of genes for predictions
To perform the phylogenetic eigenvectors-based predictions, we first updated the archaeal amoA gene reference phylogeny from Alves et al. [21], which now contains 1190 unique amoA sequences and 168 highly complete AOA genomes of isolates and MAGs distributed across most amoA lineages (Supplementary Fig. 3).We screened the genomes and MAGs for the presence/absence of a set of ecologically relevant genes selected for modelling [16] (Table 1).We found a negative association between Nitrososphaerales (NS) taxa and the presence of genes associated with motility (tadC, with the exception of NS-a), osmotic regulation (trk), and thermoadaptation (cpsC).The metE gene, responsible for methionine synthesis in energylimiting environments [16] was found in Nitrosopumilales (NP) clades associated to deep sea waters (NP-a and -q), as well as in Nitrosocaldales (NC) and Nitrosotaleales (NT).Genes encoding the high-and low-affinity ammonium transporters (Amt2 and Amt1, respectively [32,33], see details in Methods) were identified based on their positions within a phylogenetic tree of translated Amt sequences [39].When doing this, we identified a gene encoding a novel variant of the ammonia transporter protein specifically associated with the NC amoA lineage, hereafter referred to as Amt-NC.All taxa having the amt-NC gene, also had the amt2 (Fig. 1a).We found two subgroups of the ammonium transporter gene amt-NC, hereinafter amt-NC.1 and amt-NC.2(Supplementary Fig. 1), which corresponded to the groups into which Luo et al. [17] divided the NC lineage based on the concatenation of 122 archaeal genes (Supplementary Text 1).The amino acid composition of the Amt-NC was identical to the Amt types described by Offre et al. [39] at the ammonium-binding sites, i.e. they contained the same histidine lining the transporter pore [39] and had the same amino acids in several conserved loci (Supplementary Fig. 2).However, it differed in several loci across the regions described by Offre et al. [39] (Supplementary Fig. 2).
All selected genes except ipct were phylogenetically conserved and did not depart from a Brownian motion model (Table 1).The distance between the 168 isolate genomes and MAGs in a principal component ordination including all 18 genes reflected the lineage classification of the AOA taxa (Fig. 1b), supporting a coherence between the amoA phylogeny and the overall gene content of available genomes and MAGs.

Phylogeny-based predictions of selected genes
We used the gene content and phylogenetic relatedness of the 168 genomes and MAGs to build predictive models of gene presence.For each of the 18 genes, we used elastic net regularized regressions, with the PEMs as predictors and the presence of each gene as response.The models with optimized lambda penalization and classification threshold parameters (Supplementary Table 2) were then used to predict gene presence across the amoA phylogeny using the phylogenetic eigenvector scores of unobserved taxa as input.The predicted presence of most genes varied across and within amoA lineages.For example, genes encoding the high-affinity ammonia transporter (amt2) was predicted to be present in nearly all lineages except two NS groups, NS-e and NS-z.In contrast, the low affinity ammonia transporter (amt1) and urease (ureC) genes were predicted to be within most NS clades, yet were more unevenly distributed or absent across NP clades (Fig. 2).Motility genes showed the highest within-clade variation, where presence of genes related to archaellum formation (flaK and flaI) varied between taxa of the NP-g clade (Fig. 2), while nearly all lineages except NP-e and -a were predicted to have the cheY gene, encoding a response regulator associated with chemotaxis.The gene responsible for B-12 independent methionine synthesis (metE) was predicted to be present in all NC and in most NT taxa.Across NP clades, metE was predicted to be present in NP-a and -b, as well as in most NP-q and some NP-h, -e, and -g (Fig. 2).Genes associated with osmotic regulation (nhaP, trk) and thermoadaptation (cpsC) were also predicted to be present more often in NP clades and less often in NS (Fig. 2).
The PEM-based predictions of gene presence for all genes resulted in an average of 88.3% accuracy, 88.1% sensitivity, and 81.5% specificity based on a 20% hold-out-validation (Table 1).Ancestral state reconstruction of traits resulted in similar levels of accuracy, sensitivity and specificity of predictions (89.6%, 90.5%, and 82.0%respectively; Table 1).For both methods, the predictive accuracy increased linearly with the strength of the phylogenetic signal (R 2 = 0.64 for both the PEM and ancestral state reconstructions, Supplementary Fig. 4).

Link between predicted genes and soil properties; a case study
To exemplify how gene predictions can contribute to trait-based studies, we placed the amoA sequence of the members of 50 AOA communities from arable land in the updated amoA reference phylogeny, predicted the gene presence among these amoA-based OTUs (Fig. 3a), and tested the link between predicted genes and soil properties by performing RLQ [53] and fourth-corner [54] analyses.The first RLQ axis showed that OTUs predicted to have genes encoding the high-affinity ammonia transporter (amt2) and chemotaxis response (cheY2) and lacking the nhaP gene, were more associated with sites with lower pH (pH = 5.7 -6.0 in sites S19, 21, 23, and 28; Fig. 3 b, c).The second RLQ axis highlighted two specific genes, i.e., the low-affinity ammonia transporter (amt1) and the ureC, which were most closely associated to sites with higher levels of total nitrogen and carbon, higher ammonium content, and more moderate soil pH (pH = 6.1 -6.2 in sites S29, 31, 34, 37; Fig. 3b, c).When testing the univariate associations of each gene and soil property using the fourth-corner approach, we found a negative correlation between the potential ammonia oxidation activity in soils and the presence of the high-affinity ammonia transporter (P = 0.04, after false discovery rate correction; Supplementary Fig. 5), and a positive correlation between the presence of the ureC gene and total nitrogen (P = 0.06, after false discovery rate correction; Supplementary Fig. 5).

Discussion
Predicting genomic traits can inform about key functions that microorganisms perform in the environment without access to their full genomes.In this study, we predicted genes of ammonia-oxidizing archaea (AOA) with >88% accuracy using the amoA phylogenetic signals implemented in phylogenetic eigenvectors.The overall distribution of our isolate genomes and MAGs across the phylogeny makes us confident that these predictions are reliable, particularly for AOA belonging to clades with good genome representation (see Supplementary Table 3).While genome and MAG incompleteness may limit inferences of microbial functions [57], we did not find any association between predictive accuracy and genome completeness (Supplementary Fig. 6a).Only a weak and non-significant trend suggesting that presence of genes in highly complete genomes were predicted with lower sensitivity and higher specificity than less complete genomes was observed (Supplementary Fig. 6b,c).Thus, with more incomplete genomes in the training datasets, predictions of gene presence may be more reliable than predictions of gene absence.
The high prediction accuracy can be explained by the strength of the phylogenetic signal of the screened genes.More than 90% of the screened genomic traits were conserved phylogenetically, in line with previous studies [16,19], and the strength of the phylogenetic signal was positively correlated with the accuracy of the predictions [5,8].As an example, the amt-NC and ureC gene, which had the strongest and weakest phylogenetic signal, respectively, displayed the highest and lowest accuracy of predictions, respectively.The differences in phylogenetic signal between genes is likely to be the evolutionary result of niche specialization.The amt-NC gene, found here for the first time, encodes a variant of ammonia transporter present only in Nitrosocaldus AOA inhabiting thermal waters (Supplementary Text 1), and is therefore localized in a single clade within the amoA phylogeny.By contrast, the ureC gene tends to be phylogenetically dispersed.In agreement, soil AOA thrive in conditions in which ammonia is supplied slowly through mineralization of organic matter [58], and urease genes are abundant in Thaumarchaeota communities in oligotrophic marine environments [59].The high accuracy of the predictions may also be the result of the availability of AOA genomes and MAGs across the amoA phylogeny.
Accordingly, predictions on genomes and MAGs belonging to the NS-z and NC-a, NP-e, and NP-a clades had the highest accuracies, and these AOA lineages were well-represented by genomes with close phylogenetic relatedness to the within-clade available relatives [5,60] (Supplementary Table 3).These findings support previous claims that phylogenetic modelling of microbial traits should be restricted to groups of organisms with good genomic representation in the databases [4,61].
Genomic trait predictions can provide a mechanistic understanding of community assembly.
By implementing the gene predictions in multivariate models of AOA communities in soils with varying physical and chemical properties, we show that AOA communities adapted to low pH soils are more likely to have the high-affinity ammonia transporter (amt2) and chemotaxis gene cheY2.This makes sense given the low availability of ammonia at low pH, and is in line with previous studies showing down-regulation of bacterial chemotaxis genes at high pH [62].By contrast, high pH sites were associated with OTUs predicted to have the nhap gene encoding a protein related to Na + /H + antiporters, which are more important for homeostasis under neutral or alkaline conditions [63].Resource rich sites with high amounts of total nitrogen and ammonium as well as soil organic carbon, located in areas with higher organic N addition [22], were related to the dominance of AOA with predicted potential for ureolytic metabolism (ureC).Nitrogen addition may increase the abundance of ureC genes [64], which in turn is associated with high amounts of ammonium as a result of urea hydrolysis [65].Higher potential ammonia oxidation activity was linked to taxa with the low affinity ammonium transporter, likely because the high ammonium concentrations used in the activity assay favored the ammonia oxidizers with low ammonia affinity or those that prefer inorganic to organic N sources, including ammonia-oxidizing bacteria [66].These patterns of AOA community assembly can have key implication for ecosystem functions, such as CO2 emission rates, as a result of urea hydrolysis [67].Overall, our predictions on soil AOA communities show that the combination of genomic trait predictions with RLQ and fourthcorner analyses can shed light on mechanisms of community assembly and this approach can be expanded to other microbial groups than AOA to decipher ecological dynamics.
Phylogenetic eigenvector mapping can complement the broadly used ASR when performing phylogeny-based modelling.We show that both methods have similar values of accuracy, sensitivity and specificity of the predictions, and accuracy values between these two methods across all genes were highly correlated (R 2 = 0.97).Based on co-inertia testing, both predictions across the phylogeny were associated with a high RV coefficient value, i.e., a correlation metric between two multivariate sets of variables [10], of 0.81 (data non shown).
When setting the threshold for classifying probabilities at 0.5 in our PEM models, which is the default for ASR in the R package picante [47], the RV increased to 0.89.The use of either PEM or ASR depends on the role of phylogeny in the analyses [68,69].When estimating microbial traits using only phylogenetic signals, both methods give similar results, with ASR not requiring the extra step of selecting or regularizing a model [8].On the other hand, PEM can be used when modelling phylogenetic signal together with other factors, e.g.abiotic variables or gene co-occurrences, which is useful when estimating traits on partially incomplete databases [70,71].In addition, the procedure used by the MPSEM R [44]package to calculate phylogenetic eigenvectors can use phylogenetic networks as input.Thus, PEM can potentially be used for microorganisms for which horizontal gene transfer occurs, as well as for organisms that undergo reticulate speciation or hybridization [72].Finally, the sets of latent descriptors produced by PEM can be used with other modelling approaches, such as support vector machines, gradient boost machines or artificial neural networks.Although some studies report a better performance of ASR over PEMs [73], perhaps because of model selection issues, the present study shows that both can provide similar accuracies, while PEM constitutes a more versatile tool for phylogenetic modelling.
In this study, we show that we can move towards genomic trait prediction in microbial ecology.Whereas the absence of many microbial genomes can limit the implementation of trait-based studies in microbial ecology, many microbial genes are conserved phylogenetically, and their presence can be predicted using such tools as phylogenetic eigenvectors and ancestral state reconstruction.Predictive modelling of microbial functions can provide useful information to understand how evolution shapes the genetic content of microorganisms, how that determines their distribution in the environment, and how that ultimately impacts ecosystem functions.illustrative purposes (values in between brackets), the values were not used to calculate the overall accuracy, sensitivity, and specificity of the predictions because the ipct gene did not have a significant phylogenetic signal.‡ amt-NC refers to a type of ammonia transporter gene that was found in this study to be present only in the Nitrosocaldales lineage (NC).
27   162 Operational Taxonomical Units (OTUs) in circles and plus (+) signs, respectively.OTU symbols were jittered for visualization purposes.(c) Biplots of the RLQ analysis displaying association between predicted genes (red) and observed soil properties (blue).PAO refers to potential ammonia oxidation rates.The genes whose predicted values were all 0 or 1 were not included in the RLQ analysis.The global P-value associated with the RLQ analysis was 0.009.

Figure 1 .
Figure 1.Distribution of genomic traits in AOA genomes.(a) The presence/absence of 18 ecologically relevant genes (Table 1), across a phylogeny of 168 AOA isolate genomes and MAGs.Colors in outer rings denote presence (filled squares) and absence (empty squares) of specific genes.Clades within each AOA genus are denoted by Greek letters and the tree is rooted on a collapsed gray clade of bacterial amoA.(b) Principal component analysis of the 168 genomes and MAGs based on a presence-absence matrix for the 18 selected genes.

Figure 2 .
Figure 2. Predictions of gene presence/absence across the archaeal amoA reference phylogeny.Colors in outer rings denote predicted presence (filled squares) and absence(empty squares) of specific genes (see Table1for definition).Red circles are the isolate genomes and MAGs with > 80% completeness and < 5% contamination.Clades within each AOA genus are denoted by Greek letters and the tree is rooted on a collapsed gray clade of bacterial amoA.

Figure 3 .
Figure 3. Association of predicted genes of AOA communities with soil properties across 44 ha of arable land.(a) Placements of the 162 Operational Taxonomical Units (OTUs) on the amoA reference phylogeny.Each circle represents an OTU whose size is proportional (fifth root) to their relative abundance.Colors in outer rings denote predicted presence (filled squares) and absence (empty squares) of specific genes for each OTU (seeTable 1 for definition).None of the OTUs belonged to Nitrosopumilales nor Nitrosocaldales, and

Table 1 . Genes selected for phylogenetic modelling, statistics for phylogenetic signal test, and validation results using phylogenetic eigenvectors and ancestral state reconstruction.
D is the phylogenetic signal strength parameter, where values less than 0 indicates phylogenetic conservatism and values close to 1 indicate random distribution of gene value across the phylogeny.† D ¹ 1 means departure from a random distribution, whereas D ¹ 0 means departure from a Brownian motion.Although we validated the predictions for the ipct gene for *