PT - JOURNAL ARTICLE AU - Niels A. Zondervan AU - Vitor A. P. Martins dos Santos AU - Maria Suarez-Diez TI - Predicting Mycoplasma tissue and host specificity from genome sequences AID - 10.1101/2022.08.08.503189 DP - 2022 Jan 01 TA - bioRxiv PG - 2022.08.08.503189 4099 - http://biorxiv.org/content/early/2022/08/11/2022.08.08.503189.short 4100 - http://biorxiv.org/content/early/2022/08/11/2022.08.08.503189.full AB - To gain insights into the genotype-phenotype relationships in Mycoplasmas, we set to investigate which Mycoplasma proteins are most predictive of tissue and host trophism and to which functional groups of proteins they belong. We retrieved and annotated 430 Mycoplasma genomes and combined their genome information with data on which host and tissue these Mycoplasmas were isolated from. We assessed clustering of Mycoplasma strains from a wide range of hosts and tissues based on different functional groups of proteins. Additionally, we assessed clustering using only a subset of M. pneumoniae strains based on different functional groups of proteins. We found that proteins belonging to the Gene Ontology (GO) Biological process group ‘Interspecies interaction between organisms’ proteins are most important for predicting the pathogenesis of Mycoplasma strains whereas for M. pneumoniae, those belonging to ‘Quorum sensing’ and ‘Biofilm formation’ proteins are most important for predicting pathogenesis.Two Random Forest Classifiers were trained to accurately predicts host and tissue specificity based on only 12 proteins. For Mycoplasma host specificity CTP synthase complex, magnesium transporter MgtE, and glycine cleavage system are most important for correctly classifying Mycoplasma strains that infect humans, including opportunistic zoonotic strains. For tissue specificity, we found that a) known virulence and adhesions factor Methionine sulphate reductase MetA is predictive of urinary tract infecting Mycoplasmas; b) an extra cytoplasmic thiamine binding lipoprotein is most predictive of gastro-intestinal infecting Mycoplasmas; c) a type I restriction endonuclease is most predictive of respiratory infecting Mycoplasmas, and; d) a branched-chain amino acid transport system is most predictive for blood infecting Mycoplasmas. These findings can aid in predicting host and tissue specific pathogenicity of Mycoplasmas as well as provide insight in which proteins are important for specific host and tissue adaptations. Furthermore, these results underscore the usefulness of deploying genome-wide methodologies for gaining insights into pathogenicity from genome sequences.Competing Interest StatementThe authors have declared no competing interest.GOGenome Ontologyt-SNET-distributed Stochastic Neighbour Embedding