PT - JOURNAL ARTICLE AU - Daniel Bojar AU - Rani K. Powers AU - Diogo M. Camacho AU - James J. Collins TI - SweetOrigins: Extracting Evolutionary Information from Glycans AID - 10.1101/2020.04.08.031948 DP - 2020 Jan 01 TA - bioRxiv PG - 2020.04.08.031948 4099 - http://biorxiv.org/content/early/2020/04/09/2020.04.08.031948.short 4100 - http://biorxiv.org/content/early/2020/04/09/2020.04.08.031948.full AB - Glycans, the most diverse biopolymer and crucial for many biological processes, are shaped by evolutionary pressures stemming in particular from host-pathogen interactions. While this positions glycans as being essential for understanding and targeting host-pathogen interactions, their considerable diversity and a lack of methods has hitherto stymied progress in leveraging their predictive potential. Here, we utilize a curated dataset of 12,674 glycans from 1,726 species to develop and apply machine learning methods to extract evolutionary information from glycans. Our deep learning-based language model SweetOrigins provides evolution-informed glycan representations that we utilize to discover and investigate motifs used for molecular mimicry-mediated immune evasion by commensals and pathogens. Novel glycan alignment methods enable us to identify and contextualize virulence-determining motifs in the capsular polysaccharide of Staphylococcus aureus and Acinetobacter baumannii. Further, we show that glycan-based phylogenetic trees contain most of the information present in traditional 16S rRNA-based phylogenies and improve on the differentiation of genetically closely related but phenotypically divergent species, such as Bacillus cereus and Bacillus anthracis. Leveraging the evolutionary information inherent in glycans with machine learning methodology is poised to provide further – critically needed – insights into host-pathogen interactions, sequence-to-function relationships, and the major influence of glycans on phenotypic plasticity.