PT - JOURNAL ARTICLE AU - Adam L. Bazinet TI - Pan-genome and phylogeny of <em>Bacillus cereus sensu lato</em> AID - 10.1101/119420 DP - 2017 Jan 01 TA - bioRxiv PG - 119420 4099 - http://biorxiv.org/content/early/2017/03/22/119420.short 4100 - http://biorxiv.org/content/early/2017/03/22/119420.full AB - Background Bacillus cereus sensu lato (s. l.) is an ecologically diverse bacterial group of medical and agricultural significance. In this study, I used publicly available genomes to characterize the B. cereus s. l. pan-genome and performed the largest phylogenetic analyses of this group to date in terms of the number of genes and taxa included. With these fundamental data in hand, it became possible to identify genes associated with particular phenotypic traits (i.e., “pan-GWAS” analysis), and to quantify the degree to which taxa sharing common attributes were phylogenetically clustered.Methods A rapid k-mer based approach (Mash) was used to create reduced representations of selected Bacillus genomes, and a fast distance-based phylogenetic analysis of this data (FastME) was performed to determine which species should be included in B. cereus s. l. The complete genomes of eight B. cereus s. l. species were annotated de novo with Prokka, and these annotations were used by Roary to produce the B. cereus s. l. pan-genome. Scoary was used to associate gene presence and absence patterns with various phenotypes. The orthologous protein sequence clusters produced by Roary were filtered and used to build HaMStR databases of gene models that were used in turn to construct phylogenetic data matrices. Phylogenetic analyses used RAxML, DendroPy, ClonalFrameML, Gubbins, PAUP*, and SplitsTree. The genealogical sorting index was used to assess the tree-based clustering of taxa sharing common attributes.Results The B. cereus s. l. pan-genome currently consists of ≈60,000 genes, ≈600 of which are “core” (common to at least 99% of taxa sampled). Pan-GWAS analysis revealed genes that were associated with phenotypes such as isolation source, oxygen requirement, and ability to cause diseases such as anthrax or food poisoning. Extensive phylogenetic analyses using an unprecedented amount of data produced phylogenies that were largely concordant with each other and with previous studies. Phylogenetic support as measured by bootstrap probabilities increased markedly when all suitable pan-genome data was included in phylogenetic analyses, as opposed to when only core genes were used. B. cereus s. l. taxa sharing common traits and species designations exhibited varying degrees of phylogenetic clustering.ACLAMEA CLAssification of Mobile genetic ElementsAFLPamplified fragment length polymorphismBCSLBacillus cereus sensu latoBLASTBasic Local Alignment Search ToolBPbootstrap probabilityGBgigabytesGWASgenome-wide association studyHMMhidden Markov modelHaMStRHidden Markov Model based Search for Orthologs using ReciprocityLCBlocally collinear blockMAFFTMultiple Alignment using Fast Fourier TransformMGEmobile genetic elementMLmaximum likelihoodMLSTmultilocus sequence typingMPmaximum parsimonyNCBINational Center for Biotechnology InformationPANTHERProtein ANalysis THrough Evolutionary RelationshipsPATRICPathosystems Resource Integration CenterPAUP*Phylogenetic Analysis Using Parsimony *and other methodsPHYLIPPhylogeny Inference PackagePRANKProbabilistic Alignment KitRAMrandom access memoryRAxMLRandomized Axelerated Maximum LikelihoodRFRobinson-FouldsRefSeqReference Sequence databaseSNPsingle nucleotide polymorphismdDDHdigital DNA-DNA hybridizationgsigenealogical sorting indexntnucleotides