Machine learning classification by fitting amplicon sequences to existing OTUs

Courtney R. Armour; Kelly L. Sovacool; William L. Close; Begüm D. Topçuoğlu; Jenna Wiens; Patrick D. Schloss

doi:10.1101/2022.09.01.506299

Abstract

Machine learning classification using the gut microbiome relies on assigning 16S rRNA gene sequences into operational taxonomic units (OTUs) to quantify microbial composition. OTU abundances are then used to train a classification model that can be applied to classify new samples. The standard approaches to clustering sequences include reference-based and de novo clustering. Reference-based clustering requires a well-curated reference database that may not exist for all systems. De novo clustering tends to produce higher quality OTU assignments than reference-based, but clusters depend on the sequences in the dataset and therefore OTU assignments will change when new samples are sequenced. This lack of stability complicates machine learning classification since new sequences must be reclustered with the old data and the model retrained with the new OTU assignments. The OptiFit algorithm addresses these issues by fitting new sequences into existing OTUs. While OptiFit produces high quality OTU clusters, it is unclear whether this method for fitting new sequence data into existing OTUs will impact the performance of classification models trained with the older data. We used OptiFit to cluster sequences into existing OTUs and evaluated model performance in classifying a dataset containing samples from patients with and without colonic screen relevant neoplasia (SRN). We compared the performance of this model to standard methods including de novo and database-reference-based clustering. We found that using OptiFit performed as well or better in classifying SRNs. OptiFit can streamline the process of classifying new samples by avoiding the need to retrain models using reclustered sequences.

Importance There is great potential for using microbiome data to aid in diagnosis. A challenge with OTU-based classification models is that 16S rRNA gene sequences are often assigned to OTUs based on similarity to other sequences in the dataset. If data are generated from new patients, the old and new sequences must all be reassigned to OTUs and the classification model retrained. Yet there is a desire to have a single, validated model that can be widely deployed. To overcome this obstacle, we applied the OptiFit clustering algorithm to fit new sequence data to existing OTUs allowing for reuse of the model. A random forest model implemented using OptiFit performed as well as the traditional reassign and retrain approach. This result shows that it is possible to train and apply machine learning models based on OTU relative abundance data that do not require retraining or the use of a reference database.

Competing Interest Statement

The authors have declared no competing interest.

Footnotes

↵* Bio-Rad Laboratories, Hercules, California, USA
↵# Bristol Myers Squibb, Summit, New Jersey, USA
Manuscript revised to add models generated using additional clustering methods.

The copyright holder for this preprint is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.