Abstract
The matrix-assisted laser desorption-ionization time-of-flight mass spectrometry has become a powerful tool for accurate species identification in routine diagnostic microbiology. Recently, the application of machine learning models with MALDI-TOF mass spectra data indicated that rapid prediction of antimicrobial resistance patterns might facilitate even timelier and improved antimicrobial treatment. Although MALDI-TOF mass spectra data have proven valuable for clinical decision support, the issue of class imbalance in routine clinical data is often overlooked. This imbalance arises from factors such as local epidemiology, selective pressure from antibiotics, culture conditions, the methodology of phenotypic antimicrobial susceptibility testing, and sample preparation processes. Here, we provide a large mass spectra dataset, MS-UMG, for antimicrobial resistance prediction model training. With previously available public datasets, our dataset is evaluated and validated for usage in AMR prediction. We further explore the mass spectra data and identify informative regions on the spectra profile for AMR prediction. Moreover, we investigate the composition of this clinical dataset and present the implications of data heterogeneity on machine learning model performance. In conclusion, our findings highlight that accurate comprehension of clinical routine data and consideration of diverse hospital protocols are critical for effective clinical decision support systems with machine learning models.
Key Points
Introduced a large-scale clinical mass spectrometry dataset to the scientific community for research on antimicrobial resistance.
Conducted a comparison and evaluation of this dataset with other existing large-scale MS datasets, highlighting its value for developing and validating predictive models in clinical settings.
Demonstrated the robustness of machine learning models for antimicrobial resistance prediction using large-scale clinical mass spectra profiles.
Analyzed the impact of data heterogeneity on the training and performance of machine learning models, emphasizing the need to account for variability in clinical routine data to enhance model reliability and generalizability.
Competing Interest Statement
The authors have declared no competing interest.