Development of an absolute assignment predictor for triple-negative breast cancer subtyping using machine learning approaches

Fadoua Ben Azzouz; Bertrand Michel; Hamza Lasla; Wilfried Gouraud; Anne-Flore François; Fabien Girka; Théo Lecointre; Catherine Guérin-Charbonnel; Philippe P. Juin; Mario Campone; Pascal Jézéquel

doi:10.1101/2020.06.02.129544

Abstract

Triple-negative breast cancer (TNBC) heterogeneity represents one of the main impediment to precision medicine for this disease. Recent concordant transcriptomics studies have shown that TNBC could be splitted into at least three subtypes with potential therapeutic implications. Although, a few studies have been done to predict TNBC subtype by means of transcriptomics data, subtyping was partially sensitive and limited by batch effect and dependence to a given dataset, which may penalize the switch to routine diagnostic testing. Therefore, we sought to build an absolute predictor (i.e. intra-patient diagnosis) based on machine learning algorithm with a limited number of probes. To this end, we started by introducing probe binary comparison for each patient (indicators). We based predictive analysis on this transformed data. Probe selection was first performed by combining both filter and wrapper methods for variable selection using cross validation. We thus tested three prediction models (random forest, gradient boosting [GB] and extreme gradient boosting) using this optimal subset of indicators as inputs. Nested cross-validation allowed us to consistently choose the best model. Results showed that the 50 selected indicators highlighted biological characteristics associated with each TNBC subtype. The GB based on this subset of indicators has better performances as compared to the other models.