PT - JOURNAL ARTICLE AU - Francisco Avila Cobos AU - José Alquicira-Hernandez AU - Joseph Powell AU - Pieter Mestdagh AU - Katleen De Preter TI - Comprehensive benchmarking of computational deconvolution of transcriptomics data AID - 10.1101/2020.01.10.897116 DP - 2020 Jan 01 TA - bioRxiv PG - 2020.01.10.897116 4099 - http://biorxiv.org/content/early/2020/01/10/2020.01.10.897116.short 4100 - http://biorxiv.org/content/early/2020/01/10/2020.01.10.897116.full AB - Many computational methods to infer cell type proportions from bulk transcriptomics data have been developed. Attempts comparing these methods revealed that the choice of reference marker signatures is far more important than the method itself. However, a thorough evaluation of the combined impact of data transformation, pre-processing, marker selection, cell type composition and choice of methodology on the results is still lacking.Using different single-cell RNA-sequencing (scRNA-seq) datasets, we generated hundreds of pseudo-bulk mixtures to evaluate the combined impact of these factors on the deconvolution results. Along with methods to perform deconvolution of bulk RNA-seq data we also included five methods specifically designed to infer the cell type composition of bulk data using scRNA-seq data as reference.Both bulk and single-cell deconvolution methods perform best when applied to data in linear scale and the choice of normalization can have a dramatic impact on the performance of some, but not all methods. Overall, single-cell methods have comparable performance to the best performing bulk methods and bulk methods based on semi-supervised approaches showed higher error and lower correlation values between the computed and the expected proportions. Moreover, failure to include cell types in the reference that are present in a mixture always led to substantially worse results, regardless of any of the previous choices. Taken together, we provide a thorough evaluation of the combined impact of the different factors affecting the computational deconvolution task across different datasets and propose general guidelines to maximize its performance.