Benchmark of cellular deconvolution methods using a multi-assay reference dataset from postmortem human prefrontal cortex

Background: Cellular deconvolution of bulk RNA-sequencing (RNA-seq) data using single cell or nuclei RNA-seq (sc/snRNA-seq) reference data is an important strategy for estimating cell type composition in heterogeneous tissues, such as human brain. Several deconvolution methods have been developed and they have been previously benchmarked against simulated data, pseudobulked sc/snRNA-seq data, or cell type proportions derived from immunohistochemistry reference data. A major limitation preventing the improvement of deconvolution algorithms has been the lack of highly integrated datasets with orthogonal measurements of gene expression and estimates of cell type proportions on the same tissue block. The performance of existing deconvolution algorithms has not yet been explored across different RNA extraction methods (e.g. cytosolic, nuclear, or whole cell RNA), different library preparation types (e.g. mRNA enrichment vs. ribosomal RNA depletion), or with matched single cell reference datasets. Results: A rich multi-assay dataset was generated in postmortem human dorsolateral prefrontal cortex (DLPFC) from 22 tissue blocks. Assays included spatially-resolved transcriptomics, snRNA-seq, bulk RNA-seq across six RNA extraction and RNA-seq library combinations, and orthogonal cell type measurements via RNAScope/Immunofluorescence (RNAScope/IF). The Mean Ratio method was developed for selecting cell type marker genes for deconvolution and is implemented in the DeconvoBuddies R package. Five extensively benchmarked computational deconvolution algorithms were evaluated in DLPFC across six RNA-seq combinations and predicted cell type proportions were compared to those measured by RNAScope/IF. Conclusions: We show that Bisque and hspe are the top performing methods with performance dependent on the RNA-seq library preparation conditions. We provide a multi-assay resource for the development and evaluation of deconvolution algorithms.


Fig S2:
Bulk RNA-seq data Quality Control.Samples are evaluated for low concordMapRate, numMapped, numReads, overallMapRate, totalAssignedGene, and totalMapped or high mitoRate.See the SPEAQeasy [52] documentation at https://research.libd.org/SPEAQeasy/outputs.html#quality-metrics for the definition of these variables.Cutoffs (horizontal lines) were determined by a 3 median absolute deviations from the mean (3 MADs) from the distributions for the polyA or RiboZeroGold samples (line color) using isOutlier() from scran [39], as well as historic cutoffs from previous LIBD bulk RNA-seq projects (green lines).Based on the distribution of the values, and logic with the QC metrics some were "warning" cutoffs vs. "drop" cutoffs (line type).RNA-seq samples were classified as "drop", "warn", or "pass" based on their relationship to the cutoffs.

Fig S3 :
Fig S3: Bulk RNA-seq data Quality Control Principal Component Analysis (PCA).PC2 versus 5 colored by "library combo" (library type + RNA extraction).The sample AN00000906_Br8492_Mid_Nuc was identified as an outlier and removed from downstream analysis.

Fig S5 :
Fig S5: Boxplots of cell type proportions calculated from snRNA-seq and RNAScope/IF data.(Left Side) This side shows the snRNA-seq cell type proportions at the broad cell type resolution for all 19 tissue blocks for which snRNA-seq data was previously generated [38].A filled dot marks tissue blocks for which RNAScope/IF data was also generated and passed quality control checks.Those absent are labeled with an x. (Right Side) This side shows the RNAScope/IF broad cell type proportions derived from the RNAScope/IF experiments for the Circle and Star combination of cell type markers.Cell type identities on the RNAScope/IF images were assigned using HALO (Indica Labs).

Fig S6 :
Fig S6: Heatmap of the top 25 Mean Ratio marker genes for deconvolution.Normalized snRNA-seq counts (logcounts) were centered and scaled by gene to compute Z scores.Brain donor identifiers (BrNum) and cell types were used to annotate this heatmap made with ComplexHeatmap [61].Genes are shown in the columns and nuclei on the rows, with the total number of nuclei (ncells) visualized as side barplots.

Fig S7 :
Fig S7: Barplots of estimated cell type proportions from deconvolution methods.Each tissue block is a column, the x-axis categories are the six extraction and library type combinations for the 110 bulk RNA-seq samples.The rows are the predictions from each of the five deconvolution methods.Columns are labeled by the tissue block, which is a combination of the brain donor identifier (BrNum) and the anterior-posterior axis location of the tissue block (anterior: ant, middle: mid, or posterior: post).

Fig S8 :
Fig S8: Scatter plots of cell type proportions measured by RNAScope/IF experiment vs. estimated proportions from deconvolution methods.Points are colored by cell type and shaped by the combination of the bulk RNA-seq sample's library type and RNA extraction.Pearson correlation (cor) and root mean squared error (rsme) values are shown for each panel.Related to Figure 4.

Fig S9 :
Fig S9: Pairwise scatter plots of measured and estimated cell type proportions from the RNAScope/IF experiments, snRNA-seq data, and deconvolution methods.Cell type proportions are colored by cell type are shown in the lower triangle.Pearson correlation values (cor) calculated by ggpairs() from GGally [63] for each cell type are shown in the upper triangle.Density plots of the proportions are shown in the diagonal panels.