Abstract
Summary Immune cell infiltration of tumors can be an important component for determining patient outcomes, e.g. by inferring immune cell presence by deconvolving gene expression data drawn from a heterogenous mix of cell types. ADAPTS aids deconvolution by adding custom cell types to existing cell-type signature matrices or building new matrices de novo. This R package builds a custom signature matrix from purified cell type gene expression data by automatically determining genes that uniquely identify each cell type. The package includes functions that call deconvolution algorithms which use the custom signature matrix to estimate the proportion of cell types present in heterogenous samples.
Availability The R packages ADAPTS, ADAPTSdata, ADAPTSdata2 are at GitHub (https://github.com/sdanzige/ADAPTS, etc.) and eventually will be available on CRAN
Contact sdanziger{at}celgene.com, aratushny{at}celgene.com
Preprint A preprint is available on BiorXiv (www.biorxiv.org/content/10.1101/633958v1)
1 Introduction
Determining cell type enrichment from gene expression data is an step towards determining tumor immune context (Thorsson et al., 2018; Newman et al., 2019). One family of techniques for doing this involves regression using a signature matrix (typically with several hundred genes), where each column represents a cell type and each row contains the average gene expression in that cell type (Erkkilä et al., 2010; Lähdesmäki et al., 2005). These signature matrices are constructed using gene expression from samples of a purified cell type. Generally, the publicly available versions of these gene expression signature matrices use immune cells purified from peripheral blood. Genes are included in these matrices based on how distinct they are for each cell type and how robust the resulting matrix is as measured by matrix stability or prediction accuracy on a test set. Although examples exist of both general purpose immune signature matrices, e.g. LM22 (Newman et al., 2015) and Immunostates (Vallania et al., 2018), and more tissue specific ones e.g. M17 (Ciavarella et al., 2018), these matrices are most likely not appropriate for all diseases and tissue types. One such example would be multiple myeloma whole bone marrow samples, in which both tumor and immune cells are present, immune cells may have different states than in peripheral blood, and non-immune stromal cells such as osteoblasts and adipocytes are expected play an important role in patient outcomes (Bianchi and Munshi, 2015).
One straightforward solution to this problem would be to augment a signature matrix by adding cell types without adding any additional genes. For example, one might find purified adipocyte samples in a public gene expression repository and add the average expression for each gene in the matrix to create an adipocyte augmented signature matrix. While this might work, one might reasonably expect adipocytes to best be identified by genes that are different from those that best characterize leukocytes. We developed a method and an R package ADAPTS (Automated Deconvolution Augmentation of Profiles for Tissue Specific cells) to implement this approach.
2 Methods
ADAPTS provides functionality for augmenting an existing cell type signature matrix or even constructing a new signature matrix de novo. The user is responsible for finding gene expression data from purified cells types, such as is available in ArrayExpress (Athar et al., 2019) or the Gene Expression Omnibus (Barrett et al., 2013). From there, ADAPTS helps a user construct new signature matrices with modular [R] functions and default parameters to:
Identify and rank significantly different genes for each cell type.
Evaluate the stability (condition number) of many potential solutions.
Smooth and normalize to meet tolerances for a robust signature matrix.
Similarly, ADAPTS can be used to construct a de novo matrix from first principals rather than starting with a seed matrix. One technique is to build an initial seed matrix out of the n (e.g. 100) genes that vary the most between cell types and use ADAPTS to augment that seed matrix. The n initial genes can then be removed from the resulting signature matrix and that new signature matrix can be re-augmented by ADAPTS.
The ADAPTS package includes functionality to call several different deconvolution methods using a common interface, thereby allowing a user to test new signature matrices with multiple algorithms.
These algorithms include:
DCQ (Altboum et al., 2014): An elastic net based deconvolution algorithm that consistently best identifies cell proportions.
SVMDECON (Newman et al., 2015): A support vector machine based deconvolution algorithm.
DeconRNASeq (Gong et al., 2013): A non-negative decomposition based deconvolution algorithm.
Proportions in Admixture (Langfelder et al., 2008): A linear regression based deconvolution algorithm.
3 Example: Detecting Tumor Cells
To demonstrate utility of the ADAPTS package, we show how it can be used to augment the LM22 from (Newman et al., 2015) to identify myelomatous plasma cells from gene expression profiles of 423 purified tumor (CD138+) samples and 440 whole bone marrow (WBM) samples taken from multiple myeloma patients (Danziger et al., 2019). The fraction of myeloma cells, which are tumorous plasma cells, were identified in both sample types via quantification of the cell surface marker CD138. Root mean squared error (RMSE) and Pearson’s correlation coefficient (ρ) were used to evaluate accuracy of tumor cell fraction estimates. RMSE proved particularly relevant when deconvolving purified CD138+ sample profiles, because 356 of 423 samples are more than 90% pure tumor resulting in clumping of samples with purity near 100%.
The following matrices were used or generated during the evaluation:
LM22: As reported in (Newman et al., 2015). The sum of the ‘memory B cells’ and ‘plasma cells’ deconvolved estimates represent tumor percentage.
LM22 + 5: Builds on LM22 by adding purified sample profiles for myeloma specific cell types: plasma memory cells (Mahevas et al., 2013), osteoblasts (Athar et al., 2019), osteoclasts, adipocytes, and myeloma plasma cells (Torrente et al., 2016). The sum estimates for ‘memory B cells’, ‘myeloma plasma cells’, ‘plasma cells’, and ‘plasma memory cells’ represent tumor percentage.
MGSM27: Builds on LM22 by adding 5 myeloma specific cell types using ADAPTS to determine inclusion of additional genes. Figure 1 shows ADAPTS evaluating matrix stability after adding different numbers of genes, smoothing the condition numbers, and selecting an optimal number of features.
de novo MGSM27: Builds a de novo MGSM27 from publicly available data similar to those mentioned in (Newman et al., 2015) and the 5 myeloma specific cell types.
Table 1 displays average RMSE and ρ for tumor fraction estimates obtained via application of DCQ deconvolution using the four aforementioned matrices across both myeloma profiling datasets.
While the exact genes chosen during each run varies slightly, Table 1 shows that consistently the best accuracy is achieved by augmenting LM22 using ADAPTS. The reduced performance of the de novo MGSM27 is likely due to genes that were present in LM22, but were missing in some of the source data and excluded from de novo construction. More details are available in the vignette distributed with the R package.
4 Conclusion
Table 1 shows an example where including additional genes and tissue specific cell types improves the ability of a deconvolution algorithm to identify tumor fractions in purified and mixed multiple myeloma samples. While this does not demonstrate that the techniques implemented in ADAPTS would be beneficial for all situations, the functions in ADAPTS enable researchers to build their own custom basis matrices to investigate biosamples consisting of multiple cell types.
Funding
This work has been supported by the Celgene corporation.
Acknowledgements
Thanks to Gareth Morgan, Jake Gockley, Robert Hershberg, Mary H Young, Andrew Dervan, and all other contributors to the paper “Identifying a High-risk Cellular Signature in the Multiple Myeloma Bone Marrow Microenvironment”.
Footnotes
This version points to the GitHub version of the code to use while the CRAN submission is pending.