Reference-free deconvolution, visualization and interpretation of complex DNA methylation data using DecompPipeline, MeDeCom and FactorViz

DNA methylation profiling offers unique insights into human development and diseases. Often the analysis of complex tissues and cell mixtures is the only feasible option to study methylation changes across large patient cohorts. Since DNA methylomes are highly cell type specific, deconvolution methods can be used to recover cell type–specific information in the form of latent methylation components (LMCs) from such ‘bulk’ samples. Reference-free deconvolution methods retrieve these components without the need for DNA methylation profiles of purified cell types. Currently no integrated and guided procedure is available for data preparation and subsequent interpretation of deconvolution results. Here, we describe a three-stage protocol for reference-free deconvolution of DNA methylation data comprising: (i) data preprocessing, confounder adjustment using independent component analysis (ICA) and feature selection using DecompPipeline, (ii) deconvolution with multiple parameters using MeDeCom, RefFreeCellMix or EDec and (iii) guided biological inference and validation of deconvolution results with the R/Shiny graphical user interface FactorViz. Our protocol simplifies the analysis and guides the initial interpretation of DNA methylation data derived from complex samples. The harmonized approach is particularly useful to dissect and evaluate cell heterogeneity in complex systems such as tumors. We apply the protocol to lung cancer methylomes from The Cancer Genome Atlas (TCGA) and show that our approach identifies the proportions of stromal cells and tumor-infiltrating immune cells, as well as associations of the detected components with clinical parameters. The protocol takes slightly >3 d to complete and requires basic R skills. This protocol describes a comprehensive computational pipeline for reference-free deconvolution of bulk DNA methylation data, including data preprocessing, confounder adjustment, feature selection, and visualization and interpretation of the results.


Introduction
DNA methylation, preferentially occurring at CpG dinucleotides in mammalian genomes, correlates with cell type identity and differentiation stage 1 . Importantly, the methylation states of particular CpGs can be used as powerful biomarkers for various conditions, including cancer 2-4 , inflammatory diseases 5 and aging 6 . To facilitate such findings, large international consortia, including the International Human Epigenome Consortium (IHEC) 7 , the German Epigenome Project (DEEP; Deutsches Epigenom Programm) 1 and the European initiative BLUEPRINT 8 , are generating genome-wide DNA methylation maps (or DNA methylomes) of primary tissue samples and isolated cell populations. However, DNA methylomes obtained from bulk samples are intrinsically heterogeneous, and can be additionally affected by aging, sex and technical confounders. Computational methods for the integration of large-scale DNA methylation datasets, and capable of delineating such complex methylomes into biologically distinct components of variability, are of paramount importance 9 .
Deconvolution methods dissect methylomes of cell mixtures into their basic constituents 10 . In case reference DNA methylomes of purified cell types are available, they can be used to infer the proportions of different cell types across the samples. Multiple reference-based methods have been proposed [11][12][13][14][15][16] and are reviewed elsewhere 17 (summarized in Supplementary Table 1). Another class of deconvolution methods allows for the composition-adjusted identification of differentially methylated regions in epigenome-wide association studies (EWAS) without explicitly computing the cell type proportions 18,19 . When reference methylomes and other prior information are partially or completely absent, semi-reference-free 20 or fully reference-free [21][22][23][24][25] deconvolution methods can be applied (Supplementary Table 1). Reference-free methods are of particular benefit for poorly characterized, complex systems such as brain, and solid tumors, where adequate reference profiles are especially hard to obtain. Furthermore, the interpretation of deconvolution results is challenging in case prior information about the investigated biological system is limited. Here, we present a comprehensive pipeline that facilitates reference-free deconvolution, starting from raw DNA methylation data through to result interpretation. Although we focus on MeDeCom 23 as a representative method, the protocol is not limited to a single deconvolution strategy and can be used in combination with other available tools (RefFreeCellMix 21 and EDec 22 ).

Development of the protocol
Reference-free deconvolution is a computationally demanding task, for which several methods have been proposed, along with the tools implementing them (Supplementary Table 1). In a pilot benchmark of three published reference-free deconvolution tools (MeDeCom, RefFreeCellMix and EDec), we found that their performance differences were marginal, both on fully synthetic and in silico mixed experimental datasets 26 . In fact, the quality and information content of the input DNA methylation matrix had a higher impact upon the accuracy than the choice of the deconvolution tool. Thus, deconvolution algorithms a priori rely on thorough data preprocessing and feature selection, especially if the differences between the underlying components are small. Furthermore, the biological interpretation of deconvolution results is often challenging, in particular for beginners with limited bioinformatic experience. To facilitate the analysis and interpretation of complex DNA methylation data, we developed a comprehensive, three-stage protocol, which includes critical preprocessing and interpretation steps in addition to the actual deconvolution. The protocol, schematically outlined in Fig. 1, consists of three main stages: (i) meticulous preprocessing of DNA methylation data, including stringent, quality-adapted CpG filtering, elimination of potential confounding factors using ICA and feature selection; (ii) reference-free methylome deconvolution; (iii) interpretation of deconvolution results with a user-friendly R/Shiny-based interface, enabling the generation of novel biological insights. A glossary of commonly used abbreviations is provided in Box 1.
The development of the protocol focused on the following three aspects: (i) Data preprocessing is key to the overall success of the deconvolution. The first stage of our protocol (DecompPipeline) comprises removal of unreliable or otherwise problematic measurements using the popular RnBeads software package for handling DNA methylation data 27,28 . Age, sex or donor genotype can have a strong influence on the methylome, and investigators may want to adjust for these in their analyses 29,30 , that is, to consider these factors as confounders. Therefore, we argue that accounting for confounding factors, using methods such as ICA 31 , is crucial in obtaining biologically relevant results. As the final processing step, a CpG subset selection determines sites that are associated with, for instance, cell type identity or any other phenotypic trait of interest. (ii) The prepared, high-quality data matrix can be processed using one of the reference-free deconvolution tools (Supplementary Table 1). These methods decompose the input DNA methylation matrix into a matrix of LMCs (T) and a matrix of proportions (A). The LMCs correspond to major sources of variation in the methylome including, but not limited to, methylomes of underlying cell types, whereas the proportions quantify their relative contribution to each bulk methylome. In the second stage of the protocol, we use MeDeCom 23 , our own method based on regularized non-negative matrix factorization (NMF) for this purpose, but the protocol seamlessly integrates with other deconvolution tools. (iii) Although reference-free deconvolution is flexible with respect to the investigated system, the interpretation of deconvolution results, especially of the LMC matrix T, can be challenging. In addition to cell type profiles, LMCs reflect multiple drivers of biological and technical variability. Furthermore, validating the proportions and LMCs is not trivial, since the space of possible solutions is large and the cellular composition is typically unknown. To guide users, we implemented most of the interpretation functionality as a specialized R/Shiny-based graphical user interface (FactorViz).
case of missing reference profiles. This includes EWAS using material from difficult-to-access or insufficiently characterized organs and tissues, such as human brain, as well as solid tumors. Previously, we and others have used this approach to understand the cellular heterogeneity in placenta 32 , multiple sclerosis 33 , breast cancer 22 and cholangiocarcinoma 34 . Reference-free deconvolution is particularly useful to dissect tumor heterogeneity, for example, to study the effect of tumorinfiltrating immune cells on the tumor microenvironment 35 . Furthermore, identified LMCs can correlate with tumor size, location, metastasis state and mutational burden. Since, in general, tumors show a high degree of sample-to-sample variation, methylome deconvolution can detect similarities among different types of cancers to define pan-cancer and cancer type-specific markers. If a particular cancer induces changes in the DNA methylation pattern of the tumor stroma, these changes are likely to be missed by reference-based methods, but not by reference-free methods.
In the original publication, along with the validation on simulated data and in silico cell type mixtures, MeDeCom was applied to a brain frontal cortex dataset, successfully identifying neuronal and glia fractions, as well as detecting additional LMCs, which could be linked to features of Alzheimer's disease 23 . We anticipate successful application of reference-free deconvolution in similar scenarios. Furthermore, although for blood-based studies reference methylomes exist 36 and referencebased methods perform generally well, reference-free deconvolution can be useful in case of severely altered blood composition, for example, due to an overproduction of rare cell types. Finally, in the case of blood and other similarly well-characterized tissues, MeDeCom can be applied in a semisupervised fashion, that is, comparing the obtained LMCs with available reference profiles. This enables easy recovery of known signatures, and allows for detection of additional unknown LMCs.  1 | Overview of the proposed deconvolution protocol. DNA methylation data from any technology yielding single CpG methylation calls can be used. Methylation data are first processed using DecompPipeline, which includes data import, preprocessing, accounting for confounders and feature selection. MeDeCom, RefFreeCellMix or EDec can be used to perform deconvolution of the input methylation matrix (dimension m CpGs × n samples) into the LMCs and the proportions matrix (dimension K LMCs × n samples). A grid of values for the regularization parameter λ and the number of components K has to be specified. The resulting matrices are then validated and interpreted using the R/Shiny visualization tool FactorViz. The numbers in parentheses represent procedure steps in the presented protocol.
Single-cell technologies are steadily improving, and cell-level DNA methylomes will become increasingly available in the future 37,38 . Nevertheless, we expect that the deconvolution of large-scale bulk tissue datasets will remain a useful complement to single-cell DNA methylation profiling, which still suffers from high costs, low sample throughput and data sparsity. We envision that both approaches could be successfully used in combination, for example, single-cell profiling for several reference samples in controlled study settings followed by deconvolution of bulk methylomes from large patient cohorts, where the single-cell profiles can be used for interpretation of the LMCs. Finally, the deconvolution of more accessible bulk methylomes can be used to integrate them with single-cell profiles of other data modalities, such as single-cell transcriptomes and chromatin accessibility maps, which are generally easier to obtain than single-cell methylation profiles

Experimental design
Our protocol is divided into three main stages: (i) data preprocessing, (ii) deconvolution and (iii) interpretation, which are described in detail below (see also Fig. 1).

Data preprocessing
We implemented the data preprocessing stage of the protocol as a new R package (DecompPipeline) that integrates quality filtering, adjustment for confounding factors and feature selection into a single, easy-to-use workflow. For loading, formatting and storing the DNA methylation data, we recommend our recently updated and extended RnBeads package 28 .
Data import. Genome-wide DNA methylation can be profiled using different technologies such as whole-genome bisulfite sequencing (WGBS) 39 , reduced-representation bisulfite sequencing (RRBS) 40 or Illumina Infinium microarrays 41 . In this protocol, we focus on 450k microarray data due to their larger sample sizes that increase the efficiency of deconvolution, but the pipeline can also be applied in a slightly modified version to Illumina EPIC data. Additionally, our pipeline is applicable to any other data type that provides DNA methylation calls at single CpG resolution, including RRBS and WGBS, after adjusting specific steps within the pipeline. In addition to raw DNA methylation data, phenotypic information, such as age and sex, is required and converted into the internal RnBeads 27,28 data structure. The input is checked for data quality using RnBeads' reporting functionality. Preprocessed data can also directly be used as a DNA methylation data matrix, which enables integration with further DNA methylation processing tools such as minfi 42  EWAS Epigenome-wide association study. A study aiming to define phenotype-associated epigenomic alterations, for example, DNA methylation changes linked to a disease.
GDC Genomic Data Commons. The data transfer tool used within the TCGA network.
ICA Independent component analysis. A statistical framework that defines statistically independent components in a data matrix.
IDAT Intensity data files. A file format generated using the Illumina BeadArray series comprising signal intensity values.
IHEC International Human Epigenome Consortium. An umbrella organization for different epigenomic research projects all around the world.
LMC Latent methylation component. A component generated using any of the reference-free deconvolution tools. LMCs represent major sources of variation in the data, such as cell types, donor age or genotype.
NMF Non-negative matrix factorization. A statistical framework to factorize a data matrix into two matrices with exclusively non-negative entries.
RMSE Root mean squared error. The square root of the mean squared differences of the entries in two vectors.
RRBS Reduced-representation bisulfite sequencing. A next-generation sequencing strategy yielding CpG methylation calls in CpG-dense regions of the genome.
Quality filtering. DecompPipeline performs quality-based filtering of CpGs across the samples in several steps (Box 2). First, CpGs are filtered according to a (bead) coverage threshold across the samples, and overall signal intensity (microarrays) or read coverage (bisulfite sequencing) outliers are removed. Missing values can either be completely discarded from the dataset or imputed 45 . We further remove sites overlapping annotated or estimated single-nucleotide polymorphisms (SNPs) 46 , sites on the sex chromosomes and cross-reactive sites 47,48 . Infinium data can be normalized before downstream analysis, and further sample properties can be inferred, such as the overall immune cell content using the LUMP algorithm 49 or the epigenetic age 30 .
Covariate adjustment using ICA. DNA methylomes can be affected by various sources of variability, of both a biological and technical nature, that might mask the signals of interest. Deciding which of the detected components is of relevance for the investigated system and which components are associated with unwanted sources of variation (confounding factors) is a critical choice to be made by the user. For instance, components linked to age may be of interest in some investigated systems but be confounders for tumor heterogeneity in another setting. We use ICA 31 as a data-driven dimensionality reduction method that performs a matrix decomposition to adjust for a given set of confounders. ICA divides the experimentally observed data matrix D mn into k independent signals S mk mixed with the coefficients of M kn : where m and n are the number of features (methylation sites) and samples, respectively. In contrast to the NMF performed by MeDeCom, the entries of the matrices are not bounded to the interval [0, 1]. The weight matrix M can be linked to confounding factors, including sex, experimental batch effects or platform biases. As the method separates the original methylation profiles into statistically independent signals, the influence of these factors on the methylation profiles can be attributed to particular CpGs, which can be removed or the weights (rows of M) of the corresponding components can be set to zero 50 . Given that the influence of the investigated confounding factors is small, we recommend setting the corresponding components to zero and reconstructing an adjusted data matrix.
A pitfall of applying ICA to DNA methylation data is the smoothing of the beta value distribution during ICA-based data transformation. Thus, a post-processing step is required to bring beta values into the expected range. To achieve this, reconstructed values are linearly rescaled to set the 1st and 99th percentiles to beta values of 0 and 1. Finally, to reduce the stochasticity of the ICA decomposition, we apply the consensus ICA approach 51 . ICA is executed multiple times, and the resulting matrices S and M are averaged between the runs. The stability of the components is estimated as the coefficient of determination (R 2 ) between the columns of S observed in different runs. The optimal number of independent components can be selected in a single run during a pre-analysis phase, which does not require consensus runs. We exhaustively try all numbers of components in a defined range for each of the confounding factors selected. ANOVA analysis reveals dependencies between ICA weights and the factor. We return the number of components resulting in the lowest P-value in the ANOVA test.
Selection of informative CpG subsets. To obtain interpretable deconvolution results, further feature selection is required, since, for instance, low-variability CpGs do not contribute to signature recovery but do add to the computational runtime. The use of prior knowledge about the underlying cell types, Box 2 | Configurable quality filtering options available in DecompPipeline 450k/EPIC array and bisulfite sequencing Missing value filtering. All sites comprising missing measurement in any of the samples are removed. SNP filtering. Sites or probes overlapping with SNPs (minor allele frequency >1%, dbSNP147 (ref. 46 )) are removed. Sex chromosome filtering. Sites located on the sex chromosomes are filtered.
450k/EPIC array only Bead filtering. All sites covered by fewer than min.n.beads (default = 3) beads in any of the samples are filtered. Intensity filtering. Sites are filtered according to their overall intensity values with respect to the intensity quantiles of all beads specified using min.int.quant (default = 0.001) and max.int.quant (default = 0.999). Cross-reactive filtering. All sites reported to be cross-reactive 47,48 are removed.
Bisulfite sequencing only Absolute coverage filtering. All sites with read coverage less than min.coverage (default = 5) are filtered. Quantile coverage filtering. Sites in the low and high read coverage quantile (defined by min.covg.quant and max. covg.quant) are removed.
for instance known cell type-specific CpGs, is the best option, if such knowledge is available 11,22 . In the absence of prior knowledge about the biological system of interest, typical strategies for feature selection include selecting the most highly variable sites, the ones with the highest loadings on the first few principal components, or a random selection. DecompPipeline provides 14 options to select CpG subsets (Table 1), and multiple options can be included in a single execution of the pipeline.

Performing deconvolution using MeDeCom
Reference-free deconvolution methods, such as RefFreeCellMix 21 , EDec 22 or MeDeCom 23 , estimate the matrix of LMCs (T) and the matrix of proportions (A) based on the DNA methylation matrix (D) of sites selected in the previous step by means of NMF. We focus on MeDeCom, but the pipeline similarly supports RefFreeCellMix and EDec. MeDeCom optimizes the squared Frobenius norm of the difference between the true (measured) methylation matrix D and the matrix product of T and A ( Fig. 1, middle) as the objective function. Desired properties of the factor matrices, that is, restriction to the interval [0, 1] (T and A) with column sums equal to 1 (A), are enforced by respective constraints on their entries. Furthermore, MeDeCom penalizes the entries of T that are not equal to 0 or 1 using quadratic regularization (maximal at entries equal to 0.5) controlled by the regularization parameter λ: This optimization problem is solved using an alternating optimization scheme, which fits either A or T while keeping the other fixed during each of its iterations. Selecting suitable values for the regularization parameter (λ) and the number of latent components (K) is assisted by a crossvalidation scheme that leaves out columns of D. Typically, a grid search for different values of K and λ is performed to determine the most suitable number of components and regularization, respectively. 600 sites listed as cell type specific in Jaffe et al. 77 Applicable only to blood datasets Jaffe et al. 77 Yes According to Stage 0 of the 'EDec' approach Requires reference profiles, Onuchic et al. 22 Yes

Yes No
To reduce the running time substantially, we recommend activating the parallel processing options on standalone workstations, or to use a high-performance computing cluster. The resulting solutions of the deconvolution problem are stored on disk and can be used for downstream interpretation.

Interpretation of deconvolution results
In contrast to the reference-based case, the interpretation of reference-free deconvolution results is non-trivial. MeDeCom produces a matrix of LMCs and a matched proportion matrix, both of which need to be biologically validated and interpreted. To facilitate an interactive interpretation, we created the semi-automated visualization tool FactorViz. FactorViz is an R/Shiny-based user interface with guidelines and functions for comprehensive biological inference. Initially, one of the possible MeDeCom solutions has to be chosen by selecting the parameters K and λ based on the crossvalidation error. To investigate the potential influences of covariates upon the estimated proportions and corresponding LMCs, the resulting proportion matrix is linked to technical or phenotypic traits using association tests. Furthermore, proportions can be linked to the expression of marker genes, or specific properties of the analyzed dataset such as survival time 52 . To annotate LMCs functionally, we determine the sites that are specifically hypomethylated in an LMC in comparison with the median of the remaining LMCs, and treat the obtained sites as LMC specific. These sites are then used for Gene Ontology (GO) 53 and LOLA 54 enrichment analysis to associate the respective LMCs with functional categories, pathways and genomic features. Finally, the LMC matrix can be compared with available reference cell type profiles.

Level of expertise needed to implement the method
DecompPipeline, MeDeCom and FactorViz are R packages and thus require some minimal prior experience with the R programming language. Basic knowledge of the Unix command line interface is recommended for data handling. To follow the steps of this protocol, one only needs a few R function calls, but the function parameters need to be tailored to the target dataset. The graphical user interface FactorViz presents plots, which require a minimum knowledge of DNA methylation data and matrix factorization.

Limitations
Since MeDeCom tests the selected combinations of the regularization parameter λ, the number of LMCs K and several feature selection methods, the number of basic deconvolution jobs can reach 10,000 or more. Thus, reference-free deconvolution is a computationally demanding task that requires high-performance computing infrastructure. When applied to larger datasets, that is, several hundreds to thousands of samples and hundreds of thousands to millions of selected CpGs, the deconvolution can take several days to finish, even on larger machines. Furthermore, the obtained LMC matrix needs to be biologically interpreted, which requires user interaction and input. Fully automated biological inference of deconvolution results will be part of the next development steps of the protocol. Accounting for confounding factors, especially for those that might have a strong influence on the methylome, can lead to a substantially modified DNA methylation data matrix. The proposed pipeline provides diagnostic plots, but user interaction is still required to determine whether the effect of a particular covariate is to be removed, and how this should be performed.

Hardware
We recommend executing the proposed protocol on large systems with, for example, 128 GB of operative memory and 32 cores for a dataset of the size of the example dataset used to demonstrate this protocol (461 samples, 485,577 CpGs). For larger datasets, such as bisulfite sequencing data, we recommend the use of a high-performance computing cluster or a cloud environment, such that the pipeline can automatically distribute jobs across different machines. Linux is the preferred operating system, and the protocol was executed on a Debian Wheezy server with 32 cores and 128 GB of main memory.
lapply(idat.files,function(x){ is.idat<-list.files(x, pattern = ".idat", full.names = TRUE) file.copy(is.idat,"idat/") unlink(x, recursive = TRUE) }) Data import • Timing 2 h 6 RnBeads converts the files into a data object and performs basic quality control. Analysis options have to be specified for RnBeads, through either an XML file or the command line. Deactivate the preprocessing, exploratory, covariate inference, export and differential methylation modules by setting the respective parameters to 'FALSE' as shown below, such that RnBeads only performs data import and quality control.
sample.anno <-"annotation/sample_annotation.tsv" idat.folder <-"idat/" dir.report <-paste0("report",Sys.Date(),"/") temp.dir <-"/tmp" download.file("http://epigenomics.dkfz.de/downloads/DecompProtocol/ rnbSet_unnormalized.zip", destfile=paste0(getwd(),"/RnBSet")) rnb.set <-load.rnb.set(paste0(getwd(),"/RnBSet")) ? TROUBLESHOOTING 8 Examine the interactive HTML report created by RnBeads for interactive quality control, identification of sample mix-ups and the initial exploratory analysis. This report specifies the steps performed and the associated results. Data should meet the quality criteria in Box 3 to be used for downstream analysis. c CRITICAL STEP While quality control for Illumina BeadArrays is based on built-in control probes visualized in the RnBeads report, quality control for bisulfite sequencing datasets is performed during upstream processing steps (out of the scope of the protocol), as well as on the read coverage statistics, which are available in the RnBeads Quality Control report. j PAUSE POINT For reference, we provide a complete RnBeads report on the website http:// epigenomics.dkfz.de/downloads/DecompProtocol/RnBeads_Report_TCGA_LUAD/.
Preprocessing and filtering • Timing 22 h 9 For further data preparation and analysis steps, use the R package DecompPipeline. Processing options are provided through individual function parameters to the prepare.data function (see the next code section). Follow a stringent filtering strategy: • Filter CpGs covered by fewer than three beads, and probes that are in the 0.001 and 0.999 overall intensity quantiles (low and high intensity outliers). • Remove all probes containing missing values in any of the samples. • Discard sites outside of CpG context, overlapping annotated SNPs, located on the sex chromosomes and potentially cross-reactive probes.
Box 3 | Quality criteria for different quality control steps in RnBeads 450k/EPIC and bisulfite sequencing Sex prediction. Patient sex can be reliably predicted from the signal intensities/read coverage of the sites on the sex chromosomes. RnBeads trained a robust logistic regression classifier on a large training dataset. If predicted sex does not match the annotated sex, this indicates a sample mix-up. The plot is available at the end of the quality_control.html report generated by RnBeads.
450k/EPIC only Control probes. All background control probes should have background intensity values in the range of 1,000-2,000. The low, medium and high control probes should show a substantially higher signal intensity in the range >10,000. Samples with high background intensity or low signal intensity should be discarded from the analysis. The plots are available at the very beginning of the quality_control.html report. SNP methylation. The Illumina BeadArrays contain a few highly variable SNP probes, which should have methylation values close to 0, 0.5 and 1. In a genetically matched setup, samples from a similar genotype (for example, the same family) should cluster together using the methylation values of these SNP probes. By this approach, potential sample mix-ups can be detected. The SNP methylation heatmap is available directly below the control probe plots.

Bisulfite sequencing only
Sequencing coverage. Read coverage should be similar across all the samples analyzed. Samples with low sequencing coverage should be removed from the analysis, while reads from samples with high sequencing coverage can be subsampled toward the mean sequencing coverage across all the samples.
• (Optional) Perform normalization, for instance using the 'dasen' method from the wateRmelon 43 package to adjust signals within Infinium arrays, or BMIQ 57 to account for the bias introduced by the two Infinium probe designs 41 . Consider that this will increase the running time by several minutes to several hours. c CRITICAL STEP Removing too few or too many sites might have a strong influence on the final deconvolution results. Thus, we recommend to carefully check the available options (Box 1) and only change the default setting in case of low-quality data. DecompPipeline internally uses the filtering functions available in RnBeads, which also include filtering options for the EPIC array. Next, the normalization of the array data has to be carefully considered, trading off between removing spurious technical variation and potentially eliminating relevant signals. We recommend to normalize the data in case of apparent artifacts (large background changes, clear differences between technical batches). DecompPipeline supports most of the normalization methods available in RnBeads. The pipeline can be modified to bisulfite sequencing data by filtering sites with too low read coverage, or sites exceeding the coverage quantiles (epigenomics.dkfz.de/ DecompProtocol/protocol_RRBS.html). j PAUSE POINT We provide a list of selected CpGs as an intermediate result at http:// epigenomics.dkfz.de/downloads/DecompProtocol/sites_passing_complete_filtering.csv.
sel.sites <-read.csv( "http://epigenomics.dkfz.de/downloads/DecompProtocol/sites_passing_ complete_filtering.csv") rem.sites <-!(row.names(annotation(rnb.set))%in%as.character(sel. sites[,1])) rnb.set <-remove.sites(rnb.set,rem.sites) data.prep <-list(rnb.set.filtered=rnb.set) ? TROUBLESHOOTING 10 Adjustment for potential confounding factors is crucial in epigenomic studies, and the influences of, for instance, donor sex, age and genotype on the DNA methylation pattern are well studied 29,30 . Use ICA (see 'Materials' section) to account for DNA methylation differences that are due to sex, age, race and ethnicity. Confounding factor adjustment alters the overall data distribution. Go to the analysis directory generated by DecompPipeline (TCGA_LUAD) and investigate the associations between independent components and the specified confounding factors. In case the reported P value is lower than the parameter alpha.fact, ICA will automatically adjust for the corresponding confounding factor. Furthermore, inspect if the distribution of DNA methylation is preserved. Confounding factor adjustment is currently only supported for Illumina BeadArray datasets and has not yet been extended to bisulfite sequencing data.

Selection of CpG subsets • Timing 1 min
11 Select a subset of sites to be used for deconvolution. DecompPipeline provides different options (Table 1) through the prepare.CG.subsets function. Select the 5,000 most highly variable CpGs across the samples.
cg_subset <-prepare.CG.subsets( rnb.set = data.prep$rnb.set.filtered, marker.selection = "var", n.markers = 5000) c CRITICAL STEP Feature selection for bisulfite sequencing datasets is harder than for Illumina BeadArrays. Since DNA methylation largely depends on the genetic background of an individual, the most variable sites will often represent inter-individual differences, especially in non-cancerous samples. Therefore, we recommend to subset genome-wide data and use only functional parts of the genome, such as promoters or enhancers. The feature selection methods also applicable to quality-filtered bisulfite sequencing data are depicted in Table 1. The advantages of each tool are outlined in their respective publications. Further existing methods will be supported in the future. j PAUSE POINT The final MeDeCom result is stored in a format that can be directly imported with FactorViz. We provide the results for exploration at http://epigenomics.dkfz.de/downloads/ DecompProtocol/FactorViz_outputs.tar.gz.

Downstream analysis • Timing 2 h
14 Start the FactorViz application to visualize and interactively explore the deconvolution results ( Supplementary Fig. 1a,b).
library(FactorViz) startFactorViz(file.path(getwd(),"TCGA_LUAD","FactorViz_outputs")) ? TROUBLESHOOTING 15 (Optional) Load the MeDeComSet object in an additional R session started in the working directory, where deconvolution has been performed. The object is stored in the FactorViz_outputs directory, and can be used for additional analysis.
load("FactorViz_outputs/medecom_set.RData") medecom.set 16 Determine the number of LMCs (K) and the regularization parameter (λ) based on the crossvalidation error. (Supplementary Fig. 1c,d). First, go to the 'K selection' panel in FactorViz to plot the cross-validation error for the range of Ks specified earlier ( Supplementary Fig. 1c). We recommend selecting K values as a trade-off between too high complexity, that is, fitting the noise in the data (higher K values) and insufficient degrees of freedom (lower K values). This typically corresponds to the saddle point where cross-validation error starts to level out. 17 For a fixed K, select a value for the regularization parameter (λ) by proceeding to the 'Lambda selection' panel in FactorViz ( Supplementary Fig. 1d). An optimal value for λ will often correspond to a local minimum of the cross-validation error. On the other hand, noticeable changes in other statistics, such as objective value or root mean squared error (RMSE), can point at different λ values. 18 In the 'Proportions' panel of FactorViz, visualize LMC proportions using heatmaps ( Supplementary  Fig. 1e). Proportion heatmaps can be visually annotated with available qualitative and quantitative traits (see the 'Color samples by' dropdown). Further visualization options include stacked barplots and lineplots. 19 In the 'Meta-analysis' panel, associate LMC proportions with quantitative and qualitative traits using correlation and t tests (Supplementary Fig. 1f). Further sample annotations, such as mutational load (https://www.cbioportal.org/study/clinicalData?id=luad_tcga_pan_can_atlas_2018) 58 Fig. 1g). Several visualization options are available, such as multidimensional scaling and histograms. Reference profiles can be used for joint visualization in case those are available. 21 Determine DNA methylation sites that are specifically hypo-and hypermethylated in an LMC by comparing the methylation values in the LMC matrix for each LMC with the median of the remaining LMCs in the 'Meta-analysis' panel ( Supplementary Fig. 1h). LMC-specific sites can be used for either a gene-centric GO 53 analysis (select 'Enrichments' and 'GO Enrichments' in the 'Analysis' and 'Output type' dropdown) or for region-based enrichment analysis with the LOLA 54 package (select 'LOLA Enrichments' in the 'Output type' dropdown; Supplementary Fig. 1i,j). The differential methylation sites can be exported for further analysis, and the resulting plots can be stored by clicking on the 'PDF' button. c CRITICAL STEP Additional built-in options for visualization of LMCs and proportions are available via the MeDeCom functions plotLMCs and plotProportions in R. Finally, the LMC and proportions matrix can be extracted from the MeDeComSet object, and the genomic annotation of CpGs (ann.C) can be obtained from the FactorViz output for a custom downstream analysis:

Troubleshooting
Troubleshooting advice can be found in Table 3. For further support questions, feature requests or help, use GitHub's issue system or write an email to the tool developers (mscherer@mpi-inf.mpg.de, p.lutsik@dkfz-heidelberg.de).

Timing
Step 1, installing R:~5 min Step 2, installing software packages:~1 h 45 min Step 3, downloading DNA methylation data from TCGA:~5 h All sites are removed during filtering The provided quality criteria were too stringent Provide less stringent quality criteria; that is, check max. int.quant, min.int.quant and min.n.beads 10 DecompPipeline stops during confounder adjustment The system runs out of memory during ICA adjustment Reduce the number of components tested (nmax), the number of cores used (ncores) or the number of independent ICA runs (ntry) 12 MeDeCom does not finish properly or stops without an error message The system runs out of memory Check the log files in the project directory and potentially get a larger machine to perform deconvolution 14 FactorViz does not show any of the plots MeDeCom has not finished properly Check the log files in the project directory Steps 4 and 5, generating annotation sheet and copying data: <1 min Step 6, configuring RnBeads options:~10 min Step 7, running RnBeads:~2 h Step 8, checking data quality:~10 min Step 9, preprocessing:~2 h (no normalization),~2 h 40 min ('dasen' norm.),~8 h (BMIQ norm.) Step 10, confounder adjustment (DecompPipeline):~14-20 h Step 11, selecting CpG subsets:~1 min Steps 12 and 13, performing deconvolution:~36-54 h (MeDeCom),~20 min-1 h (RefFreeCellMix) Step 14, starting FactorViz: <1 min Steps 15-17, determining the parameters:~20 min Step 18, visualizing the proportions:~5 min Step 19, associating phenotypic traits:~45 min Step 20, visualizing the LMCs:~5 min Step 21, interpreting LMCs matrix:~45 min Timing estimates depend largely on the size of the dataset, the computational setup, the unrelated workload of the machine and the iteration limit of the alternating optimization procedure (parameter itermax, default value 1,000). A faster overview analysis can be performed with itermax = 100.

Quality control and feature selection
Deconvolution analysis requires input DNA methylation data of sufficient technical quality, which we verified using RnBeads' QC module. Quality control probes on the Infinium array did not reveal lowquality samples to be removed. Verifiable phenotypic information matched the inferred sample properties, such as the predicted sex of all subjects (Extended Data Fig. 1). We used several criteria to select a set of high-confidence CpGs as input to MeDeCom. Most of the discarded sites (39.5%) were removed, because they were covered by fewer than three beads in any of the samples, or showed unusually high or low signal intensity. Further CpG filtering steps, including sequence context (SNPs, sites on the sex chromosomes, 10.5%) and removal of cross-reactive probes (2.5%) eliminated further problematic sites. As a final outcome of the filtering procedure, 230,223 sites (47.4% of 485,577) passed our stringent quality criteria and were used for downstream analysis. We note that these are extremely stringent quality criteria, but deconvolution typically requires only a small subset of CpGs, which should be selected from a high-quality set of CpGs.

Confounding factor analysis
We evaluated ICA by applying the proposed workflow to the TCGA dataset twice: once without adjustment for age, sex, race and ethnicity, and once with such adjustment using ICA (Fig. 2a). ICA detected 22 components, of which 2 were significantly associated with sex and ethnicity, respectively (Fig. 2b). The overall distribution of the DNA methylation matrix was still bimodal after ICA adjustment, although there were notable peaks at methylation values 0 and 1, respectively. We argue that these peaks are a consequence of the scaling to the interval [0, 1] (Fig. 2c). After running MeDeCom independently on the unadjusted and the adjusted DNA methylation matrix, three of the detected components were significantly linked to sex in the unadjusted run (P values of 6 × 10 −4 for LMC1, 1.4 × 10 −5 for LMC4 and 3 × 10 −3 for LMC5), but only one component was mildly linked to sex in the adjusted run (P value of 7.8 × 10 −4 for LMC7, Fig. 2d). Although ICA component 21 was linked to ethnicity, we could not find a similar association with LMCs. Neither age nor race were significantly linked to any component produced by either ICA or MeDeCom.

Deconvolution results
The deconvolution results of the TCGA LUAD dataset are shown in Fig. 3. Since we did not have prior knowledge on the expected number of underlying cell types to select, we resorted to the crossvalidation procedure of MeDeCom. We chose seven LMCs as the value of K at which the crossvalidation error started to level out (Extended Data Fig. 2a). Similarly, we selected λ = 0.001 (regularization parameter) as the point where the cross-validation error is still low, while the objective value and RMSE substantially change (Extended Data Fig. 2b). Notably, LMC5 was particularly hypomethylated and LMC6 showed a high overall methylation level, while the remaining LMCs were rather intermediately methylated (Extended Data Fig. 2c,d). We found that the use of different normalization methods only marginally influenced the results (Supplementary Fig. 2) and that the results of applying RefFreeCellMix as the deconvolution tool were highly concordant with those of MeDeCom (Extended Data Fig. 3 and Supplementary Fig. 3). We further investigated the biological implications of the detected LMCs. To this end, we found that LMC5 had substantially higher proportions in the healthy tissue samples (Fig. 3a). This indicated that reference-free deconvolution was able to capture the inherent methylation signatures specific to cancerous and healthy tissues. When we conducted enrichment analysis for the sites that were specifically hypomethylated in LMC5, we found that binding sites for the core members of the Polycomb repressive complex 2 (PRC2), SUZ12 and EZH2, were overrepresented. The PRC2, which typically represses oncogenes, is often dysregulated in cancers, a process that is often associated with hypermethylation [61][62][63] . What is more, proportions of LMC5 in tumor tissues provided a generic estimate of tumor sample purity, that is, the degree of contamination by adjacent normal tissue. Thus, we were able to capture tumor-specific methylation signatures with both MeDeCom and RefFree-CellMix (Extended Data Fig. 3) without conducting differential analysis between two phenotypic groups.
LMC3 showed highly variable proportions across the samples and was the main driver of the overall cancer sample clustering. LMC3 proportions were strongly correlated with the LUMP estimate (Fig. 3b), which predicts the overall immune cell content of a sample 49 . Furthermore, we detected enrichments of LMC3-specific hypomethylated sites toward leukocyte (B lymphocyte)-specific transcription factor binding sites and immune response-related GO terms (Fig. 3c,d). The detected LMCs correlated with components generated using ICA on matched RNA-seq data, which similarly revealed enrichment toward immune-related GO terms (Supplementary Fig. 4a). We concluded that LMC3 most likely represented tumor-infiltrating immune cells. The extent of tumor infiltration might be relevant to associate cancer state to patient prognosis 64 .  To determine whether the detected LMCs reflected the expression of known marker genes of lung tissue cell types, we selected EPCAM as an epithelial, CLDN5 as an endothelial, COL1A2 as a stromal and PTPRC as an immune cell marker 60     samples (data processing using the edgeR 65 and TCGAbiolinks 66 R packages is described in Supplementary Note) and found LMC3 to be correlated with PTPRC expression. Furthermore, LMC1 was strongly associated with the epithelial marker expression and LMC5 with the endothelial marker CLDN5 (Fig. 3e). Similarly, the components detected by RefFreeCellMix correlated with the expression of these marker genes (Extended Data Fig. 3e). Many of the detected components were linked to cancer mutational status, such as overall mutation count or chromosomal gain or loss ( Supplementary Fig. 5), or cancer stemness 67 (Supplementary Fig. 6). We further investigated the sites that were particularly hypo-or hypermethylated in LMC4, since LMC4 showed a high proportion in a small subset of samples that clustered separately from the remaining cancer samples (Fig. 3a). One of these sites, cg26992600, is located 3 kb upstream of the TSS of NKX2-8, a gene that has potential roles in the progression of lung cancers 68 (Supplementary Fig. 7 and Supplementary Table 2). Finally, in a survival analysis of LMC proportions, we found that proportions of LMC5 and LMC6 were associated with survival time (P values of 0.18 for : LMC5 and 0.03 for LMC6, Extended Data Fig. 4), warranting further investigation of their biological nature.

Conclusions
Our methylome deconvolution protocol is able to provide novel biological insights into the cellular and intra-tumor heterogeneity of lung cancer. In addition to Illumina microarray datasets, including Illumina EPIC and 450k data, the protocol is readily extensible to bisulfite sequencing data, and we applied a modified version of the protocol to RRBS profiles from Ewing sarcoma patients 69 (epigenomics.dkfz.de/DecompProtocol/protocol_RRBS.html, Extended Data Fig. 5 and Supplementary Fig. 8). We expect the protocol to be of great benefit to all investigators performing DNA methylation analysis in complex and underexplored experimental systems, including bulk samples of highly heterogeneous tissues and tumors. We envision that, after further adaptations, the proposed protocol will also be applicable to datasets beyond DNA methylation, including transcriptomic data (Supplementary Table 3).

Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The results shown here are wholly or partially based upon data generated by the TCGA (TCGA-LUAD dataset) Research Network: https://www.cancer.gov/tcga. The Ewing sarcoma dataset is available from the Gene Expression Omnibus GEO, accession number GSE88826.

Code availability
All R packages are available from public code repositories: DecompPipeline: https://github.com/CompEpigen/DecompPipeline Fig. 3 | Interpreting MeDeCom results with FactorViz. a, Proportion heatmap of LMCs in the different samples. We selected K = 7 LMCs and set λ = 0.001. The samples were hierarchically clustered according to the Euclidean distance between the proportions using complete linkage. We annotated samples using disease status and with the sample-specific LUMP estimate. b, Associations between the phenotypic traits and LMC proportions. For quantitative traits, the Pearson correlations are shown as ellipses that are directed to the upper right for positive and to the lower right for negative correlations, respectively. For qualitative traits, the absolute difference of the proportions in the two groups (for example, female versus male) is shown. P values (two-sided correlation test for quantitative and two-sided t test for categorical variables) <0.01 are indicated by bold borders. LOLA 54 (c) and GO 53 (d) enrichment analysis of the LMC-specific hypomethylated sites for LMCs 1, 3 and 5. No significant GO enrichment was found for LMC 1 and 5. Sites were defined as LMC-specific hypomethylated if the difference between the LMC value and the median of all other LMCs was lower than 0.5. P values have been adjusted for multiple testing with the Benjamini-Hochberg method 70 . Encode TFBS, transcription factor binding sites ChIP-seq profiles from the Encode 71 project; Encode segmentation, chromatin state segmentation of Encode ChIP-seq profiles; Codex, ChIP-seq profiles from the Codex database 72 ; Cistrome, ChIP-seq profiles from the Cistrome project 73 . e, Scatterplots between LMC proportions per sample and known marker gene expression of different lung cell types. The gene expression was measured using counts per million (CPM), the blue line represents the least squares regression line, and the P values were computed using a two-sided correlation test. This figure is based on plots generated by custom plot scripts motivated from FactorViz.  c Multidimensional scaling of the LMC data matrix after fixing the number of components to 7 and the regularization parameter to 0.001. Shown are the first two multidimensional components. d Violin plots of the LMC methylation matrix for the selected parameters. Boxplot lines represent the median, the 25th-and 75th-percentiles, and 1.5 times the inter-quartile range. Fig. 3 | Interpreting RefFreeCellMix results with FactorViz. a Heatmap of LMC proportions in TCGA-LUAD cohort samples (K=7 components). The samples were hierarchically clustered according to the Euclidean distance between the proportions using complete linkage. We annotated samples using disease status and with the sample-specific LUMP estimate. b Associations between the phenotypic traits and proportions. For quantitative traits, the Pearson correlations are shown as ellipses that are directed to the upper right for positive and to the lower right for negative correlations, respectively. For qualitative traits, the absolute difference of the proportions in the two groups (for example, female vs. male) is shown. P values (two-sided correlation test for quantitative and two-sided t-test for categorical variables) less than 0.01 are indicated by bold borders. LOLA (c) and GO (d) enrichment analysis of the LMC-specific hypomethylated sites for components 1, 2 and 4. No significant GO enrichment was found for components 1 and 4. Sites were defined as LMC-specific hypomethylated if the difference between the value of the methylation component and the median of all other components was less than 0.5. P values have been adjusted for multiple testing with the Benjamini-Hochberg method. e Scatterplots between proportions per sample and known marker gene expression of different lung cell types. The gene expression was measured using counts per million (CPM). Fig. 4 | Survival analysis using the survival R-package 52 comparing different levels of LMC proportions. Shown are Kaplan-Meier curves, while samples were stratified according to the LMC proportions into two groups according to the median (high vs. low proportions). P values were computed using the Cox proportional hazards model with the LMC proportions as input, and age, sex, and tumor stage as covariates.

Extended Data
Extended Data Fig. 5 | Interpreting MeDeCom results on the Ewing sarcoma RRBS data set 69 with FactorViz. a Heatmap of LMC proportions in the Ewing sarcoma samples (K=6 components, λ=0.001). The samples were hierarchically clustered according to the Euclidean distance between the proportions using complete linkage. We annotated samples using the tumor location and with the sample-specific LUMP estimate. b Associations between the phenotypic traits and proportions. For quantitative traits, the Pearson correlations are shown as ellipses that are directed to the upper right for positive and to the lower right for negative correlations, respectively. For qualitative traits, the absolute difference of the proportions in the two groups (for example mutation vs. wildtype) is shown. P values (two-sided correlation test for quantitative and two-sided t-test for categorical variables) less than 0.01 are indicated by bold borders. GO (c) and LOLA (d) enrichment analysis of the LMC6-specific hypomethylated sites. No significant LOLA and GO enrichments were found for the remaining LMCs. Sites were defined as LMC-specific hypomethylated if the difference between the value of the LMC and the median of all other components was less than 0.5. P values have been adjusted for multiple testing with the Benjamini-Hochberg method. No matched gene-expression values were available for this data set.

October 2018
Corresponding author(s): Pavlo Lutsik Last updated by author(s): May 8, 2020 Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection TCGA data has been collected using the freely available GDC data transfer tools provided by the TCGA network: https://gdc.cancer.gov/ access-data/gdc-data-transfer-tool.

Data analysis
Data has been analyzed using freely available software packages, that we provide on GitHub and GitLab: DecompPipeline: https://github.com/CompEpigen/DecompPipeline MeDeCom: https://github.com/lutsik/MeDeCom FactorViz: https://github.com/CompEpigen/FactorViz consensusICA: https://gitlab.com/biomodlih/consica . The scripts to reproduce the figures in the manuscript are freely available from our supplementary website: http://epigenomics.dkfz.de/DecompProtocol/ For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability The results shown here are wholly or partially based upon data generated by the TCGA (TCGA-LUAD dataset) Research Network: https://www.cancer.gov/tcga. The Ewing sarcoma dataset is available from the Gene Expression Omnibus GEO, accession number GSE88826.