Abstract
Pancreatic cancers are typically diagnosed at late stage where disease prognosis is poor as exemplified by a 5-year survival rate of 8.2%. Earlier diagnosis would be beneficial by enabling surgical resection or earlier application of therapeutic regimens. We investigated the detection of pancreatic ductal adenocarcinoma (PDAC) in a non-invasive manner by interrogating changes in 5-hydroxymethylation cytosine status (5hmC) of circulating cell free DNA in the plasma of a PDAC cohort (n=51) in comparison with a non-cancer cohort (n=41). We found that 5hmC sites are enriched in a disease and stage specific manner in exons, 3’UTRs and transcription termination sites. Our data show that 5hmC density in H3K4me3 sites is reduced in progressive disease suggesting increased transcriptional activity. 5hmC density is differentially represented in thousands of genes, and a stringently filtered set of the most significant genes exhibited biology related to pancreas (GATA4, GATA6, PROX1, ONECUT1) and/or cancer development (YAP1, TEAD1, PROX1, ONECUT1, ONECUT2, IGF1 and IGF2). Regularized regression models were built using 5hmC densities in statistically filtered genes or a comprehensive set of highly variable gene counts and performed with an AUC = 0.94-0.96 on training data. We were able to test the ability to classify PDAC and non-cancer samples with the elastic net and lasso models on two external pancreatic cancer 5hmC data sets and found validation performance to be AUC = 0.74-0.97. The findings suggest that 5hmC changes enable classification of PDAC patients with high fidelity and are worthy of further investigation on larger cohorts of patient samples.
Introduction
Translational research using genomic and proteomic technologies has provided molecular insights into the pathogenesis and biology of pancreatic cancer but has yet to yield robust diagnostic biomarkers to impact early diagnosis of disease, as reflected by a low overall 5-year survival rate of 8.2%1,2. Pancreatic cancer often presents late and has few symptoms, at which point only 10-20% of patients are eligible for surgical resection 2. Pancreatic ductal adenocarcinoma (PDAC) and its variants account for more than 90% of all pancreatic malignancies 3 with the next most common sub-type being neuroendocrine tumors 2. Tobacco smoking confers a two- to three-fold higher risk of pancreatic cancer and also demonstrates a dose-risk relationship, while contributing to approximately 15 to 30% of cases 2, with smokers diagnosed 8 to 15 years younger than non-smokers4,5. Family history is contributory in approximately 10% of cases, and germline mutations in genes such as BRCA2, BRCA1, CDKN2A, ATM, STK11, PRSS1, MLH1 and PALB2 are associated with pancreatic cancer with variable penetrance 2.
The management of PDAC presents physicians with challenges along the entire clinical spectrum, including early detection in high risk individuals, early diagnosis of patients with symptoms or imaging findings, prognostication of outcomes and prediction of therapeutic responsiveness. Collectively these factors have engendered intensive efforts in translational research to identify and validate biomarkers with sufficient clinical performance metrics to improve decision algorithms and resultant clinical outcomes. Current guidelines in PDAC management are limited to two biomarker recommendations tested in an invasive fashion in cystic fluid. First, carbohydrate antigen 19-9 (CA 19-9) guides surgery decisions, use of adjuvant therapy, or the detection of post-operative tumor recurrence, however, the utility is limited because 10% of the population does not secrete the antigen6. Second, carcinoembryonic antigen (CEA) concentration determination from cyst fluid is used to distinguish higher risk mucinous from non-mucinous cysts7,8, thereby mitigating risk. Among the inherited risk factors are genomic mutations such as BRCA2, which confers a 3.5-fold risk in carriers, with the probability of a germline mutation between 6 to 12% in PDAC patients with a first-degree relative diagnosed with PDAC9.
Molecular analyses of pancreatic cancer genomes have revealed activating mutations in KRAS and inactivation of CDKN2A, TP53 and SMAD4, either through point mutation or copy number changes at >50% population frequency10–12, however much mutational heterogeneity exists rendering this subset of genes incomplete for the diagnosis of patients. Molecular subtyping of pancreatic tumors using mutational-based data11 or gene expression signatures13–15 have not yet seen clinical applicability. Other forms of epigenetic data have focused on chromatin-based post-translation modifications and the methylation status of cytosine based in DNA.
The control of DNA state and chromatin regulation have been observed to underpin the onset and progression of oncologic disease16,17. DNA methylation status of cytosine bases has been shown to associate with transcriptional regulation of gene expression. DNA methylation in promoters tends to associate with gene silencing whereas demethylation is associated with gene activation18. More recently, detailed understanding of demethylation has been enabled with precision around intermediate states during demethylation activation19,20. Specifically the oxidation of these methyl group via TET enzymes to 5-hydroxymethyl cytosine (5hmC) have yielded novel signatures that have enable the definition of cellular states 21, as well as the identification of cancer in the cell free state22–24.
Previously, molecular signatures have been shown in circulating cell free DNA (cfDNA) based on 5-hydroxymethylation that may define the tissue of tumor origin in a variety of disease types22. Therefore, we embarked on a case-control study aimed at investigating whether DNA 5hmC signatures were present in the blood of patients with PDAC compared to a cohort of non-disease individuals. We also investigated whether these signatures enabled the discrimination between cancer and non-cancer patients.
We found that in our study population, PDAC patients possess many thousands of genes with an altered hydroxymethylome compared to non-disease individuals. Furthermore, filtering to those genes with the most differentially hydroxymethylated states reveals genes that have been previously implicated in pancreas development or pancreatic cancer. This biologically significant gene set performs well in the construction of predictive models to discriminate PDAC from non-disease, suggesting that the measurement of 5hmC in cfDNA merits further investigation for the detection and classification of PDAC.
Results
Clinical cohort and Study Design
Plasma specimens from 92 subjects without or with pancreatic ductal adenocarcinoma (PDAC) were collected at multiple institutions in different geographic regions of the United States and Germany. These PDAC and non-cancer patient samples satisfied the study inclusion criteria, which included a minimum subject age of 18 years as well as confirmed pathologic diagnosis of adenocarcinoma of any subtype at the time of biopsy or surgical resection for subjects in the cancer cohort (Figure 1A). The non-cancer cohort was identified as satisfying the study inclusion criteria and patients were specifically negative for any form of cancer. Neither cohort were being treated with medication for disease at the time of blood collection, which was prior to any biopsy or surgical resection in the cancer cohort. There were no statistically significant differences in subject age or gender between the two cohorts, but there was a statistically significant greater tobacco exposure in the PDAC cohort, as expected given smoking is common risk factor pancreatic cancer.
Sequencing results and metrics
Filtering criteria to enable the determination of high quality 5hmC libraries were established yielding 51 PDAC and 41 non-cancer subjects. These criteria were established from previous studies22 and extensive analysis did not reveal batch processing effects occurring specifically in either study cohorts. PDAC samples that were exclusively either male or female were combined with non-cancer samples that were exclusively either female or male in a ratio of 2:1 respectively following a block randomization scheme. This generated two batch types for the study (Figure 1B) and enabled the identification of sample swaps, none of which were detected. A single pooled non-cancer sample (30 non-cancer donor were combined into one plasma pool from which cfDNA was isolated) served as a technical/process control for the each of the batches in the study. 5hmC enrichment libraries were sequenced to produce a median number of unique read pairs of 9.1 and 10.7 million in the PDAC and non-cancer cohorts respectively.
Distributions of 5hmC densities into functional regions in PDAC and non-cancer cohort
To gain an understanding of the functional genomic regions possibly regulated by hydroxymethylation, we first determined 5hmC enriched loci, as measured by increased read density and detection as peaks by MACS2. The vast majority of 5hmC loci occur on average in non-coding regions of the genome (intronic, transposon repeats – SINES and LINEs, and intergenic - Figure 2A) with no preferential distribution in the PDAC or non-cancer cohort. Despite the high frequency of 5hmC occurrence, these functional regions exhibit low enrichment (intron, Figure 2B) or even depletion of 5hmC sites (intergenic and LINE elements, Figure 2B) relative to the genome background. Instead enrichment occurs in promoters, UTRs, exons, transcription termination sites (TTS) and SINE elements, as measured by increased relative fold change compared to the genome background. Significant differences in enrichment of 5hmC peaks over functional regions were observed in a disease cohort specific manner. Increases in enrichment in PDAC were measured in exons, 3’UTR and TTS whereas decreases were found in promoter and LINEs, which themselves were either enriched or depleted respectively (Figure 2C). These global changes found to occur in a statistically significant manner in each cohort were also found to occur in a cancer stage specific manner, with gradual increases (exon, 3’UTR and TTS) or decreases (promoter and LINE) in later stage patients (Figure 2D).
Next, we investigated 5hmC occupancy, and its associated changes in PDAC, with respect to chromatin state. Post-translational modifications such as methylation and acetylation on histone proteins were inferred in relation to 5hmC occupancy, using the existing histone maps from the pancreatic cancer cell line, PANC-1, for which epigenetic data were made available by ENCODE25. Notably, reduced overlap with 5hmC was observed in PDAC coincident with loci associated with H3K4me3 and H3K27ac, both of which mark transcriptionally active states (Figure 2E). There was no significant difference in global 5hmC overlap observed between PDAC and non-cancer cohorts in H3K4me1-associated loci, which mark enhancer regions (Figure 2E). Furthermore, 5hmC in H3K4me3 associated loci are significantly reduced with progressive disease as observed when the PDAC cohort is subdivided into disease stages (p=0.0003 – Figure 2E bottom right panel). Conversely, there were no statistically significant changes detected in 5hmC globally over H3K4me1 and H3K27ac associated loci with progressive disease (Figure 2E – bottom panels). Amongst the three histone maps from PANC-1 cell line, H3K4me1 has the most abundant overlap with 5hmC occupancy in both the PDAC and non-cancer cohort (Supplementary Figure 1). The PDAC samples have an increased 5hmC intensity over H3K4me1- associated sites compared to the non-cancer cohort (Figure 2F – bottom panel). Conversely, H3K4me3 loci exhibit the lowest 5hmC intensity in both cohorts (Figure 2F – top panel) and the least abundant overlap with 5hmC occupancy (Supplementary Figure 1).
Identification of disease specific genes from plasma samples
Differential analysis of 5hmC densities in genes revealed 6,496 and 6,684 genes with an increased and decreased 5hmC density respectively in PDAC compared to non-cancer samples (Figure 3A – adjusted p-value < 0.05). Further filtering of this gene set (fold change ≥ 1.5 in PDAC versus non-cancer and average log 2 CPM ≥ 4 counts, 142 genes total) revealed annotated genes with increased 5hmC density and whose biology is related to pancreas development (GATA426, GATA626, PROX127, ONECUT128) and/or implicated in cancer (YAP129, TEAD129, PROX130, ONECUT2/ONECUT1, IGF1 and IGF2). Inspection of the MSigDB for relevant pathways comprising the 142 genes with enriched 5hmC densities revealed a preponderance of pathways down-regulated in liver cancer (5 of the top 10 most significant pathways – Table 2). The differential representation analysis coupled with filtering (fold change ≤ 1.5 in PDAC versus non-cancer and log CPM of 5hmC ≥ 4) also revealed 178 genes with a decreased 5hmC density in pancreas cancer cfDNA. Closer inspection of these pathways with decreased 5hmc representation revealed fundamental pathways in immune system regulation (3 of the top 10 most significant pathways – Table 3).
Expanding gene set enrichment analysis to include the full data set of all genes revealed that more than 30% of immune related pathways have a reduced 5-hydroxymethylation across early and late stage PDAC (Figure 3 B, Table 4). Multidimensional scaling analysis (MDS) using either the 13,180 genes with high variation in 5hmC counts (Figure 3 C) or the 320 genes filtered at the extremes of 5hmC representation in PDAC (Figure 3 D), reveal partitioning of the PDAC samples from the non-cancer equally well, using statistically filtered genes that form a biologically relevant set. Furthermore, the 320 genes were employed in a hierarchical clustering analysis, which enabled better partitioning of the pancreas and healthy 5hmC data from Song et al22 compared with similar data from Li et al24 (Figure 3E). In summary, we have been able to find a differentially represented gene set whose biological functions are congruent with both pancreatic development and cancer more broadly and the hydroxymethylation densities of these genes alone enable the partitioning of PDAC from non-cancer.
Predictive models for the detection of pancreatic cancer in cfDNA
We performed regularized logistic regression analysis in order to determine whether gene-based features are present in the PDAC and non-cancer cohorts that enable the classification of patient samples. The full set of 92 patient samples were partitioned into a training and test set comprising 75% and 25% of the patient data respectively and 65% of the genes with the most variable 5hmC count were employed for model selection. Two methods of regularization were employed, elastic net (glmnet) and lasso (glmnet2)31. Other modeling approaches were explored such as random forest, support vector machines and neural nets in a preliminary analysis and were found to have inferior performance on the training data.
Both regularization methods require specifying hyper-parameters which control the level of regularization used in the fit. These hyper-parameters were selected based on out-of-fold performance on 30 repetitions of 10-fold cross-validated analysis of the training data. Out-of-fold assessments are based on the samples in the left-out fold at each step of the cross-validated analysis. The training set yielded an out-of-fold performance metric, Area Under Curve (AUC), of 0.96 (elastic net and lasso) with an internal sample test AUC of 0.84 (elastic net) and 0.88 (lasso) (Fig 4B). The distribution of probability scores indicated that within the training data both models classify well (Figure 4B) but that improved robustness and stability of scoring was found with the elastic net model as evidenced by reduced variation in probability scores observed during repeated cross-validations. Next, the training model was tested on two external validation set of patient samples. These include pancreatic cancer and healthy samples from Li et al24 (pancreas subtype was not specified, 23 pancreas, 53 healthy) and Song et al22 (pancreas subtype specified as adenocarcinoma, 7 pancreas, 10 healthy – Supplemental Figure 2). The validation set exhibited a performance with AUC = 0.78 (elastic net and lasso) in the Li et al data and AUC = 0.99 (elastic net) and 0.97 (lasso) in the Song et al data (Figure 4C).
The effect of feature selection on prediction performance was evaluated by filtering the initial set of significant genes (Figure 3B) to satisfy a 1.5 fold differential 5hmC representation in the PDAC cohort with median representation of gene counts of log2 average 5hmC representation > 4. This filtering approach was applied on 75% sample data, reserving the remaining 25% for subsequent testing (see below). The same regularized regression models were built using this set of 287 genes with increased 5hmC and 343 genes with decreased 5hmC density, employing a similar setup for training and testing as defined previously and found training set AUC = 0.96 (elastic net) and 0.94 (lasso). Not surprisingly, internal testing yielded a high performance with AUC = 0.92 (elastic net) and 0.93 (lasso). Of greater interest, was the performance on external data sets with AUC = 0.74 (elastic net) and 0.67 (lasso) for Li et al data and AUC = 0.97 (elastic net) and 0.94 (lasso) for Song et al data. This suggests that statistically filtered genes that are biologically relevant to pancreatic cancer and/or pancreas development do not perform much better than an algorithmically driven selection of features during regression training, as has been shown elsewhere32.
The final models fitted to the 65% most variable 5hmC gene features in the training set, using hyper-parameter values determined from the training set data analysis, were fitted to the whole cohort of PDAC and non-cancer samples and this yielded models with 109 genes (elastic net) and 47 genes (lasso). The genes in the models were found to possess PDAC versus non-cancer t-scores that are concordant with both the Li et al and Song et al data sets (Figure 5).
Discussion
This study was focused on the discovery of cfDNA specific hydroxymethylation-based biomarkers that may facilitate the development of molecular diagnostic tests to detect pancreatic cancer at earlier stages. Our data highlight the ability to detect differentially hydroxymethylated genes whose underlying biology shows association with both pancreas and cancer development as well as established trends in chromatin mark maps and other functional regions of the genome. Furthermore, regularized regression methods were used to build models from (i) statistically filtered genes that form a biologically relevant set and (ii) a comprehensive gene set found to be highly variable, and this yielded models with AUC = 0.94- 0.96 with an external data set validation AUC = 0.74-0.97 (elastic net models).
The 5hmC signal was readily found to be enriched in gene-centric sequence types (promoter, exons, UTR and TTS), as well as transposable elements like SINEs (enriched) and LINEs (depleted) (Figure 2A, B). Such hydroxymethylcytosine changes in functional regions have been reported in cfDNA from colorectal24, esophageal23,33 and lung cancer23. In a similar manner, PDAC specific gains or losses in hydroxymethylation were observed in functional regions in our data. In addition to enrichment and depletion of 5hmC in functional regions, there was a novel PDAC specific 5hmC increase in exons, TTS and 3’UTR and a 5hmC decrease in promoters and LINE elements (Figure 2 C). In embryonic stem cells, 5-hydromethylation decreases in the promoter region have been shown to associate with gene transcription34. An increase in disease relevant transcription may be implicitly supported in our PDAC data by the 5hmC increase in gene-centric features mentioned earlier, as well as an apparent decreasing trend of 5hmC in promoter regions toward late stage PDAC (Figure 2 D).
Dynamic changes in chromatin have been shown to control cell development and transition of cells with oncogenic potential35. The PDAC specific decrease of 5hmC in H3K4me3 loci appear to be coincident with a non-statistically significant increase of 5hmC in H3K4me1 (Figure 2 E). These DNA hydroxymethylation patterns appear to complement each other in genomic location and also the histone marks they occupy (Figure 2 F) and also suggest disease specific increases in gene transcription via chromatin modifications, given the known permissive transcriptional function associated with H3K4me3/me136. Precise 5hmC patterning around known functional elements of the genome suggests a broader function for hydroxymethylation in the epigenetic control of transcriptional processes. Additional work will reveal the extent to which models predictive of PDAC can be built from a combination of gene-specific, chromatin mark and transposable elements detected in cfDNA.
In this study, we employed coarse resolution of hydroxymethylation at the gene-based level in PDAC and yet were able to find genes whose increased 5hmC signals in highlighted pathways implicated in liver cancer (Table 2). We note that MSigDB does not currently contain pathways annotated for pancreatic cancer37 and further that pancreas typically has groups of expressed genes shared with liver and salivary gland (https://www.proteinatlas.org/humanproteome/pancreas). We employed two approaches for gene set enrichment analysis, either using genes with differentially decreased 5hmC or via performing GSEA on all reporting genes, and found close to one third of immune system pathways were implicated. Assuming the strong association between 5hmC density and gene transcription, one interpretation of this result is that immune system function is decreased in PDAC patients. Inspection of individual genes that were either significantly increased or decreased in 5hmC density reveals genes implicated in normal pancreas development, for instance the transcription factors GATA4, GATA6, PROX1, ONECUT1/2, and also genes whose increased expression is implicated in cancer like YAP1, TEAD, PROX1, ONECUT2, ONECUT1, IGF1 and IGF2. The impact of increased relative 5hmC representation of transcription factor genes like GATA4, GATA6, PROX1, ONECUT1/2 in PDAC patient, whose involvement in early pancreatic development, suggest a reversion to a stem-like state. This is further supported by the fact that some of the significant genes with increased 5hmC representation identify a stem cell pathway (BOQUEST_STEM_CELL_UP, in top 20 mSigDB pathways from 142 genes with increased 5hmC in PDAC).
Identifying genes whose 5hmC densities are significantly changed in PDAC, leads to an enrichment of genes with annotated relevant biology which can be used to build regularized regression models, whose performance matched models built on the more comprehensive set of variable genes. This gives us good confidence that our models, whose performance is high (training AUC = 0.94-0.96 with an external data set validation AUC = 0.74-0.97), are measuring underlying biological signals relevant to PDAC. Our current external data set performance may be somewhat explained by the small sample size of external data sets (Li et al, 23 pancreas, 53 healthy and Song et al, 7 pancreas, 10 healthy). Also, whilst our external validation AUC on Li et al data was generally lower than for Song et al data, we note that Li et al pancreatic data were distributed with a mode around stage 3 disease versus a mode at stage 2 for this study (Song et al was broadly distributed across all stages – Supplementary Figure 2), thus our predictive model may be better suited for the detection of earlier versus later stage disease. Other patient characteristics (such as histological subtype and smoking, etc.) may have also differed in these independent study sets.
Despite the large number of differentially hydroxymethylated genes the regularized regression models included 100 genes or less. However the fact that 13,180 differentially hydroxymethylated genes were detected suggest that other biological signals may also reside in our data set. Smoking status is a known risk factor for PDAC up to 20 years post smoking cessation and DNA methylation changes have been associated with tobacco-based toxins38. In our retrospective case-control designed study, ever smokers constituted 59% and 49% of PDAC and non-cancer cohorts respectively, indicating that ever smokers are well represented in each cohort. Consequently, we do not suspect that smoking association in our PDAC cohort could account for the significantly hydroxymethylated genes found. However, a more extensive future study focused on sub-partitioning PDAC and non-cancer patient into never and ever smokers with pack-year characteristics will enable us to address the impact of smoking on the hydroxymethylome in PDAC patients. These pancreatic cancer risk parameters combined into a clinical relevant, intent to test population based study, will further our current set of findings beyond our current case-control cohort study, which numbers less than 100 participants. Further consideration of disease-related clinical parameters will enable us to explore hydroxymethylcytosine features with the aim of yielding refined signals capable of earlier diagnosis of PDAC.
Methods
Clinical cohorts and study design
A case-control study was performed using plasma obtained from subjects without (termed non-cancer) and with pancreatic cancer who provided informed consent and contributed biospecimens in studies approved by the Institutional Review Boards (IRBs) at participating sites in the United States and Germany. Plasma samples for the non- cancer cohort were obtained from subjects enrolled prospectively at five sites in the United States, following review and approval of the study protocol by each site’s participating investigator(s).
Cancer cohort
Plasma samples for the cancer cohort were obtained from subjects who had undergone management for pancreatic cancer in the United States or Germany, and also provided consent for use of blood specimens for archival storage and retrospective analyses. Criteria for subject eligibility for inclusion in the analysis included age greater than or equal to 21 years for all subjects, with additional requirements for the cancer cohort including: 1) no cancer treatment, e.g., surgical, chemotherapy, immunotherapy, targeted therapy, or radiation therapy, prior to study enrollment and blood specimen acquisition; and 2) a confirmed pathologic diagnosis of adenocarcinoma inclusive of all subtypes.
Non-cancer cohort
Subject exclusion criteria for the non-cancer cohort also included any of the following: prior cancer diagnosis within prior six months; surgery or invasive procedure requiring general anesthesia within prior month; non-cancer systemic therapy associated with molecularly targeted immune modulation; concurrent or prior pregnancy within previous 12 months; history of organ tissue transplantation; history of blood product transfusion within one month; and major trauma within six months. Clinical data required for all subjects included age, gender, smoking history, and both tissue pathology and grade, and were managed in accordance with the guidance established by the Health Insurance Portability and Accountability Act (HIPAA) of 1996 to ensure subject privacy.
Plasma collection
Plasma was isolated from whole blood specimens obtained by routine venous phlebotomy at the time of subject enrollment. For cancer subjects, whole blood was collected in K3EDTA tubes (Sarstedt, Nümbrecht, Germany) with isolation of plasma within 4 h of phlebotomy by centrifugation at 1,500g for 10 min at RT, followed by transfer of the plasma layer to a new tube for centrifugation at 3,000g for 10 min at RT, with plasma aliquots used for isolation of cell-free DNA (cfDNA) or stored at −80°C.
For non-cancer subjects, whole blood was collected in Cell-Free DNA BCT® tubes according to the manufacturer’s protocol (Streck, La Vista, NE) (https://www.streck.com/collection/cell-free-dna-bct/). Tubes were maintained at 15 °C to 25 °C with plasma separation performed within 24 h of phlebotomy by centrifugation of whole blood at 1600 x g for 10 min at RT, followed by transfer of the plasma layer to a new tube for centrifugation at 16,000 x g for 10 min. Plasma was aliquoted for subsequent cfDNA isolation or storage at −80°C.
cfDNA isolation
cfDNA was isolated using the QIAamp Circulating Nucleic Acid Kit (QIAGEN, Germantown, MD) following the manufacturer’s protocol excepting the omission of carrier RNA during cfDNA extraction. Two milliliter plasma volumes (cancer) or four milliliter plasma volumes (non-cancer) were lysed for 30 minutes prior to collection of nucleic acids; all cfDNA eluates were collected in a volume of 60 µl buffer. All cfDNA eluates were quantified by Bioanalyzer dsDNA High Sensitivity assay (Agilent Technologies Inc, Santa Clara, CA) and Qubit dsDNA High Sensitivity Assay (Thermo Fisher Scientific, Waltham, MA) was employed to ensure the absence of contaminating high molecular weight DNA emanating from white blood cell lysis.
5-hydroxymethyl Cytosine (5hmC) assay enrichment
Sequencing library preparation and 5hmC enrichment was performed as described previously (Song et al). cfDNA was normalized to 10 ng total input for each assay and ligated to sequencing adapters. 5hmC bases were biotinylated via a two-step chemistry and subsequently enriched by binding to Dynabeads M270 Streptavidin (Thermo Fisher Scientific, Waltham, MA). All libraries were quantified by Bioanalyzer dsDNA High Sensitivity assay (Agilent Technologies Inc, Santa Clara, CA) and Qubit dsDNA High Sensitivity Assay (Thermo Fisher Scientific, Waltham, MA) and normalized in preparation for sequencing.
DNA sequencing and alignment
DNA sequencing was performed according to manufacturer’s recommendations with 75 base-pair, paired-end sequencing using a NextSeq550 instrument with version 2 reagent chemistry (Illumina, San Diego, CA). Twenty four libraries were sequenced per flowcell and raw data processing and demultiplexing was performed using the Illumina BaseSpace Sequence Hub to generate sample-specific FASTQ output. Sequencing reads were aligned to the hg19 reference genome using BWA-MEM with default parameters39.
Peak Detection
BWA-MEM read alignments were employed to identified regions or peaks of dense read accumulation that mark the location of a hydroxymethylated cytosine residue in a CpG content. Prior to identified peaks BAM files containing the locations of aligned reads were filtered for poorly mapped (MAPQ < 30) and not properly paired reads. 5hmC peak calling was carried out using MACS2 (https://github.com/taoliu/MACS) with a p-value cut off = 1.00e-5. Identified 5hmC peaks residing in “blacklist regions” as defined elsewhere (https://sites.google.com/site/anshulkundaje/projects/blacklists) and read date on chromosomes X, Y and mitochondrial genome were also removed. Computation of genomic feature enrichment overlap 5hmC peaks were performed using the HOMER software (http://homer.ucsd.edu/homer/) with default parameters.
Chromatin modifications (H3K4me1, H3K4me3 and H3K27ac) were identified in histone maps of the pancreatic cancer cell line PANC-1 and were downloaded from ENCODE ChIP-Seq repository (https://genome.ucsc.edu/encode/dataMatrix/encodeChipMatrixHuman.html Determination of enrichment were calculated via Odds ratio using the Fisher Exact Test via the program bedtools fisher. For comparisons between PDAC and non-cancer, the Wilcoxon test was used, and for across stages comparison, the Kruskal-Wallis Test was employed.
Differential Representation Analysis
For the purpose of reliably identifying gene bodies with differential representation between the PDAC and the non-cancer groups, we closely followed the RNA-Seq workflow outlined in Law et al. 201640, including much of the preliminary QC steps. In brief, the analysis includes data pre-processing by adopting the following workflow: (i) transforming the data from raw counts to log2(counts per million), (ii) removing genes that are weakly represented, (iii) normalizing the gene representation distributions, and (iv) performing unsupervised clustering of samples. To accomplish differential representation analysis, we applied the following steps: (i) creating a design matrix and PDAC vs non-cancer cohorts, (ii) removing heteroscedascity from the data, fitting the linear models for the comparison of interest, PDAC vs non-cancer, (iv) examining the number of differentially represented genes.
In most of these analysis steps the default settings were used when appropriate. To remove weakly represented genes, we excluded genes that did not have greater than 3 counts per million reads in at least 20 samples. This filter excludes roughly 12% of the genes. For the identification of the significantly differentially represented regions we used the method of Benjamini and Hochberg41 to obtain p-values adjusted for multiple comparisons. In this report we use adjusted p-value and false discovery rate (FDR) interchangeably.
Predictive Modelling
For the purpose of assessing the feasibility of building classifiers that can discriminate between PDAC and non-cancer samples based on the 5hmC representation of gene bodies, we evaluated to performance of two forms of regularized logistic regression models commonly used in the classification context, where the number of examples are few and the number of features large; the lasso and the elastic net. See Friedman et al. (2010)31 for a description of the general elastic net precedure of which the lasso is a special case. Software implementation of these methods can be found at https://cran.r-project.org/web/packages/glmnet/index.html. Weakly represented genes were excluded from analysis as described in the section on Differential Representation Analysis.
All training and fitting was done on 75% of the samples selected at random in a balanced way to keep the ratio of the number of PDAC to non-cancer samples similar in both the training and testing subsets. Before any fitting, genes were filtered to include the 65% most variable genes for the model fitting task. The filter was designed using the training samples only and was done in a way to ensure that genes of all levels of representation were included.
Both regularization methods assessed, the lasso and the elastic net, require specifying hyper- parameters which control the level of regularization used in the fit. Hyper-parameters were selected based on out-of-fold performance on 30 repetitions of 10-fold cross-validated analysis of the training data. Out-of-fold assessments are based on the samples in the left-out fold at each step of the cross-validated analysis. The out-of-fold performance of the models fitted with hyper-parameter values set at the optimal values might yield a slightly optimistic assessment of performance. The performance of these models applied to the test set should provide less biased estimate of performance, although generalizability to external datasets is not always guaranteed.
The hyper-parameter values that lead to the best out-of-fold performance were then used to fit the final models which were fitted to the entire set of samples including both training and testing subsets. The performance of these final models can thus only be evaluated based on their performance on external data sets. These do provide a sense of the generalizability of the performance observed in the local training and testing data sets.
To evaluate the effect of feature selection on prediction performance, we repeated the training and evaluation task based on a filtered set of genes that included genes found to be significantly differentially represented, having a 1.5 fold differential 5hmC representation, and a level of representation exceeding the median level (log2 CPM ≥ 4). This filter was designed based on training data statistics only.