ABSTRACT
The genome is traditionally viewed as a time-independent source of information; a paradigm that drives researchers to seek correlations between the presence of certain genes and a patient’s risk of disease. This analysis neglects genomic temporal changes, which we believe to be a crucial signal for predicting an individual’s susceptibility to cancer. We hypothesize that each individual’s genome passes through an evolution channel (The term channel is motivated by the notion of communication channel introduced by Shannon1 in 1948 and started the area of Information Theory), that is controlled by hereditary, environmental and stochastic factors. This channel differs among individuals, giving rise to varying predispositions to developing cancer. We introduce the concept of mutation profiles that are computed without any comparative analysis, but by analyzing the short tandem repeat regions in a single healthy genome and capturing information about the individual’s evolution channel. Using machine learning on data from more than 5,000 TCGA cancer patients, we demonstrate that these mutation profiles can accurately distinguish between patients with various types of cancer. For example, the pairwise validation accuracy of the classifier between PAAD (pancreas) patients and GBM (brain) patients is 93%. Our results show that healthy unaffected cells still contain a cancer-specific signal, which opens the possibility of cancer prediction from a healthy genome.
Main
The human genome has evolved over time by an interplay of mutational events. An enhanced understanding of the genome’s evolution has numerous direct and practical applications in improving healthcare, discerning ancestry, materializing DNA storage and designing synthetic biology devices for computation. Traditionally, the genome has been viewed as a time-independent source of information, and hence much of the genomic research has been focused on discovering variants that cause a certain phenotype. Linkage studies have discovered genes for Mendelian diseases such as Cystic Fibrosis2, Huntington disease3, Fragile-X syndrome4 and many others5 by investigating genetic variants across families. For more complex diseases, Genome Wide Association Studies (GWAS)6 can be used to discover large amounts of risk factors working in conjunction. The broader scope of GWAS has led to the discovery of several new genes and pathways7, but many diseases still remain unexplained. Instead of searching for disease-causing variants, we view the genome as a time-dependent signal, searching for indicators for how the genome is mutating over time. This gives rise to the following question - What are the possible ways to measure the evolution of mutations? Put differently, how can we quantify the accumulation of mutations in the genome of an individual?
Our approach for extracting time-dependent information about a person’s mutation history is to focus on the tandem repeat regions of this person’s healthy genome. We have studied two types of mutations in the genome: tandem duplications and point mutations. Tandem duplications involve the consecutive repetition of a subsequence (e.g. TCATG → TCATCATG). Point mutations, which include substitutions, insertions, and deletions, are single changes in the DNA (e.g. ACTG → ACAG). When these two processes occur in the same location, point mutations can propagate through tandem duplications, leaving a change in the repeated sequence (see Figure 1a and Methods section). This allows us to construct a likely history of tandem duplications and point mutations. Slippage events can cause regions with many tandem duplications8, which are a convenient locations to observe this interaction between mutation processes. In a sense, these tandem repeats regions are a nature given repetition error-detecting code9, where the point mutation errors in the copies store information about the history of the evolution of these regions. These repeat regions effectively characterize an evolution channel, which can shed light on the accumulation of mutations in the genome.
Cancer Genomics
Cancer is currently the second leading cause of death worldwide10. Cancer is caused by an intricate mixture of complex factors whose inter-relations are not well understood. While the roles of environmental and hereditary factors are well accepted, recent studies suggest that two-thirds of the mutations in human cancers are caused by replication errors11.
Most GWAS studies on cancer risk have focused on differences between healthy (i.e., normal) and tumor DNA samples, namely in Single Nucleotide Polymorphisms (SNPs) and Copy Number Variations (CNVs). These studies have discovered tumor suppressor genes like BRCA1, BRCA2, TP53 and oncogenes like HER2 and RAs family12. Previous work has also shown that tumor genomes have significantly more genes with repeat instabilities, linking microsatellite instability to colorectal13 and other cancers 13–18. Another recent approach identified 21 signatures for mutational processes in human cancer using healthy genome based on 96 substitution classifications that were defined by 6 single base substitution classes and the sequence context left and right of the mutated base19.
Unlike previous works, we aim to study cancer risk factors while using a healthy genome, without using the genome of the tumor itself, which opens the door to cancer prediction and risk assessment. To do this, we analyzed tandem repeat region data in different cancer types from The Cancer Genome Atlas (TCGA)20. We estimated the number of point mutations (m) and tandem duplications (d) in each tandem repeat region by predicting the evolutionary history of those regions21,22 (see Figure 1a). We used the aggregate of this evolution information to form what we call the mutation profile of the genome. We then used a gradient boosting algorithm to learn the association between these mutation profiles and the probability of developing specific cancers23. The association between the mutation profile and the cancer-type signifies the presence of a cancer-type “signal” in the mutation profiles of the healthy genome, which could be useful for future cancer prediction and early cancer detection.
Results
We hypothesized that different genetic mutation processes accounted for varying risks of developing cancer, and that these processes would leave detectable signals in an individual’s profile. Rigorous verification of this hypothesis would require DNA samples from cancer patients before the onset of their cancer. Such a dataset is not currently available, but blood derived DNA is accessible on The Cancer Genome Atlas (TCGA)20, and closely resembles the DNA of cancer patients before they developed the disease. From TCGA, we gathered 3874 unamplified blood derived WXS samples which spanned 12 cancers: TCGA-GBM, LUAD, LUSC, PRAD, PAAD, STAD, HNSC, BLCA, KIRC, LGG, SKCM, THCA (Table 1A (Column 2), Supplementary Files 1-12). We used microsatellites (tandem repeats with pattern lengths ≤ 10 bp) with at most 100 repeats to obtain mutation profiles (see Methods, Figure 1).
Pairwise Cancer Classifiers - Using only Blood Derived Normal Samples
Here we use 3843 unamplified blood-derived normal samples spanning 11 cancers for our analysis (see Table 1A, Supplementary Files 1-12). We did not use the blood derived samples from TCGA-KIRC in this analysis as we only had 31 samples for KIRC which was not enough to construct a reliable classifier. We verified the existence of cancer-type signals within the mutation profiles of blood-derived normal samples by training cancer classifiers using xgboost23 and testing their accuracy on separate validation-set data (see Methods, Code/Software, Figure 2).
As can be seen in Figure 2, mutation profiles of blood-derived normal DNA of GBM patients shows strongly distinctive signals from the rest of the tested cancers with classification accuracies ranging in between 75% for HNSC to as high as 93% for SKCM and PAAD. A similar observation is made for both SKCM and PAAD as they are distinguishable from all of the other cancers with more than 71% accuracy. For other cancers - STAD, BLCA, LGG, PRAD, LUAD, THCA, LUSC, HNSC, the distinguishing signal is much weaker for many cancers. For example, LGG when compared against PRAD, BLCA, LUAD, LUSC gives pairwise accuracies of 59%, 64%, 59% and 58% respectively. Cancers with risk factors that emit different mutation profiles are easier to distinguish, resulting in more accurate classifiers. Hence, accuracy gives a notion of distance on the scale of 50% (close, indistinguishable) to 100% (far, different). The order of the cancers in the display minimizes the distances between neighboring cancers using the travelling salesman problem (TSP)24, giving a likely low dimensional projection of the features being learned by the classifiers. The observed accuracies and specificity/sensitivity observed in Figure 2 confirm the presence of cancer-type signal in the blood-derived normal DNA.
The clustering of cancers in Figure 2 led us to define four cancer classes: Class 1 = [GBM], Class 2 = [SKCM], Class 3 = [PAAD] and Class 4 = [LUAD, LUSC, PRAD, STAD, HNSC, BLCA, LGG, THCA].
The seriation matrices in Figure 3a represent the binary classifier accuracies and sensitivity/specificity for these different classes.
Cancer Classification Profiles
To assess a patient’s propensity of developing a class of cancers, we trained a multiclassifier for the four cancer classes using gradient boosting. This classifier uses a mutation profile to predict the relative probability of each class of cancer. Figure 3b shows the mean and standard deviation of these probabilities when tested on patients from each cancer class. Class 1, 2, and 3 all give large probabilities for their respective classes. Classes 2 and 3 give weaker signals because they are closer to Class 4 than Class 1 is (see Figure 3a). Individuals in Class 4 in the test set have similar scores for Classes 2, 3 and 4, showing that Class 4’s signal is not very distinct from Classes 2 and 3. This can again be attributed to the closeness of Class 4 to both Class 2 and Class 3 in the seriation diagram in Figure 3a. Supplementary Figure 6 gives the classification profile for Class 4 individuals when training only on Classes 1, 2 and 3 individuals. Again, Class 4 seems to imitate Classes 2 and 3, but the high standard deviation in Class 1 probability suggests Class 4 cancer patients can also have a high probability for Class 1 cancers.
Effect of Adding NAT samples on classifiers
Recent studies have shown positive associations of Solid Tissue Normal (Normal Adjacent to Tumor (NAT)) samples on TCGA with the tumor DNA of cancer patients25, 26. We added 687 unamplified NAT samples as mentioned in Table 1A (Column 3) in our analysis to check if their presence is useful in discovering a stronger cancer-type signal. More precisely, we combined the 3874 blood-derived and 687 NAT samples to construct the pairwise classifiers. Here, we also covered TCGA-KIRC as now we had 210 (179 NAT and 31 blood-derived) samples that were enough to build reliable classifiers. We didn’t observe any significant improvement in cancer signal detection by adding NAT samples over only using blood-derived normal samples and found the same cancer classes that we discovered previously (see Figure 4a, Supplementary Figure 7). Further, we found that TCGA-KIRC belonged to the same class as TCGA-GBM showing strongly distinctive signal from the other 10 cancers (see Figure 4a, Supplementary Figure 7).
Analysis of Amplified Samples
Amplification techniques have been shown to bias tandem repeat information27, especially in TCGA data28. To control for this, we separately analyzed amplified samples. Figure 4b shows the seriation diagrams for accuracy and sensitivity/specificity of pairwise classifiers built using 525 samples amplified by MDA technology. Because of the limited data, this test only covered TCGA-GBM (brain), TCGA-OV (ovary) and TCGA-LAML (leukemia) and both the normal DNA types, i.e. blood-derived and solid tissue normal (NAT) were used (Table 1B, Supplementary Files 13-15). The high accuracy and sensitivity/specificity values in these diagrams suggest a strong cancer-type signal in the mutation profiles of the healthy DNA. Further, we also generated the classification profiles using these amplified samples for individuals with brain, ovary and leukemia cancer. Figure 4c shows the mean and standard deviation of the predicted cancer probabilities for these three populations. The highest probability cancers correspond with the cancers that the patients were diagnosed with, affirming that healthy DNA contains a cancer-type signal.
Genome analysis for the results presented in Figures 2-5 and Supplementary Figures 6-14 was done using samtools29 (see Methods, Code/Software) and the pipeline presented in Figure 1b. We also verified these results for unamplified samples by using another genome analysis tool for short tandem repeats (STR)-hipSTR30 that only detects tandem repeats with pattern lengths atmost 6 (see Supplementary Figure 15).
Driver Genes
Studies in the past have identified driver genes like TP53, BRCA-1, BRCA-2, etc. We considered 723 such genes that are listed in Supplementary File 16 obtained from Cancer gene census - COSMIC31, 32.
To test whether these regions provided special information, we filtered our mutation profiles to only use tandem repeats that overlapped with driver gene regions. We conducted this experiment for the 4561 unamplified samples and 525 amplified samples separately. Figure 5 shows a comparison the classifiers trained on these filtered mutation profiles and mutation profiles which contain all the features except those in the filtered profiles. Darker cells in this figure correspond to large differences in the accuracy of the classifiers, indicating that these signals exist outside of driver gene regions. We notice these darker cells especially for TCGA-PAAD (pancreas) and TCGA-SKCM (skin). We also see noticeable differences when TCGA-OV (ovary) is compared against TCGA-GBM (brain) and TCGA-LKCM (leukemia). The driver-gene classifiers always performed worse than the classifiers trained on the rest of the genome, indicating that the signal exists both inside and outside driver gene regions.
Discussion
Early Cancer prediction
We have shown that the mutation profiles of the blood-derived normal genomes of cancer patients contain a cancer-type signal (Figures 2 & 3). It is reasonable to assume that the mutation profiles of cancer-free patients may also contain these signals, and we can use our classifier to quantify their presence. The cancer classification profiles given by this classifier could be used to screen individuals for those who may benefit from more comprehensive and expensive cancer detection tests.
Accumulation of Mutations
Searching for information-containing features within 3 billion nucleotides is a formidable task. This has traditionally been simplified by comparing individuals to extract variants, which compresses the genome into a smaller set of features to analyze. These differences, known as SNPs and CNVs, are central to both Mendelian studies5 and GWAS6, 7, 12.
This form of genome compression loses crucial information about how the genome is changing by only considering differences in the genome’s current state. Every individual’s genome passes through a distinct evolution channel that is controlled by hereditary, environmental and stochastic factors. These evolution channels differ among the population and can give rise to different risks of disease, but we cannot easily identify these differences from the single-generation SNP and CNV analysis used in GWAS. Mendelian studies may provide insight into inter-generational processes, but do so at the cost of requiring inter-generational data, which severely limits the scope of a feature search. Even with additional data, Mendelian studies still lack the ability to detect differences in mutation processes that occur throughout one’s lifetime.
Mutation profiles are generated without any comparative analysis, reducing the data-demand. Instead, the tandem repeat regions in a single genome provide a window into its history, capturing information about the individual’s evolution channel. This ability to reconstruct a genome’s history from repeat regions is lost when studies only view differences between individuals. The use of mutation profiles expands our access to time-dependent traits which may be essential to understanding developed diseases like cancer.
Sequencing Technology Limitations
TCGA samples are obtained from Illumina platforms with a coverage depth 30-40X. The read lengths used are short ranging between 100-500 bp. This poses a problem in the detection of longer tandem repeats33–35. In our analysis, we only used repeats with pattern lengths ≤ 10 bp and number of copies not greater than 100.
Methods
WXS data
We used exome data from “blood derived normal” and “solid tissue normal” samples in the TCGA20 database, details about which are provided in the Supplementary Files 1-15. The BAM file for each sample was aligned against hg38. All the autosomes from each sample were recovered using samtools29.
Algorithms
Our algorithms are partitioned to Part A and Part B (see Fig. 1b). Part A is only performed once, where Part B is performed whenever cancer prediction is required. In Part A, a dataset of healthy DNA is first processed by the Benson21 and Tang et al.22 algorithms to deduce the mutation profiles. Then, these vectors are aligned by a dynamic programming algorithm to resolve missing regions. Finally, the aligned vectors are fed into a training algorithm to produce a classifier. In Part B, this classifier is applied over any individual’s genome, to assess the overall probability to contract any of the cancer in question.
Tandem Repeat Detection and Duplication History Estimation
Tandem duplications are consecutively repeated patterns caused by replication slippage events36, 37, in which a pattern is duplicated next to the original. For example, the following shows two tandem duplications of length 4, where the duplicated part is highlighted in bold. The underlined segment is the microsatellite or repeat region.
The pattern of a region is the short strand which repeats itself. The copy number d of a repeat region indicates the number of times that the pattern is repeated. For example, the pattern of the underlined repeat region in the right hand side of (1) is GTGA, and its copy number is 3.
Microsatellites are usually accompanied by various types of errors: substitutions (replacement of one nucleotide by another), deletions (omission of a nucleotide), and insertions (addition of a nucleotide). The total number of substitutions, deletions, and insertions in a repeat region is called the error number m. For example, the following shows the contamination of (1) by 1 substitution, 1 deletion, and 1 insertion.
Clearly, the copy number of (2) is 3 and its error number is 3, and hence its mutation index is (m, d) = (3, 3). In the first step of Part A we use the Benson Tandem Repeat Finder to detect repeats with consensus pattern size at most 10 and copy number at most 100. These size limitations mean we only consider regions smaller than 1000 nucleotides. The single block version of the duplication history estimation algorithm given in Tang et al.22 was then applied to each tandem repeat region to obtain the respective mutation index = (m, d). The aggregation of these (m, d) values gives a vector twice the size of the number of repeat regions, which we call an individual’s mutation profile. Since, TCGA data is WXS, we only calculated a unique mutation profile of an individual’s exome.
Alignment
Following the completion of the Benson and Tang et al. algorithms, it was sometimes the case that certain repeat regions appeared in some patients and did not appear in others. In addition, minor differences were observed in the patterns of identical repeat regions in different individuals. As a result, a technical difficulty arose in handling the input to the learning algorithm. Consider the following two patients, in which the repeat regions are underlined.
The success of machine learning depend on the detection of patterns in specific positions of feature vector, so entries which correspond to the same repeat region must also be placed in the same position for all inputs. This is clearly not the case in the above example, in which the second entries of the vectors correspond to different repeat regions.
This issue is resolved by using a dynamic programming alignment algorithm. In this algorithm, a similarity score is computed recursively for each possible alignment, and the alignment which leads to the best possible score is chosen. Each possible alignment is defined as the sum of normalized edit-distances1 between the patterns of all respective pairs. Further, the distance between any pattern and a “missing pattern”, denoted by ‘–’ below, is defined as 0.4. Namely, two patterns whose respective normalized edit distance is less than 0.4 were considered to be equal for the sake of the alignment. For example, the vectors above are aligned in the following way.
The score for the alignment (3) is de(A, A)+ de(–, CGTA)+ de(CG, CG) = 0 + 0.4 + 0 = 0.4, where de denotes edit distance. For comparison, the alternative alignment has score of de(A, A) + de(CG, CGTA) + de(–, CG) = 0 + 2/3 + 0.4 ≈ 1.06, and hence (3) is preferred over (4).
The mutation profile of each individual was aligned against the mutation profile of the reference genome (hg38) by using the method that is mentioned above. The repeat regions that were missing in the reference genome were omitted from these aligned mutation profiles. Further, given the aligned mutation profiles, every ‘–’ is replaced by (0, 0). This gave aligned mutation profiles of the same size that can now be used as features for the learning part described next.
Machine Learning
The aligned mutation profiles were used as features for the learning algorithm. Machine learning classifiers for distinguishing cancers were obtained using two approaches:
Pairwise Classifiers
We trained a binary classifier for every pair of types of cancer, generating pairwise classifiers for unamplified samples and pairwise classifiers for amplified samples. The accuracy in either of those classifiers is used as a measure for the “uniqueness” of the mutation profiles that cause a certain type of cancer, and can additionally be seen as a distance measure between different types of cancer. We used xgboost23 algorithm at default parameters with max-depth = 2, and performed 4-fold validation to build each of these pairwise classifiers.
Multiclassifier
This was built using xgboost ‘multi:softprob’ parameter with max-depth = 2 and predicted the probability of all the cancers simultaneously. Again 4-fold cross validation was performed to avoid over-fitting.
Code/Software
The code and necessary documentation for the pipeline used is available at http://paradise.caltech.edu/~sidjain/Codes.tar.gz.
Data Availability
The BAM files for WXS samples of cancer patients used in the study were obtain from The Cancer Genome Atlas (TCGA)20. These files have controlled access and cannot be availed publicly. However, request to access TCGA controlled data can be made via dbGap38 (accession code: phs000178.v1.p1). The file names for the analyzed samples are given in Supplementary Files 1-15.
Author contributions statement
S.J. analyzed the TCGA genomic data, implemented repeat finding and history estimation steps in the pipeline, helped with the machine learning step in building pairwise and multiclassifiers, and wrote the manuscript; B.M. implemented the machine learning pipeline; N.R. implemented the alignment algorithm; J.B. originated and guided the study. S.J., B.M., N.R. and J.B. participated in brainstorming of the concepts and discussions and revisions of the manuscript.
Competing Interests
The authors declare no competing interests.
Ethics Statement
The ethics approval to the TCGA data was granted by Caltech Institutional Review Board.
Acknowledgements
This work was supported in part by The Caltech Mead New Adventure Fund and a Caltech CI2 Fund. The authors would like to thank Eytan Ruppin for his valuable advice and feedback.
Footnotes
1 That is, the minimal number of insertions, deletions, and substitutions that are required to transform one pattern to the other, divided by the average length of the sequences.