Forward-reverse mutation cycles between stages of cancer development

Earlier, prominent occurrences of interstitial loss-of-heterozygosities (LOHs) were found in different cancers as a type of single-nucleotide-variations (SNVs), at rates far exceeding those of the commonly investigated gain-of-heterozygosities (GOHs) type of SNVs. Herein, such co-occurrences of LOHs and GOHs were confirmed in 102 cases of four cancer types analyzed with three different next-generation sequencing platforms, comparing non-tumor, paratumor, and tumor tissues with white-blood-cell controls; and in 246 pan-cancer cases of whole-genome tumor-control pairs. Unexpectedly, large numbers of SNVs enriched with CG>TG GOHs and copy-number-variations (CNVs) proximal to these GOHs were detected in the non-tumor tissues, which were extensively reversed in paratumors showing prominent TG>CG LOHs with proximal CNVs, and less so in tumors to form forward-reverse mutation cycles. Lineage effects in the reversions, likely resulting from directional selection, supported a sequential rather than parallel mode of evolution as described in a ‘Stage Specific Populations’ model of cancer development.


Introduction
The progressive development of cancer has been investigated at the cytochemical and genetic levels [1][2][3][4][5][6] , and the existence of early premalignant stages in the process has been suggested by precancer manifestations in terms of DNA, gene expression, protein and microscopic changes [7][8][9][10][11][12][13][14] . Genomic analysis has played an increasingly important role toward delineation of mutational events 15,16 , and a recent study by us has led to the finding of massive single-base interstitial loss-of-heterozygosities (LOHs) in various types of cancers, pointing to widespread intersister chromosomal gene conversions arising from defective DNA double-strand break (DSB) repair in tumor development 17 . Thus the interplay between the gain-of-heterozygosity (GOH) and LOH types of single nucleotide variations (SNVs) requires examination along with copy number variations (CNVs) as mutational elements of cancer regarding the extent of possible reversion of forward mutations during cancer development.
In the present study, non-tumor, paratumor, primary tumor and metastatic tumor tissues in different types of solid tumors were compared with same-patient whiteblood-cell controls based on massive parallel sequencing. Somatic mutations in both directions, i.e. GOHs, LOHs, CNV gains and CNV losses were examined residue-byresidue and window-by-window in order to detect the presence of mutation reversals in the development of cancer cells and to assess their biological significance. The results obtained suggested that forward-reverse cycles of mutations along with directional selection are important determinants of stage-specific cell population profiles in cancer development.

Genotypic changes in non-tumor and paratumor tissues
White-blood-cells (B), tumor tissue (T), paratumor tissue (P) immediately adjacent to tumor, and more remote non-tumor tissue (N) were collected in twelve same-patient tetra-sample cases consisted of four breast carcinomas (BRCA), five stomach adenocarcinomas (STAD) and three hepatocellular carcinomas (LIHC) (Supplementary Table 1 Table 1). Notably, a forward-reverse mutation cycle (FR-cycle) was commonly observed in the patch diagrams in Fig. 1b: GOHs were strongly reversed by LOHs (e.g., G1 mutations reversed by L1 mutations), and LOHs strongly reversed by GOHs (e.g., L14 reversed by G12). Moreover, the LOH mutations of Mm residues 6 displayed pronounced lineage effects, favoring the formation of MM residues when the Mm residues themselves were derived from MM residues, but favored the formation of mm residues when the Mm residues were derived from mm residues (as in the LOH mutations highlighted by yellow triangles in the figure).
Notably, in Fig. 1b right panel, partition of germline Mm residues via LOH steps L13 and L14 yielded a greater MM/mm product ratio than partition via LOH steps L15 and L16, and greater still than partition via LOH steps L21 and L22, although in each instance MM products exceeded mm products (Fig. 2a). Since all these three successive partitions emanated from germline Mm residues, their diminishing MM/mm ratios could not be the consequence of lineage effects. Instead, because MM residues in the genome generally have been optimized for growth by prolonged evolution, they tend to be favored over mm residues. Accordingly, the finding of to P-stage cells, and likewise P-stage cells relative to T-stage cells, suggesting that MM was generally preferred over mm for optimized growth of cells.
When the trinucleotide-based mutational profile method 18 was employed to classify the GOHs and LOHs observed in the B-N-T-P tetra samples into the C>A, C>G, C>T, T>A, T>C and T>G groups, the results showed that C>T and T>C mutations were particularly prominent among both GOHs and LOHs, in keeping with the general expectation that transitions exceed transversions in SNVs (Fig. 1d). The C>T GOHs among the ∆NB changes displayed peak frequencies at the four NCG triplets, conforming to the 'Signature 1A' (marked by four solid arrowheads) common to 7 cancers, and likely ascribable to the contribution of spontaneous deaminations of 5methyl-cytosine at methylated CpG to form thymidine 18,19 . These deaminations would also explain the ~50% greater occurrence of C>T GOHs than T>C GOHs in the ∆NB changes. In support of this, Fig. 2b shows that although there were less CG dimers than other dimers among AluScan captured as well as whole-genome sequences ( Supplementary Fig. 1), more CG dimers underwent SNV mutations than any other dimers. In Fig. 1d . sum of Types II, III, IVa and IVb patterns) than in P-and T-stages combined (viz. Type-I). Reversals of Nstage SNVs and CNVs (viz. sum of Types IVa and IVb patterns) were common, amounting to ~70% of N-stage SNVs or ~40% of N-stage CNVs; and far more of such reversals took place in P-stage (Type-IVa) than in T-stage (Type-IVb).
When another seventeen B-N-T trio sample sets consisted of one BRCA, two LIHC and fourteen non-small cell lung cancers (NSCLC) were analyzed with respect to 8 GOH and LOH changes in the N-and T-stage cells relative to B-stage cells (Fig. 3), the results obtained showed the same regularities as the B-N-P-T tetra samples: the Bgenomes displayed much higher LOH (L5, L6) rates and GOH-m (G3, G4) rates than GOH-M (G1, G2) rates, strong lineage effects in LOH-partitions between MM and mm products (highlighted by yellow triangles), and prominent FR-cycles viz. L1 reversing G1, L4 reversing G3, G5 reversing L5, and G6 reversing L6. Mm residues between MM and mm products (highlighted by yellow triangles). 9 In Fig. 5a, the relative prominences of ∆TN GOHs, ∆TN LOHs, ∆MT GOHs and ∆MT LOHs varied among the five different cancer groups. This could arise in part from biological dissimilarities between the sequences analyzed on the different platforms on account of their varied sequence coverages of the genome. The SNV sites observed in the five groups displayed non-identical distributions among the Genic, Proximal and Distal sequence zones 22 , as well as non-identical replication timings during the cell cycle (Fig. 5b). The proportion of ∆TN GOHs that became reversed in the ∆MT changes, marked by solid segments of the GOH frequency bars in the ∆TN tiers, were highest in the AluScan group, also quite high in the WES-NSCLC-L group, modest in the WES-Non-Lung group and lowest in the WGS-Liver-M and WES-NSCLC-H groups, even though the WES-NSCLC-L, WES-Non-Lung and WES-NSCLC-H groups were all analyzed based on the WES platform 21 .

Genotypic changes in primary and metastatic tumors
The WES-NSCLC-H group was unique in its display of particularly eminent C>A GOHs. Previously, C>A transversions were linked to polycyclic aromatic hydrocarbons 23 and acrolein 24 in tobacco smoke. The 23 WES-NSCLC-H samples were derived entirely from smokers, in accord with smoking being a significant factor for their elevated C>A GOHs. However, the WES-NSCLC-L samples with much more subdued C>A GOHs included two non-smokers and four smokers, suggesting that smoking or high C>A GOHs could play a less important carcinogenic role in a minority of smokers.

Whole genome sequencing confirmed the abundance of interstitial LOHs
A total of 246 tumor-control pairs from International Cancer Genome Consortium (ICGC) collection of WGS data 25 were analyzed to yield LOH and GOH types of SNVs in each paired-samples. These included a panel of sixty-three pan-cancer cases (Pilot-63) (Fig. 6a), and twenty-two intrahepatic cholangiocarcinoma (ICC), eightysix LIHC and seventy-five NSCLC cases (Fig. 6b)

Evidence from mutational profiles for gene conversions in LOH production
For the ∆TB SNVs of Pilot-63 (Fig. 6a), the alteration-group plots of mutational profile in Fig. 6c revealed strikingly similar heights for the opposing C>T (blue) and T>C (pink) LOH bars in the right panel, but not for the opposing C>T (blue) and T>C (pink) GOH bars in the left panel, consistent with excess LOHs than GOHs. The context-group plots in Supplementary Fig. 2a show comparable rates for all pairs of opposing LOHs in contrast to the unequal rates for the opposing GOHs. The mutation-rate diagrams in Fig. 6d show the actual rates for all opposing GOH pairs and LOH pairs. These results are summarized in Fig. 6e, where over 61% of the rate ratios of opposing substitutions were greater than two for the GOHs (blue bars), but 100% were between one and two for the LOHs (striped red bars), supporting different underlining mechanisms for GOH and LOH. This divergence of the rate ratios between opposing GOHs and opposing LOHs was in accord with our proposal that the LOHs in cancer cells resulted mainly from DSB repairs through gene conversions 17 . In a DSB at a heterozygous C/T, gene conversion could yield either a C/C or T/T homozygous position at comparable rates, depending on which homologous chromatid bears the DSB. On the other hand, because GOHs depend on point mutations rather than gene conversions, this comparable-rate constraint would not apply to GOHs.

Distances between SNVs and recurrent CNVs
That the N-stage SNVs and CNVs in the B-N-P-T tetra samples both underwent active reversions, and more in P-stage than in T-stage ( Fig. 2c) suggest some form of possible correlation between these two types of mutations. This was supported by Fig.   2d which shows that the sites of C>T GOHs with NCG context occurring in the ∆NB changes, and T>C LOHs with NTG context occurring in the ∆PN changes, were located particularly close to recurrent CNVs compared to mutations with other contexts or in other stages of change, p < 10 -7 . Furthermore, these two groups of SNVs declined in old age (Fig. 2e), in resemblance to the decrease of global DNA methylation in old age 26 . The correlation between somatic CNVs with CpGe and MeMRE (Fig. 2f), the increased SNVs at CpG sites (Fig. 2b) and the high tendency of methylated CpG conversion to TpG 27 , also pointed to some SNV-CNV relationships in the CNV process merits investigation.

Frequency classes of serial CNV changes
In the B-N-P-T tetra-sample cases, the status of any CNV in the N-, P-and T-stages

AluScan sequencing
AluScans of genomic regions flanked by Alu repetitive sequences were obtained by means of inter-Alu PCR as described 17,37 , employing both Head-type and Tail-type and treated as non-tumor, or N-stage, samples in this study.

SNV calling
For the paired-end sequencing reads generated on the Illumina platform by the

Mutational profiles of SNVs
Mutational profiles of SNVs were analyzed following the procedure developed by

CNV calling and identification of recurrent CNVs
From AluScan data, the AluScanCNV software 43  from further analysis to reduce background noise introduced by less informative windows in the human genome. Only sequence windows where CNV was detected in six or more of the twelve patients were considered to harbor a recurrent CNV.

Co-localization of CNVT and CpGe/MeMRE
CpGe and MeMRE entries were downloaded from UCSC Genome Browser as

2000-bp windows, and the average densities of CNVT breakpoints and base pairs in
CpGe or MeMRE in each window were calculated. Thereupon the windows with zero CpGe or MeMRE density were removed to avoid error caused by missing data, and the remaining windows were separated into ten groups based on the percentile of CpGe or MeMRE density. Finally, the average CNVT breakpoint densities in the groups were plotted against the percentile CpGe or MeMRE density.

Mutation enrichment in genes and pathways
The results of variant analysis of AluScan data of the twelve tetra-sample cases were uploaded to BioMart of the Ensembl database to generate a list of their gene contents under R environment using the 'biomaRt' R package 44  Only those functionally annotated groups and pathways yielding Bonferroni corrected p-value, Benjamini corrected p-value as well as FDR q-value all less than 0.05 were considered statistically significant.

Statistical analysis and data visualization
Statistical analyses were performed using R software (http://www.r-project.org  Fig. 7b which was drawn using the Circos program 46 .

Disclosure of potential conflicts of interest
The authors declare no competing financial or non-financial interests.       Supplementary Table 3. Summary of SNV mutations in B-N-P-T tetra samples.

Author contributions
Supplementary Table 4. The exact residue-by-residue SNV mutations in each sample of the B-N-P-T tetra-sample cases.
Supplementary Table 5. Summary of CNV mutations in B-N-P-T tetra samples.
Supplementary Table 6. Summary of SNV mutations in B-N-T trio samples.