Introduction

Cancer genomics has contributed to medical oncology by providing the genomic landscape and catalog of somatic mutations of human cancers. This information holds clinically actionable targets that may be used for personalized oncology and the development of new therapeutics. In addition, because the catalog of somatic mutations is a cumulative archeological record of all the mutational processes a cancer cell has experienced throughout the lifetime of a patient, it provides a rich source of information for biologists to understand the DNA damage and repair mechanisms that function in human somatic cells1.

Genomic alterations in cancer cells consist of two major categories: (1) small variations that include single-nucleotide variants and short indels, and (2) large variations known as chromosomal rearrangements or structural variations (SVs). SVs are rearrangements of large DNA segments (for example, chromosomal translocations), occasionally accompanying DNA copy number alterations. Although there is no rule that clearly distinguishes the “small” and “large” variation categories, researchers currently regard 50 bp as the tentative cutoff criteria2. Before the era of whole-genome sequencing (WGS), tentatively regarded as prior to 2010, the comprehensive detection of SV “breakpoints” (qualitative changes) was not feasible in cancer genomes. CNAs (quantitative changes) were relatively easier to assess using classical technologies, such as comparative genomic hybridization (CGH) and genotyping microarrays3.

Because high-throughput DNA sequencing technologies produce unbiased sequences from whole genomes within a reasonable timeframe and at a reasonable cost (i.e., < 2000 USD and < 1 week for the production of 30 × WGS data from a tumor and paired normal tissue, as of Nov 2017), many research groups, in particular, two large international consortia (The International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)), have produced large-scale WGS data sets from a variety of common and rare tumor types during the last decade4,5. Various computational algorithms and tools have been developed for the sensitive and precise detection of SVs from the WGS data (reviewed in ref. 6,7). These efforts have enabled the identification of driver SV events with remarkable functional consequences8,9,10,11,12,13,14 and mechanistic patterns of SVs, which could not be identified by classical technologies. For example, the chromothripsis15 mechanism, exhibiting a massive number of localized SV breakpoints with extensive oscillation of two DNA copy number states, was observed in cancer genome sequences, and elucidation of its molecular mechanisms followed16,17,18,19,20. However, many features remain unexplored, such as the frequency, activating conditions, and molecular machineries that are associated with the complex event. Understanding the diverse patterns of SVs observed in genome sequences is the first step to answering these questions.

Historical overview of SVs in cancers: from cytogenetics to array CGH

The first insights into SVs in cancer cells were provided by Theodor Boveri in the early twentieth century21 (Fig. 1). By examining dividing cancer cells under a microscope, he observed the presence of scrambled chromosomes associated with uncontrolled cell division. Following the discovery of the double helix DNA structure (1952)22, abnormalities of the genome were proposed to cause many human diseases. For example, the trisomy of chromosome 21 in Down syndrome (1959)23 and the recurrent translocation between chromosomes 9 and 22 (known as the Philadelphia chromosome; 1960) in chronic myelogenous leukemia (CML) were found using cytogenetics technologies24. As the resolution of florescence in situ hybridization (FISH) technology improved, the CML-causing BCR-ABL1 fusion gene in the Philadelphia chromosome was identified25. In parallel, quantitative FISH analyses showed that some genetic loci are markedly amplified from the normal two copies in cancer cells26,27. Further technical improvements, such as CGH28, array CGH29 and genotyping microarray30,31, enabled genome-wide screening of CNAs in the 1990s and 2000s27,28,32,33. Many cancer genes have been found to be frequently amplified (i.e., MCL1, EGFR, MYC, and ERBB2) or deleted (i.e., CDKN2A/B, RB1, and PTEN) in cancer cells34,35. Indeed, genomic instability is one of the hallmarks of cancers36,37.

Fig. 1
figure 1

The history of structural variation research

Advances in hybridization technologies increased the resolution of CNA detection to ~ 1000 base pairs. However, regardless of the resolution, these methods only approximate the genomic locations of CNAs without giving an accurate determination of the breakpoint sequences. Moreover, detection of novel copy number-neutral SVs (for example, balanced inversions and translocations) is fundamentally impossible when using array technologies. In addition, hybridization technologies are not adequate for exploring repetitive genome sequences (i.e., transposable elements)3. In the 2010s, advances in sequencing technologies finally enabled comprehensive, fine-scaled SV detection4,38,39.

Patterns and mechanisms of SVs

Conventionally, cytogenetic technologies categorized SVs into four simple types: (large) deletions, duplications (amplifications), translocations, and inversions (Fig. 2). By definition, deletions and duplications are accompanied by CNAs. By contrast, inversions and translocations can be copy number neutral (balanced inversion or translocation). However, whole-genome analysis has shown that many SVs are not independent events but are acquired by a “single-hit” event and are therefore complex genome rearrangements. In this section, we introduce typical patterns of complex rearrangements found in cancers.

Fig. 2: Types of basic genomic variations.
figure 2

a Small mutations, base substitution and indels. b Simple structural variations, deletion, amplification, inversion, and interchromosomal translocation

Chromothripsis

Chromothripsis is a pattern of complex chromosomal rearrangement that is affected by a massive number of SV breakpoints, sometimes > 100, which are densely clustered in mostly one or a few chromosomal arms40 (Fig. 3a). The term chromothripsis means “chromosome shattering into pieces” and was identified in 201115. In general, chromothripsis is found in ~ 3% of all tumors and is frequently found in bone tumors (osteosarcoma and chordoma; 25%) and brain tumors (10%)15. However, an accurate description of its prevalence and cancer type specificity remains largely elusive.

Fig. 3: Patterns and proposed mechanisms of structural variations.
figure 3

a Chromothripsis, showing a shattering and subsequent repair process. Telomere crisis and/or micronuclei by chromosome mis-segregation may induce chromothripsis. b Chromoplexy, showing a “closed chain” (upper) in the Circos plot. This is a multi-chromosomal translocation (lower). c MMBIR by template switching of the replication machineries. d BFB cycle, showing subtelomeric copy number increases and fold-back inversions. Proposed mechanisms are shown below. e Different patterns of SVs in BRCA1- and BRCA2-mutant breast cancers. f Patterns and formation of DMs and neochromosomes. DNA fragments can self-ligate, forming a ring structure, and are amplified (DMs). Fragments capturing centromeres and telomeres become neochromosomes. g Patterns and processes of L1 retrotransposition in the cancer genome. h HPV integration and regional rolling-circle amplification

In the typical case of chromothripsis localized in a chromosome arm, a massive number of SV elements (breakpoints) consist of similar proportions of all intrachromosomal rearrangement types (i.e., deletion type, tandem duplication type, and head-to-head and tail-to-tail inversion types). The copy number of the involved chromosome arm usually oscillates between the normal and deleted copy number states. In addition, loss-of-heterozygosity (LOH) is frequently observed in the low-DNA copy number regions. The simplest model for explaining the chromothripsis pattern is that a single “catastrophic hit” shatters one or a few chromosome arms into hundreds of DNA segments simultaneously in an ancestral region of cancer cells, and DNA repair pathways (presumably non-homologous end-joining) reassemble the fragments in an incorrect order and orientation15. DNA segments that are not rejoined during the repair process result in deletions. Although such a scenario explains the features of chromothripsis, the nature of the catastrophic hit is not fully understood. At present, two non-mutually exclusive mechanisms have been experimentally shown: (1) telomere crisis with telomere shortening and end-to-end chromosomal fusions followed by the formation of a chromatin bridge41, and (2) micronuclei formation due to mis-segregated chromosomes during mitosis18.

A telomere is the DNA sequence region at the end of a chromosome that protects the chromosome. When telomeres are shortened, the ends of chromosomes (chromatids) can be fused, forming a dicentric chromosome that fails to segregate into daughter cells during mitosis. The fused sites are then stretched during the anaphase of mitosis41, forming a chromatin bridge. Under certain circumstances, the bridge induces a partial rupture of the nuclear membrane in anaphase, and the nuclease activity of the 3′ repair exonuclease 1 (TREX1) generates extensive single-strand DNA and bridge breakage42. The frequently observed SV spectrums in the daughter cells are genomic rearrangements recapitulating known features of chromothripsis combined with localized hyper-point mutations (kataegis)42. This mechanism explains why chromothripsis frequently occurs in the vicinity of telomeric regions.

Alternatively, a physical isolation of chromosomes in aberrant nuclear structures (micronuclei) was proposed as a possible mechanism of chromothripsis18,19. Micronuclei are frequently caused by errors in cell division, such as mis-segregation of intact chromosomes during mitosis43 and acentric genome fragments from abnormal DNA replication/repair processes19,20,44. Molecular processes in micronuclei are known to be error prone; thus, isolated genetic materials are massively broken into pieces and reassembled18,19,45. The rejoined DNA fragments, showing chromothripsis-like features, can be fixed in a daughter cell.

Chromoplexy

Chromoplexy is another pattern of complex rearrangements that has many interdependent SV breakpoints (mostly interchromosomal translocations) but usually fewer than chromothripsis. This phenomenon was identified in prostate cancer genomes46. Chromoplexy mechanisms frequently disrupt tumor suppressor genes (i.e., PTEN, TP53, and CHEK2) and activate oncogenes by the formation of fusion genes (i.e., TMPRSS2-ERG) in the cancer type. The prevalence in prostate cancer is ~ 90%, but chromoplexy has not yet been explored in other cancer types. Conceptually, chromoplexy is an extended version of balanced translocation that reshuffles multiple chromosomes (rather than two chromosomes, as in balanced translocations) in a new scrambled configuration (Fig. 3b). Therefore, SVs in a chromoplexy event usually involve multiple chromosomes (usually > 3), and its rearrangement pattern resembles a “closed chain”. Although small deletions can occasionally be combined in the vicinity of the breakpoints as a form of “deletion bridge”, a large fraction of SVs in a chromoplexy event is copy number neutral. Like chromothripsis, chromoplexy is readily explained by the presence of a catastrophic hit that produces multiple DNA double-strand breaks (DSBs). Unlike chromothripsis, multiple DSBs in chromoplexy are not confined to a chromosome arm but are rather distributed across many chromosomes46.

Although the phenomenon is found in many common cancers (including prostate cancers, non-small cell lung cancers, head and neck cancers, and melanomas46) and rare solid cancers47, the molecular basis of the catastrophic hit is unclear. The genome-wide distribution of DSBs in a chromoplexy event is not random but is enriched in actively transcribed and open chromatin regions48,49,50. This suggests that a nuclear transcription hub wherein many co-regulated genomic regions are spatially aggregated is fragmented by the catastrophic blow in chromoplexy46.

Microhomology-mediated break-induced replication

The basic mechanisms of chromothripsis and chromoplexy are massive “shatter-and-stitch” processes of the genome. In these mechanisms, copy number gains of DNA segments are rarely observed. Cancer genomes frequently harbor another pattern of complex rearrangements, demonstrating a massive number of interspersed copy number gains (amplifications) of one parental allele without evidence of LOH These amplicons are directly interconnected with frequent templated insertions and common microhomologies (2–15 bps) at breakpoint junctions. These features suggest a replication-based mechanism for the acquisition of extra DNA copies, with frequent template switching of the DNA replication complex for the rearrangement (Fig. 3c). The replication-based model, termed microhomology-mediated break-induced replication (MMBIR), was initially suggested to explain the patterns of germline CNAs51,52. Presumably, translesion DNA polymerases, such as Polζ and Rev1, are responsible for MMBIR53.

The cellular conditions that induce MMBIR are not fully understood. Presumably, collapse of a replication fork due to a single-strand DNA break and/or a bulky DNA adduct in the template DNA (collectively referred to as replication stress) interferes with normal DNA replication and stimulates template switching54,55. Normally, the template switching contributes to the repair of broken replication forks using a sister chromatid. However, the process is a double-edged sword that may lead to chromosomal rearrangements when non-allelic chromosomal regions are selected as the template. A lack of Rec/RAD proteins (e.g., RAD51) due to persistent replication stress has been reported to trigger MMBIR51,56.

Breakage-fusion-bridge cycle

The breakage-fusion-bridge (BFB) cycle, first discovered by Barbara McClintock57 in 1939, is a recursive cycle of generation of the dicentric chromosome by telomere fusions and breaks when the two centromeres are pulled apart in anaphase (Fig. 3d). As multiple DSBs occur in random positions in the middle of the two centromeres over a few cell cycles, the BFB cycle leaves typical patterns of rearrangements, including (1) the stair-like increase in subtelomeric regions58 (reviewed in ref. 41) and (2) the enrichment fold-back inversions in the breakpoints. BFB cycle-mediated SVs have been well demonstrated in a subtype of acute lymphoblastic leukemia, which exhibits intrachromosomal amplification of chromosome 21 involving RUNX1 gene alteration59,60.

Homologous recombination repair defect

Homologous recombination (HR) is a basic cellular mechanism to repair DSBs using identical or similar DNA sequences61. The basic steps of HR are (1) resection of the 5′ extremes of DSBs, (2) invasion of overhanging 3′ ends to a similar or identical DNA segment, and (3) DNA repair using one of two pathways—double-Holliday junction (reviewed in ref. 62) or synthesis-dependent strand annealing (reviewed in ref. 62,63).

The defect of HR (for example, BRCA1 and BRCA2 inactivation) causes genomic instability and increases the incidence of breast and ovarian cancers64,65. Complete inactivation of BRCA1 and/or BRCA2 genes are found in 7% of all breast cancers66, with an enrichment in the triple-negative breast cancer subtype67. BRCA gene-mutant breast cancers have a much higher burden of genome-wide SVs compared to ordinary breast cancers68. Interestingly, specific patterns of SVs are found according to the inactivated genes (Fig. 3e). For example, BRCA1-inactive cancers dominantly harbor short (< 10 kb) tandem duplications, but BRCA2-mutant cancers primarily show deletions68. Generally, BRCA1 recognizes DNA double-strand breaks along with ATM, TP53, and CHEK2 in the HR pathway. BRCA2 has an important role in the loading of RAD5169,70, which is necessary for strand invasion after 5′-end resection71.

The HR defect has been of interest in clinical research fields because HR-defective cancers are susceptible to targeted therapies (PARP inhibitors) that inhibit the base excision repair pathway. This strategy aims to trigger additional genomic instability in HR-defective cancer cells (but not in normal cells), which leads to cancer cell death72. Breast cancer patients with germline BRCA1/BRCA2 mutations are responding well to PARP inhibitor therapy73,74.

Double-minute chromosome and neochromosome

Double-minute chromosomes (DMs) are aberrant genomic segments in a small circular form that are self-replicable but lack a centromere (Fig. 3f). DMs are often massively amplified in various solid and hematologic cancer cells75. DMs are detected in ~ 40% of glioblastomas, and some oncogenes, such as CDK4, MDM2, and EGFR, are frequently co-amplified in DMs76,77. DMs are important in tumorigenesis and tumor clonal evolution78,79. DM segments can be derived from DNA fragments that fail to be reassembled during chromothripsis15.

Neochromosomes are aberrant genomic segments in either circular or linear forms. Unlike DMs, neochromosomes harbor a centromeric structure and (if linear)`telomeric regions (Fig. 3f). Neochromosomes are observed in ~ 3% of all cancers and are especially frequent in a subset of mesenchymal tumors, including parosteal osteosarcomas (90%), atypical lipomatous tumors (85%), dedifferentiated liposarcomas (82%), and dermatofibrosarcoma protuberans (67%)80. The formation process of neochromosomes has been elucidated in detail from liposarcoma genomes81. Like DMs, neochromosomes begin as circular DNA structures. The intermediate structures subsequently capture centromeres and are finally linearized by the acquisition of telomeres at both ends due to concurrent rearrangements, including chromothripsis- and BFB cycle-like processes.

Transposition of mobile elements

Transposable elements (TEs) are repetitive DNA sequences that occupy 45% of the human genome82. In the human genome, these elements are successful parasitic units that have important roles in genome evolution by generating SVs via “cutting-and-pasting” (DNA transposons) or “copying-and-pasting” themselves (retrotransposons)83. Most of the TEs in human genomes are now truncated and inactive in both germline and somatic lineages. For example, of the 500,000 copies of the L1 retrotransposons84,85 in the human genome, only ~ 100 L1 copies have intact open reading frames and are potentially capable of retrotransposition. In cancer cells, retrotranspositions of L1 are frequently observed (Fig. 3g)86,87 in ~ 50% of pan-cancer tissues86,88, with a high enrichment in esophageal cancers (> 90%), colon cancers (> 90%) and squamous cell lung cancers (> 90%)86,88. L1 retrotransposition is carried out by transcription, processing, reverse transcription, and novel insertion89. In some cases, hundreds of somatic retrotranspositions are observed in a cancer cell. In addition, L1 retrotranspositions occasionally carry adjacent non-repetitive DNA sequences (termed transduction), which can widely scatter genes, exons and regulatory elements across the genome86. The functional impacts of retrotranspositions in the pathogenesis of cancers are emerging90. The retrotranspositional insertion sites are enriched in the heterochromatin and hypomethylated regions91, and cancer-related genes are sometimes affected87,90,92,93,94.

Insertion of external DNA sequences

In addition to reshuffling of the nuclear genomes mentioned above, cancer cells may acquire completely new extranuclear DNA sequences from viruses, mitochondria95,96 and bacteria97,98. For example, the vast majority of uterine cervical cancers (> 95%) and a substantial fraction of head and neck cancers (12%) contain human papillomavirus (HPV) DNA sequences in their genome99. HPV genome integration is involved in direct tumorigenesis (i.e., inhibition of the p53 pathway by the HPV oncoprotein E6100) and in the induction of genomic instability101. For example, the insertional sites of HPV are frequently amplified102 by the “loop-mediated mechanism”101 (Fig. 3h). If brief, the insertional regions tend to form a loop structure, which is susceptible to amplification during DNA replication. As a result, genomic DNA segments flanked by viral insertions can be massively amplified, occasionally by > 50 copies, which leads upregulation of the viral oncoprotein and co-amplified adjacent gene products101.

Intracellular nuclear transfers of full or partial mitochondrial DNA sequences are also observed in cancer genomes95,103,104,105. The prevalence of this event is ~ 2% of all cancers, with an enrichment in skin, lung, and breast cancers96. However, the molecular mechanism by which mitochondrial DNA is mobilized and inserted into nuclear genomes has not been fully elucidated. Most somatic nuclear integrations of mitochondrial DNA do not occur alone but are frequently combined with other complex rearrangements, suggesting that mitochondrial DNA fragments could be used as a “filler material” or a string for weaving broken nuclear DNA segments into the DNA repair processes in somatic cells106.

Comprehensive signatures of SVs

Beyond the rearrangement patterns mentioned above, additional mechanisms presumably remain undetermined. Many ongoing efforts are being carried out to reveal comprehensive SV mutational signatures in cancer genomes. For example, > 30 mutational signatures have been revealed for point mutations from the statistical analysis of large catalogs of mutations107. Similar concepts have been applied to SVs in breast cancer genomes by clustering genome-wide SVs according to their features, such as local proximities, rearrangement class (tandem duplication, deletion, inversion, and translocation), and rearrangement size68. The analysis yielded six rearrangement signatures: (1) large ( > 100 kb) tandem duplication, (2) dispersed translocation, (3) small tandem duplication, (4) clustered translocation, (5) deletion, and (6) other clustered rearrangements. Among these signatures, tandem duplications (SV signatures 1 and 3) are thought to occur due to HR deficiency108. In a similar manner, Li et al.109 identified nine SV signatures from a cohort of >  2500 cancer genomes. Using this classification, they inferred that a considerable proportion of rearrangements are caused by replication-based mechanisms.

Large-scale genome studies have revealed that SVs are not evenly distributed across the genome. The density of SVs is affected by local genome and epigenome features as well as by 3D genome conformation110,111,112. For example, local rearrangement rates are affected by replication time, transcription rate, GC content, methylation status113,114, and chromosomal fragile sites115,116, including chromosome loop anchor sites117. More systematic analyses combining genome and epigenome features from a larger cohort will likely yield a better definition of the structural variation signatures and additional mutational processes in human cancers.

Functional consequences of SVs

SVs have functional consequences in tumorigenesis and clonal evolution via at least four direct mechanisms (Table 1): (1) truncation of genes (for example, deletion or gene disruption)8,118, (2) amplifications of whole genes and their expression levels by the “dosage effect”, (3) fusion gene formation (for example, BCR-ABL in CML and EML4-ALK in lung cancers) and (4) mobilization of gene-regulatory element organization (‘enhancer hijacking’)2. The first three mechanisms are conventional, and evidence for the fourth mechanism is actively emerging. Examples of enhancer hijacking, which alters gene expression of cancer genes, including IRS4, SMARCA1, and TERT, have been reported119. In breast cancers, breast tissue-specific regulatory regions are recurrently duplicated120, suggesting that positive selection pressures are strongly present. Similarly, many non-coding SVs may affect the gene expression of adjacent or distant genes by mobilizing many regulator regions or expressional quantitative trait loci121,122. More specifically, an experiment has shown that rearrangement involving the genomic topologically associating domain boundary can alter gene expression by altering the 3D genome structures that are involved in regulating gene expression123.

Table 1 Selected examples of genes altered by structural variation in cancers

Future direction and conclusion

The revolution of WGS provides an unbiased and comprehensive catalog of SVs in human cancer cells. Via a systematic, in-depth analysis of SV breakpoints, unique patterns and their underlying mutational processes are now emerging. However, current predominant WGS platforms producing short reads (< 500 bp) provide disintegrated data that are limited in the direct phasing of SV breakpoints. Despite many bioinformatic and statistical algorithms, the seamless reconstruction of final reassembled chromosomes is sometimes impossible with short read sequences, especially when the SVs are highly complex. In addition, SVs involved in highly repetitive regions (for example, telomeres, centromeres, and simple repeats) cannot be fully explored using these technologies. To this end, the combination of long read sequences (for example, from the PacBio platform) and high-resolution cytogenetics data will be helpful. Alternatively, Hi-C can be used to detect SVs in a high-throughput manner124, although the cost efficiency could be an issue. If culturing somatic cells in vitro is possible, the Strand-seq125 technique can provide fully phased data even if the subjects are not diploid. Single-cell genome sequencing is also a promising technology. For example, single-cell whole-genome sequencing could determine the exact timing of an SV per cell cycle126.

Apart from the technical limitations of DNA sequencing, the accurate molecular mechanisms of SVs are difficult to elucidate because tissue sequencing primarily reflects only the terminal results of SVs. Although we can observe DSBs under a microscope127 or with special sequencing technology (e.g., END-seq128), observing the DSBs and final rearrangements (outcome) at the sequence level in the same cell is currently impossible to. Well-designed experiments and analyses are needed to bridge this gap.

Understanding the functional consequences of SVs and their association with drug efficacy are important for precision medicine. For accurate functional analyses, the “genome sequencing-only” approach is limited, and the integration of multiomics data, such as genome, transcriptome, and epigenome data, are needed. Data representing the association between gene expression and the variation in the genome are being collected in the GTEx project121. Information on the regulatory region of the genome and the genomic regions interacting with it is actively accumulating in the ENCODE129 and FANTOM projects130,131. By integrating these data sets, we will be able to comprehensively interpret the functional consequences of genome SVs and further advance precision oncology in the near future.