Multiplex generation and single cell analysis of structural variants in a mammalian genome

The functional consequences of structural variants (SVs) in mammalian genomes are challenging to study. This is due to several factors, including: 1) their numerical paucity relative to other forms of standing genetic variation such as single nucleotide variants (SNVs) and short insertions or deletions (indels); 2) the fact that a single SV can involve and potentially impact the function of more than one gene and/or cis regulatory element; and 3) the relative immaturity of methods to generate and map SVs, either randomly or in targeted fashion, in in vitro or in vivo model systems. Towards addressing these challenges, we developed Genome-Shuffle-seq, a straightforward method that enables the multiplex generation and mapping of several major forms of SVs (deletions, inversions, translocations) throughout a mammalian genome. Genome-Shuffle-seq is based on the integration of “shuffle cassettes” to the genome, wherein each shuffle cassette contains components that facilitate its site-specific recombination (SSR) with other integrated shuffle cassettes (via Cre-loxP), its mapping to a specific genomic location (via T7-mediated in vitro transcription or IVT), and its identification in single-cell RNA-seq (scRNA-seq) data (via T7-mediated in situ transcription or IST). In this proof-of-concept, we apply Genome-Shuffle-seq to induce and map thousands of genomic SVs in mouse embryonic stem cells (mESCs) in a single experiment. Induced SVs are rapidly depleted from the cellular population over time, possibly due to Cre-mediated toxicity and/or negative selection on the rearrangements themselves. Leveraging T7 IST of barcodes whose positions are already mapped, we further demonstrate that we can efficiently genotype which SVs are present in association with each of many single cell transcriptomes in scRNA-seq data. Finally, preliminary evidence suggests our method may be a powerful means of generating extrachromosomal circular DNAs (ecDNAs). Looking forward, we anticipate that Genome-Shuffle-seq may be broadly useful for the systematic exploration of the functional consequences of SVs on gene expression, the chromatin landscape, and 3D nuclear architecture. We further anticipate potential uses for in vitro modeling of ecDNAs, as well as in paving the path to a minimal mammalian genome.

Insertion sites across all chromosomes for shuffle cassettes whose genomic coordinates were mapped with high confidence, colored by allele.Inconclusive indicates that there is conflicting evidence for the insertion allele, while noVariant denotes those insertions that were un-assigned due to a lack of reads that overlap with a known variant between the BL6 and CAST genomes.

Fig S2 .
Fig S2.Integration and characterization of shuffle cassette library into mESCs.A) Schematic of experiment to integrate shuffle cassettes to the genomes of mESCs at a high multiplicity of infection (MOI) by co-transfection with a small percentage of a helper plasmid containing the puromycin resistance gene (40).B) Copy number of shuffle cassettes and the puromycin resistance gene were estimated in the bottlenecked population via quantitative PCR (qPCR) relative to two genomic targets (Trfc, Tert).The height of the bar represents the mean and the error bars indicate the standard deviation of the copy number measured relative to the two genomic targets .C) Histogram of read count for each barcode pair detected in amplicon-seq data normalized to sequencing depth across 4 technical replicates.D) Frequency of the first 4 bp of the genomic sequence detected in IVT-seq reads in technical replicate 1 and 2 from parental cells.TTAA is the expected sequence given our use of the PiggyBac transposon.

Fig
Fig S3.Allele-specific insertion sites across all chromosomes.Insertion sites across all chromosomes for shuffle cassettes whose genomic coordinates were mapped with high confidence, colored by allele.Inconclusive indicates that there is conflicting evidence for the insertion allele, while noVariant denotes those insertions that were un-assigned due to a lack of reads that overlap with a known variant between the BL6 and CAST genomes.

Fig S4 .
Fig S4.Characteristics of the complete set of rearrangements detected in bulk by amplicon-seq at 72h post-Cre transfection.A) Log2 ratio of total reads that contain rearranged barcode (BC) pairs to the total number of reads that contain parental BC pairs in technical replicates of each Cre transfection sample.B) Venn diagram depicting the overlapping relationships between Cre transfection samples for the subset of SVs that are detected in both technical replicates of each sample.C) Pie chart depicting the distribution of SV type for all rearrangements detected at 72h.D) Scatter plot of rearrangement size (y-axis) vs. normalized read count (x-axis) for deletions and inversions detected at day 3. Pearson correlation is calculated between the log10 values of the two metrics.E) Median size of inversions and deletions, weighted by their read count, for both the complete set of rearrangements (left) and those shared between technical replicates for a condition (right).F) Similar to lower part of Fig.3D, the bar plot shows the proportion of each SV type (from the complete set of rearrangements at 72h) that is supported by at least one read in the IVT-seq data.G) Violin plots depicting the distribution of read counts for deletions, inversions and translocations for the complete set of rearrangements detected at day 3. Inset within each violin plot is a box plot of the distribution with the median value depicted as a white line, the length of the box depicting the interquartile range and the whiskers depicting the extent of the distribution.P-values are calculated using the non-parametric Mann-Whitney U test.

Fig S7 .
Fig S7.Rearrangements are not stably maintained in the post-Cre induction cell population and cannot be rescued by inducible Cre variants nor by p53 inhibition.A) Schematic of the long-term culture experiments with Cre variants or p53 inhibition.The possible products from the 4 primer amplicon-seq strategy (also see Fig. S1) are depicted to the right.B) Total number of reads from 4 primer amplicon-seq data generated from Cre, Bxb1 or No DNA transfected cells that contain rearranged barcode pairs.Bars are split based on whether the reads contain the same or different (diff) capture sequence on the same molecule.C) Number of rearranged barcode (BC) combinations detected at day 3, 5 or 7 post transfection with Cre, CreERT2 or ERT2CreERT2.Cells were either untreated or treated with tamoxifen (0.5μM) for 24 hours.D) Log2 ratio of total reads with rearranged BC combinations to parental BC combinations in each sample.E) Similar to panel C, the number of rearranged BCs detected at day 3 or 5 for Cre-transfected cells with or without p53 inhibitor (Pifithrin-α, 20μM).F) Similar to panel D, Log2 ratio of total reads with rearranged BC combinations to parental BC combinations for samples with or without p53 inhibitor (Pifithrin-α, 20μM).Data presented in this figure is from one replicate.