A complete digital karyotype of the B-cell leukemia REH cell line resolved by long-read sequencing

The B-cell acute lymphoblastic leukemia (ALL) cell line REH, with the t(12;21) ETV6-RUNX1 translocation, is known to have a complex karyotype defined by a series of large-scale chromosomal rearrangements. Taken from a 15-year-old at relapse, the cell line offers a practical model for the study of high-risk pediatric B-ALL patients. In recent years, short-read DNA and RNA sequencing have emerged as a complement to analog karyotyping techniques in the resolution of structural variants in an oncological context. However, it is challenging to create a comprehensive digital karyotype of a genome with these techniques alone. Here, we explore the integration of long-read PacBio and Oxford Nanopore whole genome sequencing (WGS), IsoSeq RNA-sequencing, and short-read sequencing to create a detailed digital karyotype of the REH cell line. WGS refined the breakpoints of known aberrations and clarified the molecular traits of disrupted ALL-associated genes BTG1 and TBL1XR1, as well as the glucocorticoid receptor NR3C1. Several previously underreported structural variants were also uncovered, including deletions affecting the ALL-associated genes VPREB1 and NFATC1. Meanwhile, transcriptome sequencing identified seven fusion genes within the genomic breakpoints. Together, our extensive whole-genome investigation makes high-quality open-source data available to the leukemia genomics community. KEY POINTS A complete digital karyotype of the REH cell line was produced with short- and long-read DNA and RNA sequencing technologies. The study enabled precise identification of structural variants, and the fusion genes expressed as the result of these variants.


INTRODUCTION
REH, established in 1974 as the first B cell precursor acute lymphoblastic leukemia (ALL) cell line 1 , was derived from the peripheral blood of a 15 year-old female patient at relapse 2 . REH has a near-diploid karyotype characterized by complex structural rearrangements, including several interchromosomal translocations.
Among the features described in this cell line is a complex four-way translocation t(4;12;21;16) that results in the canonical subtype-defining ETV6-RUNX1 fusion gene 3 and glucocorticoid (GC) resistance attributed to lack of a GC receptor 4 . While this cell line is used extensively in leukemia research [5][6][7][8][9][10] , its complex karyotype has never been comprehensively characterized.
Traditional karyotyping and banding techniques can resolve structural variants to a resolution of 5Mbp, while FISH and microarray analyses are limited to approximately 150kbp 11,12 . Whole genome and transcriptome sequencing (WGTS) have become increasingly important as a comprehensive clinical test for accurately resolving genomic breakpoints of structural variants (SVs) and fusion genes in cancer genomes, providing significant improvements in diagnostic yield over analog methods [13][14][15] . Sequencing-based diagnostic approaches are predominantly based on short-read next generation sequencing (NGS) technologies, which have benefits in low cost, high accuracy, and extensively validated computational pipelines 16-18 . However, inherent challenges exist with short-read sequencing, which include difficulty in mapping short reads that arise from repetitive sequences, sequencing through regions of high GC content, and detecting complex SVs 19,20 . The "dark" regions of the human genome have been difficult to assemble and are thus poorly covered by standard short-read sequencing, preventing identification of disease-relevant SVs within these regions 21 . Long-read sequencing technologies, including those developed by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have significantly advanced detection and analysis of the dark regions of the genome with their unique capability to read through repetitive regions with variable GC-content, and ability to resolve single-nucleotide variants (SNVs) and SVs even in the most challenging genomic regions 16,22,23 . Long reads have shown superiority compared to short reads for SV detection, especially when used in combination with multiple SV-calling softwares 24

Data Set
We generated three WGS and three RNA-seq datasets from REH cells ( Table 1). An Illumina PCR-free WGS library was generated with 1000 ng input DNA and sequenced PE-150 on a HiSeq X instrument. An Illumina short-read RNA-seq library was generated with 300 ng input total RNA and the TruSeq stranded total RNA kit, and sequenced PE-100 on a NovaSeq 6000 instrument. One Oxford Nanopore WGS library was generated from 40 µg input DNA prepared with the Ultra-Long  We also generated one RNA-seq dataset from GM12878 (B lymphoblastoid cell line) by merging four Illumina short-read RNA-seq libraries, each generated using 500 ng of input total RNA with the TruSeq Stranded Total RNA with RiboZero depletion, and sequenced on a NovaSeq 6000 instrument.

Data Analysis
SV and fusion gene detection were performed by first aligning reads to the reference genome version GRCh38, followed by analysis with SV calling software and fusion gene detection tools, respectively ( Table 1). For gene annotation, GENCODE V37 (Ensembl 103) was used.

SV detection using WGS data
For the short-read WGS data, we used Nextflow v21.10.6 26 to run the nf-core/sarek pipeline v2.7 27,28 , which handles mapping, indexing, sorting, duplicate marking and recalibration; the pipeline's TIDDIT module 29 was used for SV calling. We mapped and sorted the PacBio CCS reads using pbmm2 v1.9.0 (https://github.com/PacificBiosciences/pbmm2), which uses minimap2 30 34 . For consensus callsets, SVs were required to agree on type and strand, with breakpoints a maximum allowed distance of 1kb. For the evaluation of large-scale variants and translocations, the callsets were programmatically pre-filtered (see Supplemental Methods) and the remaining candidates were inspected manually. SV candidates were confirmed visually in IGV v2.13.0 35 , requiring the presence of split long reads and/or discordant mates in at least two of the three SV callsets for confirmation.

Fusion gene detection
We ran the nf-core/rnafusion pipeline v2.0.0 27,36 on short-read REH RNA-seq data to call and visualize putative fusion genes. This pipeline uses five different fusion tools: Arriba 37 , FusionCatcher 38 , pizzly 39 , squid 40 and STAR-fusion 41 . We also ran nf-core/rnafusion on the GM12878 short-read RNA-seq dataset in order to generate a set of non-leukemic fusion gene candidates for filtering.
Candidate fusion genes were programmatically pre-filtered (see Supplemental Methods) and the remaining candidates were inspected manually in IGV.

Karyotyping
We used G-banding to confirm the integrity of the REH cells used for sequencing, and to verify the presence of the chromosomal features described previously in literature 3

Refinement of known REH aberrations at base-pair resolution
We detected the established and expected chromosomal aberrations in each of the three SV callsets ( Table 2). These features include a deleted chromosome X, gain of chromosome 16; a 26mbp del(3)(p22.3;p14.2); a balanced translocation between chromosomes 5 and 12; and finally, a complex four-way rearrangement between chromosomes 4, 12, 21, and 16, resulting in the ETV6-RUNX1 fusion gene. We were also able to determine that the t(5;12) and the four-way translocation involve two different alleles of chromosome 12, as the two different p-arms of chromosome 12 (from p13.2 to the telomere) were found to be fused to either chromosome 5 or 21.
Combining the different SV callsets facilitated the resolution of these breakpoints down to a base-pair accuracy, thereby resolving inconsistencies in the previously documented REH karyotypes (see Supplemental Discussion and Supplemental   Table S2).
We reconstructed the molecular events of the four-way translocation as follows. The p-arm of chromosome 12 (12Mbp), with breakpoint at p13.2 in intron 5 of ETV6, was translocated to chr21q22.12, resulting in the canonical, subtype-defining fusion gene We were able to detect known aneuploidies: a deletion of chromosome X was detected in all data sets by a marked decrease in the average depth of coverage (DC), while an additional chromosome 16 was detected by a corresponding increase (Supplemental Table S3). All three analog karyotypes indicate two copies of the  Table 2). Additional details can be found in the Supplemental Discussion and Table S4.

Seven expressed fusion genes, including two in-frame fusions
We used short-and long-read RNA-sequencing data together with seven fusion gene calling softwares to detect potential fusion genes in REH. In total, the unfiltered fusion callset consisted of 11099 fusion gene candidates. After stringent filtering (see Supplemental Methods), 30 candidates remained. Additional manual examination of split and spanning reads, as well as discordant mate pairs, retained seven high-confidence fusion candidates whose breakpoints could be confirmed in IGV with supporting genomic breakpoints in the WGS data (Figure 4).
Out of the seven confirmed fusion genes, two were in-frame. First, the expected ETV6-RUNX1 fusion, resulting from the canonical subtype-defining translocation t(12;21), was detected as two splicing variants. Transcripts of both variants retained the Sterile alpha motif (SAM)/Pointed domain from ETV6, and the Runt and Runx inhibition domains from RUNX1 (Figure 4a). Five splicing variants of a second in-frame fusion gene, RUNX1-PRDM7, resulted from the t(16;21) occurring in two copies of the der (16). In the splicing variant involving exon 5 of PRDM7, the RUNX1-PRDM7 fusion retained the promoter from RUNX1 and the SSXRD motif from PRDM7 (Figure 4b). Two out-of-frame fusion genes were found to arise from the balanced t(5;12): PHAX-AC007450.2 and LRP6-SLC27A6 (Figure 4c, 4d). A fusion transcript arose from the 260kbp deletion del(12)(q21.33), which resulted in a truncated BTG1 gene, with a breakpoint in exon 2. In this fusion, the truncated BTG1 transcript was fused with an expressed non-coding region originating between two long non-coding RNA genes LINC02404 and AC090049.1. (Figure 4e) (Figure 5).

DISCUSSION
Cell lines continue to be a vital resource for functional studies. Several well-established cancer cell lines are highly rearranged, with underreported genomic complexity [49][50][51] . Despite its frequent use for functional and mechanistic studies in leukemia research [8][9][10]52 , a comprehensive analysis of the complex REH genome has never been undertaken. Herein, we used a combination of short-and long-read DNA and RNA sequencing to provide a complete digital karyotype for the REH cell line. This clarifies the ambiguity left by traditional karyotyping and resolves a number of known and novel aberrations in the REH genome, which are of potential biological importance.
GCs are a key component of ALL treatment protocols and the roles of NR3C1 and BTG1 expression and perturbation have been widely analyzed in the REH cell line 46,53,54 . Despite the many functional studies that use REH, few have addressed the genomic aberrations underlying its phenotype. The BTG1 deletion identified in the present study has been previously noted in analyses of REH using arrays 55 and PCR 46 , while NR3C1 has an established nonsense (p.Gln528Ter) mutation on one allele 56 and a deletion on the other 57 . Recurrent deletions of TBL1XR1, which encodes for nuclear hormone receptor co-repressors, have been shown to contribute to dysregulated gene expression and GC resistance in ETV6-RUNX1 positive ALL, including REH 58,59 .
In the present study we identified a previously unknown REH variant (214 kbp deletion on 22q11.22) that deletes the VPREB1 gene, which codes for part of the surrogate light chain that forms pre-BCRs. VPREB1 deletions have previously been shown to occur in up to 40% of ETV6-RUNX1 positive ALL patients at diagnosis and >70% at relapse 60 , contributing to the disordered light chain rearrangement that acts as a leukemogenic trigger 48 . The VPREB1 deletion in REH, which also encompasses part of the pseudogene PRAMENP, is adjacent to another 192 kbp deletion, affecting the chronic lymphocytic leukemia-associated gene IGLL5 61 ; the gene that lies between these two deletions, PRAME, has been identified as a potential therapeutic target for pediatric ALL 62 . Recurrent deletions at the 22q11.22 locus have been previously identified as a prognostic marker of poor outcome for certain B-ALL patients 63 , making this a genomic region warranting further research.
Likewise, the long arm of chromosome 18 was the site of recurring deletions. Of the three large-scale deletions found between 18q21-q23, one was found to affect the NFATC1 gene (132kb del on 18q23), a member of the NFAT family involved in the calcineurin signaling pathway, which has been identified as a potential therapeutic target 47 . In addition, the 26 Mbp deletion on chromosome 3 encompasses a large number of genes, including SETD2, which is associated with chemotherapy resistance 64 ; ARPP21, a frequent deletion in ETV6-RUNX1-like ALL cases 65 , and FHIT, a tumor suppressor gene that has been shown to be abnormally expressed in ALL 66 .
This highly expressed fusion occurs on the same RUNX1 allele as the canonical ETV6-RUNX1 fusion. In contrast to previous reports 67,68 , our data indicate that this fusion is in-frame, retaining the SSXRD motif of PRDM7. Because of the close sequence similarity between PRDM7 (chromosome 16) and its paralogous copy PRDM9 (chromosome 5) 69 , this fusion was misidentified as RUNX1-PRDM9 by two short-read fusion callers and a mismapping of up to 127 short RNA-seq reads.
PRDM9, which shares the SSXRD, SET and KRAB protein domains with PRDM7, has been implicated in genetic predisposition to pediatric B-ALL 70,71 , while our findings suggest that this association might extend to other members of the PRDM gene family.
In assessing the performance of the different long-and short-read technologies, we found that it was beneficial to implement an integrative process using a variety of tools and datasets due to their distinct strengths in detecting variants of different types and sizes 72 . Overall, we found that the ONT dataset was the most useful for SV discovery, with the highest sensitivity and lowest FPR; however, when adjusting for the slightly lower DC, PacBio had a comparable performance. Importantly, both long-read datasets were able to resolve variants in highly repetitive regions that were missed in the short-read data.
We found that SV detection using TIDDIT on Illumina short reads resulted in an FPR of over 98%, limiting its standalone utility in SV discovery in the absence of a matched normal control; in particular, a known problem with this method is a high rate of falsely identified translocations 32 . However, short, paired reads were useful in confirming and resolving breakpoints suggested by the long-reads; thus, given the high throughput and low cost, short reads remain a sensible choice for clinical screenings targeting specific variants. Of note, the high FPR and low sensitivity observed in this study are the statistical side effects of using a consensus method combining SV calls from several different methods; the upside to this approach is high precision 24 .
In assessing fusion detection capabilities, we found short-read RNA-seq data to be more informative than the IsoSeq dataset, largely due to the superior performance of the short-read fusion caller Arriba, which has been shown to outperform both other short-read fusion callers 73 , and long-read callers in low-coverage contexts 74 .
In summary, our findings highlight the need for comprehensive genomic analysis of commonly used and perturbed cell lines. Further, we underscored the importance of the increased informational context of long reads and how they can identify critical biologically important SVs. Long reads hold promise to more accurately capture the complexity of SVs in cancer genomes and these results add to the growing evidence for the potential value of long-read sequencing in clinical cancer genomics. Finally, we have cataloged multiple aberrations in REH that are frequently observed in high-risk ALL patients, making this cell line a highly relevant model for researchers investigating the biological underpinnings of relapse and GC resistance. 18 Figure 1. G-banding of the REH cell line, used to verify the karyotype provided by the cell line vendor. Arrows mark the chromosomes with visible aberrations, reflecting the major features of the stemline described in the DSMZ karyotype: 46 (44)(45)(46)(47)<2n>X, -X, +16, del(3)(p22), t(4;12;21;16)(q32;p13;q22;q24.3)-inv(12)(p13q22), t(5;12)(q31-q32;p12), der(16)t(16;21)(q24.3;q22)sideline with inv(5)der(5)(p15q31),+18. G-banding showed that the cells in the present study did not contain the sideline.