An anciently diverged family of RNA binding proteins maintain correct splicing of a class of ultra-long exons through cryptic splice site repression

We previously showed that the germ cell specific nuclear protein RBMXL2 represses cryptic splicing patterns during meiosis and is required for male fertility. RBMXL2 evolved from the X-linked RBMX gene, which is silenced during meiosis due to sex chromosome inactivation. It has been unknown whether RBMXL2 provides a direct replacement for RBMX in meiosis, or whether RBMXL2 evolved to deal with the transcriptionally permissive environment of meiosis. Here we find that RBMX primarily operates as a splicing repressor in somatic cells, and specifically regulates a distinct class of exons that exceed the median human exon size. RBMX protein-RNA interactions are enriched within ultra-long exons, particularly within genes involved in genome stability, and repress the selection of cryptic splice sites that would compromise gene function. These similarities in overall function suggested that RBMXL2 might replace the function of RBMX during meiosis. To test this prediction we carried out inducible expression of RBMXL2 and the more distantly related RBMY protein in somatic cells, finding each could rescue aberrant patterns of RNA processing caused by RBMX depletion. The C-terminal disordered domain of RBMXL2 is sufficient to rescue proper splicing control after RBMX depletion. Our data indicate that RBMX and RBMXL2 have parallel roles in somatic tissues and the germline that must have been conserved for at least 200 million years of mammalian evolution. We propose RBMX family proteins are particularly important for the splicing inclusion of some ultra-long exons with increased intrinsic susceptibility to cryptic splice site selection.


Introduction.
Efficient gene expression in eukaryotes requires introns and exons to be correctly recognised by the spliceosome, the macromolecular machine that joins exons together.The spliceosome recognises short sequences called splice sites that are present at exon-intron junctions within precursor mRNAs.In higher organisms there is some flexibility in splice site recognition, as most genes produce multiple mRNAs by alternative splicing.However, aberrant "cryptic" splice sites that are weakly selected or totally ignored by the spliceosome occur frequently in the human genome and can function as decoys to interfere with gene expression (Aldalaqan et al., 2022;Sibley et al., 2016).Many cryptic splice sites are located amongst repetitive sequences within introns, where they are repressed by RNA binding proteins belonging to the hnRNP family (Attig et al., 2018).
However, cryptic splice sites can also be present within exons, and particularly can shorten long exons (by providing competing alternative splice sites) or cause formation of exitrons (internal exon sequences that are removed as if they were introns) (Marquez et al., 2015).
The testis-specific nuclear RNA binding protein RBMXL2 was recently shown to repress cryptic splice site selection during meiosis, including within some ultra-long exons of genes involved in genome stability (Ehrmann et al., 2019).RBMXL2 is only expressed within the testis (Aldalaqan et al., 2022;Ehrmann et al., 2019), raising the question of how these same cryptic splice sites controlled by RBMXL2 are repressed in other parts of the body.Suggesting a possible answer to this question, RBMXL2 is part of an anciently diverged family of RNA binding proteins.The RBMXL2 gene evolved 65 million years ago following retro-transposition of the RBMX gene from the X chromosome to an autosome (Ehrmann et al., 2019).RBMX and RBMXL2 proteins (also known as hnRNP-G and hnRNP-GT) share 73% identity at the protein level and have the same modular structure comprising an N-terminal RNA Recognition Motif (RRM) and a C-terminal disordered region containing RGG repeats (Figure 1A).RBMX and RBMXL2 are also more distantly related to a gene called RBMY on the long arm of the Y chromosome that is deleted in some infertile men (with only ~37% identity between human RBMXL2 and RBMY) (Elliott et al., 1997;Ma et al., 1993).The role of RBMY in the germline is almost totally unknown, but RBMY protein has been implicated in splicing regulation (Elliott et al., 2000;Venables et al., 2000).
The location of RBMX and RBMY on the X and Y chromosomes has important implications for their expression patterns during meiosis.The X and Y chromosomes are inactivated during meiosis within a heterochromatic structure called the XY body (Turner, 2015;Wang, 2004).Meiosis is quite a long process, and to maintain cell viability during this extended period a number of autosomal retrogenes have evolved from essential X chromosome genes.These autosomal retrogenes are actively expressed during meiosis when the X chromosome is inactive.However, it is unknown whether RBMXL2 is functionally similar enough to RBMX to provide a direct replacement during meiosis, or whether RBMXL2 has evolved differently to control meiosis-specific patterns of expression.
Suggesting somewhat different activities, RBMX was recently shown to activate exon splicing inclusion, via a mechanism involving binding to RNA through its C-terminal disordered domain facilitated by recognition of m6A residues and RNA polymerase II pausing (Liu et al., 2017;Zhou et al., 2019).
Here, we have used iCLIP and RNA-seq to analyse the binding characteristics and RNA processing targets of human RBMX.We identify a novel class of RBMX-dependent ultra-long exons connected to genome stability and transcriptional control, and find that RBMX, RBMXL2 and RBMY paralogs have closely related functional activity in repressing cryptic splice site selection.Our data reveal an ancient mechanism of gene expression control by RBMX family proteins that predates the radiation of mammals, and provides a new understanding of how ultra-long exons are properly incorporated into mRNAs.

RBMX primarily operates as a splicing repressor in somatic cells
We first set out to identify the spectrum of splicing events that are strongly controlled by RBMX across different human cell lines.We used RNA-seq from biological triplicate MDA-MB-231 cells treated with siRNA against RBMX (achieving >90% depletion, Figure 1B), followed by bioinformatics analysis using the SUPPA2 (Trincado et al., 2018) and MAJIQ (Vaquero-Garcia et al., 2023, 2016) splicing prediction tools.We identified 315 changes in RNA processing patterns in response to RBMX-depletion that were high enough amplitude to be visually confirmed on the IGV genome browser (Robinson et al., 2011) (Figure 1 -Figure supplement 1A) (Figure 1 -Source Data 1).Analysis of these splicing events within existing RNA-seq data from HEK293 cells depleted for RBMX (GSE74085) (Liu et al., 2017) revealed 148 high amplitude events that are controlled by RBMX in both HEK293 and MDA-MB-231 cells (Figure 1C).We concentrated our downstream analysis on these splicing events (Figure 1 -Source Data 1).92% of the splicing events regulated by RBMX in human somatic cells were already annotated on Ensembl, Gencode or Refseq (Figure 1D).Strikingly two thirds of these events are repressed by RBMX, meaning they were increasingly used in RBMX depleted cells compared to control, and include exon inclusion, alternative 5ʹ and 3ʹ splice sites, exitrons, and intron retention (Figure 1E).Furthermore, analysis of splice site strength revealed that, unlike splice sites activated by RBMX (Figure 1 -Figure supplement 1B), alternative splice sites repressed by RBMX have comparable strength to more commonly used splice sites (Figure 1F).This means that RBMX operates as a splicing repressor in human somatic cells to prevent use of 'decoy' splice sites that could disrupt normal patterns of gene expression.Splicing control and sites of RBMX protein-RNA interaction are enriched within long internal exons The above data indicated that RBMX has a major role in repressing cryptic splicing patterns in human somatic cells.To further correlate splicing regulation to patterns of RBMX protein-RNA interactions, we next mapped the distribution of RBMX-RNA binding sites in human somatic cells.
We engineered a stable human HEK293 cell line to express RBMX-FLAG fusion protein in response to tetracycline addition.Western blotting showed that expression of RBMX-FLAG was efficiently induced after tetracycline treatment.Importantly, levels of the induced RBMX-FLAG protein were similar to those of endogenous RBMX (Figure 2A).We next used this inducible cell line to carry out individual nucleotide resolution crosslinking and immunoprecipitation (iCLIP)a technique that produces a global picture of protein-RNA binding sites (Konig et al., 2011).After crosslinking, RBMX-FLAG protein was immunoprecipitated, then infra-red labelled RNA-protein adducts were isolated (Figure 2B) and subjected to library preparation.Following deep sequencing of biological triplicate experiments, 5 to 10 million unique reads (referred to here as iCLIP tags, representing sites of RBMX protein-RNA cross-linking) were aligned to the human genome.Each individual iCLIP replicate showed at least 70% correlation with each of the others (Figure 2  In line with previous work on other RNA binding proteins (Van Nostrand et al., 2020), only 31% of the RNA splicing events that are controlled by RBMX in both HEK293 cells and MDA-MB-231 cells were identified by iCLIP as direct targets for RBMX binding (Figure 2  ).Furthermore, when we plotted the fraction of RBMX iCLIP tags present near exons that contain splicing defects in the absence of RBMX, and compared it to iCLIP tags present near a set of exons unaffected by RBMX depletion, we did not detect significant enrichment of RBMX binding within exons that contain splice sites repressed by RBMX (Figure 2 -Figure supplement 1D, E).However, RBMX-responsive internal exons that did contain RBMX iCLIP tags were significantly longer than the ones that are not bound by RBMX (Figure 2D and Figure 2 -Source Data 1).We therefore compared the length of the internal exons regulated (identified by RNA-seq) and bound by RBMX (identified by iCLIP) within protein-coding genes to all internal mRNA exons expressed in HEK293 (Liu et al., 2017).We reasoned that larger exons might have a higher chance to be bound by RBMX merely because of their large size.To minimise this effect, we did not take into account the density of RBMX binding and instead considered all exons that contained at least one iCLIP tag.Strikingly, we found that exons regulated and bound by RBMX were significantly longer than the median size of HEK293 mRNA exons which is ~130 bp (Figure 2E, and Figure 2 -Source Data 2).This led us to test whether RBMX protein is preferentially associated with long exons.For this we plotted the distribution of internal exons bound and regulated by RBMX together with all internal exons expressed from HEK293 mRNA genes (Liu et al., 2017).We found that RBMX controls and binds two different classes of exons: the first have comparable length to the average HEK293 exon, while the second were extremely long, exceeding 1000 bp in length (Figure 2F).We defined this second class as 'ultra-long exons', which represented the 18.9% of internal exons regulated by RBMX and 17.6% of the ones that contained RBMX iCLIP tags.These proportions were significantly enriched compared to the general abundance of internal ultra-long exons expressed from HEK293 cells, which was only 0.4% (Figure 2G).K-mer analyses also showed that while ultra-long exons within mRNAs are rich in AT-rich sequences compared to shorter exons (Figure 2H), the ultra-long exons that are either regulated or bound by RBMX displayed enrichment of AG-rich sequences (Figure 2I), consistent with our identified RBMX-recognised sequences (Figure 2C).Overall, this data revealed a function for RBMX in the regulation of splicing of a particular group of ultra-long exons.anti-FLAG pull-down from crosslinked HEK293 control cells not expressing RBMX-FLAG proteins, treated with 0.8 U/ml RNaseI.Lane 3, RBMX-FLAG pull-down crosslinked to RNA, treated with 2.5 U/ml RNaseI.Lanes 5-7, RBMX-FLAG pull-down crosslinked to RNA, treated with 0.8 U/ml RNaseI.Samples in lanes 5-7 were used for iCLIP library preparation.Lanes 2 and 4 are empty.RBMX is important for proper splicing inclusion of full-length ultra-long exons within genes involved in DNA repair and RNA polymerase II transcription We next wondered whether ultra-long exons regulated by RBMX (which represented 11.6% of all ultra-long internal exons from genes expressed in HEK293) had any particular feature compared to ultra-long exons that were RBMX-independent.To determine whether RBMX regulates particular classes of genes we performed Gene Ontology analysis.Both the genes bound by RBMX (detected using iCLIP, Figure 3  Ontology analyses for genes that contained ultra-long exons bound by and dependent on RBMX for correct splicing were enriched in pathways involving cell cycle, DNA repair, and chromosome regulation, compared to all expressed genes with ultra-long exons (Figure 3A and Figure 3 -Source Data 1).These data are consistent with published observations (Adamson et al., 2012;Munschauer et al., 2018;Zheng et al., 2020) that depletion of RBMX reduces genome stability.In addition, comet assays also detected increased levels of genome instability after RBMX depletion The above data indicated that RBMX-RNA binding interactions and splicing control by RBMX are particularly associated with long internal exons and enriched within classes of genes involved in genome stability.These exons included the 2.1 Kb exon 5 of the ETAA1 (Ewings Tumour Associated Antigen 1) gene, where RBMX potently represses a cryptic 3ʹ splice site that reduces the size of this exon from 2.1 Kb to 100 bp (Figure 3B and Figure 3 -Figure supplement 2A).RT-PCR analysis confirmed that RBMX depletion causes a much shorter version of ETAA1 exon 5 to prevail, particularly in MDA-MB-231 and NCI-H520 cells, but less in MCF7 cells (Figure 3C).
ETAA1 encodes a replication stress protein that accumulates at sites of DNA damage and is a component of the ATR signalling response (Bass et al., 2016).Selection of RBMX-repressed cryptic 3ʹ splice sites within ETAA1 exon 5 removes a long portion of the open reading frame (Figure 3 -Figure supplement 2B).Consistent with the penetrance of this ETAA1 splicing defect being sufficiently high to affect protein production, no ETAA1 protein was detectable 72 hours after RBMX depletion from MDA-MB-231 cells (Figure 3D).
Another ultra-long exon is found within the REV3L gene that encodes the catalytic subunit of DNA polymerase ζ that functions in translesion DNA synthesis (Martin and Wood, 2019).RBMX similarly represses a cryptic 3ʹ splice site within the ultra-long exon 13 of the REV3L gene (~4.2Kb), that has an extremely high density of RBMX binding (Figure 3E).RT-PCR analysis confirmed a strong splicing switch to a cryptic splice site within REV3L exon 13 after RBMX was depleted from MDA-MB-231, MCF7 and NCI-H520 cells (Figure 3F).RBMX protein-RNA interactions may insulate important splicing signals from the spliceosome.
The iCLIP data suggested a model where RBMX protein binding may insulate ultra-long exons so that cryptic splice sites cannot be accessed by the spliceosome.This model predicted that RBMX binding sites would be close to important sequences used for selection of cryptic splice sites.
RBMX iCLIP tags mapped just upstream of the cryptic 3ʹ splice sites within ETAA1 exon 5 in HEK293 cells and MDA-MB-231 cells after RBMX depletion (Figure 3B), suggesting that RBMX may bind close to the branchpoints used to generate these cryptic splicing patterns.However, although usually located close to their associated 3ʹ splice sites, in some cases branchpoints can be located far upstream (Gooding et al., 2006).We tested the prediction that RBMX may sterically interfere with components of the spliceosome by directly mapping the branchpoints associated with use of these cryptic ETAA1 splice sites.To facilitate mapping of the branchpoint sequences used by the cryptic 3ʹ splice site within ETAA1 exon 5, we made a minigene by cloning the ultra-long ETAA1 exon 5 and flanking intron sequences between constitutively spliced β-globin exons (Figure 3 -Figure supplement 2D).Confirming that this minigene recapitulated cryptic splicing patterns, after transfection into HEK293 cells we could detect splicing inclusion of both the full-length and shorter (cryptic) versions of ETAA1 exon 5 mRNA isoforms using multiplex RT-PCR (Figure 3 -Figure supplement 2E).We then used an RT-PCR assay (Figure 3 -Figure Supplement 2F) to monitor the position of branchpoints just upstream of the cryptic 3ʹ splice sites of ETAA1 exon 5 (Královičová et al., 2021).Sanger sequencing of the amplification product made in this assay confirmed that the branchpoint sequences used by these cryptic 3ʹ splice sites are adjacent to RBMX binding sites (Figure 3G    Adjusted p-value were produced using the Benjamini-Hochberg method.Count, number of genes in the GOBP group.GeneRatio, proportion of genes in the GOBP group relative to the full list of RBMX-regulated genes.(B) Analysis as in (A) but relative to genes regulated by RBMX in both MDA-MB-231 (this study) and HEK293 (Liu et al., 2017)   cryptic splicing (see Figure 3G).The distance from the relative cryptic splice site and from RBMX binding site as defined by iCLIP is indicated.

RBMXL2 and RBMY can replace the activity of RBMX in somatic cells
The above data showed that although RBMX can activate splicing of some exons, it predominantly operates as a splicing repressor in human somatic cells, and moreover has a key role in repressing cryptic splicing within ultra-long exons.This pattern of RBMX activity is thus very similar to that previously reported for RBMXL2 in the germline, where RBMXL2 represses cryptic splice sites during meiosis.RBMXL2 is expressed during male meiosis when the X chromosome is silenced.
To directly mimic this switch in protein expression we constructed a HEK293 RBMXL2-FLAG tetracycline-inducible cell line, from which we depleted RBMX using siRNA (Figure 4A).Western blots showed that RBMX was successfully depleted after siRNA treatment, and the RBMXL2-FLAG protein was strongly expressed after tetracycline induction, thus simulating their relative expression patterns in meiotic cells (Figure 4B).We globally investigated patterns of splicing in these rescue experiments by performing RNA-seq analysis of each of the experimental groups.Strikingly, almost 80% of splicing defects that we could detect after RBMX-depletion were rescued by tetracycline-induced RBMXL2 (Figure 4C, and Figure 4 Source Data 1).Notably, longer exons were much more likely to be rescued by RBMXL2 than shorter exons (Figure 4D), and most of the splice events that were restored by RBMXL2-expression had nearby RBMX binding sites evidenced by iCLIP (Figure 4E).We then validated three cryptic splicing patterns by RT-PCR.
Confirming our previous finding, in the absence of tetracycline treatment depletion of RBMX led to increased selection of cryptic splice sites within ETAA1 exon 5 and REV3L exon 13, and to formation of an exitron within ATRX exon 9 (Figure 4C-E, compare lanes 7-9 with lanes 10-12).
Consistent with our RNA-seq analysis (Figure 4 -Figure supplement 1A-C), tetracycline-induction of RBMXL2 was sufficient to repress production of each of these aberrant splice isoforms (Figure 4C-E, compare lanes 1-3 with lanes 4-6).These experiments indicate that RBMXL2 is able to replace RBMX activity in regulating ultra-long exons within somatic cells.
RBMX and RBMXL2 are both more distantly related to the Y chromosome-encoded RBMY protein, with RBMX and RBMY diverging when the mammalian Y chromosome evolved (Figure 1A).RBMY has also been implicated in splicing control (Nasim et al., 2003;Venables et al., 2000), but its functions are very poorly understood.We thus tested whether RBMY might also be performing a similar function to RBMX.Employing a HEK293 cell line containing tetracycline-inducible, FLAGtagged RBMY protein, we detected successful recovery of normal splicing patterns of the ultra-long exons within the ETAA1, REV3L and ATRX genes within RBMX-depleted cells 24 hours after tetracycline induction of RBMY (Figure 4

The disordered domain of RBMXL2 is required for efficient splicing control of ultra-long exons
The above data showed that RBMX predominantly operates as a splicing repressor in somatic cells, thus performing a functionally parallel role to RBMXL2 in the germline.Although RBMX contains an RRM domain that is the most highly conserved region compared with RBMXL2 and RBMY, splicing activation by RBMX depends on its C-terminal disordered domain that also binds to RNA (Liu et al., 2017;Moursy et al., 2014).We thus reasoned that if RBMX and RBMXL2 were performing equivalent molecular functions, rescue of splicing by RBMXL2 should be mediated by the disordered region of RBMXL2 alone, independent of the RRM (Liu et al., 2017;Moursy et al., 2014).To test this prediction, we created a new tetracycline-inducible HEK293 cell line expressing the disordered region of RBMXL2 protein and not the RRM domain (RBMXL2ΔRRM, Figure 5B).
Tetracycline induction of this RBMXL2ΔRRM protein was able to rescue siRNA mediated depletion of RBMX (Figures 5C-E), directly confirming that the C-terminal disordered domain of RBMXL2 protein is responsible for mediating cryptic splicing repression.exemplified by the RPL10 and RPL10L proteins that are 95% identical: RPL10 mutation causes meiotic arrest, and RPL10L has been shown to directly replace its X-linked ortholog RPL10 during meiosis (Jiang et al., 2017;Wang, 2004).RBMXL2 is the only other X-linked retrogene that has been shown to be essential for meiotic prophase (Ehrmann et al., 2019).Here we show that ectopic expression of RBMXL2 can compensate for lack of RBMX in somatic cells.This is consistent with a recent model suggesting that RBMXL2 directly replaces RBMX function during meiosis because of transcriptional inactivation of the X chromosome (Aldalaqan et al., 2022).This general requirement for functionally similar RBMX family proteins across somatic and germ cells further suggest that RBMX-family functions in splicing control have been required for ~200 million years, since before the divergence of separate RBMX and RBMY genes early in mammalian evolution.
The iCLIP data reported here show a high density of RBMX binding within ultra-long exons, consistent with a model in which RBMX protein binding to RNA masks sequences required for cryptic splice sites selection.Such RBMX binding would block access to spliceosome components or splicing activator proteins (Figure 6).Our data show that the C-terminal disordered domain of RBMXL2 protein is sufficient to control splicing inclusion of ultra-long exons.This is exactly analogous to the mechanism of control of splicing activation by RBMX, which occurs via recognition of m6A modified RNA targets via the C-terminal disordered domain (Liu et al., 2017).
Intriguingly, global studies have shown that m6A residues are enriched within some long internal exons (Dominissini et al., 2012), where they might help facilitate RBMX protein-RNA interactions.
The C-terminal disordered region of RBMX is also reported to mediate protein-protein interactions, therefore shorter exons that show defective splicing in RBMX-depleted cells but are not directly bound by RBMX could rely on different regulatory mechanisms.RBMY, RBMX and RBMXL2 directly interact with the SR protein Tra2β (Elliott et al., 2000;Venables et al., 2000) and have opposing functions during RNA binding and splicing regulation (Nasim et al., 2003;Venables et al., 2000).Hence it is still possible that RBMX family proteins counteract recognition by SR proteins of ESEs near cryptic splice sites via a protein-protein interaction mechanism.
Extensive literature shows that RBMX is important for genome stability, including being involved in replication fork activity (Munschauer et al., 2018;Zheng et al., 2020), sensitivity to genotoxic drugs (Adamson et al., 2012) and cell proliferation (https://orcs.thebiogrid.org/Gene/27316).Interestingly, many of the ultra-long exons controlled by RBMX are within genes important for genome stability, including REV3L, ATRX and ETAA1.This makes it likely that RBMX contributes to maintaining genome stability through ensuring full-length protein expression of genes important in this process.
As an example, we show here that depletion of RBMX protein causes aberrant selection of a high amplitude cryptic splice site within ETAA1 exon 5 which prevents detectable expression of ETAA1 protein, and contributes to genome instability (Bass et al., 2016).Cancer and neurological disorders are amongst the most common human diseases associated with defective DNA damage response (Jackson and Bartek, 2009).The double role of RBMX in genome maintenance via both direct participation in the DNA damage response and splicing regulation of genome stability genes could explain why mutations of RBMX are associated with an intellectual disability syndrome (Cai et al., 2021;Shashi et al., 2015), and why RBMX has been identified as a potential tumour suppressor (Adamson et al., 2012;Elliott et al., 2019).The data reported in this paper thus have implications for understanding the links between RNA processing of unusual exons, genome stability and intellectual disability.Ultra-long exons are intrinsically fragile as they may contain cryptic splice sites within an environment rich in Exonic Splicing Enhancers (ESEs).RBMX protein binding within ultra-long exons may directly block access of spliceosome components to cryptic splice sites, and depletion of RBMX from somatic cells activates selection of cryptic splice sites.This means a shorter version of the originally ultra-long exon is included, that fits more easily with exon definition rules normally followed for median size exons.During meiosis, lack of RBMX caused by X chromosome inactivation is compensated by expression of RBMXL2 protein.

siRNA knockdown and tetracycline induction
RBMX transient knockdown was established using two different pre-designed siRNAs targeting RBMX mRNA transcripts (hs.Ri.RBMX.13.1 and hs.Ri.RBMX.13.2, from Integrated DNA Technologies).Negative control cells were transfected with control siRNA (Integrated DNA Technologies, # 51-01-14-04).Cells were seeded onto 6-well plates forward transfected with Lipofectamine™ RNAiMAX transfection reagent (Invitrogen, # 13778150) according to manufacturer′s instructions using 30 pmol of siRNA for 72h at 37°C before harvesting.For tetracycline-inducible cell lines, Flp-In HEK293 cells expressing either RBMXL2-FLAG, or RBMY-FLAG, or RBMXL2ΔRRM-FLAG genes were similarly seeded onto 6-well plates and treated with RBMX and control siRNAs for 72h at 37°C.24h before harvesting 1µg/ml of tetracycline (Sigma-Aldrich) was added to half of the siRNA-treated samples to promote the expression of RBMXL2-FLAG and RBMY-FLAG.

RNA-seq
RNA was extracted from cells using RNeasy Plus Mini Kit (Qiagen #74134) following manufacturer's instructions and re-suspended in nuclease-free water.RNA samples were DNase treated (Invitrogen, AM1906).For siRNA treated MDA-MB-231 cells, paired-end sequencing was done initially for two samples, one of negative control and one of RBMX knock-down, using an Illumina NextSeq 500 instrument.Adapters were trimmed using trimmomatic v0.32.Three additional biological repeats of negative control and RBMX siRNA treated MDA-MB-231 cells were then sequenced using an Illumina HiSeq 2000 instrument.The base quality of raw sequencing reads was checked with FastQC (Andrews, 2010).RNA-seq reads were mapped to the human genome assembly GRCh38/hg38 using STAR v.2.4.2 (Dobin et al., 2013) and subsequently quantified with Salmon v. 0.9.1 (Patro et al., 2017) andDESeq2 v.1.16.1 (Love et al., 2014) on R v.3.5.1.All snapshots indicate merged tracks produced using samtools (Li et al., 2009) and visualised with IGV (Robinson et al., 2011).For HEK293 cells treated with either RBMX or control siRNA, either in the absence or in the presence of tetracycline, RNAs were sequenced using an Illumina NextSeq 500 instrument.Quality of the reads was checked with FastQC (Andrews, 2010).

Identification of splicing changes
Initial comparison of single individual RNA-seq samples from RBMX-depleted and control cells was carried out using MAJIQ (Vaquero-Garcia et al., 2016), which identified 596 unique local splicing variations (LSV) at a 20% dPSI minimum cut off from 505 different genes potentially regulated by RBMX.These LSVs were then manually inspected using the RNA-seq data from the second RNA sequencing of biological replicates for both RBMX-depleted and control cells, by visual analysis on the UCSC browser (Karolchik et al., 2014) to identify consistent splicing changes that depend on RBMX expression.The triplicate RNA-seq samples were further analysed for splicing variations using SUPPA2 (Trincado et al., 2018), which identified 6702 differential splicing isoforms with pvalue < 0.05.Predicted splicing changes were confirmed by visual inspection of RNA-seq reads using the UCSC (Karolchik et al., 2014) and IGV (Robinson et al., 2011) genome browsers.
Identification of common splicing changes between RBMX-depleted MDA-MB-231 and HEK293 cells was done comparing data from this study with data from GSE74085 (Liu et al., 2017).For comparative analysis, a negative set of cassette exons that were non-responsive to RBMX depletion were those where every splice junction had an absolute dPSI of 2% or less in two of the knockdown experiments analysed.
iCLIP iCLIP experiments were performed on triplicate samples in RBMX-FLAG expressing Flp-In HEK293 cells using the protocol described in (Huppertz et al., 2014).Briefly cells were grown in 10 cm tissue culture dishes and irradiated with 400 mJ cm−2 ultraviolet-C light on ice, lysed and sonicated using Diagenode Bioruptor® Pico sonicator for 10 cycles with alternating 30 secs on/ off at low intensity and 1 mg of protein was digested with 4 U of Turbo DNase (Ambion, AM2238) and 0.28 U/ml (low) or 2.5 U/ml (high) of RNAse I (Thermo Scientific, EN0602).The digested lysates were immunoprecipitated with Protein G Dynabeads™ (Invitrogen, #10003D) and either 5 μg anti-FLAG antibody (Sigma-Aldrich, F1804) or 5 μg IgG (Santa Cruz biotechnology, sc-2025).
Subsequently a pre-adenylated adaptor L3-IR-App (Zarnegar et al., 2016) was ligated to the 3ʹ of the RNA fragments.The captured Protein-RNA complexes were visualised using Odyssey LI-COR CLx imager scanning in both the 700nm and 800nm channels.The RNA bound to the proteins was purified, reverse transcribed with barcoded RT oligos complementary to the L3 adaptor.The cDNAs were purified using Agencourt AMPure XP beads (Beckman Coulter™, A63880), circularised and linearised by PCR amplification.The libraries were gel purified and sequenced on Illumina NextSeq 500.All iCLIP sequencing read analysis was performed on the iMaps webserver (imaps.goodwright.com)using standardised icount demultiplex and analyse work flow.Briefly, reads were demultiplexed using the experimental barcodes, UMIs (unique molecular identifiers) were used to remove PCR duplicates and reads were mapped to the human genome sequence (version hg38/GRCh37) using STAR (Dobin et al., 2013).Crosslinked sites were identified on the iMAPS platform and the iCount group analysis workflow was used to merge the replicate samples.
For enrichment analysis of RBMX iCLIP around cassette exons we compared the number of exons that contained iCLIP binding events that were regulated by RBMX (either repressed or activated) versus non-responsive RBMX cassette exons sets (defined above) in each of the following regions: the proximal intronic region within 300 nt upstream of the 3ʹ splice site, the proximal intronic region within 300 nt downstream of the 5ʹ splice site, and the splice site proximal exonic regions within 50 nt of the 3ʹ splice site or the 5ʹ splice site.

K-mer enrichment analysis
K-mer motif enrichment was performed with the z-score approach using the kmer_enrichment.pyscript from the iCLIPlib suite of tools (https://github.com/sudlab/iCLIPlib).All transcripts for each non-overlapping protein coding gene from the Ensembl v.105 annotation were merged into a single transcript, used for this analysis, using cgat gtf2gtf --method=merge-transcripts (Sims et al., 2014).
Each crosslinked base from the merged replicate bam file was extended 15 nucleotides in each direction.For every hexamer, the number of times a crosslink site overlaps a hexamer start position was counted within the gene and then summed across all genes.This occurrence was also calculated across 100 randomizations of the crosslink positions within genes.The z-score was thus calculated for each hexamer as (occurrenceoccurrence in randomized sequences) / standard deviation of occurrence in randomized sequences.For motif enrichment analysis within ultra-long internal exons we compared hexamer occurrence within the set of internal exons from Ensembl v.105 mRNA canonical transcripts of 1000 nt or more and compared those to internal exons of less than 1000 nt and calculated a z-score for each hexamer.A similar analysis was done by stratifying the set of ultra-long internal exons to those with RBMX binding or splicing regulation compared to those with no evidence of RBMX activity.

Exon size analysis
Analyses of exon sizes from RNA-seq data (Figures 2D and 4D) were used using GraphPad Prism 9.5.0.Annotations of all human exons related to position and size were downloaded from Ensembl Genes v.105 (http://www.ensembl.org/biomart/).Selection of exons expressed in HEK293 was performed using data from control RNA-seq samples of the dataset GSE74085 (Liu et al., 2017), subsequently filtered to focus on mRNA exons using biomaRt v.2.52.0 (Durinck et al., 2005).Size of the internal mRNA exons containing RBMX-regulated splicing patterns was annotated using IGV (Robinson et al., 2011).iCLIP tags were extended to 80 nt sequences centered at the crosslinked site, and annotated within human exons using ChIPseeker v.1.32.0 (Yu et al., 2015) and Ensembl Genes v.105.iCLIP tags present in mRNA exons were filtered using biomaRt v.2.52.0 (Durinck et al., 2005).iCLIP-containing exons were listed once, independently of the number of tags or tag score, and filtered to isolate internal exons only using the annotations from Ensembl Genes v.105.

Gene Ontology analyses
Gene Ontology Analyses were performed in R v.4.2.1 using GOstats v.2.62.0 (Falcon and Gentleman, 2007) except for  for which clusterProfiler::enrichGO v.4.4.4 (Yu et al., 2012) was used.Entrez annotations were obtained with biomaRt v.2.52.0 (Durinck et al., 2005).Read counts from control treated HEK293 cells (Liu et al., 2017) were used to isolate genes expressed in HEK293.Gene Ontology analyses for Figure 3A were performed for ultra-long (>1000 bp) exons bound or regulated by RBMX against all genes expressed in HEK293 that contain ultra-long exons.The Bioconductor annotation data package org.Hs.eg.db v.3.15.0 was used as background for GOBP terms.P-values were adjusted by false discovery rate using the base R stats package v.4.2.1, except for [Figure 3 -Figure supplement 1A] for which the default Benjamini-Hochberg method was used while running enrichGO.Significantly enriched GOBP pathways were filtered with a p-value cut-off of 0.05.Redundant terms identified with GOstats were removed using Revigo (Supek et al., 2011) with SimRel similarity measure against human genes eliminating terms with dispensability score above 0.5.The dot-plots were produced using ggplot2 v.3.3.6 (Wickham, 2016) focussing on representative terms associated to at least 5% of the initial gene list.Full GOBP lists can be found in Figure 3 -Source Data 1.

Comet assay
The comet assay was performed using the Abcam Comet Assay kit (ab238544) according to manufacturer's instructions.Briefly U2OS cells transfected with RBMX siRNA or control siRNA were harvested after 72 hours, 1x10 5 cells were mixed with cold PBS.Cells in PBS were mixed with low melting comet agarose (1/10) and layered on the glass slides pre-coated with low melting comet agarose.The slides were lysed in 1x lysis buffer (pH10.0,Abcam Comet Assay kit) for 48 hours at 4°C, immersed in Alkaline solution (300 mM NaOH, pH>13, 1 mM EDTA) for 30 min at 4ºC in the dark and then electrophoresed in Alkaline Electrophoresis Solution (300 mM NaOH, pH>13, 1 mM EDTA) at 300mA, 1volt/cm for 20 min.The slide was then washed in pre-chilled DI H2O for 2 min, fixed in 70% ethanol for 5 min and stained with 1x Vista Green DNA Dye (1/10000 in TE Buffer (10 mM Tris, pH 7.5, 1 mM EDTA), Abcam Comet Assay kit) for 15 min and visualized under fluorescence microscopy Zeiss AxioImager (System 3).Comet quantification was performed using OpenComet (Gyori et al., 2014).
RNA extraction and cDNA synthesis for transcript isoform analysis.

Figure 1 .
Figure 1.RBMX primarily operates as a splicing repressor in human somatic cells.(A) Schematic structure of RBMX family proteins (left side, cladogram) and amino acid similarity of each domain between RBMX protein and two other members of this family, RBMXL2 and RBMY.RRM, RNA recognition motif; CD, central domain important for recognition of nascent transcripts and nuclear localisation; CTD, C-terminal domain, involved in RNA binding (Elliott et al., 2019).(B) Western blot analysis shows efficient siRNA-mediated depletion of RBMX from MDA-MB-231 cells.(C) Pie chart showing the percentages of events controlled by RBMX in both MDA-MB-231 (this study) and HEK293 (Liu et al., 2017) cells.(D) Pie chart showing the percentages of events controlled by RBMX in both MDA-MB-231 and HEK293 cells that have been previously annotated (Refseq, Ensembl, Gencode), and those that are novel to this study.(E) Bar chart showing the different types of alternative splicing events controlled by RBMX protein in both HEK293 and MDA-MB-231 cells, summarising the proportion of splicing events that are activated by RBMX versus those that are repressed.(F) Splice site score analyses for 5ʹ (left panel) and 3ʹ (right panel) splice

Figure 1 -
Figure 1 -Figure supplement 1 (A) Bar chart showing the different types of alternative splicing events controlled by RBMX protein in MDA-MB-231 cells, summarising the proportion of splicing events that are activated by RBMX versus those that are repressed.(B) Splice site score analyses for 5ʹ (left panel) and 3ʹ (right panel) splice sites activated by RBMX compared to RBMX their nonresponsive alternative splice sites.eCDF, empirical Cumulative Distribution Function.Two-sample KS test two-sided pP-value = 0.23 and 0.0007 (**) respectively.
-Figure supplement 1C, and and Figure 2 -Source Data 1

Figure 2 .
Figure 2. Splicing control and sites of RBMX protein-RNA interaction are enriched within long internal exons.(A) Western blot showing levels of RBMX-FLAG protein, expressed after 24h treatment with tetracycline, compared to endogenous RBMX within HEK293 cells, both detected using α-RBMX antibody.α-GAPDH antibody was used as loading control.(B) RNAs cross-linked to RBMX-FLAG during iCLIP detected through the infrared adaptor (RBMX-RNA complexes).Lane 1,

(
C) K-mer analysis shows the top 10 enriched motifs within sequences surrounding RBMX iCLIP tags.(D) Boxplot analysis shows sizes of exons containing splicing events regulated by RBMX, grouped by whether they contain CLIP tags or not.****, p-value<0.0001(Mann-Whitney test).(E) Boxplot analysis shows distribution of exon sizes relative to: all or internal exons contained in mRNA genes expressed in HEK293 cells (Liu et al., 2017); exons regulated by RBMX as identified by RNA-seq; exons containing RBMX binding sites as identified by iCLIP, listed independently of iCLIP tag density.Median sizes for each group are shown.****, p-value<0.0001(Wilcoxon rank test and Kruskal-Wallis test).(F) Distribution plot of exon sizes for the groups shown in (E).Note the increased accumulation of exons larger than 1000 bp (ultra-long exons) in RBMX bound and regulated exons compared to all exons expressed in HEK293 (Liu et al., 2017).(G) Bar plot indicating the proportion of ultra-long exons in the groups shown in (E, F). ****, p-value<0.0001(Chi-squared test).(H) Histogram of hexamer Z-scores for ultra-long exons (exceeding 1000 nt) versus non-ultra long exons from Ensembl canonical mRNA transcripts.The top five enriched hexamers are show with corresponding Z-scores.Grey dots indicate histogram bins containing one of the top 25 RBMX iCLIP hexamer motifs.(I) Similar analysis as in (H), but for ultra-long exons with evidence of RBMX binding or regulation versus RBMX-independent ultra-long exons.

Figure 2 Figure
Figure 2 Figure Supplement 1. (A) Correlation analysis between three replicates for RBMX-FLAG iCLIP.(B) K-mer analysis shows the top 50 enriched motifs within sequences surrounding RBMX iCLIP binding.(C) Barplot showing percentage of exons regulated by RBMX that contain iCLIP tags.(D) Fraction of exons that were repressed by RBMX (blue) or independent (grey) that contained RBMX iCLIP tags in surrounding regions (below diagram).*, p-value<0.05(two-tailed Fisher's exact test).(E) Same analyses as in (D) but for exons that were activated by RBMX (red).**, p-value<0.01(two-tailed Fisher's exact test).
-Figure supplement 1A and Figure 3 -Source Data 1) and regulated by RBMX in both MDA-MB-231 and HEK293 cell lines (detected using RNA-seq, Figure 3 -Figure supplement 1B and Figure 3 -Source Data 1) each showed individual global enrichment in functions connected to genome stability and gene expression.Similarly, Gene We also detected extremely high density RBMX protein binding within exon 9 of the ATRX gene (3 Kb in length) that encodes a chromatin remodelling protein involved in mitosis.Depletion of RBMX results in expression of a shortened version of ATRX exon 9, caused by formation of an exitron through selection of cryptic 5ʹ and 3ʹ splice sites within exon 9 (Figure3-Figure supplement 2C).

Figure 3 .
Figure 3. RBMX protein is important for full-length splicing inclusion of ultra-long exons involved in DNA repair and RNA polymerase II transcription.(A) Gene Ontology analysis of genes with ultra-long exons regulated and bound by RBMX displaying significant Gene Ontology Biological Process (GOBP) terms containing at least 5% of the total gene list.FDR, False Discovery Rate.Count, number of genes in the GOBP group.GeneRatio, proportion of genes in the GOBP group relative to the full list of RBMX-regulated genes.(B) Snapshot of RNA-seq merged tracks from MDA-MB-231 cells and RBMX iCLIP tags from HEK293 cells from the IGV genome browser shows cryptic 3ʹ splice sites repressed by RBMX in ETAA1 exon 5.At the bottom, schematic of PCR products identified by RT-PCR in (C).(C) RT-PCR analysis shows splicing

Figure 3 -
Figure 3 -Figure supplement 1 (A) Gene Ontology analysis of genes bound by RBMX as identified by iCLIP, displaying top 20 significant Gene Ontology Biological Process (GOBP).
as identified by RNA-seq.GOBP terms containing at least 5% of the total gene list are shown.FDR, False Discovery Rate.(C) Comet assay shows increased formation of DNA breaks in U2OS cells treated with RBMX siRNA.Direction of comets is shown.Scale bars 200µm.(D) Quantification of percentage of DNA in the tail of comets.n=58 cells in both conditions.****, p<0.0001 (Mann-Whitney test).

Figure 3 -
Figure 3 -Figure supplement 2 (A) Snapshot of RNA-seq merged tracks from HEK293 cells (Liu et al., 2017) from the IGV genome browser shows cryptic 3ʹ splice sites repressed by RBMX in ETAA1 exon 5. (B) Schematic of ETAA1 protein in normal conditions and expected ETAA1 protein -Figure supplement 2).These results indicate that even despite its more extensive divergence, RBMY can also functionally replace RBMX in cryptic splice site control within long exons.Thus, splicing control mechanisms by RBMX family proteins predate the evolution of the mammalian X and Y chromosomes

Figure 4 .
Figure 4. RBMXL2 can replace the activity of RBMX in ensuring proper splicing inclusion of ultra-long exons.(A) Schematic of the time-course experiment used to analyse RBMXL2 function in RBMX-depleted HEK293 cells.All conditions were repeated in biological triplicates.(B) Western blot analysis shows that RBMXL2-FLAG protein is stably expressed in HEK293 cells after 24 hours of tetracycline induction, and RBMX protein is successfully depleted after 72 hours siRNA

Figure 4 -
Figure 4 -Figure supplement 1. RBMXL2 can replace the activity of RBMX in ensuring proper splicing inclusion of ultra-long exons.(A-C) Snapshot of RNA-seq merged tracks from HEK293 cells from the IGV genome browser shows tetracycline-induced expression of RBMXL2 restores correct splicing patterns within ETAA1 exon 5 (A), REV3L exon 13 (B) and ATRX exon 9 (C).Location of the splicing defects in RBMX-depleted cells is shown with red dotted lines.

Figure 4 -
Figure 4 -Figure supplement 2. RBMY can replace the activity of RBMX in ensuring proper splicing inclusion of ultra-long exons.(A) Western blot analysis shows that RBMY-FLAG protein is stably expressed in HEK293 cells after 24 hours of tetracycline induction, and RBMX protein is successfully depleted after 72 hours siRNA treatment.All conditions were repeated in biological triplicates.(B, D, F) Capillary gel electrophoretograms show RNA processing patterns of endogenous ultra-long exons within ETAA1, REV3L and ATRX controlled by RBMX and RBMY analysed using isoform-specific RT-PCR.(C, E, G) Bar charts showing percentage splicing inclusion (PSI) of cryptic isoforms from the endogenous ETAA1, REV3L and ATRX genes under the different experimental conditions, relative to experiments in (B), (D) and (F) respectively.Pvalues were calculated using unpaired t-test.***, p-value<0.001.****, p-value<0.0001.

Figure 5 .
Figure 5.The disordered domain of RBMXL2 is required to mediate splicing control of ultralong exons in HEK293 cells.(A) Western blot analysis shows that RBMXL2ΔRRM-FLAG protein is stably expressed in HEK293 cells after 24 hours of tetracycline induction, and RBMX protein is successfully depleted after 72 hours siRNA treatment.(B, D, F) Capillary gel electrophoretograms show RNA processing patterns of endogenous ultra-long exons within ETAA1, REV3L and ATRX analysed using isoform-specific RT-PCR.(C, E, G) Bar charts showing percentage splicing inclusion (PSI) of cryptic isoforms from the endogenous ETAA1, REV3L and ATRX genes under the different experimental conditions, relative to experiments in (B), (D) and (F) respectively.Pvalues were calculated using unpaired t-test.****, p-value<0.0001.

Figure 6 .
Figure 6.Model of cryptic splice site repression within ultra-long exons by RBMX family proteins.

Figure 1 -Figure 2 -Figure 2 -
Figure 1 -Source Data 1 List of splicing defects in MDA-MB-231 and HEK293 related to Figure 1C and Figure1 Supplement 1A

Figure
Figure 3 -Source Data 1Gene ontology analyses Related to Figure3Aand Figure3supplement 1A,B