Acorde: unraveling functionally-interpretable networks of isoform co-usage from single cell data

Alternative splicing (AS) is a highly-regulated post-transcriptional mechanism known to modulate isoform expression within genes and contribute to cell-type identity. However, the extent to which alternative isoforms establish co-expression networks that may relevant in cellular function has not been explored yet. Here, we present acorde, a pipeline that successfully leverages bulk long reads and single-cell data to confidently detect alternative isoform co-expression relationships. To achieve this, we developed and validated percentile correlations, a novel approach that overcomes data sparsity and yields accurate co-expression estimates from single-cell data. Next, acorde uses correlations to cluster co-expressed isoforms into a network, unraveling cell type-specific alternative isoform usage patterns. By selecting same-gene isoforms between these clusters, we subsequently detect and characterize genes with co-differential isoform usage (coDIU) across neural cell types. Finally, we predict functional elements from long read-defined isoforms and provide insight into biological processes, motifs and domains potentially controlled by the coordination of post-transcriptional regulation.


Introduction
Single-cell RNA-seq (scRNA-seq) has revolutionized transcriptomics analysis, especially as the development of technologies with increasingly higher throughputs has enabled the processing of thousands of single cells simultaneously, boosting the amount of biological diversity that can be captured in a sequencing experiment 1 . The technology has been extensively applied to the discovery of new cell types and the characterization of their transcriptional profiles, resulting in the definition of cell type marker genes [2][3][4][5][6][7] . Even though scRNA-seq datasets can encode several levels of granularity in the form of cell subtypes, these studies rely on the low number of features required to recapitulate the cell type structure of the data 8 , which situates cell type characterization efforts at the baseline of understanding the intricacy of single-cell biology. scRNA-seq studies have also tackled transcriptional dynamics and how they relate to cell type properties. These include methods for the study of dynamic processes, namely pseudotime 9,10 and RNA-velocity 11 analyses, which have provided insight on cell differentiation and transition mechanisms between cell states 7,[12][13][14][15][16] . Moreover, novel methods have been developed to convey the inference of gene regulatory networks (GRNs) to the single-cell level 17 in an attempt to combine single-cell information with extant knowledge to infer relationships between genes and transcriptional regulators at a higher resolution.
Single-cell research is nevertheless far from realizing its full potential. On the contrary, the timing is now optimal for the field to undertake the investigation of deeper layers of cellular complexity. In particular, the investigation of Alternative Splicing (AS) and isoform expression dynamics has remained a challenge to the field. Reasons for this include the uncertainty of short read-based isoform quantification, which is exacerbated by the lower number of available reads per transcript in comparison to bulk libraries 18 , and the fact that the most popular scRNA-seq methods are heavily 3' end biased, which precludes the unambiguous identification of alternative transcript variants 19 . Current methods for the study of isoforms in single cells therefore rely on alternative metrics that either avoid isoform-level expression estimation completely 20,21 or exclusively consider individual splicing events [22][23][24][25][26] , generally leaving isoform characterization aside, with some recent exceptions 27 . Meanwhile, long read RNA sequencing (lrRNA-seq) of single cells is beginning to emerge an alternative approach to mitigate this ambiguity, given that it successfully grasps how individual events are combined into alternative isoforms 19 . Long read studies have therefore expanded the field's notions of cell type-specific splicing from event inclusion towards isoform selection patterns and showed that cell type-specific isoform expression can be detected in both broad types as well as cell subtypes [28][29][30][31][32] . Unfortunately, the sequencing depth constraints intrinsic to long read protocols 19 have limited the amount of isoform diversity that can be captured by single-cell long read transcriptomics [28][29][30] , and datasets generally show low levels of redundancy between cells of the same cell type.
Notwithstanding this limited scenario, there are a number of relevant questions regarding the importance of splicing for cell identity and function that can only be resolved by evaluating isoform expression at the single-cell level. In point of fact, splicing differences have been shown to discriminate cell types with an accuracy comparable to that obtained using gene expression 33 , while integrating AS and gene expression changes has led to the discovery of cell subtypes and states that were otherwise not detected 27,[34][35][36] . Especially relevant among these inquiries is the much-debated issue of whether individual cells express one or several isoforms, that is, whether the isoform diversity observed in bulk studies is recapitulated by each single cell or, alternatively, arises as a result of the combination of multiple cells, each of which uniquely express one of the gene's isoforms. Ever since the publication of the first scRNA-seq studies, short reads have been used to answer this question, usually via the characterization of splicing event -rather than isoform expression-modalities. Successive studies have provided non-conclusive results, with evidence of bimodal splicing patterns 22,37,38 as well as concerns regarding the relationship between bimodal isoform detection and technical noise 18,39 , a controversy that suggests that new analytical and computational approaches are needed to understand the isoform landscape of single cells.
Another pending question for the field is whether isoform expression programs involve coexpression relationships between transcript variants from different genes. So far, the application of long read technologies to single-cell data has served to unravel coordinated event choice patterns within isoforms of the same gene 40,41 , however, cross-gene isoform expression networks have not been investigated. In other words, there have been no studies addressing potential codependency between genes regarding the selection of transcript variants from their isoform repertoire, or the implications of this coordination for cell-state and cell-type properties. This is not only related to the general constraints of single-cell isoform studies, but also to the lack of computational methods and mathematical models to extract this complex signal from the data. In spite of the present research gap, isoform coexpression networks are an anticipated consequence of the regulation of splicing by RNA binding proteins (RBPs) and other splicing factors, and their investigation constitutes an opportunity to gain insight on the functional role of AS. Moreover, given that a multiple cell type and high cell throughput context such as that of scRNA-seq data constitutes a far more suitable data scenario to that of bulk RNA-seq, this is undoubtedly a timely inquiry to make.
In the present study, we hypothesize that isoform expression coordination exists as a result of AS regulation, and that this can be computationally detected in the form of isoform groups showing co-variation across cell types. To demonstrate this hypothesis, we have designed an end-to-end, data-intensive pipeline for the study of isoform networks (Figure 1). First and foremost, we employed a hybrid strategy where bulk long-reads and single-cell Illumina sequencing were integrated to estimate isoform expression at the single cell level. To unlock the limitations of extant correlation metrics in the single-cell context 42 , we developed a novel strategy to obtain noise-robust correlation estimates in scRNA-seq data, and a semiautomated clustering approach to detect modules of co-expressed isoforms across cell types.
We additionally re-defined and implemented Differential Isoform Usage (DIU) and co-Differential Isoform Usage (coDIU) analyses in order to leverage the multiple cell types contained in single-cell datasets. Finally, to couple these analyses with a biologically interpretable readout, we incorporated a functional annotation step in which several databases and prediction tools were integrated to add isoform-specific functional information ( Figure 1).
We have hereby applied this pipeline ( Figure 1) to the analysis of a publicly available mouse neural dataset, including published scRNA-seq Smart-Seq2 (primary visual cortex, generated by Tasic et al. 43 ) and bulk ENCODE PacBio long-read data (mouse cortex and hippocampus, generated by Wyman et al. 44 ). As a result, we successfully detected cell type-specific coexpression of isoforms in a manner that was independent of gene-level expression.
Furthermore, we demonstrated that these isoforms encode shared functional properties, highlighting the role of post-transcriptional processing as an additional regulatory layer that fine-tunes cellular functions and contributes to encode cell type identity. This novel pipeline has been implemented in the R package acorde (https://github.com/ConesaLab/acorde).

Results
Enabling multi-group differential expression of isoforms in single-cell data Traditionally, RNA-seq studies use publicly available reference transcriptomes such as RefSeq and ENSEMBL for short read isoform quantification. However, most tissues and cell types will express only a subset of the genes and isoforms contained in the reference, with previous studies showing that isoform detection accuracy increases when adopting tissue-specific isoform sets as a reference for mapping 45 . Long read technologies have the potential to achieve this, while also expanding the reference with novel isoforms that have not yet been annotated 46 . Given the low depth of single-cell long read datasets 19 , we employed a bulk PacBio dataset obtained from ENCODE 44 to define a mouse, neural-specific transcriptome, that, after extensive curation using the SQANTI3 toolkit (https://github.com/ConesaLab/SQANTI3) contained 36,986 isoforms belonging to 12,692 genes (see Supplementary Note 1).
To quantify the expression of the long read-defined isoforms at the single-cell level, we made use of a publicly available, deeply sequenced, full-length, short-read single-cell RNA-seq dataset by Tasic et al. 43 . We retained 1,591 cells after quality control (see Methods). Using the labels from the original characterization of the dataset, we assigned cells to 7 broad cell types, 5 glial (microglia, endothelial cells, oligodendrocytes, oligodendrocyte precursor cells (OPCs) and astrocytes) and 2 neural (GABA-ergic and glutamatergic neurons), each of which can be divided into several, distinct subtypes (Supplementary Figure 1A). To target potential differential isoform selection between cell types, we selected genes that retained more than one isoform after quality filtering, ultimately keeping 13,832 isoforms from 4,591 genes.
Of note, we observed a drastic cell number imbalance between neural (~720 cells/cell type) and glial cell types (~30 cells/cell type) in the Tasic dataset, which resulted in the underestimation of transcriptional differences between non-neural types (Supplementary Figure 1B). In order to facilitate downstream analysis, we used a downsampling approach (Methods) to balance the cell type abundances in the dataset while effectively preserving the data structure (Supplementary Figure 1C). Next, to select isoforms with robust co-variation and non-constitutive expression, we applied a multi-group strategy to detect isoforms showing Differential Expression (DE) in at least one cell type, combining the ZinBWaVE zeroexpression weighting strategy 47 with bulk-designed DE methods DESeq2 48 and edgeR 49 (see   Methods). Upon testing, we detected 5,711 DE isoforms using edgeR (FDR<0.05) and 7,714 using DESeq2 (FDR<0.05). To maximize sensitivity and ensure a broad iso-transcriptome analysis, we considered an isoform to be DE if it was detected by at least one of these methods, resulting in 10,100 isoforms from 4,305 genes (Supplementary Figure 2). Since two or more isoforms with differential cell type expression are required to form co-splicing relationships (see Methods), we finally retained 8,967 isoforms from 3,172 genes with multiple DE isoforms for downstream analysis.
To ensure that downsampling had no effect on DE results, a validation experiment for multigroup DE analysis was conducted, in which we performed 50 runs of random neural cell sampling (see Methods). Although DESeq2 proved to be slightly more robust across independent runs than edgeR, Jaccard Index values indicated that the majority of isoforms were consistently detected as DE (DESeq2: mean no. of DE isoforms = 6,908±101, mean Jaccard Index = 0.84±0.02; edgeR: mean no. of DE isoforms = 6,016±410, mean Jaccard Index = 0.74±0.02). In addition, this validation run showed that considering the union of edgeR and DESeq2 results contributed to improve robustness (mean no. of DE isoforms in union between methods = 9,399±248, Jaccard Index = 0.82±0.01).

Detecting isoforms showing cell type-level co-expression
Co-expression signals in single-cell data are weak and often result in poor performance of traditional correlation metrics and network inference methods 42,50 . Although data transformation approaches 51 and alternative metrics 42 have been proposed, these are more complex to apply and considerably less interpretable, respectively. Furthermore, most of these studies have only investigated gene level co-expression 17 , often ignoring the AS regulatory landscape. To address these limitations, we implemented a percentile correlation strategy: a simple, scalable approach to overcome single-cell noise in isoform co-expression studies ( Figure 2A).
Our approach considers cell-type identity to be defined by context-specific regulation of gene expression programs, and within-cell type stochasticity to arise from a combination of technical noise 52,53 and biological mechanisms such as transcriptional bursting 54 , which translate into the sparsity and heterogeneous expression patterns typically observed in scRNA-seq (Supplementary Figure 3A) and result in high variance across all levels of expression (Supplementary Figure 3B). Together, these effects mask the co-expression signal in the data and tend to yield low correlation values when using traditional metrics ( Figure 2B).
To overcome this problem, we treat single cells of the same cell-type as biological replicates that represent the state of a delimited cell population but are differently affected by the aforementioned combination of technical and biological forces. In this context, the expression distribution of any given transcript across the population can be considered as the signature of the transcript in that cell-type. To translate these assumptions into a metric, every isoform's expression within a cell type is first summarized into an expression profile, where single-cell count values are replaced by 10 percentile values (deciles) (Figure 2A, Methods). Intuitively, this reduced number of values synthesizes the approximate behavior of that transcript in the cell type, as inferred given cell-level observations. Next, to grasp expression distribution similarities across cell types, Pearson correlation between transcript pairs is computed using the percentile-summarized expression, resulting in a meaningful distribution of correlation values ( Figure 2B). In this manner, we extend the notion of cell-type markers to rely not only on mean or frequency of expression, but on their actual distributional pattern. Our coexpression metric therefore by-passes cell-level matching of individual observations, providing a correlation estimate that is both robust to the uncertainty of single cell expression and interpretable as a measure of expression similarity. Of note, changing percentile number did not have a noticeable effect on the resulting correlation values (Supplementary Figure 3C), which were considerably higher than those generated using traditional metrics, successfully unlocking co-expression analysis in single-cell data.
To detect modules of co-expressed isoforms, we used the percentile correlation matrix as a distance matrix for hierarchical clustering, and designed a semi-automated cluster refinement approach to ensure maximal profile similarity within clustered modules ( Figure 3A, see Methods for a detailed description). Briefly, we first used the dynamicTreeCut R package 55 to set adaptive thresholds and find clusters dynamically within the dendrogram. Among the resulting 152 clusters, some showed highly similar (i.e. redundant) expression profiles (Supplementary Figure 4A) and were therefore merged by re-clustering using the mean scaled expression of the isoforms in each cluster (i.e. clustering of metatranscripts, see Methods), obtaining 60 isoform clusters. Among them, 18 groups presented noisy expression profiles (Supplementary Figure 4B) and their isoforms were therefore re-assigned to the remaining 42 clusters. However, fully automated re-clustering was not effective to eliminate redundancy in some cases. As a result, conflicting cases were inspected and merged, generating a final number of 19 distinct clusters containing 8,688 isoforms (96,8% of total) and representing diverse expression modalities across the 7 broad cell types ( Figure 3B).

Validation of percentile correlations on simulated data
Building on studies reporting the poor performance of popular correlation metrics in singlecell data, authors have attempted the implementation of sparsity-aware measurements 51,56 and reported the potential of other alternatives to compute similarity, such as proportionality metrics 42 . Here, we present an interpretable, scalable and biology-aware alternative to singlecell co-expression studies based on Pearson or Spearman correlation. However, to better understand how percentile correlation performs in comparison to extant correlation metrics, we compared it to Pearson, Spearman and zero-inflated Kendall correlations 56 and one proportionality metric, rho (ρ) 57 using simulated data.
Given that there are -to the best of our knowledge-no scRNA-seq data simulators that include transcript co-expression patterns, we designed a simulation strategy (Supplementary Figure   5A, Methods) to generate an appropriate validation framework for our metric. Briefly, we applied SymSim 58 Figure 5D).
In order to evaluate how well the 5 co-expression methods recapitulated the simulated patterns, we computed these metrics for all synthetic transcript pairs in each simulated cluster (Supplementary Figure 5E). Among them, percentile correlation consistently yielded the best proportion of high within-cluster correlations followed by ρ, however, rather counterintuitively, ρ had only an average performance when low-complexity patterns were provided, with less than 20% output proportionality values >0.8 within clusters 1-5. Shockingly, zeroinflated Kendall correlation, a single cell-tailored metric, failed to recapitulate the simulated co-expression profiles and showed a considerably lower proportion of high correlations within the simulated clusters than Pearson and Spearman correlations. As a result of this evaluation, we can confidently assume that percentile correlations are useful to detect covariation patterns, yielding overall higher correlation values than all other considered metrics.
Of note, and similarly to real data, the simulated dataset showed no detectable effect when varying the number of percentiles used to compute percentile correlations (Supplementary Figure 6).
Next, we compared the ability of each co-expression metric to inform clustering and group synthetic transcripts with similar expression patterns. To achieve this, we run our clustering pipeline on the simulated isoforms using the 5 metrics as distance ( Supplementary Figures 7-11). To enable benchmarking, clustering was automated to always generate a total number of 15 calculated clusters (see Methods). In order to evaluate which metric worked best to detect co-expressed transcript groups, we considered internal correlations between the transcripts in generated clusters. We observed that ρ and percentile-generated clusters, unlike the remaining co-expression metrics, presented consistently high levels of internal correlation ( Figure 4A). Notably, the distribution of co-expression values obtained using percentiles was the most robust among the five metrics (Supplementary Figure 12). We next assessed how well the clusters generated using each co-expression metric (i.e. calculated clusters) recapitulated the simulated clusters. Calculated and simulated clusters were paired based on the similarities between their mean cluster profiles (Supplementary Figure 13, Methods), and the Jaccard index (JI) for each simulated-calculated pair was computed to measure the agreement in synthetic transcript assignment ( Figure 4B). Interestingly, results were highly heterogeneous for most methods: even though a number of simulated co-expression groups were easily detected by most metrics, no method was able to fully recapitulate the simulated clusters, with ρ proportionality, Pearson and percentile correlations being the most accurate ( Figure 4B). Zero-inflated Kendall and Spearman correlations, on the other hand, showed consistently low agreement with the simulated transcript groups. Finally, we considered the number of transcripts that remained unclustered ( Figure 4C) before and after re-clustering unassigned transcripts (i.e. cluster expansion step, see Methods). Pearson correlation provided successful cluster assignment for practically all transcripts in the simulated dataset, especially when incorporating percentiles (Pearson: ~10% unclustered before expansion, ~1% after; percentile: ~4% unclustered before expansion, 0% after), whilst the rest of metrics performed significantly worse, leaving 20-30% of transcripts unassigned even after cluster expansion, with proportionality (~30% unclustered before expansion, ~25% after) being the less optimal. Altogether, though ρ demonstrated good performance in many aspects of clustering, including intra-cluster correlation and agreement with the simulated clustering, it was outperformed by percentile correlation when globally considering all evaluated parameters ( Figure 4D). In addition to the fact that ρ failed to control for unassigned transcripts, computing means and standard deviations of Jaccard indices across simulatedcalculated pairs showed percentile and Pearson correlations as the most consistently accurate methods. All in all, our synthetic data evaluations showed that the percentile correlation approach performed well -and more consistently than ρ proportionality-in all the evaluated features, and visibly captured co-expression better than both traditional and zero inflation-aware correlation metrics.

Co-Differential Isoform Usage analysis of single-cell isoform expression
Isoform clusters represent groups of alternative transcripts that are co-expressed at the cell type level. However, our clustering results did not provide information on iso-transcriptome properties associated with splicing regulation. To facilitate interpretation of the isoform clustering results, we first defined genes with Differential Isoform Usage (DIU) as those whose isoforms were assigned to different clusters ( Figure 5A) and found that 80% of genes with clustered isoforms (2,577 out of 3,172) were DIU and involved a total of 7,575 isoforms. In this context, DIU genes will necessarily have two or more isoforms with significant changes in expression across cell types and simultaneously undergo cell type-dependent posttranscriptional regulation, leading to changes in isoform expression in each cell type.
In order to study isoform co-expression patterns, we defined co-Differential Isoform Usage (coDIU) genes as those showing coordinated cell-type-specific isoform usage. Specifically, we considered two or more genes to be coDIU if their isoforms had been assigned to the same clusters ( Figure 5B, Methods). This resulted in the definition of an isoform co-splicing network where nodes are clusters of correlated isoforms and edges represent coDIU genes, i.e. the number of genes for which two or more isoforms are co-expressed across cell types ( Figure   5C, Methods). To ensure the selection of significantly coDIU genes, we fitted a generalized linear model for every pair of coDIU genes, and selected pairs with isoform-level coexpression and no significant cell type expression variation when only accounting for genelevel expression (see Methods). CoDIU genes therefore present cell type-dependent coexpression of at least two isoforms, represented by cluster assignment matches, but are not co-expressed when only gene expression is considered. Using this strategy, we detected 2,049 genes with at least one significant coDIU partner (cluster*cell-type FDR < 0.05, gene*cell-type FDR > 0.05), involving 6,370 co-expressed isoforms. The number of coDIU genes sharing isoforms across each cluster pair was variable, although it rose up to ~120 for highly connected expression profiles ( Figure 5C).
We then interrogated the coDIU network to find patterns underlying the splicing coordination signal detected by the acorde pipeline. First, in order to measure whether coDIU generated strong or subtle variations in isoform selection across cell types, we investigated the association of coDIU to single or multiple cell type isoform switching events. Note that single isoform switching events involve clusters with patterns that are similar across all cell types except one, leading to high between-cluster correlations. Interestingly, we found that the number of coDIU genes linking isoform co-expression clusters was dependent on cluster sizes, but showed no direct relationship with the similarities between expression profiles ( Figure 5D). The detection of coDIU genes involving isoforms with highly different expression patterns suggested that coordinated isoform usage mechanisms are able produce strong cell type-level shifts in isoform selection, and thus modulate the expression of highly cell typespecific splice variants.
We next evaluated the cell type-level relationships present in the isoform co-expression network, namely the occurrence of coDIU across all possible pairs of cell types in our data.
Although co-splicing could potentially occur between any combination of cell types, our results showed that a high proportion of coDIU interactions were detected when the isoforms involved had high expression in one of the two neural cell types, i.e. GABA-ergic and glutamatergic neurons ( Figure 5E). This can be partially explained by the fact that some of the clusters with neuron expression are among the largest generated by the acorde pipeline ( Figure 5C). However, another plausible explanation is that the central role of neurons in the tissue under study (i.e. primary visual cortex) might situate co-splicing at the core of neural function regulation, as well as the modulation of its interaction with glial cell types.

Functional analysis of the coDIU network
We next set out to investigate the functional implications of our isoform co-expression network. Since, with a few exceptions 59,60 , splicing analysis tools rarely integrate functional information and function annotation databases do not usually include data at isoform resolution, we annotated the long read-defined transcripts using IsoAnnotLite compared the proportion of coDIU and DIU genes annotated for each functional feature in the transcriptome using a partially-overlapping samples z test 61 (Figure 6B, see Methods). Not surprisingly, coDIU genes were significantly enriched (FDR<0.05) in some of the same functionalities obtained in the previous DIU analysis, such as RNA biosynthesis and processing (transcription, gene expression, RNA metabolic process). However -and in contrast to DIU genes, which were markedly involved in biosynthesis-, genes regulated by coDIU were erniched in mitochondria components (i.e. mitochondrial matrix), suggesting that coordinated isoform usage may affect oxidatoin and energy metabolism. Remarkably, coDIU genes also showed additional enrichment for splicing-related terms such as RNA splicing, mRNA splicing via spliceosome and for 3' UTR motif K-box ( Figure 6B). This result links genes involved in splicing and elements affecting RNA stability with the coordination of AS and suggests that the coexpression of alternative isoforms is a post-transcriptionally regulated process, in line with previous evidence of AS self-regulation [62][63][64][65] .
To obtain further insight into the functional elements controlled by AS co-regulation across neural cell types, we performed a Functional Diversity Analysis (FDA, Figure 6C). FDA is part of the tappAS framework 60 , and identifies functionally varying genes, i.e. genes expressing transcript variants with differences in the inclusion of functional features (see Methods). FDA can be evaluated from a presence/absence standpoint (i.e. AS completely removes a feature), or by detecting variation in the transcript positions defining the feature ( Figure 6C). We therefore compared the diversity in transcript-level functional features between DIU, coDIU genes and genes with more than one DE isoform, for all functional categories provided by IsoAnnotLite ( Figure 6D). Interestingly, we detected a larger percentage of varying genes as the level of considered regulatory complexity increased, with coDIU resulting in the largest 17  To measure this effect, we compared varying percentages between all combinations of the three gene sets (paired samples t-test, see Methods) and confirmed the observed trend, regardless of whether the variability criteria employed was position or presence. In particular, even though all comparisons were significant, coDIU resulted in the most significant increase in feature variation (coDIU vs DE isoform genes p-value: presence = 7.61e-07, position = 3.45e-05; coDIU vs DIU genes p-value: presence = 4.92e-04, position = 9.62e-03). We verified that this increase in functional diversity was not associated to expression level or isoform length biases in coDIU genes (Supplementary Figures 14A-B). This result suggests that alternative isoforms that engage in co-expression relationships tend to alter their functional properties significantly more often than other transcripts, namely by controlling the inclusion of motifs, sites and /or domains, thereby coupling alternative splicing and isoform co-expression with functional potential.

Functional analysis of neuron-oligodendrocyte isoform co-expression
To further understand the relationship between cell-type identity, isoform co-expression and the functional properties of coDIU genes, we searched the coDIU network for cluster groups representing biologically-related isoform switches between defined neural types. Namely, we focused on a set of genes 160 coDIU genes ( Figure 5C) representing oligodendrocyte-specific, neuron-specific, and shared isoform expression patterns (clusters 8, 13 and 19, respectively, Figure 6E) and analyzed isoform-associated functional variability using FDA. For this set of alternative isoforms, polyA site and 3'UTR length showed the highest variation rates among 18 Figure   15A). Moreover, we noticed that these changes followed a clear cell type-specific pattern, with the majority of coDIU genes shared among the three clusters expressing their longest 3'UTR isoforms in neurons (Supplementary Figure 15B) and neural-specific isoforms generally expressing longer 3'UTRs than their oligodendrocyte-specific counterparts (Supplementary Figure 15C).
Next, we inspected 3'UTR-related functional categories, i.e. RBP binding, miRNA binding, and 3'UTR motifs, using ID-level FDA (see Methods) to identify specific functional features associated to isoform usage differences between neurons and oligodendrocytes. Regarding the presence of RBP binding sites, we found that Mbnl and CELF4 binding sites showed high variation levels, i.e. 50% and 75% of genes, respectively (Supplementary Figure 15D); however, the amount of genes including Mbnl and CELF4 binding motifs was low (~3% genes), possibly due to the relatively poor annotation density of RBP sites transferred by IsoAnnotLite (Supplementary Note 1). We also detected high variation frequencies for several miRNA sites (Supplementary Figure 15D), although no specific miRNA motif was shared by more than ~5% of genes (up to 9 out of 160 genes). Nevertheless, regarding 3'UTR motifs, we found that the Kbox motif presented inclusion changes in ~60% of annotated coDIU genes ( Figure 6A).
Interestingly, K-box motifs have been described to have a role in negative post-transcriptional regulation by promoting miRNA binding and transcript degradation 66,67 . In line with this, the coDIU network included several genes in which 3'UTR elongation led to neuron-specific coinclusion of K-box motifs, included Atxn10 (ataxin 10, Supplementary Figure 16A) and Btrc  Figure 15D). However, synergies between PTM and domain inclusion changes could still result in differential functional activities at the protein level. As an example, we found two tubulin isotypes, i.e. γ-tubulin 2 (Tubg2) and β-4-tubulin B-chain (Tubb4b), that co-expressed isoforms in clusters 8 and 19, that is, protein variants with neuron-specific and neuron-oligodendrocyte expression, respectively. Interestingly, both of these genes also presented inclusion changes in an N-terminal GTP-ase protein domain and several PTMs ( Figure 6F and Figure 6G), although with different functional outcome. Tubg2 Specifically, α/β tubulins are the building-blocks of the microtubule cytoskeleton, while γtubulin has a major role in microtubule nucleation and on the regulation of microtubule dynamics. The γ-tubulins only present two isotypes, among which Tubg2 is exclusive to neurons and has a well-defined role in neuron survival and growth 71 , whereas Tubb4b is part of a broader catalogue of β-tubulin isotypes, all of which are jointly regulated and share a structural function 72 . Intriguingly, however, mutations in a closely-related isotype (Tubb4a) have already been shown to have a differential phenotypic effect in neurons and oligodendrocytes 73 . In line with this, we observed modifications in the number and density of N-terminal phosphorylation sites in Tubb4b as a result of cell type-level isoform co-expression, with an increase in protein diversity in neurons (3 vs 2 alternative isoforms, Figure 16G). Even more interestingly, PTM changes on different α/β-tubulin isotypes have long been known to modify tubulin stability and its interactions with other proteins 74 , creating a "tubulin code" 75 .
As a result, tubulin PTMs have the ability to modulate microtubule stability and function, intervening in processes such as neural differentiation, survival and polarity 76,77 . This cell typespecific inclusion of different PTMs via isoform co-expression could therefore operate as a fine-tuning mechanism of the tubulin code, modifying the number and position of available modification sites in β-4-tubulin. Alternative Splicing (AS) is known to be a tightly-regulated process in which splicing factors interact to create cell type-specific isoform expression patterns 78 . The transcriptome-level consequences of AS regulation have been studied in different ways, including, but not limited to, the detection of within-isoform coordination of alternative sites 40,41 , the generation of gene-isoform networks to uncover novel regulatory relationships [79][80][81][82] and the application of single-cell data to unravel cell type-specific expression patterns for same-gene isoforms 27,83 .
However, the extent to which AS regulation creates co-expression patterns among alternative isoforms from different genes has not yet been fully addressed. Specifically, previous studies tackling this type of isoform co-expression have either focused on specific event types, such as alternative 3' exons 84 , or solely on the identification of functionally-relevant alternative isoforms in different biological contexts 85,86 .
In this study we present acorde, an end-to-end pipeline to generate isoform co-expression networks and detect genes with co-Differential Isoform Usage (coDIU), and apply it to the study of isoform co-expression among seven neural broad cell types 43 . To this end, we successfully leveraged single-cell data by implementing percentile correlations, a metric designed to overcome single-cell noise and sparsity and provide high-confidence estimates of isoform-to-isoform correlation. Here, we show that percentile-summarized Pearson correlations outperform both classic and single-cell specific correlation strategies 56 , including proportionality methods that were recently proposed as one of the best alternatives to measure co-expression in single-cell data 42 . In addition, using a long read-defined, functionally annotated transcriptome enabled us to obtain a biological readout from the isoform network. isoform co-expression thus appears to impact a subset of DIU genes sharing a specific set functions, which suggests that coDIU may contribute with an additional layer of complexity to some cellular processes, operating as a fine-tuning mechanism. Our analyses also revealed that isoforms from coDIU genes encompass higher functional diversity than those belonging to DIU genes, an effect that was not associated to expression or length differences.
Importantly, these changes impact both transcript and protein isoform functional features, which pinpoints their ability to globally increase the functional repertoire of coDIU genes.
Mechanisms generating isoform co-expression can therefore be thought of as a potential source of functional synergies between alternatively spliced genes, giving rise to simultaneous changes in functional properties among co-expressed isoforms.
To demonstrate the power of the acorde pipeline, we include examples where these kinds of coordinated changes were detected in the mouse neural dataset. First, we report a neuralspecific pattern of 3'UTR co-elongation that is consistent with the available literature 87,88 and results in a simultaneous increase in the number of K-box motifs, miRNA and RBP binding sites in these UTRs. In addition, we describe how cell type-specific co-expression of γ and βtubulin isoforms creates divergent functional properties among them, which is consistent with their different biological functions 71,72 and points to a splicing-mediated coordination of celltype specific structural components. On a broader note, an interesting finding stemming from our functional analyses concerns the type of biological properties that make the functional signal of the coDIU network. Namely, while we do find enrichment of biological processes (i.e. RNA metabolism or mRNA splicing via spliceosome) and differential inclusion of many features from a wide variety of functional categories, we failed to recover specific functional elements consistently present in a large number of co-expressed isoforms, such as binding sites for specific RBPs or microRNAs. This suggests that AS and post-transcriptional regulation mechanisms operate by controlling the presence of a broad array of functional features, which jointly contribute to modulate key cell-level processes encoding cell-type identity. While these insights need to be subject to further experimental validation, they serve to illustrate the hypothesis-generating power of our pipeline.
All in all, we have hereby showed that acorde can effectively leverage single-cell RNA-seq data to build isoform co-expression networks, revealing a new dimension of post-transcriptional regulation of gene expression while also disclosing the cellular processes and functional elements impacted by these mechanisms. 24

Single-cell data pre-processing and quality control
Mouse neural single-cell RNA-Seq data from mouse primary visual cortex was obtained from Tasic et al. 43 and consists in paired-end Illumina reads generated with the Smart-seq2 protocol 89 , which enables isoform-level quantification. Reads were downloaded from Sequence Read Archive accession SRP061902 and mapped to the mouse genome (GRCm38.p6) using STAR 90 . We performed isoform expression quantification of the long read- Isoform length effect on expression was evaluated using the NOISeq R package 92 , where mean expression showed to be highly correlated with transcript length (adjusted R 2 = 0.81; p-value = 2.2e-16). Using isoform i effective length (l i ) and cell-level j estimated counts (c ij ), both output by RSEM and, after testing several alternatives, we devised a custom formula to minimize the impact of length on isoform expression for each isoform i: The transformed expression value for isoform i in cell j (y ij ) was again tested for length bias and a low correlation was found (adjusted R 2 = 0.25; p-value = 6.84e-8). Next, we inspected the library size distribution and filtered both high and low-count outliers due to potential premature cell death or library preparation duplets, with a total of 1591 cells passing quality control. Feature-level quality control was performed in a cell type-aware manner, keeping isoforms that showed non-zero expression in at least 20% of one cell type. Out of the 36,986 isoforms and 12,692 genes in the PacBio-defined transcriptome, we retained 18,867 isoforms and 9,596 genes for downstream analysis. Finally, we retained isoforms from multiple-isoform genes, hence removing cases where no AS-directed, isoform-level co-expression relationships can be established. This sets a general requirement for the entire study, which is that all isoforms retained must have, at all times, at least one same-gene counterpart to establish regulatory relationships that can be based on differential splicing of that gene, given that no AS regulation can be detected if a gene's total expression is represented by a single isoform.
As a result, 13,832 isoforms from 4,591 genes were preserved.

Single-cell DE analysis
Differential Expression (DE) analysis among the 7 cell types was performed by combining ZinBWaVE weights 47 and bulk-designed DE methods edgeR 49 and DESeq2 48 (i.e. using the corresponding R packages), which enable multiple group testing and were among the bestperforming methods when combined with the ZinBWaVE method. Briefly, ZinBWaVE calculates cell-level weights for each isoform, effectively downweighting zeros during modelling for differential expression in single cell data (see van den Berge et al. 47

Validation of downsampling and single-cell DE analysis
To check that the random selection of cells in the downsampling process was not affecting the DE results, we performed 50 independent runs of random neural cell sampling (without replacement) followed by zero-expression filtering (expression above zero for at least 25% of

Percentile correlation
In order to assess the similarity of isoform expression profiles across cells, a correlation measurement can be used by taking cells as observations. We propose here instead to first summarize the expression within a given cell type with percentiles and then compute the correlation using all cell types and their percentiles as observations, a method that we refer to as "percentile correlation" (see main text Figure 2A).
Percentile correlations rely on the assumption that cell-to-cell differences can be mostly attributed to transcriptional stochasticity or technical noise, and that these within-cell type differences have a smaller effect than between-cell type expression differences. However,

Semi-automated isoform clustering
In order to obtain modules of tightly co-expressed isoforms, we combined the hierarchical clustering algorithm with several rounds of cluster profile refinement (see main text, Figure   3A), in order to automate the most intensive steps of clustering while also granting control over the level of aggregation and within-cluster similarity. Clustering and refinement steps can be combined and re-arranged to best capture co-expression patterns within the data, and their parameters can be defined by potential future users to provide maximum flexibility.
Functions for clustering and refinement are implemented in the acorde R package.

Dynamic hierarchical clustering
The previously obtained correlation matrix (R), where each element rij represents the Pearson's correlation coefficient between the percentiles of isoforms (i,j), was transformed into a distance metric to be used in the hierarchical clustering. As we aimed to cluster .
We next performed hierarchical cluster analysis using the hclust() function in the R base This initial set of clusters is to be used as "hooks" to gather as much expression profile diversity from the data as possible. Importantly, even though our parametrization allows isoforms to remain unassigned to clusters (see above), some isoforms may still show low similarity to their cluster's profile. To be able to obtain profiles as consistent as possible for downstream refinement, we implemented a cluster quality control step in which we removed isoforms based on a minimum correlation threshold with the rest of the members. In particular, isoforms were moved to the unclustered group if they showed a correlation lower than 0.85 with 3 or more isoforms from their cluster. In this manner, only tightly-correlated groups of isoforms will remain clustered.

Expanding clusters with unassigned isoforms
To re-assign unclustered isoforms to clusters with which they show high correlation, we performed two rounds of correlation-based, cluster expansion. In this process, we first summarized each cluster's profile into a synthetic representative transcript that we named "metatranscript". Metatranscripts were calculated as the mean of the percentile-based expression of all isoforms in the cluster. As a result, we obtained 11·K (K being the number of cell types) mean-summarized percentile expression values, which can be understood as an approximation to the expression range shown by the isoforms from that cluster in each of the cell types.
We next computed correlations between metatranscripts and unclustered isoforms, and performed two rounds of assignment. First, we assigned unclustered isoforms showing correlation values above 0.9 with at least one cluster, where the maximally correlated cluster was selected as the best match if there were ties. Next, metatranscripts were re-calculated for the newly expanded clusters and a similar assignment round was performed, this time lowering the correlation threshold to 0.8. In doing this, unclustered isoform groups are assigned in order, and highly correlated elements can therefore contribute to strengthen within-cluster similarities before assigning more lowly correlated elements.

Merging clusters by profile similarity
Prioritizing the reduction of within-cluster variability led to obtaining a large number of small, redundant clusters. To mitigate this effect while also preserving high correlations between cluster members, we merged clusters by profile similarity using the correlations between their metatranscripts. To achieve this, we performed dynamic hierarchical clustering using the cutreeHybrid() function on metatranscripts, with the following non-default parameters: deepSplit = 4, pamStage = FALSE (see section on dynamic hierarchical clustering above for more information), minClusterSize = 1. Of note, minClusterSize is set to 1 to avoid forcing merge of clusters that do not show enough similarity. As a result, we joined highly similar clusters, decreasing the number of clusters and hence minimizing redundancy.
While this strategy effectively eliminated between-cluster redundancy, we observed some exceptions where clearly similar profiles were not merged by clustering metatranscripts (Supplementary Figure 3A). In addition, a few cases arose where automated merging resulted in joining clusters with highly uncorrelated profiles (Supplementary Figure 3B). Given that it is very hard to provide a completely automated solution to the clustering problem, we solved these final discrepancies using a series of manually supervised refinement steps.
For that, we first inspected cluster profiles to flag incorrect merge decisions generating noisy clusters. Isoforms in these clusters were treated as unclustered, and used to expand the remaining, well-defined clusters. To do this, we computed the correlation between the isoforms to be assigned and cluster metatranscripts, similarly to the cluster expansion process described above. In this case, we joined isoforms to clusters if they presented Pearson correlation > 0.8 with its metatranscript. Isoforms showing high correlation with several clusters were assigned to the cluster with maximum correlation. Once isoforms from noisy clusters had been re-assigned, clusters were manually inspected again to detect profile redundancies missed by automated metatranscript clustering. Clusters that presented a similar expression pattern across cell types were then merged.

Co-expression pattern simulation
To validate percentile correlations and our clustering strategy, we evaluated their performance on synthetic data, where co-expression relationships between simulated features need to be pre-defined as part of the data simulation process. However, there is, to the best of our knowledge, no currently available strategy to simulate single-cell data including modules of co-expressed features. We therefore designed our own simulation strategy by combining the SymSim R package 58 to adequately model single-cell RNA-Seq data, and a dedicated strategy to generate co-expression between SymSim simulated features.
First, we drafted 15 different co-expression profiles reflecting three levels of expression complexity, that is, showing high expression or expression "peaks" in one, two, or three cell groups, respectively. To generate a count matrix reflecting these expression patterns, we shuffled simulated counts to create new, synthetic features. To achieve this, we first re-ranked features in each cell group by mean expression across cells in the group, breaking feature connectivity between the simulated cell types. Then, the top 1,400 features from each cell type were selected, together with the bottom 1,400 features. In this manner, we obtained high-expression and low-expression count vectors for each group, which we then combined to create synthetic features following the pre-designed cluster's co-expression pattern. To make our benchmarking comparable, we adapted our clustering pipeline to remove all non-automated steps and always generate a fixed number of clusters. First, hierarchical clustering was performed on each correlation matrix using dynamicTreeCut() 55  features, to evaluate the clustering. We did this by aggregately considering all pairwise metric values for features within a cluster and measuring the percentage of metrics that are above a threshold value of 0.8. To assess how well empirical clusters recapitulated the co-expression simulation, we paired empirical with simulated clusters using the correlations between their mean cluster profiles. Simulated clusters were therefore paired with the empirical clustering showing maximum profile correlation. We next compared synthetic feature IDs assigned to the obtained and empirical clusters in each pair using the Jaccard Index (JI).

Defining Differential Isoform Usage across multiple groups
Grouping isoforms into different clusters allows detection of a number of expression patterns across the multiple cell types included in single-cell data.  Figure 5A). Ultimately, this can be interpreted as an indicator that isoform expression regulation is cell type-dependent in that gene.
Detecting co-splicing patterns across isoform clusters: co-Differential Isoform Usage We define coordinated splicing patterns as a situation where post-transcriptional regulation, defined by isoform expression, can be detected independently of transcriptional regulation, i.e. gene-level expression. To detect splicing coordination, we defined co-Differential Isoform Usage (coDIU) as a pattern where a group of genes shows co-expression of their isoforms, but no co-expression can be detected when only gene expression is considered (see main text, Figure 5B). In the context of our pipeline, a set of potentially coDIU genes will have at least two of their isoforms assigned to the same clusters, therefore showing detectable isoform-level co-expression, and suggesting coordinated splicing regulation in that group of genes.
However, clustering allows expression pattern variability among members, and therefore some isoforms might be assigned to clusters that do not faithfully represent their expression profile, leading to detection of false-positive coDIU genes.
To identify groups of genes candidates for coDIU, we applied negative-binomial generalized linear regression models. Let G be a group of genes, each of them with I g isoforms, where g = 1,…,|G|. At least one of the isoforms of each gene g in G must belong to the same cluster c, where c∈ {1,…,C} and C is the total number of clusters. Let z be the expression vector obtained after concatenating the expression vectors y i of each isoform i of every gene g = 1,…,|G|. For the sake of simplicity, let us assume that |G|=2, I g =2 ∀   g, and consequently C=2. In this case, vector z will contain 4N elements, where N is the total number of cells in the data (N=241 in our data) and will be the response variable in our regression model. We need to assess if z values follow the trend depicted in Figure 5B, that is, the average profile across cell types of the two isoforms in cluster 1 must be significantly different to the average profile of the two isoforms in cluster 2. In addition, the average profile of the two isoforms of gene 1 must not be different to the average profile of the two isoforms of gene 2. To identify groups of genes with these characteristics we proposed to fit the regression model in Equation 4 and select the group of tested genes as coDIU candidates when having a significant interaction between cluster and cell type effects, and a non-significant interaction between gene and cell type effects.
Where G 2 and C 2 are dummy variables indicating whether the expression value corresponds to gene or cluster 2 ( value 1) or 1 (value 0), respectively, T k is a dummy variable which takes value 1 when the corresponding cell is assigned to cell type k (k=1,…,K) and 0 otherwise, βk, γk, δk and τ k are the regression coefficients, ε represents the error term, and h() is the link function of the GLM (natural logarithm in this case).
We fitted the GLM model with the glm() function in the R-base package, and the negative.binomial() function in the MASS R package 94   isoforms will form pairs with all potentially coDIU counterparts and be repeatedly tested, we considered genes to be positive for coDIU if they met the significance criteria in at least one of these pairwise tests.

Functional analyses
The analyses in this manuscript are based on a long read-defined transcriptome which, after careful quality control and curation of the isoform models, was further annotated using

Functional Enrichment Analysis
In order to understand the functional properties of AS-regulated and co-regulated genes, we set out to characterize DIU and coDIU genes using different functional enrichment analysis approaches. In this manner, we intended to gain insight on functional features and categories showing significant overrepresentation in each of these two gene lists, in comparison to different backgrounds, i.e. lists of genes to compare to in order to detect enrichment. In the case of DIU genes, we calculated enrichment relative to genes with multiple DE isoforms (total: 10,100) in order to discriminate the functional properties of genes regulated by alternative splicing, as opposed to those lacking differential usage of their isoforms. We considered all annotated functional categories and features, and applied tappAS Functional Enrichment Analysis (FEA), which relies on the GOSeq R package 96  Specifically, we compared the proportion of coDIU genes containing each of the functional features (relative to DIU genes) with the proportion of DIU genes containing that same annotation (with respect to genes with DE isoforms). In other words, we tested whether the proportion of coDIU vs DIU genes including a given functional feature was significantly higher than that shown in the comparison between DIU and DE genes. We performed the analysis for features with more than 20 annotated genes, and subsequently corrected for multiple testing within functional categories using the Benjamini-Hochberg method. Functional features were considered to be present in a significantly higher proportion in coDIU genes when FDR<0.05.

Functional Diversity Analysis
To obtain insight into the functional changes generated as a consequence of DIU and coDIU, we again used the tappAS tool for the functional analysis of alternative splicing 60 . In particular, we first applied tappAS' Functional Diversity Analysis (FDA) module (see main text Figure 6A).
Briefly, FDA performs a within-gene comparison of all the isoforms included in the analysis, aiming to detect whether they present variation in the inclusion of a functional feature. In FDA, variation can be positional, i.e. one or more of the gene's isoforms present a change in the genomic coordinates defining the feature, or be defined by presence/absence, i.e. at least one of the isoforms lacks a feature that is present in the rest. As a result, FDA provides analyzed genes with a label for each of the feature categories included in the transcriptome's functional annotation file, flagging them as varying if at least one of the isoforms presents variation in a feature from that category, or not varying if no changes are detected. For more details on FDA, see the Methods section in de la Fuente et al. 60 .
We run both positional and presence/absence FDA for three gene sets: 1) genes with multiple DE isoforms (total: 3,172), 2) DIU genes (total: 2,577) and 3) coDIU genes (total: 2,049). Next, for each of these gene sets, we computed the proportion of varying genes detected for each functional category. Varying proportions were calculated relative to the total number of genes including annotations from the category, instead of considering all genes in the set. In this manner, we avoided underestimating variation rates for categories that were less represented in the functional annotation file. In order to check whether any of these gene sets presented a significantly higher mean proportion of varying genes across categories, we performed a paired t-test for each combination of gene set pairs: DIU vs multiple DE, coDIU vs multiple DE, and coDIU vs DIU. In this analysis, we considered functional categories to be the individuals under evaluation, while the proportion of varying genes calculated for each category in the two tested sets constituted the paired observations. As a result, we obtained three p-values per FDA analysis type, i.e. presence/absence and positional variation.
To better understand the functional readout that can be obtained using the acorde pipeline, we analyzed a subset of the coDIU gene network, namely three clusters showing related isoform co-expression patterns: neuron-specific expression (cluster 8), oligodendrocytespecific expression (cluster 13) and expression in both neural and oligodendrocyte cell types (cluster 19). In total, 160 coDIU genes showed co-expressed isoforms between at least two of these clusters. To characterize functional variation among the clusters, we used positional/presence FDA (see above) and ID-level FDA. ID-level FDA is also included in tappAS 60 and provides a within-feature summary of FDA results. In other words, ID-level FDA ultimately reports the number of varying and not varying genes detected for each feature ID included in a given functional category. In this case, varying status obeys a similar criterion to the one described above, i.e. genes in which at least one isoform shows differential transcriptome. Isoform are then tested for multi-group differential expression and those that are significantly DE in at least one of the cell types are selected. Next, percentile correlations are computed to cluster isoforms with similar expression patterns across cell types. Finally, gene pairs are tested for co-differential isoform usage, detecting genes that form co-expression relationships for subsequent functional analysis.    tial Isoform Usage (DIU) across multiple cell types. DIU genes that have at least two isoforms assigned to different clusters, indicating a differential isoform selection pattern across the different cell types. B) Definition of co-Differential Isoform Usage (coDIU) using clusters. CoDIU genes have multiple isoforms assigned to the same clusters, establishinig cross-cell type co-expression relationships for at least two of their isoforms. C) coDIU network. Nodes represent clusters and depict their mean expression profile across cell types. Node color represents cluster size