Per-sample standardization and asymmetric winsorization lead to accurate clustering of RNA-seq expression profiles

Motivation Data transformations are an important step in the analysis of RNA-seq data. Nonetheless, the impact of transformations on the outcome of unsupervised clustering procedures is still unclear. Results Here, we present an Asymmetric Winsorization per Sample Transformation (AWST), which is robust to data perturbations and removes the need for selecting the most informative genes prior to sample clustering. Our procedure leads to robust and biologically meaningful clusters both in bulk and in single-cell applications. Availability The AWST method is available at https://github.com/drisso/awst. The code to reproduce the analyses is available at https://github.com/drisso/awst_analysis.

For each sample, we first estimate µ with the right mode of the log2-counts empirical distribution, denoted asμ. Analogously, we estimate the locationν of the counts as eμ. Then, we mirror the values on the right of the mode (the dotted blue curve in Figure 1a) and we estimate the varianceσ 2 with the maximum likelihood estimator on the right tail and its mirrored values. To get the varianceτ 2 of the counts, we exploit the log-normal distribution, i.e.,τ 2 = e 2μ eσ 2 − 1 eσ 2 . The standardized counts, the z-counts zj, are thus defined as zj = xj −ν τ .
A similar approach was adopted by Hart et al. (2013) to standardize the log-FPKM values with the purpose of identifying a threshold that separates active from background genes. Both FPKM and the log transformation may lead to biases due to the influence of highly expressed genes (Bullard et al., 2010) and zero counts (Lun, 2018), respectively. Hence, we do not act on the log-FPKM transformed data. Rather, our standardization proposal is applied in the original (read count) scale. We shall note that our aim is different from that of Hart et al. (2013): while Hart et al. (2013) uses the transformation to classify between active and non-active genes, we use the transformation as a preliminary step for clustering.

Smoothing step
The second step of our proposal is a smoothing transformation applied to the z-counts. This transformation is crucial to regularize the expression of the genes affected by systematic artifacts that may bias the characterization of molecular subtypes. In fact, even when similar samples share the same subset of up-regulated genes, the level of up-regulation can change actively from sample to sample so that a compact cluster does not come to light. On the other hand, genes with a weak expression level contribute to the bias by acting as a source of noise for distance functions in high dimensions.
Briefly, we propose a smooth and skewed version of winsorization (Tukey, 1962) applied to the z-counts.
To this end, we employ a smoothing function, T (·), based on a modified Gaussian distribution function, made asymmetric through the variance parameter. Let Φ(z) be the distribution function of a standard Gaussian random variable, and consider the function where σ0 is a constant and I(·) is the indicator function. σ(z; σ0, λ) = σ0 for all the z ≤ 0, and it is an increasing function for z > 0; the parameter λ ≥ 0 controls the growth-rate of the right tail. The asymmetrization mechanism follows the one adopted by Azzalini (1985) to define the skew-normal distribution.
Differently from his proposal, we applied the asymmetrization device to the variance rather than to the body of the distribution.
The smoothing function is is a standardizing factor and φ(·) the probability density function of the standard Gaussian.
In Figure 1b, we plot the T (·) transformation for different values of λ. The black curve is the Gaussian probability distribution function with µ = 0 and σ = 0.075. The value of σ is empirically computed so that zeroes and counts close to zero are pushed to the minimum. As λ increases, from 7 (blue curve) to 13 (red curve), we observe a delay (as x increases) in transforming the values to the maximum. When λ = 0, the smoother is very close to acting as a device that dichotomizes the original values, where the mode is the changing point. Figure 1c, in which we draw the density estimates of the smoothed values of one TCGA sample, shows that the majority of the values have been shrunk around −2, while the others values gradually increase up to around 4. The effect of reducing the contribution of lowly expressed genes, and of the winsorization for the highly expressed ones, is made explicit by comparing the original log-counts versus the smoothed z-counts in Figure 1d. This latter graph shows that non-expressed and lowly expressed genes are uniformly transformed to the minimum, while the highly expressed genes are pushed to the maximum.
The genes with an intensity around the mode (mapped to zero) can contribute with different levels to a distance function.

Gene filtering
The effect of the smoothing function is to push outlying genes to a maximum value and genes with a low expression level to a minimum. Genes reaching the maximum across all samples are irrelevant, as well as those always assigned to the minimum value. Moreover, the features with about the same expression level can only add noise to the distance between samples without contributing meaningfully to the similarity measure.
To remove the irrelevant features, and those only contributing noise, we propose a filtering procedure based on a heterogeneity index, which is possible only because AWST forces the expression level of any sample in the same range.
Since the expression levels for every sample are bound by the same maximum and minimum values, we define a partition of K equally spaced intervals. The partition can be used to categorize the gene expression levels in each sample, and then to compute the heterogeneity index hj for each of the genes. Given a cutoff value c, we remove all those features fulfilling the constraint hj < c. As a heterogeneity index we use the normalized Shannon's entropy where p jk is the empirical probability associated with the k th interval of the partition and the j th gene.
Our filtering procedure maps the samples in a less parsimonious subspace than that induced by the rule of thumb of retaining the top 1,000 features (or few more), but it often yields a finer partition of the samples thanks to the AWST, as we show in section 3. Figure 2: Performance of different data pre-processing paired with the hierarchical clustering (Euclidean distance, Ward's linkage). a) AWST data pre-processing; b) Hart's data pre-processing; c) Radovich's data preprocessing; d) TCGA data pre-processing. The "sample" bar indicates the true partition, and the "clust" bar indicates the inferred partition obtained by cutting the tree to get 5 clusters. The Calinski-Harabasz curve is superimposed in each panel.

Parameters choice
Our transformation has two tuning parameters: λ and σ0, which control the amount of smoothing. Larger λ values push fewer points to the maximum value ( Figure 1). Similarly, larger σ0 values lead to less shrinkage.
Empirically, we found that values of λ ranging from 5 to 13 are good choices in practice. The default value of σ0 is chosen to shrink to the minimal value the low counts, more likely to be noise, while simultaneously preserving a dynamic range of moderate and high values.
The two parameters both contribute to the amount of shrinkage. Decreasing λ increases the shrinkage for highly expressed genes, while decreasing σ0 increases the overall rate of shrinkage, eventually leading, for very small values of σ0, to dichotomization of the data. In all the analyses of this article, we set the two parameters to their default values (σ0 = 0.075, λ = 13).
Similarly, our filtering step has two tuning parameters: the threshold c and the number of bins K. The threshold c controls the amount of genes that get filtered out according to the normalized entropy. An entropy of 0 indicates the genes that will not contribute any information in the classification, falling in the same bin across all samples. In practice, a small positive value (we adopt 0.1 by default) may be preferable to 0, to additionally remove those genes that contribute very little to the classification procedure.
Also the number of bins K contributes to the number of features retained for downstream analyses. We empirically found 20 as a good value, being a compromise between the retention of informative features and the removal of the non-informative ones. As K decreases, the intervals become larger and a higher number of genes is removed; on the contrary, when K increases the filtering procedure retains more genes with a possible increase of noise in downstream analyses.

Synthetic data
First, we evaluate our procedure on a synthetic dataset created from the SEQC (Su et al., 2014) reference data. Briefly, starting from 80 real RNA-seq technical replicates of Agilent Universal Human Reference, we simulate differential expression to create five groups of samples and we compare each procedure based on its ability to retrieve the simulated true partition (see Supplementary Information for details on how differential expression was simulated). Our synthetic dataset consists of 150 samples and 19,701 genes. The AWST procedure yields the same results both with and without our gene-filtering step. We present the results omitting the gene filtering, i.e., without removing any of the 19,701 genes.
The first set of experiments compares AWST versus a pre-processing method inspired by the transformation of Hart et al. (2013) and state-of-the-art procedures proposed for the clustering of RNA-seq data in Radovich et al. (2018) and TCGA Research Network (2015). Radovich and TCGA pre-processings (detailed in Supplementary information) adopt almost the same robust pipeline; they differ for the choices of the robust estimates of location and scale, and for the order in which standardization and log2-transformation are performed (see Supplementary Information). We also include in the comparison a simpler procedure that retains the top variable genes after FPKM normalization. The comparisons are two-fold: a) we apply hierarchical clustering (with Euclidean distance and Ward's linkage) to each pre-processing and evaluate the agreement between the inferred partition and the true clusters using the Adjustud Rand Index (ARI) (Hubert and Arabie, 1985); b) with the help of silhouette plots (Rousseeuw, 1987), we measure the compactness of the true partition in the distance matrices obtained after each pre-processing.  : Performance of different data pre-processing paired with ConsensusClusterPlus (inner and outer average linkage with Pearson's correlation as distance matrix). a) AWST data pre-processing; b) Hart's data pre-processing; c) Radovich's protocol; d) TCGA data protocol. The "clust" bar indicates the true partition, and the "consensus" bar indicates the inferred partition obtained by requiring 5 clusters.
Briefly, the consensus clustering (referred to as outer clustering) is a hierarchical clustering procedure with average linkage based on 1,000 inner hierarchical clusterings, also with average linkage. Each time 80% of the original samples are randomly drawn. Both papers adopt Pearson's correlation between samples as a distance matrix. We applied the same method to each pre-processing.
When AWST is applied prior to the consensus clustering procedure, the approach leads to a perfect partition (Figure 3a). Hart transformation leads to a slightly less accurate partition (Figure 3b, ARI = 0.70) and, more importantly, to more uncertainty in the cluster allocations (as shown by the consensus matrix), suggesting that AWST improves the robustness of the clustering results. The Radovich procedure also yields a perfect partition (Figure 3c). However, that is not the case when, after Radovich pre-processing, we apply a simple hierarchical clustering (Figure 2c Taken together, these results show that AWST is the only pre-processing that leads to an optimal sample partition on all the tested clustering algorithms (Figure 2a, Figure 3a, Supplementary Figure 4a).

A collection of 516 primary tumor samples of the Lower Grade Glioma (LGG) study from The Cancer
Genome Atlas (TCGA) allows to demonstrate the advantages of AWST on a complex RNA-seq experiment.
We compare AWST results with those previously published, in which ad-hoc and sometimes multi-omic approaches were used. While there is no ground truth in this dataset, comparing our partition to those identified by several independent, multi-omic approaches, allows us to check if the derived clusters represent biologically meaningful subtypes.
We apply AWST to the RSEM estimated counts after GC-content correction and full quantile normalization. The gene-filtering step retains 10,212 genes out of the original 19,138. We compute the Euclidean distance followed by hierarchical clustering with Ward's linkage (Figure 4). The lack of meaningful clusters in the dendrogram of the 8,926 filtered out genes (Supplementary Figure 11) indicates that the removed features only contribute noise.
The dendrogram in Figure 4 is annotated with different partitions. The expression clusters denoted by the "exprClust/CELL" bar come from a pan-glioma study, in which stringent filtering was applied to select 2,275 features in a set of 667 samples (154 GBM and 513 LGG) (Ceccarelli et al., 2016), leading to four distinct groups. A second expression clustering ("exprClust/NEJM") is from TCGA Research Network (2015) and provides a different partition in four groups. Kamoun et al. (2016) studied 156 oligodendroglial tumors of the IDHmut-CODEL subtype. With a multi-omic approach, they found three molecular subtypes (indicated by the "CODEL/subtype" bar in Figure 4).
The CH-curve suggests a primary partition ("clust5") of 5 groups that mostly mirrors the Original subtype of brain cancer: c51 is the cluster of the IDHwt samples; c53 includes the IDHmut-codel samples.
The IDHmut-non-codels splits into two groups: c54 and c55. Cluster c52 mostly corresponds to the LGr2 (Ceccarelli et al., 2016) and R4 groups (TCGA Research Network, 2015); the genomic determinants of this group are still unexplored. Clust5 is supported by the study of the overall survival time associated with each sample (Supplementary Figure 6; log-rank test p-value of ≤ 2.2 · 10 −16 ).
C53 is a large collection of IDHmu-codel samples. Kamoun et al. (2016) specifically studied this kind of tumors and obtained three sub-groups, supported by different genomic analyses. Our hierarchical clustering also splits c53 into three sub-groups that match the corresponding ones of Kamoun. To support the identification of our sub-groups, we estimated the survival curves associated with them (Supplementary Figure   8; log-rank test p-value of 0.02). Our finding is comparable to Supplementary Figure 8a of Kamoun et al. (2016).
C54 is an almost homogeneous cluster of G-Cimp-high samples. Our clustering identifies c541 and c542. The latter matches the G-Cimp-low sub-type discovered in Ceccarelli et al. (2016), obtained using a supervised analysis of methylation data. Our experiment identifies the G-Cimp-low solely based on the expression profiles, for the first time to the best of our knowledge (Supplementary Figure 9; log-rank test p-value = 2 · 10 −4 ).
To compare the performance of AWST to its competitors, we apply the other methods to the LGG dataset. None of the estimated partitions of the 516 LGG samples consistently match the subtypes known from the literature. (Supplementary Figures 12, 14, 16). Reassuringly, we note that our implementation of the TCGA protocol reproduces the original results of TCGA Research Network (2015), denoted in the figures by "exprClust/NEJM". The antibody-derived tags data (ADTs) measure the levels of those membrane proteins able to classify each cell to the sub-populations of the immune system, essentially providing the same informative content of flow cytometry. As an example, the combination of CD3+ and CD19-is a marker of T-cells, while CD3-and CD19+ characterizes B-cells. Here, we use ADT as an independent annotation of our estimated partition, which is solely based on scRNA-seq expression.

Cord Blood Mononuclear Cells
Before applying AWST, we retain the cells having at least 500 detected genes. This pre-processing is a typical step in single-cell analysis, in which low-quality samples may be disrupted or dying cells, or empty droplets (Lun et al., 2019). The total number of cells is reduced to 7,613 from the original size of 7,985 (about 5% removed). AWST is applied to a data matrix of 7,613 cells measured on 17,014 features after full quantile normalization. The gene-filtering step (with default parameters) returns 230 genes. None of the 230 genes retained by the filtering step is included in the set of antibodies in the ADT data.
Applying hierarchical clustering with Ward's linkage and Euclidean distance, we obtained a partition of six clusters (Figure 5a), each of which is, for the most part, interpretable according to the levels of expression of the independent ADT data. For instance, cluster CBMC1 identifies the T-cells sub-population given that CD3 is highly expressed, while CD19 is low. The opposite expression of these two markers (CD3 low and CD19 high) identifies the B-cells, clustered in CBMC2. The monocytes are the cells in CBMC5 where the markers CD14 and CD11c are both expressed. Each of the six clusters shows compact sub-groups that are not explained by the ADT, suggesting that the scRNA-seq data may lead to more granularity in the clustering compared to the relatively few ADT markers.
The inferred clusters are well separated in the space of the first two principal components computed from the 230 features after AWST (Figure 5b). A 3-dimensional view of the cells is at https://tinyurl.com/ybp3wydb.
Overall, there is good agreement between ADT data and scRNA-seq-based unsupervised clustering (Supplementary Figures 17 and 18).
Generally, t-distributed Stochastic Neighbor Embedding (t-SNE; figure 5c) mapping (van der Maaten and Hinton, 2008), and Uniform Manifold Approximation and Projection (UMAP; Figure 5d) (McInnes et al., 2018) are the preferred visualization methods for single-cell data. The hierarchical clustering of Figure 5a shows its superiority to unveil the complex similarity between cells. From tSNE or UMAP, the landscape of this collection of cells appears quite flat, while the hierarchical clustering is able to unveil the relations between cell types. Overall, this analysis demonstrates that AWST leads to biologically interpretable results with a simple hierarchical clustering procedure in single-cell data.

Discussion
Data transformations are an important step prior to unsupervised clustering. However, not much attention has been paid to the effects of standardization on downstream results. We have presented a novel data transformation, AWST, with the aim of reducing the noise and regularizing the influence of outlying genes in the clustering of RNA-seq samples.
AWST is a per-sample transformation that comprises two main steps: the standardization step exploits the log-normal distribution to bring the distribution of each sample on a common location and scale; the smoothing step employs an asymmetric winsorization to push the lowly expressed genes to a minimum value and to bound the highly expressed genes to a maximum.
Our optional filtering procedure involves a heterogeneity index, Shannon's entropy, that does not need a preliminary location estimate, unlike the variance and its robust alternatives. This strategy is possible because of the smoothing step, which maps the values of all samples into the same interval.
Applying our transformation to synthetic and real datasets, including bulk and single-cell RNA-seq experiments, shows that AWST leads to biologically meaningful clusters. We have used synthetic data to compare AWST with two recent protocols, and two more pre-processings, based on the Hart transformation and on FPKM, respectively. AWST yields the best results with any clustering method, while the competitors' performances depend on the downstream clustering algorithm.
In a LGG real dataset, AWST followed by hierarchical clustering with Ward's linkage and Euclidean distance uncovers cancer subtypes previously discovered in independent, sometimes multi-omic, experiments.
Applied to single-cell data, AWST leads to compact, biologically meaningful clusters, consistent with independent external data.
Overall, AWST leads to better and more robust sample segregation allowing to move from standard statistical discovery to a more in-depth investigation of sub-groups within clusters on the way to precision medicine. Furthermore, AWST allows to use statistical coherent methods, such as the Wards linkage method applied to Euclidean distance, (both related to the ANOVA setup), which is the only guarantee for the trustworthiness of the results in unsupervised settings.