Deep Learning Applications in Single-Cell Omics Data Analysis

Traditional bulk sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity, and rare populations. The single-cell resolution, however, enhances our understanding of complex biological systems and diseases, such as cancer, the immune system, and chronic diseases. However, the single-cell technologies generate massive amounts of data that are often high-dimensional, sparse, and complex, thus making analysis with traditional computational approaches difficult and unfeasible. To tackle these challenges, many are turning to deep learning (DL) methods as potential alternatives to the conventional machine learning (ML) algorithms for single-cell studies. DL is a branch of ML capable of extracting high-level features from raw inputs in multiple stages. Compared to traditional ML, DL models have provided significant improvements across many domains and applications. In this work, we examine DL applications in genomics, transcriptomics, spatial transcriptomics, and multi-omics integration, and address whether DL techniques will prove to be advantageous or if the single-cell omics domain poses unique challenges. Through a systematic literature review, we find that DL has not yet revolutionized or addressed the most pressing challenges of the single-cell omics field. However, using DL models for single-cell omics has shown promising results (in many cases outperforming the previous state-of-the-art models) in data preprocessing and downstream analysis, but many DL models still lack the needed biological interpretability. Although developments of DL algorithms for single-cell omics have generally been gradual, recent advances reveal that DL can offer valuable resources in fast-tracking and advancing research in single-cell.


Introduction
Since single-cell sequencing (sc-seq) was highlighted as the "Method of the Year" in 2013 (Fritzsch, Dusny et al. 2012), sequencing individual cells at the single-cell resolution has become the norm for studying cell-to-cell heterogeneity. RNA and DNA single-cell measurements, and recently, epigenetic and protein levels, stratify cells at the highest possible resolution. That is, single-cell RNA sequencing (scRNA-seq) makes it possible to measure transcriptome-wide gene expression at the single-cell level. Such resolution enables researchers to distinguish different cell types based on their characteristics [see (Anchang, Hart et al. 2016, Haber, Biton et al. 2017, Tabula Muris, Overall et al. 2018, Han, Zhou et al. 2020], organize cell populations, and identify cells transitioning between states. Such analyses provide a much better picture of tissues and the underlying dynamics in the development of an organism, which in turn allow for delineating intrapopulation heterogeneities that had previously been perceived as homogeneous by bulk RNA sequencing. Similarly, singlecell DNA sequencing (scDNA-seq) studies can reveal somatic clonal structures [e.g., in cancer, see (Roth, Khattra et al. 2014, Zafar, Navin et al. 2019], thereby helping to monitor cell lineage development and providing insight into evolutionary mechanisms that function on somatic mutations. The prospects resulting from sc-seq are tremendous: it is now possible to re-evaluate hypotheses on differences between predefined sample groups at a single-cell level regardless of samples being disease subtypes, treatment groups, or morphologically distinct cell types. As a result, in recent years, enthusiasm about the possibility of screening the basic units of life's genetic material has continued to expand. Human Cell Atlas ) is a prominent example: an effort to sequence the various cell types and cellular states that make up a human being. Encouraged by the great potential of single-cell investigation of DNA and RNA, there has been substantial growth in the development of related experimental technologies. In particular, the advent of microfluidics techniques and combinatorial indexing strategies (Hosokawa, Nishikawa et al. 2017, Zilionis, Nainys et al. 2017) has resulted in routinely sequencing hundreds of thousands of cells in a single experiment. This growth has also allowed a recent publication to analyze millions of cells at once (Cao, Spielmann et al. 2019). More large-scale sc-seq datasets are becoming accessible worldwide (with tens of thousands of cells), constituting a data explosion in single-cell analysis platforms. The continuous growth of the scale and quantity of available sc-seq data has raised significant questions: 1) How do we correctly interpret and analyze the increasing complexity of sc-seq datasets? 2) How can different types of data sets (mentioned above) provide a deeper understanding of the underlying biological dynamics for a specific condition? and 3) how can the acquired information be changed into practical applications in medicine, ranging from rapid and precise diagnoses to accurate medicine and targeted preventions. As we face the challenges of a rise in chronic diseases, aging populations, and limited resources, a transformation towards intelligent analysis, interpretation, and understanding of complex data is essential. In this paradigm shift, the rapidly emerging field of ML is central .
ML is the study of models which can learn from data without the need for an explicit set of instructions. Simultaneously with biomedical advancements of the past decade, there has been a surge in the development and application of ML algorithms, prominently led by advances in DL. Some of the earliest DL algorithms developed were intended to computationally model our brains' learning process, therefore being called "Artificial Neural Networks" (ANN). DL models often consist of many processing layers (with many nodes in each layer), which enable them to learn a representation of data with several levels of abstraction. The recent improvements in computational hardware have made training DL models feasible, resulting in successful and revolutionary applications of such models across many domains.
This review aims to discuss DL applications in sc-seq analysis and elaborate on their instrumental role in improving sc-seq data processing and analysis. We first introduce some key concepts in DL, which enable us to discuss their applications in the sc-seq field. Because of the high technical noise and complexity of sc-seq data, we review appropriate techniques for analyzing such data, which ensure robustness and reproducibility of results. Finally, we conclude this work by discussing possible developments, highlighting both emerging obstacles and opportunities to advance the rapidly evolving field of single-cell omics.

Feed Forward Neural Network
Due to the nature of sc-seq data, Feed Forward Neural Networks (FFNNs) are the most common architecture used as the core of many existing models for single-cell omics. FFNNs, the quintessential example of Artificial Neural Networks (ANNs), are composed of interconnected nodes ("artificial neurons") that resemble and mimic the brain's neuronal functions. The connections between the nodes ("edges") strengthen or weaken as the learning progresses. Figure 1(a) illustrates an FFNN with an input layer (which senses and detects signals within the environment), three hidden layers (that process the signal sent by the input layer), and an output layer (the response to a signal or stimulus). FNNN can be viewed as a function that maps inputs from sc-seq space to an output space, often with fewer dimensions than the input space. This mapping is composed of simpler functions, each of which provides a different mathematical representation of the input. The true success of FFNN was realized when the capacity of such models was increased through the use of non-linear activation layers (such as Rectified Linear Units (Nair and Hinton 2010) and Hyperbolic Tangent) and adding more layers (depth).
These architectural choices allow neural networks to approximate highly non-linear and complex functions (theoretically being "universal approximators").
ANNs learn in the same manner as humans: by interacting with, and responding to, various stimuli within a local environment. FFNNs consist of numerous internal adjustable parameters (often on the order of hundreds of millions) that determine the input-output function of the machine. The objective of FFNNs (and all ANNs) is to minimize a loss function that teaches the network to perform specific tasks. This optimization is done via learning an optimal set of parameters (weights) that minimize the objective function. In FFNNs, all neurons are fully connected; thus, the model must learn the In this network, the raw input is shown as X, the weight tensors are denoted Wi, activation layers are αi and the bias nodes are shown in blue nodes marked by b. The positive and negative weights are shown as blue and red edges, respectively, where a smaller numerical value corresponds to more transparent edges and vice versa. (b) An example of CNN architecture for classification. In this hypothetical architecture, the model feeds the inputs through the three stages of a CNN (with non-linear activation not depicted) to extract features. As is often the case, the output of the CNN is inputted to a fully connected neural network for classification. (c) An RNN depicted with its training flow, and an unrolled version showing the timestepdependent inputs, hidden state, and outputs are marked. (d) An illustration of a denoising autoencoder. Autoencoders (AE) have an encoder network, which tries to extract the most critical features and map them to a latent space (often of a much smaller dimension than the input data). The decoder network takes in the latent vector produced by the encoder (denoted as z) and maps this representation back to the original input dimension. (e) The architecture of a Variational AE (VAE) that aims at generating synthetic scRNA-seq data using the gene expression values. (f) A depiction of generative adversarial network for generating in-silico scRNA-seq data. In this illustration, the scRNA-seq count matrix is converted to images that are used as real samples.
optimal contribution of each node, which determine the final output of the model. That is, the direct connection between any two nodes can have positive or negative weights (refer to Fig. 1(a) for a visualization of this concept). The output values are sequentially computed within the network's hidden layers, with each input vector taking in the previous layer's output until reaching the final output layer (this process is known as the forward propagation). This multi-stage propagation of features through the layers results in an abstract data representation attentive to small details while ignoring irrelevant information (Lin, Yang et al. 2019).
Initially assigned random values, the model weights are optimized throughout the training based on the loss function. Conventionally, the model optimization is done through gradient descent performed backwards in the network, which consists of two components: a numerical algorithm for efficient computation of the chain rule for derivatives (backpropagation) and an optimizer (e.g., Stochastic Gradient Descent or Adam (Kingma and Ba 2015). The optimizer is an algorithm that performs gradient descent, while backpropagation is an algorithm for computing the expression for gradients during the backward pass of the model. Deep neural networks (DNNs) are a subset of ANNs with more hidden layers between the input and output layers (usually greater than or equal to three hidden layers). This depth enables the model to learn complex nonlinear functions and allows the machine to learn a multi-step computer program (Goodfellow, Bengio et al. 2016). Several studies have shown that an increase in the network's capacity (e.g., through depth or nonlinearity) has resulted in better performance in recognition and prediction tasks (Nair and Hinton 2010, Sun, Liang et al. 2015, Bahdanau, Chorowski et al. 2016).

Common Algorithms in Deep Learning
In the following sections, we provide an overview of several frequently-used DL models for single-cell applications.

Recurrent Neural Network (RNNs)
RNNs (Rumelhart, Hinton et al. 1986) are used for processing sequential data, including natural language and time series. RNNs process sequential inputs one at a time and implicitly maintain a history of previous elements in the input sequence. A typical architecture for RNNs is presented in Fig 1(c). Similar to FFNNs, RNNs learn by propagating the gradients of each hidden state's inputs at discrete times. This process becomes more intuitive if we consider the outputs of the hidden units at various time iterations as if they were the outputs of different neurons in a deep multilayer network.
Due to the sequential nature of RNNs, the backpropagation of gradients would shrink or grow at each time step, causing the gradients to vanish or blow up, which makes RNNs notoriously hard to train. However, when these issues are averted (via gradient clipping or other techniques), RNNs are powerful models and gain state-of-the-art capabilities in many domains, such as the process of natural language. The training challenges combined with the nature of sc-seq data have resulted in fewer developments of RNNs for single-cell analysis. However, recently some studies have used RNNs and Long Short-Term Memory (a variant of RNNs) used for predicting cell types and cell motility (Hubel andWiesel 1962, Kimmel, Brack et al. 2019).

Convolutional Neural Network (CNNs)
CNNs (LeCun and Bengio 1995) are specialized types of networks that use convolution (the mathematical operation) instead of tensor multiplication (which is done in FFNNs) in at least one of their layers. This convolution operation makes CNNs ideal for processing data with a grid-like topology (images are the quintessential example of such datasets). Compared to other ANNs, CNNs have three key benefits: (i) sparse interactions, (ii) shared weights, and (iii) equivariant representations (LeCun and Bengio 1995). CNNs have been effectively used in many applications in computer vision and time-series analysis but are not utilized as frequently for sc-seq applications (since sc-seq datasets do not have a grid-like structure). However, some studies, such as Xu et al. (Xu, Zhang et al. 2020) have used CNNs after converting sc-seq data to images, which have shown promising results.
The architecture of CNNs was inspired by the organization of the visual cortex (LGN-V1-V2-V4-IT) hierarchy in the visual cortex ventral pathway (Hubel and Wiesel 1962), with the connectivity pattern trying to resemble the neural connections of our brains. A typical CNN architectural block is composed of a sequence of layers (usually three) which include a convolutional layer (affine transform), a detector stage (non-linear transformation), and a pooling layer. The learning unit of a convolutional layer is called a filter or kernel. Each convolutional filter is a matrix, typically of small dimensions (e.g. 3x3), composed of a set of weights that acts as an object detector and is continuously calibrated during the learning process. The goal of CNNs is to learn an optimal set of filters (weights) which can detected the needed features for a specific task (e.g. image classification). The result of the convolution between the input data and the filter's weights is named a feature map. Once a feature map is available, each value of this map is passed through a non-linearity (e.g. ReLU). The output of a convolutional layer is made of as many stacked feature maps as the number of filters present within the layer. There are two key ideas behind this design: first, in data with grid-like topology (such as images), local neighbors have highly correlated information. Second, equivariance to translation can be obtained if units at different locations share weights. In other words, parameter sharing in CNNs allows for the detection of features regardless of the locations that they appear in. An example of this would be detecting a car. In a dataset, a car could appear at any position in a 2D image, but the network should be able to detect it regardless of the specific coordinates (Zhao, Zheng et al. 2019).
One way of achieving equivariance to translation is to utilize pooling layers. In the pooling layers, we use the outputs of the detector stage (at certain locations) to calculate a summary statistic for a rectangular window of values (e.g., calculating the mean of a 3x3 patch). There are many pooling operations, with the common ones being max-pooling (taking the maximum value of a rectangular neighborhood), mean-pooling (taking the average), and L2 norm (taking the norm). In all cases, rectangular patches from one or several feature maps are inputted to the pooling layer, where semantically similar features are merged into one. Pooling decreases the dimension of learned representations, and makes the model insensitive to small shifts and distortions (LeCun, Bengio et al. 2015). CNNs typically have an ensemble of stacked convolution layers, nonlinearity, and pooling layers, followed by fully connected layers that produce the final output of the network. Fig. 1(b) illustrates an example of a typical CNN architecture. The backpropagation of gradients through CNNs is analogous to DNNs, enabling the model to learn an optimal set of filters.

Autoencoders (AEs)
AEs are neural networks that aim to reconstruct (or copy) the original input via a non-trivial mapping. Conventional AEs have an "hour-glass" architecture with two mirroring networks: an encoder and a decoder. The encoder's task is to map the input data to a latent space, often of a much smaller dimension than the original input space. The encoder is responsible for data compression and feature extraction, forming the narrowing part of the hourglass architecture (see Fig. 1(d)). The output of the encoder network (latent vector) will contain the most important features present in the data in a compressed form. Conversely, the decoder is tasked with mapping the latent vector back to the original input dimension and reconstructing the original data. In the ideal case, the decoder's output will be an exact copy of the training sample.
AEs were traditionally used for dimensionality reduction and denoising, trained by minimizing a mean squared error (MSE) objective between the input data and the reconstructed sample (output of the decoder). Fig. 1(d) depicts an example of a denoising AE. Over time, the AE framework has been generalized to stochastic mappings of an encoder distribution and a decoder distribution. A well-known example of such generalization is Variational Autoencoders (VAEs) (Kingma and Ba 2015), where using the same hour-glass architecture, one can generate new samples drawn from an approximated posterior for an assumed prior distribution. Both traditional AEs and VAEs have practical applications in many biological fields. flexibility to learn any distribution, requiring no assumptions on the prior distribution, and no limitations on the size of the latent space.
Despite these advantages, GANs are notoriously difficult to train since achieving Nash equilibrium for G and D is extremely difficult (Wang, She et al. 2021). Another drawback of GANs is vanishing gradients, which occurs if D learns the distinction between real and generated data well too quickly, prohibiting G from training propery (Arjovsky, Chintala et al. 2017). Another problem with GANs is "mode collapse," which occurs when G produces only a small number of outputs that potentially trick D. This happens when G has trained to map many noise vectors to the same output that D recognizes as real data. Quantifying how much GANs have learned the distribution of real data is often difficult, hence one of the most common methods of assessing GANs is to evaluate the output directly (Larsen, Sønderby et al. 2016), which could be laborious. Even though certain GAN variations have been proposed to reduce vanishing gradients and mode collapse, (e.g., Wasserstein-GANs (WGANs) (Arjovsky, Chintala et al. 2017) and Unrolled-GANs (Metz, Poole et al. 2016), the convergence of GANs is still a big issue.

Deep Learning in Single-Cell Transcriptomics
Single-cell RNA sequencing (scRNA-seq) has improved our understanding of biological processes substantially in recent years. Researchers have shown the potentials of single-cell (SC) transcriptomics through studying the cellular heterogeneity of many organisms, such as humans, mice, zebrafish, frogs, and planaria (Tabula Muris, Overall et al. 2018, Han, Zhou et al. 2020, and uncovering previously unknown cell populations (Montoro, Haber et al. 2018, Plass, Solana et al. 2018. Moreover, these studies have also highlighted the need for better computational methods which can facilitate the analysis of large and complex scRNA-seq datasets. In the following sections, we provide an overview of existing computational approaches for the various stages of scRNA-seq analysis (summarized in Fig. 2).

Pre-processing and quality control
Cell barcodes are designed to delineate various cell populations present within a sequenced sample. However, barcodes can mistakenly tag several cells (doublet) or not tag any cells at all (empty droplet/well), which prompts the quality control (QC) step in scRNA-seq analysis. Many raw data pre-processing pipelines, including Cell Ranger (Zheng, Terry et al. 2017), indrops (Klein, Mazutis et al. 2015), SEQC (Azizi, Carr et al. 2018), and z-unique molecular identifiers (zUMIs) (Parekh, Ziegenhain et al. 2018), can perform QC. The dimension of the count matrix generated by sequencing technologies depends on the the number of barcodes and transcripts. Though the noise rate in measurements varies across reads and count data, standard research pipelines very often emply the same processing techniques (Lafzi, Moutinho et al. 2018).
Despite scRNA-seq data's richness which can offer significant and deeper insights, the data's complexity and noise are far higher than traditional bulk RNA-seq, making it challenging to process the raw data for downstream analysis. Unwanted variations such as biases, artifacts, etc., require extensive QC and normalization efforts (Jiang, Thomson et al. 2016). The number of counts per barcode (count depth), the number of genes per barcode, and the fraction of counts from mitochondrial genes per barcode are three QC covariates widely used in the QC step (Griffiths, Scialdone et al. 2018). On the other hand, other experimental factors (such as damaging the sample during dissociation) could result in low-quality scRNA-seq libraries, which can yield erroneous findings in downstream analyses. Currently, there is an unmet need for developing more efficient and accurate methods for filtering low-quality cells during the library preparation.
Given the limited number of studies regarding pre-processing and quality control, we will focus on the DL applications that are most related to normalization, data correction, and downstream analysis.

Normalization
Normalization is a crucial first step in pre-processing scRNA-seq expression data to address the constraints caused by low input content, or the different types of systematic measuring biases (Bacher, Chu et al. 2017). Normalization aims to detect and remove changes in measurements between samples and features (e.g., genes) caused by technical artifacts or unintended biological effects (e.g., batch effects) (Hogan, Courtier et al. 2019). Methods designed for normalizing bulk RNAseq and microarray data are often used to normalize scRNA-seq data. However, these techniques often ignore essential aspects of scRNA-seq results (Hogan, Courtier et al. 2019). For scRNA-seq data, a few families of normalization techniques have been developed, such as scaling techniques (Lun, Bach et al. 2016), regression-based techniques for identified nuisance factors (Buettner, Natarajan et al. 2015, Bacher, Chu et al. 2017, and techniques based on spike-in sequences from the External RNA Controls Consortium (ERCC) (Ding, Zheng et al. 2015, Vallejos, Marioni et al. 2015. However, these methods are specific to certain experiments and can not be applied to all research designs and experimental protocols. Although some DL-based methods have been proposed to generalize the normalization stage, accounting for technical noise of scRNAseq data still remains a challenge and an active area of research within the field (Zheng and Wang 2019).

Data correction
Although normalization aims to address the noise and bias in the data, normalized data can still contain unexpected variability. These additional technical and biological variables, such as batch, dropout and cell cycle effects are accounted for during the "data correction" stage, which depends on the downstream analysis (Luecken and Theis 2019). In addition, it is a recommended practice to address biological and technical covariates separately (Luecken and Theis 2019), since they serve different purposes. Given the mentioned subtleties, designing DL models that can address most of these challenges is difficult. Therefore, there are currently no DL models that are widely used for data correction within the field.

Dropout
Compared to bulk RNA-seq, scRNA-seq datasets are noisy and sparse and pose unique nuances such as "dropout" [which is one of the most significant issues in this field (Kharchenko, Silberstein et al. 2014, Gong, Kwak et al. 2018]. Dropout occurs when a gene is observed at a moderate or high expression level in one cell but is not detected in another cell (Qiu 2020). Dropout events can occur during library preparation (e.g. extremely low levels of mRNA in single cells) or due to biological properties (the stochastic aspect of gene expression in multiple cells) (Ran, Zhang et al. 2019). Additionally, shorter genes have lower counts and a greater dropout rate (Zappia, Phipson et al. 2017). Overall, a low RNA capture rate results in the inability of detecting an expressed gene, leading to a "false" zero, known as a dropout event. Furthermore, it has been suggested that sometimes near-zero expression measurements can also be dropouts (Lin, Troup et al. 2017). Dropout events will introduce technical variability and noise, adding an extra layer of difficulty in analyzing scRNA-seq (Sengupta, Rayan et al. 2016), and downstream analyses such as clustering and pseudo-time reconstruction (Arisdakessian, Poirion et al. 2019).
It is essential to understand the difference between "false" and "true" zero counts. True zero counts mean that a gene is not expressed in a particular cell type, indicating true cell-type-specific expression (Eraslan, Simon et al. 2019). Hence, it is important to note that zeros in scRNA-seq data do not necessarily translate to missing values and must remain in the data. However, the false zeros (missing values) must be imputed to further improve the analysis. The missing values are replaced with either random values or by an imputation method (Eraslan, Simon et al. 2019). Imputation approaches designed for bulk RNA-seq data may not be suitable for scRNA-seq data for multiple reasons, mainly due to scRNA-seq's heterogeneity and dropouts. ScRNA-seq has much higher cell-level heterogeneity than bulk RNA-seq data; scRNA-seq has cell-level gene expression data while bulk RNA-seq data represents the averaged gene expression of the cell population. Additionally, the number of missing values in bulk RNA-seq data is much lower compared to scRNA-seq (Gong, Kwak et al. 2018). Given these factors and the non-trivial difference between true and false zero counts, classic imputation approaches with specified missing values are often not appropriate for scRNA-seq data, and scRNA-seqspecific dropout imputation methods are required.
Current scRNA-seq imputation methods can be divided into two groups: (i) those that change all gene expression levels, such as Markov Affinity-based Graph Imputation of Cells (MAGIC) (van Dijk, Nainys et al. 2017) and single-cell analysis via expression recovery (SAVER) (Huang, Wang et al. 2018), and (ii) methods that impute drop-out events (zero or nearzero counts) alone, such as scImpute , DrImpute (Gong, Kwak et al. 2018), and LSImpute (Moussa and Măndoiu 2019). These techniques can fail to account for the non-linearity of the data's count structure. Moreover, as larger scRNA-seq datasets become available and common, imputation methods should scale to millions of cells. However, many of the earlier models are either incapable of or very slow at processing datasets of larger size (tens of thousands or more) (Eraslan, Simon et al. 2019). As a result, many have resorted to designing DL-based approaches to combat these challenges, both on the technical and efficiency fronts.
Most of DL algorithms for imputing drop-out events are based on AEs. For example, in 2018, Talwar et al. proposed AutoImpute, a technique for retrieving the whole gene expression matrix using overcomplete AEs to impute the dropouts. AutoImpute learns the underlying distribution of the input scRNA-seq data and imputes missing values based on the learned distribution, with minor modifications to the biologically silent gene expression values. Through expanding the expression profiles into a high-dimensional latent space, AutoImpute learns the underlying distribution and patterns of gene expression in single cells and reconstructs an imputed model of the expression matrix. At the time, Talwar et al. claimed that their system was the only model that could perform imputation on the largest of the nine datasets they studied (68K PBMC, which contains ~68,000 cells), without running out of memory (Talwar, Mongia et al. 2018).
In another study, Eraslan et al. proposed the Deep Count AE network (DCA). DCA uses a negative binomial noise model both with and without zero-inflation to account for the count distribution and overdispersion while capturing nonlinear gene-gene dependencies. Since their approach scales linearly with the number of cells, DCA can be used on datasets with millions of cells. DCA also depends on gene similarities; using simulated and true datasets, DCA denoising enhances several traditional scRNA-seq data analyses. One of the key benefits of DCA is that it only requires the user to define the noise model. Current scRNA-seq approaches depend on a variety of hypotheses and often use standard count distributions, such as zero-inflated negative binomial. DCA increases biological exploration by outperforming current data imputation approaches in terms of quality and time. Overall, DCA calculates the "dropout probability" of a zeroexpression value due to scRNA-seq dropout and imputes the zeros only when the probability is high. Consequently, while DCA effectively detects true zeros, it can be biased when dealing with nonzero values (Eraslan, Simon et al. 2019). Badsha et al. propose TRANSfer learning with LATE (TRANSLATE) (Badsha, Li et al. 2020), a DL model for computing zeros in scRNA-seq datasets which are extremely sparse. Their nonparametric approach is based on AEs and builds on their previous method, Learning with AuToEncoder (LATE). The key presumption in LATE and TRANSLATE is that all zeros in the scRNA-seq data are missing values. In most cases, their approach achieves lower mean squared error, restores nonlinear gene-gene interactions, and allows for improved cell type separation. Both LATE and TRANSLATE are also very scalable, and when using a GPU, they can train on over a million cells in a few hours. TRANSLATE has shown better performance on inferring technical zeros than other techniques, while DCA is better at inferring biological zeros than TRANSLATE.
Sparse Autoencoder for Unsupervised Clustering, Imputation, and Embedding (SAUCIE) (Amodio, Van Dijk et al. 2019) is a regularized AE that denoises and imputes data using the reconstructed signal from the AE. Despite the noise in the input data, SAUCIE can restore the significant relationships across genes, leading to better expression profiles which can improve downstream analyses such as differential gene expression (Amodio, Van Dijk et al. 2019). ScScope (Deng, Bao et al. 2019) is a recurrent AE network that iteratively handles imputation by employing a recurrent network layer; taking the time recurrence of ScScope to one (i.e. T=1) will reduce the model to a traditional AE. Given that ScScope is a modification of traditional AEs, its runtime is similar to other AE-based models (Deng, Bao et al. 2019).
A few non-AE-based models have also been developed for imputation and denoising of scRNA-seq data. DeepImpute (Arisdakessian, Poirion et al. 2019) uses several sub-neural networks to impute groups of target genes using signals (genes) that are strongly associated with the target genes. Arisdakessian et al. demonstrate that DeepImpute has a better performance than DCA, contributing the advantages to their divide-and-conquer approach (Arisdakessian, Poirion et al. 2019).
Mongia et al. (Mongia, Sengupta et al. 2020) introduced Deep Matrix Completion (deepMC), an imputation method based on deep matrix factorization for missing values in scRNA-seq data that utilizes a feed backward neural network. In most of their experiments, deepMC outperformed other existing imputation methods while not requiring any assumption on the prior distribution for the gene expression. We predict that deepMC will be the preferred initial approach for imputing scRNA-seq data, given the superior performance and simplicity of the model. (Lopez, Regier et al. 2018). ScVI is based on a hierarchical Bayesian model and uses a DNN to define the conditional probabilities, assuming either a negative binomial or a zero-inflated negative binomial distribution (Lopez, Regier et al. 2018). Lopez et al. show that scVI can accurately recover gene expression signals and impute the zero-valued entries, potentially enhancing the downstream analyses without adding any artifacts or false signals.

Single-cell variational inference (scVI) is another DNN algorithm introduced by Lopez et al
Recently, Patruno et al. (Patruno, Maspero et al. 2020) compared 19 denoising and imputation methods, based on numerous experimental scenarios such as recovery of true expression profiles, characterization of cell similarity, identification of differentially expressed genes, and computation time. Their results showed that ENHANCE (Expression deNoising Heuristic using Aggregation of Neighbors and principal Component Extraction), MAGIC, SAVER, and SAVER-X offer the best overall results when considering efficiency, accuracy and robustness for the investigated tasks (Patruno, Maspero et al. 2020).
It is important to note that traditional methods, despite their current success, are not well-suited for large-scale scRNAseq studies. As larger scRNA-seq datasets become the norm, we anticipate that DL-based models will prove to be advantageous. Therefore, more work is required to build upon the existing DL methods for imputing dropout effects and better managing technical zeros while retaining biological zeros.

Batch effects correction
When samples are conducted in separate batches, the term 'batch effect" is used to describe the variation caused by technical effects. Different types of sequencing machines or experimental platforms, laboratory environments, different sample sources, and even technicians who perform the experiments can cause batch effects (Fei and Yu 2020). Removing and accounting for batch effects is often helpful and recommended, however, the success varies significantly across different studies. For example, batch effect removal on bulk RNA-seq data from Encyclopedia of DNA Elements (ENCODE) human and mouse tissues (Lin, Lin et al. 2014) is a recommended standard data preparation step. Batch effect correction has been an active area of research since the microarray time. Johnson et al. suggested parametric and nonparametric empirical Bayes frameworks for adjusting the data for batch effects removal (Johnson, Li et al. 2007). In recent years and with an increased level of complexity in sequencing datasets, more involved batch effect correction methods have been proposed and used (Fei and Yu 2020). However, a majority of the existing approaches require biological group expertise for each observation and were originally designed for bulk or microarray RNA-seq data. Given the heterogeneity present within scRNA-seq date, these earlier techniques are not well suited for single-cell analysis in certain cases (Luo and Wei 2019). Batch effects in scRNA-seq data may have a substantial impact on downstream data analysis, impacting the accuracy of biological measurements and ultimately contributing to erroneous conclusions (Büttner, Miao et al. 2019). Therefore, alternative batch effect correction techniques for scRNA-seq data have been developed to address the specific needs of single-cell datasets.
Several statistical methods, including linear regression models like ComBat (Johnson, Li et al. 2007), and nonlinear models like Seurat's canonical correlation analysis (CCA) (Zhang, Wu et al. 2019) or scBatch (Fei and Yu 2020), have been designed to eliminate or minimize scRNA-seq batch effects while aiming to maintain biological heterogeneity of scRNA-seq data. Additionally, some differential testing frameworks such as Linear Models for Microarray (limma) (Ritchie, Phipson et al. 2015), Model-based Analysis of Single-cell Transcriptomics (MAST) (Finak, McDavid et al. 2015), and DESeq2 (Love, Huber et al. 2014) already integrate the batch effect as a covariate in model design.
Haghverdi et al. (Haghverdi, Lun et al. 2018) developed a novel and efficient batch correction method for single-cell data which detects cell mappings between datasets, and subsequently reconstructs the data in a shared space. To create relations between two datasets, the algorithm first identifies mutual nearest neighbors (MNNs). The translation vector is then computed from the resulting list of paired cells (or MNNs) to align the data sets into a shared space. The benefit of this method is that it produces a normalized gene expression matrix, which can be used in downstream analysis and offer an effective correction in the face of compositional variations between batches (Haghverdi, Lun et al. 2018). Scanorama (Hie, Bryson et al. 2019) and batch balanced k-nearest neighbors (BBKNN) (Polański, Young et al. 2020) are two other approaches that look for MNNs in reduced-dimension spaces, and use them in a similarity-weighted way to direct batch integration.
Hie et al. (Hie, Bryson et al. 2019) proposed Scanorama which can combine and remove batch effects from heterogeneous scRNA-seq studies by identifying and merging common cell types across all pairs in a dataset. Using a variety of existing tools, Scanorama batch-corrected output can be used for downstream tasks, such as classify cluster-specific marker genes in differential expression analysis. Scanorama outperforms current methods for integrating heterogeneous datasets and it scales to millions of cells, allowing the identification of rare or new cell states through a variety of diseases and biological processes (Hie, Bryson et al. 2019).
Polańsky et al. (Polański, Young et al. 2020) developed BBKNN, a fast graph-based algorithm that removes batch effects through linking analogous cells in different batches. BBKNN is a simple, rapid, and lightweight batch alignment tool, and its output can be directly use for dimensionality reduction. BBKNN's default approximate neighbor mode scales linearly with the size of datasets and remains consistently faster (by one or two orders of magnitude) when compared to other existing techniques (Polański, Young et al. 2020).
Recently, there has been considerable progress in using DL for batch effect corrections. Residual Neural Networks (ResNets) (He, Zhang et al. 2016) and AEs are two of the most commonly used DL-based batch correction approach in scRNA-seq analysis. ResNets are a form of deep neural network that make a direct connection between the input of a layer (or network) and the outputs, often through an addition operation. Shaham et al.  suggested a non-linear batch effect correction approach based on a distribution-matching ResNet. Their approach focuses on reducing the Maximum Mean Discrepancy (MMD) between two multivariate replication distributions that were measured in separate batches. Shaham et al. applied their methodology to batch correction of scRNA-seq and mass cytometry datasets, finding that their model can overcome batch effects without altering the biological properties of each sample ).
Li et al. (Li, Wang et al. 2020) presented deep embedding algorithm for single-cell clustering (DESC), an unsupervised DL algorithm for "soft" single-cell clustering which can also remove batch effects. DESC learns a non-linear mapping function from the initial scRNA-seq data space to a low-dimensional feature space using a DNN, iteratively optimizing a clustering objective function. This sequential process transfers each cell to the closest cluster and attempts to account for biological and technical variability across different clusters. Li et al. demonstrated that DESC can eliminate the technical batch effect more accurately than MNN-based methods while better preserving true biological differences within closely related immune cells (Li, Wang et al. 2020).
In a prior study, Shaham (Shaham 2018) proposed batch effect correction through batch-free encoding using an adversarial VAE. Shaham utilizesd the adversarial training to achieve data encoding that corresponded exclusively to a subject's intrinsic biological state, as well as to enforce accurate reconstruction of the input data. This approach results in maintaining the true biological patterns expressed in the data and minimizing the significant biological information loss (Shaham 2018).

Wang et al. introduced Batch Effect ReMoval Using Deep
Autoencoders (BERMUDA) (Wang, Johnson et al. 2019), an unsupervised framework for correcting batch effect in scRNA-seq data across different batches. BERMUDA combines separate batches of scRNA-seq data with completely different cell population compositions and amplifies biological signals by passing information between batches. Most nearest neighbor-based models can manage variations in cell populations between batches when such differences are significant. However, BERMUDA was developed with an emphasis on scRNAseq data with distinct cell populations in mind, focusing on the similarities between cell clusters rather than significat variations. Altogether, a rapidly expanding number of general DL methods for batch effects correction in biological datasets represent new ways for eliminating batch effects in biological datasets.

Dimensionality reduction
Dimensionality reduction is a crucial step in visualizing scRNA-seq data, since typical datasets contain thousands of genes as features (dimensions) (Wang and Gu 2018). The most common dimensionality reduction techniques used for scRNA-seq are principal component analysis The integrated data is used as a reference, and the process is repeated with each new batch of data. This figure has been reused with permission from authors (Haghverdi, Lun et al. 2018).
In low-dimensional spaces, linear projection methods like PCA traditionally cannot depict the complex structures of single-cell data. On the other hand, nonlinear dimension reduction techniques like t-SNE and UMAP, have been shown to be effective in a variety of applications and are commonly used in single-cell data processing (Ding, Condon et al. 2018). These methods also have some drawbacks, such as lacking robustness to random sampling, inability to capture global structures while concentrating on local data structures, parameter sensitivity, and high computational cost . Several DL techniques for reducing the dimensionality of scRNA-seq data have recently been developed. Here we focus on the ones that are based on VAEs or AEs, which are more commonly used in the field.
Ding et al. (Ding, Condon et al. 2018) proposed a VAE-based model (called scvis) to learn a parametric transformation from a high-dimensional space to a low-dimensional embedding, ultimately learning the estimated posterior distributions of low-dimensional latent variables. Compared to common techniques (e.g. t-SNE), scvis can (i) better obtain the global structure of the data, (ii) provide greater interpretability, and (iii) be more robust to noise or unclear measurements. Ding et al. demonstrated that scvis is a promising tool for studying large-scale and high-resolution single cell populations (Ding, Condon et al. 2018). However, according to Becht et al. (Becht, McInnes et al. 2019), the runtime of scvis is long, particularly for dimensionality reduction, and it appears to be less effective at separating cell populations.
In another work, Wang et al. (Wang and Gu 2018) proposed a method for unsupervised dimensionality reduction and visualization of scRNA-seq using deep VAEs, called VAE for scRNA-seq data (VASC). VASC's architecture consists of the traditional encoder and decoder network of a VAE, with an addition of a zero-inflated layer that simulates dropout events. In comparison to current methods such as PCA, t-SNE, and Zero Inflated Factor Analysis (ZIFA) (Pierson and Yau 2015), VASC can identify nonlinear patterns present within the data, and has broader compatibility as well as better accuracy, particularly when sample sizes are larger (Wang and Gu 2018).
In 2020, Märtens et al. proposed BasisVAE (Märtens and Yau 2020) as a general-purpose approach for joint dimensionality reduction and clustering of features using a VAE. BasisVAE modified the traditional VAE decoder to incorporate a hierarchical Bayesian clustering prior, demonstrating how collapsed variational inference can identify sparse solutions when over-specifying K. (Märtens and Yau 2020).
Peng et al. (Amodio, Van Dijk et al. 2019) proposed an AE-based model that combines gene ontology (GO) and DNNs to achieve a low-dimensional representation of scRNA-seq data. Based on this idea, they proposed two innovative approaches for dimensionality reduction and clustering: an unsupervised technique called "Gene Ontology AutoEncoder" (GOAE) and a supervised technique called "Gene Ontology Neural Network" (GONN) for training their AE model and extracting the latent layer as low dimensional representation. Their findings show that by integrating prior information from GO, neural network clustering and interpretability can be enhanced and that they outperform the state-of-the-art dimensionality reduction approaches for scRNA-seq (Peng, Wang et al. 2019).
In a study by Armaki (Armacki 2018), the dimensionality reduction capabilities of VAE-and AE-based models were evaluated and benchmarked against principal component analysis. They found that the best approach for reducing the dimensionality of single-cell data was using AE-based models, while the more efficient VAEs performed worse in some respects than the linear PCA. One possible hypothesis could be that the prior used for modeling the latent space, which was Gaussian distribution, is not a good fit for single-cell data. A prior more befitting single-cell data (such as negative binomial distribution) could improve the performance of the VAE based model (Armacki 2018). Finally, there is always the endeavor of optimizing algorithms. As mentioned earlier, developing an efficient DL method for dimensional reduction of data is a necessary next step, since it can potentially improve the quality of lowerdimensional representations.

In-Silico Generation and Augmentation
Given limitations on scRNA-seq data availability and the importance of adequate sample sizes, in-silico data generation and augmentation offer a fast, reliable, and cheap solution. Synthetic data augmentation is a standard practice in various ML areas, such as text and image classification (Shorten and Khoshgoftaar 2019). With the advent of DL, traditional data augmentation techniques (such as geometric transformations or noise injection) are now being replaced with deep-learned generative models, primarily VAEs (Kingma and Welling 2013) and GANs (Goodfellow, Pouget-Abadie et al. 2014). In computational genomics, both GAN-and VAE-based models have shown promising results in generating omics data. Here, we focus on the recent methods introduced for generating realistic in-silico scRNA-seq.
Marouf et al. (Marouf, Machart et al. 2020) introduced two GAN-based models for scRNA-seq generation and augmentation called single-cell GAN (scGAN) and conditional scGAN (cscGAN); we collectively refer to these models as scGAN. At the time, scGAN outperformed all other state-of-the-art methods for generating and augmenting scRNA-seq data (Marouf, Machart et al. 2020). The success of scGAN was attributed to the Wasserstein-GAN (Arjovsky, Chintala et al. 2017) that learns the underlying manifold of scRNA-seq data, subsequently producing realistic never-seen-before samples. Marouf et al. showcase the power of scGAN by generating specific cell types that are almost indistinguishable from the real data and augmenting the dataset with the synthetic samples improved the classification of rare cell populations (Marouf, Machart et al. 2020).
In a related work, Heydari et al. (Heydari, Davalos et al. 2021) proposed a VAE-based in-silico scRNA-seq model that aimed at improving Marouf et al.'s training time, stability, and generation quality using only one framework (as opposed to two separate models). Heydari et al. proposed ACTIVA (Automated Cell-Type-informed Introspective Variational Autoencoder), which employs a single-stream adversarial VAE conditioned with cell-type information. The cell-type conditioning encourages ACTIVA to learn the distribution of all cell types in the dataset (including rare populations), which allows the model to generate specific cell types on demand. Heydari et al. showed that ACTIVA performs better or comparable to scGAN while training up to 17 times faster due to the design choices. Data generation and augmentation with both ACTIVA and scGAN can enhance scRNA-seq pipelines and analysis, such as benchmarking new algorithms, studying the accuracy of classifiers, and detecting marker genes. Both generative models will facilitate the analysis of smaller datasets, potentially reducing the number of patients and animals necessary in initial studies (Heydari, Davalos et al. 2021).

Downstream Analysis
Following pre-processing, downstream analysis methods are used to derive biological understandings and identify the underlying biological mechanism. For example, cell-type clusters are made up of cells with similar gene expression profiles; minor differences in gene expression between similar cells indicate continuous (differentiation) trajectories; or genes which expression profiles are correlated, signaling co-regulation (Zhang, Cui et al. 2021).

Clustering and cell annotation
A significant phase in the scRNA-seq study is to classify cell subpopulations and cluster them into biologically relevant entities (Yang, Liu et al. 2017). Li et al. presented DESC, an AE-based method for clustering scRNA-seq data with a self-training target distribution that can also denoise and potentially remove batch effects. In experiments conducted by Li et al. (Li, Wang et al. 2020), DESC attained a high clustering accuracy across the tested datasets compared to several existing methods. DESC also showed consistent performance in a variety of scenarios, and did not directly need the batch definition for batch effect correction. Given that Li et al. use a deep AE to reconstruct the input data, the latent space is not regularized with additional properties that could additionally help with clustering . As with most DL models, DESC can be trained on CPUs or GPUs.
Chen et al. proposed scAnCluster (Single-Cell Annotation and Clustering), an end-to-end supervised clustering and cell annotation framework that is built upon their previous unsupervised clustering work, namely scDMFK and scziDesk. ScDMFK algorithm (Single-Cell Data Clustering through Multinomial Modeling and Fuzzy K-Means Algorithm) combined deep AEs with statistical modeling. It proposed an adaptive fuzzy k-means algorithm to handle soft clustering while using multinomial distribution to describe the data structure and relying on a neural network to facilitate model parameter estimation ). More specifically, the AE in scAnCluster learns a low-dimensional representation of the data, and an adaptive fuzzy k-means algorithm with entropy regularization for soft clustering. On the other hand, ScziDesk aimed to learn a "cluster-friendly" lower-dimensional representation of the data. Many existing DL-based clustering algorithms for scRNA-seq do not consider distance and affinity constraints between similar cells. This prohibits such models to learn cluster friendly lower-dimensional representations of the data  ScDeepCluster does not pre-select informative genes as input data, making the model more computationally efficient, but decreasing the clustering accuracy ).
Peng et al. (Peng, Wang et al. 2019) developed a strategy for optimizing cell clustering based on global transcriptome profiles of genes. They used a combination of DNN and GO to reduce the dimensions of scRNA-seq data and improve clustering. Their supervised approach was based on a conventional neural network, while the unsupervised model utilized an AE. Their model consisted primarily of two main components: the choosing of important GO factors and the combination of GO terms with the DNN-based model (Peng, Wang et al. 2019). Single-Cell Variational Auto-Encoders (scVAE) (Grønbech, Vording et al. 2020) is another VAE-based model for downstream analysis scRNA-seq data. However, compared to other models, scVAE has the advantage of not requiring many of the traditional pre-processing steps since it uses the raw count data as input. ScVAE can accurately predict expected gene expression levels and a latent representation for each cell and is flexible to use the known scRNA-seq count distributions (such as Poisson or Negative Binomial) as its model assumption (Grønbech, Vording et al. 2020).

Cell-Cell communication analysis
In recent years, scRNA-seq has become a powerful tool for analyzing cell-cell communication in tissues. Intercellular communication controlled by ligand-receptor complexes is essential for coordinating a wide range of biological processes, including development, differentiation, and inflammation. Several algorithms have been proposed to carry out these analyses. These algorithms begin with a database of interacting molecular partners (such as ligand and receptor pairs) and predict a list of possible signaling pathways between types of cells based on their expression patterns. Although the findings of these studies may provide useful insights into the mechanisms of tissues, they can be difficult to visualize and analyze using current algorithms (Tian, Wan et al. 2019, Almet, Cang et al. 2021. DL-based algorithms in this space have not yet been fully developed. However, we expect new DL methods to emerge in the near future, given the importance of cell-cell communication analysis.

RNA velocity
RNA velocity has created new ways to research cellular differentiation in scRNA-seq studies. RNA velocity represents the rate of change in gene expression for a single gene at a given time point depending on the spliced ratio to unspliced mRNA. In other words, RNA velocity is a vector that forecasts the possible future of individual cells on a timescale of hours, which prove new insights into cellular differentiation that have challenged conventional and long-standing views. Existing experimental models for tracking cell fate and reconstructing cell lineages, such as genetic methods or time-lapse imaging have limited power as they do not show the trajectory of differentiation or reveal the molecular identity of intermediate states. But recently, with the advent of scRNA-seq, a variety of algorithms such as VeloViz (Atta and Fan 2021) or scVelo (Bergen, Lange et al. 2020) can visualize the velocity estimates in low dimensions. Although such models show promising results (particularly in well-characterized systems), we are far from a complete understanding of cellular differentiation and cell fate decisions. By this means, we foresee the development of more integrative models, such as DL-based approaches, to resolve long-standing concerns regarding cell fate choices and lineage specification using RNA velocity.

Deep Learning in Single-Cell Genomics
Traditional sequencing methods are limited to measuring the average signal in a group of cells, potentially masking heterogeneity and rare populations (Xiong, Xu et al. 2019). On the other hand, scRNA-seq technologies provide a tremendous advantage for investigating cellular heterogeneity and recognizing new molecular features correlated with clinical outcomes, resulting in the transformation of many biological research domains. Similar to scRNA-seq, single-cell (SC) genomics is being used in many areas, such as predicting the sequence specificity of DNA-and RNA-binding proteins, enhancer and cis-regulatory regions, methylation status, gene expression, control splicing, and searching for associations between genotype and phenotype. However, SC genomics data are often too large and complex to be analyzed only through visual investigation of pairwise correlations. As a result, a growing number of studies have leveraged DL techniques to process and analyze these large datasets. In addition to the scalability of DL algorithms, another advantage of DL techniques is the learned representations from raw input data, which are beneficial in specific SC genomics and epigenomics applications. These application include cell-type identification, DNA methylation, chromatin accessibility, TF-gene relationship prediction, and histone modifications. In the following sections, we review some applications of DL models for analyzing scDNA-seq data (summarized in Fig. 4).

Cell type identification in CyTOF
An essential and challenging task in genomics research is to accurately identify and cluster individual cells into distinct groups of cell types. Li et al. ) describe AE methods (stacked AE and multi-AE) as a gating strategy for mass cytometry (CyTOF). CyTOF is a recent technology for high-dimensional multiparameter SC analysis. They introduced DeepCyTOF as a standardization procedure focused on a multi-AE neural network. DeepCyTOF focuses on domain adaptation principles and is a generalization of previous work that helps users in calibrating between a source  (Angermueller, Lee et al. 2017). An overview of the DeepCpG model is shown in (Fig. 5).
Interestingly, DeepCpG can be used for differentiating human induced pluripotent stem cells in parallel with transcriptome sequencing to specify splicing variation (exon skipping) and its determinants. Linker et al. (Linker, Urban et al. 2019) presented that variation in SC splicing can be precisely predicted based on local sequence composition and genomic features. DeepCpG, which is used for DNA methylation profiles, imputes unobserved methylation regions of individual CpG sites. The cell-type-specific models were made using CpG and genomic information according to DeepCpG's setup of a joint model. Finally, during cell differentiation, Linker et al. identified and characterized associations between DNA methylation and splicing changes, which led to indicating novel insights into alternative splicing at the SC level.
Another method for studying chromatin accessibility at single-cell resolution, called Assay of Transposase Accessible Chromatin sequencing (scATAC-seq), has recently gained considerable popularity. In scATAC-seq, mutation-induced hyperactive Tn5 transposase tags and fragments regions in open chromatin sites in the DNA sequence, which is later sequenced using paired-end Next Generation Sequencing (NGS) technologies (Yan, Powell et al. 2020). The pre-processing steps of scATAC-seq data analysis are often analogous to scRNA-seq pipelines. That is, the same tools are often used in both data modalities, although scRNA-seq tools are often not optimized for the particular properties of scATAC-seq data. ScATACseq has low coverage, and the data analysis is highly sensitive to non-biological confounding factors. The data is preprocessed and assembled into a feature-per-cell matrix, where common choices for "feature" are fixed-size genomic bins and signal peaks at biological events. This matrix displays particular numerical properties which entail computational challenges: it is extremely high-dimensional, sparse, and near-binary in nature (presence/absence of signal). Several packages have been recently developed specifically for scATAC-seq data, with all of them having some major limitations (Cao, Fu et al. 2021). The tools for processing scATAC-seq are diverse and nescient, and thus, there is no consensus on the best practices scATAC-seq data analysis. Further development of scATAC-centric computational tools and benchmark studies are much needed. ScATAC-seq analyses can help to elucidate cell types and differentially accessible regions across cells. Moreover, it can decipher regulatory networks of cis-acting elements like promoters and enhancers, and trans-acting elements like transcription factors (TFs), and infer gene activity (Baek and Lee 2020). ScATAC-seq data could also be integrated with RNA-seq and other omics data. However, most current software only integrates the derived gene activity matrix with expression data, and important information from whole-genome chromatin accessibility is lost. Currently, there are a variety of DL models for bulk ATAC-seq data, such as LanceOtron's CNN for peak calling (Hentges, Sergeant et al. 2021) and CoRE-ATAC for functional classification of cis-regulatory elements (Thibodeau, Khetan et al. 2020). Thibodeau et al. demonstrated CoRE-ATAC's transferable capability on cell clusters inferred from single nuclei ATAC-seq data with a small decrease in model prediction accuracy (mean micro-average precision of 0.80 from bulk vs. 0.69 in singlecell clusters).
One common way to reduce the dimensionality of scRNA-data is to identify the most variable genes (e.g. with PCA), since they carry the most biologically relevant information. However, scATAC-seq data is binary, and therefore prohibiting the identification of variable peaks for dimensionality reduction. Instead, dimensionality reduction of scATAC data is done through Latent Semantic Indexing (LSI), a technique used for natural language processing. Although this approach is scalable to large number of cells and features, it may fail to capture the complex reliance of peaks, since LSI is a linear method.
Single-Cell ATAC-seq analysis via Latent feature Extraction (SCALE) (Xiong, Xu et al. 2019) combines a deep generative framework with a probabilistic Gaussian Mixture Model (GMM) as a prior over the latent variables, in order to learn a nonlinear latent space of scATAC-seq features. Given the nature of scATAC-seq, GMM is a suitable distribution for modeling the high-dimensional, sparse multimodal scATAC-seq data. SCALE can also be used for denoising and imputing missing data, recovering signals from missing peaks. In the benchmarking study by Xiong et al., they demonstrated that SCALE outperformed conventional non-DL scATAC-seq tools for dimensionality reduction, such as PCA and LSI. Furthermore, they show that SCALE is scalable to large datasets (on the order of 80,000 single cells). While SCALE succeeds at learning nonlinear cell representations with higher accuracy, it assumes that the read depth is constant across cells and ignores potential batch effects. These pitfalls motivated the development of scalable and accurate invariant representation learning scheme (SAILER) (Cao, Fu et al. 2021).
SAILER is a deep generative model inspired by VAEs which also learns a low-dimensional latent representation of each cell. For SAILER, the authors aimed to design an invariant representation learning scheme, where they discard the learned component associated with confounding factors from various technical sources. SAILER captures nonlinear dependencies among peaks, faithfully separating biologically relevant information from technical noise, in a manner that is easily scalable to millions of cells (when using GPUs). Similar to SAILER, SCALE also offers a unified strategy for scATAC-seq denoising, clustering, and imputation. However, in multi-sample scATAC-seq integration, SAILER can eliminate batch effects and properly recreate a chromatin accessibility landscape free of confusing variables, regardless of sequencing depths or batch effects.

Transcription Factor (TF)-gene relationship prediction
To unravel gene regulatory mechanisms and differentiate heterogeneous cells, understanding the genome-wide binding TF profile is crucial. Researchers have developed several methods that use expression data for infer gene-gene interactions such inferring coexpression, understanding functional assignments and reconstructing pathways (Yuan and Bar-Joseph 2019). However, each task in infering gene-gene relationships is typically done using different techniques. DL-based methods, on the other hand, are capable of learning multiple tasks jointly, which is very advantageous in inferring relationships between genes. Yuan et al. (Yuan and Bar-Joseph 2019) introduced Convolutional Neural Network for Co-Expression (CNNC), a new encoding method for gene expression data based on CNNs. CNNC is a general computational technique for supervised gene relationship inference that builds on previous approaches in various tasks, including predicting TF targets and recognizing genes related to disease in order to infer cause and effect. The key idea behind CNNC is to turn data into co-occurrence histograms (as images), and then analyzing the histograms using CNNs. More specifically, Yuan et al. generated a histogram for each pair of genes and utilized CNNs to infer relationships among the various levels of expression encoded in the image. CNNC is adaptable and can easily be expanded to integrate other types of genomics data as well, resulting in additional performance gains. CNNC goes beyond previous approaches for predicting TF-gene and protein-protein interactions and predicting the pathway of a regulator-target gene pair. CNNC may also be used to draw causality inferences, functional assignments (such as biological processes and diseases), and as part of algorithms that recreate known pathways (Yuan and Bar-Joseph 2019).
In another work, Fu et al. (Fu, Zhang et al. 2020) presented Single Cell Factor Analysis Network (scFAN), a DL model for determining genome-wide TF binding profiles in individual cells. The scFAN pipeline consists of a "pre-trained model" trained on bulk data and then used to predict TF binding at the cellular level using DNA sequence, aggregated associated scATAC-seq data, and mappability data. ScFAN can help overcome the basic sparsity and noise constraints of scATAC-seq data. This model provides a valuable method for predicting TF profiles through individual cells and could be applied to analyze SC epigenomics and determine cell types. Fu et al. presented scFAN's ability to identify cell types by analyzing sequence motifs enriched within predicted binding peaks and studying the effectiveness of predicted TF peaks. They suggested a novel metric called "TF activity score" to classify each cell and demonstrated that the activity scores could accurately capture cell identity. Generally, scFAN is capable of connecting open chromatin states with transcript factor binding activity in individual cells, which is beneficial for a deeper understanding of regulatory and cellular dynamics (Fu, Zhang et al. 2020).

Histone modification
Given the effects of protein-DNA interactions between histone marks and TF on the regulation of crucial cellular processes (including the organization of chromatin structures and gene expression), the identification of such interactions is highly significant in biomedical science. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a widely used technique for mapping TFs, histone changes, and other protein-DNA interactions for genome-wide mapping (Furey 2012). ChIP-seq data is very sparse, therefore generally requiring imputation for more accurate analysis. Albretch et al. introduced Single-cell ChIP-seq iMPutAtion (SIMPA) (Albrecht, Andreani et al. 2021), an imputation algorithm that was tested on a single-cell ChIP-seq (scChIP-seq) dataset of the H3K4me3 and H3K27me3 histone marks in B-cells and T-cells. Unlike most SC imputation approaches, SIMPA integrates the sparse input of one single cell with a series of 2,251 ENCODE ChIP-seq experiments to extract predictive information from bulk ChIP-seq data. SIMPA's goal is to identify statistical patterns that bind protein-DNA interacting sites through specific SC regions of target-specific ENCODE data for different cell types, as well as the presence or absence of potential sites for a single cell. Once the patterns are identified, SIMPA's DL models uses these patterns to make precise predictions. As a new approach in sc-seq, SIMPA's imputation strategy augmented sparse scChIP-seq data, leading to improved cell-type clustering and the detection of pathways that are specific to each cell type.
SC genomics analysis through DL is a promising and emerging field with an incredible potential to advance our knowledge of fundamental biological matters. In this respect, DL can provide us with a better understanding of nature, and the intricacies of DNA structure and epigenomics effects on human diseases for both therapeutic and diagnostic purposes. Due to intrinsic challenges in SC genomics, such as sparsity, systematic noise, and higher-dimensionality of biological systems, developing new DL models are paramount to further advancing the SC genomics field.

Deep Learning in Spatial Transcriptomics
Since being named the method of the year (Marx 2021), spatial transcriptomics (ST) is becoming the natural extension of scRNA-seq, unbiasedly profiling transcriptome-wide gene expression. By not requiring tissue dissociation, spatial transcriptomic retain spatial information (adding a spatial component to conventional RNA-seq technologies). ST have the potential of revolutionizing the field by bridging the gap between the deep characterization of cellular states and the cellular diversity that constitutes tissue organization. Spatially resolved transcriptomics can provide the genetic profiles of cells while containing information about the positional distribution of the sequenced cells, enhancing our understanding of cell interactions, organ function and pathology. However, the high-throughput spatial characterization of complex tissues remains a challenge. Broadly, spatially-resolved transcriptomics techniques can be divided into two categories: (i) the cyclic RNA imaging techniques that achieve single-cell resolution and (ii) array-based spatially resolved RNA-seq techniques, such as Visium Spatial Transcriptomics (Ståhl, Salmén et al. 2016), Slide-sequencing (Rodriques, Stickels et al. 2019), or highdefinition spatial transcriptomics (HDST) (Vickovic, Eraslan et al. 2019). Though both subgroups can provide spatial information on a single-cell level, the cyclic RNA methods are limited on the number of genes that they can multiplex. On the other hand, the array-based spatially resolved RNA-seq techniques achieve high-throughput data by capturing mRNA across thin tissue sections using a grid of microarrays or bead-arrays, relying on simple molecular biology and histology protocols. However, since array-based mRNA capture does not match cellular boundaries, spatial RNA-seq measurements are a combination of multiple cell-type gene expressions that can correspond either to multiple cells (Visium) or fractions of multiple cells (depending on the spatial resolution of each method).
To obtain a comprehensive characterization of underlying tissue, there is the need for computational methods that can produce coupled single-cell and spatially resolved transcriptomics strategy, mapping cellular profiles into a spatial context. Nonetheless, there are some techniques for retrieving relevant biological information from ST data. As remarked by Lähnemann et al. (Lähnemann, Köster et al. 2020), detecting spatial gene expression patterns is one of the most pressing challenges in single-cell omics data science. Identifying such patterns can provide valuable insight on the spatial distribution of cell populations, pointing out gene marker candidates and potentially leading to identification of new rare cell subpopulations. Moreover, ST not only puts gene expression into a spatial context but also facilitates the integration of tissue-image information with gene expression information. Such data integration will enable researchers to utilize image processing techniques to investigate the morphological information, gaining more intuition in order to obtain more refined inferences, predictions, or cellular profiles. After addressing the mapping issues from gene expression to the spatial coordinates, further computational tools will be required to study cell-cell interactions within tissues and to model transcriptional relations between cell types (Pham, Tan et al. 2020). Given the recency of ST, ML-and DL-based models for studying this type of data are rare and not fully developed. Given the complexity of the ST space, however, we predict that DL models will be the predominant method of choice for ST data integration and analysis, with perhaps many new models borrowing ideas and designs from the existing body of work in computer vision.

Integrating scRNA-seq and Spatial Transcriptomics (Spot Deconvolution)
Given that ST approaches normally detect mRNA expression from a mixture of cells, and that they do not distribute the sequenced samples to match cellular boundaries, it is imperative to integrate ST data with scRNA-seq data to obtain a comprehensive map. Using scRNAs-seq as a reference, this integration aims to infer which cells belong to which gene expression counts detected at the various location across the tissue. Thus far, traditional ML and statistical models have shown promising results in this space. For example, SPOTlight (Elosua-Bayes, Nieto et al. 2021) is a model for deconvoluting spatial transcriptomics spots with single-cell transcriptomes using seeded NMF regression. SPOTlight starts by obtaining cell-type-specific profiles that are representatives of the gene expressions associated with the different cells. This process is further refined by a proper initialization of the method that uses unique marker genes of specific cell types. Next, the method uses Non-Negative Least Squares to deconvolute the captured expression of each spot (location). Despite SPOTlight's accuracy and computational efficiency, SPOTlight lacks the flexibility of integrating datasets from different batches or sequencing technologies. In SPOTlight, several aspects of the underlying biological and technological variance in the data are not addressed nor accounted for, which limits its application. Furthermore, technical procedures to obtain RNAseq data are notoriously delicate and subjected to notable sources of variation.
Methods like Seurat 3 (Stuart, Butler et al. 2019) attempt to account for the intrinsic technical variability of RNA-seq procedures thorough developing an "anchor"-based method for integrating datasets. However, it is important to remember that there are biological variabilities in the number of cells across positions or the amount of mRNA expressed by each cell or cell type. Therefore, there is an unmet need for techniques that can integrate datasets while properly accounting for the various sources of variability. Several studies have proposed statistical frameworks as viable candidates to account for the variability when integrating different datasets. For example, Stereoscope  build their model on a statistical framework. They model gene expressions counts as occurrences under a negative binomial distribution. Stereoscope follows previous approaches of obtaining a gene expression profile for each cell type. That is, Andersson et al. follow a two-step approach: First, they estimate the parameter of the negative binomial distribution for all genes within each cell type. Similar parameters for a distribution of the RNA-seq and spatial expression mixture are then formed by a linear combination of the single-cell parameters. The next step is to search a set of weights that can best fit the spatial data ). These computed weights reflect the contribution of each cell type to the gene expression counts found in each location, thus explaining the abundance of each cell type across the spots.
Following the previous methodology, cell2location (Kleshchevnikov, Shmatko et al. 2020) builds on a Bayesian framework and models gene expression counts as a negative binomial distribution. This approach enables controlling the sources of variability, which is crucial when working with data from different technologies. Cell2location integrates scRNA-seq information into the spatial model using a similar statistical framework as Stereoscope: In addition to modeling the genespecific unobserved rate (mean) as a weighted sum of the cell signature gene expression, cell2location also adds various parameters to provide the model with prior information regarding technology sensitivity. That is, to further improve integration between different technologies, Kleshchevnikov et al. allow for four hyperparameters that will scale the weighted sum of cell contributions. Kleshchevnikov et al. showed that cell2location results in a more comprehensive integration of omics data while being more computationally efficient than competing models (such as Stereoscope).

Deep Learning in the Integration of Single-Cell Multimodal Omics Data
Single-cell sequencing (scSeq) was chosen as Method of the Year in 2013 due to its ability to sequence DNA and RNA in individual cells (Teichmann and Efremova 2020). ScSeq allows for gene expression measurements at an unprecedented single-cell resolution, which can provide a comprehensive view of the genome, transcriptome, or epigenome. In addition, recent technological advances now allow for multimodal omics measurements from the same experiment. In order to gain an accurate and comprehensive view of the cellular composition of the control (normal) and disease groups, it is essential to integrate all omics data for one sample simultaneously (Wani and Raza 2019). The various omics technologies can assess various modalities in one experiment (i.e. conduct multimodal studies) or integrate diverse omics datasets from multiple experiments.
Given the tremendous potentials of these approaches, single-cell multimodal omics was named the 2019 Method of the Year (Teichmann and Efremova 2020). Omics integration holds the promise of linking even small datasets across orthogonal biochemical domains, amplifying biologically significant signals in the process (Grapov, Fahrmann et al. 2018). By analyzing multi-omics data, researchers can produce novel hypotheses or design mathematical algorithms for prediction tasks, such as drug sensitivity and efficacy, gene dependence prediction, and patient stratification. Hao et al (Hao, Hao et al. 2021) proposed a non-DL based method, called the weighted-nearest neighbor method which has shown promising results. WNN is an unsupervised framework for defining cellular identity by leveraging multiple data types for creating a multimodal reference atlas. Hao et al. apply their technique to a dataset of human PBMCs that includes paired transcriptomes and measurements of 228 surface proteins, forming a multimodal immune system atlas. They evaluate multimodal datasets from single cells, including paired measurements of RNA and chromatin state, and extend beyond the transcriptome to define cellular identity in a coherent and multimodal manner (Hao, Hao et al. 2021). Although traditional ML models such as WNN have shown promising results, multi-omics data pose unique challenges for holistic integration, in addition to the other traditional difficulties such as batch effects from multiple sources. Multi-omics data reflect molecular phenotypes at different molecular systems, and thus each omics dataset could follow a different and specific distribution. To overcome these hurdles, sophisticated statistical and computational strategies are required. Among the various proposed algorithms thus far, only DL-based algorithms provide the computational versatility necessary to effectively model and incorporate virtually any form of omics data in an unsupervised or supervised manner (Grapov, Fahrmann et al. 2018).
Most DL-based algorithms in this area aim to simultaneously calculate several modalities in a single experiment. For example, Zuo et al. (Zuo and Chen 2020) introduced a single-cell multimodal VAE model (scMVAE) for profiling both transcriptomic and chromatin accessibility information in the same individual cells. Given the scRNA-seq and scATAC-seq of the same individual cells, scMVAE's uses three joint-learning strategies to learn a non-linear joint embedding that can be used for various downstream tasks (e.g. clustering). This joint learning distinguishes scMVAE from other VAE-based models (such as scVI) which process individual omics data separately. Zuo et al. note that scMVAE's feature embeddings are more distinct than the scVI's for each omics data, indicating that the joint learning representation of multi-omics data will produce a more robust and more valuable representation (Zuo and Chen 2020).
Currently, there are only a few studies that have used DL for data integration, but their success thus far calls for additional investigation of DL models in this domain. Despite the significant advances made with single-cell multimodal omics technologies, several obstacles remain: First, these techniques are prohibitively expensive when used on a large scale to analyze complex heterogeneous samples and distinguish rare cell types within a tissue. On the other hand, data sparsity is a significant limitation of high-throughput single-cell multimodal omics assays. Furthermore, existing methods cover only a small portion of the epigenome and transcriptome of individual cells, making it challenging to separate technical noise from cell-to-cell variability. While future modification of these approaches will eventually close the gap, fundamentally new algorithms or strategies may be needed to resolve this constraint completely (Zhu, Preissl et al. 2020). Amodio et al. (Amodio and Krishnaswamy 2018) propose Manifold-Aligning GAN (MAGAN), which is a GAN-based model which aligns two manifolds coming from different domains with the assumption that different measurements for the same underlying system contain complementary information. They show MAGAN's potential in the problem of single-cell data integration (CyTOF and scRNA-seq data) and demonstrate the generalization potential of this method in the integration of other data types. However, the performance of MAGAN decreases in the absence of correspondence complementary information among samples (Liu, Huang et al. 2019). Cao et al. introduced UnionCom for unsupervised topological alignment of singlecell omics integration without a need for correspondence information among cells or among features, which can be very useful in data integration. However, UnionCom is not scalable to large datasets that are on the order of millions of cells (Cao, Bai et al. 2020). Other models such as SMILE (Single-cell Mutual Information Learning) (Xu, Das et al. 2021) also allow for unmatched feature types. SMILE is a deep clustering algorithm for different tissues and modalities, even when the feature types are unmatched. SMILE can remove batch effects and learn a discriminative representation for data integration using a cell-pairing maximization algorithm (Xu, Das et al. 2021).
Single-Cell Data Integration via Matching (SCIM) (Stark, Ficek et al. 2020) is another deep generative approach that constructs a technology-invariant latent space to recover cell correspondences among datasets, even with unpaired feature sets. The architecture is a modified auto-encoder with an integrated discriminator network, similar to the one in GANs, allowing the network to be trained in an adversarial manner. Multi-modal datasets are integrated by pairing cells across technologies using a bipartite matching scheme that operates on the low-dimensional latent representations (Stark, Ficek et al. 2020). Another data integration model is inteGrative anaLysis of mUlti-omics at single-cEll Resolution (GLUER) (Peng, Chen et al. 2021), which employs three computational approaches: a nonnegative matrix factorization (NMF), mutual nearest neighbor, and a DNN to integrate multi-omics data. The NMF stage helps to identify shared signals across data sets of different modalities, resulting in "factor loading matrices" (FLM) for each modality. The FLM from one data modality is defined as the reference matrix while the other FLM is used as query matrices, which are used to calculate putative cell pairs. These cell pairs are then used in a DNN which tries to learn a mapping between the query FLMs and the reference FLMs, resulting in co-embedded datasets (Peng, Chen et al. 2021).
Some studies have formulated the integration problem as a transfer-learning task. For example, Lin et al. (Lin, Wu et al. 2021) pose the integration problem as a transfer learning question, where the model is co-trained on labeled RNA and unlabeled ATAC data. Their model, scJoint, integrates atlas-scale collections of scRNA-seq and scATAC-seq data, using a neural network framework (Lin, Wu et al. 2021). Their approach uses scATAC-seq to gain a complementary layer of information at the single-cell resolution which is then added to the gene expression data from scRNA-seq. However, scJoint requires that both input matrices share the same dimensions, and scATAC-seq is first converted to gene activity scores, where a single encoder can share the weights for both RNA and ATAC data (Lin, Wu et al. 2021).

Conclusion
Single-cell (SC) omics technologies generate large datasets that describe the genomic, transcriptomic, or epigenomic profiles of many individual cells in parallel. Integrative methods have recently opened a new way for delineating heterogeneous mechanistic landscapes and cell-cell interactions in single-cell multi-omics. Given the challenges in inferring biological information and constructing predictive models for such datasets, there has been a surge in the use of DL-based models for SC omics. So far, deep learned models have shown promising results, demonstrating the ability to process and learn from massive quantities of high-dimensional representations of single-cell and multi-omics data. However, a thorough understanding of the underlying models is needed to evaluate the optimal pipeline for analyzing single-cell datasets to understand cellular identity and function better. We believe that the sc-seq space will pose unique challenges in representation and DL, particularly in multi-omics applications.
The value and potentials of DL in SC omics have been further highlighted during the current Coronavirus (COVID-19) pandemic. Many scientists have utilized deep-learned models in SC analysis for diagnosing and predicting the severity of COVID-19 (Jeong, Jia et al. 2021), predicting future mutations (Saha, Ghosh et al. 2021) and testing drug combinations for treating COVID-19 (Jin, Stokes et al. 2021). Researchers have also used DL in studying the immune dysfunction in COVID-19 patients, particularly PBMCs and lung tissues, which has allowed scientists to better characterize COVID-19 (Han and Zhang 2021). Overall, the advances of the DL techniques in SC have provided researchers with a unique ability to develop new therapeutic interventions, contributing to a global coordinated effort to win the fight against diseases, such as COVID-19.
We anticipate an increase in the use of DL techniques for addressing the current challenges mentioned in the SC field. Furthermore, we believe that the biological generalizability and interpretability of deep-learned models in understanding complex pathological phenotypes such as cancer, drug resistance, and neurobiology would be of great interest and importance to the omics field.