A General Single-Cell Analysis Framework via Conditional Diffusion Generative Models

The fast-growing single-cell analysis community extends the horizon of quantitative analysis to numerous computational tasks. While the tasks hold vastly different targets from each other, existing works typically design specific model frameworks according to the downstream objectives. In this work, we propose a general single-cell analysis framework by unifying common computational tasks as posterior estimation problems. In light of conditional diffusion generative models, we introduce scDiff through the proposed framework and study different conditioning strategies. With data-specific conditions, scDiff achieves competitive performance against state-of-the-art in various benchmarking tasks. In addition, we illustrate the flexibility of scDiff by incorporating prior information through large language models and graph neural networks. Additional few-shot and zero-shot experiments prove the effectiveness of the prior conditioner on scDiff. Our implementation is publicly available at https://github.com/OmicsML/scDiff.


INTRODUCTION
Recent advances in single-cell technology enable extensive computational tasks for quantitative understanding of underlying biological principles (Wen et al., 2022;Elmentaite et al., 2022;Heath et al., 2016).Some typical examples of these tasks include cell-level classification (Ma & Pellegrini, 2020;Xu et al., 2021), missing value imputation (Eraslan et al., 2019;Huang et al., 2018) and generalization to novel conditions (Hetzel et al., 2022;Roohani et al., 2023;Lotfollahi et al., 2019).Existing works often design distinct frameworks for different tasks according to their objectives.For example, cell type annotation algorithms (e.g., ACTINN (Ma & Pellegrini, 2020)), typically model the class attributes with multi-class cross-entropy loss; single-cell imputation methods (e.g., DCA (Eraslan et al., 2019)) aim to recover the true counts from dropouts by specifying some prior distribution; and perturbation prediction frameworks (e.g., GEARS (Roohani et al., 2023)) explicitly model the change in expression between perturbed cells and control cells.
In this work, we provide a new perspective by formulating common single-cell tasks through the lens of distribution estimation.While the tasks are derived from diverse biological perspectives, we articulate that their objectives can be described as posterior distribution estimation problems under task-specific conditions, as detailed in Section 2. For example, cell type annotation can be considered as classifying each cell to the cell type that maximizes the conditional likelihood of its expression (Li et al., 2023); and imputation can be treated as drawing samples from the learned posterior given the partially observed expression data.This perspective brings new opportunities to single-cell analysis.It allows a general posterior estimation framework that enables us to handle multiple single-cell analysis tasks with a single unified objective.In the meantime, many conditional generative models can be plugged into the framework.However, it also introduces new challenges.Within the framework mentioned above, the choice of the generative model plays a crucial role in how well they can conduct downstream tasks (Dhariwal & Nichol, 2021).An expected model would bring out accurate distribution estimation and high-quality sample generation.Meanwhile, better conditioning strategies are desired to trade off the influence across different conditions.
To address the challenges mentioned above, we delve into diffusion generative models (DGMs).DGMs have shown great successes in generation tasks since the introduction of denoising diffusion probabilistic models (Ho et al., 2020), which are further extended to conditional scenarios through classifier guidance (Dhariwal & Nichol, 2021) and classifier-free guidance (Ho & Salimans, 2021).Various types of conditions have been applied to guide the DGMs, like image (Poole et al., 2022), text (Rombach et al., 2022;Kim et al., 2022b), and audio (Ruan et al., 2023;Leng et al., 2022).Compared to class guidance, those approaches paved ways to guide DGMs with prior knowledge.This flexibility allows us to construct the posterior estimation process through conditional DGMs in single-cell analysis with internal and external information.We demonstrate that internal-guided DGMs can match state-of-the-art performance in standard settings while prior knowledge enables better transferability under zero-shot and few-shot settings.We summarize our main contributions as follows: • We present a general single-cell analysis framework by formulating various tasks as posterior estimation problems.Through this framework, we introduce scDiff with a conditional DGM.
• We study the conditioning strategies of scDiff .With cell-label conditioning, scDiff achieves competitive performance with state-of-the-art models in various benchmarking tasks.
• We incorporate prior knowledge with large language models (LLMs) and graph neural networks.Experimentally, scDiff shows outstanding few-shot and zero-shot results.

A GENERAL POSTERIOR ESTIMATION FRAMEWORK
In this section, we detail our proposed framework.Section 2.1 aims to formulate the various singlecell tasks as posterior estimation questions.Section 2.2 introduces the background of conditional DGMs and the structure of the scDiff .

SINGLE-CELL TASKS AS POSTERIOR ESTIMATION
Single-cell analysis is a vast topic that involves a large number of computational and biological tasks.While existing works provide novel and effective solutions for individual tasks, few can address multiple challenges.A natural reason is that the tasks are quantifying distinct perspectives of the underlying mechanisms.Hereafter, we will argue that many common tasks in single-cell analyses amount to quantifying the cell identities given the biological context.In other words, we are actually estimating the posterior distribution of the cells' expression given specific conditions.
Formally, we denote the expression of all cells as X ∈ R n×m , where n is the number of cells and m stands for the number of genes.We categorize the common tasks into three classes: cell labeling, expression completion and knowledge transfer.The notations for task-specific conditions are detailed subsequently.
Cell labeling.One of the most critical single-cell tasks is to label cells by their expression.Specifically, let us denote the labels as C label ∈ R n , which can either be discrete (e.g., cell type) or continuous (e.g., spatial cell type ratio).For the cell labeling task, we estimate the posterior of the labels given the expression X. Inspired by Li et al. (2023), we formulate the posterior with the Bayesian theorem where we assume a uniform prior on the support of C label .Consequently, the problem shifts from estimating label posterior to approximating expression posterior.
One representative task of cell labeling is cell type annotation, where the labels become the functional types of the cells.The integration in the denominator of equation 1 then reduces to summation on the support of classes.From the data perspective, the ground truth annotation is typically given by experts' curations.Most existing works directly model the posterior of the labels with multi-class cross-entropy loss.Another typical task is cell trajectory inference, where the labels become the developmental trajectories of biological progression through processes such as cell differentiation (Qiu et al., 2022).Similar ideas can be extended to spatial transcriptomics.For example, for cell type deconvolution, the labels become the ratios of cell types within each spot (Biancalani et al., 2021;Ma & Zhou, 2022).More details can be found in Appendix D Expression completion.A crucial category of single-cell analysis tasks is expression completion.It includes both filling in missing values or predicting the whole expression.In some scenarios, the task may require external information from reference datasets.To account for the majority of the settings, we denote the observed expression as M ⊙ X, where ⊙ denotes matrix element-wise multiplication and M ∈ {0, 1} n×m is the element-wise indicator with ones be observed and zeros be missing.The task is defined to estimate the posterior p(( is an all-ones matrix of dimensions n × m.Equivalently, we write the objective as p(X | M ⊙ X).
(2) In missing value imputation, the partially observed expression is produced by manually dropping out none-zero counts (Eraslan et al., 2019;Van Dijk et al., 2018).The dropout mechanism can be extended to molecular cross validation (Batson et al., 2019), where a continuous M ∈ [0, 1] n×m becomes the element-wise dropout rate specified by the partition of molecules.When some genes are completely unobserved, the objective shifts to missing gene imputation (Arisdakessian et al., 2019;Biancalani et al., 2021).Specifically, in equation 2, the indicator matrix M = [J n,s , O n,m−s ] describes s seen genes and m − s unobserved genes, where O n,m−s is an all-zeros matrix with dimensions n × (m − s).In the multiomics setting, the previous formulation can be extended to modality prediction tasks, where the objective is to predict the expression of the target modality X from source modality (Yang et al., 2021;Wu et al., 2021).We detail the formulation in Appendix D.
Knowledge transfer.While the exact condition control in single-cell experiments is both timeconsuming and resource-consuming, it is desired to have the computational method to transfer known results to unseen conditions (Roohani et al., 2023;Hetzel et al., 2022).Let C s be the source condition set and C t be the target condition set with C s ∩ C t = Ø, we aim to estimate the expression under target conditions.External reference conditions may also be included to enhance the estimation with prior information.The knowledge transfer task can be formulated as: Estimate p(X | C t ) given p(X | C s ).
(3) Common single-cell knowledge transfer tasks mainly focus on perturbation prediction, which includes predicting novel gene perturbation responses (Roohani et al., 2023), predicting novel drug perturbation responses (Hetzel et al., 2022) and cell type transfer of perturbation (Lotfollahi et al., 2019).Considering the first two tasks, prior knowledge is often needed to generalize to unseen perturbation types.For the remaining task, we focus on only one type of perturbation and aim to generalize the perturbation effect across cell types.Specifically, the training set contains both perturbed and control cells for the source cell types but only control cells for the target cell types, and the model is expected to approximate the perturbed state of the target cell types.
A general objective.The objectives of the aforementioned tasks are to estimate the posterior distribution of the cells' expression given task-specific conditions.Thus, a general objective can be formally stated as: (4) Without loss of generality, we write x as x 0 on sample level.In the following section, we focus on objective p(x 0 | c) for one cell.

CONDITIONAL DIFFUSION MODEL FOR POSTERIOR ESTIMATION
To estimate the posterior distribution, we delve into Diffusion generative models (DGMs).Follow The above process (i.e., reverse process in DDPM) learns to recover the original data from Gaussian white noise.Conversely, the forward process gradually corrupts data by adding Gaussian noise: where {β t } T t=1 is the variance schedule.Next we detail the parameterization of the reverse process.We empirically find that the widely-used predict-ϵ objective fails to recover the expression.Since single-cell data often shows extreme sparsity with more than 95% of the entries as zeros, the corrupted input at time-step t, i.e., x t , will mostly be pure noise.Under predict-ϵ parameterization, the model would likely learn to reverse the noise schedule instead of estimating the data posterior.Therefore, we turn to predict-x 0 parameterization.Denote α t = 1 − β t and ᾱt = t s=1 α s , we write the reverse process as: Due to the integral in equation 5, the data posterior is intractable.Alternatively, the parameters are optimized by minimizing the variational lower bound (ELBO): As detailed in Appendix A, we start from the ELBO and arrive at the simplified training objective: Next we introduce our scDiff model architecture, which is depicted in Fig 1 .From a high level, scDiff aims to recover the clean single cell gene expression x 0 given the corrupted signal x t with added Gaussian noise to time step t.The associated conditions of the cell are also fed into the model to provide conditional information.Specifically, scDiff follows a general encoder-decoder design and consists of four main components: (1) input expression embedder ϕ; (2) various conditioners, ψ * , where each converts a specific condition of the input cell into a sequence of dense numerical vectors; (3) a cross-attention encoder, E, which combines the input embeddings with the corresponding conditioners and transforms them into the hidden representation of the input cell; and (4) a linear decoder, D, that projects the hidden representation back to the gene expression space to recover the input cell's noise free expression.We detail each component in the followings.
Embedder.We use a linear mapping W ∈ R m×d to project the noised gene expression x t ∈ R m into R d and further mix it with the sinusoidal time embedding (Appendix B.1) to inject diffusion time step information, following previous work (Ho et al., 2020).
Conditioner.The goal of each conditioner is to extract a set of L numerical representations of an input condition c.Among these representations, each is used as the basis for the key and value embeddings in the cross-attention encoding step, as described in the next section.Formally, each conditioner ψ f is a multilayer perceptron (MLP) that converts the raw representation extracted by f into the final condition representation set1 .
where f ∈ F is a function that maps an input condition into a d dimensional vector and F is the set of mappings.The mapping here can be designed to suit the specific needs of different input types.
Context.A cell context is a randomly masked expression x = m ⊙ x, where m ∈ {0, 1} m is the element-wise mask indicator.We process the context condition similarly as the input embedding using a linear projection, but without the time embedding: Class.We use learnable d dimensional embeddings to represent each class, f cls (c) = h c .The class attribute is an important information of the input sample, and can be used to guide the diffusion generation process (Ho & Salimans, 2021).In our case, class attributes can describe the cell type or the perturbation state of a given cell.
The two conditioners above utilize internal information obtained from the given training data.However, effectively incoporating prior information into the model is the key to more generalizable and transferable knowledge.We next describe two distinct approaches to incorporate prior knowledge as examples to illustrate the extendability of the conditioners.
LLM.Besides encoding the cell type attribute using class embeddings, we can alternatively leverage LLM to extract rich representations of different cell types using their textual descriptions.Specifically, the cell type definitions are first obtained from the cell ontology (Bard et al., 2005).We then feed these descriptions into the pre-trained BioLinkBERT (Yasunaga et al., 2022) model, and use the resulting class token embeddings as f LLM (c).
GEARS.Roohani et al. (2023) proposed a novel approach to encode gene perturbation information using a graph neural network (GNN) on a gene similarity graph G.This graph is constructed such that each edge represents the number of shared gene ontology terms (Ashburner et al., 2000) between each pair of genes, which reflects their functional similarities.The gene perturbation embeddings are then computed using simple graph convolution (SGC) (Wu et al., 2019), f GEARS (c) = SGC(c, G).
Encoder.Once the input embedding ϕ(x t , t) and the condition representations {ψ f (c)} f ∈F are calculated, we combine them through multiple layers of cross-attention (Appendix B.3).
Decoder.Finally, the cell latent embedding is linearly projected back to the gene expression space to recover the clean expression signals x 0 .We follow Lopez et al. (2018) and mix in an additional learnable batch embedding with the latent embedding according to the batch label of the input cell.This approach can better disentangle the non-biological variations in the data.
Combining the above components, we summarize the full scDiff model in equation 14.

RELATED WORK
We introduce some existing works that study generative models in single-cell analysis.A large proportion of the single-cell generative models are variational autoencoders (VAEs).scVI (Lopez et al., 2018) led the trend of VAEs with a negative binomial prior to the raw expression.scVI achieved satisfactory integration results by explicitly incorporating library size and batch information in the model.Including those conditions helps to regress out the technological variance within the latent space.Many other existing works extended the design space of VAEs in single-cell analyes.sc-VAE (Grønbech et al., 2020) incorporated a Gaussian-mixture latent space to model the underlying clustering structure.scDHA (Tran et al., 2021) formed a hierarchical framework consisting of two VAEs.In a broader application scenario, scGen (Lotfollahi et al., 2019) utilized the VAE structure for out-of-distribution prediction, while scMM (Minoura et al., 2021) aimed at multiomics analysis with a mixture-of-experts VAE model.Meanwhile, several existing works applied generative adversarial networks (GANs) (Goodfellow et al., 2014) to single-cell analysis.cscGAN (Marouf et al., 2020) incorporated a conditional GAN for data augmentation.scIGANs (Xu et al., 2020) adapted GAN for single-cell imputation via generation.Despite existing works applying generative models in single-cell analysis, they are typically designed for one or a few tasks.This fundamentally limits their extendability to broader classes of problems.

EXPERIMENT
In this section, we conduct experiments to validate the effectiveness of scDiff .Datasets used in experiments are summarized in Appendix E.2.Through the experiments, we aim to answer the following research questions: • RQ1: How does scDiff perform against the state-of-the-art with internal data-specific conditions?
• RQ2: Can scDiff extend to other application scenarios with external prior knowledge?

PERFORMANCE OF INTERNAL-CONDITIONED scDiff
To answer the first question, we choose one representative task from each of the three categories in Section 2.1, i.e., cell type annotation, missing value imputation, and perturbation prediction for novel cell type.It is worth noting that in this section, we implement scDiff with the same structure across three Representative tasks.More implementation details can be found in Appendix E.1.

CELL TYPE ANNOTATION
Experimental settings.Cell type annotation is one of the fundamental tasks in single-cell analysis.We collect 6 benchmark datasets: PBMC12K (Zheng et al., 2017;Lopez et al., 2018), Pancreas (Luecken et al., 2022), HLCA (Sikkema et al., 2023), Immune (Domínguez Conde et al., 2022), Brain (Seeker et al., 2023) and Liver (MacParland et al., 2018).We randomly hold out 10% of the cells for each dataset as the test set and train all the models on the remaining cells.For scDiff , we annotate the cells by evaluating the mean square error between input expression and model posterior in a classifier-free approach (Li et al., 2023).We elaborate on the details in Appendix C. The classification results are quantified by the macro multi-class accuracy score and F1 score.Experimental results.Table 1 illustrates the cell type annotation results, where we report the macro accuracy scores with mean and standard deviation across five runs.Note that the colors in the result table in this section refer to the performance rank within one dataset, which is depicted as first place, second place, and third place.We highlight that scDiff achieves top performance in four out of six datasets without explicitly training a classifier, which extends the results in (Li et al., 2023).This observation offers a solid support for the proposed posterior estimation framework, suggesting that well-established generative models can even outperform discriminative models in the single-cell context.We include the macro F1 score results in Appendix E.3.Experimental settings.Missing value imputation aims to recover the true expression levels from the dropout events in sequencing (Hou et al., 2020).We choose three datasets, i.e., Jurkat, 293T, and PBMC1K, from Hou et al. (2020).For evaluation purposes, existing works typically create the corrupted matrix by dropping out some non-zero entries.Specifically, we follow the setting of DCA (Eraslan et al., 2019) and MAGIC (Van Dijk et al., 2018) to mimic the dropout events by masking 10% of the none-zero counts to zeros, where the masking probability is given by the exponential distribution.We evaluate the imputation performance based on the Pearson correlation between the prediction and ground truth on masked entries.
Baselines.We choose the following top-performing baselines according to Hou et al. (2020).DCA (Eraslan et al., 2019)  Experimental results.The results are illustrated in Table 2a with standard deviation across five runs, where the best result in each dataset is highlighted in bold.We observe that scDiff and MAGIC bring out similar performance on Jurkat and 293T, while scDiff outperforms the others on PBMC1K.Meanwhile, scDiff delivers significant performance gain compared to the other generative model scVI.This highlights the capability of scDiff to recover the true expression from dropout events.et al. (2018).Each dataset contains eight different cell types and two perturbation states (perturbed and control).During training, the model is given the full dataset except for the perturbed cells of one cell type, which are held out for testing.Then, the model generates the unseen perturbed cells' expressions for the held-out cell type in the testing stage.The evaluation metric is based on changes in expression between perturbed and control cells.We aggregate the expression of a given condition by calculating the mean expression across cells.The changes in expression are given by the difference between perturbed and control of the held-out cell type in the mean expression.Squared Pearson correlation is calculated among the top 100 differential expressed genes.Baselines.For evaluation purposes, we implement existing baselines and benchmark their performance.scGen (Lotfollahi et al., 2019) is a variational autoencoder combined with latent space vector arithmetics.Along with scGen, we also include some baselines mentioned in Lotfollahi et al. (2019), i.e., a conditional variational autoencoder (CVAE), vector arithmetics on expression space (Vec) and vector arithmetics in the latent space of principal component analyses (PCA-Vec).In addition, we include CPA (Lotfollahi et al., 2023), which is an autoencoder with disentangled latent space.
Experimental results.We summarize the results in Table 2b, where the bold numbers represent the top performance of each dataset.Remarkably, scDiff outperforms all baselines on all datasets.These results highlight that scDiff shows superior generalizability to novel conditions compared to the baseline counterparts.Consequently, scDiff shows great potential in single-cell applications where condition transfer is needed.

PERFORMANCE OF EXTERNAL-CONDITIONED scDiff
In Section 4.1, we evaluated scDiff in three representative tasks where all individual conditions are observed.In practice, we may encounter extreme cases when only a few or even no labeled samples are available for the query conditions.Given only the internal information, the few-shot or zero-shot settings are challenging, if not intractable.Here, we showcase ways to extend scDiff by incorporating prior information as external conditions to enable the handling of unseen conditions.
To test the performance of scDiff under these settings, we conduct experiments with one-shot cell type annotation and zero-shot gene perturbation prediction.

ONE-SHOT CELL TYPE ANNOTATION
While some rare cell types play a crucial role in particular researches (Khalilia et al., 2011), accurately annotating them is incredibly challenging because of the limited availability of labeled samples (Jindal et al., 2018).Under the few-shot setting, prior information on cell types would significantly enhance the model.The cell ontology provides a comprehensive vocabulary and definitions of different cell types written in natural language, which can be readily encoded by LLMs into embeddings.We use BioLinkBERT (Yasunaga et al., 2022) as the backbone LLM since it is specifically trained on the biomedical corpuses.We extracted textual descriptions of all cell types from the cell ontology terms that appeared in our datasets except for the mucus-secreting cell (CL:0000319) and pulmonary artery endothelial cell (CL:1001568).For these two terms, we utilize GPT4 (OpenAI, 2023) to depict their descriptions given available definitions as query contexts.
Experimental settings.For the six datasets in Section 4.1.1,we use the four datasets from CEL-LxGENE (Megill et al., 2021) since they come with manually annotated cell ontology terms.To mimic the rare samples setting, we simulate an extreme one-shot scenario.Particularly, for a specific dataset, we count the number of cells within every cell type and set a threshold to filter out about half of the cell types.We pick the cell types with fewer cells than the threshold as target cell types and randomly sample one cell per type as the one-shot set.We first pre-train scDiff on the remaining cell types and then fine-tune the whole model on the one-shot set for 50 epochs.Similar to the evaluation in Section 4.1.1,we calculate the macro average of the multi-class accuracy score and F1 score for performance comparison.
Experimental results.We summarize the results in Fig. 2, where we increase the number of cell types by gradually adding target cell types from top to bottom in the descending order of their original cell counts.We directly train a CellTypist (Domínguez Conde et al., 2022) on the one-shot set to provide a reference of the macro accuracy scores.Notably, the LLM variant of scDiff shows performance gain against the class-conditioned scDiff on three out of four datasets.The conclusion can be drawn from the results that utilizing BioLinkBERT in scDiff -LLM enhances the model in the one-shot setting.More results and analysis are in Appendix E.4.

NOVEL GENE PERTURBATION PREDICTION
Understanding the transcriptional responses to genetic perturbations is a crucial step towards delineating the regulatory circuits in the biological system (Jaitin et al., 2016;Sachs et al., 2005).Its realization has many vital applications in translational medicine and health science (Réda et al., 2020).However, exhaustively screening all possible genetic perturbations is impractical, given the high cost of such experiments.Here, we follow GEARS (Roohani et al., 2023) and leverage scDiff to predict the effects of novel gene perturbations.Under such zero-shot setting, we incorporate biological prior for the unseen genes by adapting the gene ontology-based graph neural network from GEARS, as detailed in Section 2.2 Experimental settings.To access the zero-shot gene perturbation prediction performance, we follow the setting of GEARS by holding out part of the perturbations as the test set.We include two one-gene perturbation datasets (Adamson et al., 2016;Dixit et al., 2016) and a two-gene perturbation dataset (Norman et al., 2019).To be consistent with GEARS, we use the same metrics, i.e., Pearson correlation of change in expression (Delta Pearson Correlation) and mean squared error of change in expression on top 20 differential expressed genes (MSE Top 20 DE).
Experimental results.The results are illustrated in Fig. 3.We observe that scDiff outperforms GEARS among all the metrics and datasets except the MSE on Norman2 .In addition, scDiff presents more stable results with smaller variance across five runs.This observation indicates that the core components of GEARS can be readily adapted to scDiff as a conditioner without any modification.The numeric results are summarized in Appendix E.5.

CONCLUSION
In this work, we have unified common single-cell tasks with a posterior estimation framework.Subsequently, we developed scDiff using a conditional diffusion generative model to approximate the posterior.scDiff showed decent performances in diverse single-cell benchmarking tasks using a single training objective.More importantly, the proposed scDiff is versatile and accommodates various conditioning strategies.As two showcases, we incorporated prior information with large language models and graph neural networks.Our results demonstrated that scDiff successfully leveraged this prior information through conditioning.Together, our work paves the way for diffusion generative models in single-cell analysis, ultimately accelerating the development from health science to therapeutic discovery.
Future work and limitation.The flexibility of scDiff enables extensive conditioning strategies.Besides LLMs and GNNs, we can enhance scDiff with other guidance methods, like CLIP (Radford et al., 2021;Kim et al., 2022b).In addition, the proposed posterior framework can be promptly extended to multiomics or multi-modality tasks.A natural future direction is to explore the possibil-ity of scDiff in spatial transcriptomic.There, histological images can also be used as an additional condition, further opening up the possibilities in single cell analyses such as predicting gene expression from histology (Shmatko et al., 2022).On the other hand, there are intrinsic limitations of the current framework.Particularly, representation learning plays a crucial role in several single-cell tasks, such as single-cell integration (Luecken et al., 2022).Yet, scDiff at its current form still needs further adaptation to accommodate those tasks.With recent works in vision (Preechakul et al., 2022;Kim et al., 2022a)

A PARAMETERIZATION DETAILS
As described in DDPM (Ho et al., 2020), the ELBO in equation 8 can be written as: ].
(15) Following Rombach et al. (2022), the posterior mean in equation 7 has the form: Replacing x t with √ ᾱt x 0 + √ 1 − ᾱt ϵ, the KL divergence term L t−1 simplifies to: Since the term L T has no trainable parameters, the complete training objective becomes: Apply the simplification process described in DDPM, the training objective reduces to equation 9 by ignoring the weights of every time-step.
B FURTHER DETAILS OF scDiff

B.1 SINUSOIDAL TIME EMBEDDING
In equation 10, we implement sinusoidal time embedding as:

B.2 CONTEXT EMBEDDING PROCESSING
As illustrated in Fig. 4, unlike the parallel processing done with parallel MLP for other conditioners mentioned in section 2.2, the masked gene expression context conditioner ψ fctxt extracts the hidden representations within a single MLP.Formally, where MLP is the output of the MLP at the l th last layer.Subsequently, the processed condition embeddings are fed into the cross attention blocks in reverse, where the highly processed context embeddings are first mixed with the raw input embeddings.This reversed mixing approach is inspired by the similar design of the DiffMAE model (Wei et al., 2023), where the masked and visual patches are mixed in reversed order.We found that, empirically, using the reversed mixing strategy leads to better reconstruction of the gene expresison (x 0 predictions) during training, as opposed to using the multihead MLP processing.

B.3 CROSS ATTENTION
We specify the cross-attention in equation 12 as: The feed-forward network (FFN) can be formulated as: C DIFFUSION MODEL AS A CLASSIFIER Li et al. (2023) introduced a density estimation approach to probe the class prediction of an input sample out of the diffusion model.The core idea of the method lies in evaluating the prediction error of the noise ϵ t at various time steps t given different conditions c. Consequently, the condition that leads to the smallest noise prediction error is taken as the prediction to the input sample.Here, we adapt this approach to the predict-x 0 scenario.Recall that the posterior for a discrete class variable c can be describe as used 8 stacking attention heads.More specifically, each attention head is 64 dimensional, and we combine outputs from all the attention heads by concatenating them into a 512 dimensional vector.
Optimizer We used the decoupled Adam (AdamW) (Loshchilov & Hutter, 2018) to optimizer the model's parameter.The learning rate is computed by scaling the base learning rate with the batch size, learning rate = base learning rate × batch size.We use batch size of 2048 throughout the experiments.

E.2 DATASETS
We summarize the information of all processed datasets in Table 4.

E.3 CELL TYPE ANNOTATION
We present the macro F1 scores of the cell-type annotation task in Tabke 5.It implies the same conclusion that scDiff outperforms the baselines in four out of six datasets.We summarize the statistics of the datasets used in few-shot cell type annotation in Table 6.Under the few-shot setting, we pre-train scDiff for 1000 epochs with the base learning rate as 1e − 8. Then we fine-tune scDiff on the few-shot set for 50 epochs, as mentioned in Section 4.2.1.
We illustrate the macro F1 score of one-shot cell type annotation in Fig. 5.Note that there is a significant drop of performance when the number of cell types reaches 5 in Liver dataset.The fifth cell type is B cell, and we observe that plasma cell have been included as the third cell type.Since B cells and plasma cells are from lymphocyte of B lineage (Hoffman et al., 2016), their gene expression levels are similar to each other.In the one-shot setting, two cells from different cell types share similar expression measurements, which will obscure the model.This observation reveals one drawback of diffusion classifier: it becomes increasingly challenging for diffusion classifier when the distributions of different classes are indistinguishable.We summarize the macro accuracy score of few-shot cell type annotation in Fig. 6 and macro F1 score in Fig. 7.We fix the number of target cell type to 5. In Brain and Immune, LLM-guided scDiff shows significant performance gain against class-guided scDiff .In the other two datasets, the two variants of scDiff achieve comaprable performance.We used the official GEARS code base3 to reproduce the performance on the three benchmarking datasets.We need to rerun the experiments because the data splits used from the original paper were not published.Thus, directly comparing our results with the reported metrics from the GEARS paper is infeasible.We obtained results comparable with the reported scores from the paper for the Adamson and Norman datasets.However, we cannot produce reasonable results for the Dixit dataset and have obtained negative Pearson correlations.Thus, we decided to directly copy the reported metrics for Dixit rather than using the doubtful results.

Baselines.
We evaluate the performance of scDiff against the representative cell type annotation methods.The baselines are listed as follows.CellTypist(Domínguez Conde et al., 2022)  is an automated tool for cell annotation based on logistic regression.SingleCellNet(Tan & Cahan, 2019) utilizes random forest along with top pair transformation.ACTINN(Ma & Pellegrini, 2020) is a three-layer MLP classifier.scANVI(Xu et al., 2021)  is a variational autoencoder with auxiliary classifiers.All the baselines are evaluated based on their default settings provided by the authors.

Figure 2 :
Figure 2: Macro accuracy score of one-shot cell type annotation.

Figure 5 :
Figure 5: Macro F1 score of one-shot cell type annotation.

Figure 6 :Figure 7 :
Figure 6: Macro accuracy score of few-shot cell type annotation on top 5 cell types.

Table 1 :
Macro ACC of cell type annotation

Table 2 :
Results of imputation and perturbation prediction.
(a) Pearson correlation of imputation.

Table 4 :
Summary of datasets

Table 5 :
Macro F1 score of cell type annotation

Table 6 :
Dataset statistics of few-shot annotation