scELMo: Embeddings from Language Models are Good Learners for Single-cell Data Analysis

Various Foundation Models (FMs) have been built based on the pre-training and fine-tuning framework to analyze single-cell data with different degrees of success. In this manuscript, we propose a method named scELMo (Single-cell Embedding from Language Models), to analyze single cell data that utilizes Large Language Models (LLMs) as a generator for both the description of metadata information and the embeddings for such descriptions. We combine the embeddings from LLMs with the raw data under the zero-shot learning framework to further extend its function by using the fine-tuning framework to handle different tasks. We demonstrate that scELMo is capable of cell clustering, batch effect correction, and cell-type annotation without training a new model. Moreover, the fine-tuning framework of scELMo can help with more challenging tasks including in-silico treatment analysis or modeling perturbation. scELMo has a lighter structure and lower requirement for resources. Moreover, it is comparable to recent largescale FMs (i.e. scGPT [1], Geneformer [2]) based on our evaluations, suggesting a promising path for developing domain-specific FMs.


Introduction
Developing large language models (LLMs) or foundation models (FMs) has become critical for board areas including but not restricted to engineering and sciences [3][4][5].In biology, FMs have been developed to analyze DNA sequences [6,7], represent cells and genes [1,2,8,9], and perform other tasks.It has been shown that FMs can facilitate various downstream biological analysis.Here we focus on a specific type of biomedical tabular data, known as single-cell sequencing data [10,11], for its intersection with FMs.Single-cell sequencing data can describe biological activity and information at the cell level.The units of are individual cells, and the features can be gene expression levels [11], protein expression levels [12,13], methylation levels [14], and others.
Several pre-training-based FMs have been developed to analyze single-cell data based on large-scale sequencing datasets collected from different studies.However, the developers of GenePT [15] questions only using the sequence information for biological analysis, and conjecture that we can either use the information from the National Center for Biotechnology Information (NCBI) [16] as prompts to get the embeddings of such prompts from LLMs to describe the genes (GenePT-w), or transfer the cell to sentence by gene ranking and use the ranking results as a prompt to extract the cell embeddings from LLMs (GenePT-s).The results from GenePT can be further explored by incorporating pre-existing knowledge or literature of the features in singlecell datasets.However, GenePT has limited generation ability by relying on external knowledge from NCBI.Moreover, the NCBI database may not be informative enough to summarize the gene functions in a well-structured manner.On the other hand, using GenePT-s to generate cell embeddings is not very practical because of the high sparsity of single-cell data, so the genes with zero expression are hard to rank.GenePT-s is also not capable of handling large-scale datasets because of the usage and time limitation from OpenAI API [17].GenePT also discussed the zero-shot learning ability of such embeddings.Recently, the access of LLMs including GPT 3.5 [18], GPT 4 [19], and LLaMa [20] offers us opportunities to explore the information of these features with the help of LLMs.These LLMs have been widely used to summarize knowledge [21], design network searching/models [22], and perform other tasks to enhance scientific research [23,24].
In this manuscript, motivated by GenePT, we explore the ability of using LLMs in a different manner.We generate meaningful text descriptions of cell-level or feature-level metadata as well as embeddings of such descriptions based on LLMs.We then assume the embeddings from LLMs carry biological properties and can be utilized in various downstream applications.Here we introduce scELMo as a pipeline for analyzing singlecell multi-omic data based on the text description and embeddings directly from LLMs [25].Using gene as one example, we use LLMs to summarize the functional information of a given gene with a suitable prompt and also use the same LLM to extract the embeddings of such description.We then either incorporate the embeddings directly into the sequence data by matrix manipulation or combine the embeddings with other models with fine-tuning targets for various challenging tasks.We demonstrate that scELMo is a simple but effective tool for single-cell data analysis under both the zero-shot learning framework and the fine-tuning framework.

Results
Overview of scELMo.The high-level idea of scELMo is to transfer the information of each cell from the sequencing data space to the embedding space from LLM.We complete this transformation by incorporating information from the feature space (for example, genes and proteins.)or the cell space (for example, cell types or cell states).
To transfer the feature information into LLM embedded space, we can either use prior knowledge from a known database such as NCBI or the summary given by LLMs as a prompt, and then generate the embeddings for the prompt based on LLMs' embedding layers.Here we choose GPT 3.5 as the tool to summarize the functions of features and to generate embeddings based on our evaluation in the Results section.
After having embeddings for features or cells, we can use them based on the zeroshot learning framework for clustering or batch effect correction.Moreover, we can also combine the embeddings with known models to improve their performance for various downstream tasks.Figure 1 shows the flowchart of scELMo under two different frameworks.The details of scELMo under different frameworks are summarized in the Methods section.
Evaluation of Hallucinations in LLMs.The first step of our work is to define a suitable LLM for generating text descriptions and embeddings.A suitable LLM should not suffer from Hallucinations [26], that is, generating fake or incorrect information about our given feature information or cell information.Here we consider GPT 2 [27], GPT 3.5, GPT 4, LLaMa-2 (70B), Mistral [28], bioGPT [29], Claude 2 [30], and Bard (PaLM 2) [31] as candidates to generate text description.Considering the diversity of tokens as well as the access to embedding layers, we chose GPT 3.5 to generate embeddings.We randomly sampled 20 proteins and 20 genes from the known proteins (∼200 proteins from [32]) and genes (∼30000 genes from NCBI), and used these LLMs we mentioned above to generate the text description of these features.We evaluated the correctness of the outputs by comparing them with known databases (GeneCards [33] and NCBI).The proportion of correct outputs is shown in Figure 2  Based on our results, GPT 3.5 and GPT 4 gave us the most accurate text description.However, GPT 4 required longer query time and some of the outputs from GPT 4 did not follow our required format mentioned in the prompt.By considering the trade-off between query time and the proportion of correct outputs, we selected GPT 3.5 as our tool to generate text descriptions of features.Moreover, we also evaluated the stability of the embeddings of LLM's outputs based on the same prompt for the same gene, as well as the similarity of embeddings from different genes.Our results are summarized in Extended Data Figures 1 (a) and (b).According to Extended Data Figure 1 (a), the correlation between the embeddings from different LLM outputs of the same gene was higher than 0.9, suggesting that the embeddings from GPT 3.5 have high stability and may capture the intrinsic functions for each gene.Based on Extended Data Figure 1 (b), the embeddings from different genes had lower correlation ([0.74, 0.84]) comparing with the embeddings from the same gene ([0.88,0.99]).
Therefore, the embeddings from GPT 3.5 may also capture the functional heterogeneity of different genes.Furthermore, we loaded the gene functional annotation provided by Geneformer [2] to predict the functional information for the other genes based on a scELMo for clustering and batch effect correction.In this section, we investigated the contribution of feature embeddings from scELMo for cell-level tasks including clustering and batch effect correction.
Clustering is effective in checking whether feature embeddings carry biological information.We incorporated gene embeddings or protein embeddings into the single-cell sequencing data and evaluated the performance of clustering based on cell embeddings.The metrics we chose include NMI, ARI, and ASW [35], which are widely used in the evaluation of clustering for single-cell data.In the following analysis, if not otherwise specified, GenePT means GenePT-w.Since we generated gene embeddings for gene names from both NCBI and Ensemble [36], we had more genes that those from GenePT.Moreover, when incorporating the feature embeddings, GenePT utilized a naive arithmetic average (aa) method, where it computed the embeddings for one cell by directly multiplying the gene expression vector from this cell and the matrix of gene embeddings from GPT 3.5.The aa mode ignored the scales of log-normalized gene expression levels across different cells.However, the level of gene expression is a key factor affecting cellular function [37,38].Different from GenePT, we treated the gene expression levels as weights and computed the weighted average (wa) value for each cell as an alternative averaging approach.We implemented these approaches for generating cell embeddings and details are included in the Methods section.
Figure 2 (c) shows that using the wa mode was better than using the aa mode for clustering.Moreover, if we combined cell embeddings with the embeddings of celltype information from GPT 3.5, we could get scores close to one, suggesting that the cell-type embeddings contained meaningful cell-type information.Moreover, using embeddings from scELMo also improved the clustering performance compared with the embeddings from GenePT, advocating LLMs as a tool for summarizing scientific concepts.However, various approaches of combining the embeddings from GenePT and GPT 3.5 (e.g.GPT 3.5 + GenePT wa means sum by genes, and GPT 3.5 || GenePT wa means concatenation by genes.)did not improve the scores compared with the individual setting.The average ranks of all methods across different datasets are summarized in Extended Data Figure 3 (a), with the GPT 3.5 wa mode having the lowest rank.Appendix C summarizes factors that could affect clustering performance.
For the batch effect correction task, because GenePT already evaluated the performance of such embeddings for single-cell RNA sequencing (scRNA-seq) data, we focused on single-cell proteomic data, such as those collected from CITE-seq [13] and CyTOF [12].Here we considered two scenarios.We tried to use the embeddings from scELMo and GenePT to reduce the batch effect for datasets from the same or different protocols.To evaluate the performance of batch effect correction, we used metrics from scIB [35] and considered the contribution of our embeddings for reducing batch effect (known as S batch ) and for preserving the biological variation (known as S bio ) separately.Figures 3 (a) and (b) display the results for batch effect correction from two CITE-seq datasets and Figures 3 (c) and (d) display the results for batch effect correction for one CITE-seq dataset and one CyTOF dataset.These four figures show that using the aa approach did not improve the performance of scELMo for this task.
Moreover, the aa approach could reduce S batch and S bio simultaneously compared with these two scores from the raw dataset under the Cytof+CITE-seq case, thus led to a larger batch effect.On the other hand, the wa approach led to the reduction of the batch effect as shown by the improvement of these two scores.Moreover, incorporating the cell-type embeddings with the aa mode also did not improve S batch and S bio jointly.However, combining the cell-type embeddings with the cell embeddings under the wa mode also improved the S bio score for both cases.Therefore, finding a good base embedding space is important for utilizing cell-state embeddings.As an extension, we summarize our analysis of multi-omic data integration [39] in Appendix D.
scELMo for cell-type annotation.Cell-type annotation is a critical component of single-cell analysis [40].Here we investigated the ability of scELMo to annotate cells under both the zero-shot learning framework and the fine-tuning framework.For the zero-shot learning framework, we incorporated the gene embeddings into both the training and testing datasets to access cell embeddings and used a kNN classifier implemented by GenePT to annotate cells in the testing dataset.However, this approach failed when the training datasets were from multiple resources or had large batch effects (shown in the PBMC section of Table 1).Therefore, we constructed a simple classifier based on neural networks [41] and contrastive learning [42].We then trained the model based on gene embeddings and expression profiles.Such model is also known as an adaptor in Natural Language Processing (NLP) tasks [25].We tested the ability of cell embeddings from this model for annotating cell types.
Table 1 shows that the zero-shot learning ability of embeddings from GPT 3.5 is impressive for annotating cells for the hPancreas dataset and the Aorta dataset.Moreover, representing cells by using the gene rank and the representation for GPT 2 or GPT 4 did not perform well.Therefore, zero-shot learning based on gene embeddings is capable of cell-type annotation for datasets with minimal batch effect.However, for the PBMC dataset, which consisted of datasets from different sources, there was a clear gap between zero-shot learning results and fine-tuning results.Moreover, the fine-tuning results by combining the gene embeddings from GPT 3.5 or GenePT and our adaptor are comparable with FMs like scGPT [1] and Geneformer [2].However, scGPT and Geneformer need more resources including NVIDIA A100 GPU and longer running time for fine-tuning [43], further shown in Extended Data Figures 4 (a) and (b).For all the datasets we tested, scELMo with a fine-tuning framework performed better than the same settings with the exception of the zero-short learning framework.
Moreover, scELMo could improve the annotation score for embeddings from different sources.Our results show that cell-type annotation based on fine-tuning the adaptor with gene embeddings from LLMs is accurate and easy to implement.
scELMo for in-silico treatment analysis.Using computational methods to discover novel target therapies and drugs has attracted much attention [44][45][46].In this section, we study the ability of scELMo to use the adaptor for cell-type annotation and The change of CS by removing different genes in the expression space for ascending aortic aneurysm (Ascending only, Ascending to descending, and Ascending w/ root).We highlighted the genes detected by both GenePT and scELMo using stars (*) and marked the genes that were discovered by previous research as therapeutic targets using bold type.
the gene embeddings we extracted from GPT 3.5 to model human diseases and reveal potential therapeutic targets.Inspired by Geneformer [2], we finetuned our adaptor based on the classification task of cell conditions and used the cell embeddings from the adaptor for therapeutic target discovery.We utilized the training-validating framework to choose the adaptor with best performance and tested the change of embeddings by removing the differenetially expressed genes (DEGs) in the expression levels across different conditions.We discussed the importance of fine-tuning a model in Appendix E. If we remove a suitable candidate for targeted therapies, the cell embeddings under disease will be more similar to those under control.The metric we used to evaluate similarity is the cosine similarity between the embeddings under the diseased case and the average cell embedding under the control case.We evaluated the discovered targets through literature review as summarized in Supplementary File 2.
We first considered hypertrophic or dilated cardiomyopathy states [47] using the scRNA-seq Heart dataset [48].We identified genes whose in-silico deletion in the disease conditions could significantly shift the cell embeddings from the disease conditions to the non-failing (NF) or control condition.The change of cosine similarity before and after the deletion for cardiomyopathy states is summarized in Figure 4 (a).Based on this figure, our adaptor identified two novel potential therapeutic targets for cells under the DCM case and four novel targets for cells under the HCM case.The silence of genes identified by scELMo may help the shift from cells under the disease conditions to cells under the control condition.There is literature to support our discovery of ANKRD1 [49], EXT1 [50], NPPB [51] and TTTY10 [52].Moreover, there is supporting evidence of GSN [2] by CRISPR-based technology [53] as a therapeutic target.
These findings suggest the potential usefulness of genes selected by scELMo.
Then we considered ascending aortic aneurysm [54], which has three different disease states.The scRNA-seq dataset we used is the scRNA-seq Aorta dataset [55].We used the similar approach discussed above and summarize the change of cosine similarity before and after deletion in Figure 4 (b).scELMo identified six novel genes as potential therapeutic targets for cells under the Ascending only state, two novel genes as potential therapeutic targets for cells under the Ascending to descending state, and three genes as potential therapeutic targets for cells under the Ascending w/ root state.
Interestingly, MT-ATP6 gene was selected by scELMo as a potential target, which implied that cells under different states might have different speeds for cell death [56] or neurodegeneration [57].These results suggest the potential benefit of exploring the correlation between cellular function and clinical manifestations of this disease.
scELMo for perturbation analysis.Analyzing the perturbation effect on cell state using single-cell data is also an important task.Here we focus on three tasks across different perturbation types, for example, cell-level perturbation [61] or genelevel perturbation [62].Our idea is to use the cell embeddings or gene embeddings generated based on the zero-shot learning framework of scELMo to replace or combine with the original input of different models to enhance their performance on the tasks including causal factor analysis, gene expression prediction with cell-level perturbation and gene expression prediction with gene-level perturbation.We evaluated the contribution of scELMo for the causal factor analysis task based on CINEMA-OT [60], the gene expression prediction task for cell-level perturbation data based on CPA [63], and the gene expression prediction task for gene-level perturbation data based on GEARS [64].All of the metrics we considered in these three tasks are from these models.
CINEMA-OT is a tool based on a causal learning framework and optimal transport to distinguish between perturbation effects and intrinsic cell-state effects for scRNA-seq datasets.We replaced the input of CINEMA-OT with cell embeddings from scELMo and investigated the contribution of our cell embeddings by evaluating the reduction of batch effect in the space of intrinsic cell-state effects (also known as the confounder space).By removing the perturbation effect, the perturbation case in the intrinsic cell-state space became the batch label and we should not observe the dings into CINEMA-OT did not affect the analysis of gene synergy effect, and the difference between synergy of Monocyte and synergy of other cell types was also obvious.For the causal factor analysis task, using the wa mode did not improve the score.
One possible reason is that CINEMA-OT can learn the best representation of cell embeddings with a good start for optimization, so the aa mode is enough.Therefore, scELMo can improve the performance of CINEMA-OT on the causal factor analysis task by offering a new candidate of input data.
CPA is a tool based on Conditional Variational Auto-encoder (CVAE) to predict the gene expression levels for the out-of-distribution (OOD) samples of scRNAseq data under certain perturbations.Here we combined the gene embeddings from scELMo with the original input dataset and learned a new latent space for gene expression prediction.We investigated the contribution of gene embeddings by comparing the R2 score between these two different settings.We computed the R2 score based on the predicted gene expression levels and the observed gene expression levels.Figure 5 (c) shows the performance of CPA under different methods for two datasets.For the CPA example dataset shown in the left panel, using the cell embeddings from the GPT 3.5 wa mode in the training processslightly improved the average R2 score, while its median value was still lower than the default mode.Therefore, the results were comparable across different methods.However, for the Openproblems dataset [65] shown in the right panel, using cell embeddings from both GenePT and GPT 3.5 improved the performance of CPA.R2 score based on combining cell embeddings with CPA had a higher average value and lower variance compared with the default mode.Therefore, scELMo could improve the performance of CPA on the prediction task by introducing the cell embeddings into the training process.
GEARS is a tool based on Graph Neural Networks (GNN) [66] to predict the gene expression levels for perturb-seq-based datasets.Here we combined the gene embeddings from scELMo with the original gene embeddings of GEARS to learn the predicted value of target genes.We studied the contribution of gene embeddings by comparing the Pearson Correlation Coefficient (PCC) and Mean Squared Errors (MSE) between the default settings and updated settings when all genes were considered and when only DEGs were considered.The results are summarized in Figure 5 (d).For the Dixit dataset, using gene embeddings from either GPT 3.5 or GenePT improved the gene expression prediction made by GEARS for both the all genes case and the DEGs case.However, for the Adamson and Norman datasets, the contribution of using gene embeddings was not significant.Moreover, the gene embeddings from GenePT were always better than those from GPT 3.5 across these three datasets, which implied that information from NCBI or real databases might be more important than the functional information for the expression levels prediction task related to the perturb-seq-based datasets.

Discussion
Developing methods to model genetic and cellular functionalities is an important task in computational biology.Although a common practice is to propose a biological question and design relevant experiments to answer this question, recent years have seen the development of a foundation model that may be informative to address many known and unknown biological questions.In the single-cell data analysis area, many researchers have proposed to pre-train a large-scale model and declare it as a foundation model by showing their comparable performance with different task-specific models in specific downstream tasks.However, it is hard to find tasks that could only be resolved through such pre-training and fine-tuning framework [43], which questioned the contribution of consuming large-scale resources to develop such models.Therefore, inspired by GenePT, we explored another approach to generalizing a foundation model in the single-cell research area, that is, utilizing the contribution of LLMs to generate meaningful feature embeddings and cell embeddings.We could either directly utilize these embeddings for clustering or reducing batch effect, or combine these embeddings with task-specific models to improve their performance.These two approaches are integrated in scELMo.Accessing the outputs of known LLMs like GPT 3.5 did not require many resources, and our results did show the superiority of scELMo in answering multiple biological questions.
For scELMo under the zero-shot learning framework, we could utilize cell embeddings to perform clustering and batch effect correction.These contributions are based on the fact that embeddings from the text description of features in single-cell datasets are good for representing biological concepts or functions.We also discussed the factors that could affect the performance of generating meaningful cell embeddings, including the approach to computing the average embeddings, the number of cells in one dataset, the number of features we need, and other factors.We also showed that such embeddings could be used for multi-omic data analysis, which illustrates the power of using LLMs as tools for incorporating prior information to enhance task-specific analysis.
Considering the limitation of the zero-shot learning framework, we also proposed a fine-tuning framework for scELMo.By combining the feature embeddings from GPT 3.5 with a light-structured neural network, we could use the embeddings to annotate cell types with performance similar to those of FMs that require much more resources for pre-training and fine-tuning.Moreover, scELMo can also be used for detecting novel therapeutic targets by examining the change of embeddings corresponding to the removal of certain genes, supported by related biological experiments.We could also directly incorporate the cell embeddings or feature embeddings with task-specific models for better performance in modeling the data with perturbation.In-silico treatment analysis and perturbation analysis are two challenging and important tasks with cell-level knowledge, which further supported the potential of scELMo and related work.
However, both GenePT and scELMo have the following limitations.Firstly, the rapid developments of likely lead to embeddings better than GPT 3.5.With a more powerful LLM, we will have a better representation of features and cells.Secondly, LLMs could not generate meaningful information for genes that were recently discovered or analyzed.Although GPT 3.5 does not makeup concepts for genes, the lack of knowledge still presents a question for the applications of scELMo.Thirdly, without enough resources, fine-tuning LLMs to generate better domain-specific embeddings is hard to deploy.Moreover, the training datasets for fine-tuning domain-specific LLMs are also hard to collect.Finally, extracting features from other biomedical data like GWAS [67] or scATAC-seq [68] data will be difficult since the number of features for these data is quite large.
Therefore, in the future, we plan to generate a database containing the text description as well as the embeddings of such descriptions of features from different LLMs.
Models like CPA can utilize these embeddings to perform prediction under the OOD cases.Moreover, we will keep on figuring out a more practical approach to offer a gene-specific LLM.In addition, since the idea from scELMo is capable of modeling arbitrary biomedical data with tabular format, we believe that extending the usage of scELMo to other tasks or areas is also very promising.

Methods
Problem definition.For a typical single-cell dataset X n×m after normalization [69] with n cells and m features, our target is to utilize the text description from a mapping function M() for feature-level metadata information f m×1 and cell-level metadata information c n×1 to learn the embeddings of cells.If we define the embedding generation layer of M() as M e (), our cell embeddings can be represented as: e c = M e (M(P rompt(c))) where P rompt is a mapping function that can transfer the name of input data to the prompt space.The prompts can be used as the input of language models.The function AV G() represents the method we used to average the embeddings of all genes for each cell.If the mode is aa, we divide X by m.If the mode is wa, we divide each row of X by the sum of this row.Considering the cell with index i, we can define these two processes as: .
Then we use matrix multiplication to combine the feature embeddings and the expression profile.Our default setting of the mapping function is a LLM.GenePT can be treated as a special case of scELMo, that is, replacing the LLM with a known database and using aa mode.Incorporating the embeddings of cell-level metadata is an optional choice.We intend to investigate if the cell embeddings can offer a better representation than the raw data.
Moreover, with the embeddings of feature-level metadata information and celllevel metadata information, we also consider if incorporating our embeddings with task-specific model T can improve the performance of T , that is: where Score() is a metric to evaluate the output of the given model, where higher value represents better output.We also may not need to have e f and e c for every model.
scELMo under the zero-shot learning framework.To evaluate the performance of e cells under the zero-shot learning framework, we consider three tasks: clustering, batch effect correction, and cell-type annotation based on a kNN classifier.
In this section, T is a kNN classifier.
For clustering and batch effect correction, we directly use e cells as a new representation for X and evaluate e cells for these two tasks.For cell-type annotation based on a kNN classifier, we consider training dataset X train and testing dataset X test , and their corresponding cell embeddings e train cells and e test cells .We use e train cells with its cell types to train a kNN classifier and perform cell-type annotation based on e test cells .Since kNN is based on similarity searching, we treat this method as an ability of zero-shot learning.We follow the settings from GenePT for this classifier and set k=10.
scELMo under the fine-tuning framework.To evaluate the performance of e f and e c under the fine-tuning framework, we considered three tasks: cell-type annotation with an adaptor, in-silico treatment analysis with an adaptor, perturbation analysis with task-specific models as adaptors.
For the cell-type annotation and in-silico treatment analysis tasks, we propose a light-structured neural network with a contrastive learning [42,70] design.Here T is a neural network with ReLU [71] as the activation function.Our intuition comes from the requirement for a good representation of cells with different labels and conditions.Therefore, we formalize the loss function of our model as: where L classif ier represents the classification loss of the model output as we use celltype labels for model training, L contrastive represents the contrastive learning loss we use to distinguish the representations of cells under different conditions in the latent space.λ is a hyper-parameter, where we set λ = 100 in this manuscript to assign a larger weight for label-aware clustering.We utilize the embeddings after fine-tuning as the training and testing datasets and use a kNN classifier to annotate the cell types to evaluate the representation we learn based on scELMo.
To analyze the target of in-silico treatment, we first compute the cosine similarity (CS old ) between the average cell embeddings based on the diseased case and control case.Then we delete the target gene by setting its expression profile as zero and compute the new embeddings and cosine similarity (CS new ).We define the score of our targeted gene g as: If such a score is larger than 1e-4, we treat the gene we analyze as a candidate for therapeutic targets.This threshold is based on the upper bound of the tiny quantities determined by Numpy [72] for scientific notation representation and the smallest nonzero scale of the y-axis in Figures 4 (a) and (b).
For the perturbation analysis task, we consider three different models for the three tasks.Here T represents different models corresponding to different perturbation analysis tasks.For the causal factor analysis task and CINEMA-OT, we replace the original input of CINEMA-OT with e cells .For the gene expression prediction task and CPA, we add a new neural network component to make e cells learnable and combined the output of this component with the latent space from the original CPA.We do not modify the training process of CPA.For the gene expression prediction task based on perturb-seq-based datasets and GEARS, we add the e f to the original gene embeddings of GEARS.We do not modify the training process of GEARS.
Data pre-processing.We follow the data pre-processing steps from Scanpy [69] for scRNA-seq datasets.For single-cell proteomic datasets, we follow the preprocessing steps from TotalVI [73] and MARIO [74] and do not change the distribution of the original data because of its density.
Metrics.For the evaluations of clustering and batch effect correction, we utilize the metrics described and implemented by scIB [35].We compute all the metrics we could compute in the evaluation process.All of the scores of the metrics in scIB are in [0,1] and a higher value means better performance.
For clustering, we use NMI, ARI, and ASW label for evaluation.
For batch effect correction, we compute ASW batch , PCR, Graph Connectivity, kBET, and iLISI and averaged the scores from these metrics to generate S batch .We compute ASW label , NMI, ARI, and cLISI and averaged the scores from these metrics to generate S bio .
For the evaluations of cell-type annotation, we use Scikit-learn [75] to calculate Accuracy, Precision, Recall, and F1 score by comparing the predicted cell-type labels and ground-truth cell-type labels.All of the metrics are in [0,1] and a higher value means better performance.
For the evaluations of in-silico treatment analysis, we use Scipy [76] to compute the cosine similarity between the mean cell embeddings from the control case and the mean cell embeddings from the diseased case.The definition of the score here is described in the Methods section.
For the evaluations of perturbation analysis, we have three different tasks with different metrics.For the causal factor analysis task, the metrics are the same as what we use in the batch effect correction task.For the gene expression prediction task based on CPA, we use the R2 score as a metric to evaluate the performance for regression.R2 score is defined as: where y i represents the ground-truth gene expression level, f i represents the predicted gene expression level, and ȳ represents the average expression levels of the given gene.
A higher averaged R2 score and a lower variance mean better performance.
For the evaluation of gene expression prediction tasks based on GEARS, we use PCC and MSE as metrics.We define PCC as: where m represents the number of genes we used, cov represents the covariance, σ represents the standard deviation.ρ is in [0,1] and a higher value means better performance.We can also define MSE as: where y ij represents the ground truth gene expression level of gene i in cell j, and f ij represents the predicted gene expression level of gene i in cell j.Lower MSE score means better performance.
We consider computing these two metrics for both all genes case and the top 20 DEGs case.
For the text description from NCBI, we need to clean the format of the original data and transfer them into text representations.Therefore, the sentences and words in this text are incoherent and their format does not strictly follow the grammar.Such difference poses the problem of model alignment with human value.Moreover, text descriptions of NCBI focus specifically on the variety of gene names, symbols, refseq status, and other properties.Moreover, it contained detailed functional information of the given gene.Some of the information may be redundant.One advantage of the text representation from NCBI is its authority, which means the reliability of text descriptions from NCBI is generally greater than the outputs from LLMs for researchers.
For the text description from GPT 3.5, we can see that the sentences and words are coherent and formalized under the correct grammar framework.This text focuses more on the overview of functions for this given gene, including its major functional tissues and cell types (for cell-type marker genes like CD79).Moreover, this text also includes the relation between COLA1A1 and certain diseases, highlighting the potential of this gene as a therapeutic target.Such prior information will be helpful in the in-silico treatment analysis research.Moreover, different prompts can generate different types of descriptions for the same gene, exploring the diversity of the outputs from LLMs is also an interesting research track.
Therefore, these two different types of text descriptions have their own advantages and disadvantages, which might be the explanation for their different suitable scenarios.One interesting research topic will be how to combine the advantages of these two kinds of text descriptions to enhance their downstream applications.
Here is an example for the text description of B cells from GPT 3.For the text description from GPT 3.5, we find that the text description is also coherent and follows the standard grammar.It summarizes the major functions of B cells as well as the cell-cell communication that B cells involve.
that the wa mode was not suitable for this research because we might have cells with zero expression by filtering some genes.There is no obvious correlation between the number of recorded genes and the clustering performance.Moreover, since the sources of GenePT or scELMo do not match all of the genes for every scRNA-seq dataset, sometimes we need to fill the gene embeddings of missing genes as zero.Therefore, we also investigated the relation between the number of matched genes and the clustering performance, shown in Extended Data Figure 9 (c).From this figure, we still did not observe a strong correlation between the number of matched genes and the clustering performance under the aa mode.However, for the wa mode, we found an obvious correlation between these two values.Therefore, having more matched genes can contribute to cell clustering under the wa mode of scELMo.Such conclusion also demonstrates the importance of extending our databases of feature embeddings.
Third, we analyzed the relation between the clustering performance and the number of cells.We subsampled different proportions of cells from the large-scale Onek1k PBMC dataset and computed the clustering results under different numbers of cells.
Based on Extended Data Figure 9 (d), we found that there was no obvious correlation between the number of cells and the clustering performance for Onek1k PBMC dataset.Therefore, cell number may not be a factor that can affect the performance of gene embeddings in this task.Moreover, scELMo is also capable of the analysis of large-scale scRNA-seq datasets.

D Analysis of multi-omic data integration.
In this section, we analyzed the possibility of utilizing gene embeddings from GPT 3.5 to resolve multi-omic data integration task.Here we consider datasets from scRNA-seq and scATAC-seq without paired information.To reduce the dimensions of the scATACseq dataset, we transfer the feature information of such dataset from the space of peaks to the space of gene activity scores.The visualization results are summarized in Extended Data Figures 7 (a) and (b).According to these figures, we can still observe significant batch effect or the difference of cell embeddings from the cells with same cell types.Therefore, the function of scELMo for multi-omic data integration under the zero-shot learning framework is not good.Moreover, based on Extended Data Figure 7 (c), neither wa mode nor aa mode can improve the S batch score and the s bio score significantly.Incorporating the cell-type information into the cell embeddings space can significantly improve the averaged scores, but for metrics like iLISI to evaluate the mixture of batch information, such embeddings still had zero score.Therefore, scELMo is not capable of multi-omic data integration under the zero-shot learning framework.
E The contribution of finetuned model in the in-silico treatment analysis.
In this section, we demonstrated the necessity of using a finetuned model rather than zero-shot learning for in-silico treatment analysis.In Extended Data Figure 10   The change of CS based on the cell embeddings from scELMo for the Heart dataset.We highlighted the genes detected by both GenePT and scELMo using stars (*) and marked the genes that were discovered by previous research as genes related to disease pathway using bold type.
(a), together with the time usage for one query for each LLM, shown in Figure 2 (b).

Fig. 1 File 1 .
Fig.1Overview of scELMo.(a) Zero-shot learning framework of scELMo.We extract the text description of metadata by using either databases or LLMs.Then we use GPT 3.5 to generate the embeddings of the text description as the embeddings of features or cell states.We then aggregate these embeddings with single-cell profiles to generate cell embeddings.(b) Fine-tuning framework of scELMo.We combine embeddings of metadata and single-cell profiles with task-specific adaptors and train the adaptors to address downstream applications.

Fig. 2
Fig. 2 Evaluations of the outputs of LLMs and the clustering performance.(a) Proportion of meaningful outputs of biological features across different LLMs.The left panel represents the proportion of genes, while the right panel represents the proportion of protein.(b) Average query time for each LLM.The left panel represents the query time of genes, and the right panel represents the query time of proteins.(c) Evaluations of the clustering performance based on different methods.Different panels represent the results of different datasets.

Fig. 3
Fig. 3 Results of batch effect correction for single-cell proteomic data.(a) UMAPs [34] for the celltype information (left) and batch information (right) of cite-seq-based datasets.The upper panel represents the raw data.The bottom panel represents the cell embeddings from GPT 3.5 wa mode.(b) Evaluations of the batch effect correction for cite-seq-based datasets across different methods.(c) UMAPs for the cell-type information (left) and batch information (right) of cite-seq-based dataset and CyTOF-based dataset.The upper panel represents the raw data.The bottom panel represents the cell embeddings from GPT 3.5 wa mode.(d) Evaluations of the batch effect correction for citeseq-based dataset and CyTOF-based dataset across different methods.

Fig. 4
Fig.4Results of in-silico treatment analysis.A gene is treated as a potential therapeutic target if the change of cosine similarity (CS) by removing the gene is larger than 1e-4.The reason for setting such threshold was discussed in the Methods section.We chose the top 10 DEGs as candidates.(a) The change of CS by removing different genes in the expression space for hypertrophic (HCM) or dilated cardiomyopathy (DCM) states.(b) The change of CS by removing different genes in the expression space for ascending aortic aneurysm (Ascending only, Ascending to descending, and Ascending w/ root).We highlighted the genes detected by both GenePT and scELMo using stars (*) and marked the genes that were discovered by previous research as therapeutic targets using bold type.

Fig. 5
Fig.5Results of perturbation analysis.(a) Scores of casual factor analysis for ChangYe2021 dataset[58,59] based on CINEMA-OT across different input data.We considered four different types of extension.We still use S batch to represent the levels of perturbation effect removal because we used the metrics for benchmarking batch integration.(b) Scores of causal factor analysis for perturbed PBMC dataset[60] based on CINEMA-OT across different input data.(c) Scores of gene expression prediction under different perturbation cases based on CPA.We considered four different methods and two datasets.(d) Scores of gene expression prediction using perturb-seq datasets based on GEARS.We considered three different methods and three datasets.
5 using the prompt: Please summarize the major function of cell type: B. Use academic language in one paragraph.:B cells, a crucial component of the adaptive immune system, play a pivotal role in humoral immunity by generating antibodies (immunoglobulins) in response to the recognition of specific antigens.Following activation, B cells undergo clonal expansion and differentiation into plasma cells, which are specialized antibody-producing cells.The antibodies secreted by plasma cells serve as effectors in immune responses, neutralizing pathogens, facilitating their removal by phagocytic cells, and promoting various immune processes such as opsonization and complement activation.Additionally, B cells can differentiate into memory B cells, which confer long-term immunity by enabling a rapid and robust antibody response upon re-exposure to the same pathogen.Overall, B cells are central to the adaptive immune system's ability to recognize and respond to a wide array of infectious agents and contribute significantly to immune memory and protection.

bb. 3
(a), we display the change of CS for the same group of DEGs under the ascending aortic aneurysm disease and all genes were not significant for the Ascending only state.a Extended Data Fig. 1 Evaluations of the similarity for embeddings under different conditions.(a) Heatmaps of correlation of embeddings from the same gene under 10 different LLM outputs.We randomly selected 10 genes and generated the LLMs' outputs for these 10 genes based on the same set of prompts.We then computed the embeddings of these outputs and calculated the Pearson correlation for these embeddings, hence we have 10 different heatmaps to represent the results for 10 different genes.(b) Heatmap of the correlation of embeddings from 10 different genes.We computed the Pearson correlation for the embeddings of different genes.The number 0 represents the index of embeddings we computed based on 10 different LLM outputs.a Extended Data Fig. 2 UMAPs for the visualization of gene functional information.(a) UMAPs for the genes with known functional information and unknown functional information.(b) UMAPs for the genes with annotated functional information based on a kNN classifier.For genes with multiple functional annotation, we combined the functions as a new label.Average ranks for clustering and cell-type annotation.(a) Average-rank information for different methods across datasets.(b) The left panel represents the average-rank information of methods based on fine-tuning for cell-type annotation.The right panel represents the average-rank information of methods based on zero-shot learning for cell-type annotation.

. 4 . 12
Comparisons of resources.(a) The plot for minimal GPU memory requirements across different FMs.(b) The plot for running time of the cell-type annotation task across different FMs.Data Fig. 10 Change of CS under the zero-shot (ZS) learning framework and UMAPs for visualization.(a) The change of CS based on cell embeddings from GenePT or PCA for the Aorta dataset.We considered all three different disease states.(b) UMAPs visualization for the original gene expression space (left panel) and cell embeddings from finetuned scELMo (right panel) based on the Heart dataset.Figures are colored by cell conditions.(c) UMAPs visualization for the original gene expression space (left panel) and cell embeddings from finetuned scELMo (right panel) based on the Aorta dataset.Figures are colored by cell conditions.Change of CS for silencing DEGs in the control case.(a) The change of CS based on cell embeddings from scELMo for the Aorta dataset.(b)

Table 1
Scores of cell-type annotation task under different settings.Parts of the results are directly extracted from GenePT.Here PCA represents principal component analysis, and scELMo+random emb represents fine-tuning scELMo with random numbers as meaningless gene embeddings.Average ranks of all methods across datasets are summarized in Extended Data Figure 3 (b).