Abstract
Large language models like GPT have shown impressive performance on natural language tasks. Here, we present a novel method to directly adapt these pretrained models to a biological context, specifically single-cell transcriptomics, by representing gene expression data as text. Our Cell2Sentence approach converts each cell’s gene expression profile into a sequence of gene names ordered by expression level. We show that these gene sequences, which we term “cell sentences”, can be used to fine-tune causal language models like GPT-2. Critically, we find that natural language pretraining boosts model performance on cell sentence tasks. When fine-tuned on cell sentences, GPT-2 generates biologically valid cells when prompted with a cell type. Conversely, it can also accurately predict cell type labels when prompted with cell sentences. This demonstrates that language models fine-tuned using Cell2Sentence can gain a biological understanding of single-cell data, while retaining their ability to generate text. Our approach provides a simple, adaptable framework to combine natural language and transcriptomics using existing models and libraries. Our code is available at: https://github.com/vandijklab/cell2sentence-ft.
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
↵* Co-first authors.
daniel.levine{at}yale.edu
syed.rizvi{at}yale.edu
sacha.levy{at}yale.edu
nazreen.pm{at}yale.edu
wuru{at}seas.upenn.edu
zihe.zheng{at}yale.edu
antonio.fonseca{at}yale.edu
xingyuchen{at}student.ethz.ch
sina.ghadermarzi{at}yale.edu