Member-only story
Foundation models for single-cell RNA-seq
AI + Medicine Newsletter 2024–02–27
Single-cell RNA sequencing (scRNA-seq) enables the analysis of the transcriptome at the level of individual cells. This technique has had a profound impact on our understanding of biological complexity and cellular diversity, leading to many key advancements in oncology, immunology, developmental biology, and many other biological fields.
Over the past decade, data from millions of cells have been accumulated. It’s an excellent opportunity to leverage this large amount of data to build foundation models for scRNA-seq that can be used for many downstream discoveries, e.g., cell-type identification, perturbation effect prediction, and clustering. This GitHub repository nicely lists many foundation models for single-cell omics data, mostly RNA-seq.
Most of these models were built using the encoder part of the transformer model and pre-trained using the masked language model, so fine-tuning is required for many downstream tasks. One recently published model, scMulan, focuses on the generative capability of the foundation model by utilizing the ‘GPT-style’ decoder of the transformer model. It can perform zero-shot tasks without fine-tuning.
As illustrated in the workflow, the input sentence includes both gene expression data and metadata (e.g., cell type, tissue). Given a specific task prompt, the pre-training process can predict the “unknown” token values. For example, in the cell-type annotation task, the…