AI + Medicine Newsletter
Posts
Foundation models for single-cell RNA-seq

Foundation models for single-cell RNA-seq

Encode Box
February 27, 2024

Single-cell RNA sequencing (scRNA-seq) enables the analysis of the transcriptome at the level of individual cells. This technique has had a profound impact on our understanding of biological complexity and cellular diversity, leading to many key advancements in oncology, immunology, developmental biology, and many other biological fields.

Over the past decade, data from millions of cells have been accumulated. It's an excellent opportunity to leverage this large amount of data to build foundation models for scRNA-seq that can be used for many downstream discoveries, e.g., cell-type identification, perturbation effect prediction, and clustering. This GitHub repository nicely lists many foundation models for single-cell omics data, mostly RNA-seq.

Most of these models were built using the encoder part of the transformer model and pre-trained using the masked language model, so fine-tuning is required for many downstream tasks. One recently published model, scMulan, focuses on the generative capability of the foundation model by utilizing the 'GPT-style' decoder of the transformer model. It can perform zero-shot tasks without fine-tuning.

As illustrated in the workflow, the input sentence includes both gene expression data and metadata (e.g., cell type, tissue). Given a specific task prompt, the pre-training process can predict the "unknown" token values. For example, in the cell-type annotation task, the cell-type tokens are masked and predicted by gene expression values. However, the generative mode is limited to only the three pre-defined tasks. If there is a new task or a dataset with new metadata, the model still needs to be fine-tuned.

The workflow of scMulan

scRNA-seq data processing and integration are challenging due to the significant batch effect and a large fraction of genes with 'zero' expression. Most of the published foundation models didn't address these challenges well when combining data from different sources, so it is questionable how useful they can be. In fact, many evaluation studies showed that these large and complex foundation models don't outperform a simple linear model and sometimes are even worse (Liu et al. 2023, Boiarsky et al. 2023, Kedzierska et al. 2023). It's still too early to tell how AI can truly benefit scRNA-seq analysis and related discovery.

If you find the newsletter helpful, please consider:

🔊 Sharing the newsletter with other people
👍 Upvoting on Product Hunt
📧 Sending any feedback, suggestions, and questions by directly replying to this email or writing reviews on Product Hunt
🙏 Supporting us with a cup of coffee.

Thanks, and see you next time!

Reply

or to participate.