Is DNA all you need?

Most genetic information is stored in DNA sequences. The accumulated genomic data is also much larger than other omic data, such as transcriptomic (RNA) and proteomic (protein). When building large foundational models for biology, should we only consider DNA sequences?

Researchers from the Arc Institute, Stanford, and TogetherAI recently developed a large DNA-only foundational model called Evo. It has 7 billion parameters and was trained using 85k prokaryotic genomes with a total of 300 billion nucleotides. The model performed well across multiple tasks, including those related to RNA and proteins.

  • Zero-short protein fitness prediction

  • Zero-short non-coding RNA fitness prediction

  • Zero-shot mRNA expression prediction

  • Zero-short protein expression prediction

  • Zero-shot gene essentiality prediction

  • Generative design of CRISPR-Cas system

  • Generative design of transposable element

  • Generating genome sequences containing plausible high-level genomic organization

Notably, the model used Hyena, a convolution-based model, instead of a transformer-based model. Previous studies (1, 2) showed that the Hyena model can match the quality of attention (transformer) while allowing for longer context lengths and lower time complexity, which is very useful for genomic sequences. The authors of Evo also confirmed that Hyena is more efficient than transformer-based models.

Does it work for Eukaryotes?

The Evo model was built for prokaryotic genomes. Does the same concept apply to eukaryotic genomes? This would be more challenging given that eukaryotic genomes are much longer and more complex, and a large proportion is usually non-coding regions (the dark matter of the genome). A couple of recent studies may provide some hints.

  • The Nucleotide Transformer, trained on 3,202 diverse human genomes, helps to understand the functional effect of non-coding regions of human genomes.

  • Large language models trained on codons, instead of amino acid sequences, provide high-quality representations that outperform comparable state-of-the-art models across a variety of protein-related tasks. The authors of this Nature Machine Intelligence paper argued that โ€œthis performance is due to the codon language modelโ€™s ability to capture patterns of codon usage across DNA sequences and that this advantage disappears when codon usage information is corrupted.โ€œ

If you find the newsletter helpful, please consider:

  • ๐Ÿ”Š Sharing the newsletter with other people

  • ๐Ÿ‘ Upvoting on Product Hunt

  • ๐Ÿ“ง Sending any feedback, suggestions, and questions by directly replying to this email or writing reviews on Product Hunt

  • ๐Ÿ™ Supporting us with a cup of coffee.

Thanks, and see you next time!

Reply

or to participate.