ChatGPT can predict molecular properties and design new molecules

ChatGPT has already demonstrated mind-blowing capabilities in natural language understanding and processing. Its power can be greatly enhanced further to perform more complex tasks using prompt engineering, domain-specific fine-tuning, retrieval-augmented generation (RAG), and collaborative agents.

Can ChatGPT understand chemistry? A paper from Nature Machine Intelligence demonstrates that domain-specific fine-tuning of GPT-3 can perform classification and regression tasks for predicting molecular properties, such as solid-solution formation and Henry's coefficient. It can even design new molecules based on instructions and specified properties. Traditionally, people have built complex ML or AI-based QSAR models for these tasks. If this new approach works, it will fundamentally change the approach in chemical and material sciences, and perhaps in all branches of science.

The fine-tuned model can perform comparably to, or even outperform, conventional machine learning techniques in a few datasets tested in the paper, particularly when the training dataset has a small sample size. It would be interesting to see how broadly this approach can be applied to various scenarios in predictive chemistry and how accurate.

Structure prediction of protein-ligand complex

Iambic Therapeutics published its deep generative model, NeuralPLexer, on Nature Machine Intelligence for predicting protein–ligand complex structure directly from protein sequence and ligand SMILES input. The code is available on GitHub.

At the same time, they also published a whitepaper of the next generation of NeuralPLexer, NeuralPLexer2. It seems the accuracy has increased a lot. Its built-in confidence estimation and higher prediction speed are also valuable for large-scale ligand screening.

Google DeepMind is also working on protein–ligand complex structure prediction using the next generation of AlphaFold. Without directly comparing the two models, the NeuralPLexer2 white paper simply adds the 73.6% accuracy number from Google’s blog post. The accuracy of NeuralPLexer2 is lower without the pLDDT filtering but higher when an estimated binding site is provided (which could be considered cherry-picking and questionable).

Featured papers

  • Umol: structure prediction of protein-ligand complexes from sequence information (code)

  • CombFold: predicting structures of large protein assemblies using a combinatorial assembly algorithm and AlphaFold2 (code)

  • Multiple structural biology papers on Cell

    • Structure is beauty, but not always truth

    • De novo protein design—From new structures to programmable functions

    • Enabling structure-based drug discovery utilizing predicted models

    • Understanding the cell: Future views of structural biology

  • Digital health

If you find the newsletter helpful, please consider:

  • 🔊 Sharing the newsletter with other people

  • 👍 Upvoting on Product Hunt

  • 📧 Sending any feedback, suggestions, and questions by directly replying to this email or writing reviews on Product Hunt

  • 🙏 Supporting us with a cup of coffee.

Thanks, and see you next time!

Join the conversation

or to participate.