Small-molecule drug discovery in the age of AI

Small Molecule

A review nicely summarized the recent advances of deep learning (DL) in small-molecule drug discovery. From the methodology perspective, small molecules can be represented in SMILES format or molecular graph (the review missed molecular fingerprint, which is also a 1D string like SMILES), thus string-based and graph-based DL algorithms are most commonly used in predicting small molecule properties.

Figure 1. Deep QSAR model

Graph-based methods, like Graph convolutional network, inherit the pooling layers from conventional convolutional network. A recent study found that attention-based pooling is more effective than max, sum, mean, and physics-aware pooling methods.

To design new small molecules, typical generative models such as RL, diffusion, GAN, and autoencoder can be used (Figure 2). Like large language models, all these models learn the “knowledge” by pertaining on a large, unlabeled molecular dataset. Instead of learning from small molecules, a new study explored chemical reactions and built a foundation model to learn the reaction rules of molecules. Given some seed molecules, the model can generate synthesizable and high-quality drug-like structures based on the learned reaction rules.

Figure 2. Generative molecular design

Protein Structure

An experimental validation study confirms that AlphaFold predictions often closely match experimental data, while discrepancies still exist, particularly in cases involving ligands, covalent modifications, or other environmental factors.

One way to assess the prediction confidence is using pLDDT (predicted local distance difference test) from AlphaFold. The study confirms the pLDDT score generally matches the accuracy (“10% of Cα atoms with pLDDT over 90 are found to be in error by over 2 Å, along with 22% of those with pLDDT between 80 and 90, 33% of those between 70 and 80”).

Moreover, combining AlphaFold2 prediction with cryo-EM experiment can achieve higher accuracy than either approach alone, as the integrated protocol of protein structure modeling, DeepMainmast, has shown.

Protein structure prediction and protein design is one of the hot topics in AI. Many related works are being presented at the NeurIPS 2023 conference this week (Dec 10 - 6, 2023, New Orleans, Louisiana), and here are the top 3 posters based on the number of GitHub stars.

  • OpenProteinSet: Training data for structural biology at scale (paperwithcode)

  • ProteinInvBench: Benchmarking Protein Inverse Folding on Diverse Tasks, Models, and Metrics (by Westlake University, paperwithcode)

  • Protein Design with Guided Discrete Diffusion (paperwithcode)

Featured News

  • NEJM AI was officially launched on Monday (Dec 11, 2020) with the goal of “augmenting the capabilities of clinicians, patients, and their larger community using the latest entrant to our ecosystem — AI — to deliver safe and effective health care to the highest of our collective standards.“ NEJM AI also allows the use of LLMs like ChatGPT for manuscript writing, as long as authors take complete responsibility for the content and properly acknowledge the use of LLMs. We previously featured one early-release manuscript about the clinical adoption of FDA-approved AI devices, and will definitely share more in the future.

  • UK Biobank releases the whole genomes of 500,000 people, the world’s biggest set of human genome sequences open to scientists, which will help researchers decipher genetic code and their links to diseases and will become valuable resources to learn the language model of genomics.

If you find the newsletter helpful, please consider:

  • 🔊 Sharing the newsletter with other people

  • 👍 Upvoting on Product Hunt

  • 📧 Sending any feedback, suggestions, and questions by directly replying to this email or writing reviews on Product Hunt

  • 🙏 Supporting us with a cup of coffee.

Thanks, and see you next time!

Join the conversation

or to participate.