Understanding the World Through Code

Funded through the NSF Expeditions in Computing Program

Organic Chemistry

Many functions of small molecules and proteins depend on their 3D structure. Deep learning techniques to predict these 3D structures, such as AlphaFold, are revolutionizing the field of structural biology. Meanwhile, the prediction of binding structures has seen less progress than protein folding due to the limited data and the enormous space of combinatorial binding structure possibilities. To provide effective solutions for these problems it is, therefore, crucial to use scientific insights to build the right degrees of freedom and symmetries into the models. With this in mind, we have developed a novel SE(3)-equivariant approach to detect binding pockets and superimpose them using deep learning and the Kabsch algorithm, with which binding structures can be predicted very fast. The application of this technique to the problems of protein-protein and protein-small molecule binding has led to the development of EquiDock ganea2021independent-ours and EquiBind stark2022equibind-ours. These two methods have shown competitive performances with established physics-based methods while being orders of magnitude faster. These results promise great potential for large-scale screening applications.

However, most biological structures are not rigid but present a level of flexibility. Modeling the whole distribution of possible molecular poses and, therefore, molecular flexibility is a very challenging task that the scientific community has tried to solve for decades due to its critical applications in understanding many biological mechanisms and developing new drugs. With this goal in mind, we started by modeling the structural flexibility of small molecules and developed torsional diffusion jing2022torsional-ours, a deep generative model that can generate conformations by iteratively refining its position over this torsional manifold. Intuitively, our model incorporates chemical and physical knowledge by reasoning about the structure in 3D space where forces operate, while restricting the diffusion to torsion angles where most of the molecule’s flexibility lies. Empirically, torsional diffusion generates superior conformer ensembles compared to previous machine learning methods and is the first ML framework to outperform state-of-the-art cheminformatics methods. We are now extending this work to model larger and more complex molecular systems.

Finally, we have also designed novel deep generative models to directly design new materials xie2021crystal-ours and proteins trippe2022diffusion-ours. Protein design relies on building techniques that learn mechanisms of protein function through the chemistry and physics of binding and folding then building new proteins for novel functions. We will evaluate our ability to design proteins with structures and functions never seen before in nature with experimental validation from our collaborations.
A new paper from our group introduces a new method called DiffDock corso2022diffdock-ours for predicting how drug molecules will bind to proteins in the body. This is an important step in designing new drugs. Traditional methods try to directly guess the exact position and orientation of the drug molecule when bound to the protein. But it's very challenging to get this exactly right, because there are so many possible ways the molecule could bind. Instead, DiffDock takes a different approach based on generative modeling. It tries to learn a model that can generate lots of different possible binding poses. Then it ranks these poses to pick out the most likely one.
The key ideas are:
1. Model the binding process as a diffusion or random walk. Start with a random pose and gradually refine it over many small steps, like gradually zooming in on the right pose.
2. Make the refinements based on how drug molecules can actually move - they can slide around, rotate, and twist along certain bonds. DiffDock models these specific types of motions.
3. Use deep learning to train a neural network model that learns to generate plausible poses through this diffusion process.
Experiments show DiffDock can predict the right binding pose much more accurately than previous methods. It achieved almost 40% accuracy on a standard benchmark dataset, nearly doubling the previous best methods. DiffDock also retained its accuracy when using computer-predicted protein structures, unlike other methods which relied on knowing the exact experimental protein structure. This could enable applications like virtually screening drug candidates against predicted proteins.
Overall, by framing binding prediction as a generative modeling problem over feasible molecular motions, DiffDock provides a new state-of-the-art approach to this important challenge in drug design. Its superior performance and robustness could enable more accurate and efficient drug discovery.