THETA TEAM - Publications

2025

SurfDock is a surface-informed diffusion generative model for reliable and accurate protein–ligand complex prediction

Cao, D., Chen, M. et al.

Nature Methods 22, 310–322 (2025)

DOI: https://doi.org/10.1038/s41592-024-02516-y

Read Paper

Abstract

Accurately predicting protein–ligand interactions is crucial for understanding cellular processes. We introduce SurfDock, a deep-learning method that addresses this challenge by integrating protein sequence, three-dimensional structural graphs and surface-level features into an equivariant architecture. SurfDock employs a generative diffusion model on a non-Euclidean manifold, optimizing molecular translations, rotations and torsions to generate reliable binding poses. Our extensive evaluations across various benchmarks demonstrate SurfDock’s superiority over existing methods in docking success rates and adherence to physical constraints. It also exhibits remarkable generalizability to unseen proteins and predicted apo structures, while achieving state-of-the-art performance in virtual screening tasks. In a real-world application, SurfDock identified seven novel hit molecules in a virtual screening project targeting aldehyde dehydrogenase 1B1, a key enzyme in cellular metabolism. This showcases SurfDock’s ability to elucidate molecular mechanisms underlying cellular processes. These results highlight SurfDock’s potential as a transformative tool in structural biology, offering enhanced accuracy, physical plausibility and practical applicability in understanding protein–ligand interactions.

2025

AI-Driven Protein Design

Koh, H.Y.*, Zheng, Y. et al.*, Shirui Pan, George Church

Nature Reviews Bioengineering (2025)

DOI: https://www.nature.com/articles/s44222-025-00349-8

Read Paper

Abstract

Protein design is undergoing a revolution driven by artificial intelligence (AI), transforming how we engineer proteins for applications in drug discovery, biotechnology and synthetic biology. By navigating the immense complexity of protein sequence space and overcoming the limitations of structural and functional data, AI enables unprecedented precision and speed in designing novel proteins with tailored functions. Central to this Review is a comprehensive and actionable roadmap for designers, providing step-by-step guidance on how to integrate state-of-the-art AI tools into protein design workflows, including tools for structural and functional prediction as well as generative models for de novo design. To illustrate this roadmap in practice, we present case studies showcasing AI-driven protein design, from engineering therapeutic proteins to designing novel proteins that unlock enzyme functions and reprogramme biomolecular systems. Looking ahead, we outline future directions highlighting the vast potential of AI to revolutionize synthetic biology, expedite drug development and drive sustainable biotechnology, positioning it as a transformative force at the forefront of protein design.

2025

Large language models for scientific discovery in molecular property prediction

Zheng, Y.*, Koh, H.Y.*, Ju, J. et al.*

Nature Machine Intelligence 7, 437–447 (2025)

DOI: https://doi.org/10.1038/s42256-025-00994-z

Read Paper

Abstract

Recent advances in large language models have demonstrated remarkable capabilities in various domains, including natural language processing and code generation. In this work, we extend these capabilities to molecular property prediction, a critical task in drug discovery. We propose a novel approach that leverages the knowledge embedded within large language models to enhance the prediction of molecular properties, leading to more accurate and interpretable results compared to traditional methods.

2024

Generic protein–ligand interaction scoring by integrating physical prior knowledge and data augmentation modelling

Cao, D. et al.

Nature Machine Intelligence 6, 261–270 (2024)

DOI: https://doi.org/10.1038/s42256-024-00849-z

Read Paper

Abstract

Developing robust methods for evaluating protein–ligand interactions has been a long-standing problem. Data-driven methods may memorize ligand and protein training data rather than learning protein–ligand interactions. Here we show a scoring approach called EquiScore, which utilizes a heterogeneous graph neural network to integrate physical prior knowledge and characterize protein–ligand interactions in equivariant geometric space. EquiScore is trained based on a new dataset constructed with multiple data augmentation strategies and a stringent redundancy-removal scheme. On two large external test sets, EquiScore consistently achieved top-ranking performance compared to 21 other methods. When EquiScore is used alongside different docking methods, it can effectively enhance the screening ability of these docking methods. EquiScore also showed good performance on the activity-ranking task of a series of structural analogues, indicating its potential to guide lead compound optimization. Finally, we investigated different levels of interpretability of EquiScore, which may provide more insights into structure-based drug design.

2024

Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data

Koh, H.Y., Nguyen, A.T.N., Pan, S. et al.

Nature Machine Intelligence 6, 673–687 (2024)

DOI: https://doi.org/10.1038/s42256-024-00847-1

Read Paper

Abstract

Understanding protein-ligand interactions is fundamental to drug discovery and development. We present a novel physicochemical graph neural network architecture designed to learn protein-ligand interaction fingerprints directly from sequence data. Our approach integrates physicochemical properties of amino acids and small molecules to create more informative representations, leading to improved prediction accuracy and interpretability of binding interactions.

03 PUBLICATIONS

Our Research

SurfDock is a surface-informed diffusion generative model for reliable and accurate protein–ligand complex prediction

Abstract

AI-Driven Protein Design

Abstract

Large language models for scientific discovery in molecular property prediction

Abstract

Generic protein–ligand interaction scoring by integrating physical prior knowledge and data augmentation modelling

Abstract

Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data

Abstract