Tuan Le

📍 Berlin, Germany

Hi! My name is Tuan and I am a Senior Machine Learning Research Scientist working at Pfizer. I obtained my Ph.D. from the Freie Universität Berlin under Frank Noé, while I have been working at Bayer and Pfizer being supervised by Djork-Arné Clevert throughout the time, focusing on the development of models for supervised and unsupervised learning on small molecules.

My research interests center around representation learning on molecular structures, including drug compounds and proteins, using methodologies from deep learning such as recurrent and graph neural networks in combination with generative learning algorithms to sample novel molecules. I have been working with molecular modeling and am particularly interested in developing methods that respect physical symmetries in both supervised and unsupervised learning settings.

I am particularly interested in transport-based generative models such as Energy-Based Models, Diffusion and Stochastic Interpolants with their applications in 3D molecule generation. Through collaboration with project teams across different disciplines, I have been developing software engineering skills to help translate research concepts into practical implementations.

selected publications

Diffusion Generative Modeling on Lie Group Representations

Marco Bertolini*, Tuan Le*, and Djork-Arné Clevert

In The Thirty-ninth Annual Conference on Neural Information Processing Systems Dec 2025

Abs PDF Code

We introduce a novel class of score-based diffusion processes that operate directly in the representation space of Lie groups. Leveraging the framework of Generalized Score Matching, we derive a class of Langevin dynamics that decomposes as a direct sum of Lie algebra representations, enabling the modeling of any target distribution on any (non-Abelian) Lie group. Standard score-matching emerges as a special case of our framework when the Lie group is the translation group T(N). We prove that our generalized generative processes arise as solutions to a new class of paired stochastic differential equations (SDEs), introduced here for the first time. We validate our approach through experiments on diverse data types, demonstrating its effectiveness in real-world applications such as SO(3)-guided molecular conformer generation and modeling ligand-specific global SE(3) transformations for molecular docking, showing improvement in comparison to Riemannian diffusion on the group itself. We show that an appropriate choice of Lie group enhances learning efficiency by reducing the effective dimensionality of the trajectory space and enables the modeling of transitions between complex data distributions.
Equivariant diffusion for structure-based de novo ligand generation with latent-conditioning

Tuan Le*, Julian Cremer*, Djork-Arné Clevert, and Kristof T. Schütt

Journal of Cheminformatics May 2025

Abs PDF Code

We introduce PoLiGenX, a novel generative model for de novo ligand design that employs latent-conditioned, target-aware equivariant diffusion. Our approach leverages the conditioning of the ligand generation process on reference molecules located within a specific protein pocket. By doing so, PoLiGenX generates shape-similar ligands that are adapted to the target pocket, enabling effective applications in target-aware hit expansion and hit optimization. Our experimental results underscore the efficacy of PoLiGenX in advancing ligand design. Notably, docking analyses reveal that the ligands generated by PoLiGenX show enhanced binding affinities relative to their reference molecules, all while retaining a similar molecular shape, but also retaining better poses with lower strain energies and less steric clashes. Furthermore, the model promotes substantial chemical diversity, facilitating the exploration of broader and more varied chemical spaces. Importantly, the generated ligands were assessed for drug-likeness using Lipinski’s rule of five, demonstrating superior adherence to drug-likeness criteria compared to the reference dataset. This work represents a step forward in the controlled and precise generation of therapeutically relevant de novo ligands tailored for specific protein targets, contributing to progress in computational drug discovery and ligand design.
PILOT: equivariant diffusion for pocket-conditioned de novo ligand generation with multi-objective guidance via importance sampling

Julian Cremer*, Tuan Le*, Frank Noé, Djork-Arné Clevert, and 1 more author

Chem. Sci. May 2024

Abs PDF Code

The generation of ligands that both are tailored to a given protein pocket and exhibit a range of desired chemical properties is a major challenge in structure-based drug design. Here, we propose an in silico approach for the de novo generation of 3D ligand structures using the equivariant diffusion model PILOT, combining pocket conditioning with a large-scale pre-training and property guidance. Its multi-objective trajectory-based importance sampling strategy is designed to direct the model towards molecules that not only exhibit desired characteristics such as increased binding affinity for a given protein pocket but also maintains high synthetic accessibility. This ensures the practicality of sampled molecules, thus maximizing their potential for the drug discovery pipeline. PILOT significantly outperforms existing methods across various metrics on the common benchmark dataset CrossDocked2020. Moreover, we employ PILOT to generate novel ligands for unseen protein pockets from the Kinodata-3D dataset, which encompasses a substantial portion of the human kinome. The generated structures exhibit predicted IC50 values indicative of potent biological activity, which highlights the potential of PILOT as a powerful tool for structure-based drug design.
Navigating the Design Space of Equivariant Diffusion-Based Generative Models for De Novo 3D Molecule Generation

Tuan Le*, Julian Cremer*, Frank Noé, Djork-Arné Clevert, and 1 more author

In The Twelfth International Conference on Learning Representations May 2024

Abs PDF Code

Deep generative diffusion models are a promising avenue for 3D de novo molecular design in materials science and drug discovery. However, their utility is still limited by suboptimal performance on large molecular structures and limited training data. To address this gap, we explore the design space of E(3)-equivariant diffusion models, focusing on previously unexplored areas. Our extensive comparative analysis evaluates the interplay between continuous and discrete state spaces. From this investigation, we present the EQGAT-diff model, which consistently outperforms established models for the QM9 and GEOM-Drugs datasets. Significantly, EQGAT-diff takes continuous atom positions, while chemical elements and bond types are categorical and uses time-dependent loss weighting, substantially increasing training convergence, the quality of generated samples, and inference time. We also showcase that including chemically motivated additional features like hybridization states in the diffusion process enhances the validity of generated molecules. To further strengthen the applicability of diffusion models to limited training data, we investigate the transferability of EQGAT-diff trained on the large PubChem3D dataset with implicit hydrogen atoms to target different data distributions. Fine-tuning EQGAT-diff for just a few iterations shows an efficient distribution shift, further improving performance throughout data sets. Finally, we test our model on the Crossdocked data set for structure-based de novo ligand generation, underlining the importance of our findings showing state-of-the-art performance on Vina docking scores.
Representation Learning on Biomolecular Structures using
Equivariant Graph Attention

Tuan Le*, Frank Noé, and Djork-Arné Clevert

In Learning on Graphs Conference May 2022

Abs PDF Code

Learning and reasoning about 3D molecular structures with varying size is an emerging and important challenge in machine learning and especially in the development of biotherapeutics. Equivariant Graph Neural Networks (GNNs) can simultaneously leverage the geometric and relational detail of the problem domain and are known to learn expressive representations through the propagation of information between nodes leveraging higher-order representations to faithfully express the geometry of the data, such as directionality in their intermediate layers. In this work, we propose an equivariant GNN that operates with Cartesian coordinates to incorporate directionality and we implement a novel attention mechanism, acting as a content and spatial dependent filter when propagating information between nodes. Our proposed message function processes vector features in a geometrically meaningful way by mixing existing vectors and creating new ones based on cross products. We demonstrate the efficacy of our architecture on accurately predicting properties of large biomolecules and show its computational advantage over recent methods which rely on irreducible representations by means of the spherical harmonics expansion.
Parameterized Hypercomplex Graph Neural Networks for Graph Classification

Tuan Le*, Marco Bertolini, Frank Noé, and Djork-Arné Clevert

In Artificial Neural Networks and Machine Learning – ICANN 2021 May 2021

Abs PDF Code

Despite recent advances in representation learning in hypercomplex (HC) space, this subject is still vastly unexplored in the context of graphs. Motivated by the complex and quaternion algebras, which have been found in several contexts to enable effective representation learning that inherently incorporates a weight-sharing mechanism, we develop graph neural networks that leverage the properties of hypercomplex feature transformation. In particular, in our proposed class of models, the multiplication rule specifying the algebra itself is inferred from the data during training. Given a fixed model architecture, we present empirical evidence that our proposed model incorporates a regularization effect, alleviating the risk of overfitting. We also show that for fixed model capacity, our proposed method outperforms its corresponding real-formulated GNN, providing additional confirmation for the enhanced expressivity of HC embeddings. Finally, we test our proposed hypercomplex GNN on several open graph benchmark datasets and show that our models reach state-of-the-art performance while consuming a much lower memory footprint with 70% fewer parameters. Our implementations are available at https://github.com/bayer-science-for-a-better-life/phc-gnn.
Neuraldecipher – reverse-engineering extended-connectivity fingerprints (ECFPs) to their molecular structures

Tuan Le*, Robin Winter, Frank Noé, and Djork-Arné Clevert

Chem. Sci. May 2020

Abs PDF Code

Protecting molecular structures from disclosure against external parties is of great relevance for industrial and private associations, such as pharmaceutical companies. Within the framework of external collaborations, it is common to exchange datasets by encoding the molecular structures into descriptors. Molecular fingerprints such as the extended-connectivity fingerprints (ECFPs) are frequently used for such an exchange, because they typically perform well on quantitative structure–activity relationship tasks. ECFPs are often considered to be non-invertible due to the way they are computed. In this paper, we present a fast reverse-engineering method to deduce the molecular structure given revealed ECFPs. Our method includes the Neuraldecipher, a neural network model that predicts a compact vector representation of compounds, given ECFPs. We then utilize another pre-trained model to retrieve the molecular structure as SMILES representation. We demonstrate that our method is able to reconstruct molecular structures to some extent, and improves, when ECFPs with larger fingerprint sizes are revealed. For example, given ECFP count vectors of length 4096, we are able to correctly deduce up to 69% of molecular structures on a validation set (112 K unique samples) with our method.