Advancing Molecular Machine Learning - Overcoming Limitations [ML4Molecules] | ELLIS workshop, VIRTUAL, December 8, 2023, unofficial NeurIPS2023 side-event

About - Registration - Schedule - Keynote Speakers - Keynote Abstracts - Accepted Contributions - Important Dates - Call for Papers

[!IMPORTANT]
2023-12-07: The workshop is taking place online via Zoom. You should have received the link and credentials to login via email if you registered for the event on eventbrite. If you have not gotten this information, contact us via ml4molecules@ml.jku.at.

About

Recent breakthroughs in machine learning (ML) for molecules have demonstrated impressive successes, ranging from highly accurate protein structure prediction to assist the discovery of novel drug candidates and chemical synthesis planning. These achievements position molecular machine learning as a key tool for addressing pressing challenges related to drug and materials discovery. However, current machine learning - especially deep learning (DL) - methods, still face significant challenges and limitations. DL models are (I) data hungry while data in the molecular sciences are often rather sparse, (II) struggle with adaptability to changing tasks or distributions, (III) lack of (properly) incorporating domain knowledge, (IV) missing explainability and inherent differentiation of causation from correlation. Additionally, given the flood of new published methods, thorough benchmarks as well as regulations for deployments and sustainability are needed.

This workshop aims to address the current limitations and capabilities of machine learning methods for molecules by (i) critically assessing them both theoretically and in applied and industrial settings, and (ii) showcasing novel and promising approaches to accelerate molecule discovery. Moreover, we will explore how recent advancements in Large Language Models (LLMs) may have the potential to revolutionize the field.

We encourage contributions that focus on robust architectures capable of handling domain shifts, novel chemical series, and diverse types of molecules. We also welcome methods that enable quick adaptation to newly acquired data, leveraging few- and zero-shot learning approaches. Furthermore, we aim to explore novel strategies for abstracting molecule representations, empowering broader generalization capabilities. Promising directions involve developing machine learning methods for creating relevant physical abstractions to enhance molecular dynamics simulations or force fields, as well as strategies to tackle automated design-make-test-analyze cycles.

Join us at this workshop, where experts from diverse fields, including ML, molecular sciences, and LLMs, will collaborate to overcome current limitations, explore new possibilities, and chart the future of molecular machine learning. Together, we aim to accelerate the discovery of functional molecules, revolutionize drug development, secure our food supply, and drive sustainable energy conversion and storage, ultimately shaping a better future for humanity.

Registration

The workshop will be open to everyone without a registration fee. You can register here!

Schedule

Fri, Dec. 8th 2023, 09:00 am - 6:00 pm, CET, online at Zoom.

CET	Event	Speakers	Title
	1. Session	Session Chair: Francesca Grisoni
09:00	Opening remarks	ML4Mol Chair
09:00	Invited Talk	Jan H. Jensen	Using machine learning and quantum chemistry in drug discovery
09:30	Invited Talk	Renana Gershoni-Poranne	Exploring the Chemical Space of Polycyclic Aromatic Systems
10:00	Contributed talk	Rıza Özçelik	Structured State-Space Sequence Models for De Novo Drug Design
10:15	Contributed talk	Elizaveta Kozlova	Protein Inpainting Co-Design with ProtFill
10:30	Contributed talk	Junwu Chen	Molecular Hypergraph Neural Network
10:45	Contributed talk	Florian Sestak	VN-EGNN: Equivariant Graph Neural Networks with Virtual Nodes Enhance Protein Binding Site Identification
11:00	Poster Session 1 (PS 1)	Poster discussion at Gathertown
12:00	Break
	2. Session	Session Chair: Andrea Volkamer
13:00	Invited talk	Rianne van der Berg	Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics
13:30	Invited talk	Bruno Correia	Leveraging learned surface fingerprints and generative AI for small-molecule design
14:00	Invited talk	Eva Nittinger	Generative Drug Design with REINVENT - Possibilities and Open Challenges
14:30	Contributed talk	Ilia Igashov	RetroBridge: Modeling Retrosynthesis with Markov Bridges
14:45	Contributed talk	Roman Joeres	DataSAIL: Data Splitting Against Information Leakage
15:00	Break
	3. Session	Session Chair: Philippe Schwaller
15:30	Invited talk	Raquel Rodríguez-Pérez	Advancing Drug Design with Machine Learning: Predicting Compound Properties in the Pharmaceutical Industry
16:00	Invited talk	Pat Walters	Benchmarking Machine Learning Models in Drug Discovery - You’re Probably Doing It Wrong
16:30	Invited talk	Andrew D. White	Agents for Scientific Research Over Scientific Domains
17:00	Closing remarks	ML4Mol Chair
17:05	Poster Session 2 (PS 2)	Poster discussion at Gathertown
18:00	End

Keynote Speakers

	Bruno Correia, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland.
	Raquel Rodríguez-Pérez, Novartis Institutes for Biomedical Research, Switzerland.
	Renana Gershoni-Poranne, Technion, Israel.
	Jan H. Jensen, University of Copenhagen, Denmark.
	Rianne van der Berg, Microsoft Research, Netherlands.
	Andrew D. White, Future House, San Francisco, United States.
	Eva Nittinger, AstraZeneca, Sweden.
	Pat Walters, Relay Therapeutics, Cambridge, United States.

Keynote Abstracts

Renana Gershoni-Poranne - Exploring the Chemical Space of Polycyclic Aromatic Systems

Polycyclic aromatic systems (PASs) are among the most prevalent and impactful classes of compounds in the natural and man-made world. Though aromatic systems have captured the fascination of chemists for almost two centuries, a general conceptual framework for understanding and predicting the structure-property relationships of polycyclic systems remains elusive. Yet, the structure-property relationships of PBHs have both conceptual and practical implications and understanding them can enable design of new functional compounds. We address this gap using a combination of computational chemistry and data science tools. We first interrogated polybenzenoid hydrocarbons using a combination of traditional computational techniques, including characterization of their aromatic character in the S0 and T1 states (described with the NICS metric), their spin density in the T1 state, and their S0—T1 energy gaps. Regularities were revealed that allowed for simple and intuitive design guidelines to be defined. To verify these guidelines in a data-driven manner, we generated a new database – the COMPAS Project and developed two types of molecular representation to enable machine- and deep-learning models to train on the new data: a) a text-based representation and b) a graph-based representation. In addition to their predictive ability, we demonstrate the interpretability of the models that is achieved when using these representations. The extracted insight in some cases confirms well-known “rules of thumb” and in other cases disproves common wisdom and sheds new light on this classical family of compounds. Finally, we implemented a generative model that design novel PASs with targeted properties in an effective and efficient manner, demonstrating the first inverse design of PASs.5

Raquel Rodríguez-Pérez - Advancing Drug Design with Machine Learning: Predicting Compound Properties in the Pharmaceutical Industry

Machine learning (ML) and deep learning models have become indispensable tools for predicting compound properties, including activity but also pharmacokinetics and toxicity endpoints. These predictions play a vital role in decision-making and assist in drug design. Recently, our investigations have focused on benchmarking different training set compositions for model generation (global vs local models). We have also explored approaches to address the challenge of changing distributions in specific projects or new modalities (domain adaptation), as well as the interpretation of predictions from ML models, with a particular emphasis on explainability and uncertainty. This talk will highlight relevant applications of ML-based molecular property predictions in the pharmaceutical industry, shedding light on their significance and addressing the challenges that require further research.

Rianne van der Berg - Diffusion Models and Force Fields for Coarse-Grained Molecular Dynamics

In this talk I will first briefly discuss some of the research areas that we are currently exploring in AI4Science at Microsoft Research, covering topics such as drug discovery, material generation and neural PDE solvers. Then I will dive a little deeper into recent work on the use of score-based generative modeling for coarse-graining (CG) molecular dynamics simulations. By training a diffusion model on protein structures from molecular dynamics simulations we show that its score function approximates a force field that can directly be used to simulate CG molecular dynamics. While having a vastly simplified training setup compared to previous work, we demonstrate that our approach leads to improved performance across several small- to medium-sized protein simulations, reproducing the CG equilibrium distribution, and preserving dynamics of all-atom simulations such as protein folding events.

Eva Nittinger - Generative Drug Design with REINVENT - Possibilities and Open Challenges

De novo drug design has gained increasing interest in the computer aided drug design community throughout the last few years. AstraZeneca’s inhouse developed molecular generative method REINVENT has been continuously developed and open sourced. Purely the generation of thousands of novel molecules does not display a difficult task anymore, as shown by recent discussions around relevant benchmarks for molecular generative models. The scoring and selection process, however, does. This talk will show the range of capabilities of REINVENT and discuss the still open challenges in the field that needs to be tackled.

Pat Walters - Benchmarking Machine Learning Models in Drug Discovery - You’re Probably Doing It Wrong

While machine learning (ML) models have been applied to quantitative structure-activity relationships (QSAR) for more than 20 years, the field has yet to arrive at standards for benchmark evaluations. Published benchmark studies have employed a wide range of datasets, cross-validation methodologies, and evaluation metrics. While variety is important, it is essential that benchmarks provide an accurate reflection of model performance. Unfortunately, many papers that compare ML methods and/or molecular representations use highly flawed datasets and fail to employ appropriate statistical methods. Datasets considered “standards” in the field contain numerous errors which may not be apparent to non-experts. These errors compromise and may invalidate method comparisons. In addition, many papers either ignore or inappropriately apply statistical tests for comparing distributions. Reported differences between methods often evaporate when exposed to statistical scrutiny. For the field to progress, we must establish standards and develop an evaluation framework that authors, reviewers, and journal editors can use. This will require a concerted, collaborative effort between domain experts, machine learning practitioners, and statisticians. This presentation will highlight prevalent issues with published benchmarking studies and suggest a path forward.

Accepted contributions (poster)

1	Assessing the Extrapolation Capability of Template-Free Retrosynthesis Models	Shuan Chen, Yousung Jung	PS 1
2	TS-DiffuGen: An equivariant diffusion model for reaction transition state conformation generation	Sacha Raffaud, Jeff Guo, Philippe Schwaller	PS 2
3	Activity Cliffs Go Smooth: Graph Siamese Neural Networks for Molecular Activity Prediction	Ghaith Mqawass, Steffen Hirte, Johannes Kirchmair, Nils Morten Kriege	PS 1
4	Bayesian Optimization of Catalysts With In-context Learning	Mayk Caldas Ramos, Shane Michtavy, Marc Porosoff, Andrew White	PS 2
5	Inverse-design of organometallic catalysts with guided equivariant diffusion	François R J Cornet, Bardi Benediktsson, Bjarke Hastrup, Arghya Bhowmik, Mikkel N. Schmidt	PS 1
6	Molecule-Edit Templates for Efficient and Accurate Retrosynthesis Prediction	Mikołaj Sacha, Michał Sadowski, Piotr Kozakowski, Ruad van Workum, Stanislaw Kamil Jastrzebski	PS 2
7	Machine learning-guided high throughput nanoparticle design	Derek van Tilborg, Ana Ortiz-Perez, Roy van der Meel, Francesca Grisoni, Lorenzo Albertazzi	PS 1
8	Retro-fallback: retrosynthetic planning in an uncertain world	Austin Tripp, Krzysztof Maziarz, Sarah Lewis, Marwin Segler, José Miguel Hernández-Lobato	PS 2
9	Exploring Organic Syntheses through Natural Language	Andres M Bran, Philippe Schwaller	PS 2
10	Harmonic Prior Self-conditioned Flow Matching for Multi-Ligand Docking and Binding Site Design	Hannes Stark, Bowen Jing, Regina Barzilay, Tommi S. Jaakkola	PS 2
11	Coherent Energy and Force Uncertainty in Deep Learning Force Fields	Peter Bjørn Jørgensen, Jonas Busk, Ole Winther, Mikkel N. Schmidt	PS 1
12	Transition Path Sampling with Boltzmann Generator-based MCMC Moves	Michael Plainer, Hannes Stark, Charlotte Bunne, Stephan Günnemann	PS 2
13	Guided docking as a data generation approach facilitates structure-based machine learning on kinases	Joschka Groß, Michael Backenköhler, Verena Wolf, Andrea Volkamer	PS 1
14	Automatic Generation of Mechanistic Pathways of Organic Reactions with Dual Templates	Shuan Chen, Ramil Babazade, Yousung Jung	PS 1
15	Unveiling the Secrets of $^1$H-NMR Spectroscopy: A Novel Approach Utilizing Attention Mechanisms	Oliver Schilter, Marvin Alberts, Alain C. Vaucher, Philippe Schwaller, Teodoro Laino	PS 2
16	Improved Chirality Encodings Boost Transformer-Based Stereochemical Reaction Prediction	Rémi Schlama, Philippe Schwaller	PS 2
17	Autoregressive Reinforcing Framework for Fragment-based Generative Model	Gunwook Nam, Yousung Jung	PS 1
18	Discriminator-Driven Diffusion Mechanisms for Molecular Graph Generation	Gerrit Großmann	PS 2
19	Genetic algorithms are strong baselines for molecule generation	Austin Tripp, José Miguel Hernández-Lobato	PS 1
20	Retrieval of synthesis parameters of polymer nanocomposites using LLMs	Defne Circi, Ghazal Khalighinejad, Shruti Badhwar, Bhuwan Dhingra, L. Brinson	PS 2
21	Graph-to-String Variational Autoencoder for Synthetic Polymer Design	Gabriel Vogel, Paolo Sortino, Jana Marie Weber	PS 1
22	MolSiam: Simple Siamese Self-supervised Representation Learning for Small Molecules	Joshua Yao-Yu Lin, Michael Maser, Nathan C. Frey, Gabriele Scalia, Omar Mahmood, Pedro O. Pinheiro, Ji Won Park, Stephen Ra, Andrew Martin Watkins, Kyunghyun Cho	PS 2
23	BoChemian: Large Language Model Embeddings for Bayesian Optimization of Chemical Reactions	Bojana Ranković, Philippe Schwaller	PS 1

Important dates

October 23, 2023: Deadline for submission
Mid / End November, 2023: Author notification
December 8, 2023: Workshop

Call for papers

We are calling for papers advancing or critically assessing molecular machine learning. Topics include (but not limited to):

Benchmarking molecular machine learning methods
Data-efficient learning
Large language models in chemistry
Model interpretability and explainability
Interatomic potentials for molecules and materials
Generative modeling
Machine learning for protein engineering
Automation of the DMTA cycle
Chemical reactions

Please submit your contributions on OpenReview until October 23 2023 11:59 PM UTC-0. The submissions should be in PDF and follow the NeurIPS template with a maximum of 4 pages (not including references and appendices). Please anonymize your paper since the review process is dual-anonymous.

Organizing Committee and Contact

Chairs: Michele Ceriotti, Francesca Grisoni, Philippe Schwaller, Andrea Volkamer

Organizing committee: Michael Backenköhler, Helena Brinkmann, Michele Ceriotti, Francesca Grisoni, Rıza Özçelik, Philippe Schwaller, Andrea Volkamer, Geemi Wellawatte

Contact: ml4molecules@ml.jku.at