BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning

1Insilico Medicine Canada Inc., 2Insilico Medicine AI Limited,
3Mila - Quebec AI Institute, 4Polytechnique Montreal, 5CIFAR AI Chair
BindGPT

BindGPT is a new framework for building drug discovery models that leverages compute-efficient pretraining, supervised funetuning, prompting, reinforcement learning, and tool use of LMs. This allows BindGPT to build a single pre-trained model that exhibits state-of-the-art performance in 3D Molecule Generation, 3D Conformer Generation, Pocket-Conditioned 3D Molecule Generation, posing them as downstream tasks for a pretrained model, while previous methods build task-specialized models without task transfer abilities. At the same time, thanks to the fast transformer inference technology, BindGPT is 2 orders of magnitude (100 times) faster than previous methods at generation.

BindGPT is

  • if you are an LLM researcher: The new framework allowing the full stack LLM research at the cost of just one node (~ 8 A100 GPUs). We demonstrate proof-of-concept of the InstructGPT pipeline (pretraining -> SFT -> Reinforcement Learning), prompting, and tool-use applied to this new domain.
  • if you are an AI4Science researcher: The new powerful state-of-the-art baseline model, that can solve many generative molecular tasks zero-shot or after transfer. It is fully data-driven, compared to recent diffusion models and graph neural networks for this domain that use strong inductive biases.
  • if you are a biotechnology researcher: The new generative model for small molecules generation in 3D augmented with Reinforcement Learning. BindGPT solves three generative tasks: 3D molecule generation, conformer generation, and pocket-conditioned molecule generation. Notably, the model finetunes with RL only once with many protein targets to avoid finetuning on each new test protein target.

Abstract

Generating novel active molecules for a given protein is an extremely challenging task for generative models that requires an understanding of the complex physical interactions between the molecule and its environment. In this paper, we present a novel generative model, BindGPT which uses a conceptually simple but powerful approach to create 3D molecules within the protein's binding site. Our model produces molecular graphs and conformations jointly, eliminating the need for an extra graph reconstruction step. We pretrain BindGPT on a large-scale dataset and fine-tune it with reinforcement learning using scores from external simulation software. We demonstrate how a single pretrained language model can serve at the same time as a 3D molecular generative model, conformer generator conditioned on the molecular graph, and a pocket-conditioned 3D molecule generator. Notably, the model does not make any representational equivariance assumptions about the domain of generation. We show how such simple conceptual approach combined with pretraining and scaling can perform on par or better than the current best specialized diffusion models, language models, and graph neural networks while being two orders of magnitude cheaper to sample.

Training Paradigm

Pretraining

The key idea of our method is utilizing an autoregressive token generation model, influenced by GPT-based models, to solve several 3D small molecule generation tasks in one simple yet flexible paradigm. The main principle in our approach is to formulate several 3D molecular design task as prompted generation of text. To achieve that, we layout the tokens of a condition before the tokens of the object to generate. For instance, a prompt can be the protein pocket for the pocket-conditioned generation task or the 2D molecular structure for the conformation generation task. During pretraining, every sequence in the training batch is either a ligand sequence of tokens or a pocket sequence of tokens following the scheme on the right image. We pretrain a 100M GPT model on 42B tokens on the version of data without hydrogens and 90B tokens on the data with hydrogens.

Data Layout

Figure 1: Data layout during the pretraining. Arrows show the tokens sequence order. Nodes such as <POCKET> show special tokens. Training is done on a mixture of pocket and ligand datasets.

Supervised Finetuning and Reinforcement Learning

Data Layout

Figure 2: Data layout during the finetuning. Arrows show the tokens sequence order. Nodes such as <LIGAND> show special tokens.

As a result of the pretraining, BindGPT gains an understanding of a broad chemical space of molecules and proteins. In the supervised finetuning stage, we simply concatenate text encodings of protein pocket and molecule as shown in the image. We treat resulting sequences as prompt2responce strings and finetune the language model in a supervised learning fashion (BindGPT-FT). Thereforem having learned proteins and molecules separately during pretraining, the finetuning exploits the independent knowledge of both pockets and ligands to learn a conditional dependency between them.

We employ SMILES randomization, which can heavily randomize one molecule by yielding 100-1000 different SMILES strings. Also, we randomly rotate the 3D coordinates of the protein pocket and of the ligand (with the same rotation matrix). This way our model learns to understand structural and spatial properties of molecular binding beyond just token sequences, that is, it learns equivariant protein binding.

To enable generation of molecules with high binding affinity, we employ two strategies: Reward-conditined finetuning (BindGPT-RFT) and finetuning with RL (BindGPT-RL). the first one simply adds a scalar score to the prompt of the language model during the supervised finetuning stage.

Second, use reinforcement learning to optimize the binding affinity of molecules. We use the model from the SFT stage (i.e. BindGPT-FT) and train it with REINFORCE over a fixed dataset of pocket prompts.

Results

We evaluate BindGPT on three downstream tasks of drug discovery: 3D molecule generation, 3D conformer generation, and pocket-conditioned generation of 3D molecules. We pretrain BindGPT on a large conformer dataset consisting of 208M 3D molecular conformers. After pretraining, the model can already solve two downstream tasks in a zero-shot fashion: generating molecules in 3D and 3D conformer generation.

Table 1: Generative metrics for the molecule generation task after the large-scale pretraining. (H) is explicit hydrogens are generated with molecules. For XYZ-TF, the RMSD calculation algorithm failed to converge. BindGPT and XYZ-TF are the only models capable of pretraining at such large scale.

Figure 3: Unconditional samples from the model. No cherry picking, filtering, or tool-use.

After the pretraining phase, we pass token <LIGAND> to generate a 3D molecule from the model. Examples of molecules that the model generates are shown in Figure 3. In Table 1, we summarize the generative metrics that our model obtains after the pretraining. Note that no other 3D generative methods are capable of pretraining at such large scale (see paper for details). We report validity (↑) of molecules and three drudlikeness metrics (QED, SA, Lipinski (↑)) to represent the overall quality of the generated molecular graph structures, finally we report the RMS Distance to the closest RDKit conformer.

Despite RDKit not being the most accurate conformer generator, it is used used in pretraining to generate 200M conformers. We use more accurate RMSD computation in the next section.

To compare BindGPT with other 3D generative models, we finetune the model on GEOM-DRUGS, which is the current main finetuning dataset for training 3D generative models. The results are shown in Table 2. The knowledge, attained during the pretraining helps the model to enable efficient transfer to high-quality conformers of GEOM-DRUGS, compared to MolDiff and EDM both of which are small-scale diffusion models.

Table 2: Generative metrics of the generated 3D molecules on GEOM-DRUGS. BindGPT is finetuned on this dataset after pretraining at much larger scale, while other methods train only on GEOM-DRUGS. Most of the metrics measure the distance (↓) between the generated distribution and the validation set of GEOM-GRUGS.

For the task of generating 3D conformers given a molecular graph, we use The Platinum dataset which was designed to assess the quality of conformer generators. Figure 4 and 5 show zero-shot results for generating conformations on this new dataset. We report RMSD Coverage metric (↑) that reveals the distance between generated 3D conformers and the real ones.
Data Layout
Data Layout

Figure 4: Generated conformers for reference molecules from the Platinum dataset.

Figure 5: RMSD Coverage metric (↑) calculated on the Platinum dataset for the 3D conformation generation task.

Finally, we evaluate our framework on the pocket-conditioned generation of 3D molecules. As explained in the previous sections, we finetune the model on the aligned dataset of pockets and molecules. In particular, we explore three variants of such finetuning. First, BindGPT-FT - finetuned on all pocket-ligand pairs including the badly scored ones to provide an initialization for the RL model. Second, BindGPT-RFT - performs the reward conditioned finetuning on the same data and conditions on high rewards (low binding affinity) at test time. Third, BindGPT-RL finetunes from BindGPT-FT using Reinforcement Learning to minimize binding affinity. We report binding affinity score (Vina score ↓) as the main metric of interest and druglikeness metrics (QED, SA, Lipinski ↑)

Table 3: Generative metrics for the pocket-conditioned generation task. Due to its explicit reward maximization, BindGPT significantly outperforms all previous baselines. Note that RL only minimizes the vina score, i.e. SA and QED are not included into the reward.

Figure 6: Examples of molecules generated by BindGPT and baselines.

BibTeX

@article{zholus2024bindgpt,
      title={BindGPT: A Scalable Framework for 3D Molecular Design via Language Modeling and Reinforcement Learning}, 
      author={Artem Zholus and Maksim Kuznetsov and Roman Schutski and Rim Shayakhmetov and Daniil Polykovskiy and Sarath Chandar and Alex Zhavoronkov},
      year={2024},
      eprint={2406.03686},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2406.03686}, 
}