[论文翻译]基于检索的可控分子生成


原文地址:https://openreview.net/pdf?id=vDFA1tpuLvk


RETRIEVAL-BASED CONTROLLABLE MOLECULE GENERATION

基于检索的可控分子生成

Zichao Wang†∗ Rice University jzwang@rice.edu

王梓超†∗ 莱斯大学 jzwang@rice.edu

Zhuoran Qiao Caltech zqiao@caltech.edu

朱然乔 加州理工学院 zqiao@caltech.edu

Richard G. Baraniuk Rice University richb@rice.edu

Richard G. Baraniuk 莱斯大学 richb@rice.edu

Anima Anandkumar NVIDIA, Caltech a an and kumar@nvidia.edu

Anima Anandkumar NVIDIA, 加州理工学院 a an and kumar@nvidia.edu

ABSTRACT

摘要

Generating new molecules with specified chemical and biological properties via generative models has emerged as a promising direction for drug discovery. However, existing methods require extensive training/fine-tuning with a large dataset, often unavailable in real-world generation tasks. In this work, we propose a new retrieval-based framework for controllable molecule generation. We use a small set of exemplar molecules, i.e., those that (partially) satisfy the design criteria, to steer the pre-trained generative model towards synthesizing molecules that satisfy the given design criteria. We design a retrieval mechanism that retrieves and fuses the exemplar molecules with the input molecule, which is trained by a new selfsupervised objective that predicts the nearest neighbor of the input molecule. We also propose an iterative refinement process to dynamically update the generated molecules and retrieval database for better generalization. Our approach is agnostic to the choice of generative models and requires no task-specific fine-tuning. On various tasks ranging from simple design criteria to a challenging real-world scenario for designing lead compounds that bind to the SARS-CoV-2 main protease, we demonstrate our approach extrapolates well beyond the retrieval database, and achieves better performance and wider applicability than previous methods.

通过生成式模型创造具有特定化学和生物特性的新分子,已成为药物发现领域的一个前景广阔的方向。然而,现有方法需要大量数据集进行长时间训练/微调,而这在实际生成任务中往往难以获取。本研究提出了一种基于检索的可控分子生成新框架。我们利用少量示例分子(即那些(部分)满足设计标准的分子)来引导预训练生成模型合成符合给定设计要求的分子。我们设计了一种检索机制,能够检索输入分子并与之融合示例分子,该机制通过一种新的自监督目标进行训练,即预测输入分子的最近邻。此外,我们还提出了一种迭代优化流程,动态更新生成分子和检索数据库以实现更好的泛化能力。本方法不依赖于特定生成模型的选择,且无需针对任务进行微调。在从简单设计标准到设计结合SARS-CoV-2主要蛋白酶的先导化合物这一具有挑战性的现实场景等多种任务中,我们的方法展现出远超检索数据库的外推能力,并取得了优于以往方法的性能和更广泛的适用性。

1 INTRODUCTION

1 引言

Drug discovery is a complex, multi-objective problem (Vamathevan et al., 2019). For a drug to be safe and effective, the molecular entity must interact favorably with the desired target (Parenti and Rastelli, 2012), possess favorable physico chemical properties such as solubility (Meanwell, 2011), and be readily syn the sizable (Jiménez-Luna et al., 2021). Compounding the challenge is the massive search space (up to $10^{60}$ molecules Polishchuk et al. (2013)). Previous efforts address this challenge via highthroughput virtual screening (HTVS) techniques (Walters et al., 1998) by searching against existing molecular databases. Combinatorial approaches have also been proposed to enumerate molecules beyond the space of established drug-like molecule datasets. For example, genetic-algorithm (GA) based methods (Sliwoski et al., 2013; Jensen, 2019; Yoshikawa et al., 2018) explore potential new drug candidates via heuristics such as hand-crafted rules and random mutations. Although widely adopted in practice, these methods tend to be inefficient and computationally expensive due to the vast chemical search space (Hoffman et al., 2021). The performance of these combinatorial approaches also heavily depends on the quality of generation rules, which often require task-specific engineering expertise and may limit the diversity of the generated molecules.

药物发现是一个复杂的多目标问题 (Vamathevan et al., 2019)。为确保药物安全有效,分子实体必须与靶标产生理想相互作用 (Parenti and Rastelli, 2012),具备溶解性等良好理化特性 (Meanwell, 2011),并易于合成 (Jiménez-Luna et al., 2021)。该领域面临的巨大挑战还在于其高达$10^{60}$分子的庞大搜索空间 (Polishchuk et al., 2013)。先前研究主要通过高通量虚拟筛选技术 (Walters et al., 1998) 在现有分子数据库中进行搜索。也有研究提出组合方法,用于在已知类药分子数据集之外枚举分子。例如基于遗传算法的方法 (Sliwoski et al., 2013; Jensen, 2019; Yoshikawa et al., 2018) 通过人工规则和随机突变等启发式方法探索潜在新药候选物。尽管这些方法在实践中被广泛采用,但由于化学搜索空间巨大 (Hoffman et al., 2021),其效率往往较低且计算成本高昂。这些组合方法的性能还高度依赖于生成规则的质量,这通常需要特定任务的工程专业知识,并可能限制生成分子的多样性。

To this end, recent research focuses on learning to control la bly synthesize molecules with generative models (Tang et al., 2021; Chen, 2021; Walters and Murcko, 2020). It usually involves first training an unconditional generative model from millions of existing molecules (Winter et al., 2019a; Irwin et al., 2022) and then controlling the generative models to synthesize new desired molecules that satisfy one or more property constraints such as high drug-likeness (Bickerton et al., 2012) and high syn the s iz ability (Ertl and Schu ffe n hauer, 2009). There are three main classes of learning-based molecule generation approaches: (i) reinforcement-learning (RL)-based methods (Olivecrona et al., 2017; Jin et al., 2020a), (ii) supervised-learning (SL)-based methods (Lim et al., 2018; Shin et al., 2021), and (iii) latent optimization-based methods (Winter et al., 2019b; Das et al., 2021). RL- and SL-based methods train or fine-tune a pre-trained generative model using the desired properties as reward functions (RL) or using molecules with the desired properties as training data (SL). Such methods require heavy task-specific fine-tuning, making them not easily applicable to a wide range of drug discovery tasks. Latent optimization-based methods, in contrast, learn to find latent codes that correspond to the desired molecules, based on property predictors trained on the generative model’s latent space. However, training such latent-space property predictors can be challenging, especially in real-world scenarios where we only have a limited number of active molecules for training (Jin et al., 2020b; Huynh et al., 2020). Moreover, such methods usually necessitate a fixed-dimensional latent-variable generative model with a compact and structured latent space (Xue et al., 2018; Xu et al., 2019), making them incompatible with other generative models with a varying-dimensional latent space, such as transformer-based architectures (Irwin et al., 2022).

为此,近期研究聚焦于通过生成模型学习可控合成分子 (Tang et al., 2021; Chen, 2021; Walters and Murcko, 2020)。通常流程包括:首先基于数百万现有分子训练无条件生成模型 (Winter et al., 2019a; Irwin et al., 2022),然后控制生成模型合成满足一个或多个属性约束的新分子,例如高类药性 (Bickerton et al., 2012) 和高可合成性 (Ertl and Schuffenhauer, 2009)。基于学习的分子生成方法主要分为三类:(i) 强化学习 (RL) 方法 (Olivecrona et al., 2017; Jin et al., 2020a),(ii) 监督学习 (SL) 方法 (Lim et al., 2018; Shin et al., 2021),以及 (iii) 潜在空间优化方法 (Winter et al., 2019b; Das et al., 2021)。RL 和 SL 方法分别以目标属性作为奖励函数 (RL) 或含目标属性的分子作为训练数据 (SL) 来训练/微调预训练生成模型。这类方法需要大量任务特定微调,难以广泛适用于药物发现任务。相比之下,潜在空间优化方法基于生成模型潜在空间训练的属性预测器,学习寻找对应目标分子的潜在编码。然而训练此类潜在空间属性预测器具有挑战性,特别是在现实场景中仅有少量活性分子可用于训练时 (Jin et al., 2020b; Huynh et al., 2020)。此外,这类方法通常需要固定维度的潜变量生成模型 (Xue et al., 2018; Xu et al., 2019),无法兼容 Transformer 架构等具有可变维度潜在空间的生成模型 (Irwin et al., 2022)。


Figure 1: An illustration of RetMol, a retrieval-based framework for controllable molecule generation. The framework incorporates a retrieval module (the molecule retriever and the information fusion) with a pre-trained generative model (the encoder and decoder). The illustration shows an example of optimizing the binding affinity (unit in $\mathtt{k c a l/m o l}$ ; the lower the better) for an existing potential drug, Fa vip i ravi r, for better treating the COVID-19 virus (SARS-CoV-2 main protease, PDB ID: 7L11) under various other design criteria.

图 1: RetMol框架示意图,这是一个基于检索的可控分子生成框架。该框架将检索模块(分子检索器和信息融合)与预训练生成模型(编码器和解码器)相结合。图示展示了在满足其他设计标准的同时,优化现有潜在药物法匹拉韦(Favipiravir)与COVID-19病毒(SARS-CoV-2主要蛋白酶,PDB ID: 7L11)结合亲和力(单位为$\mathtt{k c a l/m o l}$,数值越低越好)的示例。

Our approach. In this work, we aim to overcome the aforementioned challenges of existing works and design a controllable molecule generation method that (i) easily generalizes to various generation tasks; (ii) requires minimal training or fine-tuning; and (iii) operates favorably in data-sparse regimes where active molecules are limited. We summarize our contributions as follows:

我们的方法。在这项工作中,我们旨在克服现有工作的上述挑战,设计一种可控分子生成方法,该方法具有以下特点:(i) 能够轻松泛化到各种生成任务;(ii) 需要最少的训练或微调;(iii) 在活性分子有限的数据稀疏场景下表现优异。我们的贡献总结如下:

[1] We propose a first-of-its-kind retrieval-based framework, termed RetMol, for controllable molecule generation. It uses a small set of exemplar molecules, which may partially satisfy the desired properties, from a retrieval database to guide generation towards satisfying all the desired properties [2] We design a retrieval mechanism that retrieves and fuses the exemplar molecules with the input molecule, a new self-supervised training with the molecule similarity as a proxy objective, and an iterative refinement process to dynamically update the generated molecules and retrieval database.

[1] 我们提出了一种首创的基于检索的框架RetMol,用于可控分子生成。该框架从检索数据库中选取少量可能部分满足目标属性的示例分子,引导生成过程实现所有期望属性。
[2] 我们设计了以下机制:1) 检索并融合输入分子与示例分子的检索机制;2) 以分子相似度作为代理目标的新型自监督训练方法;3) 动态更新生成分子与检索数据库的迭代优化流程。

[3] We perform extensive evaluation of RetMol on a number of controllable molecule generation tasks ranging from simple molecule property control to challenging real-world drug design for treating the COVID-19 virus, and demonstrate RetMol’s superior performance compared to previous methods.

[3] 我们在从简单分子属性控制到针对COVID-19病毒治疗这一具有挑战性的真实世界药物设计等多种可控分子生成任务上,对RetMol进行了广泛评估,并证明了RetMol相较于先前方法的卓越性能。

Specifically, as shown in Figure 1, the RetMol framework plugs a lightweight retrieval mechanism into a pre-trained, encoder-decoder generative model. For each task, we first construct a retrieval database consisting of exemplar molecules that (partially) satisfy the design criteria. Given an input molecule to be optimized, a retriever module uses it to retrieve a small number of exemplar molecules from the database, which are then converted into numerical embeddings, along with the input molecule, by the encoder of the pre-trained generative model. Next, an information fusion module fuses the embeddings of exemplar molecules with the input embedding to guide the generation (via the decoder in the pre-trained generative model) towards satisfying the desired properties. The fusion module is the only part in RetMol that requires training. For training, we propose a new self-supervised objective (i.e., predicting the nearest neighbor of the input molecule) to update the fusion module, which enables RetMol to generalize to various generation tasks without task-specific training/fine-tuning. For inference, we propose a new inference process via iterative refinement that dynamically updates the generated molecules and retrieval database, which leads to improved generation quality and enables RetMol to extrapolate well beyond the retrieval database.

具体如图1所示,RetMol框架在预训练的编码器-解码器生成模型中嵌入了轻量级检索机制。针对每项任务,我们首先构建由(部分)满足设计标准的示例分子组成的检索数据库。给定待优化的输入分子时,检索模块会从中获取少量示例分子,这些分子与输入分子一起通过预训练生成模型的编码器转换为数值嵌入。随后,信息融合模块将示例分子的嵌入与输入嵌入融合,以引导生成过程(通过预训练生成模型的解码器)满足目标属性。融合模块是RetMol中唯一需要训练的部分。训练阶段,我们提出新的自监督目标(即预测输入分子的最近邻)来更新融合模块,使RetMol无需任务特定训练/微调即可泛化至各类生成任务。推理阶段,我们通过动态更新生成分子与检索数据库的迭代优化过程,既提升了生成质量,又使RetMol能够显著超越检索数据库的覆盖范围。

RetMol enjoys several advantages. First, it requires only a handful of exemplar molecules to achieve strong controllable generation performance without task-specific fine-tuning. This makes it particularly appealing in real-world use cases where training or fine-tuning a task-specific model is challenging. Second, RetMol is flexible and easily adaptable. It is compatible with a range of pretrained generative models, including both fixed-dimensional and varying-dimensional latent-variable models, and requires training only a lightweight, plug-and-play, task-agnostic retrieval module while freezing the base generative model. Once trained, it can be applied to many drug discovery tasks by simply replacing the retrieval database while keeping all other components unchanged.

RetMol具有多项优势。首先,它仅需少量示例分子即可实现强大的可控生成性能,无需针对特定任务进行微调。这使得它在实际应用中尤为吸引人,因为在这些场景下训练或微调特定任务模型往往具有挑战性。其次,RetMol灵活且易于适配。它与多种预训练生成模型兼容,包括固定维度和可变维度的潜变量模型,仅需训练一个轻量级、即插即用且与任务无关的检索模块,同时冻结基础生成模型。一旦训练完成,只需替换检索数据库而保持其他所有组件不变,即可应用于众多药物发现任务。

In a range of molecule controllable generation scenarios and benchmarks with diverse design criteria, our framework achieves state-of-the-art performance compared to latest methods. For example, on a challenging four-property controllable generation task, our framework improves the success rate over the best method by $4.6%$ $96.9%$ vs. $92.3%$ ) with better synthesis novelty and diversity. Using a retrieval database of only 23 molecules, we also demonstrate our real-world applicability on a frontier drug discovery task of optimizing the binding affinity of eight existing, weakly-binding drugs for COVID-19 treatment under multiple design criteria. Compared to the best performing approach, our framework succeeds more often at generating new molecules with the given design criteria $(62.5%$ vs. $37.5%$ success rate) and generates on average more potent optimized drug molecules (2.84 vs. $1.67\mathrm{kcal/mol}$ average binding affinity improvement over the original drug molecules).

在一系列分子可控生成场景和具有多样化设计标准的基准测试中,我们的框架相比最新方法实现了最先进的性能。例如,在一项具有挑战性的四属性可控生成任务中,我们的框架将成功率较最佳方法提升了4.6%(96.9% vs. 92.3%),同时具有更好的合成新颖性和多样性。仅使用23个分子的检索数据库,我们还在COVID-19治疗药物发现的前沿任务中验证了框架的实际应用价值——在多重设计标准下优化八种现有弱结合药物的结合亲和力。与性能最佳的方法相比,我们的框架在按给定设计标准生成新分子方面成功率更高(62.5% vs. 37.5%),且平均生成具有更强效力的优化药物分子(相比原始药物分子平均结合亲和力提升2.84 vs. 1.67kcal/mol)。

2 METHODOLOGY: THE RETMOL FRAMEWORK

2 方法论: RETMOL框架

We now detail the various components in RetMol and how we perform training and inference.

我们现在详细介绍RetMol中的各个组件以及如何进行训练和推理。

Problem setup. We focus on the multi-property controllable generation setting in this paper. Concretely, let $x\in\mathcal X$ be a molecule where $\mathcal{X}$ denotes the set of all molecules, $a_{\ell}(x):\bar{\mathcal{X}}\to\mathbb{R}$ a property predictor indexed by $\ell\in[1,\ldots,L]$ , and $\delta_{\ell}\in\mathbb{R}$ a desired threshold. Then, we formulate multi-property controllable generation as one of three problems below, differing in the control design criteria: (i) a constraint satisfaction problem, where we identify a set of new molecules such that ${x\in\mathcal{X}|a_{\ell}(x)\geq\delta_{\ell},\forall\ell}$ , (ii) an un constrained optimization problem, where we find a set of new molecules such that $x^{\prime}=\mathrm{argmax}_ {x}s(x)$ where $\begin{array}{r}{s(x)=\sum_{\ell=1}^{L}w_{\ell}a_{\ell}(x)}\end{array}$ with the weighting coefficient $w_{\ell}$ , and (iii) a constrained optimization problem that c ombines the objectives in (i) and (ii).

问题设定。本文主要研究多属性可控生成场景。具体而言,令$x\in\mathcal X$表示分子(其中$\mathcal{X}$为所有分子的集合),$a_{\ell}(x):\bar{\mathcal{X}}\to\mathbb{R}$为第$\ell\in[1,\ldots,L]$个属性预测器,$\delta_{\ell}\in\mathbb{R}$为期望阈值。我们将多属性可控生成问题划分为以下三种控制设计准则:(i) 约束满足问题,即识别满足${x\in\mathcal{X}|a_{\ell}(x)\geq\delta_{\ell},\forall\ell}$的新分子集合;(ii) 无约束优化问题,即寻找使$x^{\prime}=\mathrm{argmax}_ {x}s(x)$成立的分子集合,其中评分函数$\begin{array}{r}{s(x)=\sum_{\ell=1}^{L}w_{\ell}a_{\ell}(x)}\end{array}$含权重系数$w_{\ell}$;(iii) 融合(i)(ii)目标的约束优化问题。

2.1 RETMOL COMPONENTS

2.1 RETMOL组件

Encoder-decoder generative model backbone. The pre-trained molecule generative model forms the backbone of RetMol that interfaces between the continuous embedding and raw molecule representations. Specifically, the encoder encodes the incoming molecules into numerical embeddings and the decoder generates new molecules from an embedding, respectively. RetMol is agnostic to the choice of the underlying encoder and decoder architectures, enabling it to work with a variety of generative models and molecule representations. In this work, we consider the SMILES string (Weininger, 1988) representation of molecules and the ChemFormer model (Irwin et al., 2022), which a variant of BART (Lewis et al., 2020) trained on the billion-scale ZINC dataset (Irwin and Shoichet, 2004) and achieves state-of-the-art generation performance.

编码器-解码器生成模型主干。预训练的分子生成模型构成了RetMol的核心框架,在连续嵌入与原始分子表征之间建立接口。具体而言,编码器将输入分子编码为数值嵌入,解码器则从嵌入生成新分子。RetMol对底层编码器和解码器架构的选择具有通用性,可适配多种生成模型与分子表征方式。本研究采用SMILES字符串(Weininger, 1988)分子表示法,并选用基于ZINC十亿级数据集(Irwin and Shoichet, 2004)训练的BART变体模型ChemFormer(Irwin et al., 2022)[20],该模型在生成性能上达到当前最优水平。

Retrieval database. The retrieval database $\mathcal{X}_{R}$ contains molecules that can potentially serve as exemplar molecules to steer the generation towards the design criteria and is thus vital for controllable generation. The construction of the retrieval database is task-specific: it usually contains molecules that at least partially satisfy the design criteria in a given task. The domain knowledge of what molecules meet the design criteria and how to select partially satisfied molecules can play an important role in our approach. Thus, our approach is essentially a hybrid system that combines the advantages of both the heuristic-based methods and learning-based methods. Also, we find that a database of only a handful of molecules (e.g., as few as 23) can already provide a strong control signal. This makes our approach easily adapted to various tasks by quickly replacing the retrieval database. Furthermore, the retrieval database can be dynamically updated during inference, i.e., newly generated molecules can enrich the retrieval database for better generalization (see Section 2.3).

检索数据库。检索数据库 $\mathcal{X}_{R}$ 包含可能作为范例分子引导生成过程满足设计标准的候选分子,因此对可控生成至关重要。该数据库的构建具有任务特异性:通常包含至少部分满足特定任务设计要求的分子。关于何种分子符合设计标准以及如何筛选部分达标分子的领域知识,在本方法中起着关键作用。因此,本方法本质上是结合启发式方法与学习型方法优势的混合系统。实验表明,仅需少量分子(如少至23个)构成的数据库即可提供强效控制信号,这使得通过快速更换检索数据库就能轻松适配各类任务。此外,检索数据库可在推理过程中动态更新(如2.3节所述),新生成分子可不断扩充数据库以提升泛化能力。

Molecule retriever. While the entire retrieval database can be used during generation, for computational reasons (e.g., memory and efficiency) it is more feasible to select a small portion of the most relevant exemplar molecules to provide a more accurate guidance. We design a simple heuristic-based retriever that retrieves the exemplar molecules most suitable for the given control design criteria. Specifically, we first construct a “feasible” set containing molecules that satisfy all the given constraints, i.e., $\bar{\chi^{\prime}}=\cap_{\ell=1}^{L}{x\in\chi_{R}|a_{\ell}(x)\geq\delta_{\ell}}$ . If this set is larger than $K$ , i.e., the number of exemplar molecules that we wish to retrieve, then we select $K$ molecules with the best property scores, i.e., $\chi_{r}={\mathrm{top}}_{K}({\mathcal{X}}^{\prime},s)$ , where $s(x)$ is the task-specific weighted average property score, defined in Section 2. Otherwise, we construct a relaxed feasible set by removing the constraints one at a time until the relaxed feasible set is larger than $K$ , at which point we select $K$ molecules with the best scores of the most recently removed property constraints. In either case, the retriever retrieves exemplar molecules with more desirable properties than the input and guides the generation towards the given design criteria. We summarize this procedure in Algorithm 1 in Appendix A. We find that our simple retriever with $K=10$ works well across a range of tasks. In general, more sophisticated retriever designs are possible and we leave them as the future work.

分子检索器。虽然生成过程中可以使用整个检索数据库,但出于计算原因(如内存和效率),选择一小部分最相关的示例分子来提供更精确的指导更为可行。我们设计了一个基于启发式的简单检索器,用于检索最适合给定控制设计标准的示例分子。具体而言,我们首先构建一个包含满足所有给定约束条件分子的"可行"集合,即$\bar{\chi^{\prime}}=\cap_{\ell=1}^{L}{x\in\chi_{R}|a_{\ell}(x)\geq\delta_{\ell}}$。如果该集合大于$K$(即我们希望检索的示例分子数量),则选择属性得分最高的$K$个分子,即$\chi_{r}={\mathrm{top}}_{K}({\mathcal{X}}^{\prime},s)$,其中$s(x)$是任务特定的加权平均属性得分(定义见第2节)。否则,我们通过逐个移除约束条件来构建一个宽松的可行集合,直到该集合大于$K$,此时我们选择在最近移除的属性约束条件下得分最高的$K$个分子。无论哪种情况,检索器都会检索出比输入分子具有更理想属性的示例分子,并引导生成过程朝向给定的设计标准。我们在附录A的算法1中总结了这一流程。实验发现,当$K=10$时,我们的简单检索器在多项任务中表现良好。更复杂的检索器设计是可能的,我们将此留作未来工作。

Information fusion. This module enables the retrieved exemplar molecules to modify the input molecule towards the targeted design criteria. It achieves this by merging the embeddings of the input and the retrieved exemplar molecules with a lightweight, trainable, standard cross attention mechanism similar to that in (Borgeaud et al., 2021). Concretely, the fused embedding $^e$ is given by

信息融合。该模块通过检索到的示例分子来修改输入分子,使其符合目标设计标准。具体实现方式是采用类似 (Borgeaud et al., 2021) 中的轻量级可训练标准交叉注意力机制,将输入分子与检索到的示例分子的嵌入表示进行融合。融合后的嵌入表示 $^e$ 由以下公式给出:

$$
e=f_{\mathrm{CA}}(e_{\mathrm{in}},E_{r};\theta)=\mathrm{Attn}(\mathrm{Query}(e_{\mathrm{in}}),\mathrm{Key}(E_{r}))\cdot\mathrm{Value}(E_{r})
$$

$$
e=f_{\mathrm{CA}}(e_{\mathrm{in}},E_{r};\theta)=\mathrm{Attn}(\mathrm{Query}(e_{\mathrm{in}}),\mathrm{Key}(E_{r}))\cdot\mathrm{Value}(E_{r})
$$

where $f_{\mathrm{CA}}$ represents the cross attention function with parameters $\theta$ , and $e_{\mathrm{in}}$ and $E_{r}$ are the input embedding and retrieved exemplar embeddings, respectively. The functions Attn, Query, Key, and Value compute the cross attention weights and the query, key, and value matrices, respectively. For our choice of the transformer-based generative model (Irwin et al., 2022), we have that $e_{\mathrm{in}}=$ $\mathrm{Enc}(\boldsymbol{x}_ {\mathrm{in}})\in\mathbb{R}^{L\times D}$ , $E_{r}=[e_{r}^{1},\dots,e_{r}^{K}]\in\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}$ , and $\pmb{e}_ {r}^{k}=\mathrm{Enc}({x}_ {r}^{k})\in\mathbb{R}^{L_{k}\times D}$ where $L$ and $L_{k}$ are the lengths of the tokenized input and the $k^{\mathrm{th}}$ retrieved exemplar molecules, respectively, and $D$ is the dimension of each token representation. Intuitively, the Attn function learns to weigh the retrieved exemplar molecules differently such that the more “important” retrieved molecules correspond to the higher weights. The fused embedding $\pmb{e}\in\mathbb{R}^{L\times D}$ thus contains the information of desired properties extracted from the retrieved exemplar molecules, which serves as the input of the decoder to control its generation. More details of this module are available in Appendix A.

其中 $f_{\mathrm{CA}}$ 表示带有参数 $\theta$ 的交叉注意力函数,$e_{\mathrm{in}}$ 和 $E_{r}$ 分别是输入嵌入和检索到的示例嵌入。函数 Attn、Query、Key 和 Value 分别计算交叉注意力权重以及查询、键和值矩阵。对于我们选择的基于 Transformer 的生成模型 (Irwin et al., 2022),有 $e_{\mathrm{in}}=\mathrm{Enc}(\boldsymbol{x}_ {\mathrm{in}})\in\mathbb{R}^{L\times D}$,$E_{r}=[e_{r}^{1},\dots,e_{r}^{K}]\in\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}$,且 $\pmb{e}_ {r}^{k}=\mathrm{Enc}({x}_ {r}^{k})\in\mathbb{R}^{L_{k}\times D}$,其中 $L$ 和 $L_{k}$ 分别是 Token 化输入和第 $k^{\mathrm{th}}$ 个检索到的示例分子的长度,$D$ 是每个 Token 表示的维度。直观上,Attn 函数学习对检索到的示例分子进行不同的加权,使得更“重要”的检索分子对应更高的权重。融合嵌入 $\pmb{e}\in\mathbb{R}^{L\times D}$ 因此包含从检索到的示例分子中提取的所需属性信息,作为解码器的输入以控制其生成。该模块的更多细节见附录 A。

2.2 TRAINING VIA PREDICTING THE INPUT MOLECULE’S NEAREST NEIGHBOR

2.2 通过预测输入分子的最近邻进行训练

The conventional training objective that reconstructs the input molecule (e.g., in ChemFormer (Irwin et al., 2022)) is not appropriate in our case, since perfectly reconstructing the input molecule does not rely on the retrieved exemplar modules (more details in Appendix A). To enable RetMol to learn to use the exemplar molecules for controllable generation, we propose a new self-supervised training scheme, where the objective is to predict the nearest neighbor of the input molecule:

传统训练目标(如ChemFormer (Irwin et al., 2022)中重建输入分子的方法)在我们的场景中并不适用,因为完美重建输入分子并不依赖于检索到的范例模块(详见附录A)。为了让RetMol学会利用范例分子进行可控生成,我们提出了一种新的自监督训练方案,其目标是预测输入分子的最近邻:

$$
\mathcal{L}(\boldsymbol{\theta})=\sum_{i=1}^{B}\mathrm{CE}\Big(\mathrm{Dec}\big(f_{\mathrm{CA}}(e_{\mathrm{in}}^{(i)},\boldsymbol{E}_ {r}^{(i)};\boldsymbol{\theta})\big),x_{\mathrm{1NN}}^{(i)}\Big).
$$

$$
\mathcal{L}(\boldsymbol{\theta})=\sum_{i=1}^{B}\mathrm{CE}\Big(\mathrm{Dec}\big(f_{\mathrm{CA}}(e_{\mathrm{in}}^{(i)},\boldsymbol{E}_ {r}^{(i)};\boldsymbol{\theta})\big),x_{\mathrm{1NN}}^{(i)}\Big).
$$

where CE is the cross entropy loss function since we use the BART model as our encoder and decoder (see more details in Appendix A), $x_{1\mathrm{NN}}$ represents the nearest neighbor of the input $x_{\mathrm{in}}$ , $\boldsymbol{\mathcal{B}}$ is the batch size, and $i$ indexes the input molecules. The set of retrieved exemplar molecules consists of the remaining $K-1$ nearest neighbors of the input molecule. During training, we freeze the parameters of the pre-trained encoder and decoder. Instead, we only update the parameters in the information fusion module $f_{\mathrm{CA}}$ , which makes our training lightweight and efficient. Furthermore, we use the full training dataset as the retrieval database and retrieve exemplar molecules by best similarity with the input molecule. Thus, our training is not task-specific yet forces the fusion module to be involved in the controllable generation with the similarity as a proxy criterion. We will show that during the inference time, the model trained with the above training objective and proxy criterion using similarity is able to generalize to different generation tasks with different design criteria, and performs better compared to training with the conventional reconstruction objective. For the efficient training, we pre-compute all the molecules’ embeddings and their pairwise similarities with efficient approximate $k\mathbf{NN}$ algorithms (Guo et al., 2020; Johnson et al., 2019).

其中 CE 是交叉熵损失函数(因为我们使用 BART 模型作为编码器和解码器,详见附录 A),$x_{1\mathrm{NN}}$ 表示输入 $x_{\mathrm{in}}$ 的最近邻,$\boldsymbol{\mathcal{B}}$ 是批次大小,$i$ 为输入分子的索引。检索到的范例分子集合由输入分子剩余的 $K-1$ 个最近邻组成。训练时,我们冻结预训练编码器和解码器的参数,仅更新信息融合模块 $f_{\mathrm{CA}}$ 的参数,这使得训练过程轻量高效。此外,我们将完整训练数据集作为检索数据库,通过输入分子的最佳相似度检索范例分子。因此,我们的训练虽非任务特定,但强制融合模块以相似度作为代理标准参与可控生成。后续将证明:在推理阶段,使用上述训练目标和相似度代理标准训练的模型,能够泛化到具有不同设计标准的生成任务,且性能优于传统重建目标训练。为提升训练效率,我们采用高效近似 $k\mathbf{NN}$ 算法(Guo et al., 2020; Johnson et al., 2019)预先计算所有分子嵌入及其两两相似度。

Remarks. The above proposed training strategy that predicts the most similar molecule based on the input molecule and other $K-1$ similar molecules shares some similarity with masked language

备注:上述提出的训练策略基于输入分子和其他 $K-1$ 个相似分子预测最相似分子,与掩码语言模型存在一定相似性。

Table 1: Under the similarity constraints, RetMol achieves higher success rate in the constrained QED optimization task and better score improvements in the constrained penalized logP optimization task. Baseline results are reported from (Hoffman et al., 2021).

表 1: 在相似性约束下,RetMol 在受限 QED 优化任务中实现了更高的成功率,并在受限惩罚性 logP 优化任务中取得了更好的分数提升。基线结果来自 (Hoffman et al., 2021)。

(a) Success rate of generated molecules that satisfy $\mathrm{QED}\in[0.9,1.0]$ under similarity constraint $\delta=0.4$ .

(a) 在相似性约束 $\delta=0.4$ 下,生成分子满足 $\mathrm{QED}\in[0.9,1.0]$ 的成功率。

MethodSuccess (%)
MMPA(Dalke etal.,2018)32.9
JT-VAE(Jinetal.,2018)8.8
GCPN(Youet al.,2018)9.4
VSeq2Seq(Bahdanauet al.,2015)58.5
VJTNN+GAN(Jinetal.,2019)60.6
AtomG2G(Jinetal.,2019)73.6
HierG2G(Jin et al.,2019)76.9
DESMILES (Maragakis et al.,2020)77.8
QMO (Hoffmanet al.,2021)92.8
RetMol94.5
方法 成功率 (%)
MMPA (Dalke et al., 2018) 32.9
JT-VAE (Jin et al., 2018) 8.8
GCPN (You et al., 2018) 9.4
VSeq2Seq (Bahdanau et al., 2015) 58.5
VJTNN+GAN (Jin et al., 2019) 60.6
AtomG2G (Jin et al., 2019) 73.6
HierG2G (Jin et al., 2019) 76.9
DESMILES (Maragakis et al., 2020) 77.8
QMO (Hoffman et al., 2021) 92.8
RetMol 94.5

(b) The average penalized logP improvements of generated molecules over inputs under similarity constraint $\delta={0.6,0.4}$ .

MethodImprovement
=0.6 =0.4
JT-VAE(Jinetal.,2018)0.28 ± 0.791.03 ± 1.39
GCPN(Youetal.,2018)0.79 ± 0.632.49 ± 1.30
MolDQN (Zhou et al.,2019)1.86 ± 1.213.37 ±1.62
VSeq2Seq(Bahdanauet al.,2015)2.33 ± 1.173.37 ± 1.75
VJTNN (Jin et al.,2019)2.33±1.243.55 ± 1.67
HierG2G Jin et al.(2019)2.49 ± 1.093.98 ± 1.46
GA (Nigam et al.,2019)3.44 ± 1.095.93 ± 1.41
QMO(Hoffman et al.,2021)3.73±2.857.71 ± 5.65
RetMol3.78±3.2911.55±11.27

(b) 在相似性约束 $\delta={0.6,0.4}$ 下生成分子相对于输入的平均惩罚性logP改进。

方法 改进 (δ=0.6) 改进 (δ=0.4)
JT-VAE (Jin et al., 2018) 0.28 ± 0.79 1.03 ± 1.39
GCPN (You et al., 2018) 0.79 ± 0.63 2.49 ± 1.30
MolDQN (Zhou et al., 2019) 1.86 ± 1.21 3.37 ± 1.62
VSeq2Seq (Bahdanau et al., 2015) 2.33 ± 1.17 3.37 ± 1.75
VJTNN (Jin et al., 2019) 2.33 ± 1.24 3.55 ± 1.67
HierG2G (Jin et al., 2019) 2.49 ± 1.09 3.98 ± 1.46
GA (Nigam et al., 2019) 3.44 ± 1.09 5.93 ± 1.41
QMO (Hoffman et al., 2021) 3.73 ± 2.85 7.71 ± 5.65
RetMol 3.78 ± 3.29 11.55 ± 11.27

model pre-training in NLP (Devlin et al., 2019). However, there are several key differences: 1) we perform “masking” on a sequence level, i.e., a whole molecule, instead of on a token/word level; 2) our “masking” strategy is to predict a particular signal only (i.e., the most similar retrieved molecule to the input) instead of randomly masked tokens; and 3) our training objective is used to only update the lightweight retrieval module instead of updating the whole backbone generative model.

自然语言处理中的模型预训练 (Devlin等人, 2019)。但存在几个关键差异: 1) 我们在序列级别(即整个分子)进行"掩码", 而非在token/单词级别; 2) 我们的"掩码"策略是仅预测特定信号(即与输入最相似的检索分子), 而非随机掩码token; 3) 我们的训练目标仅用于更新轻量级检索模块, 而非更新整个主干生成模型。

2.3 INFERENCE VIA ITERATIVE REFINEMENT

2.3 基于迭代优化的推理

We propose an iterative refinement strategy to obtain an improved outcome when controlling generation with implicit guidance derived from the exemplar molecules. The strategy works by replacing the input $x_{\mathrm{in}}$ and updating the retrieval database with the newly generated molecules. Such an iterative update is common in many other controllable generation approaches such as GA-based methods (Sliwoski et al., 2013; Jensen, 2019) and latent optimization-based methods (Chen tham arak shan et al., 2020; Das et al., 2021). Furthermore, if the retrieval database $\mathcal{X}_{R}$ is fixed during inference, our method is constrained by the best property values of molecules in the retrieval database, which greatly limits the generalization ability of our approach and also limits our performance for certain generation scenarios such as un constrained optimization. Thus, to extrapolate beyond the database, we propose to dynamically update the retrieval database over iterations.

我们提出了一种迭代优化策略,通过从示例分子中提取隐式指导来控制生成过程,从而获得更好的结果。该策略通过替换输入$x_{\mathrm{in}}$并用新生成的分子更新检索数据库来实现。这种迭代更新在多种可控生成方法中都很常见,例如基于遗传算法(GA)的方法[20][21]和基于潜在优化的方法[22][23]。此外,如果在推理过程中保持检索数据库$\mathcal{X}_{R}$不变,我们的方法会受到检索数据库中分子最佳属性值的限制,这会极大限制我们方法的泛化能力,同时也会制约我们在某些生成场景(如无约束优化)中的表现。因此,为了实现超越数据库的外推能力,我们提出了动态更新检索数据库的迭代方案。

We consider the following iterative refinement process. We first randomly perturb the fused embedding $M$ times by adding Guassian noises and greedily decode one molecule from each perturbed embedding to obtain $M$ generated molecules. We score these molecules according to task-specific design criteria. Then, we replace the input molecule with the best molecule in this set if its score is better than that of the input molecule. At the same time, we add the remaining ones to the retrieval database if they have better scores than the lowest score in the retrieval database. If none of the generated molecules has a score better than the input molecule or the lowest in the retrieval database, then the input molecule and the retrieval database stays the same for the next iteration. We also add a stop condition if the maximum allowable number of iterations is achieved or a successful molecule that satisfies the desired criteria is generated. Algorithm 2 in Appendix A summarizes the procedure.

我们考虑以下迭代优化过程。首先通过添加高斯噪声对融合嵌入进行$M$次随机扰动,并从每个扰动后的嵌入中贪婪解码出一个分子,从而获得$M$个生成分子。根据任务特定的设计标准对这些分子进行评分。若该集合中最佳分子的评分优于输入分子,则用该分子替换输入分子。同时,将剩余分子中评分高于检索数据库最低分的分子加入检索数据库。若所有生成分子的评分均未超过输入分子或检索数据库最低分,则下一轮迭代保持输入分子与检索数据库不变。当达到最大允许迭代次数或生成满足目标标准的成功分子时,流程终止。附录A中的算法2总结了该流程。

3 EXPERIMENTS

3 实验

We conduct four sets of experiments, covering all controllable generation formulations (see “Problem setup” in Section 2) with increasing difficulty. For fair comparisons, in each experiment we faithfully follow the same setup as the baselines, including using the same number of optimization iterations and evaluating on the same set of input molecules to be optimized. Below, we summarize the main results and defer the detailed experiment setup and additional results to Appendix B and C.

我们进行了四组实验,涵盖所有难度递增的可控生成任务(参见第2节"问题设定")。为确保公平比较,每组实验都严格遵循基线方法的相同设置,包括使用相同的优化迭代次数,并在同一组待优化输入分子上进行评估。下文总结主要实验结果,详细实验设置和补充结果见附录B和C。

3.1 IMPROVING QED AND PENALIZED LOGP UNDER SIMILARITY CONSTRAINT

3.1 在相似性约束下改进 QED 和惩罚性 logP

These experiments aim to generate new molecules that improve upon the input molecule’s intrinsic properties including QED (Bickerton et al., 2012) and penalized logP (defined by $\log\mathrm{P}(x){-}\mathrm{SA}(x))$ (Jin et al., 2018) where SA represents syn the s iz ability (Ertl and Schu ffe n hauer, 2009)) under similarity constraints. For both experiments, we use the top 1k molecules with the best property values from the ZINC250k dataset (Jin et al., 2018) as the retrieval database. In each iteration, we retrieve $K=20$ exemplar molecules with the best property score that also satisfy the similarity constraint. The remaining configurations exactly follow (Hoffman et al., 2021).

这些实验旨在生成改进输入分子内在特性的新分子,包括 QED (Bickerton et al., 2012) 和惩罚性 logP (定义为 $\log\mathrm{P}(x){-}\mathrm{SA}(x))$ (Jin et al., 2018),其中 SA 表示合成可行性 (Ertl and Schuffenhauer, 2009),并在相似性约束下进行。对于这两个实验,我们使用 ZINC250k 数据集 (Jin et al., 2018) 中具有最佳属性值的前 1k 个分子作为检索数据库。在每次迭代中,我们检索 $K=20$ 个既满足相似性约束又具有最佳属性分数的示例分子。其余配置完全遵循 (Hoffman et al., 2021)。

QED experiment setup and results. This is a constraint satisfaction problem with the goal of generating new molecules $x^{\prime}$ such that $a_{\mathrm{sim}}(x^{\prime},x)\geq\delta=0.4$ and $a_{\mathrm{QED}}(x^{\prime})\geq0.9$ where $x$ is the input molecule, $a_{\mathrm{sim}}$ is the Tanimoto similarity function (Bajusz et al., 2015), and $a_{\mathrm{QED}}$ is the QED predictor. We select 800 molecules with QED in the range [0.7, 0.8] as inputs to be optimized. We measure performance by success rate, i.e., the percentage of input molecules that result in a generated molecule that satisfy both constraints. Table 1a shows the success rate of QED optimization by comparing RetMol with various baselines. We can see that RetMol achieves the best success rate, e.g., $94.5%$ versus $92.8%$ compared to the best existing approach.

QED实验设置与结果。这是一个约束满足问题,其目标是生成新分子$x^{\prime}$,需满足$a_{\mathrm{sim}}(x^{\prime},x)\geq\delta=0.4$和$a_{\mathrm{QED}}(x^{\prime})\geq0.9$,其中$x$为输入分子,$a_{\mathrm{sim}}$为Tanimoto相似度函数(Bajusz et al., 2015),$a_{\mathrm{QED}}$为QED预测器。我们选取800个QED值在[0.7, 0.8]范围内的分子作为待优化输入,通过成功率(即满足双约束的生成分子所占输入分子百分比)衡量性能。表1a对比了RetMol与各基线的QED优化成功率,可见RetMol以94.5%的成功率超越现有最佳方法(92.8%)。

Penalized logP setup and results. This is a constrained optimization problem with the goal to generate new molecules $x^{\prime}$ to maximize the penalized logP value $a_{\mathrm{plogP}}(x^{\prime})$ with similarity constraint $a_{\mathrm{sim}}(x^{\prime},x)\geq\delta$ where $\delta\in{0.4,0.6}$ . We select 800 molecules that have the lowest penalized logP values in the ZINC250k dataset as inputs to be optimized. We measure performance by the relative improvement in penalized logP between the generated and input molecules, averaged over all 800 input molecules. Table 1b shows the results comparing RetMol to various baselines. RetMol outperforms the best existing method for both similarity constraint thresholds and, for the $\delta=0.4$ case, improves upon the best existing method by almost $50%$ . The large variance in RetMol is due to large penalized logP values of a few generated molecules, and is thus not indicative of poor statistical significance. We provide a detailed analysis of this phenomenon in Appendix C.

惩罚性logP实验设置与结果。这是一个约束优化问题,目标是通过相似性约束条件 $a_{\mathrm{sim}}(x^{\prime},x)\geq\delta$ (其中 $\delta\in{0.4,0.6}$)生成新分子 $x^{\prime}$ 以最大化惩罚性logP值 $a_{\mathrm{plogP}}(x^{\prime})$ 。我们从ZINC250k数据集中选取800个惩罚性logP值最低的分子作为待优化输入,通过计算生成分子与输入分子间惩罚性logP相对提升的平均值来衡量性能。表1b显示RetMol与各基线的对比结果:在两种相似性约束阈值下,RetMol均超越现有最佳方法,其中当 $\delta=0.4$ 时性能提升近 $50%$ 。RetMol的高方差源于少数生成分子具有极高惩罚性logP值,因此并不代表统计显著性不足。附录C提供了该现象的详细分析。

3.2 OPTIMIZING $\mathrm{GSK3}\beta$ AND JNK3 INHIBITION UNDER QED AND SA CONSTRAINTS

3.2 在QED和SA约束条件下优化 $\mathrm{GSK3}\beta$ 和JNK3抑制

This experiment aims to generate novel, strong inhibitors for jointly inhibiting both $\mathrm{GSK3}\beta$ (Glycogen synthase kinase 3 beta) and JNK3 (-Jun N-terminal kinase-3) enzymes, which are relevant for potential treatment of Alzheimer’s disease (Jin et al., 2020a; Li et al., 2018). Following (Jin et al., 2020a), we formulate this task as a constraint satisfaction problem with four property constraints: two positivity constraints $a_{\mathrm{GSK3}\beta}(x^{\prime}) \ge 0.5$ and $\begin{array}{r}{\dot{a}_ {\mathrm{JNK3}}(x^{\prime})\geq}\end{array}$ 0.5, and two molecule property constraints of QED and SA that $a_{\mathrm{QED}}\dot{(}x^{\prime})\geq$

本实验旨在生成新型强效抑制剂,用于联合抑制与阿尔茨海默病潜在治疗相关的两种酶:$\mathrm{GSK3}\beta$(糖原合成酶激酶3β)和JNK3(c-Jun氨基末端激酶-3)(Jin et al., 2020a; Li et al., 2018)。参照(Jin et al., 2020a)的方法,我们将该任务构建为包含四个属性约束的满足问题:两个活性约束$a_{\mathrm{GSK3}\beta}(x^{\prime}) \ge 0.5$和$\begin{array}{r}{\dot{a}_ {\mathrm{JNK3}}(x^{\prime})\geq}\end{array}$0.5,以及两个分子属性约束QED和SA要求$a_{\mathrm{QED}}\dot{(}x^{\prime})\geq$

Table 2: Success rate, novelty and diversity of generated molecules in the task of optimizing four properties: QED, SA, and two binding affinities to $\mathrm{GSK}3\beta$ and JNK3 estimated by pre-trained models from (Jin et al., 2020a). Baseline results are reported from Jin et al. (2020a); Xie et al. (2021).

MethodSuccess%NoveltyDiversity
JT-VAE (Jin et al.,2018)1.3
GVAE-RL(Jinetal.,2020a)2.1
GCPN(Youetal.,2018)4.0
REINVENT(Olivecronaetal.,2017)47.90.5610.621
RationaleRL(Jinetal.,2020a)74.80.5680.701
MARS(Xieetal.,2021)92.30.8240.719
MolEvol(Chen*etal.,2021)93.00.7570.681
RetMol96.90.8620.732

表 2: 在优化四种性质(QED、SA以及与$\mathrm{GSK}3\beta$和JNK3的两种结合亲和力)任务中生成分子的成功率、新颖性和多样性,这些性质由(Jin et al., 2020a)的预训练模型估计。基线结果来自Jin et al. (2020a)和Xie et al. (2021)。

方法 成功率% 新颖性 多样性
JT-VAE (Jin et al.,2018) 1.3
GVAE-RL(Jin et al.,2020a) 2.1
GCPN(You et al.,2018) 4.0
REINVENT(Olivecrona et al.,2017) 47.9 0.561 0.621
RationaleRL(Jin et al.,2020a) 74.8 0.568 0.701
MARS(Xie et al.,2021) 92.3 0.824 0.719
MolEvol(Chen* et al.,2021) 93.0 0.757 0.681
RetMol 96.9 0.862 0.732

0.6 and $a_{\mathrm{SA}}(x^{\prime})\leq4$ . Note that $a_{\mathrm{GSK3}\beta}$ and $a_{\mathrm{JNK3}}$ are property predictors (Jin et al., 2020a; Li et al., 2018) for $\mathrm{GSK3}\beta$ and JNK3, respectively, and higher values indicate better inhibition against the respective proteins. The retrieval database consists of all the molecules from the CheMBL (Gaulton et al., 2016) dataset (approx. 700) that satisfy the above four constraints and we retrieve $K=10$ exemplar molecules most similar to the input each time. For evaluation, we compute three metrics including success rate, novelty, and diversity according to (Jin et al., 2020a).

0.6 且 $a_{\mathrm{SA}}(x^{\prime})\leq4$。请注意,$a_{\mathrm{GSK3}\beta}$ 和 $a_{\mathrm{JNK3}}$ 分别是针对 $\mathrm{GSK3}\beta$ 和 JNK3 的属性预测器 (Jin et al., 2020a; Li et al., 2018),数值越高表示对相应蛋白的抑制效果越好。检索数据库包含 CheMBL (Gaulton et al., 2016) 数据集中满足上述四个约束的所有分子(约 700 个),每次检索与输入最相似的 $K=10$ 个示例分子。评估时,我们根据 (Jin et al., 2020a) 计算成功率、新颖性和多样性三个指标。

Table 2 shows that RetMol outperforms all previous methods on the four-property molecule optimization task. In particular, our framework achieves these results without task-specific fine-tuning required for Rationale RL and REINVENT. Besides, RetMol is computationally much more efficient than MARS, which requires 550 iterations model training whereas RetMol requires only 80 iterations (see more details in Appendix B.5). MARS also requires test-time, adaptive model training per sampling iteration which further increases the computational overhead during generation.

表 2 显示,RetMol 在四属性分子优化任务上超越了所有先前的方法。值得注意的是,我们的框架在无需 Rationale RL 和 REINVENT 所需的任务特定微调情况下就取得了这些成果。此外,RetMol 的计算效率远高于 MARS,后者需要 550 次迭代模型训练,而 RetMol 仅需 80 次迭代 (详见附录 B.5)。MARS 还要求每次采样迭代时进行测试阶段的自适应模型训练,这进一步增加了生成过程中的计算开销。

3.3 GUACAMOL BENCHMARK MULTIPLE PROPERTY OPTIMIZATION

3.3 GUACAMOL基准多属性优化

This experiment evaluates RetMol on the seven multiple property optimization (MPO) tasks, a subset of tasks in the Guacamol benchmark (Brown et al., 2019). They are the un constrained optimization problems to maximize a weighted sum of multiple diverse molecule properties. We choose these MPO tasks because 1) they represent a more challenging subset in the benchmark, as existing methods can achieve almost perfect performance on most of the other tasks (Brown et al., 2019), and 2) they are the most relevant tasks to our work, e.g., multi-property optimization. The retrieval database consists of 1k molecules with best scores for each task and we retrieve $K=10$ exemplar molecules with the highest scores each time.

本实验在Guacamol基准测试(Brown et al., 2019)的七项多属性优化(MPO)任务子集上评估RetMol。这些任务是无约束优化问题,旨在最大化多种分子属性的加权总和。我们选择这些MPO任务是因为:1) 它们代表了基准测试中更具挑战性的子集,因为现有方法在大多数其他任务上已能实现近乎完美的性能(Brown et al., 2019);2) 它们与我们的工作(如多属性优化)最相关。检索数据库包含每个任务中得分最高的1k个分子,每次检索得分最高的 $K=10$ 个示例分子。


Figure 2: Comparison with the state-of-the-art methods in the multiple property optimization (MPO) tasks on the Guacamol benchmark. Left: QED (↑) versus the averaged benchmark performance (↑). Right: SA (↓) versus the average benchmark performance (↑). RetMol achieves the best balance between improving the benchmark performance while maintaining the syn the s iz ability (SA) and drug-likeness (QED) of generated molecules.

图 2: 在Guacamol基准测试的多属性优化(MPO)任务中与最先进方法的对比。左图: 药物相似性(QED)(↑)与平均基准性能(↑)的关系。右图: 合成可行性(SA)(↓)与平均基准性能(↑)的关系。RetMol在提升基准性能的同时,对生成分子的合成可行性(SA)和药物相似性(QED)保持了最佳平衡。

Table 3: Quantitative results in the COVID-19 drug optimization task, where we aim to improve selected molecules’ binding affinity (estimated via docking (Shidi et al., 2022)) to the SARS-CoV-2 main protease under the QED, SA, and similarity constraints. Under stricter similarity condition, RetMol succeeds in more cases (5/8 versus 3/8). Under milder similarity condition, RetMol achieves higher improvements (2.84 versus 1.67 average binding affinity improvements). Unit of numbers in the table is $\mathtt{k c a l/m o l}$ and lower is better.

0=0.6 = 0.4
InputmoleculeInputscoreRetMolGraph GA (Jensen,2019)RetMolGraph GA (Jensen,2019)
Favipiravir-4.93-6.48-7.10-8.70-7.10
Bromhexine-9.64-11.48-11.20-12.65-11.83
PX-12-6.13-8.45-8.07-10.90-8.31
Ebselen-7.31-10.82-10.41
Disulfiram-8.58-9.09-10.44-10.00
Entecavir-9.00-12.34
Quercetin-9.25-9.849.81
Kaempferol-8.45-8.54-10.3510.19
Avg. Improvement0.780.712.841.67

表 3: COVID-19药物优化任务的定量结果,我们的目标是在QED、SA和相似性约束下改善选定分子与SARS-CoV-2主要蛋白酶的结合亲和力(通过对接估计 (Shidi et al., 2022))。在更严格的相似性条件下,RetMol在更多案例中取得成功(5/8对比3/8)。在较宽松的相似性条件下,RetMol实现了更高的改进(平均结合亲和力改进2.84对比1.67)。表中数字单位为$\mathtt{k c a l/m o l}$,数值越低越好。

输入分子 输入分数 RetMol (δ=0.6) Graph GA (Jensen,2019) (δ=0.6) RetMol (δ=0.4) Graph GA (Jensen,2019) (δ=0.4)
Favipiravir -4.93 -6.48 -7.10 -8.70 -7.10
Bromhexine -9.64 -11.48 -11.20 -12.65 -11.83
PX-12 -6.13 -8.45 -8.07 -10.90 -8.31
Ebselen -7.31 -10.82 -10.41
Disulfiram -8.58 -9.09 -10.44 -10.00
Entecavir -9.00 -12.34
Quercetin -9.25 -9.84 9.81
Kaempferol -8.45 -8.54 -10.35 10.19
平均改进 0.78 0.71 2.84 1.67

We demonstrate that RetMol achieves the best results along the Pareto frontier of the molecular design space. Figure 2 visualizes the benchmark performance averaged over the seven MPO tasks against QED, SA, two metrics that evaluate the drug-likeness and syn the s iz ability of the optimized molecules and that are not part of the Guacamol benchmark’s optimization objective. Our framework achieves a nice balance between optimizing benchmark performance and maintaining good QED and SA scores. In contrast, Graph GA (Jensen, 2019) achieves the best benchmark performance but suffers from low QED and high SA scores. These results demonstrate the advantage of retrieval-based controllable generation: because the retrieval database usually contains drug-like molecules with desirable properties as high QED and low SA, the generation is guided by these molecules in a good way to not deviate too much from these desirable properties. Moreover, because the retrieval database is updated with newly generated molecules with better benchmark performance, the generation can produce molecules with benchmark score beyond the best in the initial retrieval database. Additional results in Figure 2 in Appendix C.3 corroborate with those presented above.

我们证明RetMol在分子设计空间的帕累托前沿取得了最佳结果。图2展示了在七个MPO任务上平均基准测试性能与QED、SA的对比情况。这两个指标用于评估优化分子的类药性和合成可行性,但并非Guacamol基准测试优化目标的一部分。我们的框架在优化基准性能与保持良好QED和SA分数之间实现了理想平衡。相比之下,Graph GA (Jensen, 2019) 虽然获得了最佳基准性能,但其QED分数较低且SA分数较高。这些结果证明了基于检索的可控生成优势:由于检索数据库通常包含具有高QED和低SA等理想特性的类药分子,生成过程会以良性方式受这些分子引导,从而不会过度偏离这些理想特性。此外,由于检索数据库会持续更新具有更优基准性能的新生成分子,因此生成的分子可以突破初始检索数据库中的最佳基准分数。附录C.3中图2的补充结果与上述结论相互印证。

3.4 OPTIMIZING EXISTING INHIBITORS FOR SARS-COV-2 MAIN PROTEASE

3.4 针对SARS-CoV-2主蛋白酶的现有抑制剂优化

To demonstrate our framework’s applicability at the frontiers of drug discovery, we apply it to the real-world task of improving the inhibition of existing weak inhibitors against the SARS-CoV-2 main protease $({\bf M}^{\mathrm{pro}}$ , PDB ID: 7L11), which is a promising target for treating COVID-19 by neutralizing the SARS-CoV-2 virus (Gao et al., 2022; Zhang et al., 2021). Because the the novel nature of this virus, there exist few high-potency inhibitors, making it challenging for learning-based methods that require a sizable training set not yet attainable in this case. However, the few existing inhibitors make excellent candidates for the retrieval database in our framework. We use a set of 23 known inhibitors (Jin et al., 2020b; Huynh et al., 2020; Hoffman et al., 2021) as the retrieval database and select the 8 weakest inhibitors to $\mathbf{M}^{\mathrm{pro}}$ as input. We design an optimization task to maximize the binding affinity (estimated via docking (Shidi et al., 2022); see Appendix B for the detailed procedure) between the generated molecule and $\mathbf{M}^{\mathrm{pro}}$ while satisfying the following three constraints: $\dot{a}_ {\mathrm{QED}}(x^{\prime})\geq0.6$ , $a_{\mathrm{SA}}(x^{\prime})\leq4$ , and $a_{\mathrm{sim}}(x^{\prime},x)\geq\delta$ where $\delta\in{\bar{0}.4,0.6}$ .

为验证本框架在药物发现前沿领域的适用性,我们将其应用于提升现有SARS-CoV-2主蛋白酶(${\bf M}^{\mathrm{pro}}$,PDB ID: 7L11)弱抑制剂活性的实际任务。该蛋白酶通过中和SARS-CoV-2病毒成为治疗COVID-19的重要靶点(Gao et al., 2022; Zhang et al., 2021)。由于该病毒的新颖性,目前高效抑制剂稀缺,这使得需要大量训练数据的学习方法面临挑战。然而,现有少量抑制剂恰好可作为本框架检索数据库的理想候选。我们采用23种已知抑制剂(Jin et al., 2020b; Huynh et al., 2020; Hoffman et al., 2021)构建检索数据库,并选取其中对$\mathbf{M}^{\mathrm{pro}}$抑制活性最弱的8种作为输入。设计优化任务时,我们在满足以下三个约束条件下最大化生成分子与$\mathbf{M}^{\mathrm{pro}}$的结合亲和力(通过对接估算(Shidi et al., 2022),具体流程见附录B):$\dot{a}_ {\mathrm{QED}}(x^{\prime})\geq0.6$、$a_{\mathrm{SA}}(x^{\prime})\leq4$以及$a_{\mathrm{sim}}(x^{\prime},x)\geq\delta$,其中$\delta\in{\bar{0}.4,0.6}$。


Figure 3: 3D visualization s that compare RetMol with Graph GA in optimizing the original inhibitor, Bromhexine, that binds to the SARS-CoV-2 main protease in the $\delta=0.6$ case. We can see the optimized inhibitor in RetMol has more polar contacts (red dotted lines) and also more disparate binding modes with the original compound than the Graph GA optimized inhibitor, which aligns with the quantitative results.

图 3: 在 $\delta=0.6$ 情况下优化原始抑制剂溴己新(与SARS-CoV-2主要蛋白酶结合)时,RetMol与Graph GA的3D可视化对比。可见RetMol优化的抑制剂比Graph GA优化的抑制剂具有更多极性接触(红色虚线),且与原化合物的结合模式差异更大,这与定量结果一致。

Table 4: Left: Comparing different training schemes in the unconditional generation setting. Right: Generation performance with different retrieval database constructions based on the experiment in Section 3.2.

Training objectiveValidityNoveltyUniqueness
RetMol (predict NN)0.9020.9980.922
Conventional (recon. input)0.8340.9980.665

表 4: 左: 无条件生成设置下不同训练方案的比较。右: 基于第3.2节实验的不同检索数据库构建的生成性能。

训练目标 有效性 新颖性 唯一性
RetMol (预测NN) 0.902 0.998 0.922
常规 (重构输入) 0.834 0.998 0.665
Ret.database constructionSuccess%NNoveltyDiversity
GSK3β + JNK3+ QED + SA96.90.8620.732
GSK3B+JNK384.70.7360.700
GSK3βorJNK344.10.5710.708
Ret.database construction Success%N Novelty Diversity
GSK3β + JNK3+ QED + SA 96.9 0.862 0.732
GSK3B+JNK3 84.7 0.736 0.700
GSK3βorJNK3 44.1 0.571 0.708

Table 3 shows the optimization results of comparing RetMol with graph GA, a competitive baseline. Under stricter $\zeta=0.6$ ) similarity constraint, RetMol successfully optimizes the most molecules constraint (5 out of 8) while graph GA fails to optimize input molecules most of the time because it cannot satisfy the constraints or improve upon the input drugs’ binding affinity. Under milder $\zeta=0.4)$ similarity constraint, our framework achieves on average higher improvement (2.84 versus 1.67 binding affinity improvements) compared to the baseline. We have also tested QMO (Hoffman et al., 2021), and find it unable to generate molecules that satisfy all the given constraints.

表 3 展示了 RetMol 与竞争基线 graph GA 的优化结果对比。在更严格的相似性约束 ( $\zeta=0.6$ ) 下,RetMol 成功优化了最多分子约束 (8 个中有 5 个),而 graph GA 大多无法优化输入分子,因为它无法满足约束条件或提升输入药物的结合亲和力。在较宽松的相似性约束 ( $\zeta=0.4$ ) 下,我们的框架相比基线实现了更高的平均改进幅度 (结合亲和力提升 2.84 对 1.67)。我们还测试了 QMO (Hoffman et al., 2021),发现其无法生成满足所有给定约束的分子。

We produce 3D visualization s in Figure 3 (along with a video demo in https://shorturl.at/ fmtyS) to show the comparison of the RetMol-optimized inhibitors that bind to the SARS-CoV-2 main protease with the original and GA-optimized inhibitors. We can see that 1) there are more polar contacts (shown as red dotted lines around the molecule) in the RetMol-optimized compound, and 2) the binding mode of the GA-optimized molecule is much more similar to the original compound than the RetMol-optimized binding mode, implying RetMol optimizes the compound beyond local edits to the scaffold. These qualitative and quantitative results together demonstrate that RetMol has the potential to effectively optimize inhibitors in a real-world scenario. Besides, we provide extensive 2D graph visualization s of optimized molecules in Tables A3 and A4 in the Appendix. Finally, we also perform another set of experiments on Antibacterial drug design for the MurD protein (Sangshetti et al., 2017) and observe similar results; see Appendix C.5 for more details.

我们在图3中制作了3D可视化效果(附带视频演示见https://shorturl.at/fmtyS),展示RetMol优化的SARS-CoV-2主蛋白酶抑制剂与原始及GA优化抑制剂的对比。可以看出:1) RetMol优化化合物中存在更多极性接触(分子周围红色虚线所示);2) GA优化分子的结合模式与原化合物相似度远高于RetMol优化模式,表明RetMol突破了骨架局部修饰的优化范畴。这些定性与定量结果共同证明RetMol具备实际场景中高效优化抑制剂的潜力。此外,我们在附录表A3和表A4中提供了优化分子的详尽2D图可视化。最后,我们在MurD蛋白抗菌药物设计(Sangshetti等人,2017)上进行了另一组实验并观察到相似结果,详见附录C.5。

3.5 ANALYSES

3.5 分析

We analyze how the design choices in our framework impact generation performance. We summarize the main results and defer the detailed settings and additional results to Appendix B and C. Unless otherwise stated, we perform all analyses on the experiment setting in Section 3.2.

我们分析了框架中的设计选择对生成性能的影响。主要结果总结如下,详细设置和额外实验结果见附录B和C。除非另有说明,所有分析均基于第3.2节的实验设置进行。

Training objectives. We evaluate how the proposed self-training objective in Sec. 2.2 affects the quality of generated molecules in the unconditional generation setting. Table 4 (left) shows that, training with our proposed nearest neighbor objective (i.e., predict NN) indeed achieves better generation performance than the conventional input reconstruction objective (i.e., recon. input).

训练目标。我们在无条件生成设置下评估第2.2节提出的自训练目标对生成分子质量的影响。表4(左)显示,采用我们提出的最近邻目标(即预测NN)训练确实比传统输入重建目标(即重建输入)取得了更好的生成性能。

Types of retrieval database. We evaluate how the retrieval database construction impacts controllable generation performance, by comparing four different constructions: with molecules that satisfy all four constraints, i.e., $\mathrm{GSK}3\beta+\mathrm{JNK}3+\mathrm{QED}+\mathrm{SA}$ ; or satisfy only $\mathrm{GSK}3\beta+\mathrm{JNK}3$ ; or satisfy only $\mathrm{GSK}3\beta$ or JNK3 but not both. Table 4 (right) shows that a retrieval database that better satisfies the design criteria and better aligns with the controllable generation task generally leads to better performance. Nevertheless, we note that RetMol performs reasonably well even with exemplar molecules that only partially satisfy the properties (e.g., $\mathrm{GSK}3\beta+\mathrm{JNK}3)$ , achieving comparable performance to Rationale RL (Jin et al., 2020a).

检索数据库类型。我们通过比较四种不同构建方式评估检索数据库构建对可控生成性能的影响:包含满足全部四个约束条件的分子,即 $\mathrm{GSK}3\beta+\mathrm{JNK}3+\mathrm{QED}+\mathrm{SA}$;或仅满足 $\mathrm{GSK}3\beta+\mathrm{JNK}3$;或仅满足 $\mathrm{GSK}3\beta$ 或 JNK3 但不同时满足两者。表 4(右)显示,更符合设计标准且与可控生成任务更匹配的检索数据库通常能带来更好的性能。不过我们注意到,即使范例分子仅部分满足属性(如 $\mathrm{GSK}3\beta+\mathrm{JNK}3)$,RetMol 仍表现良好,其性能与 Rationale RL (Jin et al., 2020a) 相当。

Size of retrieval database. Figure 4 (left) shows a larger retrieval database generally improves all metrics and reduces the variance. It is particularly interesting that our framework achieves strong performance with a small retrieval database, which already outperforms the best baseline on the success rate metric with only a 100-molecule retrieval database.

检索数据库规模。图4(左)显示更大的检索数据库通常能提升所有指标并降低方差。特别值得注意的是,我们的框架在小规模检索数据库下就表现出强劲性能——仅使用100个分子的检索数据库时,成功率指标就已超越最佳基线水平。


Figure 4: Generation performance with varying retrieval database size (left), varying number of iterations (middle), and with or without dynamically updating the retrieval database (right). The left two plots are based on the experiment in Section 3.2 while the right plot is based on the penalized logP experiment in Section 3.1.

图 4: 不同检索数据库规模下的生成性能(左)、不同迭代次数下的生成性能(中)以及动态更新检索数据库与否的生成性能对比(右)。左侧两图基于第3.2节的实验,右侧图表基于第3.1节的惩罚对数概率实验。

Number of optimization iterations. Figure 4 (middle) shows that RetMol outperforms the best existing methods with a small number of iterations (outperforms Rationale RL at 30 iterations and MARS at 60 iterations); its performance continues to improve with increasing iterations.

优化迭代次数。图4 (中) 显示RetMol在较少迭代次数下即超越现有最佳方法 (30次迭代超越Rationale RL,60次迭代超越MARS) ,且性能随迭代次数增加持续提升。

Dynamic retrieval database update. Figure 4 (right) shows that, for the penalized logP optimization experiment (Section 3.1), with dynamical update our framework generates more molecules with property values beyond the best in the data compared to the case without dynamic update. This comparison shows that dynamic update is crucial to generalizing beyond the retrieval database.

动态检索数据库更新。图4 (右) 显示,在惩罚性logP优化实验 (第3.1节) 中,采用动态更新的框架相比静态检索,能生成更多属性值超越数据集最佳记录的分子。该对比表明动态更新对突破检索数据库的泛化能力至关重要。

4 RELATED WORK

4 相关工作

RetMol is most related to controllable molecule generation methods, which we have briefly reviewed and compared with in Section 1. Another line of work in this direction leverages constraints based on explicit, pre-defined molecule scaffold structures to guide the generative process (Maziarz et al., 2022; Langevin et al., 2020). Our approach is fundamentally different in that the retrieved exemplar molecules implicitly guides the generative process through the information fusion module.

RetMol 与可控分子生成方法最为相关,我们已在第1节简要回顾并对比了这些方法。该方向的另一类工作利用基于明确预定义分子骨架结构的约束来引导生成过程 (Maziarz et al., 2022; Langevin et al., 2020)。我们的方法根本区别在于:检索到的示例分子通过信息融合模块隐式地引导生成过程。

RetMol is also inspired by a recent line of work that integrates a retrieval module in various NLP and vision tasks, such as language modeling (Borgeaud et al., 2021; Liu et al., 2022; Wu et al., 2022), code generation (Hayati et al., 2018), question answering (Guu et al., 2020; Zhang et al., 2022), and image generation (Tseng et al., 2020; Casanova et al., 2021; Blattmann et al., 2022; Chen et al., 2022). Among them, the retrieval mechanism is mainly used for improving the generation quality either with a smaller model (Borgeaud et al., 2021; Blattmann et al., 2022) or with very few data points (Casanova et al., 2021; Chen et al., 2022). None of them has explicitly explored the controllable generation with retrieval. The retrieval module also appears in bio-related applications such as multiple sequence alignment (MSA), which can be seen as a way to search and retrieve relevant protein sequences, and has been an essential building block in MSA transformer (Rao et al., 2021) and AlphaFold (Jumper et al., 2021). However, the MSA methods focus on the pairwise interactions among a set of evolutionarily related (protein) sequences while RetMol considers the cross attention between the input and a set of retrieved examples.

RetMol的灵感还来自近期一系列将检索模块整合到各类NLP与视觉任务中的研究,例如语言建模 (Borgeaud et al., 2021; Liu et al., 2022; Wu et al., 2022)、代码生成 (Hayati et al., 2018)、问答系统 (Guu et al., 2020; Zhang et al., 2022) 以及图像生成 (Tseng et al., 2020; Casanova et al., 2021; Blattmann et al., 2022; Chen et al., 2022)。这些工作中,检索机制主要用于通过更小的模型 (Borgeaud et al., 2021; Blattmann et al., 2022) 或极少量数据样本 (Casanova et al., 2021; Chen et al., 2022) 提升生成质量,但均未明确探索基于检索的可控生成。检索模块也出现在生物相关应用中,例如多序列比对(MSA)——可视为搜索并获取相关蛋白质序列的方法,并已成为MSA transformer (Rao et al., 2021) 和AlphaFold (Jumper et al., 2021) 的核心组件。不过MSA方法聚焦于一组进化相关(蛋白质)序列间的两两相互作用,而RetMol关注的是输入与检索样本集之间的交叉注意力机制。

There also exist priors work that study retrieval-based approaches for controllable text generation (Kim et al., 2020; Xu et al., 2020). These methods require task-specific training/fine-tuning of the generative model for each controllable generation task whereas our framework can be applied to many different tasks without it. While we focus on molecules, our framework is general and has the potential to achieve controllable generation for multiple other modalities beyond molecules.

已有先前工作研究了基于检索的可控文本生成方法 (Kim et al., 2020; Xu et al., 2020)。这些方法需要针对每个可控生成任务对生成模型进行特定任务的训练/微调,而我们的框架无需此过程即可应用于多种不同任务。虽然我们专注于分子领域,但该框架具有通用性,有望在分子以外的多种模态上实现可控生成。

5 CONCLUSIONS

5 结论

We proposed RetMol, a new retrieval-based framework for controllable molecule generation. By incorporating a retrieval module with a pre-trained generative model, RetMol leverages exemplar molecules retrieved from a task-specific retrieval database to steer the generative model towards generating new molecules with the desired design criteria. RetMol is versatile, requires no taskspecific fine-tuning and is agnostic to the generative models (see Appendix C.7). We demonstrated the effectiveness of RetMol on a variety of benchmark tasks and a real-world inhibitor design task for the SARS-CoV-2 virus, achieving state-of-the-art performances in each case comparing to existing methods. Since RetMol still requires exemplar molecules that at least partially satisfy the design criteria, it becomes challenging when those molecules are unavailable at all. A valuable future work is to improve the retrieval mechanisms such that even weaker molecules, i.e., those that do not satisfy but are close to satisfying the design criteria, can be leveraged to guide the generation process.

我们提出了RetMol,这是一种基于检索的新型可控分子生成框架。通过将检索模块与预训练生成模型相结合,RetMol利用从特定任务检索数据库中获取的范例分子,引导生成模型产生符合设计需求的新分子。该框架具有通用性,无需针对特定任务进行微调,且与生成模型无关(详见附录C.7)。我们在多个基准测试和针对SARS-CoV-2病毒的实际抑制剂设计任务中验证了RetMol的有效性,相较于现有方法均实现了最先进的性能表现。由于RetMol仍需依赖至少部分满足设计要求的范例分子,当这类分子完全缺失时仍存在挑战。未来的改进方向是增强检索机制,使得即使是不完全满足但接近设计要求的分子也能用于指导生成过程。

ACKNOWLEDGEMENTS

致谢

ZW and RGB are supported by by NSF grants CCF-1911094, IIS-1838177, and IIS-1730574; ONR grants N00014-18-12571, N00014-20-1-2534, and MURI N00014-20-1-2787; AFOSR grant FA9550-22-1-0060; and a Vannevar Bush Faculty Fellowship, ONR grant N00014-18-1-2047.

ZW和RGB的研究工作获得了以下资助:美国国家科学基金会(NSF)项目CCF-1911094、IIS-1838177和IIS-1730574;美国海军研究办公室(ONR)项目N00014-18-12571、N00014-20-1-2534以及MURI项目N00014-20-1-2787;美国空军科研办公室(AFOSR)项目FA9550-22-1-0060;以及范内瓦·布什教师奖学金(ONR项目N00014-18-1-2047)。

ETHICS STATEMENT

伦理声明

Applications that involve molecule generation such as drug discovery are high-stake in nature. These applications are highly regulated to prevent potential misuse (Hill and Richards, 2022). RetMol as a technology to improve controllable molecule generation has the potential to be subjected to malicious use. For example, one could change the retrieval database and the design criteria into harmful ones, such as increased drug toxicity. However, we note that RetMol is a computational tool useful for in silico experiments. As a result, although RetMol can suggest new molecules according to arbitrary design criteria, the properties of the generated molecules are estimations of the real chemical and biological properties and need to be further validated in lab experiments. Thus, while RetMol’s real-world impact is limited to in silico experiments, it is also prevented from directly generating real drugs that can be readily used. In addition, controllable molecule generation is an active area of research; we hope that our work contribute to this ongoing line of research and make ML methods safe and reliable for molecule generation applications in the real world.

涉及分子生成的应用(如药物发现)本质上具有高风险性。这些应用受到严格监管以防止潜在滥用 (Hill and Richards, 2022)。RetMol作为提升可控分子生成的技术,可能被恶意利用。例如,有人可能将检索数据库和设计标准更改为有害指标(如增加药物毒性)。但需注意,RetMol仅是用于计算机模拟实验的计算工具。虽然它能根据任意设计标准推荐新分子,其生成分子的属性仅为真实化学/生物特性的预测值,仍需通过实验室实验进一步验证。因此,RetMol的实际影响仅限于计算机实验,无法直接生成可立即使用的真实药物。此外,可控分子生成是当前活跃的研究领域;我们希望本工作能推动该领域发展,使机器学习方法在现实分子生成应用中更安全可靠。

REPRODUCIBILITY STATEMENT

可复现性声明

To ensure the reproducibility of the empirical results, we provide the implementation details of each task (i.e., experimental setups, hyper parameters, dataset specifications, etc.) in Appendix B. The source code is available at https://github.com/NVlabs/RetMol.

为确保实证结果的可复现性,我们在附录B中提供了每项任务的实现细节(包括实验设置、超参数、数据集规格等)。源代码详见 https://github.com/NVlabs/RetMol

REFERENCES

参考文献

Samuel C. Hoffman, Vijil Chen tham arak shan, Kahini Wadhawan, Pin-Yu Chen, and Payel Das. Optimizing molecules using efficient queries from property evaluations. Nature Machine Intelligence, 4(1): 21–31, December 2021. doi: 10.1038/s42256-021-00422-y. URL https://doi.org/10.1038/ s42256-021-00422-y.

Samuel C. Hoffman、Vijil Chenthamarakshan、Kahini Wadhawan、Pin-Yu Chen 和 Payel Das。基于属性评估的高效查询优化分子。Nature Machine Intelligence,4(1): 21–31,2021 年 12 月。doi: 10.1038/s42256-021-00422-y。URL https://doi.org/10.1038/s42256-021-00422-y

Paul Maragakis, Hunter Nisonoff, Brian Cole, and David E. Shaw. A deep-learning view of chemical space designed to facilitate drug discovery. Journal of Chemical Information and Modeling, 60(10):4487–4496, July 2020. doi: 10.1021/acs.jcim.0c00321. URL https://doi.org/10.1021/acs.jcim.0c00321.

Paul Maragakis、Hunter Nisonoff、Brian Cole 和 David E. Shaw. 促进药物发现的化学空间深度学习视角. Journal of Chemical Information and Modeling, 60(10):4487–4496, 2020年7月. doi: 10.1021/acs.jcim.0c00321. URL https://doi.org/10.1021/acs.jcim.0c00321.

art. arXiv:2201.09680, January 2022.

arXiv:2201.09680,2022年1月。

Appendix

附录

A FRAMEWORK DETAILS

框架详情

More details of the information fusion module (Eq. ( 1)) Recall that Eq.( 1) is

信息融合模块( Eq. ( 1))的更多细节
回顾式( 1)

$$
e=f_{\mathrm{CA}}(e_{\mathrm{in}},E_{r};\theta)=\mathrm{Attn}(\mathrm{Query}(e_{\mathrm{in}}),\mathrm{Key}(E_{r}))\cdot\mathrm{Value}(E_{r})
$$

$$
e=f_{\mathrm{CA}}(e_{\mathrm{in}},E_{r};\theta)=\mathrm{Attn}(\mathrm{Query}(e_{\mathrm{in}}),\mathrm{Key}(E_{r}))\cdot\mathrm{Value}(E_{r})
$$

where $f_{\mathrm{CA}}$ represents the cross attention function with parameters $\theta$ , and $e_{\mathrm{in}}\in\mathbb{R}^{L\times D}$ and $E_{r}\in$ $\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}$ are the input embedding and retrieved exemplar embeddings, respectively.

其中 $f_{\mathrm{CA}}$ 表示带有参数 $\theta$ 的交叉注意力函数,$e_{\mathrm{in}}\in\mathbb{R}^{L\times D}$ 和 $E_{r}\in$ $\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}$ 分别为输入嵌入和检索示例嵌入。

Concretely, Query, Key, and Value are affine functions that do not change the shape of the the input. The function input and output dimensions are as follows:

具体来说,Query、Key和Value都是不改变输入形状的仿射函数。函数输入和输出维度如下:

$$
\begin{array}{r l}&{\mathrm{Query: }\mathbb{R}^{L\times D}\rightarrow\mathbb{R}^{L\times D},}\ &{\mathrm{Key: }\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}\rightarrow\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D},}\ &{\mathrm{Value:~}\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}\rightarrow\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}}\end{array}
$$

$$
\begin{array}{r l}&{\mathrm{查询: }\mathbb{R}^{L\times D}\rightarrow\mathbb{R}^{L\times D},}\ &{\mathrm{键: }\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}\rightarrow\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D},}\ &{\mathrm{值:~}\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}\rightarrow\mathbb{R}^{(\sum_{k=1}^{K}L_{k})\times D}}\end{array}
$$

In particular, the Key and Value functions first apply to each of the $k$ -th retrieved molecule embedding $\boldsymbol{e}_ {r}^{k}\in\mathbb{R}^{L_{k}\times D}$ (which is a block of matrix in the retrieved molecule embedding $E_{r}$ ) and then stack these $\mathrm{K}$ output matrices horizontally to obtain the output matrices of shape $({\textstyle\sum_{k=1}^{K}}L_{k})\times{\cal D}$ .

具体来说,Key和Value函数首先应用于每个第$k$个检索到的分子嵌入$\boldsymbol{e}_ {r}^{k}\in\mathbb{R}^{L_{k}\times D}$(这是检索分子嵌入$E_{r}$中的一个矩阵块),然后将这些$\mathrm{K}$个输出矩阵水平堆叠,得到形状为$({\textstyle\sum_{k=1}^{K}}L_{k})\times{\cal D}$的输出矩阵。

The Attn function first computes the inner product between each slice of the query output matrix (of shape $D$ ) and the key output matrix (of shape $(\textstyle\sum_{k=1}^{K}L_{k})\times D)$ , followed by a softmax function which results in $(\textstyle\sum_{k=1}^{K}L_{k})$ un-normalized weightPs. These weights are applied to the value output matrix (of shape $(\textstyle\sum_{k=1}^{K}L_{k})\times D)$ to obtain one slice in the output fused matrix $^e$ . The above procedure is performed in parallel to obtain the full output fused matrix $e$ , such that it maintains the same dimensionality with the input embedding ein, i.e., e ∈ RL×D.

Attn函数首先计算查询输出矩阵(形状为$D$)的每个切片与键输出矩阵(形状为$(\textstyle\sum_{k=1}^{K}L_{k})\times D)$)的内积,随后通过softmax函数生成$(\textstyle\sum_{k=1}^{K}L_{k})$个未归一化的权重。这些权重被应用于值输出矩阵(形状为$(\textstyle\sum_{k=1}^{K}L_{k})\times D)$),以获取输出融合矩阵$^e$中的一个切片。上述过程并行执行以获得完整的输出融合矩阵$e$,使其保持与输入嵌入ein相同的维度,即e ∈ RL×D。

More details of the CE loss The cross entropy in Eq. (2) is the maximum likelihood objective commonly used in sequence modeling and sequence-to-sequence generation tasks such as machine translation. It takes two molecules as its inputs: one is the ground-truth molecule sequence, and the other one is the generated molecule sequence (i.e., the softmax output of decoder). Specifically, if we define $y:=x_{1\mathrm{NN}}$ and $\hat{y}:=\mathrm{Dec}(f_{\mathrm{CA}}(e_{\mathrm{in}},E_{r});\theta)$ , then the “CE” in Eq. (2) is given as follows:

交叉熵损失 (CE loss) 的更多细节
式 (2) 中的交叉熵是序列建模和序列到序列生成任务 (如机器翻译) 常用的最大似然目标函数。它接收两个分子作为输入:一个是真实分子序列,另一个是生成的分子序列 (即解码器的 softmax 输出) 。具体来说,若定义 $y:=x_{1\mathrm{NN}}$ 和 $\hat{y}:=\mathrm{Dec}(f_{\mathrm{CA}}(e_{\mathrm{in}},E_{r});\theta)$ ,则式 (2) 中的 "CE" 可表示为:

$$
\mathrm{CE}(\hat{y},y)=-\sum_{l=0}^{L-1}\sum_{v=0}^{V-1}y_{l,v}\log\hat{y}_{l,v}
$$

$$
\mathrm{CE}(\hat{y},y)=-\sum_{l=0}^{L-1}\sum_{v=0}^{V-1}y_{l,v}\log\hat{y}_{l,v}
$$

where $L$ is the molecule sequence length, $V$ is the vocabulary size, $y_{l,v}$ is the ground-truth (one-hot) probability of vocabulary entry $v$ on the $l$ -th token, and $\hat{y}_{l,v}$ is the predicted probability (i.e., softmax output of decoder) of the vocabulary entry $v$ on the $l$ -th token.

其中 $L$ 是分子序列长度,$V$ 是词汇表大小,$y_{l,v}$ 是第 $l$ 个 token 上词汇项 $v$ 的真实 (one-hot) 概率,$\hat{y}_{l,v}$ 是第 $l$ 个 token 上词汇项 $v$ 的预测概率 (即解码器的 softmax 输出)。

More details of the RetMol training objective (Sec. 2.2) We first provide more detailed descriptions of our proposed objective: Given an input molecule, we use its nearest neighbor molecule from the retrieval database, based on the cosine similarity in the embedding space of the CDDD encoder, as the prediction target. The decoder takes in the fused embeddings from the information fusion module to generate new molecules. As shown in Eq. (2), we calculate the cross-entropy distance between the decoded output and the nearest neighbor molecule as the training objective.

RetMol训练目标详解(第2.2节)
我们首先对提出的目标进行更详细的描述:给定输入分子,基于CDDD编码器嵌入空间中的余弦相似度,从检索数据库中获取其最近邻分子作为预测目标。解码器接收来自信息融合模块的融合嵌入,生成新分子。如公式(2)所示,我们计算解码输出与最近邻分子之间的交叉熵距离作为训练目标。

The motivation is that: Other similar molecules (i.e., the remaining $K-1$ nearest neighbors) from the retrieval database, through the fusion module, can provide good guidance for transforming the input to its nearest neighbor (e.g., how to perturb the molecule and how large the perturbation would be). Accordingly, the fusion module can be effectively updated through this auxiliary task.

动机在于:检索数据库中其他类似分子(即剩余的 $K-1$ 个最近邻)通过融合模块,能为输入分子向其最近邻的转化提供有效指导(例如如何扰动分子以及扰动幅度)。因此,该辅助任务能有效更新融合模块。

On the contrary, if we use the conventional encoder-decoder training objective (i.e., reconstructing the input molecule), the method actually does not need anything from the retrieval database to do well in the input reconstruction (as the input molecule itself contains all the required information already). As a result, the information fusion module would not be effectively updated during training.

相反,如果我们使用传统的编码器-解码器训练目标(即重构输入分子),该方法实际上无需从检索数据库中获取任何信息就能很好地完成输入重构(因为输入分子本身已包含所有必需信息)。因此,信息融合模块在训练过程中无法得到有效更新。

Molecule retriever Algorithm 1 describes the how the retriever retrieves exemplar molecules from a retrieval database given the design criteria.

分子检索器
算法1描述了检索器如何根据设计标准从检索数据库中检索示例分子。

Inference Algorithm 2 describes the inference procedure.

推理算法 2 描述了推理过程。

Parameters The total number of parameters in RetMol and in the base generative model (Irwin et al., 2022) is 10471179 and 10010635, respectively. The additional 460544 parameters come exclusively from the information fusion model, which means that it adds only less than $5%$ parameter overhead and is very lightweight compared to the base generative model.

参数
RetMol 和基础生成模型 (Irwin et al., 2022) 的总参数量分别为 10471179 和 10010635。新增的 460544 个参数完全来自信息融合模型,这意味着它仅增加了不到 $5%$ 的参数开销,与基础生成模型相比非常轻量。

Algorithm 1: Exemplar molecule retriever

算法 1: 示例分子检索器

Require :Property predictors $a_{\ell}$ and desired property thresholds $\delta_{\ell}$ for $\ell\in[1,\ldots,L]$ for property constraints; scoring function $s$ for properties to be optimized; retrieval database $\mathcal{X}_ {R}$ ; number of retrieved exemplar molecules $K$ Input :Input molecule $x_{\mathrm{in}}$ Output :Retrieved exemplar molecules $\chi_{r}$ 1 $\ell^{\prime}=L$ ; 2 $\chi^{\prime}=\cap_{\ell=1}^{L}{x\in\chi_{R}|a_{\ell}(x)\geq\delta_{\ell}};$ ; 3 while $|\mathcal{X}^{\prime}|\le K$ do 4 $L:=L-1$ ; 5 $\ell^{\prime}:=L$ ; 6 $\chi^{\prime}:=\cap_{\ell=1}^{L}{x\in\chi_{R}|a_{\ell}(x)\geq\delta_{\ell}};$ 7 end 8 $\mathcal{X}^{\prime\prime}={x\in\mathcal{X}^{\prime}|a_{\ell^{\prime}}(x)\geq a_{\ell^{\prime}}(x_{\mathrm{in}})}$ ; 9 if $|\mathcal{X}^{\prime\prime}|\ge K$ then 10 return $\chi_{r}={\mathrm{top}}_ {K}(\chi^{\prime\prime},s)$ ; 11 else 12 return $\chi_{r}={\mathrm{top}}_{K}({\mathcal{X}}^{\prime},s)$ ; 13 end

要求:属性预测器 $a_{\ell}$ 和期望属性阈值 $\delta_{\ell}$ (其中 $\ell\in[1,\ldots,L]$ )用于属性约束;待优化属性的评分函数 $s$ ;检索数据库 $\mathcal{X}_ {R}$ ;检索示例分子数量 $K$
输入:输入分子 $x_{\mathrm{in}}$
输出:检索到的示例分子 $\chi_{r}$
1 $\ell^{\prime}=L$;
2 $\chi^{\prime}=\cap_{\ell=1}^{L}{x\in\chi_{R}|a_{\ell}(x)\geq\delta_{\ell}}$;
3 while $|\mathcal{X}^{\prime}|\le K$ do
4 $L:=L-1$;
5 $\ell^{\prime}:=L$;
6 $\chi^{\prime}:=\cap_{\ell=1}^{L}{x\in\chi_{R}|a_{\ell}(x)\geq\delta_{\ell}}$;
7 end
8 $\mathcal{X}^{\prime\prime}={x\in\mathcal{X}^{\prime}|a_{\ell^{\prime}}(x)\geq a_{\ell^{\prime}}(x_{\mathrm{in}})}$;
9 if $|\mathcal{X}^{\prime\prime}|\ge K$ then
10 return $\chi_{r}={\mathrm{top}}_ {K}(\chi^{\prime\prime},s)$;
11 else
12 return $\chi_{r}={\mathrm{top}}_{K}({\mathcal{X}}^{\prime},s)$;
13 end

B DETAILED EXPERIMENT SETUP

B 详细实验设置

B.1 RETMOL TRAINING

B.1 RETMOL 训练

We only train the information fusion model in our RetMol framework. The training dataset uses either ZINC250k (Irwin and Shoichet, 2004) (for the experiments in Section 3.1 or CheMBL (Gaulton et al., 2016). The choice of the training data is to align with the setup for the baseline methods in each experiment; The experiments in Sections 3.1 and 3.3 use the ZINC250k dataset while the experiments in Sections 3.2 and 3.4 use the CheMBL dataset. For the ZINC250k dataset, we follow the train/validation/test splits in (Jin et al., 2019) and train on the train split. For the ChemMBL dataset, we train on the entire dataset without splits. Training is distributed over four V100 NVIDIA GPUs, each with 16GB memory, with a batch size of 256 samples on each GPU, for 50k iterations. The total training time is approximately 2 hours for either the ZINC250k or the CheMBL dataset. Our training infrastructure largely follows the Megatron1 version of the molecule generative model in (Irwin et al., 2022), which uses DeepSpeed,2 Apex,3 and half precision for improved training efficiency. Note that, once RetMol is trained, we do not perform further task-specific fine-tuning.

我们仅在RetMol框架中训练信息融合模型。训练数据集采用ZINC250k (Irwin and Shoichet, 2004)(用于第3.1节实验)或CheMBL (Gaulton et al., 2016)。训练数据的选择与各实验基线方法的设置保持一致:第3.1节和第3.3节实验使用ZINC250k数据集,第3.2节和第3.4节实验使用CheMBL数据集。对于ZINC250k数据集,我们沿用(Jin et al., 2019)中的训练/验证/测试划分,并在训练集上进行训练。对于CheMBL数据集,我们直接使用完整数据集进行训练。训练分布在4块16GB显存的NVIDIA V100 GPU上完成,每个GPU的批次大小为256个样本,共进行5万次迭代。ZINC250k和CheMBL数据集的训练时间均约为2小时。我们的训练基础设施主要基于(Irwin et al., 2022)中分子生成模型的Megatron1版本,采用DeepSpeed2、Apex3和半精度训练以提升效率。需要注意的是,RetMol训练完成后不再进行任务特定的微调。

Algorithm 2: Generation with adaptive input and retrieval database update

算法 2: 自适应输入生成与检索数据库更新

Require :Encoder Enc, decoder Dec, information fusion model $f_{\mathrm{CA}}$ , exemplar molecule retriever Ret, property predictors $a_{\ell}(x)$ and desired property thresholds $\delta_{\ell}$ for $\ell\in[1,\ldots,L]$ for property constraints, retrieval database $\mathcal{X}_ {R}$ , scoring function $s(x)$ for properties to be optimized Input :Input molecule $x_{\mathrm{in}}$ , number of retrieved exemplar molecules $K$ , number of optimization iterations $T$ , the number of molecules $M$ to sample at each iteration Output :Optimized molecule $x^{\prime}$ 1 for $t\in[1,\ldots,T]$ do 2 $e_{\mathrm{in}}=\mathtt{E n c}(x)$ ; 3 $\chi_{r}=\mathsf{R e t}\left(\chi_{R},{a_{\ell},\delta_{\ell}}_ {\ell=1}^{L},s,x_{\mathrm{in}},K\right);$ 4 $\pmb{E}_ {r}=\mathtt{E n c}(\pmb{\chi}_ {r})$ ; 5 $\boldsymbol{e}=f_{\mathrm{CA}}(e_{\mathrm{in}},E_{r};\theta)$ ; 6 $\chi_{\mathrm{gen}}=\emptyset$ ; 7 for $i=1,\ldots,M$ do 8 $\epsilon\sim\mathcal{N}(0,\bf{1})$ ; 9 ei = e + ϵ; 10 x′ = Dec(ei); 11 $\mathcal{X}_ {\mathrm{gen}}:=\mathcal{X}_ {\mathrm{gen}}\cup{x^{\prime}};$ 12 end 13 $\begin{array}{r l}&{\mathcal{X}_ {\mathrm{gen}}:=\mathsf{R e t}\left(\mathcal{X}_ {\mathrm{gen}},{a_{\ell},\delta_{\ell}}_ {\ell=1}^{L},s,x_{\mathrm{in}},K\right);}\ &{x_{\mathrm{in}}:=\mathsf{t o p}_ {1}(\mathcal{X}_ {\mathrm{gen}},s);}\ &{\mathcal{X}_ {R}:=\mathcal{X}_ {R}\cup\mathcal{X}_ {\mathrm{gen}}\setminus x_{\mathrm{in}}}\end{array}$ 14 15 16 end

要求:编码器 Enc、解码器 Dec、信息融合模型 $f_{\mathrm{CA}}$、示例分子检索器 Ret、属性预测器 $a_{\ell}(x)$ 和属性约束的期望阈值 $\delta_{\ell}$(其中 $\ell\in[1,\ldots,L]$)、检索数据库 $\mathcal{X}_ {R}$、待优化属性的评分函数 $s(x)$
输入:输入分子 $x_{\mathrm{in}}$、检索的示例分子数量 $K$、优化迭代次数 $T$、每次迭代采样的分子数量 $M$
输出:优化后的分子 $x^{\prime}$
1 for $t\in[1,\ldots,T]$ do
2 $e_{\mathrm{in}}=\mathtt{E n c}(x)$;
3 $\chi_{r}=\mathsf{R e t}\left(\chi_{R},{a_{\ell},\delta_{\ell}}_ {\ell=1}^{L},s,x_{\mathrm{in}},K\right);$
4 $\pmb{E}_ {r}=\mathtt{E n c}(\pmb{\chi}_ {r})$;
5 $\boldsymbol{e}=f_{\mathrm{CA}}(e_{\mathrm{in}},E_{r};\theta)$;
6 $\chi_{\mathrm{gen}}=\emptyset$;
7 for $i=1,\ldots,M$ do
8 $\epsilon\sim\mathcal{N}(0,\bf{1})$;
9 ei = e + ϵ;
10 x′ = Dec(ei);
11 $\mathcal{X}_ {\mathrm{gen}}:=\mathcal{X}_ {\mathrm{gen}}\cup{x^{\prime}};$
12 end
13 $\begin{array}{r l}&{\mathcal{X}_ {\mathrm{gen}}:=\mathsf{R e t}\left(\mathcal{X}_ {\mathrm{gen}},{a_{\ell},\delta_{\ell}}_ {\ell=1}^{L},s,x_{\mathrm{in}},K\right);}\ &{x_{\mathrm{in}}:=\mathsf{t o p}_ {1}(\mathcal{X}_ {\mathrm{gen}},s);}\ &{\mathcal{X}_ {R}:=\mathcal{X}_ {R}\cup\mathcal{X}_ {\mathrm{gen}}\setminus x_{\mathrm{in}}}\end{array}$
14
15
16 end

B.2 RETMOL INFERENCE

B.2 RETMOL 推理

We use greedy sampling in the decoder throughout all experiments in this paper. When we need to sample more than one molecule from the decoder given the (fused) embedding, we first perturb the (fused) embedding with independent random isotropic Gaussian with standard deviation of 1 and then generate a sample from each perturbed (fused) embedding using the decoder. Inference uses a single V100 NVIDIA GPU with 16 GB memory.

本文所有实验在解码器中均采用贪心采样策略。当需要基于(融合)嵌入向量从解码器采样多个分子时,我们首先使用标准差为1的独立随机各向同性高斯噪声对(融合)嵌入向量进行扰动,然后通过解码器从每个扰动后的(融合)嵌入向量生成样本。推理过程使用单块16GB显存的NVIDIA V100 GPU完成。

Note that during inference, the multi-property design objective is provided with a general form $\begin{array}{r}{s(x)=\sum_{l=1}^{L}w_{l}a_{l}(x)}\end{array}$ . In the experiments, we simply set all the weight coefficients $w_{l}$ to 1, i.e., the aggrega ted design criterion is the sum of individual property constraints.

需要注意的是,在推理过程中,多属性设计目标以通用形式 $\begin{array}{r}{s(x)=\sum_{l=1}^{L}w_{l}a_{l}(x)}\end{array}$ 给出。实验中,我们简单地将所有权重系数 $w_{l}$ 设为1,即聚合设计标准为各属性约束之和。

B.3 BASELINES

B.3 基线

We briefly overview the existing methods and baselines that we have used for all experiments. Some baselines are applicable for more than one experiment while some specialize in a certain experiment.

我们简要概述了用于所有实验的现有方法和基线。部分基线适用于多个实验,而有些则专用于特定实验。

JT-VAE (Jin et al., 2018) The junction tree variation al auto encoder (JT-VAE) reconstructs a molecule in its graph (2 dimensional) representation using its junction tree and molecular graph as inputs. JT-VAE learns a structured, fixed-dimensional latent space. During inference, controllable generation is achieved by first performing optimization on the latent space via property predictors trained on the latent space and then generating a molecule via the decoder in the VAE.

JT-VAE (Jin et al., 2018) 连接树变分自编码器 (JT-VAE) 通过输入分子的连接树和分子图,在二维图表示中重构分子。JT-VAE 学习到一个结构化的固定维度潜空间。在推理阶段,首先通过在潜空间上训练的性质预测器对潜空间进行优化,然后通过 VAE 的解码器生成分子,从而实现可控生成。

MMPA (Dalke et al., 2018) The Matched Molecular Pair Analysis (MMPA) platform uses rules and heuristics, such as matched molecular pair, to perform various molecule operations such as search, transformations, and synthesizing.

MMPA (Dalke et al., 2018) 匹配分子对分析(Matched Molecular Pair Analysis, MMPA)平台利用匹配分子对等规则和启发式方法,执行搜索、转化及合成等多种分子操作。

GCPN (You et al., 2018) Graph Convolutional Policy Network (GCPN) trains a graph convolutional neural network to generate molecules in their 2D graph representations with policy gradient reinforcement learning. The reward function consists of task-specific property predictors and adversarial loss.

GCPN (You et al., 2018) 图卷积策略网络 (Graph Convolutional Policy Network, GCPN) 通过策略梯度强化学习训练图卷积神经网络,以二维图表示形式生成分子。其奖励函数包含任务特定的属性预测器和对抗损失。

Vseq2seq (Bahdanau et al., 2015) This is a basic sequence-to-sequence model borrowed from the machine translation literature that translates the input molecule to an output molecule with more desired properties.

Vseq2seq (Bahdanau et al., 2015) 这是一个源自机器翻译领域的基础序列到序列模型,可将输入分子转化为具有更理想特性的输出分子。

VJTNN (Jin et al., 2019) VJTNN is a graph-based method for generating molecule graphs based on junction tree variation al auto encoders (Jin et al., 2018). The method formalizes the controllable molecule generation problem as a graph translation problem and is trained using an adversarial loss to match the generation and data distribution.

VJTNN (Jin等人,2019) VJTNN是一种基于图的方法,用于基于连接树变分自编码器 (Jin等人,2018) 生成分子图。该方法将可控分子生成问题形式化为图翻译问题,并使用对抗性损失进行训练以匹配生成和数据分布。

HierG2G (Jin et al., 2019) The Hierarchical Graph-to-Graph generation method takes into account the molecule’s substructures, which are interleaved with the molecule’s atoms for more structured generation.

HierG2G (Jin et al., 2019) 该分层图到图生成方法考虑了分子的子结构,这些子结构与分子原子交错排列以实现更具结构化的生成。

AtomG2G (Jin et al., 2019) The Atom Graph-to-Graph generation method is similar to HierG2G but only takes into account of the molecule’s atoms information without the structure and substructure information.

AtomG2G (Jin et al., 2019) 该原子图到图生成方法类似于HierG2G,但仅考虑分子的原子信息,不包含结构和子结构信息。

DESMILES (Maragakis et al., 2020) The DESMILES method aims to translate the molecule’s fingerprint into its SMILES representation. Controllable generation is achieved by fine-tuning the model for task-specific properties and datasets.

DESMILES (Maragakis等,2020) DESMILES方法旨在将分子指纹转换为对应的SMILES表征。通过针对特定任务属性和数据集对模型进行微调,实现了可控生成。

MolDQN (Zhou et al., 2019) The Molecule Deep Q-Learning Network formalizes molecule generation and optimization as a sequence of Markov decision processes, which the authors use deep Q-learning and randomized reward function to optimize.

MolDQN (Zhou等人,2019) 将分子生成与优化形式化为马尔可夫决策过程序列,作者采用深度Q学习与随机奖励函数进行优化。

GA (Nigam et al., 2019) This method augments the classic genetic algorithm with a neural network which acts as a disc rim in at or to improve the diversity of the generation during inference.

GA (Nigam et al., 2019) 该方法通过引入神经网络作为判别器来增强经典遗传算法 (genetic algorithm),以提升推理过程中生成结果的多样性。

QMO (Hoffman et al., 2021) The Query-based Molecule Optimization framework is a latentoptimization-based controllable generation method operated on the SMILES representation of molecules. Instead of finding a latent code in the latent space via property predictor trained on the latent space, QMO uses property predictor on the molecule space and performs latent optimization via zeroth order gradient estimation. Although this latent-optimization-based method removes the need to train latent-space property predictors, we find that it is sometimes challenging to tune the hyper-parameters to adapt QMO for different controllable generation tasks. For example, in the SARS-CoV-2 main protease inhibitor design task, we could not succeed to generate an optimized molecule using QMO even with extensive hyper-parameter tuning.

QMO (Hoffman等人, 2021) 基于查询的分子优化框架是一种作用于分子SMILES表征的、基于隐空间优化的可控生成方法。与通过在隐空间训练属性预测器来寻找隐编码不同,QMO直接在分子空间使用属性预测器,并通过零阶梯度估计进行隐优化。尽管这种基于隐优化的方法无需训练隐空间属性预测器,但我们发现调整超参数使QMO适应不同可控生成任务时存在挑战。例如在SARS-CoV-2主蛋白酶抑制剂设计任务中,即使经过大量超参数调优,我们仍未能成功用QMO生成优化分子。

GVAE-RL (Jin et al., 2020a) This method is the graph-based grammar VAE (Kusner et al., 2017) that learns to expand a molecule with a set of expansion rules, based on a variation al auto encoder. GVAE-RL further fine-tunes GVAE with RL objective for controllable generation, using the property values as the reward.

GVAE-RL (Jin et al., 2020a) 该方法基于图结构的语法变分自编码器 (Kusner et al., 2017),通过学习一组扩展规则来生成分子。GVAE-RL进一步通过强化学习目标对GVAE进行微调,以实现可控生成,并将属性值作为奖励信号。

REINVENT (Olivecrona et al., 2017) REINVENT trains a recurrent neural network using RL techniques to generate new molecules.

REINVENT (Olivecrona et al., 2017) REINVENT 采用强化学习 (RL) 技术训练循环神经网络来生成新分子。

Rationale RL (Jin et al., 2020a) The RationalRL method first assembles a “rationale”, a molecule fragment composed from different molecules with the desired properties. Then it trains a decoder using the assembled collection of rationales as input by first randomly sampling from the decoder, scoring each sample with the property predictors, and use the positive and negative samles as training data to fine-tune the decoder.

RationalRL (Jin et al., 2020a)
RationalRL方法首先组装一个"原理片段(rationale)",即由具有目标特性的不同分子构成的分子片段。随后以组装的原理片段集合作为输入训练解码器:先随机从解码器采样,用属性预测器对每个样本评分,并将正负样本作为训练数据微调解码器。

MARS (Xie et al., 2021) The MARS method models the molecule generation process as a Markov chain, where the transitions are the “edit