MOLER: Incorporate Molecule-Level Reward to Enhance Deep Generative Model for Molecule Optimization

The goal of molecular optimization is to generate molecules similar to a target molecule but with better chemical properties. Deep generative models have shown great success in molecule optimization. However, due to the iterative local generation process of deep generative models, the resulting molecules can significantly deviate from the input in molecular similarity and size, leading to poor chemical properties. The key issue here is that the existing deep generative models restrict their attention on substructure-level generation without considering the entire molecule as a whole. To address this challenge, we propose Molecule-Level Reward functions (MOLER) to encourage (1) the input and the generated molecule to be similar, and to ensure (2) the generated molecule has a similar size to the input. The proposed method can be combined with various deep generative models. Policy gradient technique is introduced to optimize reward-based objectives with small computational overhead. Empirical studies show that MOLER achieves up to 20.2% relative improvement in success rate over the best baseline method on several properties, including QED, DRD2 and LogP.


Introduction
Designing molecules with desirable properties is a fundamental task in drug discovery. Traditional methods such as high throughput screening (HTS) tests large compound libraries to identify molecules with desirable properties, which are inefficient and costly [1], [2]. Unlike HTS, molecule optimization takes an input molecule X and aims to find molecule Y with improved drug properties such as drug-likeness and biological activity. However, standard molecule optimization is still costly and time-consuming due to the long cycles of labor-intensive synthesis and analysis [3].
The challenges mentioned above motivate a series of works that apply machine learning techniques on molecule optimization, with the ultimate goal to generate new molecules that maximize desirable drug properties while maintaining similarity to the original molecule. Existing works mainly can be categorized as generative models [4], [5], [6] and reinforcement learning (RL) methods [7], [8]. Generative models for molecule optimization map an input molecule to a latent space and perform optimization in the latent space to search for new molecules. For example, Gómez-Bombarelli et al. [6], Blaschke et al. [9] utilized SMILES strings as molecule representations to generate molecules; however, they are known to create many invalid molecules. To improve the validity of the generated molecules, various approaches were proposed: Kusner et al. [4] and Dai et al. [5] designed grammar constraints, while MolGAN [10], CGVAE (Constrained Graph VAE) [11], JTVAE (Junction Tree VAE)-based approaches [12], [13], [14] focus on graph representations of molecules. Reinforcement learning for molecule optimization are often developed on top of molecule generators for optimizing desirable properties. For example, Olivecrona et al. [15], Putin et al. [16], and Popova et al. [17] applied RL techniques on top of a string generator to produce SMILES strings. You et al. [7], Zhou et al. [8] proposed to generate molecular graphs, which achieved perfect validity.
Despite the initial success, existing methods often rely on iterative local expansion to acquire the target molecule, potentially creating molecules of arbitrary sizes. Without a global fitness metric, the generated molecules can significantly deviate from the target molecule in molecular similarity and size. Also, these methods focus on patterns that map on substructures. However, during testing, the evaluation metrics for molecule quality are often based on the entire molecule. These discrepancies cause two challenges. We also empirically demonstrate them using Junction Tree Variational Auto-Encoder (JTVAE) [12] and Variational Junction Tree encoder-decoder (VJTNN) [13] models, as shown in Table 1 and Fig. 1.

•
Difficulty in Maintaining Similarity While Optimizing Drug Properties.
Similarity scores between input and generated molecules (defined in Equation (11)) are relatively low compared with property improvement. For example, when property scores are higher than 0.7 in the range [0,1], the similarity score will only be low (around 0.3 in the range [0,1]), as shown in Table 1.
• Difficulty in Maintaining Molecule Sizes Which Affect Property Improvement.
When generated molecules are of very different sizes compared to the target molecules, they will have poor performance in both similarity and property. Fig.  1 show the property improvement as a function of the size difference between the generated and input molecules. We can see that a large size difference in either positive or negative direction usually corresponds to smaller property improvement.
To address these challenges, we introduce Molecule-level Rewards (MOLER) to enhance the molecule level properties. MOLER is a general and flexible approach that can be incorporated into various deep generative models for improved performance. MOLER is enabled by the following technical contributions.
• Similarity Reward (Section 3.2): We formulate the similarity between input and generated molecules as a reward function to alleviate the similarity gap between them.
• Size deviation penalty (Section 3.3): To explicitly control the size of molecule, we design a size deviation penalty that penalizes large size deviations to reduce the size difference between target and generated molecules.
We evaluated MOLER by incorporating it into JTVAE [12], VJTNN [13] and CORE [18]. We compared the MOLER-enhanced models with several strong baselines on three realworld molecule optimization tasks. Results show MOLER-enhanced models can achieve up to 16%, 14%, and 20% relatively improvement over the best baselines on similarity score, property improvement, and success rate, respectively.

Related Works
We review related works in molecule generation. From the methodology perspective, there are two research lines in automatic molecule generation, namely the maximum likelihood estimation (MLE) via deep generative models (DGM), and the reinforcement learning (RL)based approaches.
Maximum Likelihood Estimation (MLE) methods include Sequence-to-Sequence (Seq2Seq) models and graph-to-graph ones. The Seq2Seq models focus on maximizing the likelihood of the target sequence and is proven effective in various natural language processing tasks [19]. Since molecule can be also represented as SMILES sequences, it is straightforward to apply the Seq2Seq model. Sequence-based approaches are very brittle since small changes to the sequence may completely violate chemical validity. Thus special constraints were developed to ensure the validity of generated SMILES string [4], [5], [6]. However, due to the limited representation power of SMILES and complex syntactic constraints on SMILES string, these sequence-based approaches still suffer from generating invalid SMILES string. To address this issue, graph-based methods were explored, where graph representation of molecules can capture the structural information of molecules. Specifically, [12] represented molecule graph where nodes are substructures such as rings and atoms, and then applied a two-phase generation procedure to create a scaffolding tree. The generation is to assemble the tree into a new molecular graph. More recently, the graph-to-graph translation model [13] was proposed as an enhancement of [12]. It extends the junction variational autoencoder via attention mechanism and generative adversarial networks. One potential problem for MLE methods often requires paired training data (i.e., input and target molecule pairs), which are not naturally available for molecule optimization.
Reinforcement Learning (RL) methodswere also adopted in sequence-based molecule generation [15], [20] and graph-based generation [7], [8]. The learning objective is based on molecule level reward, e.g., directly introducing chemical property of the generated molecule as reward. However, RL-based methods usually ignore substructurelevel supervision and suffer from poor exploration efficiency [21] and are usually hard to train [13], [22].
Between the two, the MLE/DGM methods focus on substructure-level translation and ignore the molecule-level reward [12], [13], [18], [23]. Also, MLE methods train on paired molecules, which are not naturally available for molecule optimization and thus limit the performance. In contrast, for RL methods, the learning objective is based on molecule-level reward, e.g., directly introduce the chemical property of the generated molecule or molecular similarity as reward. RL-based methods usually ignore substructure-level supervision and suffer from poor exploration efficiency [21] and are generally hard to train [13], [22]. To alleviate these issues, our method bridges the two approaches by incorporating two molecule-level rewards into MLE/DGM models.

MOLER Method
In this section, we will first present deep generative models as background for molecule optimization in Section 3.1, particularly the scaffolding tree 1 based graph generation approaches [12], [13], [14], [18].
After introducing the background on scaffolding tree-based generative model and its limitations, we present the two molecule reward functions individually in Sections 3.2 and 3.3 followed by the optimization procedure (Section 3.4). We list some important mathematical notations and their short explanations in Table 2 for clarity, and also show a flow chart in Fig. 2 to illustrate deep generative models and MOLER in an intuitive way.

Deep Generative Models for Molecule Generation
In this subsection, we summarize the scaffolding tree-based generative model for molecule generation, which achieves the state of the art performance [12], [13], [14], [18]. For ease of exposition, we first define molecular graph, substructure and scaffolding tree. Note that this subsection provides the background to understand these types of generative models, which were originally proposed in [12] then subsequently enhanced by [13], [14], [18]. The main contribution of this paper will be discussed in next subsection. G).-Molecular graph G represents the graph structure for a molecule. In molecule graph G, each node corresponds to an atom of a molecule, and each edge corresponds to a bond connecting atoms.

Definition 1 (Molecular Graph
Definition 2 (Substructure).-The set of substructures includes all possible rings, bonds, and atoms. A substructure is a basic unit in the scaffolding tree. Each substructure corresponds to a node in the scaffolding tree.
Definition 3 (Scaffolding Tree T G ).-Given a molecule and its corresponding molecular graph G, the scaffolding tree T is generated by partitioning the graph G into substructures and connecting substructures into a tree. The main purpose of using scaffolding trees is to simplify the generation procedure. The scaffolding tree T G is the skeleton of the molecular graph.
These models for generating valid molecules comprise of two parts: (1) the encoder-decoder for scaffolding tree T G , and 2) the encoder-decoder for molecular graph G.
For scaffolding tree-based graph generation methods, the training set consists of paired data (X, Y). The input is molecule X while the target is another molecule Y similar to X but with better chemical property, e.g., biological activity (measured by DRD) [24], drug-likeness (measured by QED) [25]. The neural architecture can be divided into two parts: (i) encoder and (ii) decoder. Both encoder and decoder have molecular graph phase and scaffolding tree phase.

Encoder-
The goal of the encoder is to yield an embedding vector for each node in the scaffolding tree T G or the input molecular graph G. Specifically, both input molecular graphs and scaffolding trees are encoded via graph Message Passing Networks (MPN) [12], [26]. It encodes both nodes and edges in a graph. First, on the node level, each node v has a feature vector denoted e v . For example, node v in a molecular graph G is an atom, e v includes the atom type, valence, and other atomic properties. In the corresponding scaffolding tree T G , node v refers to a substructure, then e v is a one-hot vector indicating where f 1 (·) is a fully-connected neural network, m uv (t) is the message vector from node u to node v at the tth iteration, whose initialization is all-0 vector, i.e., m uv (0) = 0, following the rule of thumb [26]. After T steps of iteration, 2 another fully-connected neural network f 2 (·) is used to aggregate these messages. Each vertex has a latent vector as 2. T is a hyperparameter, similar to that in Graph Convolutional Network [27]. A. Tree Decoder.: To generate a new scaffolding tree, the idea is to generate one substructure at a time from an empty tree. In each step, first it needs to decide/predict whether to expand the current node or backtrack to its parent (topological prediction). If the prediction is to expand, (substructure prediction) is leveraged to predict which substructure to add to the new node, as shown in Fig. 2. When it backtracks to the starting node and decide not to expand any more, the generation process terminates automatically.

Topological Prediction.:
The core idea is first leveraging tree-based Recurrent Neural Network (tree-RNN) [12] to enhance the embedding for node i t , then predict whether to expand or backtrack. Given scaffolding tree T G = (V, ℰ), the tree decoder uses the tree-based RNN with attention mechanism to further improve embedding information learned from the original message-passing embeddings Z T . The tree-RNN converts graph into a sequence of nodes and edges via depth-first search. Specifically, let ℰ = i 1 , j 1 , i 2 , j 2 , …, i m , j m be the edges traversed in depth first search, each edge is visited twice in both directions, so we have m = | ℰ | = 2 | ℰ|. Suppose ℰ t is the first t edges in ℰ, message vector h i t , j t is updated as The expanding probability at node i t is computed by combining the embeddings Z T , Z G and the current state e i t , ∑ k, i t ∈ ℰ t h k, i t using a neural network f 3 (·) Specifically, we first compute context vector c t topo using attention mechanism (similar to the process in Equations (5) and (6)), then concatenate c t topo and e i t , and feed into a fullyconnnected neural network with sigmoid function to produce the expanding probability.

Substructure Prediction.:
After the decision of node expansion, we focus on finding what substructures to add by either selecting from the global set of substructures [12], [13] or copying from original input [18]. Every time after expanding a new node, the model predicts the substructure from the vocabulary.
We will summarize the key steps next. First we use attention mechanism to compute context vector based on current message vector h i t , j t and node embedding Z T , Z G c t sub = Attention h i t , j t , Z T , Z G , (5) Specifically, the attention weight is computed as where f 4 (·) is the dot-product function [28]. {α G ) are generated in the same way. Then we concatenate tree-level and graph-level context vector to generate the context vector Then based on attention vector c t sub and message vector h i t , j t , f 5 (·), a fully-connected neural network with softmax activation, is added to predict the substructure where q t sub is a probability distribution over all substructures [12], [13]. In [18], q t sub is enhanced by a COPY strategy that copies substructures from the input molecule using attention weight in Equation (6).

B. Graph Decoder.:
Based on the scaffolding tree generated by the tree decoder, graph decoder is used to convert nodes in the scaffolding tree together into the correct molecular graph [12]. During the learning procedure, we enumerate all the possible molecular structures {G i }. Then we formulate it as a multi-classification problem, where the learning objective is the log-likelihood of the right structure G o where s a G i = h G i ⊤ d G o is a scoring function that measures the likelihood of the current structure G i ; d G o is the embedding of the original graph G o .

Limitation and Extensions.:
The problem of these generative models is that they only focus on the local (substructure level) mapping while ignore the global (molecule level) metric, which may lead a significant deviation from input molecule in similarity. Next, we will present RL-based reward functions to enhance the overall loss function towards more desirable molecules.

Similarity Reward
Both similarity with input molecule X and property of Y (denoted sim(X, Y)) are essential metrics to evaluate the quality of generated Y. The similarity between two molecules ranges from 0 to 1. As mentioned, the similarity value is relatively low (around 0.3 for both QED and DRD2 dataset) compared with property value (approximately 0.7-0.8) in numerical value. We consider adding "similarity reward" to explicitly enhance the similarity constraint, i.e., maximize where θ represents all the parameters for generative models; ℒ sim (θ) is objective function for similarity reward; hyperparameter w sim ∈ ℝ + is weight of the reward. sim(X, Y) represents Tanimoto similarity between molecule X and Y over Morgan fingerprints [29].
Morgan fingerprints is a binary vector where each bit records the presence ("1") or absence ("0") of a binary fragment descriptor in the molecule. For example, a fragment descriptor can be "if Benzene ring exists in the molecule". It is assumed that two fingerprints with many bits in common represent similar molecules. Let S X and S Y denote the set of fragment descriptors that exist in molecule X and Y, respectively (i.e., presence of fragment descriptor). Tanimoto similarity is defined as where ∩, ∪ represent the intersection and union of two binary vectors respectively; |·| denotes the cardinality of a set. The value ranges from 0 to 1. Higher value means more similarity. We use Rdkit package [30] for both generations of Morgan fingerprints and computation of Tanimoto similarity. When optimizing Equation (10), the computation of its gradient estimator requires the numerical value of the probability density function π θ (Y|X) explicitly. Now we discuss how to evaluate π θ (Y|X) explicitly.
We know from Section 3.1, the molecule generation is mainly divided into three prediction tasks: (i) topological prediction, which is a binary classification task; (ii) substructure prediction; (iii)Assembling prediction (in graph decoder). Since there are two phases (scaffolding tree and molecular graph) in graph generation, π θ (Y|X) can be written as the following joint distribution that incorporate the generation of both the scaffolding tree and the molecular graph, which includes three prediction tasks, First, p t topo is the probability for the tth topological prediction, defined in Equation (4). ℰ is edge set of the generated scaffolding tree. Since the generation procedure would terminate until backtracking to root node, each edge is visited twice and there are totally 2 | ℰ| topological predictions. Second, regarding substructure prediction, q t sub is a distribution over all substructures defined Equation (8), S ⊆ 1, 2, …, 2 | ℰ | represents the set of the indexes of the edges who expands to a new node in a scaffolding tree, because only when topological prediction p t topo predict to expand to a new node, we need to make substructure prediction.
Third, regarding assembling prediction, G Y is the assembling structure for Y, selected from all the possible molecular structure {G i } based on the scaffolding tree, s a G i = h G i ⋅ d G o is a scoring function defined in Equation (9).

Size Deviation Penalty
Size deviation between input and target molecules is another reason that the target molecule is dissimilar to the input. Therefore, we design a penalty score to constrain this deviation.
As mentioned in Section 3.1, in graph generation there are three key prediction tasks: (i) topological prediction; (ii) substructure prediction; (iii) assembling prediction. Since scaffolding tree-based approaches use substructure as the basic component in molecule generation, we use the number of substructure as the surrogate for molecule size, which is much more efficient to compute since it is only related to topological prediction.
To design a reasonable size deviation penalty, we empirically investigate the correlation between the size of X and Y in training data pairs, and find they are positively correlated.
We want to minimize the following size deviation penalty, where ℒ size (θ) is objective function of size deviation penalty, hyperparameter w size ∈ ℝ + is weight for size deviation penalty, size(X) denotes the number of substructure in scaffolding tree of X. g(·, ·) is the reward function of molecule size defined as where ϵ ∈ ℕ + is a positive integer. We set ϵ = 3, which is empirically validated. Note that g(·, ·) is commonly known as ϵ-insensitive loss function in the context of SVM regression [31]. The intuition behind the design of g(x, y) is to give a high penalty when the size of the input and generated molecule differs significantly. When minimizing ℒ size (θ), we need to evaluate π θ T explicitly, which is a joint distribution for all topological prediction,

Learning and Optimization
Now we discuss the optimization procedure. Without loss of generalization, we consider maximizing a general objective as follows, where ℒ(θ) can be either ℒ sim or ℒ size in this paper, reward function R(X, Y ) ∈ ℝ can be either sim(X, Y) or g(size(X), size(Y)). R(X, Y) usually doesn't involve θ, so it's seen as a constant if we optimize w.r.t. θ. π θ (Y | X) ∈ ℝ + corresponds the probability density function for generative model.∇ θ ℒ(θ), the gradient of objective function with regard to θ, can be expanded as where in the third equality, "log-derivative trick" is applied. It is a well-known technique in policy gradient-based reinforcement learning [32], [33] and stochastic variational inference [34]. Then, the unbiased estimator for gradient of ℒ(θ) with regard to parameters θ can be obtained using Monte Carlo samples, where Y i π θ ⋅ | X i . That is, Y i is a molecule generated by generative model given the input molecule X i . Then gradient ascent is used to maximize the objective function with regard to θ. Worth to mention that the proposed method does not require extra parameters for the model compared with the deep generative models.
In summary, the deep generative model and two reward-based objectives are optimized alternatively and individually. Stopping criteria is checked every epoch on the validation set. When the success rate (a metric, which would be discussed in Section 4.5) of the current epoch is lower than the previous epoch, we terminate the learning procedure. We show the complete algorithm in Algorithm 1, which has more details.  Optimize ℒ gen (θ) w.r.t. θ using X i , Y i i = 1 m using SGD, where ℒ gen (θ) is loss of generative models.

Choosing Weight for MOLER
During learning procedure, two reward values (related to similarity and size) are integrated in the learning objective in order to minimize ℒ = ℒ gen − w sim ℒ sim + w size ℒ size , (19) where w sim , w size ∈ ℝ + are hyperparameters that control the strength of MOLER, ℒ gen is loss of generative model. It is time-consuming to use grid search/random search to find the near-optimal weight combination (w sim , w size ). To address this issue, we resort to Gaussian process (GP) [35], [36]. Gaussian process is used to approximate the validation accuracy as a function of these weight combination. The validation accuracy can be success rate, which is described in Section 4.5) Then via searching for the optimum of the approximated function, we obtain the appropriate weight combinations. Concretely, we denote the function to approximate as h(w 1 , w 2 ), where w 1 = w sim , w 2 = w size . The search range for w i , i = 1, 2 is bounded by LB i ≤ w i ≤ UB i . We draw m weight combinations, denoted w (1) = w 1 1 , w 2 1 ; w (2) = w 1 2 , w 2 2 ; …; w (m) = w 1 m , w 2 m . For the ith weight combination w (i) , we evaluate the task metric (e.g., success rate) on validation set r i , which can be seen as a noisy evaluation of the true function ℎ w 1 i , w 2 i we want to approximate. The noise comes from sampling bias of validation set. The true objective function h(w 1 , w 2 ) is unknown, a zero-mean Gaussian prior is specified, ℎ( ⋅ ) GP( ⋅ , k( ⋅ , ⋅ )), (20) where k(·, ·) is the covariance function. Given m points w (i) i = 1 m (weight combination) and their evaluation r i i = 1 m . We assume r is a noisy evaluation of the true unknown function that we are interested, i.e., r(w) = h(w) + ϵ, w = (w 1 , w 2 ). According to Bayes rule, the posterior distribution of the true objective function is ℎ | w (i) , r i i = 1 m N μ(w), σ 2 (w) , μ(w) = k ⊤ K + σ 2 I −1 r, σ(w) = k(w, w) − k ⊤ K + σ n 2 I −1 k, Fu et al. Page 11 where K = k w 1 , w 1 ⋯ k w 1 , w m ⋮ ⋱ ⋮ k w m , w 1 ⋯ k w m , w m , k = k w, w 1 , …, k w, w m ⊤ , r = r 1 , …, r m ⊤ . (22) Regarding covariance function we set k(w, v) = exp(− 1 2 2 ). α is empirically set to 0.2 [36].
To summarize, Gaussian process provides a surrogate function to approximate the true objective h(w 1 , w 2 ), the validation accuracy as a function of weights in MOLER (w sim and w size ). The surrogate can be used to search, efficiently, for the optimum of the objective function, i.e., optimal weights in MOLER. Grid search is leveraged here to explore the surrogate and get the optimal weights.

Experiment
The experiment section answers the following questions.

Q1: Can MOLER improve deep generative models for molecular optimization?
Q2: What is the effect of similarity reward alone?
Q3: What is the effect of size deviation penalty alone?

Molecular Properties
We are usually interested in some desired chemical properties of molecule in drug discovery, which are natural metrics for evaluation. Following [7], [8], [13], [14], [23], [37], we also focus on the following molecular properties: • Quantitative Estimate of Drug-likeness (QED). QED is an indicator of druglikeness [25], and the QED score of a compound ranges from 0 to 1.
• Dopamine Receptor (DRD2). DRD2 score measures a molecule's biological activity against a biological target named the dopamine type 2 receptor (DRD2) [24]. DRD2 score ranges from 0 to 1, and a high score means better property.
• Penalized LogP. Penalized LogP is a logP score that also accounts for ring size and synthetic accessibility [38]. LogP score ranges from −∞ to +∞. For most molecules, the score ranges from −20 to 20.
For any chemically valid molecule, QED, DRD2, and LogP scores can be evaluated via Rdkit package [30]. For all these three scores, a higher score is better. Thus, for the training data pairs (X, Y), X is the molecule with lower scores, while Y is a paraphrase of X with a higher score.

Setup
First, we describe the experimental setup, including the chemical properties of molecules and the generation of training data. The molecular properties include Drug likeness (QED), Dopamine Receptor (DRD2), and Penalized LogP.
Dataset.-Data pairs of input and target molecules are extracted from the ZINC dataset. ZINC contains 250K drug molecules, which are extracted from the ZINC database. 3 We use 3 public datasets publicly available in [13], including (1) QED dataset for improving QED; (2) DRD2 dataset for improving DRD2; (3) LogP dataset for improving Penalized LogP. In this paper, QED, DRD2, LogP are both the target chemical properties to improve and the name of datasets, e.g., QED dataset is used to learn a model for improving the QED score of a molecule. Some basic statistics of all the dataset are listed in Table 3.

Dataset Construction
The training datasets contain data pairs, which are extracted from ZINC dataset 4 based on following two criteria: • Simlarity Constraint. X and Y are similar enough, i.e., sim(X, Y ) ≥ η 1 .
• Property Constraint. Y has a higher score than X on a certain property. For QED dataset, QED score of X are in the range of [0.7,0.8] while QED score of Y is in the range of [0.9,1]. For DRD2 dataset, a well-trained model in [40] is applied to assess the probability of biological activity, where probability of X is lower than 0.05 and the probability for Y is greater than 0.5. The property constraint for LogP dataset is described in [13], [39]. The settings are the same as [13].

Baseline Methods
Now we briefly introduce the baseline methods.
• ReLeaSE. Reinforcement Learning for Structural Evolution (ReLeaSE) [20] exhibits state-of-the-art performance on SMILES representation.
MOLER can be applied to enhance deep generative models such as JTVAE, VJTNN, and CORE.

Evaluation Metrics
Next, we provide some key metrics when evaluating the effectiveness of the generated molecules. The task is to generate Y, a paraphrase molecule of X, with better desired property. Here are the evaluation metrics used in our experiments: • Similarity. We evaluate the molecular similarity between the input molecule and the generated molecule, measured by Tanimoto similarity over Morgan fingerprints [29], defined in Equation (11). Similarity between X and Y (denoted sim(X, Y)) ranges from 0 to 1. Morgan fingerprint generation and Tanimoto similarity calculation can be done by Rdkit package [30].
Following [13], [14], [37], [41], for QED and DRD2 dataset, when the similarity between input and generated molecule is greater than 0.3 and chemical property (QED, DRD2) of generated molecule is greater than 0.6, we regard the target molecule as a success. For LogP dataset, when the similarity between input and generated molecule is greater than 0.4 and LogP improvement is greater than 0.8, we regard the target as a success.
• Novelty is defined as the percentage of generated molecules that does not appear in training set. We also need to do canonical operation to convert the generated molecule into canonical SMILES.

Experimental Details for Reproducibility
In this section, we provide the implementation details required for reproduction of experimental results.
Features.-First we discuss the features used as input for encoders, including both node and edge features. Following [12], [13], [18], in the (molecular) graph message passing network, the initial atom features include its atom type, degree, its formal charge and its chiral configuration. Bond feature is a concatenation of its bond type, whether the bond is in a ring, and its cis-trans configuration. In the scaffolding tree-based encoder, we represent each substructure with a neural embedding vector, which is similar to word embedding. Both tree and graph decoder leverage the same feature setting as encoders.
Hyperparameter.-We follow most of the hyperparameter setting of JTVAE [12], VJTNN [13] and CORE [18] when running JTVAE+MOLER, VJTNN+MOLER and CORE +MOLER, respectively. For all these baseline methods and datasets, the maximal epoch number is set to 10; the batch size is set to 32. The initial learning rate is set to 1e −3 with the Adam optimizer. Every epoch learning rate is annealed by 0.8. Convergence criteria is checked every epoch measured by success rate on the validation set. When the success rate of the current epoch is lower than the previous epoch, we claim that the model converges and terminate the learning procedure. Following VJTNN and CORE, the discriminator in adversarial training is a three-layer fully-connected network with latent dimension 300 and LeakyReLU function is leverage as activation. For JTVAE, VJTNN, CORE, and their MOLER variants, the depth of tree encoder and graph encoder are set to 6 and 3, respectively. The dimension of hidden state for both (molecular) graph message passing network (GMPN) and junction tree message passing network (JTMPN) are set to 300. All the MPN uses tanh activation and mean function as readout function. In decoder network, all the hidden layer uses ReLU function as activation. MOLER does not bring extra learnable parameters. According to empirical study, the weight of individual MOLER plays a particularly important role in performance. It's regarded as a gradient-free optimization problem, we resort to Gaussian Process [35], as described in Section 3.5. Regarding both similarity reward and size deviation penalty, the search space of weight ranges from 1e −4 to 1e −2 .
Hardware and Software Configuration.-All the experiments are run on an Intel Xeon E5-2690 machine with 256G RAM and 8 NVIDIA Pascal Titan X GPUs. Our method is implemented by Python 2.7 and Pytorch 0.4.0. RDKit version has to be later than 2017.09.

Q1: MOLER can Improve Deep Generative Models
First, we demonstrate that MOLER can improve the state-of-the-art generative model methods on all the three datasets. The results (in terms of similarity, property improvement, and success rate) are reported in Table 4. We found that MOLER variants (JTVAE+MOLER, VJTNN+MOLER, CORE+MOLER) provides significant and consistent improvement across different generative models up to 19%. For other metrics, in Fig. 3, we use bar charts to show the results of various methods on various datasets. MOLER along with other methods can maintain high novelty in the generated molecules in Fig. 3d; MOLER will not cost many extra computational costs in training time and model size as shown in Figs. 3e and f; Regarding running time, we observe that JTVAE-based methods (JTVAE and JTVAE+MOLER) takes a longer time to converge when optimizing LogP score. There are mainly three reasons. (i) When we use JTVAE+MOLER to optimize LogP, we observe during the early iterations, the generated molecules differ a lot from the target ones, the MOLER objective is not stable. Thus it costs more time. (ii) LogP is more sensitive to the local structure [38] compared QED and DRD2 and hard to optimize. We observe that all the methods cost a longer time when optimizing LogP. (3) At the same time, compared with VJTNN, the neural architecture of JTVAE is less finer-grained. Specifically, it does not have adversarial learning module and attention mechanism in the decoder, which may cause a longer time to converge. Finally, as shown in Fig. 3g, MOLER is able to reduce the size deviation, measured by the variance of size(X)-size(Y) on the test set.
Case Study.-Moreover, we provide an example in Fig. 4, where VJTNN generate a small molecule with poor performance. In contrast, JTVAE+MOLER and VJTNN+MOLER generate target molecules that achieve high similarity and much more improved QED score for the same input molecule.

Q2: The Positive Effect of Similarity Reward
Next, we present the effect of similarity reward defined in Equation (10) (Section 3.2), and show how we set the similarity reward.
In practice, to accelerate the optimization procedure [32], [33], we want to make the average R(X, Y) to be close to 0, so we subtract its average from the original value. For example, in similarity reward, the reward is where C ∈ ℝ is a hyperparameter, we want it to be close to the average of all the similarity value.
Two hyperparameters play crucial role in the empirical performance of similarity reward: (1) the weight of similarity reward in the whole objective w sim ∈ ℝ + in Equation (10); (2) hyperparameter in similarity reward C ∈ ℝ in Equation (24). We search the weight of similarity reward w sim from {1e −4 , 3e −4 , 1e −3 , 3e −3 , 1e −2 }. For hyperparameter in similarity reward C, we want it to be close to the average of similarity value. During the learning procedure, the similarity would increase from 0 to about 0.3-0.4. So we search C from {0, 0.1, 0.2, 0.3, 0.4}. We use grid search to find the optimal combination of hyperparameters. For all the possible combinations, similarity and property of generated molecules are shown in Table 5. w sim = 0 corresponds to baseline method (VJTNN) that don't use similarity reward. We can find that most of the (w sim , C) combinations can outperform baseline method, validating the effectiveness of adding similarity reward.
In addition, we show more visualization results in Fig. 5. For each weight w sim ∈ {1e −4 , 3e −4 , 1e −3 , 3e −3 , 1e −2 }, we show the change of performance (both similarity and property) with different Cs (in Equation (24)) in Figs. 5a and 5b. Also, for each C ∈ {0, 0.1, 0.2, 0.3, 0.4}, we show the change of performance with various w sim in Figs. 5c and 5d. We find that (i) w sim = 1e −3 performs best among all the hyperparameters, especially on improvement of similarity; (ii) When C = 0.3, MOLER achieve the best performance; (iii) within a reasonable range, the similarity would increase as we increase the weight w sim ; (iv) too large w sim would degrade the performance, even worse than baseline method (VJTNN).
For QED dataset, the optimal combination of hyperparameters is w sim = 1e −3 and C = 0.3, as mentioned above. It is also validated to be optimal in DRD2 dataset. For LogP dataset, the optimum is w sim = 1e −3 and C = 0.4.

Q3: The Positive Impact of Size Deviation Penalty
Next, we demonstrate the effect of the size deviation penalty on the quality of generated molecules. The selection of reward weight w size and the reward function g (Equation (14)) are important to the performance. We also compare the performance of different size deviation penalty functions, where x, y represent the size of input X and generated Y, respectively.
• g 1 , reward function that discourage generating small molecule. It set a global threshold S for all the molecules, so it doesn't depend on the input molecule X. It is defined as g 1 (x, y) = 0, y ≥ S, S − y, y < S, (25) it assigns large penalty when size of Y is small, We set S = 10 because it achieve best performance. The average number of substructures in training set is around 14, as shown in Table 3.
• g 2 , reward function that discourage generating large molecule. Similar with g 1 , it set a global threshold T for all the molecules and is defined as g 2 (x, y) = 0, y ≤ T , u − T u > T , (26) which returns large penalty for large molecule. T = 17 achieve best performance.
• g 3 , reward function that discourage generating small molecule compared with the size of input molecule X. It is defined as g 3 (x, y) = 0, y ≥ x − ϵ, x − y − ϵ, y < x − ϵ, (27) it returns a big penalty when the size of Y is much smaller than the size of X.
• g 4 , reward function that discourage generating large molecule. Similar with g 3 , It depends on the size of input molecule X and is defined as • g 5 (g 5 = g in Equation (14) in Section 3.3) can be seen as a combination of g 3 and g 4 . It will generate penalty when the size of generated Y deviate significantly from the input molecule X. It is defined as g 5 (x, y) = x − y − ϵ, x − y > ϵ, 0, x − y ≤ ϵ . (29) For g 3 , g 4 and g 5 , ϵ = 3 perform best among ϵ ∈ {1, 2, 3, 4, 5} on QED, so it's fixed to 3.
For each size deviation penalty g, we plot the change of performance (both similarity and property) with various weight w size in Fig. 6 on the QED dataset. We find that (i) g 5 (which is represented in Section 3.3) and w size = 1e −3 achieve the best performance; (ii) most of the hyperparameter setting would outperform VJTNN (baseline). Therefore, the effect of size deviation penalty is positive. Moreover, this selection is also validated to be optimal on both DRD2 and LogP datasets.

Conclusion
In this paper, we have proposed to incorporate molecule level reward function (MOLER) into deep generative models for molecule optimization. Specifically, we have designed two molecule reward functions motivated by some empirical observations. The first one is the similarity reward to encourage the generated molecule to be similar to the input one. Another reward is to control the size of the generated molecule explicitly. MOLER provides a general and flexible framework that can incorporate various reward functions to specify different aspects of generated molecules based on any deep generative models for molecule optimization. Policy gradient is applied to optimize the reward objective, and it wouldn't cause too much extra computational cost compared with deep generative models. Thorough empirical studies have been conducted on several real molecule optimization tasks to validate the effectiveness of MOLER.
In the future, we want to explore the following directions: (1) designing more reward functions that can improve the molecule optimization procedure; (2) decomposing molecule level reward into substructure level to make the learning procedure easier and more efficient.
researcher at IBM TJ Watson Research Center and an associate professor at Georgia Tech, Atlanta, Georgia. His research focuses on data mining and health analytics, especially in tensor factorizations, deep learning, and large-scale predictive modeling systems. He has been collaborating with many healthcare organizations. He published more than 120 papers and filed more than 20 patents (five granted  Property Improvement as a function of size difference between source molecule X and target molecule Y. MOLER Framework. (1) First we describe the generative model. A molecular graph is transformed into a scaffolding tree. Both of them are encoded. Then, tree decoder and graph decoder are applied one after another. During the tree decoder, topological prediction generates the structure of scaffolding tree, while substructure prediction is to predict the substructure for the label of each node in the scaffolding tree. After that, during the graph decoder, scaffolding tree is assembled into molecules. SGD is applied to optimize generative models.
(2) Then in MOLER, we use the generative model to generate molecules and evaluate their reward, where the two reward functions (similar reward and size deviation penalty) will assign high reward to the molecule that is (a) similar to input molecule X; (b) close to X in molecular size. Policy gradient is used to optimize the reward.  A case study on QED dataset, where VJTNN generate a small molecule with poor performance, VJTNN+MOLER generates a molecule that achieves much better similarity and QED score. Selection of (w sim , C) on QED dataset for similarity reward (Section 3.2). We find (i) w sim = 1e −3 and C = 0.3 performs best among all the hyperparameters, especially on improvement of similarity; (ii) within a reasonable range, the similarity would increase when weight w sim increase. However, too large w sim would degrade the performance, even worse than baseline method (VJTNN, dashed line). Empirical effect of different size deviation penalty function (from g 1 to g 5 , Section 4.9) for size deviation penalty on QED using VJTNN+MOLER. The dash lines correspond to baseline VJTNN (weight of reward is set to 0). g 5 and w size = 1e −3 obtain the best performance among all selections.  QED and DRD2 are important chemical properties for drug molecules. Small molecule includes the generated molecules with less than eight substructures while large molecule includes the ones generated with more than 18 substructures. "Similarity" is the similarity between input and generated molecule. "Property" (either QED or DRD2) is the average property score of the generated molecules. Both scores are basic metrics for molecule optimization, ranging from 0 to 1, and higher is better. We report the average results on test set. We find that (1) similarity score has more improvement space compared with property scores; (2) performance degradation on either small or large molecules.  Here vocabulary is the set of all substructures. "Avg # Substructures" is the average number of substructures per molecule. η 1 is the threshold for similarity constraint defined in Equation (23). QED, DRD2, LogP are not only the chemical properties in this paper. They are also used to name the datasets, e.g., QED dataset is used to learn a model for improving the QED score of a molecule.  MOLER-based methods achieved the best performance in all settings, especially CORE+MOLER, which seems the most competitive. MOLER provides significant and consistent improvement across different generative models up to 19%. In each column, we highlight the best performance using bold font.   Average Sim(X, Y), QED(Y) are reported. We use grid search to find the best hyperparameter combination for similarity reward. The optimal selection is w sim = 1e −3 , C = 0.3, reaching the best similarity and property improvement at the same time.