Learning Vehicle Routing Problems using Policy Optimisation

Abstract

Deep reinforcement learning (DRL) has been used to learn effective heuristics for solving complex combinatorial optimisation problem via policy networks and have demonstrated promising performance. Existing works have focused on solving (vehicle) routing problems as they have a nice balance between non-triviality and difficulty. State-of-the-art approaches learn a policy using reinforcement learning, and the learnt policy acts as a pseudo solver. These approaches have demonstrated good performance in some cases, but given the large search space typical combinatorial/routing problem, they can converge too quickly to poor policy. To prevent this, in this paper, we propose an approach name entropy regularised reinforcement learning (ERRL) that supports exploration by providing more stochastic policies, which tends to improve optimisation. Empirically, the low variance ERRL offers RL training fast and stable. We also introduce a combination of local search operators during test time, which significantly improves solution and complement ERRL. We qualitatively demonstrate that for vehicle routing problems, a policy with higher entropy can make the optimisation landscape smooth which makes it easier to optimise. The quantitative evaluation shows that the performance of the model is comparable with the state-of-the-art variants. In our evaluation, we experimentally illustrate that the model produces state-of-the-art performance on variants of Vehicle Routing problems such as Capacitated Vehicle Routing Problem (CVRP), Multiple Routing with Fixed Fleet Problems (MRPFF) and Travelling Salesman problem.

摘要

深度增强学习（DRL）已被用于学习通过策略网络解决复杂组合优化问题的有效启发式，并具有明显的性能。现有的作品专注于解决（车辆）路由问题，因为它们在非琐事和困难之间具有很好的平衡。最先进的方法学习使用强化学习的政策，学习政策充当伪求解器。这些方法在某些情况下表现出良好的性能，但鉴于大型搜索空间典型的组合/路由问题，它们可以过快地收敛到较差的政策。为了防止这一点，我们提出了一种通过提供更多随机策略来支持探索的方法名称熵正则化强化学习（ERRL），这往往会改善优化。经验上，低方差errl提供快速稳定的RL训练。我们还在测试时间内引入了本地搜索运算符的组合，这显着提高了解决方案和补充错误。我们定性证明，对于车辆路由问题，具有更高熵的策略可以使优化景观平滑，这使得更容易优化。定量评估表明，该模型的性能与最先进的变体相当。在我们的评估中，我们通过实验说明该模型在车辆路由问题的变体上产生最先进的性能，例如电容车辆路由问题（CVRP），多次路由与固定的舰队问题（MRPFF）和旅行推销员问题。

Introduction

Combinatorial Optimisation (CO) Problems are hard as it involves finding the optimal solution under various constraints. A conventional approach to solve these problems involves modelling the problem into a mathematical objective, selection of an appropriate solver and then optimising its parameters for the problem instance at hand. While this approach has been successful, it requires high levels of optimisation expertise and domain knowledge, limiting their widespread usage. Also, the selection of solver and its optimal parameters varies for different problem instances when the problem instance changes, often the process of searching appropriate solver and parameters are restarted. This has raised interest in the level of generalisation at which optimisation operate. In recent years, there is increasing interest in considering combinatorial optimisation as a learning problem,(;), where optimisation instances, and their solutions, are used as training instances(). Thereby, the resulting learnt model is then considered as a general solver. The models typically involve deep neural networks (DNNs)(), and recent state-the-art approaches take a reinforcement learning strategy for cases where supervised approaches/solutions are not plausible(). The learnt policy is similar to a solver(). In particular, deep reinforcement learning (DRL) and policy gradients can successfully finding close to optimal solution to the problems such as TSP , , CVRP(, (), 0-1 Knapsack. All the problems we discussed in Appendixproblems. The previously mentioned state-of-the-art( and) uses standard policy gradient-based approaches (PGM) build upon the REINFORCE algorithm(). For standard reinforcement learning (RL) problems, search space (landscape) is smaller, the search space is smooth, and optimisation is not difficult. However, in the case of combinatorial optimisation ((class of CO) problems such as VRP, TSP, search space may not be infinite, but the search space of the CO is difficult . Therefore, for combinatorial optimisation problems, we need an effective method for moving through the search space.

简介

组合优化（CO）问题很难，因为它涉及在各种约束下找到最佳解决方案。解决这些问题的传统方法涉及将问题建模到数学目标中，选择适当的求解器，然后优化其手头的问题实例的参数。虽然这种方法已经成功，但它需要高水平的优化专业知识和领域知识，限制了他们的广泛使用情况。此外，求解器的选择及其最佳参数在问题实例发生变化时不同的问题实例变化，通常重新启动搜索适当的求解器和参数的过程。这已经提高了对优化运作的泛化水平的兴趣。近年来，考虑到组合优化作为学习问题，在优化实例及其解决方案中，使用作为培训实例的兴趣，越来越兴趣。由此，然后将得到的学习模型被认为是普通求解器。该模型通常涉及深度神经网络（DNN），最近的国家 - 最先进的方法采用加强学习策略，以便在监督方法/解决方案不是合理的。学习的政策类似于求解器。特别是，深度加强学习（DRL）和政策梯度可以成功地找到靠近最佳解决方案，例如TSP，CVRP，0-1背包等问题。我们在附录问题上讨论的所有问题。之前提到的州-Of-Art（AND）在钢筋算法（）上使用标准策略基于梯度的方法（PGM）。对于标准增强学习（RL）问题，搜索空间（景观）较小，搜索空间是平滑的，并且优化并不困难。然而，在组合优化（CO类别的类别）问题之类的情况下，诸如VRP，TSP，搜索空间可能不是无限的情况下，但是CO的搜索空间很难。因此，对于组合优化问题，因此，对于组合优化问题，我们需要一种有效的方法来通过搜索空间。

The issue with the standard policy gradient methods apply to CO Problem is, fluctuating landscape of the CO Problem which may not estimate gradient properly (two nearby samples can have very different gradients) which make the problem difficult to optimise. Because when we take the average which may have variance in the gradient. Additionally, in many cases in RL, a situation can occur if the agent discovers a strategy following a policy that achieved a reward which is better than when the agent started first, but the strategy of the agent following can be difficult to find the optimal solution and tend to take a single move over and over. So as agent progressing learning, it is average move distribution will be closer to prediction with either single move or multiple moves. So it is unlikely to explore different actions. As a consequence, instead of using standard PGM, our line of research devoted to how can we explore the search space effectively and efficiently to improve the solution of a combinatorial optimisation problem? When we look for the best possible answer to the question, a perfectly logical solution to use exploration strategy for better exploration in the search space. As a result, we need a method that can able to explore the search space effectively. However, this is not the case in previous learning methods. To solve this, we add an entropy of the policy with the RL objective as in . The goal is to find the probability distribution that has the highest entropy, which states that it is one of the best representation of the current state. Using exploration strategy in RL can be used where the neural network is used as function approximation. In the line of research, we contribute to the deep learning community for COP by introducing entropy regularised term in recent policy gradient-based methods (ERRL). ERRL offers to improve policy optimisation . It is believed that entropy regularisation, assist with exploration by encouraging the selection of more stochastic policies().

标准策略梯度方法的问题适用于CO问题，是CO问题的波动景观，这可能无法正确估计梯度（附近的两个样本可以具有非常不同的梯度），这使得问题难以优化。因为当我们花的平均值时可能具有梯度方差。另外，在RL的许多情况下，如果代理在实现奖励的策略之后，可能会出现情况，这可能会在实现奖励优于代理首先开始的奖励，但代理策略之后可能难以找到最佳解决方案并倾向于一遍又一遍地举动。因此，由于代理进展学习，它是平均移动分布将更接近预测单一移动或多个移动。因此，不太可能探索不同的行动。因此，由于我们的研究系列致力于如何有效探索搜索空间，而不是使用标准PGM，以有效探索搜索空间，以改善组合优化问题的解决方案？当我们寻找问题的最佳答案时，一个完全合乎逻辑的解决方案，用于在搜索空间中使用勘探策略以更好地探索。结果，我们需要一种能够有效探索搜索空间的方法。但是，这不是先前学习方法的情况。为了解决这个问题，我们将策略的熵添加到RL目标中。目标是找到具有最高熵的概率分布，这使得它是当前状态的最佳表示之一。可以使用使用R1中的探索策略，其中神经网络用作函数近似。在研究线上，我们通过在最近的基于政策梯度的方法（错误）中引入熵正则化术语来促进COP的深度学习界。 ERRL提供改善政策优化。据信，熵正则化，通过鼓励选择更多随机政策协助探索。

As a consequence, in this work, we analyse this claim a policy with higher entropy can make the changes the optimisation landscape and maintains exploration to discourage early convergence. To the best of our knowledge, the entropy regularised term has not been studied or used in existing learning to (combinatorial) optimise literature. ERRL can be integrated with any existing policy gradient approaches that use parameterised functions to approximate policies; hence we applied entropy technique to the state-of-the-art methods();(). We demonstrated the effectiveness of ERRL on three categories of routing problems. The goal of this work is not to outperform all the existing state-of-the-art VRP learning algorithm from every aspect but to provide direction in the study of the RL approach to encourage exploration to fundamental routing problems, considering the before-mentioned difficulties. The main contributions are as follows: - We proposed an approach using entropy regularised term that can solve route optimisation problems. We devise a new exploration-based and low-variance method for policy gradient method because this baseline assists to select a more stochastic policy. The proposed method is verified on multiple types of routing problems, i.e., Capacitated vehicle routing problem (CVRP) and multiple routing with fixed fleet problems (MRPFF) and Travelling salesman problems (TSP). - The generality of the proposed scheme is validated with different approaches and evaluating the resultant method on various problem sizes (and even at high problem dimensionality of 100) to achieve outstanding performance, better than the state-of-the-art in terms of accuracy and time-efficiency. - Another contribution we use a local search algorithm 2-opt(). This hybrid approach is an example of combining learned and traditional heuristics to improve the solution. In this work, we analyse existing inference techniques to show the impact of the post-processing techniques in the solution quality.

因此，在这项工作中，我们分析了该索赔，该索赔具有更高熵的政策可以使更改优化景观并保持探索，以阻止早期收敛。据我们所知，熵正则化术语尚未研究或用于现有学习（组合）优化文献。 ERRL可以与使用参数化功能的任何现有策略渐变方法集成到近似政策;因此，我们将熵技术应用于最先进的方法（）;（）。我们展示了错误的三类路由问题的有效性。这项工作的目标不是从各个方面表达所有现有的最先进的VRP学习算法，而是在考虑前提前提到的困难时提供探索探索的RL方法的方向。。主要贡献如下： - 我们提出了一种使用熵正则化术语的方法，可以解决路由优化问题。我们设计了一种基于探索的基于探索和低方差方法，用于政策梯度方法，因为该基线有助于选择更随机的政策。所提出的方法在多种类型的路由问题上验证，即电容车辆路由问题（CVRP）和多个路由，以及固定的舰队问题（MRPFF）和旅行推销员问题（TSP）。 - 所提出的方案的一般性以不同的方法验证，并评估各种问题尺寸的结果方法（甚至在高位问题维度100）上，以实现出色的性能，而不是最先进的准确性和时间效率。 - 我们使用本地搜索算法2-opt（）的另一个贡献。这种混合方法是结合学习和传统启发式的示例，以改善解决方案。在这项工作中，我们分析了现有推理技术，以显示在解决方案质量中的后处理技术的影响。

In recent years, the line of research has many ways to solve COP using deep learning paradigm. Many methods have been developed to tackle combinatorial optimisation problems. Traditional heuristics for routing problems can be categorised as construction and improvement heuristics(). Recent advances in the neural networks include the design of a new model architecture called Pointer Network(PN)(). In Pointer network() learns to solve combinatorial optimisation problems where encoder (RNN) converts the input sequence that is fed to the decoder (RNN). They use attention on the input and train this model in a supervised setting to solve the Euclidean TSP instances. The goal is to use the Pointer Network architecture to find close to optimal tours from ground truth optimal (or heuristic) solutions for Traveling Salesman Problem (TSP). In() takes a graph as an input and extracts features from its nodes and edges. Their model can be considered as a stack of several graph convolutional layers. The output of the neural network is an edge adjacency matrix representing the probabilities of edges occurring on the TSP tour. The edge predictions, forming a heat-map, are transformed into a valid tour. They trained their model as a supervised manner using pairs of problem instances and optimal solutions. Despite this promising early application, reinforcement learning becomes a compelling choice to the prospect of learning to optimise as it does not require a set of pre-solved solutions for training. () first propose a reinforcement learning approach, in which a pointer network is trained using an actor-critic reinforcement learning strategy to generate solutions for artificial planar TSP instances. They address this issue by designing a neural combinatorial optimisation framework that uses reinforcement learning to optimise policy. S2VDQN() solves optimisation problems using a graph embedding structure and a deep Q-learning algorithm. Recently, many deep learning-based approaches exist, however only a few learning-based approaches propose a solution to the VRP(,). Recent approaches applied a deep RL model that generate solutions sequentially one node at a time. The constructive heuristics,(), a model was proposed that uses a recurrent neural network (RNN) decoder and an attention mechanism to build solutions for the CVRP and the SDVRP, train the model using policy gradient methods (actor-critic approach) similar to(). Solution searching techniques used a beam search with a beam-width of up to 10.

Motivation

In recent works, PGM use to solve CO problem. The key idea in policy optimisation (PGM) is to learn parameters, $ \theta $ of a policy, $ \pi_{\theta}(a|s) $ , $ s \in S, a \in A $ . Here a is the action and s is state. The policy gradient method() stated the gradient as:
$$ J_{ER}(\theta) = \sum_{s\in S} d^{\pi_{\theta}}(s) \sum_{a\in A}\pi_{\theta}(a|s) Q^{\pi_{\theta}}(s,a) $$
where $ d^\pi $ is the stationary distribution of states and $ Q^{\pi}{\theta}(s_t,a_t) $ is the expected discounted sum of rewards starting state $ s $ , taking action $ a $ and sampling actions according to the policy, $ a\sim\pi(. |s{t}) $ . The $ Q^{\pi}{\theta}(s,a) $ ), (Monte Carlo estimation) is the value function pair following a policy $ \pi $ . Here we are interested in finding parameter $ \theta $ that maximises the objective function $ J{ER} $ (the goal is to maximise the discounted cumulative rewards). The equation helps to find a policy with the highest expected reward from the agent's action. However, many issues are encountered using current PGM approaches in combinatorial optimisation problems as combinatorial optimisation is not easy. One approach to characterising the degree of difficulty of an optimisation problem is its search space. The search space is also known as its landscape, and solutions to the optimisation are points on this landscape. It is difficult to have access to the exact transition and reward dynamics. Therefore, the gradient of $ J_{ER} $ given by the policy gradient theorem in Equationtheta1 cannot be evaluated directly. The Equationtheta1 allowing us to estimate $ \nabla J_{ER} $ using Monte-Carlo samples, where each trajectory is defined as a sequence. In a previous learning-based model, never consider an agent that have prior knowledge of previous states, so it is perfectly logical to think of an agent that needs to have experience of previous states of data to achieve maximum reward (encourage exploration). In consequence, need to change in the RL objective to solve COP so that model can achieve the highest expected reward with low variance.

动机

最近的作品，PGM用来解决CO问题。策略优化（PGM）的关键思路是学习策略的参数$\theta $，$\pi_{\theta}(a|s) $，$ s \in S, a \in A $。这里是行动，s是州。策略渐变方法（）表示梯度为：
$$ j_ {er}（\ theta）= \ sum_ {s \ in s} d ^ {\ pi _ {\ theta}}（s）\ sum_ {a \ } \ pi _ {\ theta}（a | s）q ^ {\ pi _ {\ theta}}（s，a）$$
其中$ d^\pi $是状态和$ Q^{\pi}{\theta}(s_t,a_t) $的静止分布预期折扣奖励奖励状态$ s $，采取动作$ a $和采样操作根据策略，$ a\sim\pi(. |s{t}) $。 $ Q^{\pi}{\theta}(s,a) $）（蒙特卡罗估计）是策略$\pi $之后的值函数对。在这里，我们有兴趣找到参数$\theta $，最大化目标函数$ J{ER} $（目标是最大化折扣累计奖励）。该等方程有助于找到代理人行动中最高预期奖励的政策。然而，在组合优化问题中使用当前PGM方法遇到许多问题，因为组合优化并不容易。表征优化问题难度程度的一种方法是其搜索空间。搜索空间也称为其景观，并且优化的解决方案是此景观的点。很难访问确切的转换和奖励动态。因此，不能直接评估由equationtheta1中的策略梯度定理给出的$ J_{ER}$的梯度。 areAtateTheta1允许我们使用Monte-Carlo样本来估计$\nabla J_{ER}$，其中每个轨迹定义为序列。在以前的基于学习的模型中，从不考虑具有先前州的先验知识的代理，因此想到需要具有以前数据态的经验以实现最大奖励的经验（鼓励探索）是完全逻辑的。因此，需要改变RL目标才能解决警察，以便模型可以实现低方差的最高预期奖励。

Entropy Regularised Reinforcement Learning(ERRL)

熵正规化强化学习（errl）

Encourage Exploration

In our model we are given a problem s (a set of nodes $ (n_1,\cdots,n_m) $ ) and a policy $ \pi $ which is parameterise by $ {\theta} $ . The policy is trainable and can be produce a valid solution to the problem. The solution is as ( $ L= {a_1,\cdots, a_i} $ ), where the ith action $ a_i $ can choose a node in the tour(solution). The neural network generated the solution one node at a time in a auto-regressive manner following the policy(stochastically), $ \pi_{t} = P_{\theta}(a_t | s) $ , where, $ t=1 $ and s is the problem instance (define as state).

In ERRL methods, we offer exploration strategy in the CO problems. Our ERRL model starts with a node for exploration shown in Figureerrl. The network samples N trajectories ( $ {{l^1,l^2,\cdots,l^N}} $ ) using the Monte Carlo method, where each trajectory is defined as $ l^i $ . In Figureerrl presented how ERRL model learns exactly previous stated data (in our case, these stated previous data is the experience of the agent.), Figureerrl shows the two distributions of the Q values can be represent as for $ (distribution1 = a_0 0.26, a_1 0.28, a_2 0.24, a_3 0.220) $ and $ (distribution2 = a_0 0.1, a_1 0.08, a_2 0.8, a_3 0.12) $ , for the first distribution all of the probabilities are low and from the second distribution $ a_2 $ has a high probability and other actions has lower probability. It had happened some cases that the agent will use the action in the future, as it already achieves some positive reward when it started. Therefore, the agent does not tend to explore in this context; however, there could exist another action that could have a much higher reward. Consequently, the agent will never try to explore instead will exploit what it has already learned. This means the agent can get stuck in a local optimum because of not exploring the behaviour of other actions and never finding the global optimum. Hence adding entropy, we encouraged exploration and avoid getting stuck in local optima. The use of Entropy in RL works as when the agent is progressing learning the policy, according to the agent action model returns a more positive reward for the state. The entropy augmented in the policy in the conventional RL objective following formalised in Equationentropy. In this work, entropy maximisation is typically carried out by adding an entropy regularisation term to the objective function of RL. Therefore, when all actions are equally good in ERRL the entropy regularisation improve policy optimisation in reinforcement learning maximising reward to improve exploration.

鼓励探索

在我们的模型中我们被给出了一个问题s（一组节点$(n_1,(n_1,,n_m) $）和${\theta} $的参数化的策略$\pi $。该政策是可培训的，可以为问题产生有效的解决方案。该解决方案是（$ L= {a_1,\cdots, a_i} $），其中第i个动作$ a_i $可以选择巡回赛中的节点（解决方案）。神经网络在策略（随机）之后以自动回归方式生成一个节点（随机），$\pi = P_{t} = P_{\theta}(a_t | s) $，其中$ t=1 $和s是问题实例（定义为状态）。

以errl方法，我们提供CO问题的探索策略。我们的错误模型以探索显示的节点开头，如图所示。使用Monte Carlo方法，网络示例n个轨迹（${{l^1,l^2,\cdots,l^N}} $），其中每个轨迹被定义为$ l^i $。在ModeRERRL中，介绍了ERRL模型究竟学习恰好上一个规定的数据（在我们的情况下，这些规定的先前数据是代理的经验。），MuperTrl显示Q值的两个分布可以表示为$(distribution1 = a_0 0.26, a_1 0.28, a_2 0.24, a_3 0.220) $和$(distribution2 = a_0 0.1, a_1 0.08, a_2 0.8, a_3 0.12) $ ，对于第一个分布，所有概率都低，并且从第二分布$ a_2 $具有高概率，并且其他动作具有较低的概率。它发生了一些案例，代理人将来会使用该行动，因为它已经在开始时实现了一些积极的奖励。因此，在这种背景下，代理人不会倾向于探索;但是，可能存在另一个可能具有更高奖励的行动。因此，代理商永远不会尝试探索，而是利用它已经学到的东西。这意味着代理人可以在局部最佳状态下被卡住，因为没有探索其他行为的行为，从未找到全球最佳。因此，添加了熵，我们鼓励探索，避免陷入困境。 RL中的熵在于代理在进行学习策略时，使用熵在于，根据代理操作模型返回对该州的更积极的奖励。在惯例中正式化的传统RL目标中的熵在常规RL目标中增强。在这项工作中，通常通过将熵正则化术语添加到R1的目标函数来执行熵最大化。因此，当所有动作同样良好时，熵正则化改善了加强学习最大化奖励的政策优化，以改善探索。

By augmenting entropy regularisation with the reward that helps to get more reward proportion to the entropy of the $ \pi $ in the following equation: It supports to help with exploration by encouraging the selection of more stochastic policies. Where $ \pi $ is a policy, $ {\pi^*} $ is the optimal policy, $ T $ is the number of time-steps, state $ s\in S $ is the state at time-step $ t $ , $ a\in A $ is the action at time-step $ t $ , $ P_{\pi} $ is the distribution of trajectories induced by policy $ \pi $ . $ H (\pi_{\theta}(. |s_{t}) $ is the entropy of the policy $ \pi $ at state $ s_t $ and is calculated as $ H (\pi_{\theta}(. |s_{t})= -log(\pi_{\theta}\pi0\pi1\pi2 $ , where H is the entropy and $ \alpha $ controls the strength of the entropy regularisation term. $ \alpha $ is a temperature parameter that controls the trade-off between optimising for the reward and for the entropy of the policy, the entropy bonuses play an important role in reward ( $ \alpha $ value discusses in Sectiondataset). So with the adding entropy to maximise objective will help agent to explore mode and have knowledge of the best representation of the state likely to have a probability distribution with the highest entropy, which means agent can have to the experience of the stated prior data. This incorporated entropy term, defined over the outputs of the policy network, into the loss function of the policy network, and the policy exploration can be supported to maximising the reward. In other words, add an entropy term into the loss function, encouraging the policy to take diverse actions). The entropy $ H(\pi(a_t|s)) $ used into the loss function to encourage the policy $ \pi_t = \theta(a_t|s) $ .

We use the entropy regularisation term with the current approaches and , by augmenting the rewards with an entropy term, $ H(\pi(. |s_{t}))= E_{a\sim\pi(. |s_{t})}[-log \pi(a|s_t)\pi0 $ . This entropy regularisation term is weighed by $ \alpha $ . It helps exploration by encouraging the selection of more stochastic policies (over deterministic ones), and premature convergence can also be prevented and learning is stable and sample efficient we demonstrates in the experiment section. In Equationtheta1, $ Q^{\pi_\theta}(s,a) $ changed to $ Q^{\alpha,\pi_\theta}(s,a) $ is the expected discounted sum of entropy-augmented rewards. $ Q^{\alpha,\pi_\theta}(s,a) $ can be estimated by executing $ \pi_{\theta} $ in the environment. Because of this term we can have slightly different value function in the settings which change the value because of the included entropy at every time step. This term encourages the policy to assign an equal probability of action that has the same as the total expected reward or nearly equal to the reward by exploring the search space. Also, helps agent not to select a particular action repeatedly that could exploit inconsistency in the approximation of $ Q^{\pi_\theta}(s,a) $ which that can suffer high variance, instead can explicitly encourage exploration. The expected discounted sum of reward $ Q^{\pi}(s_t,a_t) $ depends on the policy $ \pi_{\theta} $ , so any change in the policy will be reflected in both the objective and gradient.

通过使用奖励增强熵正则化，有助于在以下等式中获得更多奖励与$\pi $的熵更加奖励比例：它支持通过鼓励选择更多随机策略来帮助探索。其中$\pi $是策略，${\pi^*} $是最佳策略，$ T $是时间步骤的数量，状态$ s\in S $是时间步骤$ t $，$ a\in A $是的状态时间步骤$ t $，$ P_{\pi} $的动作是策略$\pi $引起的轨迹的分布。 $ H (\pi_{\theta}(. |s_{t}) $是策略的熵$\pi $在状态$ s_t $并且被计算为$ H (\pi_{\theta}(. |s_{t})= -log(\pi_{\theta}\pi0\pi1\pi2 $，其中H是熵和$\alpha $控制熵正则化术语的强度。 $\alpha $是一个温度参数，它控制在优化奖励和策略熵之间的权衡之间的权衡，熵奖金在奖励中发挥着重要作用（$\alpha $值在SectionDataset中讨论）。因此，对于最大化目标的添加熵将有助于代理探索模式，并且了解可能具有最高熵的概率分布的状态的最佳表示，这意味着代理商必须具有所规定的先前数据的经验。这种熵项，在策略网络的输出中定义为策略网络的丢失函数以及策略探索可以支持最大化奖励。换句话说，将熵术语添加到损失函数中，鼓励策略采取各种行动）。熵$ H(\pi(a_t|s)) $用于丢失功能，鼓励策略$\pi_t = \theta(a_t|s) $。

我们使用当前方法使用熵正则化术语，并通过熵项增强奖励，$ H({t}(. |s_{a\sim\pi(. |s_{t})}[-log \pi(a|s_t)\pi0 $。此熵正则化术语由$\alpha $称重。它通过鼓励选择更多随机政策（过度确定性的）来帮助探索，并且还可以防止过早的收敛，并且学习是稳定的，并且在实验部分中显示出样本。在equationtheta1中，$ Q^{\pi_\theta}(s,a) $更改为$ Q^{\alpha,\pi_\theta}(s,a) $是熵增强奖励的预期折扣和。 $ Q^{\alpha,\pi_\theta}(s,a) $可以通过在环境中执行$\pi_{\theta} $来估计。由于此术语，我们可以在设置的设置中具有略微不同的值函数，这在每次步骤中包含熵的熵。该术语鼓励策略为通过探索搜索空间分配与总预期奖励或几乎等于奖励的相同概率。此外，帮助代理不重复选择特定操作，该方法可以利用可能遭受高方差的$ Q^{\pi_\theta}(s,a) $的近似的不一致，而是可以明确鼓励探索。预期折扣奖励奖励$ Q^{\pi}(s_t,a_t) $取决于策略$\pi_{\theta} $，因此策略中的任何变更都将反映在目标和渐变中。

We trained their model using our Entropy Regularised Reinforcement Learning, name ERRL1. The problem instance defines as $ s $ is a graph, and the model is considered as Graph Attention Network(). Attention based model defines a stochastic policy $ p(\pi|s) $ for selecting solution $ \pi $ given the problem instance $ s $ , parameterised by $ \theta $ . As solution define as $ \pi $ and network samples N solution trajectories, we calculate the total reward $ R(l^i) $ of each solution $ l^i $ . We use gradient ascent with an approximation to maximise the expected return J:
$$ \begin{matrix} \nabla_{\theta}J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^{N} (R(l^i) - b^i(s)))\nabla_{\theta} log P_{\theta}(l^i|s) \end{matrix} $$
In order to train our ERRL1 model, we optimise the objective by using the REINFORCE() gradient estimator with augmented entropy term along with baseline. To encourage exploration and avoid premature convergence to a sub-optimal policy, we add an entropy bonus.
$$ \begin{matrix} \nabla_{\theta}J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^{N} (R(l^i) - b^i(s)))\nabla_{\theta} log P_{\theta}(l^i|s)+ \alpha H(\pi_\theta(.|s_t))) \end{matrix} $$
Here, reinforcement learning with baseline learns both $ R(l^i) $ and the $ {b^i(s)} $ used as a baseline. The average returns serve as a baseline with included entropy bonuses. Here, we use entropy term to prevent premature convergence, results in a slightly different gradient with changing value function. This will result in larger entropy, which means the policy will be more stochastic.
t_t_242_0 %Because this method trained on the difference between the two near generated rollouts by networks;
A good baseline $ b^i(s) $ reduces gradient variance and therefore increases the speed of learning. After generating solution trajectories ( $ {l^1,l^2,\cdots,l^n} $ ), we used the greedy-rollout baseline scheme, each sample-rollout assessed independently. With the changed baseline, now, each trajectory competes with N-1 other trajectories where the network will not select two similar trajectories. With the increased number of heterogeneous trajectories all contributing to setting the baseline at the right level, premature converge to a suboptimal policy is not encouraged instead converge to explored policy. Afterwards, similar to() we updates via ADAM (adaptive moment estimation)() combine the previous objectives via SGD (stochastic gradient descent). Details are given in AlgorithmERRam.

我们使用我们的熵正常化的强化学习，验证了他们的模型，名称错误。问题实例定义为$ s $是一个图形，并且该模型被视为图形关注网络（）。基于注意的模型定义了一个用于选择解决方案$\pi $的随机策略$ p(\pi|s) $，由$\theta $参数，参数化为问题实例$ s $。作为解决方案定义为$\pi $和网络样本N解决方案轨迹，我们计算每个解决方案$ l^i $的总奖励$ R(l^i) $。我们使用梯度上升与近似值最大化，以最大化预期返回j：
$$ \ begin {矩阵} \ nabla _ {\ theta} j（\ theta）\ inflicat \ dfrac {1} {n} \ sum_ {i = 1} ^ {n}（r（l ^ i） - b ^ i（s）））\ nabla _ {\ theta} log p _ {\ theta}（l ^ i | s）\ end {marrix} $$
以训练我们的Errl1模型，我们通过使用增强熵项以及基线使用钢筋（）渐变估算器来优化目标。为了鼓励探索并避免过早收敛到次优政策，我们添加了熵奖金。
$$ \ begin {matrix} \ nabla _ {\ theta} j（\ theta）\ intain \ dfrac {1} {n} \ sum_ {i = 1} ^ {n}（r（l ^ i） - b ^ i （s）））\ nabla _ {\ theta} log p _ {\ theta}（l ^ i | s）+ \ alpha h（\ pi_ \ theta（。| s_t）））\ neg {marrix} $$
在这里，加强使用基线学习学习$ R(l^i) $和${b^i(s)} $用作基线。平均退货用作基线，其中包括熵奖金。在这里，我们使用熵项来防止过早收敛，导致具有更改值函数的较差较差的梯度。这将导致更大的熵，这意味着政策将更随机。良好的基线$ b^i(s) $可降低梯度方差，因此增加了学习速度。在生成解决方案轨迹（${l^1,l^2,\cdots,l^n} $）后，我们使用了贪婪 - 卷展栏基线方案，每个样本卷排出都独立评估。使用已更改的基线，现在，每个轨迹与N-1其他轨迹竞争，网络将无法选择两个类似的轨迹。随着异构轨迹的数量增加，所有有助于在右键处设置基线，不鼓励到次优政策的早泄到次优政策。之后，类似于（）我们通过ADAM更新（Adaptive Slion估计）（）通过SGD（随机梯度下降）组合先前的目标。详细信息在algorithmerram中给出。

Experiments

All of our ERRl experiment use the PGM. We emphasize that ERRL can be applied to any PGM, to support the claim we applied ERRL to existing two algorithms introduce by Kool et al. and Nazari et al.. In this work, we implemented the datasets described by() for TSP and CVRP for ERRL1 and ERRL2. For MRPFF we implemented dataset, where we need to find the shortest path connecting all N $ (n_1,\cdots,n_i) $ nodes, where the distance between two nodes is 2D Euclidean distance. The location of each node is sampled randomly from the unit square, in this problem vehicle does not need to full fill any customer demand but optimise multiple routes. We used the same architecture settings as throughout all the experiments and initialize parameters Uniformly like , policy gradients are averaged from a batch of 128 instances. Adam optimizer is used with a learning rate 0.0001 and a weight decay (L2 regularisation). To keep the training condition simple and identical for all experiments we have not applied a decaying learning rate, although we recommend a fine-tuned decaying learning rate in practice for faster convergence. Every epoch we process 2500 batches of 512 instances generated randomly on the fly. Training time varies with the size of the problem. In Figuresmalltsp illustrates most of the learning is already completed by 200 epochs. In the experiment, we manually set entropy values for all the problems. We evaluated the results for other values results presented in Appendixparameter, but the best performance was achieved with $ \alpha=0.3 $ with learning rate 0.0001.

Method		CVRP20			CVRP50			CVRP100
CVRP	TourL	Gap(%)	Time(s)	TourL	Gap(%)	Time(s)	TourL	Gap(%)	Time(s)
LKH3	6.14	0.00	28(m)	10.39	0.00%	112(m)	15.67	0.00	211(m)
Random CW*	6.81	10.91		12.25	17.90		18.96	20.99
Random Sweep*	7.01	14.16		12.96	24.73		20.33	29.73
Or-tools*	6.43	4.73		11.43	10.00		17.16	9.50
L2I	6.12	-	12(m)	10.35	-	17(m)	15.57	-	24(m)
Kool	6.67	8.63	0.01	11.00	5.87	0.02	16.99	8.42	0.07
Nazari	7.07	15.14	6.41	11.95	15.01	19	17.89	14.16	42
Ours(Constructive)
ERRL1	6.34	3.25	0.01	10.77	3.65	0.02	16.38	4.53	0.06
ERRL2	6.67	8.63	5.4	11.01	2.63	14	17.23	9.95	33
ERRL2(2opt)	6.18	0.65	5.49(m)	10.56	1.63	24.28(m)	16.16	3.12	65(m)

实验

我们所有的错误实验都使用pgm。我们强调，ERRL可以应用于任何PGM，以支持我们将ERRL应用于现有的两种算法的索利申请。在这项工作中，纳齐拉等人，我们实现了（）为erst1和errl2的tsp和cvrp描述的数据集。对于MRPFF，我们实现了数据集，我们需要找到连接所有N $(n_1,\cdots,n_i) $节点的最短路径，其中两个节点之间的距离是2D欧几里德距离。每个节点的位置随机从单位广场上采样，在此问题中不需要完全填充任何客户需求，但优化多个路线。我们使用与整个实验相同的架构设置，并均匀地初始化参数，策略梯度从一批128个实例取平均。 ADAM优化器与0.0001和重量衰减（L2正则化）一起使用。为了保持培训条件，对于所有实验来说，我们没有应用衰减的学习率，尽管我们建议在实践中建议进行微调的衰变学习率，以便更快地收敛。每个时代我们处理2500批批次的512个实例随机生成。培训时间随问题的大小而变化。在图中，Malltsp说明了大多数学习已经在200个时期完成。在实验中，我们手动为所有问题设置熵值。我们评估了附录参数中提出的其他值结果的结果，但是使用$\alpha=0.3 $实现了最佳性能，具有0.0001的学习率。

Capacitated Vehicle Routing Problem (CVRP)

In this table we group baselines as solver name as LKH3, non-learning baselines, constructive approaches and and improvement approaches (another two algorithms Chan and Tian() and L2I() introduced that fuses the strength of Operations Research (OR) heuristics with learning capabilities of reinforcement learning, in Tablerandomdata, we reported results from their paper). Our method is directly comparable to constructive approaches, given 1000 random CVRP instances of CVRP20, CVRO50 and CVRP100. ERRL2(using Policy network) slightly improve the solutions. However ERRL1(using Policy network) find near optimal solutions for all the problem sizes. ERRL1 and ERRL2 combine with 2OPT outperforming all other learning approaches significantly both in terms of solution quality and solving time in Tablerandomdata all the results. For all results, the learning algorithms and baselines were implemented using their publicly available code except Chan and Tian() and L2I(). Learning curves of CVRP50 and TSP50 in Figuresmalltsp show that ERRL training is more stable and most of the learning converge faster than kool et al. model. We observed also most of the learning is already completed within 200 epochs for both the problems in Figuresmalltsp. After each training epoch, we generate 1000 random instances to use them as a validation set.

电容车辆路由问题（CVRP）

在本表中，我们将基线作为LKH3，非学习基准，建设性方法和改进方法（另外两种算法和Tian（）和L2I（）引入的那样作业研究强度研究（或）具有钢筋学习的学习能力的启发式，在Tablerandomdata中，我们报告了他们的论文的结果）。我们的方法与CVRP20，CVRO50和CVRP100的1000随机CVRP实例直接相当。 errl2（使用策略网络）略微提高解决方案。然而，Errl1（使用策略网络）查找所有问题大小的最佳解决方案附近。 Errl1和Errl2与2opt连续表达所有其他学习方法，在TablerAndomdata中的解决方案质量和解决时间均有显着性。对于所有结果，使用除Chan和Tian（）和L2I（）之外的公共可用代码实现了学习算法和基准。图中CVRP50和TSP50的学习曲线在图MallTSP中表明，ERRL培训更稳定，大部分学习都比Kool等人更快。模型。我们观察到大部分学习都已在200个时代内完成，以便在图中的问题中的问题。在每次训练时，我们会生成1000个随机实例以将它们用作验证集。

Multiple Routing with Fixed Fleet Problems (MRPFF)

Analyse the generalisation of ERRL method we created a new route problems to test how our model performs compare to other existing methods. The result for Multiple Routing with Fixed Fleet Problems (MRPFF) reported Tablemrpff. MRPFF experiment of applying ERRL results with 20, 50, and 100 customer nodes are reported in Tablemrpff, and all the ERRL models is shown to outperform all the existing methods.

Method		MRPFF=20			MRPFF=50			MRPFF=100
MRPFF	TourL	Gap(%)	Time(s)	TourL	Gap(%)	Time(s)	TourL	Gap(%)	Time(s)
LKH3	5.34	0.00	17(m)	9.12	0.00	90(m)	13.16	0.00	145(m)
Constructive Models
Nazari et al.	6.79	27.15	6	10.85	18.96	18.98	15.97	21.35	33
Kool et al.	5.99	12.17	0.01	10.30	12.93	0.03	14.67	11.47	0.07
Ours(Constructive)
ERRL1	5.51	3.18	0.01	10.10	10.74	0.03	13.97	6.15	0.07
ERRL2	5.70	6.74	5	10.40	14.03	14	14.40	9.42	30
ERRL1(2Opt)	5.45	2.05	2(m)	9.39	2.96	15(m)	13.80	4.86	23(m)
ERRL2(2opt)	5.60	4.86	4.41(m)	9.79	7.34	17(m)	14.12	7.29	45(m)

使用固定船队问题的多路由（MRPFF）

分析ERRL方法的泛化我们创建了一个新的路由问题，以测试我们的模型如何与其他现有方法进行比较。与固定舰队问题的多次路由（MRPFF）报告的TableMRPFF的结果。 MRPFF在TableMRPFF中报告使用20,50和100个客户节点应用错误结果的实验，并且显示所有ERRL模型以优于所有现有方法。

Travelling Salesman Problem (TSP)

In this section, we want to evaluate and show our performance is comparable with existing work as the previous state-of-the-art approaches typically focus on TSP random data. For the TSP, we report optimal results by Concorde() and(). Besides, we compare against Nearest, Random and further Insertion, as well as Nearest Neighbour. In Table tsp illustrates the performance of our techniques compared to the solver, heuristics, and state of the art learning techniques for various TSP instance sizes. Table tsp is separated into four sections: solver, heuristics, learning methods using reinforcement learning (RL), and; learning models using supervised techniques (S). For all results were implemented using their publicly available code except Graph Convolutional Network (GCN) taken from(); We implemented PN.,(), Bello et al.,(), EAN., (), [Kool() andNazari, accordingly refer the results we found from our implementation. We are able to achieve satisfactory results and report the average tour lengths of our approaches on TSP20, TSP50, and TSP100 in Tabletsp. The data in Table tsp shows, perform better compare to Nazari et al.() for all sizes of TSP instances. The data in Tabletsp shows using our greedy approach name Entropy regularised Reinforcement Learning, outperformed not only all the traditional baselines but also perform better compare to(). Furthermore, we considered execution times in seconds for all instances. Run times are important but can vary due to implementation using Python or C++. We show the run times for our approach and compared with all the approaches, used python for implementation. Another important factor is using hardware such as GPUs or CPUs(). We implemented all approaches on the same hardware platform as experimental results can vary based on hardware platforms. In Tabletsp, we report the running times for the results from our implementation using their publicly available codes, as reported by others, are not directly comparable. We only reported execution times for directly comparable baselines() and(). We report the time it takes to solve the average solution time (in seconds) over a test set of size 1000 test.

旅行推销员问题（TSP）

在本节中，我们希望评估并显示我们的性能与现有的工作相当，因为之前的最先进的方法通常专注于TSP随机数据。对于TSP，我们通过concorde（）和（）报告最佳结果。此外，我们与最近，随机和进一步的插入以及最近的邻居进行比较。在表格中，与各种TSP实例大小的求解器，启发式技术和艺术学习技术的状态相比，我们的技术的性能说明了我们的技术。表TSP分为四个部分：求解器，启发式，使用加强学习（RL）和学习方法;使用监督技术学习模型。对于所有结果，使用其公开可用的代码来实施，除了从（）的图表卷积网络（GCN）;我们实现了pn。，（），bello等，，ean。，（），[kool（）和[nazari]（），请参阅我们从我们的实施中找到的结果。我们能够实现令人满意的结果，并在TSP20，TSP50和TSP100中报告我们在TSP20中的方法的平均旅游长度。表TSP中的数据显示，对所有大小的TSP实例进行更好地比较Nazari等人。平板电脑应用中的数据使用我们的贪婪方法名称熵正常化的强化学习，不仅优于所有传统的基线，而且表现出更好的比较（）。此外，我们在几秒钟内考虑了执行时间的所有情况。运行时间很重要，但由于使用Python或C ++的实现可能会有所不同。我们为我们的方法显示了运行时期，并与所有方法相比，使用Python实现。另一个重要因素是使用GPU或CPU（）等硬件。我们在与实验结果相同的硬件平台上实现了所有方法，可以根据硬件平台而变化。在平板电脑中，我们通过其他人报告，我们向我们的实现的运行时间报告了我们的实施，如其他人报告的，不可比较。我们只报告了直接可比较的基线（）和（）的执行时间。我们报告了在大小1000测试的测试集上解决平均解决时间（以秒为单位）所需的时间。

ERRL combine with 2-Opt

In this study, another further enhancement is, we use a local search algorithm 2-opt to improve our results during test time. We show that the model can produce improve the result by using a ‘hybrid’ approach of a learned algorithm with local search. This hybrid approach is an example of combining learned and traditional heuristics. Recent many works showed that the design of the search procedure has an immense impact on the performance of the ML approach. Francois et al. shown in their result that the search procedure can promote improvement, not from the learning intrinsic. With local search added, the ERRL1 and ERRL2 with 2opt heuristics have much-improved performance than without 2-opt shown in Tablesrandomdata,mrpff andtsp.

Method		TSP20			TSP50			TSP100
Solver	TourL	Gap(%)	Time	TourL	Gap(%)	Time	TourL	Gap(%)	Time
Concorde	3.83	0.00	4m	5.70	0.00	10m	7.77	0.00	55m
LKH3	3.83	0.00	42(s)	5.70	0.00	59(m)	7.77	0.00	25(m)
Heuristics
Nearest Insertion	4.33	13.05	1(s)	6.78	18.94	2(s)	9.45	21.62	6(s)
Random Insertion	4.00	4.43	0(s)	6.13	7.54	1(s)	8.51	9.52	3(s)
Farthest Insertion	3.92	2.34	1(s)	6.01	5.43	2(s)	8.35	7.46	7(s)
Or-tools	3.85	0.52	-	5.80	1.75	-	8.30	6.82	-
Learning Models (SL)
PN	3.88	1.30		6.62	16.14		10.88	40.20
GCN*	3.86	0.78	6(s)	5.87	2.98	55(s)	8.41	8.23	6(m)
Learning Models (RL)
Bello et al.	3.89	1.56	-	5.99	5.08	-	9.68	24.73	-
EAN.(2Opt)	3.93	2.61	4(m)	6.63	16.31	26(m)	9.97	28.31	178(m)
Kool et all	3.85	0.52	0.001(s)	5.80	1.75	2(s)	8.15	4.89	6(s)
Nazari et al	4.00	4.43		7.01	22.76		9.46	21.75
Ours
ERRL1	3.83	0	0.01(s)	5.74	0.70	2(s)	7.86	1.15	6(s)
ERRL2	3.86	0.78	7(s)	5.76	1.05	18(s)	7.92	1.93	31(s)
ERRL1(2opt)	3.81	-	1(m)	5.71	1.57	11.87(m)	7.77	0	25(m)
ERRL2(2opt)	3.83	0	4(m)	5.73	0.52	13(m)	7.80	0.38	37(m)

errl与本研究中的2-opt

相结合，另一种进一步的增强是，我们使用本地搜索算法2-opt来改善我们在测试时间期间的结果。我们表明该模型可以通过使用本地搜索的学习算法的“混合”方法来产生改进的结果。这种混合方法是结合学习和传统启发式的一个例子。最近的许多作品表明，搜索程序

[论文翻译]利用策略优化学习车辆路径问题

原文地址：https://arxiv.org/pdf/2012.13269v1.pdf

Learning Vehicle Routing Problems using Policy Optimisation

Abstract

摘要

Introduction

简介

相关工作

Motivation

动机

Entropy Regularised Reinforcement Learning(ERRL)

熵正规化强化学习（errl）

Encourage Exploration

鼓励探索

Experiments

实验

Capacitated Vehicle Routing Problem (CVRP)

电容车辆路由问题（CVRP）

Multiple Routing with Fixed Fleet Problems (MRPFF)

使用固定船队问题的多路由（MRPFF）

Travelling Salesman Problem (TSP)

旅行推销员问题（TSP）

ERRL combine with 2-Opt

errl与本研究中的2-opt

[论文翻译]利用策略优化学习车辆路径问题

原文地址：https://arxiv.org/pdf/2012.13269v1.pdf

Learning Vehicle Routing Problems using Policy Optimisation

Abstract

摘要

Introduction

简介

Related Work

相关工作

Motivation

动机

Entropy Regularised Reinforcement Learning(ERRL)

熵正规化强化学习（errl）

Encourage Exploration

鼓励探索

Experiments

实验

Capacitated Vehicle Routing Problem (CVRP)

电容车辆路由问题（CVRP）

Multiple Routing with Fixed Fleet Problems (MRPFF)

使用固定船队问题的多路由（MRPFF）

Travelling Salesman Problem (TSP)

旅行推销员问题（TSP）

ERRL combine with 2-Opt

errl与本研究中的2-opt