[论文翻译]利用策略优化学习车辆路径问题


下载PDF:https://arxiv.org/pdf/2012.13269v1.pdf


Learning Vehicle Routing Problems using Policy Optimisation

Abstract

Deep reinforcement learning (DRL) has been used to learn effective heuristics for solving complex combinatorial optimisation problem via policy networks and have demonstrated promising performance. Existing works have focused on solving (vehicle) routing problems as they have a nice balance between non-triviality and difficulty. State-of-the-art approaches learn a policy using reinforcement learning, and the learnt policy acts as a pseudo solver. These approaches have demonstrated good performance in some cases, but given the large search space typical combinatorial/routing problem, they can converge too quickly to poor policy. To prevent this, in this paper, we propose an approach name entropy regularised reinforcement learning (ERRL) that supports exploration by providing more stochastic policies, which tends to improve optimisation. Empirically, the low variance ERRL offers RL training fast and stable. We also introduce a combination of local search operators during test time, which significantly improves solution and complement ERRL. We qualitatively demonstrate that for vehicle routing problems, a policy with higher entropy can make the optimisation landscape smooth which makes it easier to optimise. The quantitative evaluation shows that the performance of the model is comparable with the state-of-the-art variants. In our evaluation, we experimentally illustrate that the model produces state-of-the-art performance on variants of Vehicle Routing problems such as Capacitated Vehicle Routing Problem (CVRP), Multiple Routing with Fixed Fleet Problems (MRPFF) and Travelling Salesman problem.

摘要

深度增强学习(DRL)已被用于学习通过策略网络解决复杂组合优化问题的有效启发式,并具有明显的性能。现有的作品专注于解决(车辆)路由问题,因为它们在非琐事和困难之间具有很好的平衡。最先进的方法学习使用强化学习的政策,学习政策充当伪求解器。这些方法在某些情况下表现出良好的性能,但鉴于大型搜索空间典型的组合/路由问题,它们可以过快地收敛到较差的政策。为了防止这一点,我们提出了一种通过提供更多随机策略来支持探索的方法名称熵正则化强化学习(ERRL),这往往会改善优化。经验上,低方差 errl 提供快速稳定的 RL 训练。我们还在测试时间内引入了本地搜索运算符的组合,这显着提高了解决方案和补充错误。我们定性证明,对于车辆路由问题,具有更高熵的策略可以使优化景观平滑,这使得更容易优化。定量评估表明,该模型的性能与最先进的变体相当。在我们的评估中,我们通过实验说明该模型在车辆路由问题的变体上产生最先进的性能,例如电容车辆路由问题(CVRP),多次路由与固定的舰队问题(MRPFF)和旅行推销员问题。

Introduction

Combinatorial Optimisation (CO) Problems are hard as it involves finding the optimal solution under various constraints. A conventional approach to solve these problems involves modelling the problem into a mathematical objective, selection of an appropriate solver and then optimising its parameters for the problem instance at hand. While this approach has been successful, it requires high levels of optimisation expertise and domain knowledge, limiting their widespread usage. Also, the selection of solver and its optimal parameters varies for different problem instances when the problem instance changes, often the process of searching appropriate solver and parameters are restarted. This has raised interest in the level of generalisation at which optimisation operate. In recent years, there is increasing interest in considering combinatorial optimisation as a learning problem,(;), where optimisation instances, and their solutions, are used as training instances(). Thereby, the resulting learnt model is then considered as a general solver. The models typically involve deep neural networks (DNNs)(), and recent state-the-art approaches take a reinforcement learning strategy for cases where supervised approaches/solutions are not plausible(). The learnt policy is similar to a solver(). In particular, deep reinforcement learning (DRL) and policy gradients can successfully finding close to optimal solution to the problems such as TSP , , CVRP(, (), 0-1 Knapsack. All the problems we discussed in Appendixproblems. The previously mentioned state-of-the-art( and) uses standard policy gradient-based approaches (PGM) build upon the REINFORCE algorithm(). For standard reinforcement learning (RL) problems, search space (landscape) is smaller, the search space is smooth, and optimisation is not difficult. However, in the case of combinatorial optimisation ((class of CO) problems such as VRP, TSP, search space may not be infinite, but the search space of the CO is difficult . Therefore, for combinatorial optimisation problems, we need an effective method for moving through the search space.

简介

组合优化(CO)问题很难,因为它涉及在各种约束下找到最佳解决方案。解决这些问题的传统方法涉及将问题建模到数学目标中,选择适当的求解器,然后优化其手头的问题实例的参数。虽然这种方法已经成功,但它需要高水平的优化专业知识和领域知识,限制了他们的广泛使用情况。此外,求解器的选择及其最佳参数在问题实例发生变化时不同的问题实例变化,通常重新启动搜索适当的求解器和参数的过程。这已经提高了对优化运作的泛化水平的兴趣。近年来,考虑到组合优化作为学习问题,在优化实例及其解决方案中,使用作为培训实例的兴趣,越来越兴趣。由此,然后将得到的学习模型被认为是普通求解器。该模型通常涉及深度神经网络(DNN),最近的国家 - 最先进的方法采用加强学习策略,以便在监督方法/解决方案不是合理的。学习的政策类似于求解器。特别是,深度加强学习(DRL)和政策梯度可以成功地找到靠近最佳解决方案,例如 TSP,CVRP,0-1 背包等问题。我们在附录问题上讨论的所有问题。之前提到的州-Of-Art(AND)在钢筋算法()上使用标准策略基于梯度的方法(PGM)。对于标准增强学习(RL)问题,搜索空间(景观)较小,搜索空间是平滑的,并且优化并不困难。然而,在组合优化(CO 类别的类别)问题之类的情况下,诸如 VRP,TSP,搜索空间可能不是无限的情况下,但是 CO 的搜索空间很难。因此,对于组合优化问题,因此,对于组合优化问题,我们需要一种有效的方法来通过搜索空间。

The issue with the standard policy gradient methods apply to CO Problem is, fluctuating landscape of the CO Problem which may not estimate gradient properly (two nearby samples can have very different gradients) which make the problem difficult to optimise. Because when we take the average which may have variance in the gradient. Additionally, in many cases in RL, a situation can occur if the agent discovers a strategy following a policy that achieved a reward which is better than when the agent started first, but the strategy of the agent following can be difficult to find the optimal solution and tend to take a single move over and over. So as agent progressing learning, it is average move distribution will be closer to prediction with either single move or multiple moves. So it is unlikely to explore different actions. As a consequence, instead of using standard PGM, our line of research devoted to how can we explore the search space effectively and efficiently to improve the solution of a combinatorial optimisation problem? When we look for the best possible answer to the question, a perfectly logical solution to use exploration strategy for better exploration in the search space. As a result, we need a method that can able to explore the search space effectively. However, this is not the case in previous learning methods. To solve this, we add an entropy of the policy with the RL objective as in . The goal is to find the probability distribution that has the highest entropy, which states that it is one of the best representation of the current state. Using exploration strategy in RL can be used where the neural network is used as function approximation. In the line of research, we contribute to the deep learning community for COP by introducing entropy regularised term in recent policy gradient-based methods (ERRL). ERRL offers to improve policy optimisation . It is believed that entropy regularisation, assist with exploration by encouraging the selection of more stochastic policies().

标准策略梯度方法的问题适用于 CO 问题,是 CO 问题的波动景观,这可能无法正确估计梯度(附近的两个样本可以具有非常不同的梯度),这使得问题难以优化。因为当我们花的平均值时可能具有梯度方差。另外,在 RL 的许多情况下,如果代理在实现奖励的策略之后,可能会出现情况,这可能会在实现奖励优于代理首先开始的奖励,但代理策略之后可能难以找到最佳解决方案并倾向于一遍又一遍地举动。因此,由于代理进展学习,它是平均移动分布将更接近预测单一移动或多个移动。因此,不太可能探索不同的行动。因此,由于我们的研究系列致力于如何有效探索搜索空间,而不是使用标准 PGM,以有效探索搜索空间,以改善组合优化问题的解决方案?当我们寻找问题的最佳答案时,一个完全合乎逻辑的解决方案,用于在搜索空间中使用勘探策略以更好地探索。结果,我们需要一种能够有效探索搜索空间的方法。但是,这不是先前学习方法的情况。为了解决这个问题,我们将策略的熵添加到 RL 目标中。目标是找到具有最高熵的概率分布,这使得它是当前状态的最佳表示之一。可以使用使用 R1 中的探索策略,其中神经网络用作函数近似。在研究线上,我们通过在最近的基于政策梯度的方法(错误)中引入熵正则化术语来促进 COP 的深度学习界。 ERRL 提供改善政策优化。据信,熵正则化,通过鼓励选择更多随机政策协助探索。

As a consequence, in this work, we analyse this claim a policy with higher entropy can make the changes the optimisation landscape and maintains exploration to discourage early convergence. To the best of our knowledge, the entropy regularised term has not been studied or used in existing learning to (combinatorial) optimise literature. ERRL can be integrated with any existing policy gradient approaches that use parameterised functions to approximate policies; hence we applied entropy technique to the state-of-the-art methods();(). We demonstrated the effectiveness of ERRL on three categories of routing problems. The goal of this work is not to outperform all the existing state-of-the-art VRP learning algorithm from every aspect but to provide direction in the study of the RL approach to encourage exploration to fundamental routing problems, considering the before-mentioned difficulties. The main contributions are as follows: - We proposed an approach using entropy regularised term that can solve route optimisation problems. We devise a new exploration-based and low-variance method for policy gradient method because this baseline assists to select a more stochastic policy. The proposed method is verified on multiple types of routing problems, i.e., Capacitated vehicle routing problem (CVRP) and multiple routing with fixed fleet problems (MRPFF) and Travelling salesman problems (TSP). - The generality of the proposed scheme is validated with different approaches and evaluating the resultant method on various problem sizes (and even at high problem dimensionality of 100) to achieve outstanding performance, better than the state-of-the-art in terms of accuracy and time-efficiency. - Another contribution we use a local search algorithm 2-opt(). This hybrid approach is an example of combining learned and traditional heuristics to improve the solution. In this work, we analyse existing inference techniques to show the impact of the post-processing techniques in the solution quality.

因此,在这项工作中,我们分析了该索赔,该索赔具有更高熵的政策可以使更改优化景观并保持探索,以阻止早期收敛。据我们所知,熵正则化术语尚未研究或用于现有学习(组合)优化文献。 ERRL 可以与使用参数化功能的任何现有策略渐变方法集成到近似政策;因此,我们将熵技术应用于最先进的方法();()。我们展示了错误的三类路由问题的有效性。这项工作的目标不是从各个方面表达所有现有的最先进的 VRP 学习算法,而是在考虑前提前提到的困难时提供探索探索的 RL 方法的方向。 。主要贡献如下: - 我们提出了一种使用熵正则化术语的方法,可以解决路由优化问题。我们设计了一种基于探索的基于探索和低方差方法,用于政策梯度方法,因为该基线有助于选择更随机的政策。所提出的方法在多种类型的路由问题上验证,即电容车辆路由问题(CVRP)和多个路由,以及固定的舰队问题(MRPFF)和旅行推销员问题(TSP)。 - 所提出的方案的一般性以不同的方法验证,并评估各种问题尺寸的结果方法(甚至在高位问题维度 100)上,以实现出色的性能,而不是最先进的准确性和时间效率。 - 我们使用本地搜索算法 2-opt()的另一个贡献。这种混合方法是结合学习和传统启发式的示例,以改善解决方案。在这项工作中,我们分析了现有推理技术,以显示在解决方案质量中的后处理技术的影响。

In recent years, the line of research has many ways to solve COP using deep learning paradigm. Many methods have been developed to tackle combinatorial optimisation problems. Traditional heuristics for routing problems can be categorised as construction and improvement heuristics(). Recent advances in the neural networks include the design of a new model architecture called Pointer Network(PN)(). In Pointer network() learns to solve combinatorial optimisation problems where encoder (RNN) converts the input sequence that is fed to the decoder (RNN). They use attention on the input and train this model in a supervised setting to solve the Euclidean TSP instances. The goal is to use the Pointer Network architecture to find close to optimal tours from ground truth optimal (or heuristic) solutions for Traveling Salesman Problem (TSP). In() takes a graph as an input and extracts features from its nodes and edges. Their model can be considered as a stack of several graph convolutional layers. The output of the neural network is an edge adjacency matrix representing the probabilities of edges occurring on the TSP tour. The edge predictions, forming a heat-map, are transformed into a valid tour. They trained their model as a supervised manner using pairs of problem instances and optimal solutions. Despite this promising early application, reinforcement learning becomes a compelling choice to the prospect of learning to optimise as it does not require a set of pre-solved solutions for training. () first propose a reinforcement learning approach, in which a pointer network is trained using an actor-critic reinforcement learning strategy to generate solutions for artificial planar TSP instances. They address this issue by designing a neural combinatorial optimisation framework that uses reinforcement learning to optimise policy. S2VDQN() solves optimisation problems using a graph embedding structure and a deep Q-learning algorithm. Recently, many deep learning-based approaches exist, however only a few learning-based approaches propose a solution to the VRP(,). Recent approaches applied a deep RL model that generate solutions sequentially one node at a time. The constructive heuristics,(), a model was proposed that uses a recurrent neural network (RNN) decoder and an attention mechanism to build solutions for the CVRP and the SDVRP, train the model using policy gradient methods (actor-critic approach) similar to(). Solution searching techniques used a beam search with a beam-width of up to 10.

相关工作

近年来,研究线路使用深度学习范式解决缔约方的方法。已经开发了许多方法来解决组合优化问题。路由问题的传统启发式可能被分类为建筑和改进启发式()。神经网络的最新进步包括设计名为指针网络(PN)()的新模型架构的设计。在指针网络()中,学习求解组合优化问题,其中编码器(RNN)将馈送到解码器(RNN)的输入序列转换。他们在输入的输入上使用注意力,在监督设置中培训此模型,以解决欧几里德 TSP 实例。目标是使用指针网络架构从地面真理最佳(或启发式)解决方案的最佳旅行,以便旅行推销员问题(TSP)。 In()将图形为输入并从其节点和边缘提取特征。他们的模型可以被认为是几个图形卷积层的堆栈。神经网络的输出是表示在 TSP 巡回赛中发生的边缘的概率的边缘邻接矩阵。形成热图的边缘预测被转换为有效的巡回赛。他们使用成对的问题实例和最佳解决方案培训了他们的模型作为监督方式。尽管有希望的早期应用,但加固学习成为学习优化的前景的令人信服的选择,因为它不需要一组用于训练的预解决方案。 ()首先提出了一种强化学习方法,其中使用演员批评的加强学习策略接受了指针网络,以生成人工平面 TSP 实例的解决方案。他们通过设计一种使用强化学习来优化政策的神经组合优化框架来解决这个问题。 S2VDQN()使用曲线图嵌入结构和深 Q 学习算法解决了优化问题。最近,存在许多基于深度的学习方法,但只有一些基于学习的方法提出了对 VRP(,)的解决方案。最近的方法应用了一个深的 RL 模型,一次生成解决方案的一个节点。建议的建设性启发式()是建议使用经常性神经网络(RNN)解码器和注意机制来构建 CVRP 和 SDVRP 的解决方案,使用策略梯度方法(演员 - 批评方法)培训模型(演员 - 批评方法) ()。解决方案搜索技术使用了光束宽度最多 10 的光束搜索。

A graph attention network similar to() is used in() and generate solutions for different routing problems trained via RL, including TSP and CVRP. They train their model using policy gradient RL with a baseline based on a deterministic greedy rollout. Our work can be classified as constructive method for solving CO problems, our method differs from previous work. First, we applied the entropy maximisation techniques to carried out by adding an entropy regularisation term to the objective function of RL to prevent premature convergence of the policy and change in the gradient. Second, we combine classical heuristics with improving the solution quality further. In order to promote the idea of exploring the search space, we need an effective exploration-based algorithm. The fundamental property of regularised term distinguishes our approach from rest. Our approach Entropy Regularised RL(ERRL) can be summarised as follows: instead of learning deterministically and making a decision at an early stage, we demonstrate that stochastic state-space models can be learned effectively with a well-designed network encouraging exploration. Other work focuses on iteratively improving heuristics, for example,() propose an RL based improvement approach that iteratively chooses a region of a graph representation of the problem and then selects and applies established local heuristics. A perturbation operator further improved this approach(). After training the neural network, some existing techniques can be applied to improve the quality of solutions it generates, for example in bello et al., active search optimises the policy on a single test instance. In bello et al.; used Sampling method that select the best solution among the multiple solution candidates. Beam search is another widely used technique uses to improve the efficiency of sampling. To further enhance the quality of the solution in one popular local search operator used. Following , in ERRL, we combine classical 2-opt local search operator to process the solution further to improve. In addition, in this work, we combine many inference techniques with our ERRL method to show the importance of searching in ML-based approaches to combinatorial optimisation .

图表注意网络类似于()使用()使用并生成通过 RL 培训的不同路由问题的解决方案,包括 TSP 和 CVRP。他们使用基于确定性贪婪的卷展栏的基线使用 Policy 梯度 RL 培训他们的模型。我们的工作可以被归类为解决 CO 问题的建设性方法,我们的方法与以前的工作不同。首先,我们应用了通过将熵正则化术语添加到 RL 的目标函数来实现熵最大化技术,以防止策略的过早融合和梯度的变化。其次,我们将经典启发式结合在一起进一步提高解决方案质量。为了促进探索搜索空间的想法,我们需要一种基于有效的探索算法。正规化期限的基本属性将我们的方法与休息区分开来。我们的方法熵正则 RL(ERRL)可以概括如下:而不是确定的学习和在早期做出决定,我们证明随机状态空间模型可以有效地利用精心设计的网络令人鼓舞的探讨。其他工作侧重于迭代改善的启发式,例如()提出基于 RL 的改进方法,其迭代地选择问题的图表表示,然后选择和应用建立的当地启发式。扰动操作员进一步改进了这种方法()。在培训神经网络之后,可以应用一些现有技术来提高它生成的解决方案的质量,例如在 Bello 等人中,活动搜索在单个测试实例上优化了策略。在 Bello 等人。使用了在多种解决方案候选中选择最佳解决方案的采样方法。光束搜索是另一种广泛使用的技术用于提高采样效率。为了进一步提高一个流行的本地搜索操作员中解决方案的质量。以下是在 ERRL 中,我们将经典的 2-opt 本地搜索操作员组合以进一步处理解决方案以改进。此外,在这项工作中,我们将许多推理技术与我们的错误方法相结合,以显示在基于 ML 的组合优化中搜索以 ML 的方法的重要性。

Motivation

In recent works, PGM use to solve CO problem. The key idea in policy optimisation (PGM) is to learn parameters, $ \theta $ of a policy, $ \pi_{\theta}(a|s) $ , $ s \in S, a \in A $ . Here a is the action and s is state. The policy gradient method() stated the gradient as:
$$ J_{ER}(\theta) = \sum_{s\in S} d^{\pi_{\theta}}(s) \sum_{a\in A}\pi_{\theta}(a|s) Q^{\pi_{\theta}}(s,a) $$
where $ d^\pi $ is the stationary distribution of states and $ Q^{\pi}{\theta}(s_t,a_t) $ is the expected discounted sum of rewards starting state $ s $ , taking action $ a $ and sampling actions according to the policy, $ a\sim\pi(. |s{t}) $ . The $ Q^{\pi}{\theta}(s,a) $ ), (Monte Carlo estimation) is the value function pair following a policy $ \pi $ . Here we are interested in finding parameter $ \theta $ that maximises the objective function $ J{ER} $ (the goal is to maximise the discounted cumulative rewards). The equation helps to find a policy with the highest expected reward from the agent's action. However, many issues are encountered using current PGM approaches in combinatorial optimisation problems as combinatorial optimisation is not easy. One approach to characterising the degree of difficulty of an optimisation problem is its search space. The search space is also known as its landscape, and solutions to the optimisation are points on this landscape. It is difficult to have access to the exact transition and reward dynamics. Therefore, the gradient of $ J_{ER} $ given by the policy gradient theorem in Equationtheta1 cannot be evaluated directly. The Equationtheta1 allowing us to estimate $ \nabla J_{ER} $ using Monte-Carlo samples, where each trajectory is defined as a sequence. In a previous learning-based model, never consider an agent that have prior knowledge of previous states, so it is perfectly logical to think of an agent that needs to have experience of previous states of data to achieve maximum reward (encourage exploration). In consequence, need to change in the RL objective to solve COP so that model can achieve the highest expected reward with low variance.

动机

最近的作品,PGM 用来解决 CO 问题。策略优化(PGM)的关键思路是学习策略的参数$\theta $,$\pi_{\theta}(a|s) $,$ s \in S, a \in A $。这里是行动,s 是州。策略渐变方法()表示梯度为:
$$ j_ {er}(\ theta)= \ sum_ {s \ in s} d ^ {\ pi _ {\ theta}}(s)\ sum_ {a \ } \ pi _ {\ theta}(a | s)q ^ {\ pi _ {\ theta}}(s,a)$$
其中$ d^\pi $是状态和$ Q^{\pi}{\theta}(s_t,a_t) $的静止分布预期折扣奖励奖励状态$ s $,采取动作$ a $和采样操作根据策略,$ a\sim\pi(. |s{t}) $。 $ Q^{\pi}{\theta}(s,a) $)(蒙特卡罗估计)是策略$\pi $之后的值函数对。在这里,我们有兴趣找到参数$\theta $,最大化目标函数$ J{ER} $(目标是最大化折扣累计奖励)。该等方程有助于找到代理人行动中最高预期奖励的政策。然而,在组合优化问题中使用当前 PGM 方法遇到许多问题,因为组合优化并不容易。表征优化问题难度程度的一种方法是其搜索空间。搜索空间也称为其景观,并且优化的解决方案是此景观的点。很难访问确切的转换和奖励动态。因此,不能直接评估由 equationtheta1 中的策略梯度定理给出的$ J_{ER}$的梯度。 areAtateTheta1 允许我们使用 Monte-Carlo 样本来估计$\nabla J_{ER}$,其中每个轨迹定义为序列。在以前的基于学习的模型中,从不考虑具有先前州的先验知识的代理,因此想到需要具有以前数据态的经验以实现最大奖励的经验(鼓励探索)是完全逻辑的。因此,需要改变 RL 目标才能解决警察,以便模型可以实现低方差的最高预期奖励。

Entropy Regularised Reinforcement Learning(ERRL)

熵正规化强化学习(errl)

Encourage Exploration

In our model we are given a problem s (a set of nodes $ (n_1,\cdots,n_m) $ ) and a policy $ \pi $ which is parameterise by $ {\theta} $ . The policy is trainable and can be produce a valid solution to the problem. The solution is as ( $ L= {a_1,\cdots, a_i} $ ), where the ith action $ a_i $ can choose a node in the tour(solution). The neural network generated the solution one node at a time in a auto-regressive manner following the policy(stochastically), $ \pi_{t} = P_{\theta}(a_t | s) $ , where, $ t=1 $ and s is the problem instance (define as state).

In ERRL methods, we offer exploration strategy in the CO problems. Our ERRL model starts with a node for exploration shown in Figureerrl. The network samples N trajectories ( $ {{l^1,l^2,\cdots,l^N}} $ ) using the Monte Carlo method, where each trajectory is defined as $ l^i $ . In Figureerrl presented how ERRL model learns exactly previous stated data (in our case, these stated previous data is the experience of the agent.), Figureerrl shows the two distributions of the Q values can be represent as for $ (distribution1 = a_0 0.26, a_1 0.28, a_2 0.24, a_3 0.220) $ and $ (distribution2 = a_0 0.1, a_1 0.08, a_2 0.8, a_3 0.12) $ , for the first distribution all of the probabilities are low and from the second distribution $ a_2 $ has a high probability and other actions has lower probability. It had happened some cases that the agent will use the action in the future, as it already achieves some positive reward when it started. Therefore, the agent does not tend to explore in this context; however, there could exist another action that could have a much higher reward. Consequently, the agent will never try to explore instead will exploit what it has already learned. This means the agent can get stuck in a local optimum because of not exploring the behaviour of other actions and never finding the global optimum. Hence adding entropy, we encouraged exploration and avoid getting stuck in local optima. The use of Entropy in RL works as when the agent is progressing learning the policy, according to the agent action model returns a more positive reward for the state. The entropy augmented in the policy in the conventional RL objective following formalised in Equationentropy. In this work, entropy maximisation is typically carried out by adding an entropy regularisation term to the objective function of RL. Therefore, when all actions are equally good in ERRL the entropy regularisation improve policy optimisation in reinforcement learning maximising reward to improve exploration.

鼓励探索

在我们的模型中我们被给出了一个问题 s(一组节点$(n_1,(n_1,,n_m) $)和${\theta} $的参数化的策略$\pi $。该政策是可培训的,可以为问题产生有效的解决方案。该解决方案是($ L= {a_1,\cdots, a_i} $),其中第 i 个动作$ a_i $可以选择巡回赛中的节点(解决方案)。神经网络在策略(随机)之后以自动回归方式生成一个节点(随机),$\pi = P_{t} = P_{\theta}(a_t | s) $,其中$ t=1 $和 s 是问题实例(定义为状态)。

以 errl 方法,我们提供 CO 问题的探索策略。我们的错误模型以探索显示的节点开头,如图所示。使用 Monte Carlo 方法,网络示例 n 个轨迹(${{l^1,l^2,\cdots,l^N}} $),其中每个轨迹被定义为$ l^i $。在 ModeRERRL 中,介绍了 ERRL 模型究竟学习恰好上一个规定的数据(在我们的情况下,这些规定的先前数据是代理的经验。),MuperTrl 显示 Q 值的两个分布可以表示为$(distribution1 = a_0 0.26, a_1 0.28, a_2 0.24, a_3 0.220) $和$(distribution2 = a_0 0.1, a_1 0.08, a_2 0.8, a_3 0.12) $ ,对于第一个分布,所有概率都低,并且从第二分布$ a_2 $具有高概率,并且其他动作具有较低的概率。它发生了一些案例,代理人将来会使用该行动,因为它已经在开始时实现了一些积极的奖励。因此,在这种背景下,代理人不会倾向于探索;但是,可能存在另一个可能具有更高奖励的行动。因此,代理商永远不会尝试探索,而是利用它已经学到的东西。这意味着代理人可以在局部最佳状态下被卡住,因为没有探索其他行为的行为,从未找到全球最佳。因此,添加了熵,我们鼓励探索,避免陷入困境。 RL 中的熵在于代理在进行学习策略时,使用熵在于,根据代理操作模型返回对该州的更积极的奖励。在惯例中正式化的传统 RL 目标中的熵在常规 RL 目标中增强。在这项工作中,通常通过将熵正则化术语添加到 R1 的目标函数来执行熵最大化。因此,当所有动作同样良好时,熵正则化改善了加强学习最大化奖励的政策优化,以改善探索。

By augmenting entropy regularisation with the reward that helps to get more reward proportion to the entropy of the $ \pi $ in the following equation: It supports to help with exploration by encouraging the selection of more stochastic policies. Where $ \pi $ is a policy, $ {\pi^*} $ is the optimal policy, $ T $ is the number of time-steps, state $ s\in S $ is the state at time-step $ t $ , $ a\in A $ is the action at time-step $ t $ , $ P_{\pi} $ is the distribution of trajectories induced by policy $ \pi $ . $ H (\pi_{\theta}(. |s_{t}) $ is the entropy of the policy $ \pi $ at state $ s_t $ and is calculated as $ H (\pi_{\theta}(. |s_{t})= -log(\pi_{\theta}\pi0\pi1\pi2 $ , where H is the entropy and $ \alpha $ controls the strength of the entropy regularisation term. $ \alpha $ is a temperature parameter that controls the trade-off between optimising for the reward and for the entropy of the policy, the entropy bonuses play an important role in reward ( $ \alpha $ value discusses in Sectiondataset). So with the adding entropy to maximise objective will help agent to explore mode and have knowledge of the best representation of the state likely to have a probability distribution with the highest entropy, which means agent can have to the experience of the stated prior data. This incorporated entropy term, defined over the outputs of the policy network, into the loss function of the policy network, and the policy exploration can be supported to maximising the reward. In other words, add an entropy term into the loss function, encouraging the policy to take diverse actions). The entropy $ H(\pi(a_t|s)) $ used into the loss function to encourage the policy $ \pi_t = \theta(a_t|s) $ .

We use the entropy regularisation term with the current approaches and , by augmenting the rewards with an entropy term, $ H(\pi(. |s_{t}))= E_{a\sim\pi(. |s_{t})}[-log \pi(a|s_t)\pi0 $ . This entropy regularisation term is weighed by $ \alpha $ . It helps exploration by encouraging the selection of more stochastic policies (over deterministic ones), and premature convergence can also be prevented and learning is stable and sample efficient we demonstrates in the experiment section. In Equationtheta1, $ Q^{\pi_\theta}(s,a) $ changed to $ Q^{\alpha,\pi_\theta}(s,a) $ is the expected discounted sum of entropy-augmented rewards. $ Q^{\alpha,\pi_\theta}(s,a) $ can be estimated by executing $ \pi_{\theta} $ in the environment. Because of this term we can have slightly different value function in the settings which change the value because of the included entropy at every time step. This term encourages the policy to assign an equal probability of action that has the same as the total expected reward or nearly equal to the reward by exploring the search space. Also, helps agent not to select a particular action repeatedly that could exploit inconsistency in the approximation of $ Q^{\pi_\theta}(s,a) $ which that can suffer high variance, instead can explicitly encourage exploration. The expected discounted sum of reward $ Q^{\pi}(s_t,a_t) $ depends on the policy $ \pi_{\theta} $ , so any change in the policy will be reflected in both the objective and gradient.

通过使用奖励增强熵正则化,有助于在以下等式中获得更多奖励与$\pi $的熵更加奖励比例:它支持通过鼓励选择更多随机策略来帮助探索。其中$\pi $是策略,${\pi^*} $是最佳策略,$ T $是时间步骤的数量,状态$ s\in S $是时间步骤$ t $,$ a\in A $是的状态时间步骤$ t $,$ P_{\pi} $的动作是策略$\pi $引起的轨迹的分布。 $ H (\pi_{\theta}(. |s_{t}) $是策略的熵$\pi $在状态$ s_t $并且被计算为$ H (\pi_{\theta}(. |s_{t})= -log(\pi_{\theta}\pi0\pi1\pi2 $,其中 H 是熵和$\alpha $控制熵正则化术语的强度。 $\alpha $是一个温度参数,它控制在优化奖励和策略熵之间的权衡之间的权衡,熵奖金在奖励中发挥着重要作用($\alpha $值在 SectionDataset 中讨论)。因此,对于最大化目标的添加熵将有助于代理探索模式,并且了解可能具有最高熵的概率分布的状态的最佳表示,这意味着代理商必须具有所规定的先前数据的经验。这种熵项,在策略网络的输出中定义为策略网络的丢失函数以及策略探索可以支持最大化奖励。换句话说,将熵术语添加到损失函数中,鼓励策略采取各种行动)。熵$ H(\pi(a_t|s)) $用于丢失功能,鼓励策略$\pi_t = \theta(a_t|s) $。

我们使用当前方法使用熵正则化术语,并通过熵项增强奖励,$ H({t}(. |s_{a\sim\pi(. |s_{t})}[-log \pi(a|s_t)\pi0 $。此熵正则化术语由$\alpha $称重。它通过鼓励选择更多随机政策(过度确定性的)来帮助探索,并且还可以防止过早的收敛,并且学习是稳定的,并且在实验部分中显示出样本。在 equationtheta1 中,$ Q^{\pi_\theta}(s,a) $更改为$ Q^{\alpha,\pi_\theta}(s,a) $是熵增强奖励的预期折扣和。 $ Q^{\alpha,\pi_\theta}(s,a) $可以通过在环境中执行$\pi_{\theta} $来估计。由于此术语,我们可以在设置的设置中具有略微不同的值函数,这在每次步骤中包含熵的熵。该术语鼓励策略为通过探索搜索空间分配与总预期奖励或几乎等于奖励的相同概率。此外,帮助代理不重复选择特定操作,该方法可以利用可能遭受高方差的$ Q^{\pi_\theta}(s,a) $的近似的不一致,而是可以明确鼓励探索。预期折扣奖励奖励$ Q^{\pi}(s_t,a_t) $取决于策略$\pi_{\theta} $,因此策略中的任何变更都将反映在目标和渐变中。

We trained their model using our Entropy Regularised Reinforcement Learning, name ERRL1. The problem instance defines as $ s $ is a graph, and the model is considered as Graph Attention Network(). Attention based model defines a stochastic policy $ p(\pi|s) $ for selecting solution $ \pi $ given the problem instance $ s $ , parameterised by $ \theta $ . As solution define as $ \pi $ and network samples N solution trajectories, we calculate the total reward $ R(l^i) $ of each solution $ l^i $ . We use gradient ascent with an approximation to maximise the expected return J:
$$ \begin{matrix} \nabla_{\theta}J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^{N} (R(l^i) - b^i(s)))\nabla_{\theta} log P_{\theta}(l^i|s) \end{matrix} $$
In order to train our ERRL1 model, we optimise the objective by using the REINFORCE() gradient estimator with augmented entropy term along with baseline. To encourage exploration and avoid premature convergence to a sub-optimal policy, we add an entropy bonus.
$$ \begin{matrix} \nabla_{\theta}J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^{N} (R(l^i) - b^i(s)))\nabla_{\theta} log P_{\theta}(l^i|s)+ \alpha H(\pi_\theta(.|s_t))) \end{matrix} $$
Here, reinforcement learning with baseline learns both $ R(l^i) $ and the $ {b^i(s)} $ used as a baseline. The average returns serve as a baseline with included entropy bonuses. Here, we use entropy term to prevent premature convergence, results in a slightly different gradient with changing value function. This will result in larger entropy, which means the policy will be more stochastic.
t_t_242_0 %Because this method trained on the difference between the two near generated rollouts by networks;
A good baseline $ b^i(s) $ reduces gradient variance and therefore increases the speed of learning. After generating solution trajectories ( $ {l^1,l^2,\cdots,l^n} $ ), we used the greedy-rollout baseline scheme, each sample-rollout assessed independently. With the changed baseline, now, each trajectory competes with N-1 other trajectories where the network will not select two similar trajectories. With the increased number of heterogeneous trajectories all contributing to setting the baseline at the right level, premature converge to a suboptimal policy is not encouraged instead converge to explored policy. Afterwards, similar to() we updates via ADAM (adaptive moment estimation)() combine the previous objectives via SGD (stochastic gradient descent). Details are given in AlgorithmERRam.

我们使用我们的熵正常化的强化学习,验证了他们的模型,名称错误。问题实例定义为$ s $是一个图形,并且该模型被视为图形关注网络()。基于注意的模型定义了一个用于选择解决方案$\pi $的随机策略$ p(\pi|s) $,由$\theta $参数,参数化为问题实例$ s $。作为解决方案定义为$\pi $和网络样本 N 解决方案轨迹,我们计算每个解决方案$ l^i $的总奖励$ R(l^i) $。我们使用梯度上升与近似值最大化,以最大化预期返回 j:
$$ \ begin {矩阵} \ nabla _ {\ theta} j(\ theta)\ inflicat \ dfrac {1} {n} \ sum_ {i = 1} ^ {n}(r(l ^ i) - b ^ i(s)))\ nabla _ {\ theta} log p _ {\ theta}(l ^ i | s)\ end {marrix} $$
以训练我们的 Errl1 模型,我们通过使用增强熵项以及基线使用钢筋()渐变估算器来优化目标。为了鼓励探索并避免过早收敛到次优政策,我们添加了熵奖金。
$$ \ begin {matrix} \ nabla _ {\ theta} j(\ theta)\ intain \ dfrac {1} {n} \ sum_ {i = 1} ^ {n}(r(l ^ i) - b ^ i (s)))\ nabla _ {\ theta} log p _ {\ theta}(l ^ i | s)+ \ alpha h(\ pi_ \ theta(。| s_t)))\ neg {marrix} $$
在这里,加强使用基线学习学习$ R(l^i) $和${b^i(s)} $用作基线。平均退货用作基线,其中包括熵奖金。在这里,我们使用熵项来防止过早收敛,导致具有更改值函数的较差较差的梯度。这将导致更大的熵,这意味着政策将更随机。良好的基线$ b^i(s) $可降低梯度方差,因此增加了学习速度。在生成解决方案轨迹(${l^1,l^2,\cdots,l^n} $)后,我们使用了贪婪 - 卷展栏基线方案,每个样本卷排出都独立评估。使用已更改的基线,现在,每个轨迹与 N-1 其他轨迹竞争,网络将无法选择两个类似的轨迹。随着异构轨迹的数量增加,所有有助于在右键处设置基线,不鼓励到次优政策的早泄到次优政策。之后,类似于()我们通过 ADAM 更新(Adaptive Slion 估计)()通过 SGD(随机梯度下降)组合先前的目标。详细信息在 algorithmerram 中给出。

Experiments

All of our ERRl experiment use the PGM. We emphasize that ERRL can be applied to any PGM, to support the claim we applied ERRL to existing two algorithms introduce by Kool et al. and Nazari et al.. In this work, we implemented the datasets described by() for TSP and CVRP for ERRL1 and ERRL2. For MRPFF we implemented dataset, where we need to find the shortest path connecting all N $ (n_1,\cdots,n_i) $ nodes, where the distance between two nodes is 2D Euclidean distance. The location of each node is sampled randomly from the unit square, in this problem vehicle does not need to full fill any customer demand but optimise multiple routes. We used the same architecture settings as throughout all the experiments and initialize parameters Uniformly like , policy gradients are averaged from a batch of 128 instances. Adam optimizer is used with a learning rate 0.0001 and a weight decay (L2 regularisation). To keep the training condition simple and identical for all experiments we have not applied a decaying learning rate, although we recommend a fine-tuned decaying learning rate in practice for faster convergence. Every epoch we process 2500 batches of 512 instances generated randomly on the fly. Training time varies with the size of the problem. In Figuresmalltsp illustrates most of the learning is already completed by 200 epochs. In the experiment, we manually set entropy values for all the problems. We evaluated the results for other values results presented in Appendixparameter, but the best performance was achieved with $ \alpha=0.3 $ with learning rate 0.0001.

Method CVRP20 CVRP50 CVRP100
CVRP TourL Gap(%) Time(s) TourL Gap(%) Time(s) TourL Gap(%) Time(s)
LKH3 6.14 0.00 28(m) 10.39 0.00% 112(m) 15.67 0.00 211(m)
Random CW* 6.81 10.91 12.25 17.90 18.96 20.99
Random Sweep* 7.01 14.16 12.96 24.73 20.33 29.73
Or-tools* 6.43 4.73 11.43 10.00 17.16 9.50
L2I 6.12 - 12(m) 10.35 - 17(m) 15.57 - 24(m)
Kool 6.67 8.63 0.01 11.00 5.87 0.02 16.99 8.42 0.07
Nazari 7.07 15.14 6.41 11.95 15.01 19 17.89 14.16 42
Ours(Constructive)
ERRL1 6.34 3.25 0.01 10.77 3.65 0.02 16.38 4.53 0.06
ERRL2 6.67 8.63 5.4 11.01 2.63 14 17.23 9.95 33
ERRL2(2opt) 6.18 0.65 5.49(m) 10.56 1.63 24.28(m) 16.16 3.12 65(m)

实验

我们所有的错误实验都使用 pgm。我们强调,ERRL 可以应用于任何 PGM,以支持我们将 ERRL 应用于现有的两种算法的索利申请。在这项工作中,纳齐拉等人,我们实现了()为 erst1 和 errl2 的 tsp 和 cvrp 描述的数据集。对于 MRPFF,我们实现了数据集,我们需要找到连接所有 N $(n_1,\cdots,n_i) $节点的最短路径,其中两个节点之间的距离是 2D 欧几里德距离。每个节点的位置随机从单位广场上采样,在此问题中不需要完全填充任何客户需求,但优化多个路线。我们使用与整个实验相同的架构设置,并均匀地初始化参数,策略梯度从一批 128 个实例取平均。 ADAM 优化器与 0.0001 和重量衰减(L2 正则化)一起使用。为了保持培训条件,对于所有实验来说,我们没有应用衰减的学习率,尽管我们建议在实践中建议进行微调的衰变学习率,以便更快地收敛。每个时代我们处理 2500 批批次的 512 个实例随机生成。培训时间随问题的大小而变化。在图中,Malltsp 说明了大多数学习已经在 200 个时期完成。在实验中,我们手动为所有问题设置熵值。我们评估了附录参数中提出的其他值结果的结果,但是使用$\alpha=0.3 $实现了最佳性能,具有 0.0001 的学习率。

Capacitated Vehicle Routing Problem (CVRP)

In this table we group baselines as solver name as LKH3, non-learning baselines, constructive approaches and and improvement approaches (another two algorithms Chan and Tian() and L2I() introduced that fuses the strength of Operations Research (OR) heuristics with learning capabilities of reinforcement learning, in Tablerandomdata, we reported results from their paper). Our method is directly comparable to constructive approaches, given 1000 random CVRP instances of CVRP20, CVRO50 and CVRP100. ERRL2(using Policy network) slightly improve the solutions. However ERRL1(using Policy network) find near optimal solutions for all the problem sizes. ERRL1 and ERRL2 combine with 2OPT outperforming all other learning approaches significantly both in terms of solution quality and solving time in Tablerandomdata all the results. For all results, the learning algorithms and baselines were implemented using their publicly available code except Chan and Tian() and L2I(). Learning curves of CVRP50 and TSP50 in Figuresmalltsp show that ERRL training is more stable and most of the learning converge faster than kool et al. model. We observed also most of the learning is already completed within 200 epochs for both the problems in Figuresmalltsp. After each training epoch, we generate 1000 random instances to use them as a validation set.
image

image

电容车辆路由问题(CVRP)

在本表中,我们将基线作为 LKH3,非学习基准,建设性方法和改进方法(另外两种算法和 Tian()和 L2I()引入的那样作业研究强度研究(或)具有钢筋学习的学习能力的启发式,在 Tablerandomdata 中,我们报告了他们的论文的结果)。我们的方法与 CVRP20,CVRO50 和 CVRP100 的 1000 随机 CVRP 实例直接相当。 errl2(使用策略网络)略微提高解决方案。然而,Errl1(使用策略网络)查找所有问题大小的最佳解决方案附近。 Errl1 和 Errl2 与 2opt 连续表达所有其他学习方法,在 TablerAndomdata 中的解决方案质量和解决时间均有显着性。对于所有结果,使用除 Chan 和 Tian()和 L2I()之外的公共可用代码实现了学习算法和基准。图中 CVRP50 和 TSP50 的学习曲线在图 MallTSP 中表明,ERRL 培训更稳定,大部分学习都比 Kool 等人更快。模型。我们观察到大部分学习都已在 200 个时代内完成,以便在图中的问题中的问题。在每次训练时,我们会生成 1000 个随机实例以将它们用作验证集。

Multiple Routing with Fixed Fleet Problems (MRPFF)

Analyse the generalisation of ERRL method we created a new route problems to test how our model performs compare to other existing methods. The result for Multiple Routing with Fixed Fleet Problems (MRPFF) reported Tablemrpff. MRPFF experiment of applying ERRL results with 20, 50, and 100 customer nodes are reported in Tablemrpff, and all the ERRL models is shown to outperform all the existing methods.

Method MRPFF=20 MRPFF=50 MRPFF=100
MRPFF TourL Gap(%) Time(s) TourL Gap(%) Time(s) TourL Gap(%) Time(s)
LKH3 5.34 0.00 17(m) 9.12 0.00 90(m) 13.16 0.00 145(m)
Constructive Models
Nazari et al. 6.79 27.15 6 10.85 18.96 18.98 15.97 21.35 33
Kool et al. 5.99 12.17 0.01 10.30 12.93 0.03 14.67 11.47 0.07
Ours(Constructive)
ERRL1 5.51 3.18 0.01 10.10 10.74 0.03 13.97 6.15 0.07
ERRL2 5.70 6.74 5 10.40 14.03 14 14.40 9.42 30
ERRL1(2Opt) 5.45 2.05 2(m) 9.39 2.96 15(m) 13.80 4.86 23(m)
ERRL2(2opt) 5.60 4.86 4.41(m) 9.79 7.34 17(m) 14.12 7.29 45(m)

使用固定船队问题的多路由(MRPFF)

分析 ERRL 方法​​的泛化我们创建了一个新的路由问题,以测试我们的模型如何与其他现有方法进行比较。与固定舰队问题的多次路由(MRPFF)报告的 TableMRPFF 的结果。 MRPFF 在 TableMRPFF 中报告使用 20,50 和 100 个客户节点应用错误结果的实验​​,并且显示所有 ERRL 模型以优于所有现有方法。

Travelling Salesman Problem (TSP)

In this section, we want to evaluate and show our performance is comparable with existing work as the previous state-of-the-art approaches typically focus on TSP random data. For the TSP, we report optimal results by Concorde() and(). Besides, we compare against Nearest, Random and further Insertion, as well as Nearest Neighbour. In Table tsp illustrates the performance of our techniques compared to the solver, heuristics, and state of the art learning techniques for various TSP instance sizes. Table tsp is separated into four sections: solver, heuristics, learning methods using reinforcement learning (RL), and; learning models using supervised techniques (S). For all results were implemented using their publicly available code except Graph Convolutional Network (GCN) taken from(); We implemented PN.,(), Bello et al.,(), EAN., (), [Kool() andNazari, accordingly refer the results we found from our implementation. We are able to achieve satisfactory results and report the average tour lengths of our approaches on TSP20, TSP50, and TSP100 in Tabletsp. The data in Table tsp shows, perform better compare to Nazari et al.() for all sizes of TSP instances. The data in Tabletsp shows using our greedy approach name Entropy regularised Reinforcement Learning, outperformed not only all the traditional baselines but also perform better compare to(). Furthermore, we considered execution times in seconds for all instances. Run times are important but can vary due to implementation using Python or C++. We show the run times for our approach and compared with all the approaches, used python for implementation. Another important factor is using hardware such as GPUs or CPUs(). We implemented all approaches on the same hardware platform as experimental results can vary based on hardware platforms. In Tabletsp, we report the running times for the results from our implementation using their publicly available codes, as reported by others, are not directly comparable. We only reported execution times for directly comparable baselines() and(). We report the time it takes to solve the average solution time (in seconds) over a test set of size 1000 test.

旅行推销员问题(TSP)

在本节中,我们希望评估并显示我们的性能与现有的工作相当,因为之前的最先进的方法通常专注于 TSP 随机数据。对于 TSP,我们通过 concorde()和()报告最佳结果。此外,我们与最近,随机和进一步的插入以及最近的邻居进行比较。在表格中,与各种 TSP 实例大小的求解器,启发式技术和艺术学习技术的状态相比,我们的技术的性能说明了我们的技术。表 TSP 分为四个部分:求解器,启发式,使用加强学习(RL)和学习方法;使用监督技术学习模型。对于所有结果,使用其公开可用的代码来实施,除了从()的图表卷积网络(GCN);我们实现了 pn。,(),bello 等,,ean。,(),[kool()和[nazari](),请参阅我们从我们的实施中找到的结果。我们能够实现令人满意的结果,并在 TSP20,TSP50 和 TSP100 中报告我们在 TSP20 中的方法的平均旅游长度。表 TSP 中的数据显示,对所有大小的 TSP 实例进行更好地比较 Nazari 等人。平板电脑应用中的数据使用我们的贪婪方法名称熵正常化的强化学习,不仅优于所有传统的基线,而且表现出更好的比较()。此外,我们在几秒钟内考虑了执行时间的所有情况。运行时间很重要,但由于使用 Python 或 C ++ 的实现可能会有所不同。我们为我们的方法显示了运行时期,并与所有方法相比,使用 Python 实现。另一个重要因素是使用 GPU 或 CPU()等硬件。我们在与实验结果相同的硬件平台上实现了所有方法,可以根据硬件平台而变化。在平板电脑中,我们通过其他人报告,我们向我们的实现的运行时间报告了我们的实施,如其他人报告的,不可比较。我们只报告了直接可比较的基线()和()的执行时间。我们报告了在大小 1000 测试的测试集上解决平均解决时间(以秒为单位)所需的时间。

ERRL combine with 2-Opt

In this study, another further enhancement is, we use a local search algorithm 2-opt to improve our results during test time. We show that the model can produce improve the result by using a ‘hybrid’ approach of a learned algorithm with local search. This hybrid approach is an example of combining learned and traditional heuristics. Recent many works showed that the design of the search procedure has an immense impact on the performance of the ML approach. Francois et al. shown in their result that the search procedure can promote improvement, not from the learning intrinsic. With local search added, the ERRL1 and ERRL2 with 2opt heuristics have much-improved performance than without 2-opt shown in Tablesrandomdata,mrpff andtsp.

Method TSP20 TSP50 TSP100
Solver TourL Gap(%) Time TourL Gap(%) Time TourL Gap(%) Time
Concorde 3.83 0.00 4m 5.70 0.00 10m 7.77 0.00 55m
LKH3 3.83 0.00 42(s) 5.70 0.00 59(m) 7.77 0.00 25(m)
Heuristics
Nearest Insertion 4.33 13.05 1(s) 6.78 18.94 2(s) 9.45 21.62 6(s)
Random Insertion 4.00 4.43 0(s) 6.13 7.54 1(s) 8.51 9.52 3(s)
Farthest Insertion 3.92 2.34 1(s) 6.01 5.43 2(s) 8.35 7.46 7(s)
Or-tools 3.85 0.52 - 5.80 1.75 - 8.30 6.82 -
Learning Models (SL)
PN 3.88 1.30 6.62 16.14 10.88 40.20
GCN* 3.86 0.78 6(s) 5.87 2.98 55(s) 8.41 8.23 6(m)
Learning Models (RL)
Bello et al. 3.89 1.56 - 5.99 5.08 - 9.68 24.73 -
EAN.(2Opt) 3.93 2.61 4(m) 6.63 16.31 26(m) 9.97 28.31 178(m)
Kool et all 3.85 0.52 0.001(s) 5.80 1.75 2(s) 8.15 4.89 6(s)
Nazari et al 4.00 4.43 7.01 22.76 9.46 21.75
Ours
ERRL1 3.83 0 0.01(s) 5.74 0.70 2(s) 7.86 1.15 6(s)
ERRL2 3.86 0.78 7(s) 5.76 1.05 18(s) 7.92 1.93 31(s)
ERRL1(2opt) 3.81 - 1(m) 5.71 1.57 11.87(m) 7.77 0 25(m)
ERRL2(2opt) 3.83 0 4(m) 5.73 0.52 13(m) 7.80 0.38 37(m)

errl 与本研究中的 2-opt

相结合,另一种进一步的增强是,我们使用本地搜索算法 2-opt 来改善我们在测试时间期间的结果。我们表明该模型可以通过使用本地搜索的学习算法的“混合”方法来产生改进的结果。这种混合方法是结合学习和传统启发式的一个例子。最近的许多作品表明,搜索程序的设计对 ML 方法的性能产生了巨大的影响。 Francois 等人。结果显示,搜索程序可以促进改进,而不是来自学习内在的。添加了本地搜索,使用 2opt 启发式的 Errl1 和 Errl2 具有巨大提高的性能,而不是 TablesRandomData,MRPFF AndTsp 中显示的 2-opt。

Solution search strategy Impact

Recent work L2I developed by outperforms LKH3, our methods differ from L2I in terms of speed also ERRl is a purely data-driven way to solve COP. The ERRL model combines local search operator 2-opt but during test time. L2I is a specialised routing problem solver based on a handcrafted pool of improvement operators and perturbation operators. ERRL net is a construction type neural net models for CO problems, previous methods and including our method have two modes for inference in general. More inference techniques used during inference time such as greedy search, using greedy search a single deterministic trajectory is drawn using argmax on the policy. In “sampling mode,” multiple trajectories are sampled from the network following the probabilistic policy.

image

In the experiment, we used two different decoders: greedy describe in Figuresearch referred to as name called [ERRL1Gr], and the beam search (BS) in Figuresearch as name ERRL1BS. Our results have shown that using the beam search algorithm, the quality of the solutions improved; however, in computation time slightly increased. In this work, the two inference techniques and one inference technique combine with 2-opt operators during inference time experimentally showed that ML approaches benefited from a search procedure presented in Figuresearch. In Figuresearch, greedy search, beam search combine with ERRL1 model, the optimality gap obtained 1.15% and 0.38% respectively, when we combine with 2 Opt operator with ERRL1 could be reduced from 1.15% to 0 for TSPs of 100 nodes problems. However, the execution time is higher. We show the impact of a search procedure fuse with ML model performances in the figuresearch. To better understand, we use greedy search, beam search and combined 2-opt with greedy search with ERRL1 to show that the importance of searching in ML-based approaches to combinatorial optimisation.

解决方案搜索策略影响

最近的工作 L2I Outformfls LKH3 开发,我们的方法在速度方面与 L2I 不同也是 Errl 是一种纯粹的数据驱动方式来解决警察。 ERRL 模型将本地搜索操作员 2-Opt 但在测试时间内结合。 L2I 是一种基于手工制作的改进操作员和扰动运营商的专业路由问题求解器。 ERRL NET 是 CO 问题的建筑型神经网络模型,以前的方法,包括我们的方法有两种推断的推断模式。使用贪婪搜索的推理时间(如贪婪搜索)使用更多推理技术,使用贪婪搜索单个确定性轨迹在策略上绘制。在“采样模式下,”概率政策之后的网络采样多个轨迹。

n_t_332_3
在实验中,我们使用了两个不同的解码器:在图中称为名称称为[errl1gr]的名称中的贪婪,以及图中的波束搜索(bs)作为名称[errl1bs]( )。我们的结果表明,使用光束搜索算法,解决方案的质量改善;但是,在计算时略微增加。在这项工作中,两种推理技术和一项推断技术在推理时间期间与 2-OPT 运算符进行实验显示,从图中展示的搜索程序中受益的方法。在图中,贪婪搜索,光束搜索与 Errl1 模型相结合,当我们与 2 个 Opt 操作员组合时,可以分别获得 1.15%和 0.38%,使用 Errl1 可以从 1.15%降至 100 个节点问题的 TSP。但是,执行时间更高。我们在图中显示了搜索程序保险丝的影响。为了更好地了解,我们使用贪婪的搜索,光束搜索和组合的 2-opt 与 Errl1 进行贪婪搜索,以表明在基于 ML 的方法中搜索组合优化的重要性。

Conclusion and Future Direction

In this work, we presented the ERRL model that encourages exploration and finds model can improve performance on many optimisation problems incorporated with a policy gradient method. In the study, demonstrated that entropy augmented reward helps the model to avoid local optima and prevent premature convergence. The model has outperformed machine learning-based approaches significantly. We expect that the proposed architecture is not limited to route optimisation problems; it is an essential topic of future research to apply it to other combinatorial optimisation problems. Recent learning-based algorithms for COP is Model-free deep RL methods that are notoriously expensive in terms of their sample complexity. This challenge severely limits the applicability of model-free deep RL to real-world tasks. Our current ERRL-net model has less sample complexity. Nevertheless, it can be improved to use this model to real-world tasks, so model prevents brittleness concerning their hyper-parameters. Also, able to balance exploration and exploitation in terms of the task. Our future goal is to develop a model, instead of requiring the user to set the temperature manually (now we have done for alpha value), we can automate the process by reformulating a different maximum entropy reinforcement objective, where the entropy is treated as a constraint. Therefore, for future work, we need an effective method that can become competitive with the model-free state-of-the-art for combinatorial domains and more sample-efficient RL algorithms may be a key ingredient for learning from larger problems. In addition, finding the most adapted search procedures for an ML model is still an open question.

结论和未来方向

在这项工作中,我们介绍了鼓励探索的错误模型,并找到模型可以提高具有策略梯度方法的许多优化问题的性能。在该研究中,证明了熵增强奖励有助于模型避免本地最佳,并防止过早收敛。该模型显着表现优于基于机器学习的方法。我们预期,所提出的架构不仅限于路由优化问题;将其应用于其他组合优化问题是未来研究的重要主题。 COP 的最近基于学习的算法是无模型的深 rl 方法,其在样本复杂性方面是众所余的。这一挑战严重限制了无模型的深度 RL 对现实世界任务的适用性。我们当前的错误净模型具有更少的样本复杂性。然而,可以改进它将这种模型与现实世界任务一起使用,因此模型可防止脆性有关其超参数。此外,能够在任务方面平衡探索和剥削。我们未来的目标是开发一个模型,而不是要求用户手动设置温度(现在我们已经为 alpha 值进行了完成),我们可以通过重新介绍不同的最大熵增强目标来自动化该过程,其中熵被视为一个约束。因此,对于未来的工作,我们需要一种有效的方法,可以对组合域的无模型最新的最先进的方法具有竞争力,并且更多的样本有效的 RL 算法可以是用于学习更大问题的关键成分。此外,找到 ML 模型的最适应的搜索过程仍然是一个打开的问题。

Problems: TSP, CVRP, MRPFF

The goal of our model is to generate the minimum total route length of the vehicle, where the route length need to be answerable from any routing problem, such as capacitated Vehicle Routing Problems(CVRP), multi-Vehicle Routing Problem with Fixed Fleet size (MRPFF) and TSP. Let G (V; E) denote a weighted graph, where V is the set of nodes, E the set of edges. In an instance of the VRP problem, we are given a set of customer nodes, specifically, for VRP instance, the input (V = ) is a set of nodes, node $ a_0 $ represents the depot, and all other nodes represent the customers that need to be visited. There can be multiple vehicles serving customers. The Vehicle Routing Problem is to find a set of routes (all starting and ending at the depot) with minimal cost, and each customer must be visited by exactly one vehicle. Consider a set of customers, to these serve customers; we must design routes for each vehicle available, all starting from a single depot, $ a_0 $ . Each route must start at the depot, visit a subset of customers and then return to that depot. The objective of the problem is to determine a set of minimal cost routes that satisfy all requirements defined above. With these parameters, the formulation of CVRP is given by, subject to, $ \displaystyle\sum_{a_i \in V} x_{a_ia_j} = 1 $ $ \forall_{a_j}\in V \setminus{a_0} $ , $ \displaystyle\sum_{a_j \in V} x_{a_ia_j} = 1 $ $ \forall_{a_i}\in V \setminus{a_0} $ , $ \displaystyle\sum_{v_i \in V} x_{a_ia_o} = K $ , $ \displaystyle\sum_{v_j \in V} x_{a_oa_j} = K $ , $ \displaystyle\sum_{i \not\in S}\displaystyle\sum_{j \in S} x_{a_ia_j}\sum0\sum1 $ , where $ \forall S \subseteq V \setminus{a_{0}}, S S 0 S 1 $ , In this formulation $ {\displaystyle C_{a_ia_j}} $ represents the cost of going from node $ {\displaystyle a_i} $ to node $ {\displaystyle a_j} $ and, where the cost of traveling from node $ a_i $ to $ a_j $ is $ c_{a_ia_j}\in\mathcal{R}^{+} $ . $ {\displaystyle x_{a_ia_j}} $ is a binary variable, $ x_{a_ia_j}\in{0,1 } $ and $ a_i, a_j \in V $ , that has value 1 if the edge going from $ {\displaystyle a_i} $ to $ {\displaystyle a_j} $ is considered as part of the solution and $ {\displaystyle 0} $ otherwise, K is the number of available vehicles. r(S) is the minimum number of the vehicle to serve set S, the capacity cut constraints, which impose that the routes must be connected and that the demand on each route must not exceed the vehicle capacity.

问题:TSP,CVRP,MRPFF

我们模型的目标是生成车辆的最小总路径长度,其中路线长度需要从任何路由问题得到应答,例如电容车辆路由问题(CVRP),固定舰队尺寸(MRPFF)和 TSP 的多辆路线问题。设 g(v; e)表示加权图,其中 V 是节点集,e 的一组边缘。在 VRP 问题的一个实例中,我们给出了一组客户节点,具体地,对于 VRP 实例,输入(v =)是一组节点,节点$ a_0 $表示仓库,并且所有其他节点代表客户需要访问。可以有多车辆服务客户。车辆路由问题是在最小的成本上找到一组路由(所有在仓库中启动和结束),并且必须恰好访问每个客户。考虑一套客户,向这些客户提供客户;我们必须为可用的每辆车设计路线,所有这些都是从单个仓库,$ a_0 $开始的。每个路线必须在仓库开始,访问客户的子集,然后返回该仓库。问题的目的是确定一组满足上面定义的所有要求的最小成本路由。用这些参数,CVRP 的制剂通过,受试者给予,$\displaystyle\sum_{a_i \in V} x_{a_ia_j} = 1 $ $\forall_{a_j}\in V \setminus{a_0} $,$\displaystyle\sum_{a_j \in V} x_{a_ia_j} = 1 $ $\forall_{a_i}\in V \setminus{a_0} $,$\displaystyle\sum_{v_i \in V} x_{a_ia_o} = K $,$\displaystyle\sum_{v_j \in V} x_{a_oa_j} = K $,$\displaystyle\sum_{i \not\in S}\displaystyle\sum_{j \in S} x_{a_ia_j}\sum0\sum1 $,其中$\forall S \subseteq V \setminus{a_{0}}, S S 0 S 1 $,在该配方中${\displaystyle C_{a_ia_j}} $表示从节点${\displaystyle a_i} $到节点${\displaystyle a_j} $去,并且在成本,其中从节点$ a_i $到$ a_j $就是$ c_{a_ia_j}\in 行驶的成本 T406_3^{+} $。 ${\displaystyle x_{a_ia_j}} $是二进制变量,$ x_{a_ia_j}\in{0,1 } $和$ a_i, a_j \in V $,如果从${\displaystyle a_i} $到${\displaystyle a_j} $被视为解决方案和${\displaystyle 0} $的一部分,则具有值 1。否则,k 是可用车辆的数量。 R(s)是用于服务的车辆的最小数量,该容量截止约束,这施加了必须连接路线,并且每个路线的需求不得超过车辆容量。

Also assuming that $ {\displaystyle 0} $ is the depot node(). In this work, we also use another variance of VRP, the multiple Routing problem with fixed fleet size (MRPFF), where customer locations are considered on a 2D Euclidean space, to serve customers, and we must design routes for one vehicle available at a single depot, $ a_o $ . Each route must start at the depot, visit a subset of customers and then return to that depot. It is further assumed that vehicle does not need to attain any demands but need to optimise the set of routes. The vehicles must create routes starting and ending at a depot node to optimise routes. The MRPFF has no demand, and we consider only one fixed vehicle, visited two sets of customers (two sets of routes). Another route problem is the TSP problem; we are given a set of points/cities. A tour of these cities is a sequence where each city is visited and only visited once. Then the TSP problem is to find such a tour of cities such that the total travel distance between consecutive pairs of cities in the tour is minimised.

还假设${\displaystyle 0} $是贮库节点()。在这项工作中,我们还使用了 VRP 的另一个方案,多个路由问题,固定的舰队尺寸(MRPFF),客户位置被考虑在 2D 欧几里德空间,为客户提供服务,我们必须设计一辆可用的一辆车的路线单仓库,$ a_o $。每个路线必须在仓库开始,访问客户的子集,然后返回该仓库。进一步假设车辆不需要达到任何要求,但需要优化一组路线。车辆必须在仓库节点开始和结束的路线以优化路由。 MRPFF 没有需求,我们只考虑一个固定车辆,访问两套客户(两组路线)。另一个路线问题是 TSP 问题;我们给了一组积分/城市。这些城市的一场巡回赛是每个城市都被访问的序列,只访问一次。然后,TSP 问题是找到如此之旅,使得游览中连续一对城市之间的总旅行距离最小化。

In this work as inference techniques, we use 2-opt to further improve the solutions. In a 2-opt algorithm, when removing two edges, there is only one alternative feasible solution. The procedure searches for k edge swaps that will be replaced by a new edge, swapping techniques results in a shorter tour. Moreover, sequential pairwise operators such as k-opt moves can be decomposed in simpler l-opt ones, where l < k. For instance, in our work, 2-opt sequential operations decomposed into one .

2-opt 本地搜索

在这项工作中作为推理技术,我们使用 2-opt 进一步改进解决方案。在 2-opt 算法中,在删除两个边的时,只有一个替代的可行解决方案。该过程搜索 k 边缘掉后将被新边缘替换,交换技术导致较短的巡视。此外,诸如 k-opt 动作的序贯成对操作者可以在更简单的 L-opt opt opl 中分解,其中 L <k。例如,在我们的工作中,将 2 个 OPT 顺序操作分解成一个。

Policy Networks(ERRL1)

We used the neural architecture of() to apply our approach to improve solutions of routing problems. Here, we briefly describe the attention model architecture in terms of the. The model is consists of attention based encoder and decoder network. The encoder produces embeddings of all inputs and decoder produces the sequence $ \pi $ of given inputs, one at a time: encoder inputs the encoder embeddings and a problem specific mask and context. When partial tour constructed, it cannot be changed, and the rest of the nodes find a path from the last node to the first node and decoder network consists of embeddings of first and last node. The attention-based encoder embeds the input nodes and processed N sequential layers, each consisting multi-head attention and feed-forward sub-layer. The graph embedding is computed the node embeddings. The attention layer in this model following Transformer Architecture(), each attention layer has two sub-layers, one multi-head attention processes message passing between the nodes and another layer is a node wise fully connected feed-forward layer. The decoder is decoding the solutions sequentially at each time step. The decoder outputs the node $ \pi_t $ based on the embeddings from the encoder. They also augmented the graph with a special context node to represent decoding context similar to(). Similar to() computed output probabilities, add one decoder layer with a single attention head. We used our approach using() model and reported improved results on a number of combinatorial optimisation (routing) problems.

策略网络(errl1)

我们使用了()的神经架构来应用我们的方法来改进路由问题的解决方案。在这里,我们简要介绍了在此方面的注意模型架构。该模型由基于注意的编码器和解码器网络组成。编码器生成所有输入和解码器的嵌入式,并在一个时间产生给定输入的序列$\pi $:编码器输入编码器嵌入物和问题特定掩码和上下文。当构造的部分巡回赛时,它无法改变,并且节点的其余部分找到从上一个节点到第一节点的路径,并且解码器网络由第一节和最后一个节点的嵌入式组成。关注的编码器嵌入输入节点并处理了 N 个顺序层,每个都是组成的多针注意和前馈子层。嵌入嵌入的节点嵌入式。在变压器架构()之后的该模型中的注意层,每个注意层都有两个子层,一个多针注意过程在节点和另一层之间的消息是一个节点明智的完全连接的前馈层。解码器在每个时间步骤顺序地解码解决方案。解码器根据来自编码器的嵌入式输出节点$\pi_t $。它们还使用特殊的上下文节点增强了图表,以表示类似于()的解码上下文。类似于()计算的输出概率,将一个解码器层添加单个注意头。我们使用()模型的方法,并报告了许多组合优化(路由)问题的改进结果。

Neural Network Architecture(ERRL2)

神经网络架构(errl2)

Policy Network

The Nazari et al. model used in the ERRL2 experiments as the same as Nazari et al. for TSP and CVRP both the problems. For MRPFF we consider the problems setting as CVRP except there is no customer demand(which we refer to as “the original Nazari et al. paper”). The policy network as a sequence to sequence (S2S) learning employed with an attention mechanism. S2S learning technique, sequentially given input to make a decision at each time step and generates the solutions as a sequence of customer locations. The customer locations are on a 2D Euclidean space. Given an input sequence, the model finds the conditional probability of the output sequence similar to(). Commonly, recurrent neural networks are used in sequence-to-sequence models to estimate this conditional probability. The sequence-to-sequence model assumes elements of the output sequence is fixed. Unlike the sequence-to-sequence model, the VRP solution (output) is a permutation of the problem nodes (input). To achieve the solution, use an attention mechanism (see, for example,(). The attention mechanism query information from all elements in the input nodes set. An affinity function is evaluated to assemble the output sequence (with each node and the final output of the model) to generate a set of scalars (aggregates many signals into one). Later, the softmax function applied to these scalars to obtain the attention weights given to each element of the input set at each time step. We define the combinatorial optimisation problem with a given set of inputs.

策略网络

nazari 等。在 ERRL2 实验中使用的模型与 Nazari 等人一样。对于 TSP 和 CVRP 这两个问题。对于 MRPFF,我们将这些问题设置为 CVRP,除非没有客户需求(我们将其称为“原始 Nazari 等人”)。策略网络作为对注意机制采用的序列(S2S)学习的序列。 S2S 学习技术,顺序给出输入以在每次步骤中做出决定,并将解决方案作为一系列客户位置生成。客户位置位于 2D 欧几里德空间。给定输入序列,模型找到类似于()的输出序列的条件概率。通常,经常性的神经网络用于序列到序列模型以估计这种条件概率。序列到序列模型假定输出序列的元素是固定的。与序列到序列模型不同,VRP 解决方案(输出)是问题节点(输入)的置换。为了实现解决方案,请使用注意机制(例如,()。注意机制来自输入节点中的所有元素的注意力查询信息。评估亲和函数以组装输出序列(使用每个节点和最终输出和最终输出。模型)生​​成一组标量(将许多信号聚合到一个)。稍后,将 SoftMax 函数应用于这些标量,以获得对每个时间步骤中输入集的每个元素的注意力。我们定义了组合优化给定集合的问题。

Between every decoding step, some of the elements of each input to change. For instance, in the case of VRP, the rest of the customer demands change over time as the vehicle visits the customer nodes; or we might consider a variant in which new customers arrive or adjust their demand values over time, independent of the vehicle decisions(). We formally, represent each input by a sequence of tuples. We start from an arbitrary input in, where we use the pointer to refer to that input and every decoding step will points to one available input, which regulates the input of the next decoder step until a terminating condition satisfied as Nazari et al.. These inputs are given to an encoder which embeds into latent space vectors. These embedded vectors are combined with the output of a decoder. This points to one of the elements of the input. This process generates a sequence and ends when a terminating condition is satisfied, e.g., when a specific number of steps are completed. In dynamic route optimisation, for example, in case of CVRP, includes all customer locations as well as their demands, and the depot location; then, the remaining demands are updated with respect to the vehicle destination and its load, and the terminating condition is that there is no more demand to satisfy. This process will generate a sequence of length, possibly with a different sequence length compared to the input length. For example, the vehicle may have to go back to the depot several times to refill. We are interested in finding a stochastic policy $ \pi $ , which generates the sequence in a way that minimises a loss objective while satisfying the problem constraints.

在每个解码步骤之间,每个输入的一些元素要改变。例如,在 VRP 的情况下,随着车辆访问客户节点的时间,客户的其余需求将随时间变化;或者我们可能会考虑一个变体,其中新客户随着时间的推移而抵达或调整需求值,与车辆决策无关。我们正式地,表示通过一系列元组输入。我们从一个任意输入开始,在那里,我们使用指针指的是输入,每个解码步骤都将指向一个可用的输入,该输入调节下一个解码器步骤的输入,直到终止条件满足 Nazari 等。这些输入给出了嵌入到潜伏空间向量的编码器。这些嵌入式矢量与解码器的输出组合。这指向输入的一个元素。当满足终止条件时,该过程产生序列并结束,例如,当完成特定数量的步骤时。在动态路由优化中,例如,在 CVRP 的情况下,包括所有客户位置以及它们的需求以及仓库位置;然后,对车辆目的地及其负载更新剩余需求,并且终止条件是不再需要满足。与输入长度相比,该过程将产生一系列长度,可能具有不同的序列长度。例如,车辆可能必须多次回到仓库以重新填充。我们有兴趣查找随机策略$\pi $,该$以一种方式生成序列,这在满足问题约束的同时最小化损失目标。

Training [ERRL2]

Algorithm ERRLLNet, we have two networks with weight vectors $ \theta $ and $ \phi $ associated with actor and critic networks, respectively. Summarises, we parameterise the stochastic policy $ \theta $ with parameters $ \phi $ . Policy gradient methods improve the policy iteratively use an estimate of the gradient of the expected return with respect to the policy parameters. Let us consider a family of problems, denoted by $ I $ , and a probability distribution over them, denoted by $ \phi_I $ . We draw $ N $ sample problems from $ I $ and use Monte Carlo simulation to produce feasible sequences for the current policy $ \pi_{\theta} $ . In AlgorithmERRLLNet, we taken the variables of the nth instance is referring to as the superscript of $ n $ . RL objective is to learn the parameters of some policy such that the expected sum of rewards is maximised under the induced trajectory distribution. After termination of the decoding in all $ N $ problems, we calculate the corresponding rewards in step 13. During training in step 14 computed the policy gradient as
$$ \begin{matrix} \nabla\theta_{\pi} = \dfrac{1}{N} \sum_{n=1}^{N} (R^{n} - V(s^n_0; \phi))\nabla_{\theta} log P(a^n|s^n_0) \end{matrix} $$
We add entropy bonus to avoid premature convergence and reduce variance in the gradient in step 14. In step 14, $ V(s^n_0,\phi) $ is the reward approximation for problem instance $ n $ . We also update the critic network $ \nabla\theta_{\phi} $ in the direction of reducing the difference between the expected rewards with the Monte Carlo estimation of the reward (calculated from the critic network) in step 15. The complete training procedure in AlgorithmERRLLNet.
t_t_510_13

训练[errl2]

算法错误,我们有两个网络,其中两个网络分别与演员和批评网络相关联的重量向量$\theta $和$\phi $。总结,我们将随机策略$\theta $参数参数$\phi $参数。策略梯度方法改善政策迭代地使用预期返回的梯度与策略参数的估计。让我们考虑一个由$ I $表示的问题,以及由此表示的概率分布,由$\phi_I $表示。我们从$ I $绘制$ N $样本问题,并使用 Monte Carlo 仿真为当前策略$\pi_{\theta} $产生可行的序列。在 algorimerrllllnet 中,我们拍摄了第 n 个实例的变量被称为$ n $的上标。 RL 目标是学习某些策略的参数,使得在诱导的轨迹分布下最大化的奖励总和最大化。在终止在所有$ N $问题中解码后,我们在步骤 13 中计算相应的奖励。在步骤 14 中的训练期间将策略梯度计算为
$$ \ begin {矩阵} \ nabla \ theta _ {\ pi} = \ dfrac { {n} \ sum_ {n = 1} ^ {n}(r ^ {n} - v(s ^ n_0; \ phi))\ nabla _ {\ theta} log p(a ^ n | s ^ n_0) \ end {矩阵} $$
我们在步骤 14 中添加熵奖励以避免过早收敛并降低梯度中的差异。在步骤 14 中,$ V(s^n_0,\phi) $是问题实例$ n $的奖励近似。我们还在步骤 15 中将批评奖励的方向更新评估网络$\nabla{\phi} $在将预期的奖励与 Monte Carlo 估算中估算(从批评网络计算)中的差异进行了差异。algorimerrllnet 的完整培训过程。.

Datasets and Settings

In this work, we implemented the datasets described by(). The locations and demands are randomly generated from a fixed distribution. Specifically, the customers and depot locations are randomly generated in the unit square, and the demand of each node chosen randomly uniform and is a discrete number in $ {1\cdots9} $ , chosen randomly uniform. We note, the demand values can be generated from any distribution. For faster training and generating feasible solutions, we have used a masking scheme which sets the log-probabilities of infeasible solutions to(- $ \infty $ ) or forces a solution if the condition is satisfied. In the CVRP, we use the masking procedures for nodes with zero demand are not allowed to be visited; all customer nodes will be masked if the vehicle’s remaining load is exactly 0, and the customers whose demands are greater than the current vehicle load are masked. Under this masking scheme, the vehicle must satisfy all customer’s demands when visiting it. We used the same architecture settings throughout all the experiments and datasets. Across all experiments, we use one-dimensional convolutional operation, LSTM cells with 128 hidden units. For training both networks, we use the REINFORCE Algorithm and Adam optimiser() with a learning rate of $ 0.0001 $ . Similarly() the decay rate of every 5000 steps by a factor of 0.96. In the critic network, first, we use the output probabilities of the actor-network to compute a weighted sum of the embedded inputs, and then, it has two hidden layers: one dense layer with ReLU activation and another linear one with a single output. The variables in both actor and critic network are initialised with Xavier initialisation(). The batch size N is 128, dropout with probability 0.1 in the decoder LSTM, and we clip the gradients when their norm is greater than 2. In the experiment, we show having an entropy augmented reward and, in general, a more stochastic policy changes this objective and perform better in terms of speed of learning shown in Sectionexp. We have manually set entropy values for all the problems. As we achieve best performance for ERRL1 for value $ \alpha=0.3 $ , we used the same hyper parameters value in ERRL2 model. - : Concorde's TSP solver has been used to obtain the optimal solutions to all random instances and TSPLIB instances; We implemented Concorde that use algorithms to iteratively solve linear programming relaxations of the TSP in addition to a branch-and-cut approach that reduces the solution search space.

数据集和设置

在此工作中,我们实现了()描述的数据集。从固定分布中随机生成位置和需求。具体地,客户和仓库位置在单位方形中随机生成,并且每个节点的需求选择随机均匀,并且是${1\cdots9} $中的离散数,选择随机均匀。我们注意,可以从任何分发生成需求值。为了更快的培训和生成可行的解决方案,我们使用了一种掩蔽方案,该方案将无法使用解决方案的日志概率设置为( - $\infty $),或者如果满足条件,则强制解决方案。在 CVRP 中,我们使用不允许访问零需求的节点的掩蔽程序;如果车辆的剩余负载恰好为 0,则屏蔽所有客户节点都将被屏蔽,并且要求其要求大于当前车辆负载的客户。在这种掩盖方案下,车辆必须在访问它时满足所有客户的需求。我们在所有实验和数据集中使用了相同的架构设置。在所有实验中,我们使用具有 128 个隐藏单元的一维卷积操作,LSTM 单元。对于培训两个网络,我们使用具有$ 0.0001 $的学习速率的增强算法和 ADAM 优化器()。同样()每 5000 个步骤的衰减率为 0.96。在批评网络中,首先,我们使用演员 - 网络的输出概率来计算嵌入式输入的加权和,然后,它有两个隐藏层:一个密集的层,具有 relu 激活和具有单个输出的另一个线性。 。 Actor 和批评网络中的变量初始化 Xavier InitialIsation()。批量尺寸 N 是 128,在解码器 LSTM 中具有概率 0.1 的丢失,并且当他们的规范大于 2 时,我们剪辑梯度。在实验中,我们展示了熵增强奖励,一般来说,更加随机的政策变化这一目标并在 expect 中显示的学习速度方面更好。我们手动设置所有问题的熵值。当我们为值$\alpha=0.3 $达到 ERRL1 的最佳性能时,我们在 ERRL2 模型中使用了相同的 Hyper 参数值。 - :Concorde 的 TSP 解算器已用于获得所有随机实例和 TSPLIB 实例的最佳解决方案;我们实施了使用算法来迭代解决 TSP 的线性编程放松,除了减少解决方案搜索空间的分支和切割方法之外。

  • Nearest Insertion inserts the node that is with the minimum distance to its nearest neighbour among the tour cities, insert the city between the two successive tour cities for which such an insertion causes the minimum increase in the overall tour length. - Furthest Insertion inserts the city is a manner such as choose the non-tour city with the maximum distance to its nearest neighbour. - A Random Insertion inserts a random node. Similar to the nearest neighbour, we consider the input order random, so we insert the nodes in this order. - The Nearest Neighbour heuristic represents the partial solution as a path with a start and end node. First, start in some city and the select to visit the city to the starting city. Continue the process, and at the end, all cities visited, and the end city is connected with the start city. We follow the implementation of. - Croes first introduced the 2-optimisation method, which is a simple and very common operator. The idea of 2-opt is to exchange the links between two pairs of subsequent nodes. - A Minimum Spanning Tree (MST) aims to minimise the weights (tour lengths) of the edges of the tree. - The Cheapest-Link Algorithm select the edge with the smallest weight and mark it and continue that following rules, do not pick an edge that will close a circuit –Do not pick an edge that will create three edges coming out from a single vertex Connect the last two vertices to close the circuit. - Google Optimisation Tools (OR-Tools) is an open-source solver for combinatorial optimisation problems. OR-Tools contains one of the best available vehicle routing problem (VRP), which is a generalisation of the TSP and implemented many heuristics for finding an initial solution and metaheuristics, we use it as our baseline. We have used the local-search meta-heuristics used in OR-Tools as Guided Local Search. - : We implemented the pointer network with supervised learning - : Across all experiments, used mini-batches of 128 sequences, LSTM cells with 128 hidden units, and train models with the Adam optimiser and use an initial learning rate of $ 10^{-3} $ for TSP20 and TSP50 and $ 10^{-4} $ for TSP100 that decay every 5000 steps by a factor of 0.96. After implementing the code, we reported the result. - : For experiment Policy gradient across all tasks, we follow the configuration provided in (), we consider a benchmarked test set of 1,000 Euclidean TSP20, TSP50, and TSP100 graphs.
  • 最近的插入插入距离旅游城市最近邻居的最小距离的节点,将该城市插入两个连续的旅游城市,因为这种插入导致整体旅游长度的最小增加。 - 最远的插入插入城市是一种方式,如选择与其最近邻居最大距离的非游览城市。 - 随机插入插入一个随机节点。类似于最近的邻居,我们考虑随机输入顺序,因此我们按此顺序插入节点。 - 最近的邻居启发式表示部分解决方案作为具有开始和结束节点的路径。首先,从一些城市开始,选择访问城市到起始城市。继续流程,最后,所有的城市都参观,最终城市与初始城市相连。我们遵循的实施。 - 克斯首先介绍了 2 优化方法,这是一个简单而非常普通的运算符。 2-opt 的想法是在两对后续节点之间交换链接。 - 最小的生成树(MST)旨在最大限度地减少树边缘的权重(巡回寿服)。 - 最便宜的链接算法选择具有最小权重和标记的边缘并继续遵循规则,不要选择将关闭电路的边缘 - 不选择将创建三个从单个顶点连接出来的边缘的边缘最后两个顶点关闭电路。 - Google 优化工具(或工具)是一个用于组合优化问题的开源求解器。或工具包含最佳可用的车辆路由问题之一(VRP),这是 TSP 的泛化,并实现了许多启发式方法,用于查找初始解决方案和殖民地,我们将其作为我们的基线。我们使用的本地搜索 Meta-heuRistics 或工具中使用的是指导本地搜索。 - :我们通过监督学习实现了指针网络 - :跨所有实验,使用迷你批次的 128 个序列,带有 128 个隐藏单元的 LSTM 单元,以及带有 ADAM 优化器的训练模型,并使用$ 10^{-3} $的初始学习率 TSP20 和 TSP50 和$ 10^{-4} $用于 TSP100,每隔 5000 个步骤衰减为 0.96。实施代码后,我们报告了结果。 - :在所有任务中进行实验策略渐变,我们按照()中提供的配置,我们考虑基准测试集 1,000 欧几里德 TSP20,TSP50 和 TSP100 图。

We applied the same approach provided in (), and we report results. - : Across all experiments,() used mini-batches of 128 sequences, LSTM cells with 128 hidden units, and train models with the Adam optimiser() and use an initial learning rate of $ 10^{-3} $ for TSP20 and TSP50 and $ 10^{-4} $ for TSP100 that decay every 5000 steps by a factor of 0.96. - : For experiment,() initiating parameter uniform $ (-\frac{1}{\sqrt{d}}, \frac{1}{\sqrt{d}}) $ , with $ d $ the input dimension. Every epoch, we process 2500 batches of 512 instances. Used a constant learning rate 0.0001 Training with a higher learning rate 0.001 is possible and speeds up initial learning, but requires decay (0:96 per epoch) to converge. - : For LKH3 by() we build and run their code with the SPECIAL parameter as specified in their CVRP run script http://akira.ruc.dk/keld/research/LKH-3/BENCHMARKS/CVRP.tgz . - : Google optimisation Tools (OR-Tools) is an open-source solver for combinatorial optimisation problems. OR-Tools is one of the best available VRP solvers, which has implemented many heuristics (e.g., Clarke-Wright savings heuristic (), Sweep heuristic() and a few others) for finding an initial solution and meta-heuristics (e.g., and few others) for escaping from local minima in the search for the best solution. In this paper, we studied the OR-Tools() VRP solver like().

我们应用于()提供的相同方法,我们报告结果。 - :跨所有实验,()使用 128 个序列的迷你批次,LSTM 单元格,带有 128 个隐藏单元的 LSTM 单元,以及使用 ADAM 优化器()的训练模型,并使用 TSP20 和 TSP50 和$ 10^ 的$ 10^{-3} $的初始学习速率。 TSP100 的 T591_1 $每 5000 个步骤衰减为 0.96。 - :用于实验,()启动参数均匀$(-\frac{1}{\sqrt{d}}) $,$ d $输入尺寸。每纪元,我们处理 2500 批 512 个实例。使用恒定的学习率 0.0001 培训,较高的学习率为 0.001 是可能的,并加快初始学习,但需要衰减(每巨头 0:96)来汇聚。 - :对于 lkh3 by()我们使用 cvrp 运行脚本中指定的特殊参数构建并运行其代码 http://akira.ruc.dk/keld/research/lkh -3/benchmarks/cvrp.tgz。 - :Google 优化工具(或工具)是一个用于组合优化问题的开源求解器。或工具是最佳可用的 VRP 求解器之一,它已经实现了许多启发式(例如,克拉克赖特储蓄启发式(),扫描启发式()和少数人)来寻找初始解决方案和元启发式(例如,和少数人)在寻找最佳解决方案中的当地最小值中逃脱。在本文中,我们研究了()等()等工具()的 vrp 求解器。

Set of parameters Testing

In this section, We trained the model using different set of parameters and illustrates the result in Tables net, net1 and net3 for problem size 20, 50 and 100 respectively on CVRP dataset. We evaluated our model with changing parameters of the model. In this experiment we first trained our model with VRP50 node dataset and tested on VRP20, VRP50 and VRP100 instances.

在本节中测试

的参数集,我们使用不同的参数培训了模型,并在 CVRP 数据集上分别示出了问题大小 20,50 和 100 的表 NET,NET1 和 NET3 的结果。我们使用模型的参数进行了评估了我们的模型。在此实验中,我们首先使用 VRP50 节点数据集进行培训,并在 VRP20,VRP50 和 VRP100 实例上进行测试。