# Learning Vehicle Routing Problems using Policy Optimisation

## Abstract

Deep reinforcement learning (DRL) has been used to learn effective heuristics for solving complex combinatorial optimisation problem via policy networks and have demonstrated promising performance. Existing works have focused on solving (vehicle) routing problems as they have a nice balance between non-triviality and difficulty. State-of-the-art approaches learn a policy using reinforcement learning, and the learnt policy acts as a pseudo solver. These approaches have demonstrated good performance in some cases, but given the large search space typical combinatorial/routing problem, they can converge too quickly to poor policy. To prevent this, in this paper, we propose an approach name entropy regularised reinforcement learning (ERRL) that supports exploration by providing more stochastic policies, which tends to improve optimisation. Empirically, the low variance ERRL offers RL training fast and stable. We also introduce a combination of local search operators during test time, which significantly improves solution and complement ERRL. We qualitatively demonstrate that for vehicle routing problems, a policy with higher entropy can make the optimisation landscape smooth which makes it easier to optimise. The quantitative evaluation shows that the performance of the model is comparable with the state-of-the-art variants. In our evaluation, we experimentally illustrate that the model produces state-of-the-art performance on variants of Vehicle Routing problems such as Capacitated Vehicle Routing Problem (CVRP), Multiple Routing with Fixed Fleet Problems (MRPFF) and Travelling Salesman problem.

## Introduction

Combinatorial Optimisation (CO) Problems are hard as it involves finding the optimal solution under various constraints. A conventional approach to solve these problems involves modelling the problem into a mathematical objective, selection of an appropriate solver and then optimising its parameters for the problem instance at hand. While this approach has been successful, it requires high levels of optimisation expertise and domain knowledge, limiting their widespread usage. Also, the selection of solver and its optimal parameters varies for different problem instances when the problem instance changes, often the process of searching appropriate solver and parameters are restarted. This has raised interest in the level of generalisation at which optimisation operate. In recent years, there is increasing interest in considering combinatorial optimisation as a learning problem,(;), where optimisation instances, and their solutions, are used as training instances(). Thereby, the resulting learnt model is then considered as a general solver. The models typically involve deep neural networks (DNNs)(), and recent state-the-art approaches take a reinforcement learning strategy for cases where supervised approaches/solutions are not plausible(). The learnt policy is similar to a solver(). In particular, deep reinforcement learning (DRL) and policy gradients can successfully finding close to optimal solution to the problems such as TSP , , CVRP(, (), 0-1 Knapsack. All the problems we discussed in Appendixproblems. The previously mentioned state-of-the-art( and) uses standard policy gradient-based approaches (PGM) build upon the REINFORCE algorithm(). For standard reinforcement learning (RL) problems, search space (landscape) is smaller, the search space is smooth, and optimisation is not difficult. However, in the case of combinatorial optimisation ((class of CO) problems such as VRP, TSP, search space may not be infinite, but the search space of the CO is difficult . Therefore, for combinatorial optimisation problems, we need an effective method for moving through the search space.

## 简介

The issue with the standard policy gradient methods apply to CO Problem is, fluctuating landscape of the CO Problem which may not estimate gradient properly (two nearby samples can have very different gradients) which make the problem difficult to optimise. Because when we take the average which may have variance in the gradient. Additionally, in many cases in RL, a situation can occur if the agent discovers a strategy following a policy that achieved a reward which is better than when the agent started first, but the strategy of the agent following can be difficult to find the optimal solution and tend to take a single move over and over. So as agent progressing learning, it is average move distribution will be closer to prediction with either single move or multiple moves. So it is unlikely to explore different actions. As a consequence, instead of using standard PGM, our line of research devoted to how can we explore the search space effectively and efficiently to improve the solution of a combinatorial optimisation problem? When we look for the best possible answer to the question, a perfectly logical solution to use exploration strategy for better exploration in the search space. As a result, we need a method that can able to explore the search space effectively. However, this is not the case in previous learning methods. To solve this, we add an entropy of the policy with the RL objective as in . The goal is to find the probability distribution that has the highest entropy, which states that it is one of the best representation of the current state. Using exploration strategy in RL can be used where the neural network is used as function approximation. In the line of research, we contribute to the deep learning community for COP by introducing entropy regularised term in recent policy gradient-based methods (ERRL). ERRL offers to improve policy optimisation . It is believed that entropy regularisation, assist with exploration by encouraging the selection of more stochastic policies().

As a consequence, in this work, we analyse this claim a policy with higher entropy can make the changes the optimisation landscape and maintains exploration to discourage early convergence. To the best of our knowledge, the entropy regularised term has not been studied or used in existing learning to (combinatorial) optimise literature. ERRL can be integrated with any existing policy gradient approaches that use parameterised functions to approximate policies; hence we applied entropy technique to the state-of-the-art methods();(). We demonstrated the effectiveness of ERRL on three categories of routing problems. The goal of this work is not to outperform all the existing state-of-the-art VRP learning algorithm from every aspect but to provide direction in the study of the RL approach to encourage exploration to fundamental routing problems, considering the before-mentioned difficulties. The main contributions are as follows: - We proposed an approach using entropy regularised term that can solve route optimisation problems. We devise a new exploration-based and low-variance method for policy gradient method because this baseline assists to select a more stochastic policy. The proposed method is verified on multiple types of routing problems, i.e., Capacitated vehicle routing problem (CVRP) and multiple routing with fixed fleet problems (MRPFF) and Travelling salesman problems (TSP). - The generality of the proposed scheme is validated with different approaches and evaluating the resultant method on various problem sizes (and even at high problem dimensionality of 100) to achieve outstanding performance, better than the state-of-the-art in terms of accuracy and time-efficiency. - Another contribution we use a local search algorithm 2-opt(). This hybrid approach is an example of combining learned and traditional heuristics to improve the solution. In this work, we analyse existing inference techniques to show the impact of the post-processing techniques in the solution quality.

In recent years, the line of research has many ways to solve COP using deep learning paradigm. Many methods have been developed to tackle combinatorial optimisation problems. Traditional heuristics for routing problems can be categorised as construction and improvement heuristics(). Recent advances in the neural networks include the design of a new model architecture called Pointer Network(PN)(). In Pointer network() learns to solve combinatorial optimisation problems where encoder (RNN) converts the input sequence that is fed to the decoder (RNN). They use attention on the input and train this model in a supervised setting to solve the Euclidean TSP instances. The goal is to use the Pointer Network architecture to find close to optimal tours from ground truth optimal (or heuristic) solutions for Traveling Salesman Problem (TSP). In() takes a graph as an input and extracts features from its nodes and edges. Their model can be considered as a stack of several graph convolutional layers. The output of the neural network is an edge adjacency matrix representing the probabilities of edges occurring on the TSP tour. The edge predictions, forming a heat-map, are transformed into a valid tour. They trained their model as a supervised manner using pairs of problem instances and optimal solutions. Despite this promising early application, reinforcement learning becomes a compelling choice to the prospect of learning to optimise as it does not require a set of pre-solved solutions for training. () first propose a reinforcement learning approach, in which a pointer network is trained using an actor-critic reinforcement learning strategy to generate solutions for artificial planar TSP instances. They address this issue by designing a neural combinatorial optimisation framework that uses reinforcement learning to optimise policy. S2VDQN() solves optimisation problems using a graph embedding structure and a deep Q-learning algorithm. Recently, many deep learning-based approaches exist, however only a few learning-based approaches propose a solution to the VRP(,). Recent approaches applied a deep RL model that generate solutions sequentially one node at a time. The constructive heuristics,(), a model was proposed that uses a recurrent neural network (RNN) decoder and an attention mechanism to build solutions for the CVRP and the SDVRP, train the model using policy gradient methods (actor-critic approach) similar to(). Solution searching techniques used a beam search with a beam-width of up to 10.

## 相关工作

A graph attention network similar to() is used in() and generate solutions for different routing problems trained via RL, including TSP and CVRP. They train their model using policy gradient RL with a baseline based on a deterministic greedy rollout. Our work can be classified as constructive method for solving CO problems, our method differs from previous work. First, we applied the entropy maximisation techniques to carried out by adding an entropy regularisation term to the objective function of RL to prevent premature convergence of the policy and change in the gradient. Second, we combine classical heuristics with improving the solution quality further. In order to promote the idea of exploring the search space, we need an effective exploration-based algorithm. The fundamental property of regularised term distinguishes our approach from rest. Our approach Entropy Regularised RL(ERRL) can be summarised as follows: instead of learning deterministically and making a decision at an early stage, we demonstrate that stochastic state-space models can be learned effectively with a well-designed network encouraging exploration. Other work focuses on iteratively improving heuristics, for example,() propose an RL based improvement approach that iteratively chooses a region of a graph representation of the problem and then selects and applies established local heuristics. A perturbation operator further improved this approach(). After training the neural network, some existing techniques can be applied to improve the quality of solutions it generates, for example in bello et al., active search optimises the policy on a single test instance. In bello et al.; used Sampling method that select the best solution among the multiple solution candidates. Beam search is another widely used technique uses to improve the efficiency of sampling. To further enhance the quality of the solution in one popular local search operator used. Following , in ERRL, we combine classical 2-opt local search operator to process the solution further to improve. In addition, in this work, we combine many inference techniques with our ERRL method to show the importance of searching in ML-based approaches to combinatorial optimisation .

## Motivation

In recent works, PGM use to solve CO problem. The key idea in policy optimisation (PGM) is to learn parameters, $\theta$ of a policy, $\pi_{\theta}(a|s)$ , $s \in S, a \in A$ . Here a is the action and s is state. The policy gradient method() stated the gradient as:
$$J_{ER}(\theta) = \sum_{s\in S} d^{\pi_{\theta}}(s) \sum_{a\in A}\pi_{\theta}(a|s) Q^{\pi_{\theta}}(s,a)$$
where $d^\pi$ is the stationary distribution of states and $Q^{\pi}{\theta}(s_t,a_t)$ is the expected discounted sum of rewards starting state $s$ , taking action $a$ and sampling actions according to the policy, $a\sim\pi(. |s{t})$ . The $Q^{\pi}{\theta}(s,a)$ ), (Monte Carlo estimation) is the value function pair following a policy $\pi$ . Here we are interested in finding parameter $\theta$ that maximises the objective function $J{ER}$ (the goal is to maximise the discounted cumulative rewards). The equation helps to find a policy with the highest expected reward from the agent's action. However, many issues are encountered using current PGM approaches in combinatorial optimisation problems as combinatorial optimisation is not easy. One approach to characterising the degree of difficulty of an optimisation problem is its search space. The search space is also known as its landscape, and solutions to the optimisation are points on this landscape. It is difficult to have access to the exact transition and reward dynamics. Therefore, the gradient of $J_{ER}$ given by the policy gradient theorem in Equationtheta1 cannot be evaluated directly. The Equationtheta1 allowing us to estimate $\nabla J_{ER}$ using Monte-Carlo samples, where each trajectory is defined as a sequence. In a previous learning-based model, never consider an agent that have prior knowledge of previous states, so it is perfectly logical to think of an agent that needs to have experience of previous states of data to achieve maximum reward (encourage exploration). In consequence, need to change in the RL objective to solve COP so that model can achieve the highest expected reward with low variance.

## 动机

$$j_ {er}（\ theta）= \ sum_ {s \ in s} d ^ {\ pi _ {\ theta}}（s）\ sum_ {a \ } \ pi _ {\ theta}（a | s）q ^ {\ pi _ {\ theta}}（s，a）$$

## 熵正规化强化学习（errl）

### Encourage Exploration

In our model we are given a problem s (a set of nodes $(n_1,\cdots,n_m)$ ) and a policy $\pi$ which is parameterise by ${\theta}$ . The policy is trainable and can be produce a valid solution to the problem. The solution is as ( $L= {a_1,\cdots, a_i}$ ), where the ith action $a_i$ can choose a node in the tour(solution). The neural network generated the solution one node at a time in a auto-regressive manner following the policy(stochastically), $\pi_{t} = P_{\theta}(a_t | s)$ , where, $t=1$ and s is the problem instance (define as state).

In ERRL methods, we offer exploration strategy in the CO problems. Our ERRL model starts with a node for exploration shown in Figureerrl. The network samples N trajectories ( ${{l^1,l^2,\cdots,l^N}}$ ) using the Monte Carlo method, where each trajectory is defined as $l^i$ . In Figureerrl presented how ERRL model learns exactly previous stated data (in our case, these stated previous data is the experience of the agent.), Figureerrl shows the two distributions of the Q values can be represent as for $(distribution1 = a_0 0.26, a_1 0.28, a_2 0.24, a_3 0.220)$ and $(distribution2 = a_0 0.1, a_1 0.08, a_2 0.8, a_3 0.12)$ , for the first distribution all of the probabilities are low and from the second distribution $a_2$ has a high probability and other actions has lower probability. It had happened some cases that the agent will use the action in the future, as it already achieves some positive reward when it started. Therefore, the agent does not tend to explore in this context; however, there could exist another action that could have a much higher reward. Consequently, the agent will never try to explore instead will exploit what it has already learned. This means the agent can get stuck in a local optimum because of not exploring the behaviour of other actions and never finding the global optimum. Hence adding entropy, we encouraged exploration and avoid getting stuck in local optima. The use of Entropy in RL works as when the agent is progressing learning the policy, according to the agent action model returns a more positive reward for the state. The entropy augmented in the policy in the conventional RL objective following formalised in Equationentropy. In this work, entropy maximisation is typically carried out by adding an entropy regularisation term to the objective function of RL. Therefore, when all actions are equally good in ERRL the entropy regularisation improve policy optimisation in reinforcement learning maximising reward to improve exploration.

### 鼓励探索

By augmenting entropy regularisation with the reward that helps to get more reward proportion to the entropy of the $\pi$ in the following equation: It supports to help with exploration by encouraging the selection of more stochastic policies. Where $\pi$ is a policy, ${\pi^*}$ is the optimal policy, $T$ is the number of time-steps, state $s\in S$ is the state at time-step $t$ , $a\in A$ is the action at time-step $t$ , $P_{\pi}$ is the distribution of trajectories induced by policy $\pi$ . $H (\pi_{\theta}(. |s_{t})$ is the entropy of the policy $\pi$ at state $s_t$ and is calculated as $H (\pi_{\theta}(. |s_{t})= -log(\pi_{\theta}\pi0\pi1\pi2$ , where H is the entropy and $\alpha$ controls the strength of the entropy regularisation term. $\alpha$ is a temperature parameter that controls the trade-off between optimising for the reward and for the entropy of the policy, the entropy bonuses play an important role in reward ( $\alpha$ value discusses in Sectiondataset). So with the adding entropy to maximise objective will help agent to explore mode and have knowledge of the best representation of the state likely to have a probability distribution with the highest entropy, which means agent can have to the experience of the stated prior data. This incorporated entropy term, defined over the outputs of the policy network, into the loss function of the policy network, and the policy exploration can be supported to maximising the reward. In other words, add an entropy term into the loss function, encouraging the policy to take diverse actions). The entropy $H(\pi(a_t|s))$ used into the loss function to encourage the policy $\pi_t = \theta(a_t|s)$ .

We use the entropy regularisation term with the current approaches and , by augmenting the rewards with an entropy term, $H(\pi(. |s_{t}))= E_{a\sim\pi(. |s_{t})}[-log \pi(a|s_t)\pi0$ . This entropy regularisation term is weighed by $\alpha$ . It helps exploration by encouraging the selection of more stochastic policies (over deterministic ones), and premature convergence can also be prevented and learning is stable and sample efficient we demonstrates in the experiment section. In Equationtheta1, $Q^{\pi_\theta}(s,a)$ changed to $Q^{\alpha,\pi_\theta}(s,a)$ is the expected discounted sum of entropy-augmented rewards. $Q^{\alpha,\pi_\theta}(s,a)$ can be estimated by executing $\pi_{\theta}$ in the environment. Because of this term we can have slightly different value function in the settings which change the value because of the included entropy at every time step. This term encourages the policy to assign an equal probability of action that has the same as the total expected reward or nearly equal to the reward by exploring the search space. Also, helps agent not to select a particular action repeatedly that could exploit inconsistency in the approximation of $Q^{\pi_\theta}(s,a)$ which that can suffer high variance, instead can explicitly encourage exploration. The expected discounted sum of reward $Q^{\pi}(s_t,a_t)$ depends on the policy $\pi_{\theta}$ , so any change in the policy will be reflected in both the objective and gradient.

We trained their model using our Entropy Regularised Reinforcement Learning, name ERRL1. The problem instance defines as $s$ is a graph, and the model is considered as Graph Attention Network(). Attention based model defines a stochastic policy $p(\pi|s)$ for selecting solution $\pi$ given the problem instance $s$ , parameterised by $\theta$ . As solution define as $\pi$ and network samples N solution trajectories, we calculate the total reward $R(l^i)$ of each solution $l^i$ . We use gradient ascent with an approximation to maximise the expected return J:
$$\begin{matrix} \nabla_{\theta}J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^{N} (R(l^i) - b^i(s)))\nabla_{\theta} log P_{\theta}(l^i|s) \end{matrix}$$
In order to train our ERRL1 model, we optimise the objective by using the REINFORCE() gradient estimator with augmented entropy term along with baseline. To encourage exploration and avoid premature convergence to a sub-optimal policy, we add an entropy bonus.
$$\begin{matrix} \nabla_{\theta}J(\theta) \approx \dfrac{1}{N} \sum_{i=1}^{N} (R(l^i) - b^i(s)))\nabla_{\theta} log P_{\theta}(l^i|s)+ \alpha H(\pi_\theta(.|s_t))) \end{matrix}$$
Here, reinforcement learning with baseline learns both $R(l^i)$ and the ${b^i(s)}$ used as a baseline. The average returns serve as a baseline with included entropy bonuses. Here, we use entropy term to prevent premature convergence, results in a slightly different gradient with changing value function. This will result in larger entropy, which means the policy will be more stochastic.
t_t_242_0 %Because this method trained on the difference between the two near generated rollouts by networks;
A good baseline $b^i(s)$ reduces gradient variance and therefore increases the speed of learning. After generating solution trajectories ( ${l^1,l^2,\cdots,l^n}$ ), we used the greedy-rollout baseline scheme, each sample-rollout assessed independently. With the changed baseline, now, each trajectory competes with N-1 other trajectories where the network will not select two similar trajectories. With the increased number of heterogeneous trajectories all contributing to setting the baseline at the right level, premature converge to a suboptimal policy is not encouraged instead converge to explored policy. Afterwards, similar to() we updates via ADAM (adaptive moment estimation)() combine the previous objectives via SGD (stochastic gradient descent). Details are given in AlgorithmERRam.

$$\ begin {矩阵} \ nabla _ {\ theta} j（\ theta）\ inflicat \ dfrac {1} {n} \ sum_ {i = 1} ^ {n}（r（l ^ i） - b ^ i（s）））\ nabla _ {\ theta} log p _ {\ theta}（l ^ i | s）\ end {marrix}$$

$$\ begin {matrix} \ nabla _ {\ theta} j（\ theta）\ intain \ dfrac {1} {n} \ sum_ {i = 1} ^ {n}（r（l ^ i） - b ^ i （s）））\ nabla _ {\ theta} log p _ {\ theta}（l ^ i | s）+ \ alpha h（\ pi_ \ theta（。| s_t）））\ neg {marrix}$$

## Experiments

All of our ERRl experiment use the PGM. We emphasize that ERRL can be applied to any PGM, to support the claim we applied ERRL to existing two algorithms introduce by Kool et al. and Nazari et al.. In this work, we implemented the datasets described by() for TSP and CVRP for ERRL1 and ERRL2. For MRPFF we implemented dataset, where we need to find the shortest path connecting all N $(n_1,\cdots,n_i)$ nodes, where the distance between two nodes is 2D Euclidean distance. The location of each node is sampled randomly from the unit square, in this problem vehicle does not need to full fill any customer demand but optimise multiple routes. We used the same architecture settings as throughout all the experiments and initialize parameters Uniformly like , policy gradients are averaged from a batch of 128 instances. Adam optimizer is used with a learning rate 0.0001 and a weight decay (L2 regularisation). To keep the training condition simple and identical for all experiments we have not applied a decaying learning rate, although we recommend a fine-tuned decaying learning rate in practice for faster convergence. Every epoch we process 2500 batches of 512 instances generated randomly on the fly. Training time varies with the size of the problem. In Figuresmalltsp illustrates most of the learning is already completed by 200 epochs. In the experiment, we manually set entropy values for all the problems. We evaluated the results for other values results presented in Appendixparameter, but the best performance was achieved with $\alpha=0.3$ with learning rate 0.0001.

Method CVRP20 CVRP50 CVRP100
CVRP TourL Gap(%) Time(s) TourL Gap(%) Time(s) TourL Gap(%) Time(s)
LKH3 6.14 0.00 28(m) 10.39 0.00% 112(m) 15.67 0.00 211(m)
Random CW* 6.81 10.91 12.25 17.90 18.96 20.99
Random Sweep* 7.01 14.16 12.96 24.73 20.33 29.73
Or-tools* 6.43 4.73 11.43 10.00 17.16 9.50
L2I 6.12 - 12(m) 10.35 - 17(m) 15.57 - 24(m)
Kool 6.67 8.63 0.01 11.00 5.87 0.02 16.99 8.42 0.07
Nazari 7.07 15.14 6.41 11.95 15.01 19 17.89 14.16 42
Ours(Constructive)
ERRL1 6.34 3.25 0.01 10.77 3.65 0.02 16.38 4.53 0.06
ERRL2 6.67 8.63 5.4 11.01 2.63 14 17.23 9.95 33
ERRL2(2opt) 6.18 0.65 5.49(m) 10.56 1.63 24.28(m) 16.16 3.12 65(m)

## 实验

### Capacitated Vehicle Routing Problem (CVRP)

In this table we group baselines as solver name as LKH3, non-learning baselines, constructive approaches and and improvement approaches (another two algorithms Chan and Tian() and L2I() introduced that fuses the strength of Operations Research (OR) heuristics with learning capabilities of reinforcement learning, in Tablerandomdata, we reported results from their paper). Our method is directly comparable to constructive approaches, given 1000 random CVRP instances of CVRP20, CVRO50 and CVRP100. ERRL2(using Policy network) slightly improve the solutions. However ERRL1(using Policy network) find near optimal solutions for all the problem sizes. ERRL1 and ERRL2 combine with 2OPT outperforming all other learning approaches significantly both in terms of solution quality and solving time in Tablerandomdata all the results. For all results, the learning algorithms and baselines were implemented using their publicly available code except Chan and Tian() and L2I(). Learning curves of CVRP50 and TSP50 in Figuresmalltsp show that ERRL training is more stable and most of the learning converge faster than kool et al. model. We observed also most of the learning is already completed within 200 epochs for both the problems in Figuresmalltsp. After each training epoch, we generate 1000 random instances to use them as a validation set.

### Multiple Routing with Fixed Fleet Problems (MRPFF)

Analyse the generalisation of ERRL method we created a new route problems to test how our model performs compare to other existing methods. The result for Multiple Routing with Fixed Fleet Problems (MRPFF) reported Tablemrpff. MRPFF experiment of applying ERRL results with 20, 50, and 100 customer nodes are reported in Tablemrpff, and all the ERRL models is shown to outperform all the existing methods.

Method MRPFF=20 MRPFF=50 MRPFF=100
MRPFF TourL Gap(%) Time(s) TourL Gap(%) Time(s) TourL Gap(%) Time(s)
LKH3 5.34 0.00 17(m) 9.12 0.00 90(m) 13.16 0.00 145(m)
Constructive Models
Nazari et al. 6.79 27.15 6 10.85 18.96 18.98 15.97 21.35 33
Kool et al. 5.99 12.17 0.01 10.30 12.93 0.03 14.67 11.47 0.07
Ours(Constructive)
ERRL1 5.51 3.18 0.01 10.10 10.74 0.03 13.97 6.15 0.07
ERRL2 5.70 6.74 5 10.40 14.03 14 14.40 9.42 30
ERRL1(2Opt) 5.45 2.05 2(m) 9.39 2.96 15(m) 13.80 4.86 23(m)
ERRL2(2opt) 5.60 4.86 4.41(m) 9.79 7.34 17(m) 14.12 7.29 45(m)

### Travelling Salesman Problem (TSP)

In this section, we want to evaluate and show our performance is comparable with existing work as the previous state-of-the-art approaches typically focus on TSP random data. For the TSP, we report optimal results by Concorde() and(). Besides, we compare against Nearest, Random and further Insertion, as well as Nearest Neighbour. In Table tsp illustrates the performance of our techniques compared to the solver, heuristics, and state of the art learning techniques for various TSP instance sizes. Table tsp is separated into four sections: solver, heuristics, learning methods using reinforcement learning (RL), and; learning models using supervised techniques (S). For all results were implemented using their publicly available code except Graph Convolutional Network (GCN) taken from(); We implemented PN.,(), Bello et al.,(), EAN., (), [Kool() andNazari, accordingly refer the results we found from our implementation. We are able to achieve satisfactory results and report the average tour lengths of our approaches on TSP20, TSP50, and TSP100 in Tabletsp. The data in Table tsp shows, perform better compare to Nazari et al.() for all sizes of TSP instances. The data in Tabletsp shows using our greedy approach name Entropy regularised Reinforcement Learning, outperformed not only all the traditional baselines but also perform better compare to(). Furthermore, we considered execution times in seconds for all instances. Run times are important but can vary due to implementation using Python or C++. We show the run times for our approach and compared with all the approaches, used python for implementation. Another important factor is using hardware such as GPUs or CPUs(). We implemented all approaches on the same hardware platform as experimental results can vary based on hardware platforms. In Tabletsp, we report the running times for the results from our implementation using their publicly available codes, as reported by others, are not directly comparable. We only reported execution times for directly comparable baselines() and(). We report the time it takes to solve the average solution time (in seconds) over a test set of size 1000 test.

### ERRL combine with 2-Opt

In this study, another further enhancement is, we use a local search algorithm 2-opt to improve our results during test time. We show that the model can produce improve the result by using a ‘hybrid’ approach of a learned algorithm with local search. This hybrid approach is an example of combining learned and traditional heuristics. Recent many works showed that the design of the search procedure has an immense impact on the performance of the ML approach. Francois et al. shown in their result that the search procedure can promote improvement, not from the learning intrinsic. With local search added, the ERRL1 and ERRL2 with 2opt heuristics have much-improved performance than without 2-opt shown in Tablesrandomdata,mrpff andtsp.

Method TSP20 TSP50 TSP100
Solver TourL Gap(%) Time TourL Gap(%) Time TourL Gap(%) Time
Concorde 3.83 0.00 4m 5.70 0.00 10m 7.77 0.00 55m
LKH3 3.83 0.00 42(s) 5.70 0.00 59(m) 7.77 0.00 25(m)
Heuristics
Nearest Insertion 4.33 13.05 1(s) 6.78 18.94 2(s) 9.45 21.62 6(s)
Random Insertion 4.00 4.43 0(s) 6.13 7.54 1(s) 8.51 9.52 3(s)
Farthest Insertion 3.92 2.34 1(s) 6.01 5.43 2(s) 8.35 7.46 7(s)
Or-tools 3.85 0.52 - 5.80 1.75 - 8.30 6.82 -
Learning Models (SL)
PN 3.88 1.30 6.62 16.14 10.88 40.20
GCN* 3.86 0.78 6(s) 5.87 2.98 55(s) 8.41 8.23 6(m)
Learning Models (RL)
Bello et al. 3.89 1.56 - 5.99 5.08 - 9.68 24.73 -
EAN.(2Opt) 3.93 2.61 4(m) 6.63 16.31 26(m) 9.97 28.31 178(m)
Kool et all 3.85 0.52 0.001(s) 5.80 1.75 2(s) 8.15 4.89 6(s)
Nazari et al 4.00 4.43 7.01 22.76 9.46 21.75
Ours
ERRL1 3.83 0 0.01(s) 5.74 0.70 2(s) 7.86 1.15 6(s)
ERRL2 3.86 0.78 7(s) 5.76 1.05 18(s) 7.92 1.93 31(s)
ERRL1(2opt) 3.81 - 1(m) 5.71 1.57 11.87(m) 7.77 0 25(m)
ERRL2(2opt) 3.83 0 4(m) 5.73 0.52 13(m) 7.80 0.38 37(m)

### Solution search strategy Impact

Recent work L2I developed by outperforms LKH3, our methods differ from L2I in terms of speed also ERRl is a purely data-driven way to solve COP. The ERRL model combines local search operator 2-opt but during test time. L2I is a specialised routing problem solver based on a handcrafted pool of improvement operators and perturbation operators. ERRL net is a construction type neural net models for CO problems, previous methods and including our method have two modes for inference in general. More inference techniques used during inference time such as greedy search, using greedy search a single deterministic trajectory is drawn using argmax on the policy. In “sampling mode,” multiple trajectories are sampled from the network following the probabilistic policy.

In the experiment, we used two different decoders: greedy describe in Figuresearch referred to as name called [ERRL1Gr], and the beam search (BS) in Figuresearch as name ERRL1BS. Our results have shown that using the beam search algorithm, the quality of the solutions improved; however, in computation time slightly increased. In this work, the two inference techniques and one inference technique combine with 2-opt operators during inference time experimentally showed that ML approaches benefited from a search procedure presented in Figuresearch. In Figuresearch, greedy search, beam search combine with ERRL1 model, the optimality gap obtained 1.15% and 0.38% respectively, when we combine with 2 Opt operator with ERRL1 could be reduced from 1.15% to 0 for TSPs of 100 nodes problems. However, the execution time is higher. We show the impact of a search procedure fuse with ML model performances in the figuresearch. To better understand, we use greedy search, beam search and combined 2-opt with greedy search with ERRL1 to show that the importance of searching in ML-based approaches to combinatorial optimisation.

n_t_332_3

## Conclusion and Future Direction

In this work, we presented the ERRL model that encourages exploration and finds model can improve performance on many optimisation problems incorporated with a policy gradient method. In the study, demonstrated that entropy augmented reward helps the model to avoid local optima and prevent premature convergence. The model has outperformed machine learning-based approaches significantly. We expect that the proposed architecture is not limited to route optimisation problems; it is an essential topic of future research to apply it to other combinatorial optimisation problems. Recent learning-based algorithms for COP is Model-free deep RL methods that are notoriously expensive in terms of their sample complexity. This challenge severely limits the applicability of model-free deep RL to real-world tasks. Our current ERRL-net model has less sample complexity. Nevertheless, it can be improved to use this model to real-world tasks, so model prevents brittleness concerning their hyper-parameters. Also, able to balance exploration and exploitation in terms of the task. Our future goal is to develop a model, instead of requiring the user to set the temperature manually (now we have done for alpha value), we can automate the process by reformulating a different maximum entropy reinforcement objective, where the entropy is treated as a constraint. Therefore, for future work, we need an effective method that can become competitive with the model-free state-of-the-art for combinatorial domains and more sample-efficient RL algorithms may be a key ingredient for learning from larger problems. In addition, finding the most adapted search procedures for an ML model is still an open question.

## Problems: TSP, CVRP, MRPFF

The goal of our model is to generate the minimum total route length of the vehicle, where the route length need to be answerable from any routing problem, such as capacitated Vehicle Routing Problems(CVRP), multi-Vehicle Routing Problem with Fixed Fleet size (MRPFF) and TSP. Let G (V; E) denote a weighted graph, where V is the set of nodes, E the set of edges. In an instance of the VRP problem, we are given a set of customer nodes, specifically, for VRP instance, the input (V = ) is a set of nodes, node $a_0$ represents the depot, and all other nodes represent the customers that need to be visited. There can be multiple vehicles serving customers. The Vehicle Routing Problem is to find a set of routes (all starting and ending at the depot) with minimal cost, and each customer must be visited by exactly one vehicle. Consider a set of customers, to these serve customers; we must design routes for each vehicle available, all starting from a single depot, $a_0$ . Each route must start at the depot, visit a subset of customers and then return to that depot. The objective of the problem is to determine a set of minimal cost routes that satisfy all requirements defined above. With these parameters, the formulation of CVRP is given by, subject to, $\displaystyle\sum_{a_i \in V} x_{a_ia_j} = 1$ $\forall_{a_j}\in V \setminus{a_0}$ , $\displaystyle\sum_{a_j \in V} x_{a_ia_j} = 1$ $\forall_{a_i}\in V \setminus{a_0}$ , $\displaystyle\sum_{v_i \in V} x_{a_ia_o} = K$ , $\displaystyle\sum_{v_j \in V} x_{a_oa_j} = K$ , $\displaystyle\sum_{i \not\in S}\displaystyle\sum_{j \in S} x_{a_ia_j}\sum0\sum1$ , where $\forall S \subseteq V \setminus{a_{0}}, S S 0 S 1$ , In this formulation ${\displaystyle C_{a_ia_j}}$ represents the cost of going from node ${\displaystyle a_i}$ to node ${\displaystyle a_j}$ and, where the cost of traveling from node $a_i$ to $a_j$ is $c_{a_ia_j}\in\mathcal{R}^{+}$ . ${\displaystyle x_{a_ia_j}}$ is a binary variable, $x_{a_ia_j}\in{0,1 }$ and $a_i, a_j \in V$ , that has value 1 if the edge going from ${\displaystyle a_i}$ to ${\displaystyle a_j}$ is considered as part of the solution and ${\displaystyle 0}$ otherwise, K is the number of available vehicles. r(S) is the minimum number of the vehicle to serve set S, the capacity cut constraints, which impose that the routes must be connected and that the demand on each route must not exceed the vehicle capacity.

## 问题：TSP，CVRP，MRPFF

Also assuming that ${\displaystyle 0}$ is the depot node(). In this work, we also use another variance of VRP, the multiple Routing problem with fixed fleet size (MRPFF), where customer locations are considered on a 2D Euclidean space, to serve customers, and we must design routes for one vehicle available at a single depot, $a_o$ . Each route must start at the depot, visit a subset of customers and then return to that depot. It is further assumed that vehicle does not need to attain any demands but need to optimise the set of routes. The vehicles must create routes starting and ending at a depot node to optimise routes. The MRPFF has no demand, and we consider only one fixed vehicle, visited two sets of customers (two sets of routes). Another route problem is the TSP problem; we are given a set of points/cities. A tour of these cities is a sequence where each city is visited and only visited once. Then the TSP problem is to find such a tour of cities such that the total travel distance between consecutive pairs of cities in the tour is minimised.

In this work as inference techniques, we use 2-opt to further improve the solutions. In a 2-opt algorithm, when removing two edges, there is only one alternative feasible solution. The procedure searches for k edge swaps that will be replaced by a new edge, swapping techniques results in a shorter tour. Moreover, sequential pairwise operators such as k-opt moves can be decomposed in simpler l-opt ones, where l < k. For instance, in our work, 2-opt sequential operations decomposed into one .

## Policy Networks(ERRL1)

We used the neural architecture of() to apply our approach to improve solutions of routing problems. Here, we briefly describe the attention model architecture in terms of the. The model is consists of attention based encoder and decoder network. The encoder produces embeddings of all inputs and decoder produces the sequence $\pi$ of given inputs, one at a time: encoder inputs the encoder embeddings and a problem specific mask and context. When partial tour constructed, it cannot be changed, and the rest of the nodes find a path from the last node to the first node and decoder network consists of embeddings of first and last node. The attention-based encoder embeds the input nodes and processed N sequential layers, each consisting multi-head attention and feed-forward sub-layer. The graph embedding is computed the node embeddings. The attention layer in this model following Transformer Architecture(), each attention layer has two sub-layers, one multi-head attention processes message passing between the nodes and another layer is a node wise fully connected feed-forward layer. The decoder is decoding the solutions sequentially at each time step. The decoder outputs the node $\pi_t$ based on the embeddings from the encoder. They also augmented the graph with a special context node to represent decoding context similar to(). Similar to() computed output probabilities, add one decoder layer with a single attention head. We used our approach using() model and reported improved results on a number of combinatorial optimisation (routing) problems.

## 神经网络架构（errl2）

### Policy Network

The Nazari et al. model used in the ERRL2 experiments as the same as Nazari et al. for TSP and CVRP both the problems. For MRPFF we consider the problems setting as CVRP except there is no customer demand(which we refer to as “the original Nazari et al. paper”). The policy network as a sequence to sequence (S2S) learning employed with an attention mechanism. S2S learning technique, sequentially given input to make a decision at each time step and generates the solutions as a sequence of customer locations. The customer locations are on a 2D Euclidean space. Given an input sequence, the model finds the conditional probability of the output sequence similar to(). Commonly, recurrent neural networks are used in sequence-to-sequence models to estimate this conditional probability. The sequence-to-sequence model assumes elements of the output sequence is fixed. Unlike the sequence-to-sequence model, the VRP solution (output) is a permutation of the problem nodes (input). To achieve the solution, use an attention mechanism (see, for example,(). The attention mechanism query information from all elements in the input nodes set. An affinity function is evaluated to assemble the output sequence (with each node and the final output of the model) to generate a set of scalars (aggregates many signals into one). Later, the softmax function applied to these scalars to obtain the attention weights given to each element of the input set at each time step. We define the combinatorial optimisation problem with a given set of inputs.

### 策略网络

nazari 等。在 ERRL2 实验中使用的模型与 Nazari 等人一样。对于 TSP 和 CVRP 这两个问题。对于 MRPFF，我们将这些问题设置为 CVRP，除非没有客户需求（我们将其称为“原始 Nazari 等人”）。策略网络作为对注意机制采用的序列（S2S）学习的序列。 S2S 学习技术，顺序给出输入以在每次步骤中做出决定，并将解决方案作为一系列客户位置生成。客户位置位于 2D 欧几里德空间。给定输入序列，模型找到类似于（）的输出序列的条件概率。通常，经常性的神经网络用于序列到序列模型以估计这种条件概率。序列到序列模型假定输出序列的元素是固定的。与序列到序列模型不同，VRP 解决方案（输出）是问题节点（输入）的置换。为了实现解决方案，请使用注意机制（例如，（）。注意机制来自输入节点中的所有元素的注意力查询信息。评估亲和函数以组装输出序列（使用每个节点和最终输出和最终输出。模型）生​​成一组标量（将许多信号聚合到一个）。稍后，将 SoftMax 函数应用于这些标量，以获得对每个时间步骤中输入集的每个元素的注意力。我们定义了组合优化给定集合的问题。

Between every decoding step, some of the elements of each input to change. For instance, in the case of VRP, the rest of the customer demands change over time as the vehicle visits the customer nodes; or we might consider a variant in which new customers arrive or adjust their demand values over time, independent of the vehicle decisions(). We formally, represent each input by a sequence of tuples. We start from an arbitrary input in, where we use the pointer to refer to that input and every decoding step will points to one available input, which regulates the input of the next decoder step until a terminating condition satisfied as Nazari et al.. These inputs are given to an encoder which embeds into latent space vectors. These embedded vectors are combined with the output of a decoder. This points to one of the elements of the input. This process generates a sequence and ends when a terminating condition is satisfied, e.g., when a specific number of steps are completed. In dynamic route optimisation, for example, in case of CVRP, includes all customer locations as well as their demands, and the depot location; then, the remaining demands are updated with respect to the vehicle destination and its load, and the terminating condition is that there is no more demand to satisfy. This process will generate a sequence of length, possibly with a different sequence length compared to the input length. For example, the vehicle may have to go back to the depot several times to refill. We are interested in finding a stochastic policy $\pi$ , which generates the sequence in a way that minimises a loss objective while satisfying the problem constraints.

## Set of parameters Testing

In this section, We trained the model using different set of parameters and illustrates the result in Tables net, net1 and net3 for problem size 20, 50 and 100 respectively on CVRP dataset. We evaluated our model with changing parameters of the model. In this experiment we first trained our model with VRP50 node dataset and tested on VRP20, VRP50 and VRP100 instances.