南京大学学报(自然科学版) ›› 2022, Vol. 58 ›› Issue (3): 420–429.doi: 10.13232/j.cnki.jnju.2022.03.006

• • 上一篇    

深度强化学习结合图注意力模型求解TSP问题

王扬, 陈智斌(), 杨笑笑, 吴兆蕊   

  1. 昆明理工大学理学院,昆明,650000
  • 收稿日期:2022-01-19 出版日期:2022-05-30 发布日期:2022-06-07
  • 通讯作者: 陈智斌 E-mail:chenzhibin311@126.com
  • 基金资助:
    国家自然科学基金(11761042)

Deep reinforcement learning combined with graph attention model to solve TSP

Yang Wang, Zhibin Chen(), Xiaoxiao Yang, Zhaorui Wu   

  1. Faculty of Science,Kunming University of Science and Technology,Kunming,650000,China
  • Received:2022-01-19 Online:2022-05-30 Published:2022-06-07
  • Contact: Zhibin Chen E-mail:chenzhibin311@126.com

摘要:

旅行商问题(Traveling Salesman Problem,TSP)是组合最优化问题(Combinatorial Optimization Problem,COP)中的经典问题,多年以来一直被反复研究.近年来深度强化学习(Deep Reinforcement Learning,DRL)在无人驾驶、工业自动化、游戏等领域的广泛应用,显示了强大的决策力和学习能力.结合DRL和图注意力模型,通过最小化路径长度求解TSP问题.改进REINFORCE算法,训练行为网络参数,可以有效地减小方差,防止局部最优;在编码结构中采用位置编码(Positional Encoding,PE),使多重的初始节点在嵌入的过程中满足平移不变性,可以增强模型的稳定性;进一步结合图神经网络(Graph Neural Network,GNN)和Transformer架构,首次将GNN聚合操作处理应用到Transformer的解码阶段,有效捕捉图上的拓扑结构及点与点之间的潜在关系.实验结果显示,模型在100?TSP问题上的优化效果超越了目前基于DRL的方法和部分传统算法.

关键词: 深度强化学习, 旅行商问题, 图注意力模型, 图神经网络, 组合最优化

Abstract:

Traveling Salesman Problem (TSP) is a classic problem in Combinatorial Optimization Problem (COP),which has been repeatedly studied for many years. In recent years,Deep Reinforcement Learning(DRL)has been widely applied in driverless,industrial automation,game and other fields,showing strong decision?making and learning ability. In this paper,DRL and graph attention model are combined to solve TSP by minimizing the path length. Specifically,the behavioral network parameters are trained by an improved REINFORCE algorithm to effectively reduce the variance and prevent local optima; Positional Encoding (PE) is used to the encoding structure to make the multiple node satisfy translation invariance during the embedding process and enhance the stability of the model. Further,we combine Graph Neural Network (GNN) and Transformer architecture,and apply GNN aggregate operation processing to transformer decoding stage for the first time,which effectively capture the topological structure of the graph and the potential relationships between points. The experimental results show that the optimization effect of the model on the 100?TSP problem surpasses the current DRL?based methods and some traditional algorithms.

Key words: Deep Reinforcement Learning (DRL), Travel Salesman Problem (TSP), graph attention model, Graph Neural Network (GNN), Combinatorial Optimization (CO)

中图分类号: 

  • O22

图1

本文提出的图注意力模型示意图"

图2

多重初始点示意图"

图3

四个节点的TSP问题解码示例图"

图4

TSP问题编码解码结构示意图"

表1

不同模型在TSP问题上的优化结果比较"

模型20⁃TSP50⁃TSP100⁃TSP
花费间隙时间花费间隙时间花费间隙时间
Concorde[9]3.830.00%5 min5.690.00%13 min7.760.00%1 h
Gurobi[8]3.830.00%7 s5.690.00%2 min7.760.00%17 min
OR⁃Tools3.860.94%1 min5.852.87%5 min8.063.86%23 min
LKH3[12]3.830.00%42 s5.690.00%6 min7.760.00%25 min
2⁃opt [1]3.953.13%1 s6.117.38%7 s8.509.53%33 s
Farthest Insertion[11]3.891.56%1 s5.974.92%2 s8.347.47%10 s
Nearest Neighbor[1]4.4816.9%1 s6.9421.9%3 s9.6824.7%7 s
Kool et al(Greedy)[19]3.850.34%≪1 s5.801.76%2 s8.124.53%6 s
Kool et al(Sampling)[19]3.840.08%5 min5.730.52%24 min7.942.26%1 h
Costa et al[17]3.830.00%15 min5.710.12%29 min7.830.87%41 min
Wu et al[22]3.830.00%1 h5.700.20%1.5 h7.871.42%2 h
Kwon et al(single trajec)[5]3.830.12%≪1 s5.741.03%3 s7.841.12%8 s
Kwon et al(no augment)[5]3.830.04%≪1 s5.710.35%10 s7.790.50%54 s
Kwon et al(8×augment)[5]3.830.00%16 s5.690.05%1 min7.770.14%7 min
Ours (single trajec)3.830.13%≪1 s5.730.70%5 s7.841.08%8 s
Ours (no augment)3.830.05%≪1 s5.700.28%10 s7.790.47%55 s
Ours (8×augment)3.830.00%17 s5.690.03%1 min7.770.12%7 min
Ours (4×augment)3.830.00%9 s5.690.01%42 s7.760.09%3 min

图5

经典模型最优间隙的对比图"

表2

POMO模型在TSP问题上的消融实验"

模型20⁃TSP50⁃TSP100⁃TSP
花费间隙时间花费间隙时间花费间隙时间
POMO (8×augment)[5]3.830.00%16 s5.690.05%1 min7.770.14%7 min
POMO (4×augment)[5]3.830.00%7 s5.690.03%40 s7.770.13%3 min

图6

本文模型与POMO模型在TSP50/100 (a,b) 上的训练损失对比图"

表3

本文模型对TSP问题的泛化能力比较"

模型20⁃TSP50⁃TSP100⁃TSP
花费间隙时间花费间隙时间花费间隙时间
Ours (TSP20)5.730.68%1 min8.053.73%5 min
Ours (TSP50)3.830.13%3 s7.841.03%4 min
Ours (TSP100)3.830.05%1 s5.710.33%30 s

表4

本文模型在训练和推理阶段的时间花费"

阶段TSP20TSP50TSP100
训练模型3 h24 h136 h
推理(single trajec)≪1 s5 s8 s
推理(no augment)≪1s10 s55 s
推理(8×augment)17 s1 min7 min
推理(4×augment)9 s42 s3 min
1 Cook W J, Cunningham W H, Pulleyblank W R,et al. Combinatorial optimization. New York,NY,USA:Wiley?Interscience,2010:11-22.
2 Papadimitriou C H. The Euclidean travelling salesman problem is NP?complete. Theoretical Computer Science19774(3):237-244.
3 林敏,刘必雄,林晓宇. 带Metropolis准则的混合离散布谷鸟算法求解旅行商问题. 南京大学学报(自然科学)201753(5):972-983.
Lin M, Liu B X, Lin X Y. Hybrid discrete cuckoo search algorithm with metropolis criterion for traveling salesman problem. Journal of Nanjing University (Natural Science)201753(5):972-983.
4 Bengio Y, Lodi A, Prouvost A. Machine learning for combinatorial optimization:A methodological tour d'horizon. European Journal of Operational Research2021290(2):405-421.
5 Kwon Y D, Choo J, Kim B,et al. POMO:Policy optimization with multiple optima for reinforcement learning. 2020,arXiv:.
6 Vaswani A, Shazeer N, Parmar N,et al. Attention is all you need∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook,NY,USA:Curran Associates Inc.,2017,30:6000-6010.
7 Li Y X. Deep reinforcement learning:An overview. 2017,arXiv:.
8 Optimization I G. Gurobi optimizer reference manual. https:∥www.gurobi.com2015.
9 Applegate D L, Bixby D E, Chvatal V,et al. The traveling salesman problem:A computational study. Interfaces200838(4):344-345.
10 Christofides N. Worst?case analysis of a new heuristic for the travelling salesman problem. Pittsburgh,PA,USA:Carnegie?Mellon University,1976.
11 Johnson D S. Local optimization and the traveling salesman problem∥The 17th International Colloquium on Automata,Languages and Programming. Springer Berlin Heidelberg,1990:446-461.
12 Helsgaun K. An effective implementation of the Lin?Kernighan traveling salesman heuristic. European Journal of Operational Research2000126(1):106-130.
13 Vinyals M, Fortunato M, Jaitly N. Pointer networks∥Proceedings of the 29th International Conference on Neural Information Processing System. Cambridge,MA,USA:MIT Press,2015(28):2692-2700.
14 Bello I, Pham H, Le Q V,et al. Neural combinatorial optimization with reinforcement learning. 2017,arXiv:.
15 Deudon M, Cournut P, Lacoste A,et al. Learning heuristics for the TSP by policy gradient∥Proceedings of the 15th International Conference on the Integration of Constraint Programming,Artificial Intelligence and Operations Research. Springer Berlin Heidelberg,2018:170-181.
16 Nazari M, Oroojlooy A, Taká? M,et al. Reinforcement learning for solving the vehicle routing problem∥Proceedings of the 32nd International Conference on Neural Information Processing Systems. Red Hook,NY,USA:Curran Associates Inc.,2018(31):9861-9871.
17 Costa P R O D, Rhuggenaath J, Zhang Y Q,et al. Learning 2?opt heuristics for the traveling salesman problem via deep reinforcement learning∥Proceedings of the 12th Asian Conference on Machine Learning. Bangkok,Thailand:JMLR,2020:465-480.
18 Dai H J, Khalil E B, Zhang Y Y,et al. Learning combinatorial optimization algorithms over graphs∥Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook,NY,USA:Curran Associates Inc.,2017,30:6351-6361.
19 Kool W, Van Hoof H, Attention Welling M.,learn to solve routing problems. 2019,arXiv:.
20 Chen X Y, Tian Y D. Learning to perform local rewriting for combinatorial optimization∥Proceedings of the 33rd Neural Information Processing Systems. Vancouver,Canada:NIPS,2019:6278-6289.
21 Li K W, Zhang T, Wang R. Deep reinforcement learning for multiobjective optimization. IEEE Transactions on Cybernetics202051(6):3103-3114.
22 Wu Y X, Song W, Cao Z G,et al. Learning improvement heuristics for solving routing problems. IEEE Transactions on Neural Networks and Learning Systems2021:1-13.
23 Xin L, Song W, Cao Z G,et al. Multi?decoder attention model with embedding glimpse for solving vehicle routing problems∥Proceedings of the 35th AAAI Conference on Artificial Intelligence. Palo Alto,CA,USA:AAAI Press,2021:12042-12049.
24 Scarselli F, Gori M, Tsoi A C,et al. The graph neural network model. IEEE Transactions on Neural Networks200820(1):61-80.
25 Williams R J. Simple statistical gradient?following algorithms for connectionist reinforcement learning. Machine Learning19928(3):229-256.
26 Ho Y, Wookey S. The real?world?weight cross?entropy loss function:Modeling the costs of mislabeling. IEEE Access2019(8):4806-4813.
27 Ma Q, Ge S W, He D Y,et al. Combinatorial optimization by graph pointer networks and hierarchical reinforcement learning. 2019,arXiv:.
[1] 徐樱笑, 资文杰, 宋洁琼, 邵瑞喆, 陈浩. 基于多站点、多时间注意力机制的电磁强度时空关联分析与可视化[J]. 南京大学学报(自然科学版), 2021, 57(5): 838-846.
[2] 刘玲珊, 熊轲, 张煜, 张锐晨, 樊平毅. 信息年龄受限下最小化无人机辅助无线供能网络的能耗:一种基于DQN的方法[J]. 南京大学学报(自然科学版), 2021, 57(5): 847-856.
[3]  林 敏*,刘必雄,林晓宇.  带Metropolis准则的混合离散布谷鸟算法求解旅行商问题[J]. 南京大学学报(自然科学版), 2017, 53(5): 972-.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!