南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (6): 1052.
吴毓双1,陈筱语1,马静雯2,陈兴国2,3*
Wu Yushuang1,Chen Xiaoyu1,Ma Jingwen2,Chen Xingguo2,3*
摘要: 在强化学习的值函数线性估计问题中,时序差分不动点解和贝尔曼残差的方法都是对真实值函数的斜投影,然而这两种解经证明都不是最优解.通过对两种投影进行加权平均,提出了一种一般化的斜投影算子.基于此推导出两种残差时序差分学习算法,并给出了这两种算法在异策略下的收敛性证明.在著名的Baird的异策略反例实验上,与相关算法进行了对比,实验结果验证了所提算法的正确性和有效性.
[1] Sutton R S,Szepesvári C,Maei H R.A convergent O(n) algorithm for off-policy temporal-difference learning with linear function approximation.In:Proceedings of the 21st International Conference on Neural Information Processing Systems.Vancouver,Canada:Curran Associates,2008:1609-1616. [2] Sutton R S.Learning to predict by the methods of temporal differences.Machine Learning,1988,3(1):9-44. [3] Tsitsiklis J N,Van Roy B.An analysis of temporal-difference learning with function approximation.IEEE Transactions on Automatic Control,1997,42(5):674-690. [4] Bradtke S J.Incremental dynamic programming for on-line adaptive optimal control.Ph.D.Dissertation.Amherst,MA,USA:University of Massachusetts,1994. [5] Boyan J.Technical update:Least-squares temporal difference learning.Machine Learning,2002,49(2-3)233-246. [6] Lagoudakis M G,Parr R.Least-squares policy iteration.The Journal of Machine Learning Research,2003,4:1107-1149. [7] Geramifard A,Bowling M,Sutton R.Incremental least-squares temporal difference learning.In:Proceedings of the 21st National Conference on Artificial Intelligence.Boston,MA,USA:AAAI Press,2006:356-361. [8] Sutton R S,Maei H R,Precup D,et al.Fast gradient-descent methods for temporal-difference learning with linear function approximation.In:Proceedings of the 26th International Conference on Machine Learning.Montreal,Canada:ACM,2009:993-1000. [9] Scherrer B.Should one compute the Temporal Difference fix point or minimize the Bellman Residual?The unified oblique projection view.In:Proceedings of the 27th International Conference on Machine Learning(ICML-10).Haifa,Israel:Ominipress,2010:959-966. [10] Liu B,Liu J,Ghavamzadeh M,et al.Finite-sample analysis of proximal gradient TD algorithms.In:31st Conference on Uncertainty in Artificial Intelligence.Amsterdam,The Netherl-ands:AUAI Press,2015:504-513. [11] Sutton R,Barto A.Reinforcement learning:An introduction.Cambridge,MA,USA:MIT Press,1998,10-25. [12] 高 阳,陈世福,陆 鑫.强化学习研究综述.自动化学报,2004,30(1):86-100.(Gao Y,Chen S F,Lu X.Research on reinforcement learning technology:A review.Acta Automatica Sinica,2004,30(1):86-100.) [13] Borkar V S,Meyn S P.The O.D.E method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization,2000,38(2):447-469. [14] Baird L.Residual algorithms:Reinforcement learning with function approximation.In:Prieditis A,Russell S J.Proceedings of the 12th International Conference on Machine Learning.San Francisco,CA,USA:Morgan Kaufmann,1995:30-37. |
No related articles found! |
|