基于一般化斜投影的异策略时序差分学习算法

南京大学学报(自然科学版) ›› 2017, Vol. 53 ›› Issue (6): 1052–.

基于一般化斜投影的异策略时序差分学习算法

吴毓双1，陈筱语1，马静雯2，陈兴国2，3*

出版日期:2017-11-26 发布日期:2017-11-26
作者简介:1.南京邮电大学贝尔英才学院，南京，210023； 2.南京邮电大学计算机学院，南京，210023； 3．南京大学计算机软件新技术国家重点实验室，南京，210023
基金资助:
基金项目：国家自然科学基金(61403208)，南京大学计算机软件新技术国家重点实验室开放课题(KFKT2016B04)，南京邮电大学引进人才科研启动基金(NY214014) 收稿日期：2017－08－03 *通讯联系人，E－mail：chenxg@njupt.edu.cn

Off－policy linear temporal difference learning algorithms with a generalized oblique projection

Wu Yushuang1，Chen Xiaoyu1，Ma Jingwen2，Chen Xingguo2，3*

Online:2017-11-26 Published:2017-11-26
About author:1.Bell Honors School，Nanjing University of Posts and Telecommunications，Nanjing，210023，China； 2．School of Computer Science and Technology，Nanjing University of Posts and Telecommunications， Nanjing，210023，China； 3．State Key Laboratory for Novel Software Technology，Nanjing University，Nanjing，210023，China

摘要/Abstract

摘要： 在强化学习的值函数线性估计问题中，时序差分不动点解和贝尔曼残差的方法都是对真实值函数的斜投影，然而这两种解经证明都不是最优解．通过对两种投影进行加权平均，提出了一种一般化的斜投影算子．基于此推导出两种残差时序差分学习算法，并给出了这两种算法在异策略下的收敛性证明．在著名的Baird的异策略反例实验上，与相关算法进行了对比，实验结果验证了所提算法的正确性和有效性．

Abstract: In the case of linear value function approximated reinforcement learning，it is meaningful to do research on off－policy algorithms to get a better balance on explore－exploit trade－off.In recent years，Sutton et al proposed off－policy gradient temporal difference learning algorithms，which possess good properties in speed and convergence.The main contribution of this paper is proposing a generalized oblique projection framework，which utilizes the weighted sum of two projections，so as to derive off－policy temporal difference learning algorithms with a generalized oblique projection.To derive a good algorithm，methods of Temporal Difference fixed－point and Bellman residual are widely used.However，they can be viewed as oblique projections of the true value function，therefore these two projections are not optimal.This paper starts from understanding different algorithms from the view of projection，and proposes a kind of method to obtain a better projection，by generalizing a kind of oblique projection as the weighted sum of projections of these two projections.Further，to obtain convergent algorithms for off－policy setting，we extend the generalized projection based on the norm of expected TD update，and generates two kinds of objective functions.Employing the approach of stochastic gradient method，this paper derives two convergent off－policy linear residual Temporal Difference algorithms.To theoretically prove the convergence of our algorithms，we use the method of ordinary－differential－equation approach，which views iterations as ordinary differential equations，and tries to guarantee the stability of them.Experimental results on Baird’s off－policy counterexample demonstrate the effectiveness of the proposed algorithms.Discussions on performance for different weight value parameters are given at last.

吴毓双1，陈筱语1，马静雯2，陈兴国2，3*. 基于一般化斜投影的异策略时序差分学习算法[J]. 南京大学学报(自然科学版), 2017, 53(6): 1052–.

Wu Yushuang1，Chen Xiaoyu1，Ma Jingwen2，Chen Xingguo2，3*. Off－policy linear temporal difference learning algorithms with a generalized oblique projection[J]. Journal of Nanjing University(Natural Sciences), 2017, 53(6): 1052–.

参考文献

[1]　Sutton R S，Szepesvári C，Maei H R．A convergent O(n) algorithm for off－policy temporal－difference learning with linear function approximation.In：Proceedings of the 21st International Conference on Neural Information Processing Systems.Vancouver，Canada：Curran Associates，2008：1609－1616. [2] Sutton R S．Learning to predict by the methods of temporal differences.Machine Learning，1988，3(1)：9－44. [3] Tsitsiklis J N，Van Roy B．An analysis of temporal－difference learning with function approximation.IEEE Transactions on Automatic Control，1997，42(5)：674－690. [4] Bradtke S J．Incremental dynamic programming for on－line adaptive optimal control.Ph.D.Dissertation.Amherst，MA，USA：University of Massachusetts，1994. [5] Boyan J．Technical update：Least－squares temporal difference learning.Machine Learning，2002，49(2－3)233－246. [6] Lagoudakis M G，Parr R．Least－squares policy iteration.The Journal of Machine Learning Research，2003，4：1107－1149. [7] Geramifard A，Bowling M，Sutton R．Incremental least－squares temporal difference learning.In：Proceedings of the 21st National Conference on Artificial Intelligence.Boston，MA，USA：AAAI Press，2006：356－361. [8] Sutton R S，Maei H R，Precup D，et al.Fast gradient－descent methods for temporal－difference learning with linear function approximation.In：Proceedings of the 26th International Conference on Machine Learning.Montreal，Canada：ACM，2009：993－1000. [9] Scherrer B．Should one compute the Temporal Difference fix point or minimize the Bellman Residual？The unified oblique projection view.In：Proceedings of the 27th International Conference on Machine Learning(ICML－10)．Haifa，Israel：Ominipress，2010：959－966. [10] Liu B，Liu J，Ghavamzadeh M，et al.Finite－sample analysis of proximal gradient TD algorithms.In：31st Conference on Uncertainty in Artificial Intelligence.Amsterdam，The Netherl－ands：AUAI Press，2015：504－513. [11] Sutton R，Barto A．Reinforcement learning：An introduction.Cambridge，MA，USA：MIT Press，1998，10－25. [12] 高　阳，陈世福，陆　鑫．强化学习研究综述．自动化学报，2004，30(1)：86－100.(Gao Y，Chen S F，Lu X．Research on reinforcement learning technology：A review.Acta Automatica Sinica，2004，30(1)：86－100.) [13] Borkar V S，Meyn S P．The O．D.E method for convergence of stochastic approximation and reinforcement learning.SIAM Journal on Control and Optimization，2000，38(2)：447－469. [14] Baird L．Residual algorithms：Reinforcement learning with function approximation.In：Prieditis A，Russell S J．Proceedings of the 12th International Conference on Machine Learning.San Francisco，CA，USA：Morgan Kaufmann，1995：30－37.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed