2019年5月23日 (四) 09:56的最后版本

强化学习

强化学习定义

强化学习（Reinforcement Learning）是一种通用的决策框架( decision-making framework)。

Agent代理具有采取动作（action）的能力（capacity），每次动作都会影响Agent的未来状态（State），返回一个标量的奖赏信号（reward signal）来量化表示成功与否（success）。

强化学习算法的目标（Goal）就是如何采取动作（action）最大化未来的奖赏（future reward）。

强化学习要素

从强化学习Agent的角度看，强化学习包含一组组件：

(1) 策略（Policy）是指：Agent的行为函数；

(2) 价值函数（Value）是指：每个状态与动作的成效如何？

(3) 模型（Model）： Agent的环境的表示。

通用人工智能AGI

深度强化学习（Deep Reinforcement Learning, Deep RL）就是把强化学习RL和深度学习DL的结合起来。

用强化学习定义目标，用深度学习给出相应的机制，如Q学习等技术，以实现通用人工智能（Artificial General Intelligence, AGI）。

强化学习应用

计算机围棋

Mastering the game of Go with deep neural networks and tree search, nature 2015.
Better Computer Go Player with Neural Network and Long-term Prediction, ICLR 2016.
Pachi: State of the art open source Go program, Advances in computer games, Springer Berlin Heidelberg, 2011.

多臂赌博机

多臂赌博机（mutiarmed bandit problem）

一个赌徒面前有一系列老虎机（或赌博机），每个赌博机在投入硬币后，返回的回报是不同的。赌徒面临的问题是如何最大化自己的收益。

当赌徒尝试了一系列赌博机后，会获得一些统计上的收益。但是，赌徒并不知道赌博机背后的真实收益分布。在获得已有的收益后，赌徒遇到的策略问题是：是继续专注于当前获得的收益呢，还是去尝试更多的赌博机？

赌徒如果专注于已获得收益的赌博机，至少可以保持一定的收益。如果去尝试更多的先前未测试的赌博机，有可能出现尝试失败的情况，但也有可能会发现具有更大收益的赌博机。

UCB方法是针对多臂赌博机问题的一种解法，力图在在探索（在未知的赌博机）和遵从（现有经验）之间找到平衡。UCB 方法全称是(“Upper Confidence Bounds”)，即上置信边界方法。

UCB算法最早由以下论文提出。

Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine learning 47, no. 2-3: 235-256, 2002.

PUCT算法由以下论文研究提出。

Christopher D. Rosin, Multi-armed bandits with episode context, Annals of Mathematics and Artificial Intelligence, March 2011, Volume 61, Issue 3, pp 203–230 2011.

其它论文

Wang, Yizao, Jean-Yves Audibert, and Rémi Munos. "Algorithms for infinitely many-armed bandits." Advances in Neural Information Processing Systems. 2009.

蒙特卡洛树搜索

蒙特卡洛树搜索（Monte-Carlo Tree Search）

Bandit based monte-carlo planning, ECML 2006.
Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search, CG 2006.
Combining Online and Offline Knowledge in UCT, ICML 2007.
Monte-Carlo tree search and rapid action value estimation in computer Go, Artificial Intelligence, Elsevier 2011.

卷积网络下围棋

卷积网络

Mimicking Go Experts with Convolutional Neural Networks, ICANN 2008.
Training Deep Convolutional Neural Networks to Play Go, ICML 2015.

在用3千万5dan以上的选手的棋局训练卷积网路，其中机器也会把人类选手下的昏招或者臭招也学会了。但是可以用自我博弈出的棋局数据来训练，这样就可以稀释掉这些昏招。

历史性进展

Achieving Master Level Play in 9 × 9 Computer Go, AAAI 2008.
The grand challenge of computer Go Monte Carlo tree search and extensions, CACM 2012.

阿尔法围棋

Mastering the game of Go with deep neural networks and tree search, Nature 2016.
AlphaGo Zero
AlphaZero

计算机游戏

Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.

参考资料

参考教材

Richard S. Sutton, Andrew Barto, An Introduction to Reinforcement Learning, MIT Press, 1998. Intro_RL
Csaba Szepesvari, Algorithms for Reinforcement Learning, Synthesis lectures on artificial intelligence and machine learning 4, no. 1, pp.1-103, 2010. RLAlgsInMDPs

参考课程

UC Berkeley CS 294: Deep Reinforcement Learning, Deep RL

@@ 第1行： / 第1行： @@
 = 强化学习 =
-== 定义 ==
+== 强化学习定义 ==
-强化学习（Reinforcement Learning）是一种通用的决策框架( decision-making framework)。Agent代理具有采取动作（action）的能力（capacity），每次动作都会影响Agent的未来状态（State），返回一个标量的奖赏信号（reward signal）来量化表示成功与否（success）。强化学习算法的目标（Goal）就是如何采取动作（action）最大化未来的奖赏（future reward）。
+强化学习（Reinforcement Learning）是一种通用的决策框架( decision-making framework)。
-== 通用AI ==
+Agent代理具有采取动作（action）的能力（capacity），每次动作都会影响Agent的未来状态（State），返回一个标量的奖赏信号（reward signal）来量化表示成功与否（success）。
-深度强化学习（Deep Reinforcement Learning, Deep RL）就是把强化学习RL和深度学习DL的结合起来。用强化学习定义目标，用深度学习给出相应的机制，如Q学习等技术，以实现通用人工智能（General Artificial Intelligence）。
-= 研究 =
+强化学习算法的目标（Goal）就是如何采取动作（action）最大化未来的奖赏（future reward）。
-== 计算机围棋与AlphaGo ==
+== 强化学习要素 ==
+从强化学习Agent的角度看，强化学习包含一组组件：
+(1) 策略（Policy）是指：Agent的行为函数；
+(2) 价值函数（Value）是指：每个状态与动作的成效如何？
+(3) 模型（Model）： Agent的环境的表示。
+== 通用人工智能AGI ==
+深度强化学习（Deep Reinforcement Learning, Deep RL）就是把强化学习RL和深度学习DL的结合起来。
+用强化学习定义目标，用深度学习给出相应的机制，如Q学习等技术，以实现通用人工智能（Artificial General Intelligence, AGI）。
+= 强化学习应用 =
+== 计算机围棋 ==
+# Mastering the game of Go with deep neural networks and tree search, nature 2015.
+# Better Computer Go Player with Neural Network and Long-term Prediction, ICLR 2016.
+# Pachi: State of the art open source Go program, Advances in computer games, Springer Berlin Heidelberg, 2011.
 ===多臂赌博机===
 * 多臂赌博机（mutiarmed bandit problem）
-#Multi-armed bandits with episode context, AMAI 2011.
+一个赌徒面前有一系列老虎机（或赌博机），每个赌博机在投入硬币后，返回的回报是不同的。赌徒面临的问题是如何最大化自己的收益。
-#Algorithms for Infinitely Many-Armed Bandits, nips 2009.
+当赌徒尝试了一系列赌博机后，会获得一些统计上的收益。但是，赌徒并不知道赌博机背后的真实收益分布。在获得已有的收益后，赌徒遇到的策略问题是：是继续专注于当前获得的收益呢，还是去尝试更多的赌博机？
+赌徒如果专注于已获得收益的赌博机，至少可以保持一定的收益。如果去尝试更多的先前未测试的赌博机，有可能出现尝试失败的情况，但也有可能会发现具有更大收益的赌博机。
+UCB方法是针对多臂赌博机问题的一种解法，力图在在探索（在未知的赌博机）和遵从（现有经验）之间找到平衡。UCB 方法全称是(“Upper Confidence Bounds”)， 即上置信边界方法。
+'''UCB算法最早由以下论文提出。'''
+#Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine learning 47, no. 2-3: 235-256, 2002.
+'''PUCT算法由以下论文研究提出。'''
+#Christopher D. Rosin, Multi-armed bandits with episode context, Annals of Mathematics and Artificial Intelligence, March 2011, Volume 61, Issue 3, pp 203–230 2011.
+'''其它论文'''
+#Wang, Yizao, Jean-Yves Audibert, and Rémi Munos. "Algorithms for infinitely many-armed bandits." Advances in Neural Information Processing Systems. 2009.
 ===蒙特卡洛树搜索===
@@ 第31行： / 第68行： @@
 # '''Training Deep Convolutional Neural Networks to Play Go, ICML 2015.'''
-== 历史性进展 ===
+在用3千万5dan以上的选手的棋局训练卷积网路，其中机器也会把人类选手下的昏招或者臭招也学会了。但是可以用自我博弈出的棋局数据来训练，这样就可以稀释掉这些昏招。
+== 历史性进展 ==
 # Achieving Master Level Play in 9 × 9 Computer Go, AAAI 2008.
 # The grand challenge of computer Go Monte Carlo tree search and extensions, CACM 2012.
-# '''Mastering the game of Go with deep neural networks and tree search, Nature 2016.'''
-==计算机游戏==
+=阿尔法围棋=
-#Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
-== 神经科学 ==
+# '''Mastering the game of Go with deep neural networks and tree search, Nature 2016.'''
+#AlphaGo Zero
+#AlphaZero
-# Gadagkar, V., Puzerey, P., Chen, R., Baird-daniel, E., Farhang, A., & Goldberg, J. (2016). Dopamine Neurons Encode Performance Error in Singing Birds. Science, 354(6317), 1278–1282.
+=计算机游戏=
+#Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
 = 参考资料 =
@@ 第53行： / 第93行： @@
 == 参考课程 ==
-UC Berkeley CS 294: Deep Reinforcement Learning,  [http://rll.berkeley.edu/deeprlcourse/ Deep RL]
+#UC Berkeley CS 294: Deep Reinforcement Learning,  [http://rll.berkeley.edu/deeprlcourse/ Deep RL]

“增强学习-入门导读”版本间的差异