更改 - iCenter Wiki

增强学习-入门导读

添加3,091字节、2019年5月23日 (四) 07:56

=~~= 教材 =~~强化学习 =

~~# Richard S. Sutton, Andrew Barto, An Introduction to Reinforcement Learning, MIT Press, 1998. [http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html Intro_RL]~~== 强化学习定义 ==~~# Csaba Szepesvari, Algorithms for Reinforcement Learning, Synthesis lectures on artificial intelligence and machine learning 4, no. 1, pp.1~~强化学习（Reinforcement Learning）是一种通用的决策框架( decision-~~103, 2010. [http://www.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf RLAlgsInMDPs]~~making framework)。

~~== 研究 ==~~Agent代理具有采取动作（action）的能力（capacity），每次动作都会影响Agent的未来状态（State），返回一个标量的奖赏信号（reward signal）来量化表示成功与否（success）。

强化学习算法的目标（Goal）就是如何采取动作（action）最大化未来的奖赏（future reward）。 == 强化学习要素 == 从强化学习Agent的角度看，强化学习包含一组组件： (1) 策略（Policy）是指：Agent的行为函数； (2) 价值函数（Value）是指：每个状态与动作的成效如何？ (3) 模型（Model）： Agent的环境的表示。 == 通用人工智能AGI == 深度强化学习（Deep Reinforcement Learning, Deep RL）就是把强化学习RL和深度学习DL的结合起来。用强化学习定义目标，用深度学习给出相应的机制，如Q学习等技术，以实现通用人工智能（Artificial General Intelligence, AGI）。 = 强化学习应用 = == 计算机围棋 ~~AlphaGo~~ == # Mastering the game of Go with deep neural networks and tree search, nature 2015.# Better Computer Go Player with Neural Network and Long-term Prediction, ICLR 2016.# Pachi: State of the art open source Go program, Advances in computer games, Springer Berlin Heidelberg, 2011. ===多臂赌博机===

* 多臂赌博机（mutiarmed bandit problem）

~~#Multi-armed bandits with episode context, AMAI 2011.#Algorithms for Infinitely Many-Armed Bandits, nips 2009.~~一个赌徒面前有一系列老虎机（或赌博机），每个赌博机在投入硬币后，返回的回报是不同的。赌徒面临的问题是如何最大化自己的收益。

当赌徒尝试了一系列赌博机后，会获得一些统计上的收益。但是，赌徒并不知道赌博机背后的真实收益分布。在获得已有的收益后，赌徒遇到的策略问题是：是继续专注于当前获得的收益呢，还是去尝试更多的赌博机？

赌徒如果专注于已获得收益的赌博机，至少可以保持一定的收益。如果去尝试更多的先前未测试的赌博机，有可能出现尝试失败的情况，但也有可能会发现具有更大收益的赌博机。

UCB方法是针对多臂赌博机问题的一种解法，力图在在探索（在未知的赌博机）和遵从（现有经验）之间找到平衡。UCB 方法全称是(“Upper Confidence Bounds”)，即上置信边界方法。

'''UCB算法最早由以下论文提出。'''

#Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine learning 47, no. 2-3: 235-256, 2002.

'''PUCT算法由以下论文研究提出。'''

#Christopher D. Rosin, Multi-armed bandits with episode context, Annals of Mathematics and Artificial Intelligence, March 2011, Volume 61, Issue 3, pp 203–230 2011.

'''其它论文'''

#Wang, Yizao, Jean-Yves Audibert, and Rémi Munos. "Algorithms for infinitely many-armed bandits." Advances in Neural Information Processing Systems. 2009.

===蒙特卡洛树搜索===

* 蒙特卡洛树搜索（Monte-Carlo Tree Search）

# '''Monte-Carlo tree search and rapid action value estimation in computer Go, Artificial Intelligence, Elsevier 2011.'''

===卷积网络下围棋===* ~~神经网络~~卷积网络

# Mimicking Go Experts with Convolutional Neural Networks, ICANN 2008.

# '''Training Deep Convolutional Neural Networks to Play Go, ICML 2015.'''

* 进展在用3千万5dan以上的选手的棋局训练卷积网路，其中机器也会把人类选手下的昏招或者臭招也学会了。但是可以用自我博弈出的棋局数据来训练，这样就可以稀释掉这些昏招。 == 历史性进展 ==

# Achieving Master Level Play in 9 × 9 Computer Go, AAAI 2008.

# The grand challenge of computer Go Monte Carlo tree search and extensions, CACM 2012.

=阿尔法围棋=

# '''Mastering the game of Go with deep neural networks and tree search, Nature 2016.'''

#AlphaGo Zero

#AlphaZero

===计算机游戏===

#Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.

=参考资料 = = ~~神经科学~~ =参考教材 == # Richard S. Sutton, Andrew Barto, An Introduction to Reinforcement Learning, MIT Press, 1998. [http://webdocs.cs.ualberta.ca/~sutton/book/the-book.html Intro_RL]# Csaba Szepesvari, Algorithms for Reinforcement Learning, Synthesis lectures on artificial intelligence and machine learning 4, no. 1, pp.1-103, 2010. [http://www.ualberta.ca/~szepesva/papers/RLAlgsInMDPs.pdf RLAlgsInMDPs] == 参考课程 ==

#UC Berkeley CS 294: Deep Reinforcement Learning, [~~1] Gadagkar, V~~http://rll.~~, Puzerey, P., Chen, R., Baird-daniel, E., Farhang, A., & Goldberg, J. (2016). Dopamine Neurons Encode Performance Error in Singing Birds. Science, 354(6317), 1278–1282~~berkeley.edu/deeprlcourse/ Deep RL]

Zhenchen

行政员、管理员

6,105

个编辑