查看“大数据与机器智能”的源代码

==版权申明==

CC BY-NC-SA

==教学团队==

互联网+实验室 [http://net.icenter.tsinghua.edu.cn iNetLab]

[https://www.researchgate.net/profile/Zhen_Chen16/ 陈震] 
马晓东 章屹松 王蓓蓓 高英

助教：郑文勋 李辰星

==协同开发==

[http://cloud.icenter.tsinghua.edu.cn  iCenter-cloud]

[http://gitlab.icenter.tsinghua.edu.cn  GitLab]

=教学目标=

以完成一种以大数据为基础的智能系统的原型开发为目标，在实践中运用大数据智能理论与技术。团队成员学习大数据系统与机器智能的理论知识和专业技能，完成项目团队结构设计和原型开发的实践环节，全面提高学生的技术实践能力。

=课程内容=

==技术浅论==

===技术本质===

* 多维多角度：工具论/人文关怀/社会抱负/技术社会
* 客观上（大众）
** “受害与受益”（两面性）
**: 金融理财便捷
**: 金融电信诈骗
** 受益方
**: 技术改变的领导者（“quick money”）
**: 受惠的人群（便捷性）
** 受害方
**: 技术改变被动接受的人
**: 被技术改变淘汰的人
* 主观上（单个人）
** 取决于个人立场、价值观、经历等等

===技术泡沫===

* 市场宣传和预期炒作，以及一些传媒的洗脑性的报道等的原因，导致概念混乱。
* 需要找到技术的本质（涉及认识论）
* 科学思维就是防止被洗脑，无脑思考。
* 研究思路：规范模式和实证模式
** 事实陈述的时候，一定要找到论点以及论据，以及判断论点是否统帅论据，论据是否支持论点。
** 实践与操作获得体验，而非感觉与愿望。

==大数据索引==

[[大数据索引]]

==大数据算法==

===数据解析===

数据解析（Data Analytic），是指对数据集的属性值进行SUM，TopN，Rank操作。一般要求实时响应。

* [https://lucene.apache.org/core/4_5_0/core/org/apache/lucene/util/BroadWord.html Broadword Implementation of Rank]

大数据解析平台，是实现数据解析的分布式软件系统。

* [http://kylin.io Apache Kylin]
* [http://druid.io/ Druid]

# Navarro, Gonzalo, and Eliana Providel. "Fast, small, simple rank/select on bitmaps." In International Symposium on Experimental Algorithms, pp. 295-306. Springer Berlin Heidelberg, 2012.
# Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, pp. 154-168. Springer Berlin Heidelberg, 2008.

===基数估计===

基数估计（Cardinality Estimation），评估一下一个集合中不同数目的个数。比如，访问一个网站的独立IP个数。

* [https://github.com/Microsoft/CardinalityEstimation Cardinality Estimation Algorithm]

# Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
# Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.

==大数据系统==

===Hadoop===

[http://hadoop.apache.org Hadoop]

# Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
# Jeffrey Dean and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

===Spark===

[http://spark.apache.org Spark]

# Zaharia, Matei, et al. "Spark: cluster computing with working sets.“ Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Vol. 10. 2010.

==机器智能==

===国际研究===
[https://www.openai.com/blog/ OpenAI]

[http://research.google.com/teams/brain/ Googel_Brain]

[https://research.facebook.com/ai Facebook_AI-Research]

===三个层面===

====实现的目标与功能分类====

语音识别 机器视觉 智能问答

====核心技术分类====

特定算法 机器学习算法 深度神经网络

====底层实现方案====

可编程逻辑阵列 FPGA  / 通用图形处理器 GPGPU / 通用处理器 CPU 群集

===机器学习===

Machine Learning [http://scikit-learn.org scikit-learn]

# Jordan, M. I., and T. M. Mitchell. "Machine learning: Trends, perspectives, and prospects." Science 349, no. 6245 (2015): 255-260. [http://science.sciencemag.org/content/349/6245/255 Machine_learning_science_2015]

====语音识别====

Automatic Speech Recognition，简称ASR

====计算机视觉====

Computer Vision，简称 CV

===深度神经网络===

[[卷积神经网络]]

Deep Neural Networks，简称DNN

Stanford Deep Learning tutorials [http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial DL_tutorials]

入门导读

# LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521(7553), pp:436-444, 2015. [http://www.nature.com/nature/journal/v521/n7553/full/nature14539.html Deep_Learning_Nature]
# Jeff Dean, Large-Scale Deep Learning for Intelligent Computer Systems, WSDM 2016. [http://research.google.com/pubs/jeff.html WSDM_keynote]
# TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016.[https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf TensorFlow_OSDI2016_paper] [http://research.google.com/pubs/pub45381.html TensorFlow_paper]

[http://research.google.com/pubs/jeff.html Jeff Dean]

[http://cs.stanford.edu/~quocle/ Quoc V. Le]

==论文报告撰写==

[http://www.madoko.net Madoko]

=项目分组=

[[大数据智能-学生分组]]

=论文研读=

==第一次==

要求：提交研读论文的PPT（正文部分不超过10页）。

时间：10月14日中午12点之前。

10月19日下午安排每组做一个小报告，每组时间不超过10分钟。

[[大数据智能-课程阅读1]]

==第二次==

要求：提交研读论文的PPT（正文部分不超过10页）。

时间：12月XX日中午12点之前。

12月XX日下午安排每组做一个小报告，每组时间不超过10分钟。

[[大数据智能-课程阅读2]]

=课程实践=

===学生准备===

携带笔记本，智能手机

(Bring your own laptop computers and camera-ready smart phones)

===Azure云平台使用===

[http://portal.azure.com Azure]

===Flask-Web服务器搭建===

[[Flask-Web服务器搭建]]

=课程项目=

==项目1-大数据==

===描述===

任务：基于位图索引的概念和原理，用C++实现一个位图索引数据库。

检验：完成对一段网流数据的索引建立，查询。在虚拟机上运行成功，得到正确结果。

网流数据：\\166.111.134.110\team-saturn\网流数据

代码托管：http://gitlab.icenter.tsinghua.edu.cn

时间：10月7日中午12点之前（特殊情况，推迟一周）(校历第四周)

组织：以组为单位，要求要看到所有同学的贡献。

===作业提交===

{|border=1
|style="height:20px;width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/xavieryao/bitmap-db Group1]]]
|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/bdmi_group2/bitmap  Group2]]]
|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/3rd_group/bitmap_indexing  Group3]]]
|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/taanng/Bitmap  Group4]]]
|-
|style="height:20px"|[[[http://gitlab.icenter.tsinghua.edu.cn/ddeerreekk/Experiment_1_Bitmap_Index  Group5]]]
|[[[http://gitlab.icenter.tsinghua.edu.cn/bdmi_group6/project1  Group6]]]
|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group7]]]
|[[[http://gitlab.icenter.tsinghua.edu.cn/FQX/bitmap  Group8]]]
|-
|}

==项目2-Lucida使用==

===Lucida安装===

每个组在清华工业云平台上安装Lucida软件

* 清华工业云 [https://cloud.icenter.tsinghua.edu.cn icenter-cloud]
* 下载地址 [https://github.com/claritylab/lucida Lucida-AI]

时间：10月26日下周三中午12点之前。（校历第七周）

===每组工作===

每组熟悉了解Lucida的7种AI服务的实现原理

[[大数据智能-Lucida使用]]

==项目3-TensorFlow安装==

阅读深度学习DeepLearning教程

安装Google TensorFlow 

完成TensorFlow网站上Get Started. 

[https://www.tensorflow.org/versions/master/tutorials/mnist/beginners/index.html MNIST For ML Beginners]

时间：2016年11月2日中午12点（校历第八周）

==项目4-云+端整合==

时间：2016年11月9日中午12点（校历第九周）（特殊情况，延长一周）

===Thrift协议===

[https://thrift.apache.org Thrift]

===Client端=== 

调用摄像头拍照

调用Thrift接口

===Server端===

接收图片文件

调用服务端程序

===参考===

[https://developer.android.com/index.html Android开发入门]

[http://cordova.apache.org Cordova]

[https://github.com/claritylab/clarity-mobile clarity-mobile]

==项目5-机器智能==

===描述===

完成一个可展示的人工智能系统

步骤：

# 设置Azure虚拟机
# 架构Flask-Web服务
# 建立AI服务（Google Tensorflow）
# lucida.ai
# 智能端开发（移动平台、嵌入式硬件） + Thrift协议联调

参考：
[http://lucida.ai Lucida-AI]

===作业提交===

{|border=1
|style="height:20px;width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn Group1]]]
|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group2]]]
|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group3]]]
|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group4]]]
|-
|style="height:20px"|[[[http://gitlab.icenter.tsinghua.edu.cn  Group5]]]
|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group6]]]
|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group7]]]
|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group8]]]
|-
|}

=致谢=

本课程获得微软Azure云计算与机器学习捐赠支持。

感谢微软公司 杨滔经理，章艳经理，刘士君工程师，闫伟工程师。

=参考文献=

===基础===

# John L. Hennessy, and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
# Neil Matthew, and Richard Stones. Beginning linux programming. John Wiley & Sons, 2011.
# Bjarne Stroustrup, The C++ programming language. Pearson Education, 2013.
# Weiss, Mark Allen, Data structures and algorithm analysis in Java, Addison-Wesley Longman Publishing Co., Inc., 1998.
# David Flanagan, JavaScript: The definitive guide: Activate your web pages. " O'Reilly Media, Inc.", 2011.
# Miguel Grinberg, Flask Web Development: Developing Web Applications with Python. O'Reilly Media, Inc., 2014.

===深度学习===

# Yoshua Bengio, Ian Goodfellow, Aaron Courville, Deep Learning, MIT Press, 2016. [http://www.deeplearningbook.org/ DeepLearningBook]
# Google brain team, TensorFlow: Large-scale machine learning on heterogeneous systems, whitepaper, 2015.
# Vijay Agneeswaran, Real-Time Applications with Storm, Spark, and More Hadoop Alternatives, 2014.

===计算机围棋===

# Mastering the game of Go with deep neural networks and tree search, nature 2015.
# Better Computer Go Player with Neural Network and Long-term Prediction, ICLR 2016.
# Pachi: State of the art open source Go program, Advances in computer games, Springer Berlin Heidelberg, 2011.
# Training Deep Convolutional Neural Networks to Play Go, JMLR 2015.