版权申明

CC BY-NC-SA

教学团队

互联网+实验室 iNetLab

陈震马晓东章屹松王蓓蓓高英

助教：郑文勋李辰星

协同开发

iCenter-cloud

GitLab

教学目标

以完成一种以大数据为基础的智能系统的原型开发为目标，在实践中运用大数据智能理论与技术。团队成员学习大数据系统与机器智能的理论知识和专业技能，完成项目团队结构设计和原型开发的实践环节，全面提高学生的技术实践能力。

课程内容

技术浅论

技术本质

多维多角度：工具论/人文关怀/社会抱负/技术社会
客观上（大众）
- “受害与受益”（两面性）
  金融理财便捷
  
  金融电信诈骗
- 受益方
  技术改变的领导者（“quick money”）
  
  受惠的人群（便捷性）
- 受害方
  技术改变被动接受的人
  
  被技术改变淘汰的人
主观上（单个人）
- 取决于个人立场、价值观、经历等等

技术泡沫

市场宣传和预期炒作，以及一些传媒的洗脑性的报道等的原因，导致概念混乱。
需要找到技术的本质（涉及认识论）
科学思维就是防止被洗脑，无脑思考。
研究思路：规范模式和实证模式
- 事实陈述的时候，一定要找到论点以及论据，以及判断论点是否统帅论据，论据是否支持论点。
- 实践与操作获得体验，而非感觉与愿望。

大数据索引

索引是加快查找的数据结构（Data Structure），主要有哈希表（Hash Table），树（Tree）和倒排索引（Inverted Index）。

Thomas H. Cormen, introduction to algorithms, third edition, MIT press, 2009.
Tardos, Eva, and Jon Kleinberg. "Algorithm Design." (2006).

Hash Table

布隆过滤器（BloomFilter）

BloomFilter

Java-Bloomfilter

Bloofi

B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.

Tree Index

Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.

Inverted Index

倒排索引（Inverted Index）是搜索引擎使用的数据结构。

倒排索引将关键字（keyword）映射到文档（document），在信息检索（Information Retrieval）中发挥重要作用。

在倒排索引中，每个关键词对应一个倒排链表（Inverted List），记录了该关键词出现的所有文档的编号。

倒排索引上的最重要的运算是集合交（Conjunction），并（Disjunction）和非（Negation）。
倒排索引在实际实现中，可以采用位图（Bitmap）与整数链表（Integer List）两种结构形式。
倒排索引上的交，并和非运算，对应的整数链表操作是Intersection/Unions操作，对应位图是比特AND, OR, NOT操作。

倒排索引实现： Lucene

Bitmap Index

Roaring Bitmap Java-Roaring CRoaring

BAH

Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
Vallentin, Matthias, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016.
Vallentin, Matthias. Scalable Network Forensics. Diss. University of California, Berkeley, 2016.
Chenxing Li et al., BAH: A Bitmap Index Compression Algorithm for Fast Data Retrieval, LCN 2016.

Integer List

Data Structures for Inverted Indexes (ds2i)

Partitioned Elias-Fano Index

FastPFor

Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.
Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.
Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.
Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.
Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013.

混合结构

融合几种独立结构的混合结构

Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.

其它结构

Succinct Data Structure Library

List of Implemented Data Structures

Bitvectors supporting Rank and Select
Integer Vectors
Wavelet Trees
Compressed Suffix Arrays (CSA)
Balanced Parentheses Representations
Longest Common Prefix (LCP) Arrays
Compressed Suffix Trees (CST)
Range Minimum/Maximum Query (RMQ) Structures

Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306.

大数据算法

数据解析

数据解析（Data Analytic），是指对数据集的属性值进行SUM，TopN，Rank操作。一般要求实时响应。

Broadword Implementation of Rank

大数据解析平台，是实现数据解析的分布式软件系统。

Navarro, Gonzalo, and Eliana Providel. "Fast, small, simple rank/select on bitmaps." In International Symposium on Experimental Algorithms, pp. 295-306. Springer Berlin Heidelberg, 2012.
Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, pp. 154-168. Springer Berlin Heidelberg, 2008.

基数估计

基数估计（Cardinality Estimation），评估一下一个集合中不同数目的个数。比如，访问一个网站的独立IP个数。

Cardinality Estimation Algorithm

Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.

大数据系统

Hadoop

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
Jeffrey Dean and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

Spark

Zaharia, Matei, et al. "Spark: cluster computing with working sets.“ Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Vol. 10. 2010.

机器智能

三个层面

实现的目标与功能分类

语音识别机器视觉智能问答

核心技术分类

特定算法机器学习算法深度神经网络

底层实现方案

可编程逻辑阵列 FPGA / 通用图形处理器 GPGPU / 通用处理器 CPU 群集

机器学习

Machine Learning scikit-learn

Jordan, M. I., and T. M. Mitchell. "Machine learning: Trends, perspectives, and prospects." Science 349, no. 6245 (2015): 255-260. Machine_learning_science_2015

语音识别

Automatic Speech Recognition，简称ASR

计算机视觉

Computer Vision，简称 CV

深度神经网络

卷积神经网络

Deep Neural Networks，简称DNN

Stanford Deep Learning tutorials DL_tutorials

入门导读

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521(7553), pp:436-444, 2015. Deep_Learning_Nature
Jeff Dean, Large-Scale Deep Learning for Intelligent Computer Systems, WSDM 2016. WSDM_keynote
TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016.TensorFlow_OSDI2016_paper TensorFlow_paper

Jeff Dean

Quoc V. Le

论文报告撰写

Madoko

项目分组

大数据智能-学生分组

论文研读

第一次

要求：提交研读论文的PPT（正文部分不超过10页）。

时间：10月14日中午12点之前。

10月19日下午安排每组做一个小报告，每组时间不超过10分钟。

大数据智能-课程阅读1

第二次

要求：提交研读论文的PPT（正文部分不超过10页）。

时间：12月XX日中午12点之前。

12月XX日下午安排每组做一个小报告，每组时间不超过10分钟。

大数据智能-课程阅读2

课程实践

学生准备

携带笔记本，智能手机

(Bring your own laptop computers and camera-ready smart phones)

Azure云平台使用

Azure

Flask-Web服务器搭建

课程项目

项目1-大数据

描述

任务：基于位图索引的概念和原理，用C++实现一个位图索引数据库。

检验：完成对一段网流数据的索引建立，查询。在虚拟机上运行成功，得到正确结果。

网流数据：\\166.111.134.110\team-saturn\网流数据

代码托管：http://gitlab.icenter.tsinghua.edu.cn

时间：10月7日中午12点之前（特殊情况，推迟一周）(校历第四周)

组织：以组为单位，要求要看到所有同学的贡献。

作业提交

[[Group1]]	[[Group2]]	[[Group3]]	[[Group4]]
[[Group5]]	[[Group6]]	[[Group7]]	[[Group8]]

项目2-Lucida使用

Lucida安装

每个组在清华工业云平台上安装Lucida软件

清华工业云 icenter-cloud
下载地址 Lucida-AI

时间：10月26日下周三中午12点之前。（校历第七周）

每组工作

每组熟悉了解Lucida的7种AI服务的实现原理

大数据智能-Lucida使用

项目3-TensorFlow安装

阅读深度学习DeepLearning教程

安装Google TensorFlow

完成TensorFlow网站上Get Started.

MNIST For ML Beginners

时间：2016年11月2日中午12点（校历第八周）

项目4-云+端整合

时间：2016年11月9日中午12点（校历第九周）（特殊情况，延长一周）

Thrift协议

Thrift

Client端

调用摄像头拍照

调用Thrift接口

Server端

接收图片文件

调用服务端程序

参考

Android开发入门

Cordova

clarity-mobile

项目5-机器智能

描述

完成一个可展示的人工智能系统

步骤：

设置Azure虚拟机
架构Flask-Web服务
建立AI服务（Google Tensorflow）
lucida.ai
智能端开发（移动平台、嵌入式硬件） + Thrift协议联调

参考： Lucida-AI

作业提交

[[Group1]]	[[Group2]]	[[Group3]]	[[Group4]]
[[Group5]]	[[Group6]]	[[Group7]]	[[Group8]]

致谢

本课程获得微软Azure云计算与机器学习捐赠支持。

感谢微软公司杨滔经理，章艳经理，刘士君工程师，闫伟工程师。

参考文献

基础

John L. Hennessy, and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
Neil Matthew, and Richard Stones. Beginning linux programming. John Wiley & Sons, 2011.
Bjarne Stroustrup, The C++ programming language. Pearson Education, 2013.
Weiss, Mark Allen, Data structures and algorithm analysis in Java, Addison-Wesley Longman Publishing Co., Inc., 1998.
David Flanagan, JavaScript: The definitive guide: Activate your web pages. " O'Reilly Media, Inc.", 2011.
Miguel Grinberg, Flask Web Development: Developing Web Applications with Python. O'Reilly Media, Inc., 2014.

深度学习

Yoshua Bengio, Ian Goodfellow, Aaron Courville, Deep Learning, MIT Press, 2016. DeepLearningBook
Google brain team, TensorFlow: Large-scale machine learning on heterogeneous systems, whitepaper, 2015.
Vijay Agneeswaran, Real-Time Applications with Storm, Spark, and More Hadoop Alternatives, 2014.

计算机围棋

Mastering the game of Go with deep neural networks and tree search, nature 2015.
Better Computer Go Player with Neural Network and Long-term Prediction, ICLR 2016.
Pachi: State of the art open source Go program, Advances in computer games, Springer Berlin Heidelberg, 2011.
Training Deep Convolutional Neural Networks to Play Go, JMLR 2015.

最后修改于2016年11月4日 (星期五) 08:49

大数据与机器智能

版权申明

教学团队

协同开发

教学目标

课程内容

技术浅论

技术本质

技术泡沫

大数据索引

Hash Table

Tree Index

Inverted Index

Bitmap Index

Integer List

混合结构

其它结构

大数据算法

数据解析

基数估计

大数据系统

Hadoop

Spark

机器智能

三个层面

实现的目标与功能分类

核心技术分类

底层实现方案

机器学习

语音识别

计算机视觉

深度神经网络

论文报告撰写

项目分组

论文研读

第一次

第二次

课程实践

学生准备

Azure云平台使用

Flask-Web服务器搭建

课程项目

项目1-大数据

描述

作业提交

项目2-Lucida使用

Lucida安装

每组工作

项目3-TensorFlow安装

项目4-云+端整合

Thrift协议

Client端

Server端

参考

项目5-机器智能

描述

作业提交

致谢

参考文献

基础

深度学习

计算机围棋

iCenter Wiki