2019年12月9日 (一) 03:24的最后版本

版权申明

课程信息

课号：01510243

英文：Big Data and Machine Intelligence

简写：BDMI

教学团队

智能系统实验室： 陈震 陆昕

助教：谢睿、黄世宇、高宸、许书畅

大数据与机器智能-历届助教名单

教学资源

由清华iCenter人工智能平台保障支持。

课堂教学

课程内容

大数据与机器智能-课程内容

教学计划

- 教学计划

2019年秋季学期： 教学计划、

- 实践教学

 实践教学、  调研考察

教学管理

- 课程分组 ** 课程研究 ** 课程项目

 学生分组、   课程研究、   课程项目、   论文阅读

参考资料

 参考教材

 参考课程

 参考文献

致谢

致谢-微软公司

“大数据与机器智能”版本间的差异

2019年12月9日 (一) 03:24的最后版本

目录

版权申明

课程信息

教学团队

教学资源

课堂教学

课程内容

教学计划

教学管理

参考资料

致谢

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具

@@ 第1行： / 第1行： @@
-==版权申明==
+= 版权申明 =
-CC BY-NC-SA
+CC [https://creativecommons.org/licenses/by-nc-nd/3.0/cn/ BY-NC-ND]
-==教学团队==
+=课程信息=
-互联网+实验室 [http://net.icenter.tsinghua.edu.cn iNetLab]
+课号：01510243
-[https://www.researchgate.net/profile/Zhen_Chen16/ 陈震]
+英文：Big Data and Machine Intelligence
-马晓东 章屹松 王蓓蓓 高英
-助教：郑文勋 李辰星
+简写：BDMI
-==协同开发==
+教学目标： [[大数据智能-教学目标]]
-[http://cloud.icenter.tsinghua.edu.cn  iCenter-cloud]
+== 教学团队 ==
-[http://gitlab.icenter.tsinghua.edu.cn  GitLab]
+  [[互联网+实验室|智能系统实验室]]： [[陈震]] [[陆昕]]
-=教学目标=
+ 助教：谢睿、黄世宇、高宸、许书畅
-以完成一种以大数据为基础的智能系统的原型开发为目标，在实践中运用大数据智能理论与技术。团队成员学习大数据系统与机器智能的理论知识和专业技能，完成项目团队结构设计和原型开发的实践环节，全面提高学生的技术实践能力。
+[[BDMI-TA| 大数据与机器智能-历届助教名单]]
-=课程内容=
+== 教学资源 ==
-==技术浅论==
+ <big>'''由清华iCenter[[AI云|人工智能平台]]保障支持。''' </big>
-===技术本质===
+= 课堂教学 =
+==课程内容==
-* 多维多角度：工具论/人文关怀/社会抱负/技术社会
+ [[大数据与机器智能-课程内容]]
-* 客观上（大众）
-** “受害与受益”（两面性）
-**: 金融理财便捷
-**: 金融电信诈骗
-** 受益方
-**: 技术改变的领导者（“quick money”）
-**: 受惠的人群（便捷性）
-** 受害方
-**: 技术改变被动接受的人
-**: 被技术改变淘汰的人
-* 主观上（单个人）
-** 取决于个人立场、价值观、经历等等
-===技术泡沫===
+== 教学计划 ==
-* 市场宣传和预期炒作，以及一些传媒的洗脑性的报道等的原因，导致概念混乱。
+** 教学计划
-* 需要找到技术的本质（涉及认识论）
-* 科学思维就是防止被洗脑，无脑思考。
-* 研究思路：规范模式和实证模式
-** 事实陈述的时候，一定要找到论点以及论据，以及判断论点是否统帅论据，论据是否支持论点。
-** 实践与操作获得体验，而非感觉与愿望。
-==大数据索引==
+年秋季学期：[[大数据智能-教学计划 | 教学计划]]、
-索引是加快查找的数据结构（Data Structure），主要有哈希表（Hash Table），树（Tree）和倒排索引（Inverted Index）。
+** 实践教学
-# Thomas H. Cormen, introduction to algorithms, third edition, MIT press, 2009.
+ [[大数据智能-实践教学 | 实践教学]]、 [[大数据智能-调研考察| 调研考察]]
-# Tardos, Eva, and Jon Kleinberg. "Algorithm Design." (2006).
-===Hash Table===
+== 教学管理 ==
-布隆过滤器（BloomFilter）
+** 课程分组  ** 课程研究  ** 课程项目
-[http://billmill.org/bloomfilter-tutorial BloomFilter]
+ [[大数据智能-学生分组 | 学生分组]]、  [[大数据智能-课程研究 | 课程研究]]、  [[大数据智能-课程项目 | 课程项目]]、  [[大数据智能-论文阅读 | 论文阅读]]
-[https://github.com/magnuss/java-bloomfilter Java-Bloomfilter]
+= 参考资料 =
-[https://github.com/lemire/bloofi Bloofi]
+ [[大数据智能-参考教材 | 参考教材]]
-# B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
+ [[大数据智能-参考课程 | 参考课程]]
-# Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.
-===Tree Index===
+ [[大数据智能-参考文献 | 参考文献]]
-[https://en.wikipedia.org/wiki/B-tree B-tree]
+= 致谢 =
-[https://en.wikipedia.org/wiki/K-d_tree ''k''-d tree]
+[[致谢-微软公司]]
-[https://en.wikipedia.org/wiki/R-tree R-tree]
-[https://en.wikipedia.org/wiki/R%2B_tree R+ tree]
-# Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
-# Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.
-===Inverted Index===
-倒排索引（Inverted Index）是搜索引擎使用的数据结构。
-倒排索引将关键字（keyword）映射到文档（document），在信息检索（Information Retrieval）中发挥重要作用。
-在倒排索引中，每个关键词对应一个倒排链表（Inverted List），记录了该关键词出现的所有文档的编号。
-* 倒排索引上的最重要的运算是集合交（Conjunction），并（Disjunction）和非（Negation）。
-* 倒排索引在实际实现中，可以采用位图（Bitmap）与整数链表（Integer List）两种结构形式。
-* 倒排索引上的交，并和非运算，对应的整数链表操作是Intersection/Unions操作，对应位图是比特AND, OR, NOT操作。
-倒排索引实现：
-[https://lucene.apache.org/core/ Lucene]
-=====Bitmap Index=====
-Roaring Bitmap
-[https://github.com/RoaringBitmap/RoaringBitmap Java-Roaring]
-[https://github.com/RoaringBitmap/CRoaring CRoaring]
-[http://gitlab.icenter.tsinghua.edu.cn/zhenchen/BAH BAH]
-# Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
-# Vallentin, Matthias, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016.
-# Vallentin, Matthias. Scalable Network Forensics. Diss. University of California, Berkeley, 2016.
-# Chenxing Li et al., BAH: A Bitmap Index Compression Algorithm for Fast Data Retrieval, LCN 2016.
-=====Integer List=====
-Data Structures for Inverted Indexes ([https://github.com/ot/ds2i ds2i])
-[http://github.com/ot/partitioned_elias_fano Partitioned Elias-Fano Index]
-[https://github.com/lemire/FastPFor FastPFor]
-# Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
-# Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.
-# Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.
-# Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.
-# Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.
-# Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013.
-===混合结构===
-融合几种独立结构的混合结构
-# Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.
-===其它结构===
-[https://github.com/simongog/sdsl-lite Succinct Data Structure Library]
-[https://github.com/simongog/sdsl-lite/wiki/List-of-Implemented-Data-Structures List of Implemented Data Structures]
-* Bitvectors supporting Rank and Select
-* Integer Vectors
-* Wavelet Trees
-* Compressed Suffix Arrays (CSA)
-* Balanced Parentheses Representations
-* Longest Common Prefix (LCP) Arrays
-* Compressed Suffix Trees (CST)
-* Range Minimum/Maximum Query (RMQ) Structures
-# Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306.
-==大数据算法==
-===数据解析===
-数据解析（Data Analytic），是指对数据集的属性值进行SUM，TopN，Rank操作。一般要求实时响应。
-* [https://lucene.apache.org/core/4_5_0/core/org/apache/lucene/util/BroadWord.html Broadword Implementation of Rank]
-大数据解析平台，是实现数据解析的分布式软件系统。
-* [http://kylin.io Apache Kylin]
-* [http://druid.io/ Druid]
-# Navarro, Gonzalo, and Eliana Providel. "Fast, small, simple rank/select on bitmaps." In International Symposium on Experimental Algorithms, pp. 295-306. Springer Berlin Heidelberg, 2012.
-# Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, pp. 154-168. Springer Berlin Heidelberg, 2008.
-===基数估计===
-基数估计（Cardinality Estimation），评估一下一个集合中不同数目的个数。比如，访问一个网站的独立IP个数。
-* [https://github.com/Microsoft/CardinalityEstimation Cardinality Estimation Algorithm]
-# Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
-# Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.
-==大数据系统==
-===Hadoop===
-[http://hadoop.apache.org Hadoop]
-# Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
-# Jeffrey Dean and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
-===Spark===
-[http://spark.apache.org Spark]
-# Zaharia, Matei, et al. "Spark: cluster computing with working sets.“ Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Vol. 10. 2010.
-==机器智能==
-[https://www.openai.com/blog/ OpenAI]
-[http://research.google.com/teams/brain/ Googel_Brain]
-[https://research.facebook.com/ai Facebook_AI-Research]
-===三个层面===
-====实现的目标与功能分类====
-语音识别 机器视觉 智能问答
-====核心技术分类====
-特定算法 机器学习算法 深度神经网络
-====底层实现方案====
-可编程逻辑阵列 FPGA  / 通用图形处理器 GPGPU / 通用处理器 CPU 群集
-===机器学习===
-Machine Learning [http://scikit-learn.org scikit-learn]
-# Jordan, M. I., and T. M. Mitchell. "Machine learning: Trends, perspectives, and prospects." Science 349, no. 6245 (2015): 255-260. [http://science.sciencemag.org/content/349/6245/255 Machine_learning_science_2015]
-====语音识别====
-Automatic Speech Recognition，简称ASR
-====计算机视觉====
-Computer Vision，简称 CV
-===深度神经网络===
-[[卷积神经网络]]
-Deep Neural Networks，简称DNN
-Stanford Deep Learning tutorials [http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial DL_tutorials]
-入门导读
-# LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521(7553), pp:436-444, 2015. [http://www.nature.com/nature/journal/v521/n7553/full/nature14539.html Deep_Learning_Nature]
-# Jeff Dean, Large-Scale Deep Learning for Intelligent Computer Systems, WSDM 2016. [http://research.google.com/pubs/jeff.html WSDM_keynote]
-# TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016.[https://www.usenix.org/system/files/conference/osdi16/osdi16-abadi.pdf TensorFlow_OSDI2016_paper] [http://research.google.com/pubs/pub45381.html TensorFlow_paper]
-[http://research.google.com/pubs/jeff.html Jeff Dean]
-[http://cs.stanford.edu/~quocle/ Quoc V. Le]
-==论文报告撰写==
-[http://www.madoko.net Madoko]
-=项目分组=
-[[大数据智能-学生分组]]
-=论文研读=
-==第一次==
-要求：提交研读论文的PPT（正文部分不超过10页）。
-时间：10月14日中午12点之前。
-月19日下午安排每组做一个小报告，每组时间不超过10分钟。
-[[大数据智能-课程阅读1]]
-==第二次==
-要求：提交研读论文的PPT（正文部分不超过10页）。
-时间：12月XX日中午12点之前。
-月XX日下午安排每组做一个小报告，每组时间不超过10分钟。
-[[大数据智能-课程阅读2]]
-=课程实践=
-===学生准备===
-携带笔记本，智能手机
-(Bring your own laptop computers and camera-ready smart phones)
-===Azure云平台使用===
-[http://portal.azure.com Azure]
-===Flask-Web服务器搭建===
-[[Flask-Web服务器搭建]]
-=课程项目=
-==项目1-大数据==
-===描述===
-任务：基于位图索引的概念和原理，用C++实现一个位图索引数据库。
-检验：完成对一段网流数据的索引建立，查询。在虚拟机上运行成功，得到正确结果。
-网流数据：\\166.111.134.110\team-saturn\网流数据
-代码托管：http://gitlab.icenter.tsinghua.edu.cn
-时间：10月7日中午12点之前（特殊情况，推迟一周）(校历第四周)
-组织：以组为单位，要求要看到所有同学的贡献。
-===作业提交===
-{|border=1
-|style="height:20px;width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/xavieryao/bitmap-db Group1]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/bdmi_group2/bitmap  Group2]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/3rd_group/bitmap_indexing  Group3]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/taanng/Bitmap  Group4]]]
-|-
-|style="height:20px"|[[[http://gitlab.icenter.tsinghua.edu.cn/ddeerreekk/Experiment_1_Bitmap_Index  Group5]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/bdmi_group6/project1  Group6]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group7]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/FQX/bitmap  Group8]]]
-|-
-|}
-==项目2-Lucida使用==
-===Lucida安装===
-每个组在清华工业云平台上安装Lucida软件
-* 清华工业云 [https://cloud.icenter.tsinghua.edu.cn icenter-cloud]
-* 下载地址 [https://github.com/claritylab/lucida Lucida-AI]
-时间：10月26日下周三中午12点之前。（校历第七周）
-===每组工作===
-每组熟悉了解Lucida的7种AI服务的实现原理
-[[大数据智能-Lucida使用]]
-==项目3-TensorFlow安装==
-阅读深度学习DeepLearning教程
-安装Google TensorFlow
-完成TensorFlow网站上Get Started.
-[https://www.tensorflow.org/versions/master/tutorials/mnist/beginners/index.html MNIST For ML Beginners]
-时间：2016年11月2日中午12点（校历第八周）
-==项目4-云+端整合==
-时间：2016年11月9日中午12点（校历第九周）（特殊情况，延长一周）
-===Thrift协议===
-[https://thrift.apache.org Thrift]
-===Client端===
-调用摄像头拍照
-调用Thrift接口
-===Server端===
-接收图片文件
-调用服务端程序
-===参考===
-[https://developer.android.com/index.html Android开发入门]
-[http://cordova.apache.org Cordova]
-[https://github.com/claritylab/clarity-mobile clarity-mobile]
-==项目5-机器智能==
-===描述===
-完成一个可展示的人工智能系统
-步骤：
-# 设置Azure虚拟机
-# 架构Flask-Web服务
-# 建立AI服务（Google Tensorflow）
-# lucida.ai
-# 智能端开发（移动平台、嵌入式硬件） + Thrift协议联调
-参考：
-[http://lucida.ai Lucida-AI]
-===作业提交===
-{|border=1
-|style="height:20px;width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn Group1]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group2]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group3]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group4]]]
-|-
-|style="height:20px"|[[[http://gitlab.icenter.tsinghua.edu.cn  Group5]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group6]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group7]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group8]]]
-|-
-|}
-=致谢=
-本课程获得微软Azure云计算与机器学习捐赠支持。
-感谢微软公司 杨滔经理，章艳经理，刘士君工程师，闫伟工程师。
-=参考文献=
-===基础===
-# John L. Hennessy, and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
-# Neil Matthew, and Richard Stones. Beginning linux programming. John Wiley & Sons, 2011.
-# Bjarne Stroustrup, The C++ programming language. Pearson Education, 2013.
-# Weiss, Mark Allen, Data structures and algorithm analysis in Java, Addison-Wesley Longman Publishing Co., Inc., 1998.
-# David Flanagan, JavaScript: The definitive guide: Activate your web pages. " O'Reilly Media, Inc.", 2011.
-# Miguel Grinberg, Flask Web Development: Developing Web Applications with Python. O'Reilly Media, Inc., 2014.
-===深度学习===
-# Yoshua Bengio, Ian Goodfellow, Aaron Courville, Deep Learning, MIT Press, 2016. [http://www.deeplearningbook.org/ DeepLearningBook]
-# Google brain team, TensorFlow: Large-scale machine learning on heterogeneous systems, whitepaper, 2015.
-# Vijay Agneeswaran, Real-Time Applications with Storm, Spark, and More Hadoop Alternatives, 2014.
-===计算机围棋===
-# Mastering the game of Go with deep neural networks and tree search, nature 2015.
-# Better Computer Go Player with Neural Network and Long-term Prediction, ICLR 2016.
-# Pachi: State of the art open source Go program, Advances in computer games, Springer Berlin Heidelberg, 2011.
-# Training Deep Convolutional Neural Networks to Play Go, JMLR 2015.