2019年12月9日 (一) 03:24的最后版本

版权申明

课程信息

课号：01510243

英文：Big Data and Machine Intelligence

简写：BDMI

教学团队

智能系统实验室： 陈震 陆昕

助教：谢睿、黄世宇、高宸、许书畅

大数据与机器智能-历届助教名单

教学资源

由清华iCenter人工智能平台保障支持。

课堂教学

课程内容

大数据与机器智能-课程内容

教学计划

- 教学计划

2019年秋季学期： 教学计划、

- 实践教学

 实践教学、  调研考察

教学管理

- 课程分组 ** 课程研究 ** 课程项目

 学生分组、   课程研究、   课程项目、   论文阅读

参考资料

 参考教材

 参考课程

 参考文献

致谢

致谢-微软公司

“大数据与机器智能”版本间的差异

2019年12月9日 (一) 03:24的最后版本

目录

版权申明

课程信息

教学团队

教学资源

课堂教学

课程内容

教学计划

教学管理

参考资料

致谢

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具

@@ 第1行： / 第1行： @@
-==版权申明==
+= 版权申明 =
-CC BY-NC-SA
-==教学团队==
+CC [https://creativecommons.org/licenses/by-nc-nd/3.0/cn/ BY-NC-ND]
-互联网+实验室 [http://net.icenter.tsinghua.edu.cn iNetLab]
-[https://www.researchgate.net/profile/Zhen_Chen16/ 陈震]
+=课程信息=
-马晓东 章屹松 王蓓蓓 高英
-助教：郑文勋 李辰星
+课号：01510243
-==协同开发==
+英文：Big Data and Machine Intelligence
-iCenter-cloud [http://cloud.icenter.tsinghua.edu.cn  iCenter-cloud]
-Gitlab [http://gitlab.icenter.tsinghua.edu.cn  GitLab]
+简写：BDMI
-=教学目标=
+教学目标： [[大数据智能-教学目标]]
-以完成一种以大数据为基础的智能系统的原型开发为目标，在实践中运用大数据智能理论与技术。团队成员学习大数据系统与机器智能的理论知识和专业技能，完成项目团队结构设计和原型开发的实践环节，全面提高学生的技术实践能力。
+== 教学团队 ==
-=课程内容=
+ [[互联网+实验室|智能系统实验室]]： [[陈震]] [[陆昕]]
-==技术浅论==
+ 助教：谢睿、黄世宇、高宸、许书畅
-===技术本质===
-*多维多角度：工具论/人文关怀/社会抱负/技术社会
-*客观上（大众）
+[[BDMI-TA| 大数据与机器智能-历届助教名单]]
-**“受害与受益” （两面性）
+== 教学资源 ==
-::金融理财便捷
-::金融电信诈骗
-**受益方
+ <big>'''由清华iCenter[[AI云|人工智能平台]]保障支持。''' </big>
-::技术改变的领导者（“quick money”）
-::受惠的人群（便捷性）
-**受害方
+= 课堂教学 =
-::技术改变被动接受的人
+==课程内容==
-::被技术改变淘汰的人
-*主观上（单个人）
+ [[大数据与机器智能-课程内容]]
-::取决于个人立场、价值观、经历等等
-===技术泡沫===
+== 教学计划 ==
-*市场宣传和预期炒作，以及一些传媒的洗脑性的报道等的原因，导致概念混乱。
-*需要找到技术的本质（涉及认识论）
-*科学思维就是防止被洗脑，无脑思考。
-*研究思路：规范模式和实证模式
-**事实陈述的时候，一定要找到论点以及论据，以及判断论点是否统帅论据，论据是否支持论点。
-**实践与操作获得体验，而非感觉与愿望。
-==大数据索引==
+** 教学计划
-索引是加快查找的数据结构(data structure)，主要有树Tree，哈希表和倒排索引。
-Readings:
+年秋季学期：[[大数据智能-教学计划 | 教学计划]]、
-#Thomas H. Cormen, introduction to algorithms, third edition, MIT press, 2009.
-#Tardos, Eva, and Jon Kleinberg. "Algorithm Design." (2006).
-----
-===Hash Table===
-布隆过滤器（BloomFilter）
+** 实践教学
-[http://billmill.org/bloomfilter-tutorial BloomFilter]
+ [[大数据智能-实践教学 | 实践教学]]、 [[大数据智能-调研考察| 调研考察]]
-[https://github.com/magnuss/java-bloomfilter java-bloomfilter]
+== 教学管理 ==
-[https://github.com/lemire/bloofi Bloofi]
+** 课程分组  ** 课程研究  ** 课程项目
+ [[大数据智能-学生分组 | 学生分组]]、  [[大数据智能-课程研究 | 课程研究]]、  [[大数据智能-课程项目 | 课程项目]]、  [[大数据智能-论文阅读 | 论文阅读]]
-Readings:
+= 参考资料 =
-#B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
+ [[大数据智能-参考教材 | 参考教材]]
-#Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.
-----
-===Tree Index===
+ [[大数据智能-参考课程 | 参考课程]]
-B-tree [https://en.wikipedia.org/wiki/B-tree B-tree]
+ [[大数据智能-参考文献 | 参考文献]]
-K-D tree [https://en.wikipedia.org/wiki/K-d_tree kd-tree]
+= 致谢 =
-R-tree [https://en.wikipedia.org/wiki/R-tree R-tree]
+[[致谢-微软公司]]
-R+ tree [https://en.wikipedia.org/wiki/R%2B_tree R+ tree]
-Readings:
-#Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
-#Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.
-----
-===Inverted Index===
-倒排索引 (Inverted Index)是搜索引擎使用的数据结构。
-倒排索引将关键字(keyword)映射到文档(document)，在信息检索（Information Retrieval）中发挥重要作用。
-在倒排索引中，每个关键词对应一个倒排链表（inverted list），记录了该关键词出现的所有文档的编号。
-#倒排索引上的最重要的运算是集合交（Conjunction），并（Disjunction）和非（Negation）。
-#倒排索引在实际实现中，可以采用位图（bitmap）与整数链表（integer list）两种结构形式。
-#倒排索引上的交，并和非运算，对应的整数链表操作是Intersection/Unions操作，对应位图是比特AND, OR, NOT操作。
-倒排索引实现：
-[https://lucene.apache.org/core/ Lucene]
-----
-=====Bitmap Index=====
-Roarin Bitmap [https://github.com/RoaringBitmap/RoaringBitmap Java-Roaring]
-[https://github.com/RoaringBitmap/CRoaring CRoaring]
-[http://gitlab.icenter.tsinghua.edu.cn/zhenchen/BAH BAH]
-Readings:
-#Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
-#Vallentin, Matthias, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016.
-#Vallentin, Matthias. Scalable Network Forensics. Diss. University of California, Berkeley, 2016.
-#Chenxing Li et al., BAH: A Bitmap Index Compression Algorithm for Fast Data Retrieval, LCN 2016.
-----
-=====Integer List=====
-Data Structures for Inverted Indexes (ds2i) [https://github.com/ot/ds2i ds2i]
-[http://github.com/ot/partitioned_elias_fano partitioned_elias_fano]
-[https://github.com/lemire/FastPFor FastPFor]
-Readings:
-#Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
-#Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.
-#Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.
-#Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.
-#Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.
-#Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013.
-----
-===混合结构===
-融合几种独立结构的混合结构
-#Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.
-----
-===其它结构===
-Succinct Data Structure
-[https://github.com/simongog/sdsl-lite Succinct Data Structure Library]
-[https://github.com/simongog/sdsl-lite/wiki/List-of-Implemented-Data-Structures Implemented_SDS]
-#Bitvectors supporting Rank and Select
-#Integer Vectors
-#Wavelet Trees
-#Compressed Suffix Arrays (CSA)
-#Balanced Parentheses Representations
-#Longest Common Prefix (LCP) Arrays
-#Compressed Suffix Trees (CST)
-#Range Minimum/Maximum Query (RMQ) Structures
-Readings:
-#Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306.
-==大数据算法==
-===数据解析===
-数据解析（Data Analytic），是指对数据集的属性值进行SUM，TopN，Rank操作。一般要求实时响应。
-Readings:
-#Navarro, Gonzalo, and Eliana Providel. "Fast, small, simple rank/select on bitmaps." In International Symposium on Experimental Algorithms, pp. 295-306. Springer Berlin Heidelberg, 2012.
-Broadword Implementation of Rank
-[https://lucene.apache.org/core/4_5_0/core/org/apache/lucene/util/BroadWord.html Broadword]
-#Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, pp. 154-168. Springer Berlin Heidelberg, 2008.
-----
-====大数据解析平台====
-Apache Kylin [http://kylin.io Kylin]
-Druid [http://druid.io/ Druid]
-----
-===基数估计===
-基数估计（Cardinality Estimation），评估一下一个集合中不同数目的个数。比如，访问一个网站的独立IP个数。
-Cardinality Estimation Algorithm
-#Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
-#Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.
-----
-==大数据系统==
-===Hadoop===
-[http://hadoop.apache.org Hadoop]
-#Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
-#Jeffrey Dean and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
-===Spark===
-[http://spark.apache.org Spark]
-#Zaharia, Matei, et al. "Spark: cluster computing with working sets.“ Proceedings of the 2nd USENIX conference on Hot topics in cloud computing. Vol. 10. 2010.
-==机器智能==
-===机器学习===
-Machine Learning [http://scikit-learn.org scikit-learn]
-===深度神经网络===
-Google TensorFlow [https://www.tensorflow.org Tensorflow]
-TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016.[https://www.usenix.org/conference/osdi16/program OSDI-2016]
-===计算机围棋===
-Better Computer Go Player with Neural Network and Long-term Prediction
-Pachi: State of the art open source Go program
-Training Deep Convolutional Neural Networks to Play Go
-==论文报告撰写==
-[http://www.madoko.net Madoko]
-=项目分组=
-===第一组===
-组长：
-姚沛然
-组员：
-王逸伦 张正彦
-===第二组===
-组长：王亦凡
-组员：刘梦旸、邱昱田
-===第三组===
-组长：李子豪
-组员：娄晨耀 张若天 邹逍遥
-===第四组===
-组长：石冠亚
-组员：宣程 汤鹏 段了了
-===第五组===
-组长：杨文聪
-组员：梅杰 计昊哲 杨应人
-===第六组===
-组长：赵宇璋
-组员：孙炜岳 吴一凡
-===第八组===
-组长：熊铮
-组员：范承泽，秦梓鑫
-=论文研读=
-要求：提交研读论文的PPT（正文部分不超过10页）。
-时间：10月14日中午12点之前。
-月19日下午安排每组做一个小报告，每组时间不超过10分钟。
-===第一组===
-RUBIK: Efficient Threshold Queries on Massive Time Series, SSDBM 2015.
-===第二组===
-SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices, SSDBM 2015.
-===第三组===
-ALACRITY: Analytics-Driven Lossless Data Compression for Rapid In-Situ Indexing, Storing, and Querying, TLDKS X, 2013.
-===第四组===
-VSEncoding: Efficient Coding and Fast Decoding of Integer Lists via Dynamic Programming, CIKM 2010.
-===第五组===
-Super-Scalar RAM-CPU Cache Compression, ICDE 2006.
-===第六组===
-Partitioned Elias-Fano Indexes, SIGIR 2014.
-===第八组===
-Optimal Space-time Tradeoffs for Inverted Indexes, WSDM 2015.
-=课程实践=
-===学生准备===
-携带笔记本，智能手机
-(Bring your own laptop computers and camera-ready smart phones)
-===Azure云平台使用===
-[http://portal.azure.com Azure]
-===Flask Web服务器搭建===
-:准备virtualenv 安装方法下面两条命令可能会适用mac和linux:
- $ sudo easy_install virtualenv
-:或者更好的:
- $ sudo pip install virtualenv
-如果你使用 Ubuntu ，请尝试:
- $ sudo apt-get install python-virtualenv
-Centos，请尝试：
- $ sudo yum install python-virtualenv
-:安装好virtualenv后，可以创建一个项目文件夹，利用virtualenv命令在其下创建 venv 文件夹:
- $ mkdir myproject
- $ cd myproject
- $ virtualenv venv
- New python executable in venv/bin/python
- Installing distribute............done.
-现在，只要你想要在某个项目上工作，只要激活相应的环境。在 OS X 和 Linux 下，按如下做:
- $ . venv/bin/activate
-:现在你只需要键入以下的命令来激活你的 virtualenv 中的 Flask:
- $ pip install Flask
-:几秒后，一切就为你准备就绪。(weiwandaixu)
-=课程项目=
-==项目1-大数据==
-===描述===
-任务：基于位图索引的概念和原理，用C++实现一个位图索引数据库。
-检验：完成对一段网流数据的索引建立，查询。在虚拟机上运行成功，得到正确结果。
-网流数据：\\166.111.134.110\team-saturn\网流数据
-代码托管：http://gitlab.icenter.tsinghua.edu.cn
-时间：10月7日中午12点之前（特殊情况，推迟一周）
-组织：以组为单位，要求要看到所有同学的贡献。
-===作业提交===
-{|border=1
-|style="height:20px;width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/xavieryao/bitmap-db Group1]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/bdmi_group2/bitmap  Group2]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/3rd_group/bitmap_indexing  Group3]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/taanng/Bitmap  Group4]]]
-|-
-|style="height:20px"|[[[http://gitlab.icenter.tsinghua.edu.cn/ddeerreekk/Experiment_1_Bitmap_Index  Group5]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/bdmi_group6/project1  Group6]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group7]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/FQX/bitmap  Group8]]]
-|-
-|}
-==项目2-Lucida使用==
-===Lucida安装===
-每个组在清华工业云平台上安装Lucida软件
-清华工业云
-[https://cloud.icenter.tsinghua.edu.cn icenter-cloud]
-下载地址
-[https://github.com/claritylab/lucida Lucida-AI]
-===每组工作===
-每组熟悉了解Lucida的7种AI服务的实现原理：
-====日历服务CA====
-====图像匹配IMM====
-====图像分类IMC====
-====问答QA====
-====手写数字识别DIG====
-====人脸识别FACE====
-====语音识别ASR====
-时间：10月26日下周三中午12点之前。
-==项目2-云+端整合==
-===Thrift协议===
-[https://thrift.apache.org Thrift]
-----
-===client端===
-*调用摄像头拍照
-*调用Thrift接口
-===参考===
-[https://developer.android.com/index.html Android开发入门]
-[http://cordova.apache.org cordova]
-[https://github.com/claritylab/clarity-mobile clarity-mobile]
-----
-===server端===
-*接收图片文件
-*调用服务端程序
-==项目3-机器智能==
-===描述===
-完成一个可展示的人工智能系统
-*步骤1：设置azure虚拟机；
-*步骤2：架构flask-web服务；
-*步骤3：建立AI服务（Google Tensorflow）；
-*步骤4：lucida.ai；
-*步骤5：智能端开发（移动平台、嵌入式硬件）+thrift协议联调；
-参考：
-[http://lucida.ai Lucida-AI]
-===作业提交===
-{|border=1
-|style="height:20px;width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn Group1]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group2]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group3]]]
-|style="width:200px"|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group4]]]
-|-
-|style="height:20px"|[[[http://gitlab.icenter.tsinghua.edu.cn  Group5]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group6]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group7]]]
-|[[[http://gitlab.icenter.tsinghua.edu.cn/  Group8]]]
-|-
-|}
-=致谢=
-本课程获得微软Azure云计算与机器学习捐赠支持。
-感谢微软公司 杨滔经理，章艳经理，刘士君工程师，闫伟工程师。
-=参考文献=
-基础
-#Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
-#Matthew, Neil, and Richard Stones. Beginning linux programming. John Wiley & Sons, 2011.
-#Stroustrup, Bjarne. The C++ programming language. Pearson Education, 2013.
-#Weiss, Mark Allen. Data structures and algorithm analysis in Java. Addison-Wesley Longman Publishing Co., Inc., 1998.
-#Flanagan, David. JavaScript: The definitive guide: Activate your web pages. " O'Reilly Media, Inc.", 2011.
-#Grinberg, Miguel. Flask Web Development: Developing Web Applications with Python. O'Reilly Media, Inc., 2014.
-深度学习
-#Yoshua Bengio, Ian Goodfellow, Aaron Courville, Deep Learning, MIT Press, 2016.
-#Google brain team, TensorFlow: Large-scale machine learning on heterogeneous systems, whitepaper, 2015.
-#Vijay Agneeswaran, Real-Time Applications with Storm, Spark, and More Hadoop Alternatives, 2014.