查看“大数据算法”的源代码

= 关系代数运算 =

关系（relation）是由列表头（称为属性）组成的表。

关系中的行称为元组（Tuple）。

关系中属性集合称为关系的模式（Schema）。

关系代数运算：

（1）选择（Selection）
（2）投影（Projection）
（3）并（union）、交（intersection）、差（differences）
（4）自然连接（Natural Join）：给定两个关系，比较两个关系中的每对元组。如果两个元组的所有公共属性的属性值一致，则生成一个新的元组。
（5）分组和聚合（Grouping and Aggregation）：聚合运算包括SUM、COUNT、AVG、MIN和MAX。


= 数据分析 =

数据分析（Data Analytic），是指对数据集（DataSet）的属性值进行统计，以发现有用的信息。

常用的操作有集合SUM，TopN，Rank，Select操作。交互式分析一般要求实时响应。

=集合运算=

==集合（Set）==

For a set of integers: S = {1, 2, 3, 1000}. Set operations include:

membership test: x ∈ S?

Set intersection: S1 ∩ S2

Set union: S1 ∪ S2

Set differences: S1 ∖ S2

Jaccard Index (Tanimoto similarity) ∣S1 ∩ S2 ∣/∣S1 ∪ S2 ∣

==有序集合（Ordered Set） ==

* Iterate: 

in sorted order,

in reverse order,

skippable iterators (jump to first value ≥ x)

* Rank:  how many elements of the set are smaller than k? (counting the number of ones up to a given position)

* Select: find the kth smallest value (finding the position of the k-th bit set)

* Min/max: find the maximal and minimal value

[https://lucene.apache.org/core/4_5_0/core/org/apache/lucene/util/BroadWord.html Broadword Implementation of Rank/Select Queries]

研究论文：
# Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, WEA 2008.


= 频繁项目集发现  =

查找频繁项目集ItemSets。其中最有名的算法是A-Priori算法。

从数据集中抽取频繁集，抽取的结果往往采用if-then 形式的规则集合来表示，这些规则被称为关联规则（association rule）。频繁项目集发现常常被看成关联规则挖掘（association rule mining）或关联规则发现。

基于位图的PCY算法。

#Park, Jong Soo, Ming-Syan Chen, and Philip S. Yu. An effective hash-based algorithm for mining association rules. ACM Sigmod 1995.

基于位图的A-Priori算法加速。

#Sung-Tan Kim et al., "BAR: bitmap-based association rule: an implementation and its optimizations." ACM MoMM 2009.

==对比度设置学习==

对比度设置学习（对比集分析）是一种关联规则的学习 ，旨在找出有意义的不同的群体之间的差异，通过逆向工程的关键预测指标，确定每一个特定的组。

基于位图的对比集挖掘算法加速。
#Gangyi Zhu et al., SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices, SSDBM 2015.

= 独立元素计数问题 =

FM（Flajolet-Martin）算法

#Flajolet, Philippe, and G. Nigel Martin. "Probabilistic counting algorithms for data base applications." FoCS 1983.


= 窗口内计数问题 =

DGIM（Datar-Gionis-Indyk-Motwani）算法

#Datar, M., Gionis, A., Indyk, P., & Motwani, R. (2002). Maintaining stream statistics over sliding windows. SIAM journal on computing, 31(6), 1794-1813.

= 基数估计 =

基数估计（Cardinality Estimation），评估一个集合中不同数据项的个数的近似算法。比如，访问一个网站的独立IP个数。

* [https://github.com/Microsoft/CardinalityEstimation Cardinality Estimation Algorithm]

# Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
# Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.

= 聚类 =

聚类是对点集按照某种距离测度将它们聚成多个簇的过程。聚类目标是使得同一簇内的点之间距离较短，而不同的簇中的点之间距离较大。

=相关性挖掘=

==相关性测度（Correlation Metrics ）==
地球移动距离（Earth Mover's Distance, EMD）

香农熵（Shannon's Entropy）

互信息（Mutual Information）

条件熵（Conditional Entropy）


==相关性挖掘（Correlation Mining）==

相关性分析位图索引加速。

#Yu Su et al., In-Situ Bitmaps Generation and Efficient Data Analysis based on Bitmaps, HPDC 2015.


= 子群发现（subgroup mining） =

基于位图索引的子群发现方法加速。

# Yi Wang et al., SciSD: Novel Subgroup Discovery Over Scientific Datasets Using Bitmap Indices, #OSU-CISRC-3/15-TR03, March 2015.


==协同过滤 ==

[[推荐系统]]


= 参考教材= 

# Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014. [http://www.mmds.org/ MMDS_book]