2017年5月14日 (日) 03:10的版本

数据分析

数据分析（Data Analytic），是指对数据集DataSet的属性值进行统计，以发现有用的信息。

常用的操作有集合SUM，TopN，Rank，Select操作。交互式分析一般要求实时响应。

集合（Set）

For a set of integers: S = {1, 2, 3, 1000}. Set operations:

tests: x ∈ S?

intersections: S1 ∩ S2

unions: S1 ∪ S2

differences: S1 ∖ S2

Jaccard Index (Tanimoto similarity) ∣S1 ∩ S2 ∣/∣S1 ∪ S2 ∣

有序集合（Ordered Set）

Iterate:

in sorted order,

in reverse order,

skippable iterators (jump to first value ≥ x)

Rank: how many elements of the set are smaller than k? (counting the number of ones up to a given position)

Select: find the kth smallest value (finding the position of the k-th bit set)

Min/max: find the maximal and minimal value

Broadword Implementation of Rank/Select Queries

研究论文：

Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, WEA 2008.

关联规则挖掘

关联规则挖掘（association rule mining），查找频繁项目集ItemSets。其中最有名的算法是Apriori算法。

基于位图的Apriori算法加速。

Kim, Sung-Tan, Jae-Myung Kim, and Sang-Won Lee. "BAR: bitmap-based association rule: an implementation and its optimizations." In Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia, pp. 627-631. ACM, 2009.

基数估计

基数估计（Cardinality Estimation），评估一个集合中不同数据项的个数的近似算法。比如，访问一个网站的独立IP个数。

Cardinality Estimation Algorithm

Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.

@@ 第39行： / 第39行： @@
 研究论文：
 # Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, WEA 2008.
-= 基数估计 =
-基数估计（Cardinality Estimation），评估一个集合中不同数据项的个数的近似算法。比如，访问一个网站的独立IP个数。
-* [https://github.com/Microsoft/CardinalityEstimation Cardinality Estimation Algorithm]
-# Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
-# Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.
@@ 第59行： / 第49行： @@
-=大数据解析平台=
+= 基数估计 =
-大数据解析平台，是实现数据解析的分布式软件系统。
-* [http://kylin.io Apache Kylin]
+基数估计（Cardinality Estimation），评估一个集合中不同数据项的个数的近似算法。比如，访问一个网站的独立IP个数。
-* [http://druid.io/ Druid]
+* [https://github.com/Microsoft/CardinalityEstimation Cardinality Estimation Algorithm]
+# Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
+# Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.

“大数据算法”版本间的差异

2017年5月14日 (日) 03:10的版本

目录

数据分析

集合（Set）

有序集合（Ordered Set）

关联规则挖掘

基数估计

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具