“大数据算法”版本间的差异
来自iCenter Wiki
(→基数估计) |
(→数据解析) |
||
第2行: | 第2行: | ||
数据解析(Data Analytic),是指对数据集的属性值进行SUM,TopN,Rank,Select操作。一般要求实时响应。 | 数据解析(Data Analytic),是指对数据集的属性值进行SUM,TopN,Rank,Select操作。一般要求实时响应。 | ||
+ | |||
+ | ==集合(Set)== | ||
+ | |||
+ | For a set of integers: S = {1, 2, 3, 1000}. Set operations: | ||
+ | |||
+ | tests: x ∈ S? | ||
+ | |||
+ | intersections: S1 ∩ S2 | ||
+ | |||
+ | unions: S1 ∪ S2 | ||
+ | |||
+ | differences: S1 ∖ S2 | ||
+ | |||
+ | Jaccard Index (Tanimoto similarity) ∣S1 ∩ S2 ∣/∣S1 ∪ S2 ∣ | ||
+ | |||
+ | ==有序集合(Ordered Set) == | ||
+ | |||
+ | Iterate: | ||
+ | |||
+ | in sorted order, | ||
+ | |||
+ | in reverse order, | ||
+ | |||
+ | skippable iterators (jump to first value ≥ x) | ||
+ | |||
+ | Rank: how many elements of the set are smaller than k? | ||
+ | |||
+ | Select: find the kth smallest value | ||
+ | |||
+ | Min/max: find the maximal and minimal value | ||
+ | |||
+ | |||
* [https://lucene.apache.org/core/4_5_0/core/org/apache/lucene/util/BroadWord.html Broadword Implementation of Rank] | * [https://lucene.apache.org/core/4_5_0/core/org/apache/lucene/util/BroadWord.html Broadword Implementation of Rank] |
2017年5月11日 (四) 01:58的版本
数据解析
数据解析(Data Analytic),是指对数据集的属性值进行SUM,TopN,Rank,Select操作。一般要求实时响应。
集合(Set)
For a set of integers: S = {1, 2, 3, 1000}. Set operations:
tests: x ∈ S?
intersections: S1 ∩ S2
unions: S1 ∪ S2
differences: S1 ∖ S2
Jaccard Index (Tanimoto similarity) ∣S1 ∩ S2 ∣/∣S1 ∪ S2 ∣
有序集合(Ordered Set)
Iterate:
in sorted order,
in reverse order,
skippable iterators (jump to first value ≥ x)
Rank: how many elements of the set are smaller than k?
Select: find the kth smallest value
Min/max: find the maximal and minimal value
大数据解析平台,是实现数据解析的分布式软件系统。
- Navarro, Gonzalo, and Eliana Providel. "Fast, small, simple rank/select on bitmaps." In International Symposium on Experimental Algorithms, pp. 295-306. Springer Berlin Heidelberg, 2012.
- Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, pp. 154-168. Springer Berlin Heidelberg, 2008.
基数估计
基数估计(Cardinality Estimation),评估一个集合中不同数据项的个数的近似算法。比如,访问一个网站的独立IP个数。
- Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
- Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.
关联规则挖掘
关联规则挖掘(association rule mining),查找频繁项目集ItemSets。其中最有名的算法是Apriori算法。
基于位图的Apriori算法加速。
- Kim, Sung-Tan, Jae-Myung Kim, and Sang-Won Lee. "BAR: bitmap-based association rule: an implementation and its optimizations." In Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia, pp. 627-631. ACM, 2009.