“大数据算法”版本间的差异
第44行: | 第44行: | ||
关联规则挖掘(association rule mining),查找频繁项目集ItemSets。其中最有名的算法是Apriori算法。 | 关联规则挖掘(association rule mining),查找频繁项目集ItemSets。其中最有名的算法是Apriori算法。 | ||
+ | |||
+ | ==频繁项目集ItemSets== | ||
基于位图的Apriori算法加速。 | 基于位图的Apriori算法加速。 | ||
#Kim, Sung-Tan, Jae-Myung Kim, and Sang-Won Lee. "BAR: bitmap-based association rule: an implementation and its optimizations." In Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia, pp. 627-631. ACM, 2009. | #Kim, Sung-Tan, Jae-Myung Kim, and Sang-Won Lee. "BAR: bitmap-based association rule: an implementation and its optimizations." In Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia, pp. 627-631. ACM, 2009. | ||
+ | ==对比度设置学习== | ||
+ | |||
+ | 对比度设置学习(对比集分析)是一种关联规则的学习 ,旨在找出有意义的不同的群体之间的差异,通过逆向工程的关键预测指标,确定每一个特定的组。 | ||
+ | |||
+ | ++Gangyi Zhu et al., SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices, SSDBM 2015. | ||
= 基数估计 = | = 基数估计 = |
2017年5月15日 (一) 04:35的版本
数据分析
数据分析(Data Analytic),是指对数据集DataSet的属性值进行统计,以发现有用的信息。
常用的操作有集合SUM,TopN,Rank,Select操作。交互式分析一般要求实时响应。
集合(Set)
For a set of integers: S = {1, 2, 3, 1000}. Set operations:
tests: x ∈ S?
intersections: S1 ∩ S2
unions: S1 ∪ S2
differences: S1 ∖ S2
Jaccard Index (Tanimoto similarity) ∣S1 ∩ S2 ∣/∣S1 ∪ S2 ∣
有序集合(Ordered Set)
- Iterate:
in sorted order,
in reverse order,
skippable iterators (jump to first value ≥ x)
- Rank: how many elements of the set are smaller than k? (counting the number of ones up to a given position)
- Select: find the kth smallest value (finding the position of the k-th bit set)
- Min/max: find the maximal and minimal value
Broadword Implementation of Rank/Select Queries
研究论文:
- Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, WEA 2008.
关联规则挖掘
关联规则挖掘(association rule mining),查找频繁项目集ItemSets。其中最有名的算法是Apriori算法。
频繁项目集ItemSets
基于位图的Apriori算法加速。
- Kim, Sung-Tan, Jae-Myung Kim, and Sang-Won Lee. "BAR: bitmap-based association rule: an implementation and its optimizations." In Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia, pp. 627-631. ACM, 2009.
对比度设置学习
对比度设置学习(对比集分析)是一种关联规则的学习 ,旨在找出有意义的不同的群体之间的差异,通过逆向工程的关键预测指标,确定每一个特定的组。
++Gangyi Zhu et al., SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices, SSDBM 2015.
基数估计
基数估计(Cardinality Estimation),评估一个集合中不同数据项的个数的近似算法。比如,访问一个网站的独立IP个数。
- Flajolet, Philippe, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. "Hyperloglog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS Proceedings 1 (2008).
- Heule, Stefan, Marc Nunkesser, and Alexander Hall. "HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm." In Proceedings of the 16th International Conference on Extending Database Technology, pp. 683-692. ACM, 2013.