2017年6月10日 (六) 11:46的版本

1 数据分析
2 集合运算
- 2.1 集合（Set）
- 2.2 有序集合（Ordered Set）
3 频繁项目集发现
- 3.1 频繁项目集发现（Frequent Item Sets Discovery）
- 3.2 对比度设置学习
4 相关性挖掘
- 4.1 相关性测度（Correlation Metrics ）
- 4.2 相关性挖掘（Correlation Mining）
5 子群发现（subgroup mining）
6 基数估计
7 参考教材

数据分析

数据分析（Data Analytic），是指对数据集（DataSet）的属性值进行统计，以发现有用的信息。

常用的操作有集合SUM，TopN，Rank，Select操作。交互式分析一般要求实时响应。

集合运算

集合（Set）

For a set of integers: S = {1, 2, 3, 1000}. Set operations include:

membership test: x ∈ S?

Set intersection: S1 ∩ S2

Set union: S1 ∪ S2

Set differences: S1 ∖ S2

Jaccard Index (Tanimoto similarity) ∣S1 ∩ S2 ∣/∣S1 ∪ S2 ∣

有序集合（Ordered Set）

Iterate:

in sorted order,

in reverse order,

skippable iterators (jump to first value ≥ x)

Rank: how many elements of the set are smaller than k? (counting the number of ones up to a given position)

Select: find the kth smallest value (finding the position of the k-th bit set)

Min/max: find the maximal and minimal value

Broadword Implementation of Rank/Select Queries

研究论文：

Vigna, Sebastiano. "Broadword implementation of rank/select queries." In International Workshop on Experimental and Efficient Algorithms, WEA 2008.

频繁项目集发现

查找频繁项目集ItemSets。其中最有名的算法是A-Priori算法。

从数据集中抽取频繁集，抽取的结果往往采用if-then 形式的规则集合来表示，这些规则被称为关联规则（association rule）。频繁项目集发现常常被看成关联规则挖掘（association rule mining）或关联规则发现。

频繁项目集发现（Frequent Item Sets Discovery）

基于位图的PCY算法。

Park, Jong Soo, Ming-Syan Chen, and Philip S. Yu. An effective hash-based algorithm for mining association rules. ACM Sigmod 1995.

基于位图的Apriori算法加速。

Sung-Tan Kim et al., "BAR: bitmap-based association rule: an implementation and its optimizations." ACM MoMM 2009.

对比度设置学习

对比度设置学习（对比集分析）是一种关联规则的学习，旨在找出有意义的不同的群体之间的差异，通过逆向工程的关键预测指标，确定每一个特定的组。

基于位图的对比集挖掘算法加速。

Gangyi Zhu et al., SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices, SSDBM 2015.

Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014. MMDS_book

@@ 第45行： / 第45行： @@
 = 频繁项目集发现  =
-查找频繁项目集ItemSets。其中最有名的算法是Apriori算法。
+查找频繁项目集ItemSets。其中最有名的算法是A-Priori算法。
-该问题常常被看成关联规则挖掘（association rule mining）或关联规则发现。
+从数据集中抽取频繁集，抽取的结果往往采用if-then 形式的规则集合来表示，这些规则被称为关联规则（association rule）。频繁项目集发现常常被看成关联规则挖掘（association rule mining）或关联规则发现。
 ==频繁项目集发现（Frequent Item Sets Discovery）==
@@ 第65行： / 第65行： @@
 基于位图的对比集挖掘算法加速。
 #Gangyi Zhu et al., SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices, SSDBM 2015.
 =相关性挖掘=

“大数据算法”版本间的差异

2017年6月10日 (六) 11:46的版本

目录

数据分析

集合运算

集合（Set）

有序集合（Ordered Set）

频繁项目集发现

频繁项目集发现（Frequent Item Sets Discovery）

对比度设置学习

相关性挖掘

相关性测度（Correlation Metrics ）

相关性挖掘（Correlation Mining）

子群发现（subgroup mining）

基数估计

参考教材

导航菜单

个人工具

名字空间

变种

查看

操作

搜索

导航

工具