“大数据索引”版本间的差异

来自iCenter Wiki
跳转至: 导航搜索
第8行: 第8行:
 
=== Hash Table ===
 
=== Hash Table ===
  
布隆过滤器(BloomFilter)
+
*[http://billmill.org/bloomfilter-tutorial BloomFilter]
 
+
*[https://github.com/magnuss/java-bloomfilter Java-Bloomfilter]
[http://billmill.org/bloomfilter-tutorial BloomFilter]
+
*[https://github.com/lemire/bloofi Bloofi]
 
+
[https://github.com/magnuss/java-bloomfilter Java-Bloomfilter]
+
 
+
[https://github.com/lemire/bloofi Bloofi]
+
  
 
# B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
 
# B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
第21行: 第17行:
 
=== Tree Index ===
 
=== Tree Index ===
  
[https://en.wikipedia.org/wiki/B-tree B-tree]
+
* [https://en.wikipedia.org/wiki/B-tree B-tree]
 
+
* [https://en.wikipedia.org/wiki/K-d_tree ''k''-d tree]
[https://en.wikipedia.org/wiki/K-d_tree ''k''-d tree]
+
* [https://en.wikipedia.org/wiki/R-tree R-tree]
 
+
* [https://en.wikipedia.org/wiki/R%2B_tree R+ tree]
[https://en.wikipedia.org/wiki/R-tree R-tree]
+
 
+
[https://en.wikipedia.org/wiki/R%2B_tree R+ tree]
+
  
 
# Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
 
# Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
第49行: 第42行:
 
==== Bitmap Index ====
 
==== Bitmap Index ====
  
Roaring Bitmap
+
* Roaring Bitmap [https://github.com/RoaringBitmap/RoaringBitmap Java-Roaring] [https://github.com/RoaringBitmap/CRoaring CRoaring]
[https://github.com/RoaringBitmap/RoaringBitmap Java-Roaring]
+
* [http://gitlab.icenter.tsinghua.edu.cn/zhenchen/BAH BAH]
[https://github.com/RoaringBitmap/CRoaring CRoaring]
+
 
+
[http://gitlab.icenter.tsinghua.edu.cn/zhenchen/BAH BAH]
+
  
 
# Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
 
# Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
第62行: 第52行:
 
==== Integer List ====
 
==== Integer List ====
  
Data Structures for Inverted Indexes ([https://github.com/ot/ds2i ds2i])
+
* Data Structures for Inverted Indexes ([https://github.com/ot/ds2i ds2i])
 
+
* [http://github.com/ot/partitioned_elias_fano Partitioned Elias-Fano Index]
[http://github.com/ot/partitioned_elias_fano Partitioned Elias-Fano Index]
+
* [https://github.com/lemire/FastPFor FastPFor]
 
+
[https://github.com/lemire/FastPFor FastPFor]
+
  
 
# Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
 
# Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
第83行: 第71行:
 
=== 其它结构 ===
 
=== 其它结构 ===
  
[https://github.com/simongog/sdsl-lite Succinct Data Structure Library]
+
*[https://github.com/simongog/sdsl-lite Succinct Data Structure Library]
 
+
[https://github.com/simongog/sdsl-lite/wiki/List-of-Implemented-Data-Structures List of Implemented Data Structures]
+
  
* Bitvectors supporting Rank and Select
+
*[https://github.com/simongog/sdsl-lite/wiki/List-of-Implemented-Data-Structures List of Implemented Data Structures]
* Integer Vectors
+
** Bitvectors supporting Rank and Select
* Wavelet Trees
+
** Integer Vectors
* Compressed Suffix Arrays (CSA)
+
** Wavelet Trees
* Balanced Parentheses Representations
+
** Compressed Suffix Arrays (CSA)
* Longest Common Prefix (LCP) Arrays
+
** Balanced Parentheses Representations
* Compressed Suffix Trees (CST)
+
** Longest Common Prefix (LCP) Arrays
* Range Minimum/Maximum Query (RMQ) Structures
+
** Compressed Suffix Trees (CST)
 +
** Range Minimum/Maximum Query (RMQ) Structures
  
 
# Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306.
 
# Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306.

2017年1月25日 (三) 18:26的版本

大数据索引

索引是加快查找的数据结构(Data Structure),主要有哈希表(Hash Table),树(Tree)和倒排索引(Inverted Index)。

  1. Thomas H. Cormen, introduction to algorithms, third edition, MIT press, 2009.
  2. Tardos, Eva, and Jon Kleinberg. "Algorithm Design." (2006).

Hash Table

  1. B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
  2. Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.

Tree Index

  1. Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
  2. Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.

Inverted Index

倒排索引(Inverted Index)是搜索引擎使用的数据结构。

倒排索引将关键字(keyword)映射到文档(document),在信息检索(Information Retrieval)中发挥重要作用。

在倒排索引中,每个关键词对应一个倒排链表(Inverted List),记录了该关键词出现的所有文档的编号。

  • 倒排索引上的最重要的运算是集合交(Conjunction),并(Disjunction)和非(Negation)。
  • 倒排索引在实际实现中,可以采用位图(Bitmap)与整数链表(Integer List)两种结构形式。
  • 倒排索引上的交,并和非运算,对应的整数链表操作是Intersection/Unions操作,对应位图是比特AND,OR,NOT操作。

倒排索引实现: Lucene

Bitmap Index

  1. Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
  2. Vallentin, Matthias, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016.
  3. Vallentin, Matthias. Scalable Network Forensics. Diss. University of California, Berkeley, 2016.
  4. Chenxing Li et al., BAH: A Bitmap Index Compression Algorithm for Fast Data Retrieval, LCN 2016.

Integer List

  1. Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
  2. Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.
  3. Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.
  4. Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.
  5. Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.
  6. Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013.

混合结构

融合几种独立结构的混合结构

  1. Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.

其它结构

  • List of Implemented Data Structures
    • Bitvectors supporting Rank and Select
    • Integer Vectors
    • Wavelet Trees
    • Compressed Suffix Arrays (CSA)
    • Balanced Parentheses Representations
    • Longest Common Prefix (LCP) Arrays
    • Compressed Suffix Trees (CST)
    • Range Minimum/Maximum Query (RMQ) Structures
  1. Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306.