“大数据索引”版本间的差异

来自iCenter Wiki
跳转至: 导航搜索
索引子系统
 
(相同用户的47个中间修订版本未显示)
第1行: 第1行:
= 索引 =
+
=索引子系统=
  
*索引(Index)
+
索引子系统(Index-subsystem)是[[大数据系统]]的核心组件(Component)。
  
索引是加快查找的数据结构(Data Structure)。用于索引功能的数据结构有哈希表(Hash Table)、树(Tree)、整数数组/链表(Integer Array/Linked List)和位图(Bitmap)。
+
NoSQL引擎主要组件是索引,或者是面向特定应用的索引结构。
  
可参考《数据结构》教科书中的搜索(Search)的章节。
+
"NoSQL engines are (basically) just indexes!"
  
* 教科书:
 
  
# Thomas H. Cormen, Introduction to algorithms, third edition, MIT press, 2009.
+
参见[[索引]]
# Eva Tardos  and Jon Kleinberg, Algorithm Design, Pearson, 2006.
+
 
+
=索引的形式=
+
索引的数据结构包括哈希表(Hash Table)、树(Tree)、整数数组/链表(Integer Array/ Linked List)和位图(Bitmap)。
+
 
+
== 哈希表(Hash Table) ==
+
 
+
*[http://billmill.org/bloomfilter-tutorial BloomFilter]
+
*[https://github.com/magnuss/java-bloomfilter Java-Bloomfilter]
+
*[https://github.com/lemire/bloofi Bloofi]
+
 
+
 
+
**研究论文:
+
 
+
# B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
+
# Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.
+
 
+
== 树(Tree) ==
+
 
+
* [https://en.wikipedia.org/wiki/B-tree B-tree]
+
* [https://en.wikipedia.org/wiki/K-d_tree ''k''-d tree]
+
* [https://en.wikipedia.org/wiki/R-tree R-tree]
+
* [https://en.wikipedia.org/wiki/R%2B_tree R+ tree]
+
 
+
 
+
**研究论文:
+
 
+
# Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
+
# Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.
+
 
+
== 整数数组/链表(Integer Array/List) ==
+
Verbatim Integer List 带来的空间开销,人们发明了Integer List Compression机制,尤其是compressing 32-bit unsigned integers。
+
 
+
* BP128 (Binary packing ) / SIMDBP128 /
+
 
+
* [https://github.com/lemire/FastPFor FastPFor]
+
 
+
* FOR (Frame of reference) / PFOR (Patched Frame of reference) / PFOR Delta (PFOR on deltas) / NewPforDelta /SIMDPforDelta
+
 
+
* Simple16 / Simple8b
+
 
+
* VByte(or VB ) / VarIntGB(or GroupVB ) / StreamVByte /MaskedVByte [http://maskedvbyte.org/ SIMD-accelerated VByte] / [https://github.com/lemire/MaskedVByte MaskedVByte]
+
 
+
* PEF (Partitioned Elias-Fano) [http://github.com/ot/partitioned_elias_fano Partitioned Elias-Fano]
+
 
+
* Data Structures for Inverted Indexes ([https://github.com/ot/ds2i ds2i])
+
 
+
 
+
**研究论文:
+
 
+
# Daniel Lemire and Christoph Rupp. Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions. Information Systems, 2017.
+
# D. Lemire, L. Boytsov, N. Kurz, SIMD compression and the intersection of sorted integers, Softw. Pract. Exp. 46 (6) (2016) 723–749.
+
# Jeff Plaisance, Nathan Kurz, and Daniel Lemire. "Vectorized vbyte decoding." CoRR, abs/1503.07387, 2015.
+
# Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015.
+
# D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Softw. Pract. Exp. 45 (1) (2015) 1–29.
+
# Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.
+
# Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.
+
# Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.
+
# Zukowski, Marcin, Sandor Heman, Niels Nes, and Peter Boncz. "Super-scalar RAM-CPU cache compression." ICDE 2006.
+
 
+
== 位图(Bitmap) ==
+
 
+
===Bit array ===
+
 
+
A bit array (also known as bit map , bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure.
+
 
+
https://en.wikipedia.org/wiki/Bit_array
+
 
+
* Population / Hamming weight
+
 
+
* Find first set (ffs) or Find first one
+
 
+
* Inversion
+
 
+
**研究论文:
+
 
+
# Matthias Vallentin, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." NSDI 2016.
+
# Gonzalo Navarro and Eliana Providel. “Fast, Small, Simple Rank/Select on Bitmaps.” SEA 2013.
+
 
+
===位图压缩===
+
 
+
Verbatim Bitmap 带来的空间开销,尤其是bitmap很稀疏的情况下。为此,人们发明了Bitmap Compression机制。不仅可以降低空间开销,而且可以显著加快运算速度。
+
 
+
 
+
**可运算的位图压缩方法
+
 
+
* BBC (Byte-aligned Bitmap Compression)
+
 
+
* WAH (Word Aligned Hybrid)
+
 
+
* PLWAH (Position List Word-Aligned Hybrid)
+
 
+
* EWAH (Enhanced Word-Aligned Hybrid )
+
 
+
* CONCISE (COmpressed N Composable Integer SEt) 
+
 
+
* PWAH (Partitioned Word-Aligned Hybrid)
+
 
+
* COMPAX (COMPressed Adaptive indeX )
+
 
+
* SBH (Super byte-aligned hybrid)
+
 
+
 
+
** 研究论文
+
 
+
[1] Antoshekov G. Byte-aligned bitmap compression. Proc of the Conf on Data Compression. Piscataway, NJ: IEEE, 1994: 363- 098, 1994.
+
 
+
[2] Wu, Kesheng, Ekow J. Otoo, and Arie Shoshani. "Compressing bitmap indexes for faster search operations." In Scientific and Statistical Database Management, 2002. Proceedings. 14th International Conference on, pp. 99-108. IEEE, 2002.
+
 
+
[3] Wu, K., Otoo, E. J., and Shoshani, A. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems (TODS), 31(1), 1-38, 2006.
+
 
+
[4] Stabno, Michał, and Robert Wrembel. "RLH: Bitmap compression technique based on run-length and Huffman encoding." Information Systems 34, no. 4, pp.400-414, 2009.
+
 
+
[5] F. Fusco, M. P. Stoecklin, and M. Vlachos, “Net-fli: on-the-fly compression, archiving and indexing of streaming network traffic,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 1382–1393, 2010.
+
 
+
[6] F. Deli`ege, T. B. Pedersen. Position list word aligned hybrid: optimizing space and performance for compressed bitmaps. In Proceeding of the 13th International Conference on Extending Database Technology, 2010.
+
 
+
[7] F. Corrales, D. Chiu, and J. Sawin, “Variable Length Compression for Bitmap Indexes,” DEXA’11, Springer-Verlag, pp.381-395, 2011.
+
 
+
[8] D. Lemire et al., “Sorting improves word-aligned bitmap indexes,” Data & Knowledge Engineering, 69(1), pp.3-28, 2010.
+
 
+
[9] S. J. van Schaik et al., “A memory efficient reachability data structure through bit vector compression,” SIGMOD, pp.913-924, 2011.
+
 
+
[10] S. Chambi, D. Lemire, O. Kaser, and R. Godin, “Better bitmap performance with roaring bitmaps,” Software: Practice and Experience, 2015.
+
 
+
[11] A. Schmidt, D. Kimmig, and M. Beine, “DFWAH: A Proposal of a New Compression Scheme of Medium-Sparse Bitmaps,” in the Third International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA 2011), pp. 192-195.
+
 
+
[12] R. Slechta, J. Sawin, B. McCamish, D. Chiu and G. Canahuate, A tunable compression framework for bitmap indices, In Data Engineering (ICDE’ 2014), pp. 484-495. IEEE.
+
 
+
 
+
**研究专利
+
 
+
[1] BBC, Byte aligned data compression, DEC/Oracle, http://www.google.com/patents/US5363098.
+
 
+
[2] WAH, Word aligned bitmap compression method, data structure, and apparatus, UC Berkeley, http://www.google.com/patents/US6831575.
+
 
+
[3] COMPAX, Network analysis, IBM, http://www.google.com/patents/US20120054160.
+
 
+
[4] PLWAH, Algorhyme, http://www.google.com/patents/US9236881.
+
 
+
== 混合结构(Hybrid) ==
+
 
+
融合几种独立结构的混合结构。
+
 
+
=== Roaring ===
+
 
+
* Roaring Bitmap
+
 
+
Roaring Bitmap 是一种为索引(Indexes)的运算(与,或,非)而优化的混合数据结构。
+
 
+
Roaring 数据结构内部采用位图Bitmap,整数数组Arrays和游程编码Runs的表示形式,三种表示形式是平等的。当数据以其中某种表示结构的表示时,整体空间为最优,Roaring则采用该表示结构。也就是说,Roaring采用空间最优化的数据表示结构。
+
 
+
Roaring表示的索引之间可以进行高效的运算。运算时,无非也是位图Bitmap,整数数组Arrays和游程编码Runs这三种结构之间的运算。
+
 
+
所有,Roaring是兼顾空间与时间效率的一种有效的索引结构。
+
 
+
 
+
Roaring网址:http://roaringbitmap.org/
+
 
+
Roaring的Java实现 [https://github.com/RoaringBitmap/RoaringBitmap Java-Roaring]
+
 
+
Roaring的C/C++实现 [https://github.com/RoaringBitmap/CRoaring CRoaring]
+
 
+
 
+
* Roaring 研究论文:
+
# Lemire, Daniel, Gregory Ssi‐Yan‐Kai, and Owen Kaser. "Consistently faster and smaller compressed bitmaps with Roaring." Software: Practice and Experience 46.11 (2016): 1547-1569.
+
# Samy Chambi,  Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
+
 
+
===其它混合结构===
+
 
+
# Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.
+
 
+
# Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." HPDC 2013.
+
 
+
 
+
**研究论文:
+
 
+
#Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi. Hybrid Compression of Bitvectors for the FM-Index.  In Proc. 2014 Data Compression Conference (DCC 2014), pp. 302-311, 2014.
+
#Gagie, Travis, Gonzalo Navarro, and Simon J. Puglisi. 2012. “New algorithms on wavelet trees and applications to information retrieval.” Theoretical Computer Science 426: 25–41.
+
 
+
== 其它结构(Others) ==
+
 
+
=== SDSL - Succinct Data Structure Library ===
+
 
+
*[https://github.com/simongog/sdsl-lite Succinct Data Structure Library]
+
 
+
*[https://github.com/simongog/sdsl-lite/wiki/List-of-Implemented-Data-Structures List of Implemented Data Structures]
+
** Bitvectors supporting Rank and Select
+
** Integer Vectors
+
** Wavelet Trees
+
** Compressed Suffix Arrays (CSA)
+
** Balanced Parentheses Representations
+
** Longest Common Prefix (LCP) Arrays
+
** Compressed Suffix Trees (CST)
+
** Range Minimum/Maximum Query (RMQ) Structures
+
 
+
 
+
=索引(大数据)=
+
 
+
大数据系统的数据管理,采用位图索引、树索引等。
+
 
+
 
+
===本组研究 ===
+
 
+
*研究定位
+
 
+
数据分析层面:关联规则分析(功能)(系统级)
+
 
+
算法实现层面:  Apriori 算法 (进程级)
+
 
+
数据结构层面:Bitmap (C++/Go/函数库)(编程语言)
+
 
+
处理器执行层面:SSE/AVX/AVX2/FMA  (机器指令 Instructions)
+
 
+
 
+
*本组工作:
+
 
+
**CODIS (Compressing Dirty Snippet)  (C++ )
+
 
+
**[http://gitlab.icenter.tsinghua.edu.cn/zhenchen/BAH BAH](Byte Aligned Hybrid)  (C++ Language)
+
 
+
**[https://github.com/thuwuyinjun/CAMP CAMP](Common Affix Merging with Partition) (Java Language)
+
 
+
**SPLWAH (CUDA )
+
 
+
**MASC(MAximized Stride with Carrier)
+
 
+
 
+
*本组论文
+
 
+
# Wenxun Zheng  et al., CODIS: A New Compression Scheme for Bitmap Indexes, ANCS 2017.
+
# Chenxing Li et al., BAH: A Bitmap Index Compression Algorithm for Fast Data Retrieval, LCN 2016.
+
# Yinjun Wu et al., CAMP: A new bitmap index for data retrieval in traffic archival, Communication Letters, Vol. 20, No. 6, pp.1128 - 1131, June 2016.
+
# Jiahui Chang et al.,  SPLWAH: A bitmap index compression scheme for searching in archival Internet traffic.  ICC 2015.
+
 
+
 
+
*研究方向
+
 
+
#*MPSoC增强的索引运算
+
 
+
++Sebastian Haas et al., An MPSoC for Energy-Efficient Database Query Processing, DAC 2016.
+
 
+
++Sebastian Haas et al., HW/SW-Database-CoDesign for Compressed Bitmap Index Processing, ASAP 2016.
+
 
+
#*非循环图的可达性查询
+
 
+
++Sebastiaan J. van Schaik et al., A Memory Efficient Reachability Data Structure Through Bit Vector Compression, SOGMOD 2011.
+
 
+
#*非循环图的可达性查询
+
 
+
++Gangyi Zhu et al., SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices, SSDBM 2015.
+
 
+
#* ISA增强的集运算
+
 
+
++O. Arnold et al., An application-specific instruction set for accelerating set-oriented database primitives. SIGMOD 2014.
+
 
+
#* 有限状态机运行并行加速
+
 
+
++Peng Jiang et al., Combining SIMD and Many Multi-core Parallelism for Finite State Machines with Enumerative Speculation,PPoPP 2017.
+
 
+
(Using Intel Xeon Phi)
+
 
+
= 索引(搜索引擎) =
+
 
+
搜索引擎(Search Engine)是一种典型的大数据系统,是实现信息检索(Information Retrieval)的软件。
+
 
+
经典的搜索引擎主要是针对爬取的网页文本的检索。 一个网页文本或文档,是由许多的单词组成的。其中每个单词可以在同一个文档中重复出现很多次,同一个单词也可以出现在不同的文档中。
+
 
+
==反向索引(Inverted Index)的原理==
+
 
+
为了实现快速的搜索响应,搜索引擎采用反向索引(Inverted Index)的数据结构。反向索引在信息检索中发挥重要作用。这里介绍一下前向索引(forward index)和反向索引(Inverted Index)的概念。
+
 
+
前向索引(forward index):从文档角度看其中的单词,表示每个文档(用文档ID标识)都含有哪些单词,以及每个单词出现了多少次(词频)及其出现位置(相对于文档首部的偏移量)。
+
 
+
反向索引(inverted index 或inverted files):从单词角度看文档,标识每个单词分别在那些文档中出现(文档ID),以及在各自的文档中每个单词分别出现了多少次(词频)及其出现位置(相对于该文档首部的偏移量)。
+
 
+
<big>直观表示</big>:
+
 
+
前向索引(一对多映射):文档 ---> 单词
+
+
反向索引(一对多映射):单词 ---> 文档
+
 
+
前向索引,又称为正排索引;反向索引,又称为倒排索引。但是正排和倒排索引的称谓,会造成误解。
+
 
+
==反向索引(Inverted Index)的实现==
+
 
+
反向索引的功能是将关键字(keyword)映射到文档(document)。
+
 
+
在反向索引中,每个关键词对应一个反向链表(Inverted List),记录了该关键词出现的所有文档的编号。
+
 
+
反向索引在实际实现中,可以采用位图(Bitmap)与整数数组/链表(Integer Array/List)两种结构形式。
+
 
+
 
+
**研究论文:
+
 
+
# Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.
+
 
+
==反向索引(Inverted Index)的运算==
+
 
+
反向索引上的最重要的运算是集合交(Conjunction),并(Disjunction)和非(Negation)。
+
 
+
反向索引上的交,并和非运算,对应的整数链表的实现,其操作是Intersection/Unions操作,对应位图的实现,其操作运算则是比特AND,OR,NOT操作。
+
 
+
 
+
**研究论文:
+
 
+
# Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
+
 
+
==反向索引的参考实现==
+
 
+
[https://lucene.apache.org/core/ Lucene]
+
 
+
[https://pypi.python.org/pypi/Whoosh Whoosh]
+
 
+
[http://www.opensearchserver.com OpenSearchServer]
+

2019年6月10日 (一) 11:35的最后版本

索引子系统

索引子系统(Index-subsystem)是大数据系统的核心组件(Component)。

NoSQL引擎主要组件是索引,或者是面向特定应用的索引结构。

"NoSQL engines are (basically) just indexes!"


参见索引