更改 - iCenter Wiki

大数据索引

删除15,619字节、2019年6月10日 (一) 09:35

/* 索引子系统 */

= 索引索引子系统=

*索引（Index）索引子系统（Index-subsystem）是[[大数据系统]]的核心组件（Component）。

索引是加快查找的数据结构（Data Structure）。用于索引功能的数据结构有哈希表（Hash Table）、树（Tree）、整数数组/链表（Integer Array/Linked List）和位图（Bitmap）。NoSQL引擎主要组件是索引，或者是面向特定应用的索引结构。

~~可参考《数据结构》教科书中的搜索（Search）的章节。~~"NoSQL engines are (basically) just indexes!"

* 教科书：

~~# Thomas H. Cormen, Introduction to algorithms, third edition, MIT press, 2009.# Eva Tardos and Jon Kleinberg, Algorithm Design, Pearson, 2006.~~ ~~=索引的形式=索引的数据结构包括哈希表（Hash Table）、树（Tree）、整数数组/链表（Integer Array/ Linked List）和位图（Bitmap）。~~ ~~== 哈希表（Hash Table） ==~~ *参见[~~http://billmill.org/bloomfilter-tutorial BloomFilter]~~*[~~https://github.com/magnuss/java-bloomfilter Java-Bloomfilter]~~*[https://github.com/lemire/bloofi Bloofi] **研究论文： ~~# B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.# Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.~~ ~~== 树（Tree） ==~~ * [https://en.wikipedia.org/wiki/B-tree B-tree]* [https://en.wikipedia.org/wiki/K-d_tree ''k''-d tree]* [https://en.wikipedia.org/wiki/R-tree R-tree]* [https://en.wikipedia.org/wiki/R%2B_tree R+ tree] **研究论文： ~~# Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.# Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.~~ ~~== 整数数组/链表（Integer Array/List） ==Verbatim Integer List 带来的空间开销，人们发明了Integer List Compression机制，尤其是compressing 32-bit unsigned integers。~~ * BP128 (Binary packing ) / SIMDBP128 / * [https://github.com/lemire/FastPFor FastPFor] * FOR (Frame of reference) / PFOR (Patched Frame of reference) / PFOR Delta (PFOR on deltas) / NewPforDelta /SIMDPforDelta * Simple16 / Simple8b * VByte（or VB ） / VarIntGB（or GroupVB ） / StreamVByte /MaskedVByte [http://maskedvbyte.org/ SIMD-accelerated VByte] / [https://github.com/lemire/MaskedVByte MaskedVByte] * PEF (Partitioned Elias-Fano) [http://github.com/ot/partitioned_elias_fano Partitioned Elias-Fano] * Data Structures for Inverted Indexes ([https://github.com/ot/ds2i ds2i]) **研究论文： ~~# Daniel Lemire and Christoph Rupp. Upscaledb: Efficient Integer-Key Compression in a Key-Value Store using SIMD Instructions. Information Systems, 2017.# D. Lemire, L. Boytsov, N. Kurz, SIMD compression and the intersection of sorted integers, Softw. Pract. Exp. 46 (6) (2016) 723–749.# Jeff Plaisance, Nathan Kurz, and Daniel Lemire. "Vectorized vbyte decoding." CoRR, abs/1503.07387, 2015.# Jeff Plaisance, Nathan Kurz, Daniel Lemire, Vectorized VByte Decoding, International Symposium on Web Algorithms 2015, 2015.# D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Softw. Pract. Exp. 45 (1) (2015) 1–29.# Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.# Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.# Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.# Zukowski, Marcin, Sandor Heman, Niels Nes, and Peter Boncz. "Super-scalar RAM-CPU cache compression." ICDE 2006.~~ ~~== 位图（Bitmap） ==~~ ~~应用环境：特别适合作为数据管理中的列式存储（Column-oriented or Columnar ）的索引和信息检索中的反向索引的数据结构。~~ ~~===Bit array ===~~ ~~A bit array (also known as bit map , bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure.~~ ~~https://en.wikipedia.org/wiki/Bit_array~~ * Population / Hamming weight * Find first set (ffs) or Find first one * Inversion **研究论文： ~~# Matthias Vallentin, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." NSDI 2016.# Gonzalo Navarro and Eliana Providel. “Fast, Small, Simple Rank/Select on Bitmaps.” SEA 2013.~~ ~~===位图压缩===~~ Verbatim Bitmap 带来的空间开销，尤其是bitmap很稀疏的情况下。为此，人们发明了Bitmap Compression机制。不仅可以降低空间开销，而且可以显著加快运算速度。 ~~在1995年，甲骨文公司（oracle）的G. Antoshekov展示了一种新的针对位图索引的压缩算法Byte-Aligned Bitmap Compression（BBC）。~~ 在此之后，WAH (Word Aligned Hybrid compression), PLWAH (Position list word aligned hybrid), COMPAX (COMPressed Adaptive indeX) 等多种索引压缩方法被提出，在国际上被认可并在数据库索引编码中被广泛应用。 ~~目前，已有相关运营软件产生，如Spark, Fastbit , Druid 等平台。~~ **可运算的位图压缩方法 * BBC (Byte-aligned Bitmap Compression) * WAH (Word Aligned Hybrid) * PLWAH (Position List Word-Aligned Hybrid) * EWAH (Enhanced Word-Aligned Hybrid ) * CONCISE (COmpressed N Composable Integer SEt) * PWAH (Partitioned Word-Aligned Hybrid) * COMPAX (COMPressed Adaptive indeX ) * SBH （Super byte-aligned hybrid） ** 研究论文 ~~[1] Antoshekov G. Byte-aligned bitmap compression. Proc of the Conf on Data Compression. Piscataway, NJ: IEEE, 1994: 363- 098, 1994.~~ [2] Wu, Kesheng, Ekow J. Otoo, and Arie Shoshani. "Compressing bitmap indexes for faster search operations." In Scientific and Statistical Database Management, 2002. Proceedings. 14th International Conference on, pp. 99-108. IEEE, 2002. ~~[3] Wu, K., Otoo, E. J., and Shoshani, A. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems (TODS), 31(1), 1-38, 2006.~~ ~~[4] Stabno, Michał, and Robert Wrembel. "RLH: Bitmap compression technique based on run-length and Huffman encoding." Information Systems 34, no. 4, pp.400-414, 2009.~~ [5] F. Fusco, M. P. Stoecklin, and M. Vlachos, “Net-fli: on-the-fly compression, archiving and indexing of streaming network traffic,” Proceedings of the VLDB Endowment, vol. 3, no. 1-2, pp. 1382–1393, 2010. [6] F. Deli`ege, T. B. Pedersen. Position list word aligned hybrid: optimizing space and performance for compressed bitmaps. In Proceeding of the 13th International Conference on Extending Database Technology, 2010. ~~[7] F. Corrales, D. Chiu, and J. Sawin, “Variable Length Compression for Bitmap Indexes,” DEXA’11, Springer-Verlag, pp.381-395, 2011.~~ ~~[8] D. Lemire et al., “Sorting improves word-aligned bitmap indexes,” Data & Knowledge Engineering, 69(1), pp.3-28, 2010.~~ ~~[9] S. J. van Schaik et al., “A memory efficient reachability data structure through bit vector compression,” SIGMOD, pp.913-924, 2011.~~ ~~[10] S. Chambi, D. Lemire, O. Kaser, and R. Godin, “Better bitmap performance with roaring bitmaps,” Software: Practice and Experience, 2015.~~ [11] A. Schmidt, D. Kimmig, and M. Beine, “DFWAH: A Proposal of a New Compression Scheme of Medium-Sparse Bitmaps,” in the Third International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA 2011), pp. 192-195. ~~[12] R. Slechta, J. Sawin, B. McCamish, D. Chiu and G. Canahuate, A tunable compression framework for bitmap indices, In Data Engineering (ICDE’ 2014), pp. 484-495. IEEE.~~ ~~[13] Kim, Sangchul, Junhee Lee, Srinivasa Rao Satti, and Bongki Moon. "SBH: Super byte-aligned hybrid bitmap compression." Information Systems 62 (2016): 155-168.~~ **研究专利 ~~[1] BBC, Byte aligned data compression, DEC/Oracle, http://www.google.com/patents/US5363098.~~ ~~[2] WAH, Word aligned bitmap compression method, data structure, and apparatus, UC Berkeley, http://www.google.com/patents/US6831575.~~ ~~[3] COMPAX, Network analysis, IBM, http://www.google.com/patents/US20120054160.~~ ~~[4] PLWAH, Algorhyme, http://www.google.com/patents/US9236881.~~ ~~== 混合结构（Hybrid） ==~~ ~~融合几种独立结构的混合结构。~~ ~~=== Roaring ===~~ * Roaring Bitmap ~~Roaring Bitmap 是一种为索引（Indexes）的运算(与，或，非)而优化的混合数据结构。~~ Roaring 数据结构内部采用位图Bitmap，整数数组Arrays和游程编码Runs的表示形式，三种表示形式是平等的。当数据以其中某种表示结构的表示时，整体空间为最优，Roaring则采用该表示结构。也就是说，Roaring采用空间最优化的数据表示结构。 ~~Roaring表示的索引之间可以进行高效的运算。运算时，无非也是位图Bitmap，整数数组Arrays和游程编码Runs这三种结构之间的运算。~~ ~~所有，Roaring是兼顾空间与时间效率的一种有效的索引结构。~~ ~~Roaring网址：http://roaringbitmap.org/~~ ~~Roaring的Java实现 [https://github.com/RoaringBitmap/RoaringBitmap Java-Roaring]~~ ~~Roaring的C/C++实现 [https://github.com/RoaringBitmap/CRoaring CRoaring]~~ * Roaring 研究论文：~~# Lemire, Daniel, Gregory Ssi‐Yan‐Kai, and Owen Kaser. "Consistently faster and smaller compressed bitmaps with Roaring." Software: Practice and Experience 46.11 (2016): 1547-1569.# Samy Chambi, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.~~ ~~===其它混合结构===~~ ~~# Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.~~ ~~# Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." HPDC 2013.~~ **研究论文： ~~#Juha Kärkkäinen, Dominik Kempa, Simon J. Puglisi. Hybrid Compression of Bitvectors for the FM-Index. In Proc. 2014 Data Compression Conference (DCC 2014), pp. 302-311, 2014.#Gagie, Travis, Gonzalo Navarro, and Simon J. Puglisi. 2012. “New algorithms on wavelet trees and applications to information retrieval.” Theoretical Computer Science 426: 25–41.~~ ~~== 其它结构（Others） ==~~ ~~=== SDSL - Succinct Data Structure Library ===~~ *[https://github.com/simongog/sdsl-lite Succinct Data Structure Library] *[https://github.com/simongog/sdsl-lite/wiki/List-of-Implemented-Data-Structures List of Implemented Data Structures]** Bitvectors supporting Rank and Select** Integer Vectors** Wavelet Trees** Compressed Suffix Arrays (CSA)** Balanced Parentheses Representations** Longest Common Prefix (LCP) Arrays** Compressed Suffix Trees (CST)** Range Minimum/Maximum Query (RMQ) Structures ~~=索引（大数据）=~~ ~~大数据系统的数据管理，采用位图索引、树索引等。位图索引在数据库、版本控制系统都有应用。~~ ~~==Git软件的对象计数问题==~~ ~~对象计数问题（counting objects）是git软件进行git clone时，需要实时计算出需要克隆的对象总数。~~ ~~Git的对象（object）就是文件，包括快照对象（Commit），目录对象（Directory）和文件对象（File）。~~ ~~"对象计数"就是清点各种commit、目录、文件等。git clone和git fetch操作都需要清点对象，因为需要知道，到底下载哪些对象文件。~~ ~~对象计数的原始算法如下。#列出本地所有分支最新的一个commit#列出远程所有分支最新的一个commit#两者进行比较，只要有不同，就意味着分支发生变动#每一个发生变动的commit，都清点其中具体变动的子目录和文件#追溯到当前commit的父节点，重复第四步，直至本地与远程的历史一致为止#加总所有需要变动的对象~~ ~~==对象计数的位图索引方法==Git软件为为每一个commit生成一个二进制值，即Bitmap索引。~~ 打开本地目录：.git/objects/pack/，其中是一个索引文件和一个数据文件，它们就是Bitmap。这两个文件索引了当前代码库的所有对象，然后使用一个二进制值代表这些对象。有多少个对象，这个二进制值就有多少位。它的第n位，就代表数据文件里面的第n个对象。 ~~每个commit都会有一个对应的二进制值，表示当前快照包含的所有对象。这些对象对应的二进制位都为1，其他二进制位都为0。~~ ~~采用位图索引的优点是：#不用读取commit对象，只要读取这个二进制值，就会知道当前commit包含了哪些节点。#两个二进制值只要做一次XOR运算，就会知道哪些位（即哪些对象）发生了变动。#新的对象总是添加到现有二进制位的后面，所以只要读取多出来的那些位，就知道当前commit比上一次commit多出了哪些对象。~~ ~~因为采用位图索引，"对象计数"就变成了位图的二进制值的比较运算和计数运算，因此速度极快。~~ ~~==参考链接==http://www.ruanyifeng.com/blog/2015/09/git-bitmap.html~~ ~~https://github.com/gitster/git/commit/fff4275~~ ~~https://github.com/gitster/git/blob/master/Documentation/technical/bitmap-format.txt~~ ~~= 索引（搜索引擎） =~~ ~~[[搜索引擎]]（Search Engine）是一种典型的大数据系统，是实现信息检索（Information Retrieval）的软件。~~ 经典的搜索引擎主要是针对爬取的网页文本的检索。一个网页文本或文档，是由许多的单词组成的。其中每个单词可以在同一个文档中重复出现很多次，同一个单词也可以出现在不同的文档中。 ~~==反向索引（Inverted Index）的原理==~~ 为了实现快速的搜索响应，搜索引擎采用反向索引（Inverted Index）的数据结构。反向索引在信息检索中发挥重要作用。这里介绍一下前向索引（forward index）和反向索引（Inverted Index）的概念。前向索引(forward index)：从文档角度看其中的单词，表示每个文档（用文档ID标识）都含有哪些单词，以及每个单词出现了多少次（词频）及其出现位置（相对于文档首部的偏移量）。反向索引(inverted index 或inverted files)：从单词角度看文档，标识每个单词分别在那些文档中出现(文档ID)，以及在各自的文档中每个单词分别出现了多少次（词频）及其出现位置（相对于该文档首部的偏移量）。 ~~<big>直观表示</big>：~~ ~~前向索引（一对多映射）：文档 ---> 单词~~ ~~反向索引（一对多映射）：单词 ---> 文档~~ ~~前向索引，又称为正排索引；反向索引，又称为倒排索引。但是正排和倒排索引的称谓，会造成误解。~~ ~~==反向索引（Inverted Index）的实现==~~ ~~反向索引的功能是将关键字（keyword）映射到文档（document）。~~ ~~在反向索引中，每个关键词对应一个反向链表（Inverted List），记录了该关键词出现的所有文档的编号。~~ ~~反向索引在实际实现中，可以采用位图（Bitmap）与整数数组/链表（Integer Array/List）两种结构形式。~~ **研究论文： ~~# Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.~~ ~~==反向索引（Inverted Index）的运算==~~ ~~反向索引上的最重要的运算是集合交（Conjunction），并（Disjunction）和非（Negation）。~~ ~~反向索引上的交，并和非运算，对应的整数链表的实现，其操作是Intersection/Unions操作，对应位图的实现，其操作运算则是比特AND，OR，NOT操作。~~ **研究论文： ~~# Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.~~ ~~==反向索引的参考实现==~~ ~~[https://lucene.apache.org/core/ Lucene]~~ ~~[https://pypi.python.org/pypi/Whoosh Whoosh]~~ ~~[http://www.opensearchserver.com OpenSearchServer]~~ ~~=大数据系统-科研=~~ ~~[[大数据系统-科研~~索引]]

Zhenchen

行政员、管理员

6,105

个编辑