更改 - iCenter Wiki

大数据索引

删除6,448字节、2019年6月10日 (一) 09:35

/* 索引子系统 */

= ~~大数据索引~~ 索引子系统=

~~索引（Index）是加快查找的数据结构（Data Structure），主要有哈希表（Hash Table）、树（Tree）、整数链表（Integer List）和位图（Bitmap）。~~索引子系统（Index-subsystem）是[[大数据系统]]的核心组件（Component）。

~~参考数据结构书中的搜索章节。~~NoSQL引擎主要组件是索引，或者是面向特定应用的索引结构。

~~# Thomas H. Cormen, introduction to algorithms, third edition, MIT press, 2009.# Eva Tardos and Jon Kleinberg, Algorithm Design, Pearson, 2006.~~"NoSQL engines are (basically) just indexes!"

~~=索引的形式=~~

~~索引的形式包括哈希表，树、整数链表和位图。~~

~~== 哈希表（Hash Table） ==~~ *参见[~~http://billmill.org/bloomfilter-tutorial BloomFilter]~~*[~~https://github.com/magnuss/java-bloomfilter Java-Bloomfilter~~索引]*[https://github.com/lemire/bloofi Bloofi] ~~# B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.# Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.~~ ~~== 树（Tree） ==~~ * [https://en.wikipedia.org/wiki/B-tree B-tree]* [https://en.wikipedia.org/wiki/K-d_tree ''k''-d tree]* [https://en.wikipedia.org/wiki/R-tree R-tree]* [https://en.wikipedia.org/wiki/R%2B_tree R+ tree] ~~# Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.# Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.~~ ~~== 整数链表（Integer List） ==Verbatim Integer List 带来的空间开销，人们发明了Integer List Compression机制。~~ * Data Structures for Inverted Indexes ([https://github.com/ot/ds2i ds2i])* [http://github.com/ot/partitioned_elias_fano Partitioned Elias-Fano Index]* [https://github.com/lemire/FastPFor FastPFor] ~~# Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.# Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.# Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.# Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.# Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.~~# Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013. ~~== 位图（Bitmap） ==Verbatim Bitmap 带来的空间开销，人们发明了Bitmap Compression机制。~~ * Roaring Bitmap [https://github.com/RoaringBitmap/RoaringBitmap Java-Roaring] [https://github.com/RoaringBitmap/CRoaring CRoaring]* [http://gitlab.icenter.tsinghua.edu.cn/zhenchen/BAH BAH] ~~# Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.~~# Vallentin, Matthias, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016.~~# Vallentin, Matthias. Scalable Network Forensics. Diss. University of California, Berkeley, 2016.# Chenxing Li et al., BAH: A Bitmap Index Compression Algorithm for Fast Data Retrieval, LCN 2016.~~ ~~== 混合结构（Hybrid） ==~~ ~~融合几种独立结构的混合结构~~ ~~# Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.~~ ~~== 其它结构 ==~~ *[https://github.com/simongog/sdsl-lite Succinct Data Structure Library] *[https://github.com/simongog/sdsl-lite/wiki/List-of-Implemented-Data-Structures List of Implemented Data Structures]** Bitvectors supporting Rank and Select** Integer Vectors** Wavelet Trees** Compressed Suffix Arrays (CSA)** Balanced Parentheses Representations** Longest Common Prefix (LCP) Arrays** Compressed Suffix Trees (CST)** Range Minimum/Maximum Query (RMQ) Structures ~~# Navarro, Gonzalo, and Eliana Providel. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306, 2012.~~ ~~= 搜索引擎的索引结构 =~~ ~~搜索引擎（Search Engine）是一种典型的大数据系统，是实现信息检索（Information Retrieval）的软件。~~ 经典的搜索引擎主要是针对爬取的网页文本的检索。一个网页文本或文档，是由许多的单词组成的。其中每个单词可以在同一个文档中重复出现很多次，同一个单词也可以出现在不同的文档中。 ~~==反向索引（Inverted Index）的原理==~~ 为了实现快速的搜索响应，搜索引擎采用反向索引（Inverted Index）的数据结构。反向索引在信息检索中发挥重要作用。这里介绍一下前向索引（forward index）和反向索引（Inverted Index）的概念。前向索引(forward index)：从文档角度看其中的单词，表示每个文档（用文档ID标识）都含有哪些单词，以及每个单词出现了多少次（词频）及其出现位置（相对于文档首部的偏移量）。反向索引(inverted index 或inverted files)：从单词角度看文档，标识每个单词分别在那些文档中出现(文档ID)，以及在各自的文档中每个单词分别出现了多少次（词频）及其出现位置（相对于该文档首部的偏移量）。 ~~前向索引（一对多映射）：文档 ---> 单词~~ ~~方向索引（一对多映射）：单词 ---> 文档~~ ~~前向索引，又称为正排索引，反向索引，又称为倒排索引。但是正排和倒排索引的称谓，会造成误解。~~ ~~==反向索引（Inverted Index）的实现==~~ ~~反向索引的功能是将关键字（keyword）映射到文档（document）。~~ ~~在反向索引中，每个关键词对应一个反向链表（Inverted List），记录了该关键词出现的所有文档的编号。~~ ~~反向索引在实际实现中，可以采用位图（Bitmap）与整数链表（Integer List）两种结构形式。~~ ~~==反向索引（Inverted Index）的运算==~~ ~~反向索引上的最重要的运算是集合交（Conjunction），并（Disjunction）和非（Negation）。~~ ~~反向索引上的交，并和非运算，对应的整数链表的实现，其操作是Intersection/Unions操作，对应位图的实现，其操作运算则是比特AND，OR，NOT操作。~~ ~~==反向索引的参考实现==~~ ~~[https://lucene.apache.org/core/ Lucene]~~ ~~[https://pypi.python.org/pypi/Whoosh Whoosh~~]

Zhenchen

行政员、管理员

6,105

个编辑