|
|
(相同用户的172个中间修订版本未显示) |
第1行: |
第1行: |
− | == 大数据索引 == | + | =索引子系统= |
| | | |
− | 索引(Index)是加快查找的数据结构(Data Structure),主要有哈希表(Hash Table),树(Tree)、整数链表(Integer List)和位图(Bitmap)。
| + | 索引子系统(Index-subsystem)是[[大数据系统]]的核心组件(Component)。 |
| | | |
− | # Thomas H. Cormen, introduction to algorithms, third edition, MIT press, 2009.
| + | NoSQL引擎主要组件是索引,或者是面向特定应用的索引结构。 |
− | # Eva Tardos and Jon Kleinberg, Algorithm Design, Pearson, 2006.
| + | |
| | | |
− | ==索引的形式==
| + | "NoSQL engines are (basically) just indexes!" |
− | 索引的形式包括哈希表,树、整数链表和位图。
| + | |
| | | |
− | === 哈希表(Hash Table) ===
| |
| | | |
− | *[http://billmill.org/bloomfilter-tutorial BloomFilter]
| + | 参见[[索引]] |
− | *[https://github.com/magnuss/java-bloomfilter Java-Bloomfilter]
| + | |
− | *[https://github.com/lemire/bloofi Bloofi]
| + | |
− | | + | |
− | # B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
| + | |
− | # Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.
| + | |
− | | + | |
− | === 树(Tree) ===
| + | |
− | | + | |
− | * [https://en.wikipedia.org/wiki/B-tree B-tree]
| + | |
− | * [https://en.wikipedia.org/wiki/K-d_tree ''k''-d tree]
| + | |
− | * [https://en.wikipedia.org/wiki/R-tree R-tree]
| + | |
− | * [https://en.wikipedia.org/wiki/R%2B_tree R+ tree]
| + | |
− | | + | |
− | # Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
| + | |
− | # Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.
| + | |
− | | + | |
− | | + | |
− | ==整数链表(Integer List)==
| + | |
− | | + | |
− | ==== Bitmap Index ====
| + | |
− | | + | |
− | * Roaring Bitmap [https://github.com/RoaringBitmap/RoaringBitmap Java-Roaring] [https://github.com/RoaringBitmap/CRoaring CRoaring]
| + | |
− | * [http://gitlab.icenter.tsinghua.edu.cn/zhenchen/BAH BAH]
| + | |
− | | + | |
− | # Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
| + | |
− | # Vallentin, Matthias, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016.
| + | |
− | # Vallentin, Matthias. Scalable Network Forensics. Diss. University of California, Berkeley, 2016.
| + | |
− | # Chenxing Li et al., BAH: A Bitmap Index Compression Algorithm for Fast Data Retrieval, LCN 2016.
| + | |
− | | + | |
− | ==== Integer List ====
| + | |
− | | + | |
− | * Data Structures for Inverted Indexes ([https://github.com/ot/ds2i ds2i])
| + | |
− | * [http://github.com/ot/partitioned_elias_fano Partitioned Elias-Fano Index]
| + | |
− | * [https://github.com/lemire/FastPFor FastPFor]
| + | |
− | | + | |
− | # Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
| + | |
− | # Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.
| + | |
− | # Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.
| + | |
− | # Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.
| + | |
− | # Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.
| + | |
− | # Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013.
| + | |
− | | + | |
− | === 混合结构 ===
| + | |
− | | + | |
− | 融合几种独立结构的混合结构
| + | |
− | | + | |
− | # Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.
| + | |
− | | + | |
− | === 其它结构 ===
| + | |
− | | + | |
− | *[https://github.com/simongog/sdsl-lite Succinct Data Structure Library]
| + | |
− | | + | |
− | *[https://github.com/simongog/sdsl-lite/wiki/List-of-Implemented-Data-Structures List of Implemented Data Structures]
| + | |
− | ** Bitvectors supporting Rank and Select
| + | |
− | ** Integer Vectors
| + | |
− | ** Wavelet Trees
| + | |
− | ** Compressed Suffix Arrays (CSA)
| + | |
− | ** Balanced Parentheses Representations
| + | |
− | ** Longest Common Prefix (LCP) Arrays
| + | |
− | ** Compressed Suffix Trees (CST)
| + | |
− | ** Range Minimum/Maximum Query (RMQ) Structures
| + | |
− | | + | |
− | # Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306.
| + | |
− | | + | |
− | === 搜索引擎的索引结构 ===
| + | |
− | | + | |
− | 搜索引擎(Search Engine)是一种典型的大数据系统,是实现信息检索(Information Retrieval)的软件。
| + | |
− | | + | |
− | 经典的搜索引擎主要是针对爬取的网页文本的检索。 一个网页文本或文档,是由许多的单词组成的。其中每个单词可以在同一个文档中重复出现很多次,同一个单词也可以出现在不同的文档中。
| + | |
− | | + | |
− | 为了实现快速的搜索响应,搜索引擎采用反向索引(Inverted Index)的数据结构。反向索引在信息检索中发挥重要作用。这里介绍一下前向索引(forward index)和反向索引(Inverted Index)的概念。
| + | |
− | | + | |
− | 前向索引(forward index):从文档角度看其中的单词,表示每个文档(用文档ID标识)都含有哪些单词,以及每个单词出现了多少次(词频)及其出现位置(相对于文档首部的偏移量)。
| + | |
− | | + | |
− | 反向索引(inverted index 或inverted files):从单词角度看文档,标识每个单词分别在那些文档中出现(文档ID),以及在各自的文档中每个单词分别出现了多少次(词频)及其出现位置(相对于该文档首部的偏移量)。
| + | |
− | | + | |
− | 前向索引(一对多映射):文档 ---> 单词
| + | |
− |
| + | |
− | 方向索引(一对多映射):单词 ---> 文档
| + | |
− | | + | |
− | 前向索引,又称为正排索引,反向索引,又称为倒排索引。但是正排和倒排索引的称谓,会造成误解。
| + | |
− | | + | |
− | 反向索引的功能是将关键字(keyword)映射到文档(document)。
| + | |
− | | + | |
− | 在倒排索引中,每个关键词对应一个倒排链表(Inverted List),记录了该关键词出现的所有文档的编号。
| + | |
− | | + | |
− | 倒排索引上的最重要的运算是集合交(Conjunction),并(Disjunction)和非(Negation)。
| + | |
− | | + | |
− | * 倒排索引在实际实现中,可以采用位图(Bitmap)与整数链表(Integer List)两种结构形式。
| + | |
− | * 倒排索引上的交,并和非运算,对应的整数链表的操作是Intersection/Unions操作,对应位图的运算则是比特AND,OR,NOT操作。
| + | |
− | | + | |
− | 倒排索引实现:
| + | |
− | | + | |
− | [https://lucene.apache.org/core/ Lucene]
| + | |
− | | + | |
− | [https://pypi.python.org/pypi/Whoosh Whoosh]
| + | |