大数据索引
来自iCenter Wiki
目录
大数据索引
索引是加快查找的数据结构(Data Structure),主要有哈希表(Hash Table),树(Tree)和倒排索引(Inverted Index)。
- Thomas H. Cormen, introduction to algorithms, third edition, MIT press, 2009.
- Tardos, Eva, and Jon Kleinberg. "Algorithm Design." (2006).
Hash Table
布隆过滤器(BloomFilter)
- B. H. Bloom,Space/time trade-offs in hash coding with allowable errors, Commun. ACM, Volume 13 Issue 7, Pages 422-426, July 1970.
- Crainiceanu, Adina, and Daniel Lemire. "Bloofi: Multidimensional Bloom Filters." Information Systems 54 (2015): 311-324.
Tree Index
- Guttman, Antonin. R-trees: a dynamic index structure for spatial searching. Vol. 14, no. 2. ACM, 1984.
- Sellis, Timos, Nick Roussopoulos, and Christos Faloutsos. "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects." VLDB, 1987.
Inverted Index
倒排索引(Inverted Index)是搜索引擎使用的数据结构。
倒排索引将关键字(keyword)映射到文档(document),在信息检索(Information Retrieval)中发挥重要作用。
在倒排索引中,每个关键词对应一个倒排链表(Inverted List),记录了该关键词出现的所有文档的编号。
- 倒排索引上的最重要的运算是集合交(Conjunction),并(Disjunction)和非(Negation)。
- 倒排索引在实际实现中,可以采用位图(Bitmap)与整数链表(Integer List)两种结构形式。
- 倒排索引上的交,并和非运算,对应的整数链表操作是Intersection/Unions操作,对应位图是比特AND,OR,NOT操作。
倒排索引实现: Lucene
Bitmap Index
Roaring Bitmap Java-Roaring CRoaring
- Chambi, Samy, Daniel Lemire, Owen Kaser, and Robert Godin. "Better bitmap performance with Roaring bitmaps." Software: practice and experience, 2015.
- Vallentin, Matthias, Vern Paxson, and Robin Sommer. "VAST: a unified platform for interactive network forensics." 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), 2016.
- Vallentin, Matthias. Scalable Network Forensics. Diss. University of California, Berkeley, 2016.
- Chenxing Li et al., BAH: A Bitmap Index Compression Algorithm for Fast Data Retrieval, LCN 2016.
Integer List
Data Structures for Inverted Indexes (ds2i)
- Culpepper, J. Shane, and Alistair Moffat. "Efficient set intersection for inverted indexing." ACM Transactions on Information Systems (TOIS), 2010.
- Schlegel, Benjamin, Thomas Willhalm, and Wolfgang Lehner. Fast Sorted-Set Intersection using SIMD Instructions, ADMS 2011.
- Inoue, Hiroshi, Moriyoshi Ohara, and Kenjiro Taura, Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions, VLDB 2014.
- Kane, Andrew, and Frank Wm Tompa, Skewed Partial Bitvectors for List Intersection, SIGIR 2014.
- Giuseppe Ottaviano, Nicola Tonellotto, Rossano Venturini, Optimal Space-Time Tradeoffs for Inverted Indexes, ACM WSDM 2015.
- Lakshminarasimhan, Sriram, et al. "Scalable in situ scientific data encoding for analytical query processing." Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. ACM, 2013.
混合结构
融合几种独立结构的混合结构
- Athanassoulis, Manos, and Anastasia Ailamaki. "BF-tree: approximate tree indexing." Proceedings of the VLDB Endowment 7, no. 14 (2014): 1881-1892.
其它结构
Succinct Data Structure Library
List of Implemented Data Structures
- Bitvectors supporting Rank and Select
- Integer Vectors
- Wavelet Trees
- Compressed Suffix Arrays (CSA)
- Balanced Parentheses Representations
- Longest Common Prefix (LCP) Arrays
- Compressed Suffix Trees (CST)
- Range Minimum/Maximum Query (RMQ) Structures
- Navarro, Gonzalo, and Eliana Providel. 2012. “Fast, Small, Simple Rank/Select on Bitmaps.” In Proceedings of the 11th International Symposium on Experimental Algorithms (SEA 2013), 295–306.