The last thirty years have witnessed an upsurging interest towards the design of algorithms and data structures that could scale with the deluge of data produced from an ever-growing number of sources: from the Web and DNA sequencing of the 90s, to multimedia platforms and Social Networks of the 00s, eventually getting to Internet of Things and personal datastores of the last decade.
To cope with these massive datasets, the algorithmic community introduced the concept of Compressed Data Structures. Compressed data structures lie at the intersection of Data Structures and Information Theory. They compress data within a space matching the theoretical lower bound and still efficiently support operations without decompressing the whole data.
Among the results of this research group, the FM-index deserves a special mention. The FM-index combines compression and full-text indexing: it encapsulates a compressed version of the original file, and it allows to search for arbitrary patterns by looking only at a small portion of the compressed file. It was successfully applied to sequence analysis in bioinformatics, search and retrieval on oriental and agglutinating languages, multimedia streams, and even structured and traditional database scenarios. The FM-index also plays a key algorithmic role in the two most famous genome-alignment tools, namely Bowtie and BWA, which are used by bio-researchers all over the world as witnessed by their 32k+ citations on Google Scholar.
Software & Datasets
Results of our research had also been implemented in:
- SDSL, the Succinct Data Structure Library.
- Bowtie, an ultrafast, memory-efficient short DNA sequences aligner.
- FM-index, an FM-index implementation using RRR Wavelet trees and fast suffix sorting by Matthias Petri.
Paolo Ferragina, Rossano Venturini: Compressed Cache-Oblivious String B-Tree. ACM Trans. Algorithms 12(4): 52:1-52:17 (2016)
Paolo Ferragina, Travis Gagie, Giovanni Manzini: Lightweight Data Indexing and Compression in External Memory. Algorithmica 63(3): 707-730 (2012)
Paolo Ferragina, Jouni Sirén, Rossano Venturini: Distribution-Aware Compressed Full-Text Indexes. ESA 2011: 760-771
Paolo Ferragina, Rossano Venturini: The compressed permuterm index. ACM Trans. Algorithms 7(1): 10:1-10:21 (2010)
Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, S. Muthukrishnan: Compressing and indexing labeled trees, with applications. J. ACM 57(1): 4:1-4:33 (2009)
Paolo Ferragina, Rodrigo González, Gonzalo Navarro, Rossano Venturini: Compressed text indexes: From theory to practice. ACM Journal of Experimental Algorithmics 13 (2008)
Paolo Ferragina, Roberto Grossi, Ankur Gupta, Rahul Shah, Jeffrey Scott Vitter: On searching compressed string collections cache-obliviously. PODS 2008: 181-190
Paolo Ferragina, Rossano Venturini: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1): 115-121 (2007)
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, Gonzalo Navarro: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2): 20 (2007)
Paolo Ferragina, Giovanni Manzini: Indexing compressed text. J. ACM 52(4): 552-581 (2005)
Paolo Ferragina, Giovanni Manzini: Opportunistic Data Structures with Applications. FOCS 2000: 390-398