A³ Lab

Scientific coordinator: Paolo Ferragina

The last thirty years have witnessed an upsurging interest towards the design of algorithms and data structures that could scale with the deluge of data produced from an ever-growing number of sources: from the Web and DNA sequencing of the 90s, to multimedia platforms and Social Networks of the 00s, eventually getting to Internet of Things and personal datastores of the last decade.

To cope with these massive datasets, the algorithmic community introduced the concept of Compressed Data Structures. Compressed data structures lie at the intersection of Data Structures and Information Theory. They compress data within a space matching the theoretical lower bound and still efficiently support operations without decompressing the whole data.

Among the results of this research group, the FM-index deserves a special mention. The FM-index combines compression and full-text indexing: it encapsulates a compressed version of the original file, and it allows to search for arbitrary patterns by looking only at a small portion of the compressed file. It was successfully applied to sequence analysis in bioinformatics, search and retrieval on oriental and agglutinating languages, multimedia streams, and even structured and traditional database scenarios. The FM-index also plays a key algorithmic role in the two most famous genome-alignment tools, namely Bowtie and BWA, which are used by bio-researchers all over the world as witnessed by their 32k+ citations on Google Scholar.

Software & Datasets

PGM-index

An optimal learned data structure that enables fast point and range searches in arrays of billions of items using orders of magnitude less space than traditional indexes.

GitHub • Website

Block-ε tree

Compressed rank/select dictionary exploiting approximate linearity and repetitiveness.

GitHub

LA-vector

Compressed learned bitvector supporting efficient rank and select queries.

GitHub

CoCo-trie

A data-aware, robust, and flexible compressed string dictionary.

GitHub

Pizza & Chili

Datasets for compressed indexes and test collections bechmarking.

GitHub • Website

FM-index v2

A full-text index data structure that combines compression and indexing by encapsulating in a single compressed file both the original file plus some indexing information.

Website

Green lossless SpMV

Lossless-compressed matrix formats for space-, time-, and energy-efficient sparse matrix-vector multiplications (SpMV).

GitHub

Results of our research had also been implemented in:

SDSL, the Succinct Data Structure Library.
Bowtie, an ultrafast, memory-efficient short DNA sequences aligner.
FM-index, an FM-index implementation using RRR Wavelet trees and fast suffix sorting by Matthias Petri.

Selected Publications

Antonio Boffa, Paolo Ferragina, Francesco Tosoni, Giorgio Vinciguerra. Compressed string dictionaries via data-aware subtrie compaction. SPIRE 2022
Paolo Ferragina, Giovanni Manzini, Giorgio Vinciguerra. Repetition- and linearity-aware rank/select dictionaries. ISAAC 2021
Paolo Ferragina, Fabrizio Lillo, Giorgio Vinciguerra. On the performance of learned data structures. Theor. Comput. Sci. 871: 107-120 (2021)
Antonio Boffa, Paolo Ferragina, Giorgio Vinciguerra: A “learned” approach to quicken and compress rank/select dictionaries. ALENEX 2021
Paolo Ferragina, Fabrizio Lillo, Giorgio Vinciguerra: Why are learned indexes so effective?. ICML 2020
Paolo Ferragina, Giorgio Vinciguerra: The PGM-index: a fully-dynamic compressed learned index with provable worst-case bounds. PVLDB 13(8): 1162-1175 (2020)
Paolo Ferragina, Rossano Venturini: Compressed Cache-Oblivious String B-Tree. ACM Trans. Algorithms 12(4): 52:1-52:17 (2016)
Paolo Ferragina, Travis Gagie, Giovanni Manzini: Lightweight Data Indexing and Compression in External Memory. Algorithmica 63(3): 707-730 (2012)
Paolo Ferragina, Jouni Sirén, Rossano Venturini: Distribution-Aware Compressed Full-Text Indexes. ESA 2011: 760-771
Paolo Ferragina, Rossano Venturini: The compressed permuterm index. ACM Trans. Algorithms 7(1): 10:1-10:21 (2010)
Paolo Ferragina, Fabrizio Luccio, Giovanni Manzini, S. Muthukrishnan: Compressing and indexing labeled trees, with applications. J. ACM 57(1): 4:1-4:33 (2009)
Paolo Ferragina, Rodrigo González, Gonzalo Navarro, Rossano Venturini: Compressed text indexes: from theory to practice. ACM Journal of Experimental Algorithmics 13 (2008)
Paolo Ferragina, Roberto Grossi, Ankur Gupta, Rahul Shah, Jeffrey Scott Vitter: On searching compressed string collections cache-obliviously. PODS 2008: 181-190
Paolo Ferragina, Rossano Venturini: A simple storage scheme for strings achieving entropy bounds. Theor. Comput. Sci. 372(1): 115-121 (2007)
Paolo Ferragina, Giovanni Manzini, Veli Mäkinen, Gonzalo Navarro: Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms 3(2): 20 (2007)
Paolo Ferragina, Giovanni Manzini: Indexing compressed text. J. ACM 52(4): 552-581 (2005)
Paolo Ferragina, Giovanni Manzini: Opportunistic data structures with applications. FOCS 2000: 390-398

Compressed and Learned Data Structures

Software & Datasets

PGM-index

Block-ε tree

LA-vector

CoCo-trie

Pizza & Chili

FM-index v2

Green lossless SpMV

Selected Publications