Figurit Homepage

Compressed and Learned Data Structures

Scientific coordinator: Paolo Ferragina

The last thirty years have witnessed an upsurging interest towards the design of algorithms and data structures that could scale with the deluge of data produced from an ever-growing number of sources: from the Web and DNA sequencing of the 90s, to multimedia platforms and Social Networks of the 00s, eventually getting to Internet of Things and personal datastores of the last decade.

To cope with these massive datasets, the algorithmic community introduced the concept of Compressed Data Structures. Compressed data structures lie at the intersection of Data Structures and Information Theory. They compress data within a space matching the theoretical lower bound and still efficiently support operations without decompressing the whole data.

Among the results of this research group, the FM-index deserves a special mention. The FM-index combines compression and full-text indexing: it encapsulates a compressed version of the original file, and it allows to search for arbitrary patterns by looking only at a small portion of the compressed file. It was successfully applied to sequence analysis in bioinformatics, search and retrieval on oriental and agglutinating languages, multimedia streams, and even structured and traditional database scenarios. The FM-index also plays a key algorithmic role in the two most famous genome-alignment tools, namely Bowtie and BWA, which are used by bio-researchers all over the world as witnessed by their 32k+ citations on Google Scholar.

Software & Datasets

PGM-index

An optimal learned data structure that enables fast point and range searches in arrays of billions of items using orders of magnitude less space than traditional indexes.

GitHubWebsite

Block-ε tree

Compressed rank/select dictionary exploiting approximate linearity and repetitiveness.

GitHub

LA-vector

Compressed learned bitvector supporting efficient rank and select queries.

GitHub

CoCo-trie

A data-aware, robust, and flexible compressed string dictionary.

GitHub

Pizza & Chili

Datasets for compressed indexes and test collections bechmarking.

GitHubWebsite

FM-index v2

A full-text index data structure that combines compression and indexing by encapsulating in a single compressed file both the original file plus some indexing information.

Website

Results of our research had also been implemented in:

  • SDSL, the Succinct Data Structure Library.
  • Bowtie, an ultrafast, memory-efficient short DNA sequences aligner.
  • FM-index, an FM-index implementation using RRR Wavelet trees and fast suffix sorting by Matthias Petri.

Selected Publications