Figurit Homepage

Compressed Data Structures

The last thirty years have witnessed an upsurging interest towards the design of algorithms and data structures that could scale with the deluge of data produced from an ever-growing number of sources: from the Web and DNA sequencing of the 90s, to multimedia platforms and Social Networks of the 00s, eventually getting to Internet of Things and personal datastores of the last decade.

To cope with these massive datasets, the algorithmic community introduced the concept of Compressed Data Structures. Compressed data structures lie at the intersection of Data Structures and Information Theory. They compress data within a space matching the theoretical lower bound and still efficiently support operations without decompressing the whole data.

Among the results of this research group, the FM-index deserves a special mention. The FM-index combines compression and full-text indexing: it encapsulates a compressed version of the original file, and it allows to search for arbitrary patterns by looking only at a small portion of the compressed file. It was successfully applied to sequence analysis in bioinformatics, search and retrieval on oriental and agglutinating languages, multimedia streams, and even structured and traditional database scenarios. The FM-index also plays a key algorithmic role in the two most famous genome-alignment tools, namely Bowtie and BWA, which are used by bio-researchers all over the world as witnessed by their 32k+ citations on Google Scholar.

Software & Datasets

Photo of Pizza & Chili

Pizza & Chili

Datasets for compressed indexes and test collections bechmarking

GitHubWebsite
Photo of Pizza & Chili

FM-index v2

Website

Results of our research had also been implemented in:

  • SDSL, the Succinct Data Structure Library.
  • Bowtie, an ultrafast, memory-efficient short DNA sequences aligner.
  • FM-index, an FM-index implementation using RRR Wavelet trees and fast suffix sorting by Matthias Petri.

Selected Publications