Bicriteria Data Compression

Bicriteria Data Compression ([ACM-SIAM SODA14], [ESA2014 (to appear)]) is a novel compression paradigm which allows the user to trade decompression time and compressed size in a principled way. Shortly, the tool lets you specify a bound on the decompression time (say, 800 msecs), and compresses the file in such a way that the decompression time is below that time-bound and compressed size is minimized (or vice-versa).

This work has been motivated by the advent of massive datasets and the consequent design of high-performing distributed storage systems (such as BigTable by Google), which have reignited the interest towards the design of lossless data compressors which achieve effective compression ratio and very efficient decompression speed.  This motivated the software engineers to devise variants of LZ77 (Such as Snappy or LZ4), with the injection of several software tricks which have beneficial effects on memory-access locality. However, such strategies are only heuristic and thus are far from being optimal, nor are effective on every input. Our approach, on the other hand, plugs classical optimization techniques in the design of data compressors, obtaining a novel and powerful technology which allows the user to, say, obtain the most succinct compressed file such that its decompression speed is at least X MB/s, or vice-versa. For more details about the main algorithmic ideas behind our approach, have a look at our SODA paper.

Code and Dataset

To showcase the potentialities of this technique, we developed bc-zip, a C++11 implementation of the bicriteria strategy. The source code has been published on GitHub.

We performed an extensive set of experiments ([ESA2014 (to appear)]), over a dataset composed of four 1GiB-long files:

  1. Census: U.S. demographic informations in tabular format (type: database);
  2. Dna: collection of families of genomes (type: highly repetitive biological data);
  3. Mingw: archive containing the whole mingw software distribution (type: mix of source codes and binaries) ;
  4. Wikipedia: dump of English Wikipedia (type: natural language).

Click here to download the dataset (zip file, 934MB).

Performances

The following tables depicts a selection of our tests on a 1GB chunk of Wikipedia (natural text) and Census (database).

tables

Both files have been compressed multiple times: lines starting with LzOpt reports the performances when the files have been compressed bit-optimally (that is,  for maximum compression), while lines starting with Bc-Zip reports the performances when  varying the bound on the compression time (reported on parenthesis). For ea ch compressed file we measured its decompression time as well as its compressed time.

Results show that BC-ZIP is very competitive when compared with leading high-performance data compressors. On one hand, it generates parsings with decompression times which is on-par with than of LZ4, but with significantly improved compression ratios. On the other hand, its compression ratios at higher compression levels are close to those of the best one, namely that of LZMA2, but with an order of magnitude faster decompression speeds.

The following figures shows the space/time trade-off curve obtained with Bc-Zip by varying the decompression time bound (blue curve), and Lz4 (red dot).

census_plot

wiki_plot

The results clearly showcase the tool’s ability to smoothly trade (decompression) time for (compressed) space (and vice-versa). For instance, in Wikipedia the decompression time decreases rougly in steps of 200msecs from the most succinct parsing (~1800msecs) to the fastest one (~300msecs). Similarly, the compressed size transitions from approximately 191MB up to 1GiB. The behavior in Census is analogous. We would like to underline that this capability in effectively (and efficiently) trading compressed space for decompression time in a consistent and principled way is unmatched by the current state-of-the-art data compressors.

More details about the actual implementation and an extensive experimental evaluation will be available in our upcoming ESA paper.

Got any question?

Contact us.

Fork me on GitHub