May 092016

The dataset contains four collections of files: three collections of genomes, each belonging to a distinct species, and a set of three 32-bit integer arrays. In particular:

  • Cere: collection of 39 strains of Saccharomyces cerevisiae (cere);
  • E. Coli: collection of 33 strains of the  bacteria Escherichia coli;
  • Para: collection of 36 strains of the  yeast Saccharomyces paradoxus;
  • DLCP: Differential Longest Common Prefix arrays computed by the Relative-FM data structure from a set of three human genomes.

These files are formatted as follows:

  • Cere, E. Coli, Para: textual files (ASCII), sequence of characters drawn from the alphabet ACTGN.
  • DLCP: binary files, sequence of signed 32-bits integers in little-endian byte-order (as obtained by dumping an array of int32_t into a file with a single fwrite in any modern machine).

The dataset (gzipped tar file, ~7.5GB) can be downloaded here.