The dataset contains four collections of files: three collections of genomes, each belonging to a distinct species, and a set of three 32-bit integer arrays. In particular:
- Cere: collection of 39 strains of Saccharomyces cerevisiae (cere);
- E. Coli: collection of 33 strains of the bacteria Escherichia coli;
- Para: collection of 36 strains of the yeast Saccharomyces paradoxus;
- DLCP: Differential Longest Common Prefix arrays computed by the Relative-FM data structure from a set of three human genomes.
These files are formatted as follows:
- Cere, E. Coli, Para: textual files (ASCII), sequence of characters drawn from the alphabet ACTGN.
- DLCP: binary files, sequence of signed 32-bits integers in little-endian byte-order (as obtained by dumping an array of int32_t into a file with a single fwrite in any modern machine).
The dataset (gzipped tar file, ~7.5GB) can be downloaded here.