Here we publish the dataset used in our “Bicriteria Data Compression” paper (arxiv version).
Each file is a chunk of 1GB (2^30 bytes) extracted from the following sources:
- Wikipedia: Natural data. Extracted from a dump of English wikipedia, in XML format.
- U.S. Census: Database, statistical metrics of the U.S. population. Extracted from the U.S. Census database.
- DBLP: Bibliographic database in XML format. Extracted from the The DBLP Computer Science Bibliography project.
- PFAM: Biological data. Extracted from the PFAM database of of protein families.