For our experimental tests we have collected and made available many datasets which we think can be useful for other purposes too. So you can freely download them, by citing in your paper/software our site please:
- GERDAQ dataset v1.0: a collection of 1000 web search engine queries, annotated with Wikipedia entities through a crowdsourcing process. The dataset is shipped with the BAT-Framework. It can also be accessed in raw XML.
- Pizza&Chili’s test collection: It contains many types of texts: source program codes, pitches, proteins, dna, english xml. They have various sizes and can be freely used for your experiments.
- TAGME Datasets: a collection of datasets composed by short text fragment drawn from Wikipedia and Tweeter. It was used in our work to evaluate annotion process over short text fragments and the coverage of Wikipedia anchors as annotation spots in the web context.
- TagMyNews Datasets: a collection of datasets composed by news, snippet and tweets. It was used in our work to evaluate classification process over short text.
- Hashtag Datasets: a collection of datasets composed by hashtags and wikipedia entities. The datasets were used to devise new algorithms for hashtag classification and hashtag relatedness in Twitter.