TagMyNews Datasets is a collection of datasets of short text fragments that we used for the evaluation of our topic-based text classifier. Two of these datasets have been created by us (News and Tweets), while the other (Snippets) belong to a previous work by Phan et al.[WWW2008]
- News: this is a dataset of ~32K english news extracted from RSS feeds of popular newspaper websites (nyt.com, usatoday.com, reuters.com). Categories are: Sport, Business, U.S., Health, Sci&Tech, World and Entertainment.
Each news in the file has the following structure:
- link (could be not still active)
- source (nyt|us|reuters)
and is divided by the next news by an empty line. Download!(About 3.9MB)
Snippet: this dataset has been created by Phan et al. for their work Learning to classify short and sparse text & web with hidden topics from large-scale data collections and was composed by ~12K snippets drawn from Google. The tar.gz file contains also a README that describe the structure of the dataset. Download! (about 500KB)
- Twitter: is a dataset composed by ~7k tweets. In order to create this dataset we retrieved a large number of tweets from Twitter and among those we selected only those containing a link to a news. We then gave to the tweets the same category assigned to the news by its provider. As in the News dataset the categories are: Sport, Business, U.S., Health, Sci&Tech, World and Entertainment. Due to the privacy policies of Twitter we can’t offer a public download of this dataset. For more information contact us at: firstname.lastname@example.org