TagMyNews Datasets

TagMyNews Datasets is a collection of datasets of short text fragments that we used for the evaluation of  our topic-based text classifier. Two of these datasets have been created by us (News and Tweets), while the other (Snippets) belong to a previous work by Phan et al.[WWW2008]

  • News: this is a dataset of  ~32K english news extracted from RSS feeds of popular newspaper websites (nyt.comusatoday.comreuters.com). Categories are: Sport, Business, U.S., Health, Sci&Tech, World and Entertainment.
    Each news in the file has the following structure:

    1. title
    2. description
    3. link (could be not still active)
    4. id
    5. date
    6. source (nyt|us|reuters)
    7. category

    and is divided by the next news by an empty line. Download!(About 3.9MB)

    Snippet: this dataset has been created by Phan et al. for their work Learning to classify short and sparse text & web with hidden topics from large-scale data collections and was composed by ~12K snippets drawn from Google. The tar.gz file contains also a README that describe the structure of the dataset. Download! (about 500KB)

  • Twitter: is a dataset composed by ~7k tweets. In order to create this dataset we retrieved a large number of tweets from Twitter and among those we selected only those containing a link to a news. We then gave to the tweets the same category assigned to the news by its provider. As in the News dataset the categories are: SportBusinessU.S., Health, Sci&Tech, World and Entertainment. Due to the privacy policies of Twitter we can’t offer a public download of this dataset. For more information contact us at: d.vitale@di.unipi.it