Hashtag Datasets

This is a freely downloadable collection of datasets that can be used to conduct several hashtag-related experiments, including hashtag relatedness and hashtag classification, as we did in our paper “On Analyzing Hashtags in Twitter”.

We offer three distinct datasets:

  • HE Graph Download (about 100MB)
  • The file contains the complete Hashtag-Entity Graph in TSV format. There are 5 columns: hashtag ID, entity ID, annotation scores, hashtag, entity title. The annotation score contains a : separated list of confidence scores returned by TagME.

  • Hashtag relatedness dataset Download
  • 3-column TSV file: the first column contains the group identifier (details in the paper), while the other two columns contain the two hashtags forming a hashtag pair.

  • Hashtag classification dataset Download
  • 2-column TSV file: the first column contains the hashtag, while the second contains the category to which the hashtag belongs to.

Our paper, accepted at ICWSM 2015, contains further details about how we created these datasets. You can refer to it if you used the hashtag datasets and you need a reference to cite.

These datasets are available under the Creative Commons Attribution-ShareAlike License. We hope they will be useful!