TAGME Datasets

TAGME Datasets is a collection of datasets that contain short text fragments drawn from Wikipedia snapshot of Novembre 6, 2009. Fragments are composed by about 30 words, and they contains about 20 non-stopword on average.

We gathered fragments of 3 types:

  • Wiki-Disamb30, a list of 2M fragments each containing one ambiguous anchor. The syntax is very simple and for each fragment two lines are deployed: the former contains the text (no lower-case was applied, we cleaned Wikipedia syntax by leveraging some heuristics), the latter contains the anchor (in lower-case) followed by the numeric ID of Wikipedia page which is pointed by the anchor. Anchor and ID are seprated by a TAB character. Download (about 124 MB)
  • Wiki-Annot30, a list of 186K fragments. The syntax is almost the same: the first line contains the text, the second one contains a list of annotated anchors found in the text, followed by numeric IDs of pages which are pointed by these anchors. A TAB character separates anchors and IDs in the list. Text and anchors are cleaned as for the previous dataset. Download (about 18MB)
  • Tweets, a list of about 5K short messages drawn from Twitter in ….. They were harvested using the “The 1000 most frequent web search queries issued to Yahoo! Search”. Cooming soon.

These datasets are available under the Creative Commons Attribution-ShareAlike License.