We are happy to announce that Rossano Venturini has been awarded with a Yahoo Faculty Research and Engagement Program (FREP). The project involves the design and the experimentation of compressed data structures for indexing users’ personal collections of documents which grow and change frequently in the time.
We are happy to announce that our system, SMAPH, co-developed by Marco Cornolti and Paolo Ferragina (University of Pisa), Massimiliano Ciaramita (Google), Hinrich Schütze and Stefan Rüd (University of Munich) achieved the best result in the ERD Challenge hosted by SIGIR 2014. Teams participating in the challenge (around 20) had to build a working system to do Entity Recognition and Disambiguation on search-engine queries, i.e. given a query, find the entities associated to it.
The problem of NER in queries is somehow harder than in long texts. Queries are often malformed, ambiguous and, most of all, lack context. A searcher that issue a query like glasses may be interested either in the drinkware or in eyeglasses, while a searcher that issue a query like google glasses has yet another need. armstrong moon landing should point to Neil Armstrong, while armstrong trumpet should point to Louis Armstrong.
SMAPH disambiguates queries. It piggybacks on a search engine to normalize the keywords of the query, then disambiguates them, and prune away bad entities. On the ERD challenge (short track) it scored the best result (68.5% F1). The system will shortly be available to be queried through a web service. Details on the system implementation are given in a paper.
We also participated as Acube Lab with WAT, a new version of TagMe, to the long-track competition (i.e. disambiguation of long texts), achieving a nice result (though unbalanced towards precision ).
Darth Vader is closely related to the Death Star, but totally unrelated to Homeopathy. A relatedness function is a function that, given two Wikipedia pages, returns their relatedness in a [0, 1] range. Many disambiguation techniques are based on a relatedness function.
But how would you measure a relatedness function? A possible way is, given a set of Wikipedia-pages pairs, to ask humans the relatedness between those pairs of pages, and check the output of the relatedness function against the relatedness found by humans.
Such a dataset is Milne&Witten’s WikipediaSimilarity353 dataset. Unfortunately, since Wikipedia is ever-growing, the dataset became obsolete, pointing to pages that don’t exist anymore. To address this issue, we cleaned it, updated a few references, and added a few pairs of pages.
We created a brand-new WikipediaSimilarity411 dataset anybody can use, which provides 411 Wikipedia-pages pairs. Please also note that the BAT-Framework has been updated and features the test of a relatedness function against this dataset.
We are pleased to announce that our paper “Bicriteria Data Compression” has been accepted at the ACM/SIAM Symposium On Discrete Algorithms (SODA) 14! The work will be illustrated on January 7, 4PM, Galleria North – Ballroom Level, Hilton Portland & Executive Tower.
We are pleased to announce that our TAGME API service has been hit more than 100 millions in about 2 years. We are currently providing access for more than 100 users and sometimes the service has been able to handle more than 1 million queries per day.
Thank you very much to all users that have provided their valuable feedback.
Here we publish the dataset used in our “Bicriteria Data Compression” paper (arxiv version).
Each file is a chunk of 1GB (2^30 bytes) extracted from the following sources:
- Wikipedia: Natural data. Extracted from a dump of English wikipedia, in XML format.
- U.S. Census: Database, statistical metrics of the U.S. population. Extracted from the U.S. Census database.
- DBLP: Bibliographic database in XML format. Extracted from the The DBLP Computer Science Bibliography project.
- PFAM: Biological data. Extracted from the PFAM database of of protein families.