Semantic clustering benchmark

Get the data

First off, get the benchmark data by clicking on the link below.

Download

Set up

Read the instructions and install the Python ETE Toolkit.

Evaluate

You can use the datasets to tune your approach and evaluate it by comparing your results to expert ones.

About

The benchmark project for evaluating semantic clustering has been initiated at the LGI2P research center during the PhD of Nicolas Fiorini. The main motivation behind this work is to provide several datasets of semantically annotated documents.

This benchmark contains 8 datasets containing about 70 bookmarks each that are annotated with WordNet 3.0. One dataset is designed for optimizing your method while the others are supposed to be used to evaluate it. For each dataset, there is a set of expert trees that have been manually created. To evaluate your results, you have to compare your tree for each dataset with the expert ones by using the python script provided. The average distance with the expert trees gives the score of your output tree for a given dataset.

This benchmark is the result of a curation of linked open data (LOD) containing users and bookmarks associated with synsets. The descriptions of the synsets give a correspondance with WordNet. We mapped the bookmarks to the corresponding WordNet synsets and pruned the LOD graph to provide a clean dataset consisting of bookmarks associated with WordNet URIs.

Instructions

1. Download the benchmark data.

2. Have a look at the data. They consist of one tuning dataset and 7 evaluation datasets. For each of them, you will find a csv file that maps bookmark URIs to WordNet synset URIs. A WordNet 3.0 dump is provided in the knowledge folder. Note that a README file in this folder gives some details regarding the creation of the benchmark.

3. Tune your method with the tuning dataset, then run it on the evaluation datasets by using the provided WordNet dump. In order to evaluate your approach according to the next step, your algorithm must output trees in Newick format.

4. Install the ETE Python Toolkit by following these instructions. Note that the benchmark has been tested with Python v2.7.

5. Use the python script provided to compare two trees. It uses the Robinson-Fould distance. For each dataset, compute the distance of your tree with all expert ones, then average the distances. This gives you the score of your output on this dataset.

Contacts

Feel free to contact the team who initiated the project if you have any request or suggestion concerning this benchmark, or if you want to collaborate with us. This project results from the collaboration between the école des mines d'Alès and Montpellier SupAgro.

Nicolas Fiorini (PhD) – Ecole des Mines d'Alès
Sébastien Harispe (PhD) – Ecole des Mines d'Alès
Sylvie Ranwez (PhD Hab.) – Ecole des Mines d'Alès
Jacky Montmain (Pr) – Ecole des Mines d'Alès
Vincent Ranwez (Pr) – Montpellier SupAgro