me@tktl 09/06

CS Department's web magazine

Front page

Evimaria Terzi will be our first PhD on 2007

Organising an evaluation task for an evaluation contest

Events

Previous issues

Editors

Organising an evaluation task for an evaluation contest

Evaluation is an important component of the information retrieval field and other areas of computer science. However, organised evaluation campaigns require massive amount of work in planning suitable evaluation tasks, gathering evaluation data, such as test sets and possible ground truths for each evaluation task, running the participating algorithms, and analysing the results.

Regardless, I found myself saying "yes", when I was asked to co-organise an evaluation task, namely SMS (Symbolic Melodic Similarity), for an evaluation contest called MIREX (Music Information Retrieval Evaluation eXchange).

The history of MIREX as an evaluation contest is very short - in the early 2000's the music information retrieval (MIR) community started discussing about the need of TREC-like (Text REtrieval Conference) evaluation in MIR. The MIR tasks were considered too different from the text retrieval tasks and thus the MIR community decided to design its own evaluation contest. The first evaluation contest concentrating solely on the audio retrieval tasks was held in 2004. In 2005 a new group of tasks for notation-based data was added into the contest and it was given the name MIREX.

In 2006 MIREX was held for the second time with nine evaluation tasks, SMS being one of them. In practice, our work as task leaders consisted of planning the task, finding the test sets, and creating the queries. The algorithms were run and the ground truth, consisting of the human judgments of the similarity between each query/candidate pair pooled from the preliminary result sets given by the participant algorithms, was collected by the head organisers of the contest.

The SMS task was divided into three subtasks, namely a monophonic task (measuring similarity between two monophonic melodies) and two polyphonic tasks (measuring similarity between a monophonic and a polyphonic musical sequence). We were able to use a collection of monophonic melodic fragments called RISM UK as the test set for the monophonic subtask. The MIDI files for the polyphonic subtasks were harvested from the Internet.

When the task was finalised and the collections made, we needed to create queries for the subtasks. The queries needed to have at least some matching candidates, preferably variations of the same melody, in the data. Thus, we needed to listen through thousands of MIDI files (which is extremely bad for your mental health) to guarantee to have matches for each query. With the queries and the test sets we were ready to evaluate the algorithms by comparing their results to the ground truth.

Even though there were some problems with the procedure, namely, we did not have enough human evaluators for the query/candidate pairs to get statistically significant results for SMS and some other tasks, MIREX 2006 was considered successful. For head organisers and task leaders it was a lot of work with no other reward but the completed tasks and general acceptance of the research community. However, I would recommend the experience to anyone having some interest in evaluation.

Anna Pienimäki