Welcome to Cross Language Evaluation Forum

CLEF 2007 | Ad-Hoc 2007

CLEF 2007 Ad-Hoc Track

Collection

The ad-hoc track will test system performance on a multilingual collection of newspaper and news agency documents. The data download page, accessible from the Workspace for Registered Participants indicates precisely which collections you need for each task

Tasks

The ad-hoc track tests mono- and cross-language textual document retrieval. Similarly to last year, the 2007 track offers basic mono- and bilingual tasks plus an experimental task aimed at (but not restricted to) experienced participants.This is the “Robust” task.

1. Monolingual

The goal is to retrieve relevant documents in Bulgarian, Czech, and/or Hungarian collections using topics in the same language, and to submit results in a ranked list.

2. Bilingual

The 2007 bilingual task focuses on "new" CLEF languages Bulgarian and Hungarian (added in 2005) and Czech (added this year). The aim is to strengthen the text collections and to see if the system performance achieved can be equivalent to that obtained with more "consolidated" languages in previous years. We also include a new English target collection (LA Times 2002), which, this year can be used with any topic language. However, we particularly encourage experiments with non-European languages against the English target collection.

The 2007 ad-hoc bilingual track will accept runs for the following source -> target language pairs:

Any topic language -> Bulgarian target collection
Any topic language -> Czech target collection
Any topic language -> Hungarian target collection
Any topic language -> English target collection

As always, the aim is to retrieve relevant documents from the chosen target collection and submit the results in a ranked list.
We strongly encourage groups that have participated in a cross-language ad-hoc task in previous years, to submit at least one run for each target language.

Topics for Tasks 1 and 2

Topics will be supplied in a variety of European (Bulgarian, Czech, English, French, Hungarian, Spanish, Portuguese) and non-European languages including Amharic, Chinese, Afaan Oromo, Hindi, Telugu and Indonesian. Other languages can be added on demand.

Sets of 50 topics will be used for both mono- and bilingual tasks and will be found in the Workspace for Registered Participants from 11 April. Please contact carol.peters at isti.cnr.it if you are interested in other topic languages.

3. Robust

In 2007, another "robust" task will be offered; this task emphasizes the importance of reaching a minimal performance for all topics instead of high average performance.

Robustness is a key issue for the transfer of CLEF research into applications. The robust task will use three languages often used in previous CLEF campaigns (English, French, Portuguese). Additional evaluation measures will be introduced.

The 2007 robust task focuses on target collections for "consolidated" languages for which many experiments have already been made within CLEF (English, French and Portuguese). One bilingual run (English -> French) will be offered.

The robust task intends to evaluate stable performance over all topics instead of high average performance in Mono- and Cross-Language IR (“ensure that all topics obtain minimum effectiveness levels” Voorhees 2005 SIGIR Forum).

The evaluation methodology will use the geometric average as well as the mean average precision of all topics. Geometric average has exhibited a high correlation to MAP at CLEF 2006. Other measures have been suggested. Candidates for measures are:

MAP
GMAP
P@10
Number of Failed Topics
Number of topics below 0.1 MAP or topics with no hits among top 10 documents

Data for Robust Task

Problems with inconsistencies between collections and topics should be avoided this year.

§ Ad-hoc collections which were available at CLEF 2001 to CLEF 2006

§ Three languages: English, French, Portuguese (EN, FR, PT)

§ Topics 2001-2003: for Training

§ Topics 2004-2006: for Testing

§ For English and French training topics will be available

§ For Portuguese no training topics will be available

§ The topics can be found on the workspace for registered participants.

§ Each group may submit five runs for each sub-task and each topic language

Contact: Thomas Mandl, University of Hildesheim, Germany mandl@uni-hildesheim.de

The track is coordinated jointly by ISTI-CNR and U.Padua (Italy) and U. Hildesheim (Germany)