Information Retrieval System Evaluation for European Languages moves to Europe

Motivation

It has been demonstrated extensively by the Text REtrieval Conferences (TREC) that the availability of evaluation procedures can contribute significantly to the improvement of system performance. For this reason, since 1997, cross-language system evaluation has been one of the tracks at TREC. The aim has been to provide developers with an infrastructure enabling them to test and tune their systems and compare the results achieved using different cross-language strategies. From the year 2000, a Cross-Language Evaluation Forum (CLEF) for European languages will be coordinated in Europe while TREC will focus on Asian languages. This move and the inclusion of a monolingual track for the evaluation of IR systems designed for languages other than English will make it possible to focus on a wider range of issues.

CLEF AGENDA for 2000

Task Description

The ultimate goal for systems for multilingual information retrieval is to offer users the opportunity to query in a given language and retrieve a merged and ranked set of documents that match the query in any language. However, information access in multiple languages also implies an understanding of the issues involved in monolingual IR for different language types and sub-types, and many of today’s applications regard cross-language retrieval between selected pairs of languages. There will thus be three main evaluation tracks in CLEF 2000. Interested groups can participate in any one or in all three tracks. Newcomers to the activity may well choose to begin with the monolingual track in the first year and work up to the others in later years. There will also be special sub-task for domain-specific cross-language evaluation.

  1. Multilingual Information Retrieval

    The main task of CLEF 2000 requires searching a multilingual document collection, containing English, German, French, and Italian texts, for relevant documents. Using a selected topic language, the goal is to retrieve documents for all languages in the collection, listing the results in a merged, ranked list.

    The official topic languages for CLEF 2000 will be Dutch, English, French, German, Italian, Spanish, Swedish, plus possibly Finnish. However, it will also be possible to submit runs in which the document collection is queried in other languages. In this case, participants will be responsible for the translation of the query into their selected language. The results for such runs will be given separately.

  2. Bilingual Information Retrieval

    A cross-language task will be provided in which the query language can be Dutch, French, German, Italian, Spanish or Swedish (plus possibly Finnish) and the target document collection is English. Many IR groups are now beginning to work on retrieval over pairs of languages and this will give them a chance to participate officially in the CLEF activity. Unofficial bilingual runs in which the query to the English document collection can be in other languages can also be submitted and will be evaluated.

  3. Monolingual (non-English) Information Retrieval

    It is often asserted that procedures for monolingual information retrieval are (almost) completely language independent. This is not however true; different languages present different problems. Methods that may be highly efficient for certain language typologies may not be so effective for others. Issues that have to be catered for include word order, morphology, diacritic characters, language variants. So far, most IR system evaluation has focussed on English. We will provide the opportunity for monolingual system testing and tuning and build up test suites in other European languages (beginning with French, German and Italian in CLEF 2000)

  4. Special task GIRT

    In addition to three main tasks, there is a special task for CLEF-2000, consisting of a data collection from a vertical domain (social sciences); this collection contains now nearly 80,000 documents in a structured database. This task will have 25 questions, created in German but translated to all languages.
    Groups should run these questions: 1. either as monolingual task against the 80,000 documents of this database (GIRT) or against both GIRT and the German newspaper articles; 2. or as multilingual task using the translated elements of GIRT. The rationale of this subtask is to study CLIR in a vertical domain (i.e. social science) where a German/English/Russian thesaurus and English translations of the document titles are available.

Resources

The CLEF test collection for 2000 consists of SGML formatted newspaper documents for English, French, German and Italian, from the same time period. Stemmed documents in French, German and Italian can be supplied on demand. In 2001, it is hoped to include Spanish documents in the collection. CLEF participants will have free access to the multilingual test suite (documents, topics, and relevance assessments) for research purposes.

Partners

English: NIST, Gaithersburg MD, USA (Ellen Voorhees)
French: University of Zurich, Switzerland (Michael Hess)
German: IZ Sozialwissenschaften (Social Science Information Centre), Bonn, in co-operation with University of Hildesheim, Germany (Michael Kluck, Jürgen Krause, Christa Womser-Hacker)
Italian: IEI-CNR, Pisa, Italy (Carol Peters)
Spanish: IEEC-UNED, Madrid, Spain (Felisa Verdejo, Julio Gonzalo)
Repositories (Test Collections and Resources): Eurospider, Switzerland (Peter Schäuble, Martin Braschler)

Participation

Those wishing to take part in CLEF 2000 are requested to send an e-mail to Carol Peters, indicating in which task(s) they intend to participate, as soon as possible. Participants will be requested to sign an agreement restricting the use of the data and regulating publication and dissemination of results.

For further information on the procedure for participation and copies of the data release forms, please click here.

Important Dates

Data Release - 1 April 2000
Topic Release - 8 May 2000
Receipt of results from participants - 1 July 2000
Release of relevance assessments and individual results - 15 August 2000
Submission of paper for Working Notes - 5 September 2000
Workshop - 21-22 September 2000
Proceedings - November 2000

Workshop

A two-day Workshop will be held on 21-22 September in Lisbon, Portugal, immediately after the fourth European Conference on Digital Libraries (ECDL 2000).

The first day will be open to all interested participants and focused on research related issues in Multilingual Information Access

The second day will present and discuss the results of the CLEF activity and will be restricted to active CLEF participants.

Contact Information

For further information and to be included in the mailing list, contact:

Carol Peters - IEI-CNR
Via Alfieri, 1 PISA (Italy)
Tel: +39 050 315 2897 - Fax: +39 050 315 2810
E-mail:
carol@iei.pi.cnr.it