CLEF AGENDA for 2000

Task Description

The ultimate goal for systems for multilingual information retrieval is to offer users the opportunity to query in a given language and retrieve a merged and ranked set of documents that match the query in any language. However, information access in multiple languages also implies an understanding of the issues involved in monolingual IR for different language types and sub-types, and many of today’s applications regard cross-language retrieval between selected pairs of languages. There will thus be three main evaluation tracks in CLEF 2000. Interested groups can participate in any one or in all three tracks. Newcomers to the activity may well choose to begin with the monolingual track in the first year and work up to the others in later years. There will also be special sub-task for domain-specific cross-language evaluation.

  1. Multilingual Information Retrieval

    The main task of CLEF 2000 requires searching a multilingual document collection, containing English, German, French, and Italian texts, for relevant documents. Using a selected topic language, the goal is to retrieve documents for all languages in the collection, listing the results in a merged, ranked list.

    The official topic languages for CLEF 2000 will be Dutch, English, French, German, Italian, Spanish, Swedish, plus possibly Finnish. However, it will also be possible to submit runs in which the document collection is queried in other languages. In this case, participants will be responsible for the translation of the query into their selected language. The results for such runs will be given separately.

  2. Bilingual Information Retrieval

    A cross-language task will be provided in which the query language can be Dutch, French, German, Italian, Spanish or Swedish (plus possibly Finnish) and the target document collection is English. Many IR groups are now beginning to work on retrieval over pairs of languages and this will give them a chance to participate officially in the CLEF activity. Unofficial bilingual runs in which the query to the English document collection can be in other languages can also be submitted and will be evaluated.

  3. Monolingual (non-English) Information Retrieval

    It is often asserted that procedures for monolingual information retrieval are (almost) completely language independent. This is not however true; different languages present different problems. Methods that may be highly efficient for certain language typologies may not be so effective for others. Issues that have to be catered for include word order, morphology, diacritic characters, language variants. So far, most IR system evaluation has focussed on English. We will provide the opportunity for monolingual system testing and tuning and build up test suites in other European languages (beginning with French, German and Italian in CLEF 2000)

  4. Special task GIRT

    In addition to three main tasks, there is a special task for CLEF-2000, consisting of a data collection from a vertical domain (social sciences); this collection contains now nearly 80,000 documents in a structured database. This task will have 25 questions, created in German but translated to all languages.
    Groups should run these questions: 1. either as monolingual task against the 80,000 documents of this database (GIRT) or against both GIRT and the German newspaper articles; 2. or as multilingual task using the translated elements of GIRT. The rationale of this subtask is to study CLIR in a vertical domain (i.e. social science) where a German/English/Russian thesaurus and English translations of the document titles are available.

Resources

The CLEF test collection for 2000 consists of SGML formatted newspaper documents for English, French, German and Italian, from the same time period. Stemmed documents in French, German and Italian can be supplied on demand. In 2001, it is hoped to include Spanish documents in the collection. CLEF participants will have free access to the multilingual test suite (documents, topics, and relevance assessments) for research purposes.

Partners

English: NIST, Gaithersburg MD, USA (Ellen Voorhees)
French: University of Zurich, Switzerland (Michael Hess)
German: IZ Sozialwissenschaften (Social Science Information Centre), Bonn, in co-operation with University of Hildesheim, Germany (Michael Kluck, Jürgen Krause, Christa Womser-Hacker)
Italian: IEI-CNR, Pisa, Italy (Carol Peters)
Spanish: IEEC-UNED, Madrid, Spain (Felisa Verdejo, Julio Gonzalo)
Repositories (Test Collections and Resources): Eurospider, Switzerland (Peter Schäuble, Martin Braschler)

Important Dates

Data Release - 1 April 2000
Topic Release - 8 May 2000
Receipt of results from participants - 1 July 2000
Release of relevance assessments and individual results - 15 August 2000
Submission of paper for Working Notes - 5 September 2000
Workshop - 21-22 September 2000