CLEF AGENDA for
2000
Task
Description
The ultimate goal for systems for
multilingual information retrieval is to offer users the opportunity to
query in a given language and retrieve a merged and ranked set of
documents that match the query in any language. However, information
access in multiple languages also implies an understanding of the issues
involved in monolingual IR for different language types and sub-types, and
many of today’s applications regard cross-language retrieval between
selected pairs of languages. There will thus be three main evaluation
tracks in CLEF 2000. Interested groups can participate in any one or in
all three tracks. Newcomers to the activity may well choose to begin with
the monolingual track in the first year and work up to the others in later
years. There will also be special sub-task for domain-specific
cross-language evaluation.
-
Multilingual
Information Retrieval
The main task of CLEF 2000
requires searching a multilingual document collection, containing
English, German, French, and Italian texts, for relevant documents.
Using a selected topic language, the goal is to retrieve documents for
all languages in the collection, listing the results in a merged, ranked
list.
The official topic languages
for CLEF 2000 will be Dutch, English, French, German, Italian, Spanish,
Swedish, plus possibly Finnish. However, it will also be possible to
submit runs in which the document collection is queried in other
languages. In this case, participants will be responsible for the
translation of the query into their selected language. The results for
such runs will be given separately.
-
Bilingual
Information Retrieval
A cross-language task will be
provided in which the query language can be Dutch, French, German,
Italian, Spanish or Swedish (plus possibly Finnish) and the target
document collection is English. Many IR groups are now beginning to work
on retrieval over pairs of languages and this will give them a chance to
participate officially in the CLEF activity. Unofficial bilingual runs
in which the query to the English document collection can be in other
languages can also be submitted and will be evaluated.
-
Monolingual
(non-English) Information Retrieval
It is often asserted that
procedures for monolingual information retrieval are (almost) completely
language independent. This is not however true; different languages
present different problems. Methods that may be highly efficient for
certain language typologies may not be so effective for others. Issues
that have to be catered for include word order, morphology, diacritic
characters, language variants. So far, most IR system evaluation has
focussed on English. We will provide the opportunity for monolingual
system testing and tuning and build up test suites in other European
languages (beginning with French, German and Italian in CLEF
2000)
-
Special task
GIRT
In addition to three main
tasks, there is a special task for CLEF-2000, consisting of a data
collection from a vertical domain (social sciences); this collection
contains now nearly 80,000 documents in a structured database. This task
will have 25 questions, created in German but translated to all
languages. Groups should run these questions: 1. either as
monolingual task against the 80,000 documents of this database (GIRT) or
against both GIRT and the German newspaper articles; 2. or as
multilingual task using the translated elements of GIRT. The rationale
of this subtask is to study CLIR in a vertical domain (i.e. social
science) where a German/English/Russian thesaurus and English
translations of the document titles are available.
Resources
The CLEF test collection for 2000
consists of SGML formatted newspaper documents for English, French, German
and Italian, from the same time period. Stemmed documents in French,
German and Italian can be supplied on demand. In 2001, it is hoped to
include Spanish documents in the collection. CLEF participants will have
free access to the multilingual test suite (documents, topics, and
relevance assessments) for research purposes.
Partners
English: NIST, Gaithersburg MD, USA
(Ellen Voorhees) French: University of Zurich, Switzerland
(Michael Hess) German: IZ Sozialwissenschaften (Social Science
Information Centre), Bonn, in co-operation with University of Hildesheim,
Germany (Michael Kluck, Jürgen Krause, Christa Womser-Hacker)
Italian: IEI-CNR, Pisa, Italy (Carol Peters)
Spanish: IEEC-UNED, Madrid, Spain (Felisa Verdejo, Julio
Gonzalo) Repositories (Test Collections and Resources):
Eurospider, Switzerland (Peter Schäuble, Martin Braschler)
Important
Dates Data Release - 1 April
2000 Topic Release - 8 May 2000 Receipt of results from participants
- 1 July 2000 Release of relevance assessments and individual results -
15 August 2000 Submission of paper for Working Notes - 5 September
2000 Workshop - 21-22 September 2000
|