Information Retrieval System
Evaluation for European Languages moves to Europe
Motivation
It has been demonstrated
extensively by the Text REtrieval Conferences (TREC) that the availability
of evaluation procedures can contribute significantly to the improvement
of system performance. For this reason, since 1997, cross-language system
evaluation has been one of the tracks at TREC. The aim has been to provide
developers with an infrastructure enabling them to test and tune their
systems and compare the results achieved using different cross-language
strategies. From the year 2000, a Cross-Language Evaluation Forum (CLEF) for European
languages will be coordinated in Europe while TREC will focus on Asian
languages. This move and the inclusion of a monolingual track for the
evaluation of IR systems designed for languages other than English will
make it possible to focus on a wider range of issues.
CLEF AGENDA for 2000
Task
Description
The ultimate goal
for systems for multilingual information retrieval is to
offer users the opportunity to query in a given language and
retrieve a merged and ranked set of documents that match
the query in any language. However, information access in
multiple languages also implies an understanding of the
issues involved in monolingual IR for different language
types and sub-types, and many of todays
applications regard cross-language retrieval between
selected pairs of languages.
There will thus be three main evaluation tracks in CLEF 2000.
Interested groups can participate in any one or in all three tracks.
Newcomers to the activity may well choose to begin with the
monolingual track in the first year and work up to the others
in later years. There will also be special sub-task for domain-specific
cross-language evaluation.
-
Multilingual Information
Retrieval
The main task of CLEF 2000
requires searching a multilingual document
collection, containing
English, German, French, and Italian texts,
for relevant documents.
Using a selected topic language, the
goal is to retrieve documents for all languages in the collection,
listing the
results in a merged, ranked list.
The
official topic languages for CLEF 2000 will be Dutch,
English, French, German, Italian, Spanish, Swedish, plus possibly Finnish.
However, it will also
be possible to submit runs in which the document
collection is queried in other languages. In this
case, participants will be responsible for the
translation of the query into their selected
language. The results for such runs will be given
separately.
-
Bilingual Information
Retrieval
A cross-language task will
be provided in which the query language can be
Dutch, French, German, Italian, Spanish or Swedish (plus possibly Finnish)
and the target document
collection is English. Many IR groups are now
beginning to work on retrieval over pairs of
languages and this will give them a chance to
participate officially in the CLEF activity.
Unofficial bilingual runs in which the query to
the English document collection can be in other
languages can also be submitted and will be
evaluated.
-
Monolingual (non-English)
Information Retrieval
It is often
asserted that procedures for monolingual information
retrieval are (almost) completely language independent.
This is not however true; different languages present
different problems. Methods that may be highly efficient
for certain language typologies may not be so effective
for others. Issues that have to be catered for include
word order, morphology, diacritic characters, language
variants. So far, most IR system evaluation has focussed
on English. We will provide the opportunity for
monolingual system testing and tuning and build up test
suites in other European languages (beginning with
French, German and Italian in CLEF 2000)
-
Special task GIRT
In addition to three main tasks, there is a special task for CLEF-2000,
consisting of a data collection from a vertical domain (social
sciences); this collection contains now nearly 80,000 documents in a
structured database. This task will
have 25 questions, created in German but translated to all languages.
Groups should run these questions: 1. either as monolingual task
against the 80,000
documents of this database (GIRT) or against both GIRT and the German
newspaper articles; 2. or as multilingual task using the translated elements
of GIRT. The rationale of this subtask is to study CLIR in a vertical domain
(i.e. social science) where a German/English/Russian thesaurus and English
translations of the document titles are available.
Resources
The CLEF test collection for 2000
consists of SGML formatted newspaper documents for English, French, German
and Italian, from the same time period. Stemmed documents in French, German
and Italian can be supplied on demand. In 2001, it is hoped to include Spanish
documents in the collection. CLEF participants will have free access to the
multilingual test suite (documents, topics, and relevance assessments)
for research purposes.
Partners
English: NIST,
Gaithersburg MD, USA (Ellen Voorhees) French: University
of Zurich, Switzerland (Michael Hess) German: IZ Sozialwissenschaften
(Social Science Information Centre), Bonn, in co-operation with University
of Hildesheim, Germany (Michael Kluck, Jürgen Krause, Christa Womser-Hacker)
Italian: IEI-CNR,
Pisa, Italy (Carol Peters)
Spanish: IEEC-UNED, Madrid, Spain (Felisa Verdejo, Julio Gonzalo)
Repositories (Test
Collections and Resources): Eurospider, Switzerland
(Peter Schäuble, Martin Braschler)
Participation
Those wishing to take part in CLEF 2000
are requested to send an e-mail to Carol Peters,
indicating in which task(s) they intend to participate, as soon as possible.
Participants will be requested to sign an agreement restricting the use of the data
and regulating publication and dissemination of results.
For further information on the procedure
for participation and copies of the data release forms, please click
here.
Important Dates
Data Release - 1
April 2000 Topic Release - 8
May 2000 Receipt of results
from participants - 1 July 2000 Release of relevance assessments
and individual results -
15 August 2000 Submission of
paper for Working Notes - 5 September 2000 Workshop - 21-22
September 2000 Proceedings -
November 2000
Workshop
A two-day Workshop
will be held on 21-22 September in Lisbon, Portugal,
immediately after the fourth European Conference on
Digital Libraries (ECDL 2000).
The first day will
be open to all interested participants and focused on
research related issues in Multilingual Information
Access
The second day
will present and discuss the results of the CLEF activity
and will be restricted to active CLEF participants.
Contact
Information
For further
information and to be included in the mailing list, contact:
Carol
Peters - IEI-CNR Via Alfieri, 1 PISA (Italy) Tel:
+39 050 315 2897 - Fax: +39 050 315 2810 E-mail: carol@iei.pi.cnr.it
|