CLEF 2007 offered a series of evaluation tracks to test different aspects of cross-language information retrieval system development. The aim was to promote research into the design of user-friendly, multilingual, multimodal retrieval systems.
This track tests mono- and cross-language textual document retrieval. In 2007, we offered mono- and bilingual tasks on target collections in Bulgarian, Czech (new this year), and Hungarian. Topics for queries were prepared in a number of European languages. Similarly to last year, a bilingual task encouraging system testing with non-European languages against English documents was also offered. Topics were in languages such as Amharic, Arabic, Oromo and Indonesian. A special sub-task regarded Indian languages and included Hindi, Telugu and Marathi. See here for more details. Groups participating with these languages had to submit one English monolingual run. A "robust" task was offered, emphasizing the importance of reaching a minimal performance for all topics instead of high average performance. Robustness is a key issue for the transfer of CLEF research into applications. The 2007 robust task used three languages often used in previous CLEF campaigns (English, French, Portuguese). Additional evaluation measures were introduced. The track was coordinated by ISTI-CNR (IT), U.Padua (IT) and U.Hildesheim (DE). See http://clef.isti.cnr.it/2007/2007Ad-hoc.htm.
Mono- and cross-language domain-specific retrieval were studied in the domain of social sciences using structured data (e.g. bibliographic data, keywords, and abstracts) from scientific reference databases. The following corpora were provided: GIRT-4 for German/English, INION for Russian and Sociological Abstracts for English (currently under negotiation from Cambridge Scientific Abstracts). A multi-lingual controlled vocabulary (German, English, Russian) suitable for use with GIRT-4 and INION together with a bi-directional mapping between this vocabulary and that used for indexing Sociological Abstracts (English) were provided. Topics were offered in at least English, German and Russian. This track was coordinated by IZ Bonn (DE). See http://www.gesis.org/en/research/information_technology/clef_ds_2007.htm for more information.
QA@CLEF 2007 proposed both main and pilot tasks. The main task scenario was topic-related QA, where the questions were grouped by topics and could contain anaphoric references one to the others. The answers were retrieved from heterogeneous document collections, i.e. news articles and Wikipedia. Many sub-tasks were set up, monolingual – where the questions and the target collections searched for answers were in the same language - and bilingual – where source and target languages were different. Bulgarian, Dutch, English, French, German, Italian, Portuguese, Romanian and Spanish were considered both as query and target languages. Following the positive response at QA@CLEF 2006, we reproposed the Answer Validation (AVE; ) and the Real Time QA Exercises. A new pilot task was also added, Question Answering on speech transcript (QAst; ), in which the answers to factual questions had to be extracted from spontaneous speech transcriptions (manual and automatic transcriptions) coming from different human interaction scenario. The track was organized by several institutions (one for each source language) and jointly coordinated by CELCT, Trento (IT) and UNED, Madrid (ES). See http:/clef-qa.itc.it/
This track evaluated retrieval of images described by text captions in several languages; both text and image retrieval techniques were exploitable. Four challenging tasks were offered: (i) multilingual ad-hoc retrieval (collection with mixed English/German/Spanish annotations, queries in more languages), (ii) medical image retrieval (casenotes in English/ French/German; visual, mixed, semantic queries in same languages), (iii) hierarchical automatic image annotation for medical images (fully categorized in English and German, purely visual task), (iv) photographic annotation through detection of objects in images (using the same collection as (i) with a restricted number of objects, purely visual task). Image retrieval was not required for all tasks and a default visual and textual retrieval system were made available for participants. The track coordinators were U.Sheffield (UK) and the U. and U. Hospitals of Geneva (CH). Oregon Health and Science U. (US), Victoria U. Melbourne (AU), RWTH Achen (DE) and Vienna Univ. Tech (AT) collaborate in the task organization. See http://ir.shef.ac.uk/imageclef/.
Cross-Language Speech Retrieval (CL-SR)
In 2006 the CL-SR track included search collections of conversational English and Czech speech using six languages (Czech, Dutch, English, French, German and Spanish). In CLEF 2007 additonal topics were added for the Czech speech collection, and additonal speech recognition results were available for the English speech collection. Speech content was described by automatic speech transcriptions manually and automatically assigned controlled vocabulary descriptors for concepts, dates and locations, manually assigned person names, and hand-written segment summaries. Additional resources of word lattices and audio files can be made available.The track was coordinated by U. Maryland (US), Dublin City U. (IE) and Charles U. (CZ). See http://clef-clsr.umiacs.umd.edu/.
CLEF Web Track (WebCLEF)
WebCLEF 2007 were centered around a "dossier task": given a topic, which is typically a person, organization, event, ..., systems compile a dossier consisting of two types of textual objects: focused snippets from a given text collection that contribute important information about the topic plus documents that contain additional information worth including in the dossier. The domain was the domain of cultural heritage (broadly construed), and the document collection was heterogeneous, consisting of a multilingual Wikipedia dump, a focused crawl of cultural heritage sites, as well as the top ranked pages returned by a generic web search engine. The track was coordinated by U. Amsterdam (NL).Join the discussion on WebCLEF 2007 at http://ilps.science.uva.nl/WebCLEF/
Cross-Language Geographical Information Retrieval (GeoCLEF)
The 2007 GeoCLEF track consisted of two parts: (1) a modification of the existing GeoCLEF search task; (2) a brand new query parsing task. As in previous years, (1) examined geographic search of a text corpus; this year the geographic area to be searched was defined in both text and machine readable format. Documents were in English, German and Portuguese, topics in a wider range of languages. How best to transform into a machine readable format the imprecise description of a geographic area found in many user queries is an open research problem. This year, in (2) GeoCLEF tackled this problem with the “Query Parsing Task”. This sub-task were run by Microsoft Research Asia who supplied a substantial set of Web queries to geo-parse. 800,000 queries were provided for any participant of GeoCLEF. For further information (including details on track coordinators), see http://ir.shef.ac.uk/geoclef/