GIRT - Mono- and Cross-language Domain-specific Information Retrieval (GIRT4)

Additional Information on the Data Structure and the Thesaurus Structure

GIRT Data Structure

GIRT is now divided into two distinct language parts: German (GIRT4-DE) and English (GIRT4-EN). The total number of documents in each collection is 151319 . They are described as follows:

GIRT4-D

The following tagged fields occur in the German GIRT data (GIRT4-DE):

GIRT4-EN

The following tagged fields occur in the English GIRT data (GIRT4-EN):

There is also a new DTD for GIRT4-DE and GIRT4-EN .

General information on the domain-specific task and the GIRT3 data (especially on the scope of the GIRT2 and GIRT3 and on the thesaurus) is given in an article by Gey and Kluck.

GIRT Thesaurus Structure

German-English Thesaurus

The structure of the German-English Thesaurus for the Social Sciences that is used for the GIRT data is described in the ENGTHES DTD . There are the following elements given:

German-Russian Translation Table

The German-Russian translation table gives Russian equivalents for German descriptors, but no structural thesaurus information. To preserve cyrillic information, this table is formatted in XML and encoded in UTF-8.

Two valuable sources have been made available for GIRT participants by Fred Gey from Berkeley:

    1. the German-English translation table as XML-file at http://otlet.sims.berkeley.edu/thesaurus/english_thes.xml

    2. the German-Russian translation table with transliteration of the Cyrillic characters into Latin characters as XML-file at http://otlet.sims.berkeley.edu/thesaurus/russian_thes_tr.xml


For any questions on the GIRT task contact Michael Kluck (kluck@bonn.iz-soz.de).

Michael Kluck
Informationszentrum Sozialwissenschaften (IZ)
Bonn, Germany

last revision: 05 December 2002