GIRT - Mono- and
Cross-language Domain-specific Information Retrieval (GIRT4)
Additional Information
on the Data Structure and the Thesaurus Structure
GIRT Data Structure
GIRT is now divided into two distinct language parts: German (GIRT4-DE)
and English (GIRT4-EN). The total number of documents in each collection
is 151319 . They are described as follows:
GIRT4-D
The following tagged fields occur in the German GIRT
data (GIRT4-DE):
- DOCNO (= original document number from the underlying
databases)
- DOCID (= unique identification number, equals
DOCNO)
- AUTHOR (= personal or institutional author of
the document)
- TITLE-DE (= German title, normally the original
title, if the document title was originally in German; otherwise the title
was translated into German by a human translator)
- PUBLICATION-YEAR
- LANGUAGE-CODE (= DE)
- CONTROLLED-TERM-DE (= controlled German term
from the Social Science Thesaurus, each document has at least one controlled
term assigned to it, on average 10 controlled terms are assigned to any document)
- METHOD-TERM-DE (= controlled German term on the used methodology,
only if applicable)
- METHOD-TEXT-DE (= German text on the used methodology
and research design, only if applicable)
- CLASSIFICATION-TEXT-DE (= German text of the classification
assigned to the document, one entry is mandatory, but it may occur more often)
- FREE-TERM-DE (= additional free term or keyword
in German, which is not a controlled term, available for less than 10 % of
the documents)
- TEXT-DE (= description, or abstract of the
content of the document, available for 96,4 % of the documents)
GIRT4-EN
The following tagged fields occur in the English GIRT
data (GIRT4-EN):
- DOCNO (= artificially generated document number
which does not equal the document number of the identical document in GIRT4-DE)
- DOCID (= unique identification number, equals
DOCNO)
- AUTHOR (= personal or institutional author of
the document)
- TITLE-EN (= human translation of the title into
English, if the original title was given in another language than English,
or original English title, available for all
documents)
- PUBLICATION-YEAR
- LANGUAGE-CODE (= EN)
- CONTROLLED-TERM-EN (= controlled English term
from the Social Science Thesaurus, each document has at least one controlled
term assigned to it, on average 10 controlled terms are assigned to any
document)
- METHOD-TERM-EN (= controlled English term on the used
methodology, only if applicable)
- CLASSIFICATION-TEXT-EN (= English text of the
classification assigned to the document, one entry is mandatory for each
document, but it may occur more often, the text is a human translation of
the German entry)
- TEXT-EN-HT (= human translation of the description
or abstract of the content of the document into English, available for about
9,1 % of all documents)
- TEXT-EN-MT (= machine translation of the description
or abstract of the content of the document into English, available for about
5,5 % of all documents. This machine translation by SYSTRAN is sometimes
vague and may contain un-translated German words or phrases, when the MT-system
was not able to identify an appropriate translation, but it is reliable
enough for searching.)
There is also a new DTD for GIRT4-DE
and GIRT4-EN .
General information on the domain-specific task and
the GIRT3 data (especially on the scope of the GIRT2 and GIRT3 and on the
thesaurus) is given in an article
by Gey and Kluck.
GIRT Thesaurus Structure
German-English Thesaurus
The structure of the German-English
Thesaurus for the Social Sciences that is used for the GIRT data is
described in the ENGTHES
DTD . There are the following elements given:
- german (= the German descriptor used in the thesaurus)
- german-caps (= the German descriptor written
in capital letters, umlauts and eszet are expanded: Ä=AE, Ö=OE,
Ü=UE, ä=AE, ö=OE, ü=UE, ß=SS)
- scope-note-de (= text of a scope note in German)
- use-instead (= descriptor to be used instead
of the (non-descriptor) keyword, reference to the preferred synonym)
- use-combination (= combination of descriptors
to be used together instead of keyword)
- broader-term (= German broader term)
- narrower-term (= German narrower term)
- related-term (= German related term)
- english-translation (= English translation of
the German descriptor, every descriptor has an English translation, some
non-descriptors don't have)
- scope-note-en (= text of a scope note in English)
German-Russian Translation Table
The German-Russian translation
table gives Russian equivalents for German descriptors, but no structural
thesaurus information. To preserve cyrillic information, this table is formatted
in XML and encoded in UTF-8.
Two valuable sources have been made available for GIRT
participants by Fred Gey from Berkeley:
1. the German-English translation
table as XML-file at http://otlet.sims.berkeley.edu/thesaurus/english_thes.xml
2. the German-Russian translation
table with transliteration of the Cyrillic characters into Latin characters
as XML-file at http://otlet.sims.berkeley.edu/thesaurus/russian_thes_tr.xml
For any questions on the GIRT task contact Michael
Kluck (kluck@bonn.iz-soz.de).
Michael Kluck
Informationszentrum Sozialwissenschaften (IZ)
Bonn, Germany
last revision: 05 December 2002