Home Encoding of Czech characters

Czech hierarchical lexicon

Hana Skoumalová
Institute of Theoretical and Computational Linguistics
Charles University, Prague

September 25, 1996

This work was supported by grant No. 72\94 of Research Support Scheme


The main aim of this work was to create an electronic lexicon, which could serve for various applications in Natural Language Processing. Such a lexicon should contain all information needed by the possible applications and it should not be dependent on a particular theory.

In my lexicon I concentrate on the morphological and syntactic information--the lexicon can be used in many types of applications, from spell-checking and lemmatizing to syntactic analysis of the text. Further enriching of the lexicon by another type of information (e.g. the semantics) should not be very difficult, because of the designed structure: The information in the lexicon is split into the morphological and syntactic data and these two parts do not overlap each other. In each of these parts the information is structured into a hierarchical network, which

In my description of the lexicon I will concentrate on verbs, as they are the most important class for the syntax. In all linguistic theories verbs play the key role in the structure of the sentence and the information about their frames (subcat lists) is the crucial part of the lexicon.

Theoretical background

Though I want to formulate the lexicon as theory independent, in the background some theory must be present. At least, the lexicon must include basic linguistic categories. The only requirement is, that the notation can be converted to another notation (e.g., HPSG or Dependency Syntax).

As the ``background'' theory I utilized the theory developed by Sgall, Hajičová and Panevová [SHP86], and especially the part dealing with the verb frames [Pan74, Pan75, Pan80]. Two levels of syntactic description--the underlying (e.i. deep) structure and the surface structure--are distinguished. In the underlying structure we work with inner participants (actants) and free modifications. Each verb can have up to five inner participants: Actor, Patient, Addressee, Origin and Effect. These inner participants are members of the verb frame and they are realized as objects in the surface structure. Some of them can be optional (facultative), which means that they do not need to be present in the sentence--both in the deep and surface structure. Other participants are always obligatory in the deep structure, however, they do not need to be obligatory in the surface structure. Those participants that can be omitted on the surface, because they are known from the context, are called obligatory deletable participants. Whether a participant is optional, obligatory or obligatory deletable can be tested by a question test. Let us imagine the following dialogue:

	 -Petr čte. 		 -Co? 		 -Nevím.

-Petr is reading. -What? -I don't know.

The answer `I don't know' is acceptable, as the the speaker does not need to know what Petr's reading is. On the other hand, in the following dialogue, the sentence with deleted Actor is acceptable but the answer `I don't know' is nonsensical. This shows us that the Actor is an obligatory deletable participant.

	 -Už přišel. 		 -Kdo? 		 -*Nevím.

-(He) has already come. -Who? -*I don't know.

In the next example the sentence is actually ungrammatical, if the participant is omitted--this is a clear evidence that the participant is obligatory:

	 *Petr daroval.

*Petr donated.

Beside the inner participants, the verb frame may have also other members--the adverbials (adjuncts, free modifications)--but only if they are obligatory in the deep structure. However, they can be deletable on the surface:

	 -Petr přišel. 		 -Kam? 		 -*Nevím.

-Petr came. -Where? -*I don't know.

Nouns and adjectives can also have frames. Their repertoir of actants, however, is larger than that of verbs. Beside the five verbal inner participants, nouns and adjectives can have Partitive, Appurtenance and Identity participant. And, of course, free modifications can occur in their frames as well.

The hierarchical structure

The lexicon is designed as a hierarchical structure with defaults and multiple inheritance. Every word (lemma) inherits attributes and values from the morphological part of the network and from the syntactic part. These two parts do not overlap, so that we do not need to cope with the problem of contradiction in inherited information (the problem known as ``Nixon's diamond'').

For the implementation of the lexicon I used QDATR--a Düsseldorf version of the Sussex system DATR. The implementation in QDATR serves mainly for defining the structure of the lexicon and for stock-taking of possible frames of Czech verbs.

Content of the lexicon

Every lemma contains the following four main parts:

base form
of the word, possibly with an index
--a short description of the meaning or an example of usage of the word
morphological information
--all possible forms of the word
syntactic information
--the frame, information about passivization, and other information necessary for the analysis/generation

I will demonstrate the structure on an example. The word bát se (to be afraid) with the frame bát se čeho/že/aby (to be afraid of/that) is encoded in the following way:

     <gloss> == bojí se strašidel;
                       ... že nepřijdou;
                       ... aby nepřišli
     <mor> == BÁT
     <syn> == RSE_F[2|clz|cla]@.

The base form serves as a node in the DATR hierarchy. The gloss, as it is unique for every word, must be inserted for every word separately. The <mor> and <syn> information, however, can be inherited from some supernodes. As an output of the DATR theory we get this result:

     <gloss> = bojí se strašidel;
                       ... že nepřijdou;
                       ... aby nepřišli
     <mor aspect> = imperf
     <mor infl pres 1p sg> = bojím
     <mor infl pres 2p sg> = bojíš
     <mor infl pres 3p sg> = bojí
     <mor infl pres 1p pl> = bojíme
     <mor infl pres 2p pl> = bojíte
     <mor infl pres 3p pl> = bojí
     <mor infl past ma sg> = bál
     <mor infl past mi sg> = bál
     <mor infl past f sg> = bála
     <mor infl past n sg> = bálo
     <mor infl past ma pl> = báli
     <mor infl past mi pl> = bály
     <mor infl past f pl> = bály
     <mor infl past n pl> = bála
     <mor infl fut 1p sg> = ! not evaluable !
     <mor infl fut 2p sg> = ! not evaluable !
     <mor infl fut 3p sg> = ! not evaluable !
     <mor infl fut 1p pl> = ! not evaluable !
     <mor infl fut 2p pl> = ! not evaluable !
     <mor infl fut 3p pl> = ! not evaluable !
     <mor infl inf> = bát, báti
     <mor infl imp 2p sg> = boj
     <mor infl imp 1p pl> = bojme
     <mor infl imp 2p pl> = bojte
     <mor infl pass ma sg> = ! not evaluable !
     <mor infl pass mi sg> = ! not evaluable !
     <mor infl pass f sg> = ! not evaluable !
     <mor infl pass n sg> = ! not evaluable !
     <mor infl pass ma pl> = ! not evaluable !
     <mor infl pass mi pl> = ! not evaluable !
     <mor infl pass f pl> = ! not evaluable !
     <mor infl pass n pl> = ! not evaluable !
     <mor infl transgr pres m> = boje
     <mor infl transgr pres fn> = bojíc
     <mor infl transgr pres pl> = bojíce
     <mor infl transgr past m> = ! not evaluable !
     <mor infl transgr past fn> = ! not evaluable !
     <mor infl transgr past pl> = ! not evaluable !
     <mor infl neg pres 1p sg> = nebojím
     <syn cat> = V
     <syn type> = main
     <syn refl> = se
     <syn subj surf> = NPnom
     <syn subj deep> = Actor
     <syn subj oblig> = oblig_deletable
     <syn 1_obj surf> = NPgen , CLže , CLaby
     <syn 1_obj deep> = Patient
     <syn 1_obj oblig> = optional
     <syn 2_obj surf> = []
     <syn 2_obj deep> = []
     <syn 2_obj oblig> = []
     <syn 3_obj surf> = []
     <syn 3_obj deep> = []
     <syn 3_obj oblig> = []
     <syn 4_obj surf> = []
     <syn 4_obj deep> = []
     <syn 4_obj oblig> = []
     <syn oadv surf> = []
     <syn oadv oblig> = []
     <syn pass> = no.

The morphological part is very large, as Czech has very rich inflection. The forms under <mor infl pres> describe the present tense of the verb, <mor infl past> describes past participle forms (for masculine animate, masculine inanimate, feminine and neuter), <mor infl fut> is for the future tense, <mor infl inf> is infinitive, <mor infl imp> is imperative, <mor infl pass> is passive participle, <mor infl transgr> is transgressive, and the structure <mor infl neg> describes negative forms of the verb. If certain forms do not exist, their value is ! not evaluable !.

The syntactic part informs that the category of the word is V (verb), its type is `main verb', the verb is intrinsic reflexive (i.e. it requires the reflexive particle se), and cannot be passivized. Further, it describes the valency frame of the verb: The frame has two members--Actor and Patient in the deep structure, which play roles of the subject and an object in the surface structure. Their surface forms are a noun phrase in Nominative for the subject, and a noun phrase in Genitive or a clause connected by the conjunction že (that) or a clause connected by the conjunction aby (so that) for the object. Actor is obligatory in the deep structure but deletable on the surface, while Patient is optional. The verb frames and the problem of passivization will be described more thoroughly in the next section.

Verb frames and the passive voice

In Czech two kinds of passive exist:

uses the auxiliary verb být (to be) and a passive participle
kniha je čtena--the book is read
(mediopassive or impersonal passive) uses a finite form of the verb and the reflexive particle se
kniha se čte--the book reads SELF
jde se -- it goes SELF

The information about the kind of passive is sufficient for us--there is no need to store the passive constructions in the lexicon, as the passive can be created after the following rules:

The only verbs which could be exceptional are the ditransitive verbs (verbs with two Accusatives in the frame). There are only two such verbs in Czech:

Verbs with the infinitive in their frames

In this group of verbs we have to describe not only the frame of the verb, but also the interaction between the higher verb and the lower verb (the infinitive)--which members of the frames they share, whether the infinitive can be passivized and other constraints that hold for the verbs.

These verbs are usually divided into two subclasses: equi and raising verbs. In both cases the subject of the infinitive is one of the actants of the higher verb, but the difference between these two sorts of verbs is in the deep structure:

equi verb
the frame of the infinitive overlaps with the frame of the control verb; the shared actants occur twice in the deep structure
raising verb
the frame of the infinitive does not overlap with the frame of the higher verb, but on the surface, the verb `raises' the subject of the infinitive as its own subject or object; this actant occurs only once in the deep structure

Raising verbs

First, we will deal with subject raising verbs. This term means that the subject of the infinitive becomes the subject of the higher verb. This group of verbs contains the modal and aspectual verbs. Examples:

Petr(i) smí [__(i) odejít].
Petr may to-leave.

Začne [pršet].
Will-start to-rain.

Petr(i) musí [__(i) být pochválen].
Petr must be praised.

Musí [__(i) se zabít] dvě mouchy(i) jednou ranou.
Must SELF to-kill two flies by one hit.
`Two flies must be killed by one hit.'

We see in the examples that the two subjects are shared, no matter which voice is used in the infinitive construction. The raising verb, however, cannot be passivized.

In my lexicon I encode subject raising verbs this way:

     <syn refl> = no
     <syn subj surf> = 1_obj:<syn subj surf>
     <syn subj deep> = []
     <syn subj oblig> = 1_obj:<syn subj oblig>
     <syn 1_obj surf> =
                 VPinf [pass = perif , refl]
     <syn 1_obj deep> = []
     <syn 1_obj oblig> = obligatory
     <syn pass> = no.
The description of <syn subj> contains only pointers to the subject of infinitive (on the surface) and the value of <syn subj deep> is empty.

The infinitive can occur in both passives, it depends only on the verb occuring as the infinitive, whether it enables it. The higher verb occurs only in active voice.

Subject-to-object raising verbs are such verbs, that have an infinitive in the frame and the subject of this infinitive becomes an object of the higher verb. This group contains the verbs of perception:

Vidím ho(i) [__(i) přicházet].
I-see him to-be-coming.
`I see him come.'

In the lexicon the frame is encoded this way:

     <syn refl> = no
     <syn subj surf> = NPnom
     <syn subj deep> = Actor
     <syn subj oblig> = oblig_deletable
     <syn 1_obj surf> = 1_obj:<syn subj surf> [NPnom = ^NPacc]
     <syn 1_obj deep> = []
     <syn 1_obj oblig> = obligatory
     <syn 2_obj surf> = VPinf [pass = no]
     <syn 2_obj surf> = Patient
     <syn 2_obj surf> = obligatory
     <syn pass> = no.
The description of <syn 1_obj> contains a pointer to the subject of the infinitive, with the constraint on the case: the Nominative must be changed to Accusative. It further overwrites the value of <syn 1_obj oblig> to obligatory.

Equi verbs

The subject and possibly some objects of the infinitive are coindexed with members of the frame of the control verb, but in the deep structure, these actants are present twice.

(Nom(Act))Oni(i) mu slíbili [(Nom(Act))__(i) přijít].
They to-him promised to-come.

Oni (Dat(Addr))mu(i) poručili [(Nom(Act))__(i) přijít].
They to-him ordered to-come.

Subject-control with Addressee-Benefactor coindexation:
(Nom(Act))Oni(i) (Dat(Addr))mu(j) slíbili [(Nom(Act))__(i) donést knihu(Dat(Ben))__(j)].
They to-him promised to-bring book.

An object of the infinitive is the subject of the control verb:
(Nom(Act))Plot(i) chce [(Nom(Act))__ natřít (Acc(Pat))__(i)].
Fence wants to-paint.

The structure can be ambiguous--either subject-control or subject-object coindexation:
(Nom(Act))Anežka(i) chce [(Nom(Act))__(i) číst pohádky].
`Anežka wants to read tales.'
(Nom(Act))Anežka(i) chce [(Nom(Act))__ číst pohádky (Dat(Ben))__(i)].
`Anežka wants someone to read tales to her.'

The description of the frame of the verb bát se (to be afraid to do sth) looks like this:

     <syn refl> = se
     <syn subj surf> = NPnom
     <syn subj deep> = Actor
     <syn subj oblig> = oblig_deletable
     <syn 1_obj surf> = 
          VPinf [subj = ^Actor; pass = perif]
     <syn 1_obj deep> = Patient
     <syn 1_obj oblig> = obligatory
     <syn pass> = no.
The description of <syn 1_obj surf> contains the information, that the subject of the infinitive is coindexed with Actor of the control verb. The reason, why I use this cross-referencing between two strata of the linguistic description, is that this relantionship between Actor and the subject of the infinitive is preserved even in the passive voice of the infinitive and/or the control verb. I will demonstrate this behaviour on the verb chtít (to want):

Two active voices:
(Nom(Act))(i) chci [(Act)__(i) pochválit (Pat)Petra] .
I want to praise Petr.

Active--periphrastic passive:
(Nom(Act))Petr(i) chce [(Pat)__(i) být pochválen] .
Petr wants to-be praised.

(Nom(Act))Anežka(i) chce [(Addr)__(i) být poučena o hudbě] .
Anežka wants to-be instructed in music.

Mediopassive--periphrastic passive:
Nechce se (Dat(Act))mi(i) [(Pat)__(i) být bit] .
`I don't want to be beaten'.

The frame of the verb chtít looks like this:

     <syn refl> = no
     <syn subj surf> = NPnom
     <syn subj deep> = Actor
     <syn subj oblig> = oblig_deletable
     <syn 1_obj surf> = 
              VPinf [subj = ^Actor;
                     pass = perif , refl].
     <syn 1_obj deep> = Patient
     <syn 1_obj oblig> = obligatory
     <syn pass> = refl.

The verb chtít, however, has two more frames containing the infinitive. One of them has the relation between higher Actor and lower Patient:

(Nom(Act))Plot(i) chce [(Act)__ natřít (Acc(Pat))__(i)] .
Fence wants to-paint.
`The fence needs painting.'

(Nom(Act))Pepík(i) chce [(Act)__ nařezat (Dat(Pat))__(i)] .
Pepík wants to-spank.
`Pepík needs spanking.'

     <syn refl> = no
     <syn subj surf> = NPnom
     <syn subj deep> = Actor
     <syn subj oblig> = oblig_deletable
     <syn 1_obj surf> =
              VPinf [Patient = ^Actor;
                     refl = no; pass = no].
     <syn 1_obj deep> = Patient
     <syn 1_obj oblig> = obligatory
     <syn pass> = no.

The other has the relation between higher Actor and lower Addressee/Benefactor:

(Nom(Act))Anežka(i) chce [(Act)__ podat knihu (Dat(Addr))__(i)] .
Anežka wants to-pass book.
`Anežka wants someone to pass her the book.'

(Nom(Act))Anežka(i) chce [(Act)__ přečíst pohádku (Dat(Ben))__(i)] .
Anežka wants to-read tale.
`Anežka wants someone to read her a tale.'

(Nom(Act))Anežka(i) chce [(Act)__ poučit o hudbě (Acc(Addr))__(i)] .
Anežka wants to-instruct in music.
`Anežka wants someone to instruct her in music.'

     <syn refl> = no
     <syn subj surf> = NPnom
     <syn subj deep> = Actor
     <syn subj oblig> = oblig_deletable
     <syn 1_obj surf> = 
              VPinf [Addr/Benef = ^Actor;
                     refl = no; pass = no].
     <syn 1_obj deep> = Patient
     <syn 1_obj oblig> = obligatory
     <syn pass> = no.


The main goal of the lexicon is to map all morphological and syntactic phenomena in Czech that are important for NLP. The information stored in the lexicon cannot be used in an application `as is' but must be interpreted. This interpretation, however, should not be very difficult--I tried to make the notation as natural and intuitive as possible.

There are, of course, still some open questions to be answered. Some of them are listed here:

These questions are very interesting from the linguistic point of view, but on the whole, they concern only a very limited group of words. The majority of Czech verbs can be easily described within my framwork without any difficulties.


Ted Briscoe, Valeria de Paiva, and Ann Copestake, editors. Inheritance, Defaults, and the Lexicon. Cambridge University Press, 1993.

Roger Evans and Gerald Gazdar, editors. The DATR Papers, Volume 1. Number 139 in CSRP. University of Sussex, Brighton, 1990.

Jarmila Panevová. On verbal frames in functional generative description, part i. Prague Bulletin of Mathematical Linguistics, 22:3-40, 1974.

Jarmila Panevová. On verbal frames in functional generative description, part ii. Prague Bulletin of Mathematical Linguistics, 23:17-52, 1975.

Jarmila Panevová. Formy a funkce ve stavbě české věty (Forms and Functions in Syntax of Czech Sentence). Number 13 in Studie a práce lingvistické. Academia, Prague, 1980.

Carl Pollard and Ivan A. Sag. Information-Based Syntax and Semantics, Volume 1, Fundamentals. Number 13 in CSLI Lecture Notes. CSLI, Stanford, 1987.

Petr Sgall, Eva Hajičová, and Jarmila Panevová. The Meaning of the Sentence in Its Semantic and Pragmatic Aspects. D. Reidel Publishing Company, Dordrecht, 1986.

Karel Svoboda. Infinitiv v současné spisovné češtině (Infinitive in Contemporary Standard Czech). Rozpravy ČSAV. Academia, Prague, 1962.

Vladimír Šmilauer. Novočeská skladba (Syntax of Modern Czech). Academia, Prague, 1967.

Hana Skoumalová
Wed Nov 20 10:00:12 MET 1996