CzeSL – a Learner Corpus of Czech
Available versions
CzeSL-plain
- Transcribed texts, 2 mil. words, without annotation, without metadata
- Consists of 3 parts:
- Texts written by foreign learners of Czech (ciz)
- Academic texts written by foreign students in Czech (kval)
- Texts written by Czech students with Romani background (rom)
- Searchable from the Czech National Corpus site
- Downloadable from the LINDAT-Clarin data repository as two subcorpora:
- AKCES 3 – includes texts produced by non-native students of Czech
- AKCES 4 – includes texts produced by students growing up in socially excluded communities
- Description in English and Czech
CzeSL-SGT
- Czech as a Second Language with Spelling, Grammar and Tags
- Transcribed texts, 1 mil. words
- Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
- Original forms and automatic corrections are tagged, lemmatized and assigned error labels
- Most texts have metadata attributes (30 items) about the author and the text
- Searchable from the Czech National Corpus site, metadata available from this site are in Czech
- Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English.
The original release is still available from AKCES 5 (CzeSL-SGT).
We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata:
native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language.
Release 2 is now also a validated XML document with all annotation represented as XML attributes.
- Description in English and Czech
CzeSL-man
- Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
- Texts searchable on-line via the SeLaQ tool, using the CzeSL-native format:
- Includes all manually annotated texts, both the non-native Czech and the Romani ethnolect Czech parts.
- SeLaQ is a purpose-built corpus manager. See its menu for instructions.
- Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
- Downloadable from here in the
feat format,
with metadata
- Coming soon: CzeSL-man searchable from the Czech National Corpus site
and downloadable from the LINDAT/CLARIN repository.
Tools
Additional documentation
- Annotation manual in Czech
- Transcription manual in Czech
- Appendix to Transcription manual in Czech
- Transcription manual – summary in English
Please note that formatting specifics concerning the markup of manuscripts in
the transcription manual, its appendix and the summary below are outdated.
Bibliography
Bibliography
Alexandr Rosen
Last modified: Tue Oct 9 20:06:25 CEST 2018