CzeSL – a Learner Corpus of Czech

The Corpus of Czech as a Second Language
A part of the AKCES/CLAC project (the Czech Language Acquisition Corpora)
For the official site see AKCES – Akviziční korpusy českého jazyka (in Czech)
An outdated site: Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk (Investments into Teaching Czech as a Foreign Language)
Funded since 2009 from several projects:
- 2009–2012: European Social Funds (ESF) – Innovative approach to teaching Czech as a second language, no. CZ.1.07/2.2.00/07.0119
- 2012–2016: Ministry of Education, Youth and Sports – Czech National Corpus, no. LM2011023
- 2016–2018: Grant Agency of the Czech Republic – Non-native Czech from the Theoretical and Computational Perspective, no. 16-10185S

Available versions

Transcribed texts, 2 mil. words, without annotation, without metadata
Consists of 3 parts:
- Texts written by foreign learners of Czech (ciz)
- Academic texts written by foreign students in Czech (kval)
- Texts written by Czech students with Romani background (rom)
Searchable from the Czech National Corpus site
Downloadable from the LINDAT-Clarin data repository as two subcorpora:
- AKCES 3 – includes texts produced by non-native students of Czech
- AKCES 4 – includes texts produced by students growing up in socially excluded communities
Description in English and Czech

Czech as a Second Language with Spelling, Grammar and Tags
Transcribed texts, 1 mil. words
Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
Original forms and automatic corrections are tagged, lemmatized and assigned error labels
Most texts have metadata attributes (30 items) about the author and the text
Searchable from the Czech National Corpus site, metadata available from this site are in Czech
Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
Description in English and Czech

Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
Annotation manual in Czech
Transcription manual in Czech
Appendix to Transcription manual in Czech
Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
Texts searchable on-line via the SeLaQ tool, using the CzeSL-native format:
- Includes all manually annotated texts, both the non-native Czech and the Romani ethnolect Czech parts.
- SeLaQ is a purpose-built corpus manager. See its menu for instructions.
- Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
Downloadable from here in the feat format, with metadata
Coming soon: CzeSL-man searchable from the Czech National Corpus site and downloadable from the LINDAT/CLARIN repository.

Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the annotation manual (in Czech)
Available from https://bitbucket.org/czesl/czesl-md in the brat format
To be extended to all CzeSL-man texts

Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
Available from the LINDAT/CLARIN repository
CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation

Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
General corpus tool TEITOK, currently used for manuscript transcription (http://utkl.ff.cuni.cz/teitok/czesl/), to be used for building, editing and viewing the CzeSL corpora

Historie: • czesl