Toto je starší verze dokumentu!
Obsah
CzeSL – a Learner Corpus of Czech
- The Corpus of Czech as a Second Language
- A part of the AKCES/CLAC project (the Czech Language Acquisition Corpora)
- For the official site see AKCES – Akviziční korpusy českého jazyka (in Czech)
- An outdated site: Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk (Investments into Teaching Czech as a Foreign Language)
- Funded since 2009 from several projects:
- 2009–2012: European Social Funds (ESF) – Innovative approach to teaching Czech as a second language, no. CZ.1.07/2.2.00/07.0119
- 2012–2016: Ministry of Education, Youth and Sports – Czech National Corpus, no. LM2011023
- 2016–2018: Grant Agency of the Czech Republic – Non-native Czech from the Theoretical and Computational Perspective, no. 16-10185S
Available versions
CzeSL-plain
- Transcribed texts, 2 mil. words, without annotation, without metadata
- Consists of 3 parts:
- Texts written by foreign learners of Czech (ciz)
- Academic texts written by foreign students in Czech (kval)
- Texts written by Czech students with Romani background (rom)
- Searchable from the Czech National Corpus site
- Downloadable from the LINDAT-Clarin data repository as two subcorpora:
CzeSL-SGT
- Czech as a Second Language with Spelling, Grammar and Tags
- Transcribed texts, 1 mil. words
- Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
- Original forms and automatic corrections are tagged, lemmatized and assigned error labels
- Most texts have metadata attributes (30 items) about the author and the text
- Searchable from the Czech National Corpus site, metadata available from this site are in Czech
- Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
CzeSL-man
- Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
- Annotation manual in Czech
- Transcription manual in Czech
- Appendix to Transcription manual in Czech
- Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
- Texts searchable on-line via the SeLaQ tool, using the CzeSL-native format:
- Includes all manually annotated texts, both the non-native Czech and the Romani ethnolect Czech parts.
- SeLaQ is a purpose-built corpus manager. See its menu for instructions.
- Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
- Coming soon: CzeSL-man searchable from the Czech National Corpus site and downloadable from the LINDAT/CLARIN repository.
CzeSL-MD
- Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the annotation manual (in Czech)
- Available from https://bitbucket.org/czesl/czesl-md in the brat format
- To be extended to all CzeSL-man texts
CzeSL-UD
- Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
- Available from the LINDAT/CLARIN repository
- CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation
Tools
- Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
- Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
- Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
- Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
- Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
- Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
- Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
- General corpus tool TEITOK, currently used for manuscript transcription (http://utkl.ff.cuni.cz/teitok/czesl/), to be used for building, editing and viewing the CzeSL corpora