Toto je starší verze dokumentu!
CzeSL – a Learner Corpus of Czech
- The Corpus of Czech as a Second Language
- A part of the AKCES/CLAC project (the Czech Language Acquisition Corpora)
- For the official site see AKCES – Akviziční korpusy českého jazyka (in Czech)
- An outdated site: Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk (Investments into Teaching Czech as a Foreign Language)
- Funded since 2009 from several projects:
- 2009–2012: European Social Funds (ESF) – Innovative approach to teaching Czech as a second language, no. CZ.1.07/2.2.00/07.0119
- 2012–2016: Ministry of Education, Youth and Sports – Czech National Corpus, no. LM2011023
- 2016–2018 (extended to mid-2020): Grant Agency of the Czech Republic – Non-native Czech from the Theoretical and Computational Perspective, no. 16-10185S
- 2018–2022: KREAS, Faculty of Arts, Charles University; Structural and Investment Funds of the European Union
- Alternative address of this site: http://utkl.ff.cuni.cz/learncorp/
Available versions
Thousands of tokens in | Annotation | Metadata | Access | Year | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
non-native | ethnolect | 𝚺 | error | linguistic | ||||||||
essays | theses | Tags | TH | T0 | T1 | T2 | ||||||
CzeSL-plain | 1,315 | 732 | 428 | 2,475 | – | – | – | – | – | – | SD | 2012 |
CzeSL-SGT | 1,147 | – | – | 1,147 | F | K | M | – | M | yes | SD | 2014 |
CzeSL-man v0, a1 | 134 | – | 192 | 326 | F+G | 2T | – | M | M | – | SD | 2012 |
CzeSL-man v0, a2 | 59 | – | 149 | 208 | F+G | 2T | – | M | M | – | S | 2012 |
CzeSL-man v1 | 134 | – | – | 134 | F+G | T2 | M | – | M+S | yes | SD | 2016 |
CzeSL-man v2 | 134 | – | – | 134 | F+G | 2T | M | M | M | yes | SD | 2020 |
CzeSL-TH | 180 | – | – | 180 | – | 2T | – | – | – | yes | D | 2018 |
CzeSL-MD | 12 | – | – | 12 | MD | T2 | – | – | – | – | D | 2018 |
CzeSL-UD | 10 | – | – | 10 | – | – | M+S | – | – | – | D | 2018 |
CzeSL-GEC | ? | ? | – | 20 | – | 2T | – | – | – | – | D | 2017 |
AKCES-GEC | 336 | – | 168 | 504 | G | 2T | – | – | – | – | D | 2019 |
CzeSL in TEITOK | 299 | – | – | 299 | F+I | 2T+ | M | M | M+S | yes | S | 2020 |
- Tags: F – formal, G – grammar-based, MD – multi-dimensional, I – implicit
- TH (target hypothesis): K – correction suggested by the proofing tool, 2T – successive corrections in the 2T scheme, T2 – correction at Tier 2, 2T+ – more than 2 successive corrections
- Linguistic annotation: M – morphology (lemmas and morphosyntactic tags), S – syntax (structure and functions)
- Access: S – searchable on-line, D – downloadable in full as a dataset
- Year: year of the first release
CzeSL-plain
- 12.4 thousand transcribed texts, 2 mil. words, 2.5 mil. tokens
- Plain = without annotation, without metadata
- Consists of 3 parts:
- Texts written by foreign learners of Czech (ciz): 8,109 texts, 1,161 thousand tokens
- Academic texts written by foreign students in Czech (kval): 174 texts, 732 thousand tokens
- Texts written by Czech students with Romani background (rom), i.e. an ethnolect of Czech: 4,105 texts, 428 thousand tokens
- Searchable from the Czech National Corpus site
- Downloadable from the LINDAT-Clarin data repository as two subcorpora:
CzeSL-SGT
- Czech as a Second Language with Spelling, Grammar and Tags
- 8,617 transcribed texts, 111 thousand sentences, 1 mil. words, 1.1 mil. tokens
- Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
- Original forms and automatic corrections are tagged, lemmatized and assigned error labels
- Most texts have metadata attributes (30 items) about the author and the text
- Searchable from the Czech National Corpus site, metadata available from this site are in Czech
- Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
CzeSL-man
- Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
- Annotation manual in Czech
- Transcription manual in Czech
- Appendix to Transcription manual in Czech
- Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
CzeSL-man v0
- Includes subsets of the ciz and rom parts of CzeSL-plain, i.e. the manually annotated Czech texts written by non-native learners and by speakers of the Roma ethnolect of Czech, the total of about 330 thousand tokens.
- Texts of about 208 thousand tokens are annotated independently by two annotators.
- Texts are searchable on-line via the SeLaQ tool, using the CzeSL-native format:
- SeLaQ is a purpose-built corpus manager. See its menu for instructions.
- Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
CzeSL-man v1
- CzeSL-man v1 contains 645 texts (128 thousand tokens) from CzeSL-SGT, including 298 doubly annotated texts (59 thousand doubly annotated tokens).
- Most texts are equipped with metadata about the author, the text and the annotation process.
- CzeSL-man can be searched or downloaded. Although the set of texts in the searchable and the downloadable versions are identical, they differ in how they represent the annotation.
- CzeSL-man v1 downloadable:
- Downloadable from https://bitbucket.org/czesl/czesl-man/
- This release is in the PML format, generated by the feat tool.
- Each text with its annotation consists of several related files.
- Some of the texts are independently annotated twice.
- Includes also flat version (files named *.vert), see CzeSL-man v2 below.
- CzeSL-man v1 searchable:
- Searchable by KonText: https://kontext.korpus.cz/first_form?corpname=czesl-man
- Differs from both CzeSL-man v0 and CzeSL-man v1 downloadable in two aspects:
- There are no texts with alternative error annotation: each text is annotated by a single annotator
- The two-tier annotation scheme is radically modified to fit the token-based setup of the search tool.
- Apart from that, the content and metadata are identical to CzeSL-man v1 downloadable and the search options to those of CzeSL-SGT.
- The main feature in the annotation is the reversal of the source text and its annotation. The correction at Tier 2 is assumed to be the basis for the annotation. The tokens of this corpus represent the words at Tier 2. The original text is added as annotation of the Tier 2 tokens. Each token of the corrected text receives its corresponding Tier 0 form and a Tier 2 error label as attributes.
- This annotation discards any Tier 1 corrections and error tags, and simplifies other than 1:1 links between tokens at Tier 0 and Tier 2.
- Tier 2 is parsed in a way similar to some other Czech corpora searchable in KonText, such as SYN2015.
CzeSL-man v2
- In this release the two-tier error annotation is represented as pairs of XML elements err and corr. An ill-formed portion of the source text is enclosed within the err structure, immediately followed by its correction, enclosed within the corr structure.
- Apart from the error annotation, the content and metadata are the same as in CzeSL-man v1.
- Linguistic annotation (tags and lemmas) is provided for all tokens at Tier 0 and Tier 2.
- Downloadable from https://bitbucket.org/czesl/czesl-man/ (files named *.vert).
CzeSL-TH
- Includes a subset of CzeSL-SGT, hand-corrected, but not error-tagged, in 2017–2018, according to the 2T scheme.
- The corpus includes about 1300 texts (180 thousand tokens), selected from those that had not been manually error-annotated before (are not part of CzeSL-man).
- The selection was meant to make the manually annotated part of CzeSL more balanced in terms of L1 and CEFR level.
CzeSL-MD
- Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the annotation manual (in Czech)
- The dataset is available from https://bitbucket.org/czesl/czesl-md in the brat format
- To be extended to all CzeSL-man texts
CzeSL-UD
- Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
- Available from the LINDAT/CLARIN repository
- CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation
CzeSL-GEC
- CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
- Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
- Downloadable from http://hdl.handle.net/11234/1-2143
- See also AKCES-GEC
AKCES-GEC
- AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
- Downloadable from http://hdl.handle.net/11234/1-3057
- See https://arxiv.org/pdf/1910.00353.pdf for a detailed description
CzeSL in TEITOK
- Work in progress: will eventually include all available Czech texts written by non-native learners
- See CzeSL in TEITOK at the ICTL site
Tools
- Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
- Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
- Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
- Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
- Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
- Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
- Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
- General corpus tool TEITOK, currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics (see Learner corpora at ICTL)
Bibliography
NEW:
Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., and Štindlová, B. (2020). Compiling and annotating a learner corpus for a morphologically rich language – CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press, Praha. Print copy, e-book CU Digital Repository
Acknowledgement
This work was supported by the European Regional Development Fund project “Creativity and Adaptability as Conditions of the Success of Europe in an Interrelated World” (reg. no.: CZ.02.1.01/0.0/0.0/16_019/0000734).