Toto je starší verze dokumentu!
CzeSL – a Learner Corpus of Czech
- The Corpus of Czech as a Second Language
- A part of the AKCES/CLAC project (the Czech Language Acquisition Corpora)
- For the official site see AKCES – Akviziční korpusy českého jazyka (in Czech)
- An outdated site: Investice do rozvoje vzdělávání v oboru čeština jako cizí jazyk (Investments into Teaching Czech as a Foreign Language)
- Funded since 2009 from several projects:
- 2009–2012: European Social Funds (ESF) – Innovative approach to teaching Czech as a second language, no. CZ.1.07/2.2.00/07.0119
- 2012–2016: Ministry of Education, Youth and Sports – Czech National Corpus, no. LM2011023
- 2016–2018 (extended to mid-2020): Grant Agency of the Czech Republic – Non-native Czech from the Theoretical and Computational Perspective, no. 16-10185S
- Alternative address of this site: http://utkl.ff.cuni.cz/learncorp/
Available versions
Thousands of tokens in | annotation | Metadata | Access | Year | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
non-native | ethnolect | 𝚺 | Error | Linguistic | ||||||||
essays | theses | Tags | TH | T0 | T1 | T2 | ||||||
CzeSL-plain | 1,315 | 732 | 428 | 2,475 | – | – | – | – | – | – | SD | 2012 |
CzeSL-SGT | 1,147 | – | – | 1,147 | F | K | M | – | M | yes | SD | 2014 |
CzeSL-man v0, a1 | 134 | – | 192 | 326 | F+G | 2T | – | M | M | – | SD | 2012 |
CzeSL-man v0, a2 | 59 | – | 149 | 208 | F+G | 2T | – | M | M | – | S | 2012 |
CzeSL-man v1 | 134 | – | – | 134 | F+G | T2 | M | – | M+S | yes | SD | 2016 |
CzeSL-man v2 | 134 | – | – | 134 | F+G | 2T | M | M | M | yes | SD | 2020 |
CzeSL-TH | 180 | – | – | 180 | – | 2T | – | – | – | yes | D | 2018 |
CzeSL-MD | 12 | – | – | 12 | MD | T2 | – | – | – | – | D | 2018 |
CzeSL-UD | 10 | – | – | 10 | – | – | M+S | – | – | – | D | 2018 |
CzeSL-GEC | ? | ? | – | 20 | – | 2T | – | – | – | – | D | 2017 |
AKCES-GEC | 336 | – | 168 | 504 | G | 2T | – | – | – | – | D | 2019 |
CzeSL in TEITOK | 299 | – | – | 299 | F+I | 2T+ | M | M | M+S | yes | S | 2020 |
- Tags: F – formal, G – grammar-based, MD – multi-dimensional, I – implicit
- TH (target hypothesis): K – correction suggested by the proofing tool, 2T – successive corrections in the 2T scheme, T2 – correction at Tier 2, 2T+ – more than 2 successive corrections
- Linguistic annotation: M – morphology (lemmas and morphosyntactic tags), S – syntax (structure and functions)
- Access: S – searchable on-line, D – downloadable in full as a dataset
- Year: year of the first release
CzeSL-plain
- 12.4 thousand transcribed texts, 2 mil. words, 2.5 mil. tokens
- Plain = without annotation, without metadata
- Consists of 3 parts:
- Texts written by foreign learners of Czech (ciz): 8,109 texts, 1,161 thousand tokens
- Academic texts written by foreign students in Czech (kval): 174 texts, 732 thousand tokens
- Texts written by Czech students with Romani background (rom), i.e. an ethnolect of Czech: 4,105 texts, 428 thousand tokens
- Searchable from the Czech National Corpus site
- Downloadable from the LINDAT-Clarin data repository as two subcorpora:
CzeSL-SGT
- Czech as a Second Language with Spelling, Grammar and Tags
- 8,617 transcribed texts, 111 thousand sentences, 1 mil. words, 1.1 mil. tokens
- Extends the “foreign” (ciz) part of CzeSL-plain by texts collected in 2013
- Original forms and automatic corrections are tagged, lemmatized and assigned error labels
- Most texts have metadata attributes (30 items) about the author and the text
- Searchable from the Czech National Corpus site, metadata available from this site are in Czech
- Dowloadable as AKCES 5 (CzeSL-SGT) Release 2 with the metadata in English. The original release is still available from AKCES 5 (CzeSL-SGT). We suggest the use of Release 2, where a number of bugs were fixed, including an issue in the metadata: native speakers of Ukrainian were labelled as speakers of “another Indo-European language” rather than as speakers of a Slavic language. Release 2 is now also a validated XML document with all annotation represented as XML attributes.
CzeSL-man
- Includes texts collected, transcribed and manually annotated within the ESF project, see a description in English.
- Annotation manual in Czech
- Transcription manual in Czech
- Appendix to Transcription manual in Czech
- Transcription manual – summary in English; please note that formatting specifics concerning the markup of manuscripts in the transcription manual, its appendix and the summary are outdated.
CzeSL-man v0
- Includes subsets of the ciz and rom parts of CzeSL-plain, i.e. the manually annotated Czech texts written by non-native learners and by speakers of the Roma ethnolect of Czech, the total of about 330 thousand tokens.
- Texts of about 208 thousand tokens are annotated independently by two annotators.
- Texts are searchable on-line via the SeLaQ tool, using the CzeSL-native format:
- SeLaQ is a purpose-built corpus manager. See its menu for instructions.
- Note that metadata and graphical display of links between annotated word tokens are not available in SeLaQ.
CzeSL-man v1
- CzeSL-man v1 contains 645 texts (128 thousand tokens) from CzeSL-SGT, including 298 doubly annotated texts (59 thousand doubly annotated tokens).
- Most texts are equipped with metadata about the author, the text and the annotation process.
- CzeSL-man can be searched or downloaded. Although the set of texts in the searchable and the downloadable versions are identical, they differ in how they represent the annotation.
- CzeSL-man v1 downloadable:
- Downloadable from https://bitbucket.org/czesl/czesl-man/
- This release is in the PML format, generated by the feat tool.
- Each text with its annotation consists of several related files.
- Some of the texts are independently annotated twice.
- CzeSL-man v1 searchable:
- Searchable by KonText: https://kontext.korpus.cz/first_form?corpname=czesl-man
- Differs from both CzeSL-man v0 and CzeSL-man v1 downloadable in two aspects:
- There are no texts with alternative error annotation: each text is annotated by a single annotator
- The two-tier annotation scheme is radically modified to fit the token-based setup of the search tool.
- Apart from that, the content and metadata are identical to CzeSL-man v1 downloadable and the search options to those of CzeSL-SGT.
- The main feature in the annotation is the reversal of the source text and its annotation. The correction at Tier 2 is assumed to be the basis for the annotation. The tokens of this corpus represent the words at Tier 2. The original text is added as annotation of the Tier 2 tokens. Each token of the corrected text receives its corresponding Tier 0 form and a Tier 2 error label as attributes.
- This annotation discards any Tier 1 corrections and error tags, and simplifies other than 1:1 links between tokens at Tier 0 and Tier 2.
- Tier 2 is parsed in a way similar to some other Czech corpora searchable in KonText, such as SYN2015.
CzeSL-man v2
- In this release the two-tier error annotation is represented as pairs of XML elements err and corr. An ill-formed portion of the source text is enclosed within the err structure, immediately followed by its correction, enclosed within the corr structure.
- Apart from the error annotation, the content and metadata are the same as in CzeSL-man v1.
- Linguistic annotation (tags and lemmas) is provided for all tokens at Tier 0 and Tier 2.
CzeSL-TH
- Includes a subset of CzeSL-SGT, hand-corrected, but not error-tagged, in 2017–2018, according to the 2T scheme.
- The corpus includes about 1300 texts (180 thousand tokens), selected from those that had not been manually error-annotated before (are not part of CzeSL-man).
- The selection was meant to make the manually annotated part of CzeSL more balanced in terms of L1 and CEFR level.
CzeSL-MD
- Includes a subset of texts from CzeSL-man, semi-automatically annotated by a multi-domain tagset with a focus on morphology, see the annotation manual (in Czech)
- The dataset is available from https://bitbucket.org/czesl/czesl-md in the brat format
- To be extended to all CzeSL-man texts
CzeSL-UD
- Texts from CzeSL-man with syntactic annotation according to the Universal Dependencies (UD) standard
- Available from the LINDAT/CLARIN repository
- CzeSL-man, CzeSL-MD and CzeSL-UD will eventually be merged into a single corpus with multiple types of annotation
CzeSL-GEC
- CzeSL Grammatical Error Correction Dataset – manually annotated texts, converted to a format intended for NLP applications
- Sentence pairs of original and corrected versions of Czech sentences collected from essays written by both non-native learners of Czech (CzeSL-man) and Czech pupils with Romani background
- Downloadable from http://hdl.handle.net/11234/1-2143
- See also AKCES-GEC
AKCES-GEC
- AKCES Grammatical Error Correction Dataset for Czech – a grammar error correction corpus for Czech generated from a subset of AKCES. It contains train, dev and test files annotated in M2 format. Note that in comparison to CZESL-GEC dataset, this dataset contains separated edits together with their type annotations in M2 format and also has twice as many sentences.
- Downloadable from http://hdl.handle.net/11234/1-3057
- See https://arxiv.org/pdf/1910.00353.pdf for a detailed description
CzeSL in TEITOK
- Work in progress: will eventually include all available Czech texts written by non-native learners
- See CzeSL in TEITOK at the ICTL site
Tools
- Annotation editor feat, used for multi-level manual error annotation of CzeSL-man
- Annotation editor brat, used for multi-domain error annotation of CzeSL-MD
- Tagger and lemmatizer of Czech Morphodita, used for morphological annotation of CzeSL-SGT
- Spelling/grammar checker Korektor, used for automatic correction of CzeSL-SGT
- Error identifier (see Jelínek, 2017), used for automatic identification of some types of errors in CzeSL-SGT
- Multi-level concordancer SeLaQ, used for basic searching in CzeSL-man
- Standard concordancer Manatee/KonText, used for searching in CzeSL-plain and CzeSL-SGT
- General corpus tool TEITOK, currently used for building, editing and viewing learner corpora hosted by the Institute of Theoretical and Computational linguistics (see Learner corpora at ICTL)
Bibliography
NEW:
Rosen, A., Hana, J., Hladká, B., Jelínek, T., Škodová, S., and Štindlová, B. (2020). Compiling and annotating a learner corpus for a morphologically rich language – CzeSL, a corpus of non-native Czech. Karolinum, Charles University Press, Praha. print copy/e-book http