Table of Contents
Tagování textů pro Ortofon
- Texty dostáváme s mark-upem
<chunk id="12A005N_1" year="2012" month="8" location="Praha" situation="hovor doma" speakers="2" genders="smíšené" generations="1" relationship="partnerský" length="10:20" tokens="1567" doc_id="12A005N" position_in_text="end"> <sp id="79f" nickname="Lenka_24" gender="Z: žena" confederate="ano" age_binary="I: do 35 let" age="33" edu_binary="A: vysokoškolské" edu_level="VŠ" edu_field="Pedagogika, učitelství a sociální péče" occupation="lektor AJ" occupation_category="23 pedagog" reg_childhood="středočeská" loc_childhood="Praha" locsize_childhood="město nad 100 tisíc" reg_longest="středočeská" locsize_longest="město nad 100 tisíc" reg_current="středočeská" locsize_current="město nad 100 tisíc" proportion="55 %" soundfile="4/4/18692a8d.mp3"> tak taky můžeš jít na ryby </sp>
Postup práce
- Každý text je anotován jedním lidským anotátorem a slitými výsledky hybridu vs MorphoDiTa.
- Máme deset adresářů s texty od Davida:
/store/corp/Ortofon/ortofon-data/01–05
(první dávka) a/store/corp/Ortofon/ortofon-data/06–10
(druhá dávka). - Ruční anotace se provádí v adresářích
/store/corp/Ortofon/ortofon-manual/davka-?
. - Automatická anotace se provádí v adresářích
/store/corp/Ortofon/ortofon-hybrid/davka-?
a/store/corp/Ortofon/ortofon-morphodita/davka?
. - Závěrečné slití ruční a automatické anotace se provede v adresáři
/store/corp/Ortofon/ortofon-etalon/davka-?
.
Ruční anotace
Příprava textů pro anotátory
- Všechny soubory dáme do adresáře
chunks
- Vyrobíme
csts
:parallel-filter.sh -C "cut -f1 | perl -pe 's/\.\.\./&thellip;/g' \ | perl -pe 's/\.\./&dhellip;/g' | replace_spaces.pl \ | perl -pe 's/<sp /<s /' | perl -pe 's:</sp>:</s>:' \ | vert_csts.pl | perl -pe 'undef $/; s/<s>\n<chunk/<chunk/'" -p45 -s chunks -t csts -v
- Tagujeme hybridem!
- Provedeme morfologii:
make-corp.sh -s csts -t csts-morf -Eucs2 -A1 -B1 -M -p45 -v
- Rozhodneme vole a von:
cd csts-morf for ff in *; do echo $ff; \ perl -i -pe 's/<f>vole<.*/<f src="M">vole<MMl>vůl<MMt>NNMS5-----A----/' $ff; done for ff in *; do echo $ff; \ perl -i -pe 's/(<f[^>]*>von)<.*/$1<MMl>on<MMt>PPYS1--3------6/' $ff; done
- Provedeme pravidla a frazémy:
make-whole-corp-csts.sh -Eucs2 -M -v -p45 -trules -Tfrazrl
- Upravíme tagy:
parallel-filter.sh -C "normalize-anot-csts.pl \ | simplify-tags-csts-utf.pl| remove-dupl-csts-mark.pl X" -p45 \ -s csts-rules-frazrl -t csts-import -v
A ještě zjednodušit tagy a ošetřit zvuky v pozadí. - Naděláme linky pro jednotlivé anotátory.
- Další kroky provedeme na jakobsonovi.
- Import souboru:
/usr/local/annotate/bin/csts-import-utkl.pl --force 05-16X001N_2-HS
- Upravit
/usr/local/annotate/users
.
Anotátoři a přidělené soubory (davka-1)
Michal Havrda - MH
- xaf (4541): hotovo
-rw-rw-r-- 1 skoumal users 169887 Nov 7 17:40 04-13B019N_0-MH * -rw-rw-r-- 1 skoumal users 147849 Nov 7 17:40 04-13B025N_0-MH * -rw-rw-r-- 1 skoumal users 136246 Nov 7 17:40 04-13O007N_1-MH * -rw-rw-r-- 1 skoumal users 133560 Nov 7 17:40 04-13O010N_0-MH * -rw-rw-r-- 1 skoumal users 165228 Nov 7 17:40 04-13O013N_0-MH * -rw-rw-r-- 1 skoumal users 119827 Nov 7 17:40 04-13P004N_2-MH ** -rw-rw-r-- 1 skoumal users 129590 Nov 7 17:40 04-14P007N_0-MH * -rw-rw-r-- 1 skoumal users 166905 Nov 7 17:40 04-14T003N_0-MH * -rw-rw-r-- 1 skoumal users 152270 Nov 7 17:40 04-14T010N_1-MH * -rw-rw-r-- 1 skoumal users 141300 Nov 7 17:40 04-14T013N_1-MH * -rw-rw-r-- 1 skoumal users 148744 Nov 7 17:40 04-14X016N_3-MH * -rw-rw-r-- 1 skoumal users 131885 Nov 7 17:40 04-15O010N_0-MH * -rw-rw-r-- 1 skoumal users 153940 Nov 7 17:40 04-15P001N_2-MH * -rw-rw-r-- 1 skoumal users 119987 Nov 7 17:40 04-15P006N_0-MH ** -rw-rw-r-- 1 skoumal users 163310 Nov 7 17:40 05-16X001N_2-MH **
Václav Horký - VH
- xaa (4238):
-rw-rw-r-- 1 skoumal users 97843 Nov 7 17:40 01-12A005N_1-VH -rw-rw-r-- 1 skoumal users 124148 Nov 7 17:40 01-13A014N_1-VH -rw-rw-r-- 1 skoumal users 139676 Nov 7 17:40 01-13A028N_1-VH -rw-rw-r-- 1 skoumal users 162599 Nov 7 17:40 01-13B031N_1-VH -rw-rw-r-- 1 skoumal users 128254 Nov 7 17:40 01-13H005N_1-VH -rw-rw-r-- 1 skoumal users 176088 Nov 7 17:40 01-13O009N_3-VH -rw-rw-r-- 1 skoumal users 128760 Nov 7 17:40 01-13P004N_1-VH -rw-rw-r-- 1 skoumal users 139368 Nov 7 17:40 01-13P009N_2-VH -rw-rw-r-- 1 skoumal users 135085 Nov 7 17:40 01-14T003N_6-VH -rw-rw-r-- 1 skoumal users 152015 Nov 7 17:40 01-14T007N_1-VH -rw-rw-r-- 1 skoumal users 153372 Nov 7 17:40 01-14T010N_0-VH -rw-rw-r-- 1 skoumal users 166194 Nov 7 17:40 01-14X016N_6-VH -rw-rw-r-- 1 skoumal users 168697 Nov 7 17:40 01-14X019N_0-VH -rw-rw-r-- 1 skoumal users 160336 Nov 7 17:40 01-15O001N_0-VH
Šárka Kadavá - SK
- xad (4521): hotovo
-rw-rw-r-- 1 skoumal users 173330 Nov 7 17:40 02-16A005N_5-SK -rw-rw-r-- 1 skoumal users 149878 Nov 7 17:40 02-16P002N_0-SK -rw-rw-r-- 1 skoumal users 118822 Nov 7 17:40 02-16X003N_5-SK -rw-rw-r-- 1 skoumal users 138030 Nov 7 17:40 03-12A035N_3-SK -rw-rw-r-- 1 skoumal users 130944 Nov 7 17:40 03-13A014N_4-SK -rw-rw-r-- 1 skoumal users 138479 Nov 7 17:40 03-13O009N_2-SK -rw-rw-r-- 1 skoumal users 132373 Nov 7 17:40 03-13P010N_1-SK -rw-rw-r-- 1 skoumal users 159964 Nov 7 17:40 03-14A011N_3-SK -rw-rw-r-- 1 skoumal users 122871 Nov 7 17:40 03-14A016N_4-SK -rw-rw-r-- 1 skoumal users 137852 Nov 7 17:40 03-14P007N_2-SK -rw-rw-r-- 1 skoumal users 186207 Nov 7 17:40 03-14T010N_2-SK -rw-rw-r-- 1 skoumal users 151313 Nov 7 17:40 03-14T013N_0-SK -rw-rw-r-- 1 skoumal users 162548 Nov 7 17:40 03-14T020N_0-SK -rw-rw-r-- 1 skoumal users 172522 Nov 7 17:40 03-14X019N_4-SK
Pavel Kopřiva - PK
- xag (4601): hotovo
-rw-rw-r-- 1 skoumal users 128575 Nov 7 17:40 04-15X030N_3-PK -rw-rw-r-- 1 skoumal users 152738 Nov 7 17:40 04-15X043N_2-PK -rw-rw-r-- 1 skoumal users 162024 Nov 7 17:40 04-16A009N_2-PK -rw-rw-r-- 1 skoumal users 114272 Nov 7 17:40 04-16E007N_2-PK -rw-rw-r-- 1 skoumal users 161129 Nov 7 17:40 04-16P004N_1-PK -rw-rw-r-- 1 skoumal users 145303 Nov 7 17:40 04-16X003N_2-PK -rw-rw-r-- 1 skoumal users 153704 Nov 7 17:40 05-12P004N_2-PK -rw-rw-r-- 1 skoumal users 141893 Nov 7 17:40 05-13A011N_0-PK -rw-rw-r-- 1 skoumal users 136846 Nov 7 17:40 05-13A014N_3-PK -rw-rw-r-- 1 skoumal users 117562 Nov 7 17:40 05-13A023N_3-PK -rw-rw-r-- 1 skoumal users 145722 Nov 7 17:40 05-13B005N_1-PK -rw-rw-r-- 1 skoumal users 169153 Nov 7 17:40 05-13D015N_0-PK -rw-rw-r-- 1 skoumal users 141199 Nov 7 17:40 05-13O007N_0-PK -rw-rw-r-- 1 skoumal users 172005 Nov 7 17:40 05-14A011N_2-PK
Lucie Onari Kreslová - LK
- xae (4451): hotovo
-rw-rw-r-- 1 skoumal users 136241 Nov 7 17:40 03-14X021N_2-LK -rw-rw-r-- 1 skoumal users 180763 Nov 7 17:40 03-15E003N_0-LK -rw-rw-r-- 1 skoumal users 144355 Nov 7 17:40 03-15E015N_1-LK -rw-rw-r-- 1 skoumal users 135982 Nov 7 17:40 03-15O002N_1-LK -rw-rw-r-- 1 skoumal users 171807 Nov 7 17:40 03-15O007N_0-LK -rw-rw-r-- 1 skoumal users 175211 Nov 7 17:40 03-15X020N_1-LK -rw-rw-r-- 1 skoumal users 158941 Nov 7 17:40 03-15X041N_2-LK -rw-rw-r-- 1 skoumal users 170958 Nov 7 17:40 03-16A005N_4-LK -rw-rw-r-- 1 skoumal users 140487 Nov 7 17:40 03-16P004N_0-LK -rw-rw-r-- 1 skoumal users 171624 Nov 7 17:40 03-16X001N_1-LK -rw-rw-r-- 1 skoumal users 148677 Nov 7 17:40 03-16X031N_4-LK -rw-rw-r-- 1 skoumal users 115839 Nov 7 17:40 04-12A025N_0-LK -rw-rw-r-- 1 skoumal users 172208 Nov 7 17:40 04-12P004N_4-LK -rw-rw-r-- 1 skoumal users 200543 Nov 7 17:40 04-13B009N_0-LK
Tereza Marková - TM
- xab (4269): - hotovo
-rw-rw-r-- 1 skoumal users 144719 Nov 7 17:40 01-15O004N_0-TM * -rw-rw-r-- 1 skoumal users 147429 Nov 7 17:40 01-15O007N_1-TM * -rw-rw-r-- 1 skoumal users 156462 Nov 7 17:40 01-15P001N_1-TM ** -rw-rw-r-- 1 skoumal users 188348 Nov 7 17:40 01-15T005N_0-TM ** -rw-rw-r-- 1 skoumal users 149779 Nov 7 17:40 01-15X045N_2-TM * -rw-rw-r-- 1 skoumal users 165698 Nov 7 17:40 01-16A005N_2-TM * -rw-rw-r-- 1 skoumal users 132634 Nov 7 17:40 01-16P002N_2-TM * -rw-rw-r-- 1 skoumal users 121098 Nov 7 17:40 01-16X033N_5-TM * -rw-rw-r-- 1 skoumal users 122287 Nov 7 17:40 02-12A011N_1-TM * -rw-rw-r-- 1 skoumal users 146221 Nov 7 17:40 02-13A011N_5-TM * -rw-rw-r-- 1 skoumal users 136882 Nov 7 17:40 02-13A036N_3-TM ** -rw-rw-r-- 1 skoumal users 198356 Nov 7 17:40 02-13B031N_0-TM * -rw-rw-r-- 1 skoumal users 112397 Nov 7 17:40 02-13O009N_1-TM ** -rw-rw-r-- 1 skoumal users 126312 Nov 7 17:40 02-13O010N_2-TM **
Anna Nováková - AN
- xah (4552):
-rw-rw-r-- 1 skoumal users 190480 Nov 7 17:40 05-14T010N_3-AN -rw-rw-r-- 1 skoumal users 122224 Nov 7 17:40 05-14T019N_0-AN -rw-rw-r-- 1 skoumal users 154255 Nov 7 17:40 05-14X012N_2-AN -rw-rw-r-- 1 skoumal users 189324 Nov 7 17:40 05-14X019N_2-AN -rw-rw-r-- 1 skoumal users 183678 Nov 7 17:40 05-14X019N_3-AN -rw-rw-r-- 1 skoumal users 154400 Nov 7 17:40 05-15O001N_1-AN -rw-rw-r-- 1 skoumal users 150413 Nov 7 17:40 05-15P001N_0-AN -rw-rw-r-- 1 skoumal users 206423 Nov 7 17:40 05-15X009N_1-AN -rw-rw-r-- 1 skoumal users 181125 Nov 7 17:40 05-15X020N_3-AN -rw-rw-r-- 1 skoumal users 99237 Nov 7 17:40 05-15X043N_5-AN -rw-rw-r-- 1 skoumal users 182076 Nov 7 17:40 05-16A005N_1-AN -rw-rw-r-- 1 skoumal users 122299 Nov 7 17:40 05-16E007N_0-AN -rw-rw-r-- 1 skoumal users 120653 Nov 7 17:40 05-16P002N_1-AN -rw-rw-r-- 1 skoumal users 131582 Nov 7 17:40 05-16P007N_1-AN
Michal Zlatkovský - MZ
- xac (4393):
-rw-rw-r-- 1 skoumal users 160624 Nov 7 17:40 02-13O013N_1-MZ -rw-rw-r-- 1 skoumal users 119673 Nov 7 17:40 02-13P004N_3-MZ -rw-rw-r-- 1 skoumal users 120363 Nov 7 17:40 02-13T029N_6-MZ -rw-rw-r-- 1 skoumal users 117594 Nov 7 17:40 02-14A016N_2-MZ -rw-rw-r-- 1 skoumal users 156603 Nov 7 17:40 02-14E003N_0-MZ -rw-rw-r-- 1 skoumal users 153488 Nov 7 17:40 02-14T003N_5-MZ -rw-rw-r-- 1 skoumal users 147516 Nov 7 17:40 02-14T013N_2-MZ -rw-rw-r-- 1 skoumal users 178036 Nov 7 17:40 02-14T020N_3-MZ -rw-rw-r-- 1 skoumal users 172908 Nov 7 17:40 02-15O002N_0-MZ -rw-rw-r-- 1 skoumal users 122828 Nov 7 17:40 02-15O004N_1-MZ -rw-rw-r-- 1 skoumal users 127983 Nov 7 17:40 02-15O009N_1-MZ -rw-rw-r-- 1 skoumal users 129864 Nov 7 17:40 02-15P001N_3-MZ -rw-rw-r-- 1 skoumal users 129648 Nov 7 17:40 02-15X041N_1-MZ -rw-rw-r-- 1 skoumal users 150095 Nov 7 17:40 02-16A001N_0-MZ
Anotátoři a přidělené soubory (davka-2)
Václav Horký - VH
- (7862) – hotovo:
-rw-rw-r-- 1 skoumal users 142827 May 27 17:40 06-12A011N_0-VH -rw-rw-r-- 1 skoumal users 166457 May 27 17:40 06-12P004N_3-VH -rw-rw-r-- 1 skoumal users 131514 May 27 17:40 06-13A003N_1-VH -rw-rw-r-- 1 skoumal users 128413 May 27 17:40 06-13A014N_2-VH -rw-rw-r-- 1 skoumal users 144623 May 27 17:40 06-13A028N_2-VH -rw-rw-r-- 1 skoumal users 150403 May 27 17:40 06-13A074N_1-VH -rw-rw-r-- 1 skoumal users 139190 May 27 17:40 06-13B005N_0-VH -rw-rw-r-- 1 skoumal users 189188 May 27 17:40 06-13B028N_1-VH -rw-rw-r-- 1 skoumal users 155333 May 27 17:40 06-13O007N_2-VH -rw-rw-r-- 1 skoumal users 189392 May 27 17:40 06-14A006N_0-VH -rw-rw-r-- 1 skoumal users 143619 May 27 17:40 06-14A008N_3-VH -rw-rw-r-- 1 skoumal users 184286 May 27 17:40 06-14E001N_0-VH -rw-rw-r-- 1 skoumal users 149792 May 27 17:40 06-14P007N_3-VH -rw-rw-r-- 1 skoumal users 147049 May 27 17:40 06-14T007N_0-VH -rw-rw-r-- 1 skoumal users 164936 May 27 17:40 06-14T020N_1-VH -rw-rw-r-- 1 skoumal users 170219 May 27 17:40 06-14X016N_1-VH -rw-rw-r-- 1 skoumal users 128950 May 27 17:40 06-15E017N_5-VH -rw-rw-r-- 1 skoumal users 166806 May 27 17:40 06-15O010N_2-VH -rw-rw-r-- 1 skoumal users 137835 May 27 17:40 06-16A001N_3-VH -rw-rw-r-- 1 skoumal users 131549 May 27 17:40 06-16P002N_3-VH -rw-rw-r-- 1 skoumal users 106379 May 27 17:40 06-16P007N_5-VH -rw-rw-r-- 1 skoumal users 141727 May 27 17:40 06-16X003N_1-VH -rw-rw-r-- 1 skoumal users 169034 May 27 17:40 06-16X030N_1-VH
- (7028) – hotovo:
-rw-r--r-- 1 skoumal users 91649 Jun 24 18:06 07-12A037N_4-VH -rw-r--r-- 1 skoumal users 161669 Jun 24 18:06 07-12O002N_0-VH -rw-r--r-- 1 skoumal users 144382 Jun 24 18:06 07-13A003N_2-VH -rw-r--r-- 1 skoumal users 148109 Jun 24 18:06 07-13A014N_0-VH -rw-r--r-- 1 skoumal users 106736 Jun 24 18:06 07-13A028N_4-VH -rw-r--r-- 1 skoumal users 142377 Jun 24 18:06 07-13A036N_5-VH -rw-r--r-- 1 skoumal users 149087 Jun 24 18:06 07-13A050N_0-VH -rw-r--r-- 1 skoumal users 166318 Jun 24 18:06 07-13E004N_6-VH -rw-r--r-- 1 skoumal users 154101 Jun 24 18:06 07-13O004N_0-VH -rw-r--r-- 1 skoumal users 144107 Jun 24 18:06 07-14T003N_4-VH -rw-r--r-- 1 skoumal users 121502 Jun 24 18:06 07-14T007N_2-VH -rw-r--r-- 1 skoumal users 149152 Jun 24 18:06 07-14T013N_3-VH -rw-r--r-- 1 skoumal users 152439 Jun 24 18:06 07-14X016N_4-VH -rw-r--r-- 1 skoumal users 131964 Jun 24 18:06 07-15C004N_0-VH -rw-r--r-- 1 skoumal users 134751 Jun 24 18:06 07-15O009N_0-VH -rw-r--r-- 1 skoumal users 129301 Jun 24 18:06 07-15P002N_0-VH -rw-r--r-- 1 skoumal users 180038 Jun 24 18:06 07-16A005N_0-VH -rw-r--r-- 1 skoumal users 133426 Jun 24 18:06 07-16A009N_3-VH -rw-r--r-- 1 skoumal users 125528 Jun 24 18:06 07-16P007N_2-VH -rw-r--r-- 1 skoumal users 128922 Jun 24 18:06 07-16X003N_3-VH -rw-r--r-- 1 skoumal users 140285 Jun 24 18:06 07-16X031N_0-VH -rw-r--r-- 1 skoumal users 129726 Jun 24 18:06 07-16X033N_3-VH
Anna Nováková - AN
- (7634) – hotovo:
-rw-r--r-- 1 skoumal users 136812 Jun 24 18:06 08-12A009N_0-AN -rw-r--r-- 1 skoumal users 162292 Jun 24 18:06 08-12A031N_0-AN -rw-r--r-- 1 skoumal users 174507 Jun 24 18:06 08-13A018N_0-AN -rw-r--r-- 1 skoumal users 150144 Jun 24 18:06 08-13A036N_0-AN -rw-r--r-- 1 skoumal users 147524 Jun 24 18:06 08-13A090N_4-AN -rw-r--r-- 1 skoumal users 159163 Jun 24 18:06 08-13B019N_1-AN -rw-r--r-- 1 skoumal users 121300 Jun 24 18:06 08-13B028N_0-AN -rw-r--r-- 1 skoumal users 117897 Jun 24 18:06 08-13O009N_0-AN -rw-r--r-- 1 skoumal users 122809 Jun 24 18:06 08-13P004N_0-AN -rw-r--r-- 1 skoumal users 159431 Jun 24 18:06 08-14C006N_0-AN -rw-r--r-- 1 skoumal users 153240 Jun 24 18:06 08-14T003N_1-AN -rw-r--r-- 1 skoumal users 132491 Jun 24 18:06 08-14T014N_4-AN -rw-r--r-- 1 skoumal users 157464 Jun 24 18:06 08-14X016N_5-AN -rw-r--r-- 1 skoumal users 164278 Jun 24 18:06 08-15E010N_5-AN -rw-r--r-- 1 skoumal users 132281 Jun 24 18:06 08-15O010N_1-AN -rw-r--r-- 1 skoumal users 162485 Jun 24 18:06 08-15X020N_2-AN -rw-r--r-- 1 skoumal users 177612 Jun 24 18:06 08-15X041N_3-AN -rw-r--r-- 1 skoumal users 181697 Jun 24 18:06 08-16A005N_3-AN -rw-r--r-- 1 skoumal users 139212 Jun 24 18:06 08-16E005N_4-AN -rw-r--r-- 1 skoumal users 121915 Jun 24 18:06 08-16E007N_4-AN -rw-r--r-- 1 skoumal users 137518 Jun 24 18:06 08-16X003N_4-AN -rw-r--r-- 1 skoumal users 176342 Jun 24 18:06 08-16X026N_1-AN -rw-r--r-- 1 skoumal users 141328 Jun 24 18:06 08-16X031N_2-AN
Michal Havrda - MH
- (7117) – hotovo:
-rw-r--r-- 1 skoumal users 160033 Jun 24 18:06 10-13A005N_4-MH -rw-r--r-- 1 skoumal users 137114 Jun 24 18:06 10-13A011N_3-MH -rw-r--r-- 1 skoumal users 112974 Jun 24 18:06 10-13A018N_2-MH -rw-r--r-- 1 skoumal users 157987 Jun 24 18:06 10-13A074N_5-MH -rw-r--r-- 1 skoumal users 129620 Jun 24 18:06 10-13B016N_1-MH -rw-r--r-- 1 skoumal users 152048 Jun 24 18:06 10-13O003N_0-MH -rw-r--r-- 1 skoumal users 126483 Jun 24 18:06 10-13P009N_0-MH -rw-r--r-- 1 skoumal users 157568 Jun 24 18:06 10-14A011N_1-MH -rw-r--r-- 1 skoumal users 154505 Jun 24 18:06 10-14C009N_2-MH -rw-r--r-- 1 skoumal users 181761 Jun 24 18:06 10-14O007N_0-MH -rw-r--r-- 1 skoumal users 135186 Jun 24 18:06 10-14P006N_1-MH -rw-r--r-- 1 skoumal users 143875 Jun 24 18:06 10-15O011N_0-MH -rw-r--r-- 1 skoumal users 122224 Jun 24 18:06 10-15O012N_0-MH -rw-r--r-- 1 skoumal users 166988 Jun 24 18:06 10-15P004N_0-MH -rw-r--r-- 1 skoumal users 122777 Jun 24 18:06 10-15T002N_2-MH -rw-r--r-- 1 skoumal users 202198 Jun 24 18:06 10-15T003N_1-MH -rw-r--r-- 1 skoumal users 190825 Jun 24 18:06 10-15T011N_4-MH -rw-r--r-- 1 skoumal users 181748 Jun 24 18:06 10-15X020N_0-MH -rw-r--r-- 1 skoumal users 161548 Jun 24 18:06 10-16A002N_0-MH -rw-r--r-- 1 skoumal users 114890 Jun 24 18:06 10-16E007N_5-MH -rw-r--r-- 1 skoumal users 128041 Jun 24 18:06 10-16P004N_3-MH -rw-r--r-- 1 skoumal users 131846 Jun 24 18:06 10-16X003N_0-MH
Pavel Kopřiva
- (7328) – hotovo:
-rw-r--r-- 1 skoumal users 123564 Jun 24 18:06 09-12A004N_1-PK -rw-r--r-- 1 skoumal users 140637 Jun 24 18:06 09-12A034N_3-PK -rw-r--r-- 1 skoumal users 137184 Jun 24 18:06 09-12H004N_1-PK -rw-r--r-- 1 skoumal users 118561 Jun 24 18:06 09-13A003N_0-PK -rw-r--r-- 1 skoumal users 151930 Jun 24 18:06 09-13A074N_4-PK -rw-r--r-- 1 skoumal users 137469 Jun 24 18:06 09-13A090N_2-PK -rw-r--r-- 1 skoumal users 115312 Jun 24 18:06 09-13B011N_0-PK -rw-r--r-- 1 skoumal users 161409 Jun 24 18:06 09-13B027N_0-PK -rw-r--r-- 1 skoumal users 156513 Jun 24 18:06 09-13O007N_3-PK -rw-r--r-- 1 skoumal users 143758 Jun 24 18:06 09-13P008N_1-PK -rw-r--r-- 1 skoumal users 147253 Jun 24 18:06 09-13T029N_3-PK -rw-r--r-- 1 skoumal users 218778 Jun 24 18:06 09-13X003N_0-PK -rw-r--r-- 1 skoumal users 121754 Jun 24 18:06 09-14A016N_0-PK -rw-r--r-- 1 skoumal users 132630 Jun 24 18:06 09-14C006N_3-PK -rw-r--r-- 1 skoumal users 98129 Jun 24 18:06 09-14T024N_4-PK -rw-r--r-- 1 skoumal users 171839 Jun 24 18:06 09-14X016N_2-PK -rw-r--r-- 1 skoumal users 143302 Jun 24 18:06 09-15O004N_2-PK -rw-r--r-- 1 skoumal users 170311 Jun 24 18:06 09-15X041N_0-PK -rw-r--r-- 1 skoumal users 148747 Jun 24 18:06 09-15X044N_1-PK -rw-r--r-- 1 skoumal users 169442 Jun 24 18:06 09-16A002N_1-PK -rw-r--r-- 1 skoumal users 104915 Jun 24 18:06 09-16E007N_1-PK -rw-r--r-- 1 skoumal users 145808 Jun 24 18:06 09-16X030N_0-PK
Kontrola a převzetí textů
- Stejným způsobem jako při Anotaci
- Pracujeme na jakobsonovi
- Nejdříve texty zkontrolujeme:
cd /net/grimm/store/corp/ortofon-etalon/csts-import for ff in 04-15X030N_3-PK 04-15X043N_2-PK 04-16A009N_2-PK; \ do echo $ff; \ /usr/local/annotate/bin/csts-export.pl --verbose $ff > /dev/null; done
- Je-li vše v pořádku, soubory uložíme.
Převod zpět do vertikály, opravy
- Vyrobíme adresář
vert-export
a převedeme soubory do něj:cd csts-export for ff in *; do echo $ff; oral-csts-vert.pl < $ff > ../vert-export/$ff; done
zde jsou opraveny i entity, takže první sloupec by měl odpovídat originálu:
cs ../chunks for ff in *; do echo $ff; sdiff -s <(cut -f1 $ff) <(cut -f1 ../vert-export/${ff%.vrt}-??); done
- Opravíme
invalid
aX@
Problematické Horkého opravy -- dotazy na MK a DL
- Tokenizace:
- dvěstě —> dvě stě
- v o —> vo
- od tamaď —> odtamaď
- třinácet —> třináct set
- takovýty —> takový ty
- tyjo —> ty jo
- napohodu —> na pohodu
- ježíš maria —> ježíšmarja
- v spára —> V - spára (neměl by být přepis vé?)
- ježíši maria —> ježíšimarja
- devatenácet —> devatenáct set
- osmnáctset —> osmnáct set
Příprava dat pro automatickou anotaci (hybrid vs. MorphoDiTa)
- Data jsou na grimmovi v adresáři
/store/corp/Ortofon
. - Pracuje se s vertikálou ze souborů v
/store/corp/Ortofon/ortofon-etalon/Verze/2/1
. - Příprava společných dat z
ortofon-etalon/davka-?
:cd ortofon-hybrid/davka-1 mkdir vert cd ../ortofon-etalon/davka-1/Verze/2/1 for ff in *; do echo $ff; cut -f1 $ff | perl -pe 's/^<.*>$//' \ | cat -s > ../../../../../ortofon-hybrid/davka-1/vert/$ff; done cd ../../../../../ortofon-hybrid/davka-1 make-corp.sh -s vert -t csts -v -p45 make-corp.sh -A1 -B1 -Eucs2 -M -p45 -s csts -t csts-morf -v
ortofon-hybrid
- Projede se celým naším hybridem a na závěr se upraví podle potřeb ortofonu.
- Příprava dat:
cd .../ortofon-hybrid/davka-? rsync -avz ../../ortofon-manual/davka-?/csts-morf .
- Honzův skript
processing_hybrid.pl
(na vertikálu):make-corp.sh -s csts-morf -t vert-morf -p45 -v cd /usr/local/corp/Perl/Ortofon ./processing_hybrid.pl /store/corp/Ortofon/ortofon-hybrid/davka-2/vert-morf cd - cp -pr vert-morf vert-morf.ori cd vert-morf for ff in *.out; do mv $ff ${ff%.out}; done cd - mv csts-morf csts-morf.ori make-corp.sh -s vert-morf -t csts-morf -p45 -v
- Pravidla až do konce:
make-whole-corp-csts.sh -C1 -Eucs2 -f -M -p45 -trules -v
- A ještě nějaké menší opravy (já –> my apod.) Tomášovým skriptem
EtalonizaceVertikaly.pl
:make-corp.sh -s csts-rules-frazrl-rulh1-tag-vid-corr -t vert-rules-frazrl-rulh1-tag-vid-corr -p45 -v parallel-filter.sh -C "/usr/local/corp/Perl/EtalonizaceVertikaly.pl" \ -s vert-rules-frazrl-rulh1-tag-vid-corr -t vert-hybrid -p45 -v
- Honzův skript
postprocessing16.pl
(na vertikálu):cd /usr/local/corp/Perl/Ortofon ./postprocessing16.pl /store/corp/Ortofon/ortofon-hybrid/davka-2/vert-hybrid cd /store/corp/Ortofon/ortofon-hybrid/davka-2/vert-hybrid mkdir ../vert-hybrid-out for ff in *.post; do mv $ff ../vert-hybrid-out/${ff%.post}; done cd ../vert-hybrid-out for ff in *; do echo $ff; sed '1{/^$/d}' $ff > ../../../ortofon-automat/davka-2/vert-hybrid/$ff; done
ortofon-morphodita
- Pro MorphoDitu se připraví morfologie, která ale musí být v souladu s Etalonem a Davidovými skripty.
- Příprava dat:
cd .../ortofon-morphodita mkdir davka-? cd davka-? rsync -avz ../../ortofon-manual/davka-?/csts-morf .
- Ovidování:
make-corp.sh -s csts-morf -t csts-morf-vid -v -p45
- Opravy vidu, roznásobení proměnných, zjednodušení tagů a odstranění duplicit:
parallel-filter.sh \ -C "corr-asp.pl | JH-wide-csts.sh | simplify-tags-csts-utf.pl | remove-dupl-csts-mark.pl X" \ -p45 -s csts-morf-vid -t csts-morf-vid-corr -v
- Tomášův skript:
make-corp.sh -s csts-morf-vid-corr -t vert-morf-vid-corr -p45 -v parallel-filter.sh -C EtalonizaceVertikaly.pl -s vert-morf-vid-corr -t vert-morf-vid-corr-etln -p45 -v
- Honzův skript
processing_mdita.pl
(na vertikálu). Vzniknou soubory.out
:cd /usr/local/corp/Perl/Ortofon ./processing_mdita.pl /store/corp/Ortofon/ortofon-morphodita/davka-?/vert-morf-vid-corr-etln
- Převedeme Honzův výstup na vstup pro MDiTu:
cd /store/corp/Ortofon/ortofon-morphodita/davka-? mkdir vert-morphodita-in cd vert-morf-vid-corr-etln for ff in *.out; do echo $ff; sed '1{/^$/d}' $ff > ../vert-morphodita-in/${ff%.out}; done rm *.out
- Spustíme MorphoDiTu a výsledek uložíme do
/store/corp/Ortofon/ortofon-morphodita/vert-morphodita-out
. - Honzův skript
postprocessing16.pl
(na vertikálu):parallel-filter.sh -C "cut -f1-3 | perl -pe 's/(\t.*)\t/\$1 /'" -s vert-morphodita-out \ -t vert-morphodita-result -p45 -v cd /usr/local/corp/Perl/Ortofon ./postprocessing16.pl /store/corp/Ortofon/ortofon-morphodita/davka-2/vert-morphodita-result cd -
- Umístíme do adresáře, kde se sleje MorphoDiTa s hybridem pro ruční anotaci:
cd ../../ortofon-automat mkdir -p davka-2/vert-morphodita cd ../ortofon-morphodita/davka-2/vert-morphodita-result for ff in *.post; do mv $ff ../../../ortofon-automat/davka-2/vert-morphodita/${ff%.post}; done
Slití výsledků a příprava importu (ortofon-automat)
- Vše je v adresářích
ortofon-automat/davka-?
. - V adresáři
vert-hybrid
jsou výsledky hybridu (viz výše). - V adresáři
vert-morphodita
jsou výsledky MorphoDiTy (viz výše). - Do adresáře
vert-paste
slijeme MorphoDiTu a hybrid:cd .../ortofon-automat/davka-? mkdir vert-paste cd vert-morphodita for ff in *; do paste $ff <(cut -f2- ../vert-hybrid/$ff) | perl -pe 's/^[\t\ ]+$//' > ../vert-paste/$ff; done
- Převedeme do
csts
:make-corp.sh -s vert-paste -t csts-paste -p45 -v
- Odstraníme duplicity:
parallel-filter.sh -C remove-dupl-csts.pl -p45 -s csts-paste -t csts-import -v
Anotátoři a přidělené soubory (davka-1)
Pavel Kopřiva
- předplaceno 9.800, tj. 14.000 slovíček; 1. dávka 14.197 slov, hotovo
-rw-rw-r-- 1 skoumal users 34407 May 23 14:41 01-12A005N_1-PK -rw-rw-r-- 1 skoumal users 46102 May 23 14:41 01-13A014N_1-PK -rw-rw-r-- 1 skoumal users 40307 May 23 14:41 01-13A028N_1-PK -rw-rw-r-- 1 skoumal users 41092 May 23 14:41 01-13B031N_1-PK -rw-rw-r-- 1 skoumal users 41144 May 23 14:41 01-13H005N_1-PK -rw-rw-r-- 1 skoumal users 44019 May 23 14:41 01-13O009N_3-PK -rw-rw-r-- 1 skoumal users 46809 May 23 14:41 01-13P004N_1-PK -rw-rw-r-- 1 skoumal users 43321 May 23 14:41 01-13P009N_2-PK -rw-rw-r-- 1 skoumal users 39320 May 23 14:41 01-14T003N_6-PK -rw-rw-r-- 1 skoumal users 44702 May 23 14:41 01-14T007N_1-PK -rw-rw-r-- 1 skoumal users 41621 May 23 14:41 01-14T010N_0-PK -rw-rw-r-- 1 skoumal users 44150 May 23 14:41 01-14X016N_6-PK -rw-rw-r-- 1 skoumal users 41916 May 23 14:41 01-14X019N_0-PK -rw-rw-r-- 1 skoumal users 44807 May 23 14:41 01-15O001N_0-PK -rw-rw-r-- 1 skoumal users 41851 May 23 14:41 01-15O004N_0-PK -rw-rw-r-- 1 skoumal users 44557 May 23 14:41 01-15O007N_1-PK -rw-rw-r-- 1 skoumal users 45038 May 23 14:41 01-15P001N_1-PK -rw-rw-r-- 1 skoumal users 50573 May 23 14:41 01-15T005N_0-PK -rw-rw-r-- 1 skoumal users 42191 May 23 14:41 01-15X045N_2-PK -rw-rw-r-- 1 skoumal users 42876 May 23 14:41 01-16A005N_2-PK -rw-rw-r-- 1 skoumal users 41182 May 23 14:41 01-16P002N_2-PK -rw-rw-r-- 1 skoumal users 37522 May 23 14:41 01-16X033N_5-PK -rw-rw-r-- 1 skoumal users 37307 May 23 14:41 02-12A011N_1-PK -rw-rw-r-- 1 skoumal users 43296 May 23 14:41 02-13A011N_5-PK -rw-rw-r-- 1 skoumal users 43113 May 23 14:41 02-13A036N_3-PK -rw-rw-r-- 1 skoumal users 42816 May 23 14:41 02-13B031N_0-PK -rw-rw-r-- 1 skoumal users 39465 May 23 14:41 02-13O009N_1-PK -rw-rw-r-- 1 skoumal users 45071 May 23 14:41 02-13O010N_2-PK -rw-rw-r-- 1 skoumal users 45410 May 23 14:41 02-13O013N_1-PK -rw-rw-r-- 1 skoumal users 41771 May 23 14:41 02-13P004N_3-PK -rw-rw-r-- 1 skoumal users 36768 May 23 14:41 02-13T029N_6-PK -rw-rw-r-- 1 skoumal users 39927 May 23 14:41 02-14A016N_2-PK -rw-rw-r-- 1 skoumal users 42561 May 23 14:41 02-14E003N_0-PK -rw-rw-r-- 1 skoumal users 44630 May 23 14:41 02-14T003N_5-PK -rw-rw-r-- 1 skoumal users 41947 May 23 14:41 02-14T013N_2-PK -rw-rw-r-- 1 skoumal users 51442 May 23 14:41 02-14T020N_3-PK -rw-rw-r-- 1 skoumal users 49831 May 23 14:41 02-15O002N_0-PK -rw-rw-r-- 1 skoumal users 40571 May 23 14:41 02-15O004N_1-PK -rw-rw-r-- 1 skoumal users 43099 May 23 14:41 02-15O009N_1-PK -rw-rw-r-- 1 skoumal users 41565 May 23 14:41 02-15P001N_3-PK -rw-rw-r-- 1 skoumal users 41993 May 23 14:41 02-15X041N_1-PK -rw-rw-r-- 1 skoumal users 41233 May 23 14:41 02-16A001N_0-PK -rw-rw-r-- 1 skoumal users 42439 May 23 14:41 02-16A005N_5-PK -rw-rw-r-- 1 skoumal users 46223 May 23 14:41 02-16P002N_0-PK -rw-rw-r-- 1 skoumal users 38555 May 23 14:41 02-16X003N_5-PK -rw-rw-r-- 1 skoumal users 42220 May 23 14:41 03-12A035N_3-PK -rw-rw-r-- 1 skoumal users 44055 May 23 14:41 03-13A014N_4-PK -rw-rw-r-- 1 skoumal users 40827 May 23 14:41 03-13O009N_2-PK -rw-rw-r-- 1 skoumal users 40803 May 23 14:41 03-13P010N_1-PK -rw-rw-r-- 1 skoumal users 46769 May 23 14:41 03-14A011N_3-PK -rw-rw-r-- 1 skoumal users 41318 May 23 14:41 03-14A016N_4-PK -rw-rw-r-- 1 skoumal users 44503 May 23 14:41 03-14P007N_2-PK -rw-rw-r-- 1 skoumal users 47542 May 23 14:41 03-14T010N_2-PK -rw-rw-r-- 1 skoumal users 47136 May 23 14:41 03-14T013N_0-PK -rw-rw-r-- 1 skoumal users 47617 May 23 14:41 03-14T020N_0-PK -rw-rw-r-- 1 skoumal users 43448 May 23 14:41 03-14X019N_4-PK -rw-rw-r-- 1 skoumal users 45235 May 23 14:41 03-14X021N_2-PK -rw-rw-r-- 1 skoumal users 43916 May 23 14:41 03-15E003N_0-PK -rw-rw-r-- 1 skoumal users 43217 May 23 14:41 03-15E015N_1-PK -rw-rw-r-- 1 skoumal users 42814 May 23 14:41 03-15O002N_1-PK -rw-rw-r-- 1 skoumal users 47330 May 23 14:41 03-15O007N_0-PK -rw-rw-r-- 1 skoumal users 47496 May 23 14:41 03-15X020N_1-PK -rw-rw-r-- 1 skoumal users 44998 May 23 14:41 03-15X041N_2-PK -rw-rw-r-- 1 skoumal users 47227 May 23 14:41 03-16A005N_4-PK -rw-rw-r-- 1 skoumal users 43259 May 23 14:41 03-16P004N_0-PK -rw-rw-r-- 1 skoumal users 47252 May 23 14:41 03-16X001N_1-PK -rw-rw-r-- 1 skoumal users 40593 May 23 14:41 03-16X031N_4-PK -rw-rw-r-- 1 skoumal users 40675 May 23 14:41 04-12A025N_0-PK -rw-rw-r-- 1 skoumal users 46531 May 23 14:41 04-12P004N_4-PK -rw-rw-r-- 1 skoumal users 45177 May 23 14:41 04-13B009N_0-PK -rw-rw-r-- 1 skoumal users 43618 May 23 14:41 04-13B019N_0-PK -rw-rw-r-- 1 skoumal users 44753 May 23 14:41 04-13B025N_0-PK -rw-rw-r-- 1 skoumal users 43003 May 23 14:41 04-13O007N_1-PK -rw-rw-r-- 1 skoumal users 44914 May 23 14:41 04-13O010N_0-PK -rw-rw-r-- 1 skoumal users 43164 May 23 14:41 04-13O013N_0-PK -rw-rw-r-- 1 skoumal users 44184 May 23 14:41 04-13P004N_2-PK -rw-rw-r-- 1 skoumal users 42275 May 23 14:41 04-14P007N_0-PK -rw-rw-r-- 1 skoumal users 49243 May 23 14:41 04-14T003N_0-PK -rw-rw-r-- 1 skoumal users 40458 May 23 14:41 04-14T010N_1-PK -rw-rw-r-- 1 skoumal users 42344 May 23 14:41 04-14T013N_1-PK -rw-rw-r-- 1 skoumal users 40829 May 23 14:41 04-14X016N_3-PK -rw-rw-r-- 1 skoumal users 40664 May 23 14:41 04-15O010N_0-PK -rw-rw-r-- 1 skoumal users 43180 May 23 14:41 04-15P001N_2-PK -rw-rw-r-- 1 skoumal users 40232 May 23 14:41 04-15P006N_0-PK -rw-rw-r-- 1 skoumal users 48556 May 23 14:41 05-14T010N_3-PK -rw-rw-r-- 1 skoumal users 44430 May 23 14:41 05-14T019N_0-PK -rw-rw-r-- 1 skoumal users 39720 May 23 14:41 05-14X012N_2-PK -rw-rw-r-- 1 skoumal users 49071 May 23 14:41 05-14X019N_2-PK -rw-rw-r-- 1 skoumal users 49252 May 23 14:41 05-14X019N_3-PK -rw-rw-r-- 1 skoumal users 44280 May 23 14:41 05-15O001N_1-PK -rw-rw-r-- 1 skoumal users 48997 May 23 14:41 05-15P001N_0-PK -rw-rw-r-- 1 skoumal users 49363 May 23 14:41 05-15X009N_1-PK -rw-rw-r-- 1 skoumal users 48507 May 23 14:41 05-15X020N_3-PK -rw-rw-r-- 1 skoumal users 38867 May 23 14:41 05-15X043N_5-PK -rw-rw-r-- 1 skoumal users 43559 May 23 14:41 05-16A005N_1-PK -rw-rw-r-- 1 skoumal users 44935 May 23 14:41 05-16E007N_0-PK -rw-rw-r-- 1 skoumal users 41154 May 23 14:41 05-16P002N_1-PK -rw-rw-r-- 1 skoumal users 44495 May 23 14:41 05-16P007N_1-PK -rw-rw-r-- 1 skoumal users 46831 May 23 14:41 05-16X001N_2-PK
Michal Havrda
- může celou dobu, 1. dávka 1.938 slov, hotovo, nevykázáno
-rw-rw-r-- 1 skoumal users 42289 May 23 14:41 04-15X030N_3-MH -rw-rw-r-- 1 skoumal users 46559 May 23 14:41 04-15X043N_2-MH -rw-rw-r-- 1 skoumal users 45223 May 23 14:41 04-16A009N_2-MH -rw-rw-r-- 1 skoumal users 43922 May 23 14:41 04-16E007N_2-MH -rw-rw-r-- 1 skoumal users 48672 May 23 14:41 04-16P004N_1-MH -rw-rw-r-- 1 skoumal users 42605 May 23 14:41 04-16X003N_2-MH -rw-rw-r-- 1 skoumal users 41810 May 23 14:41 05-12P004N_2-MH -rw-rw-r-- 1 skoumal users 42430 May 23 14:41 05-13A011N_0-MH -rw-rw-r-- 1 skoumal users 46764 May 23 14:41 05-13A014N_3-MH -rw-rw-r-- 1 skoumal users 40185 May 23 14:41 05-13A023N_3-MH -rw-rw-r-- 1 skoumal users 41116 May 23 14:41 05-13B005N_1-MH -rw-rw-r-- 1 skoumal users 48129 May 23 14:41 05-13D015N_0-MH -rw-rw-r-- 1 skoumal users 44079 May 23 14:41 05-13O007N_0-MH -rw-rw-r-- 1 skoumal users 45164 May 23 14:41 05-14A011N_2-MH
Anna Nováková
- může až v červenci
Šárka Kadavá
- kdyby bylo nejhůř
Václav Horký
- jako pomvěd
Anotátoři a přidělené soubory (davka-2)
Pavel Kopřiva
- 2. dávka 15.313 slov
-rw-rw-r-- 1 skoumal users 42720 May 30 17:32 06-12A011N_0-PK -rw-rw-r-- 1 skoumal users 45138 May 30 17:32 06-12P004N_3-PK -rw-rw-r-- 1 skoumal users 45138 May 30 17:32 06-13A003N_1-PK -rw-rw-r-- 1 skoumal users 44326 May 30 17:32 06-13A014N_2-PK -rw-rw-r-- 1 skoumal users 45282 May 30 17:32 06-13A028N_2-PK -rw-rw-r-- 1 skoumal users 46555 May 30 17:32 06-13A074N_1-PK -rw-rw-r-- 1 skoumal users 43331 May 30 17:32 06-13B005N_0-PK -rw-rw-r-- 1 skoumal users 48089 May 30 17:32 06-13B028N_1-PK -rw-rw-r-- 1 skoumal users 40901 May 30 17:32 06-13O007N_2-PK -rw-rw-r-- 1 skoumal users 48638 May 30 17:32 06-14A006N_0-PK -rw-rw-r-- 1 skoumal users 40347 May 30 17:32 06-14A008N_3-PK -rw-rw-r-- 1 skoumal users 47879 May 30 17:32 06-14E001N_0-PK -rw-rw-r-- 1 skoumal users 44009 May 30 17:32 06-14P007N_3-PK -rw-rw-r-- 1 skoumal users 44757 May 30 17:32 06-14T007N_0-PK -rw-rw-r-- 1 skoumal users 45276 May 30 17:32 06-14T020N_1-PK -rw-rw-r-- 1 skoumal users 46545 May 30 17:32 06-14X016N_1-PK -rw-rw-r-- 1 skoumal users 42302 May 30 17:32 06-15E017N_5-PK -rw-rw-r-- 1 skoumal users 51075 May 30 17:32 06-15O010N_2-PK -rw-rw-r-- 1 skoumal users 41040 May 30 17:32 06-16A001N_3-PK -rw-rw-r-- 1 skoumal users 43143 May 30 17:32 06-16P002N_3-PK -rw-rw-r-- 1 skoumal users 35558 May 30 17:32 06-16P007N_5-PK -rw-rw-r-- 1 skoumal users 46698 May 30 17:32 06-16X003N_1-PK -rw-rw-r-- 1 skoumal users 45687 May 30 17:32 06-16X030N_1-PK -rw-rw-r-- 1 skoumal users 39012 May 30 17:32 07-12A037N_4-PK -rw-rw-r-- 1 skoumal users 44139 May 30 17:32 07-12O002N_0-PK -rw-rw-r-- 1 skoumal users 46794 May 30 17:32 07-13A003N_2-PK -rw-rw-r-- 1 skoumal users 46770 May 30 17:32 07-13A014N_0-PK -rw-rw-r-- 1 skoumal users 39791 May 30 17:32 07-13A028N_4-PK -rw-rw-r-- 1 skoumal users 43325 May 30 17:32 07-13A036N_5-PK -rw-rw-r-- 1 skoumal users 48700 May 30 17:32 07-13A050N_0-PK -rw-rw-r-- 1 skoumal users 42543 May 30 17:32 07-13E004N_6-PK -rw-rw-r-- 1 skoumal users 43506 May 30 17:32 07-13O004N_0-PK -rw-rw-r-- 1 skoumal users 44126 May 30 17:32 07-14T003N_4-PK -rw-rw-r-- 1 skoumal users 41497 May 30 17:32 07-14T007N_2-PK -rw-rw-r-- 1 skoumal users 45303 May 30 17:32 07-14T013N_3-PK -rw-rw-r-- 1 skoumal users 46002 May 30 17:32 07-14X016N_4-PK -rw-rw-r-- 1 skoumal users 42059 May 30 17:32 07-15C004N_0-PK -rw-rw-r-- 1 skoumal users 46510 May 30 17:32 07-15O009N_0-PK -rw-rw-r-- 1 skoumal users 41922 May 30 17:32 07-15P002N_0-PK -rw-rw-r-- 1 skoumal users 47748 May 30 17:32 07-16A005N_0-PK -rw-rw-r-- 1 skoumal users 40438 May 30 17:32 07-16A009N_3-PK -rw-rw-r-- 1 skoumal users 40615 May 30 17:32 07-16P007N_2-PK -rw-rw-r-- 1 skoumal users 45292 May 30 17:32 07-16X003N_3-PK -rw-rw-r-- 1 skoumal users 41981 May 30 17:32 07-16X031N_0-PK -rw-rw-r-- 1 skoumal users 44339 May 30 17:32 07-16X033N_3-PK -rw-rw-r-- 1 skoumal users 40608 May 30 17:32 08-12A009N_0-PK -rw-rw-r-- 1 skoumal users 45404 May 30 17:32 08-12A031N_0-PK -rw-rw-r-- 1 skoumal users 50613 May 30 17:32 08-13A018N_0-PK -rw-rw-r-- 1 skoumal users 43737 May 30 17:32 08-13A036N_0-PK -rw-rw-r-- 1 skoumal users 41190 May 30 17:32 08-13A090N_4-PK -rw-rw-r-- 1 skoumal users 43500 May 30 17:32 08-13B019N_1-PK -rw-rw-r-- 1 skoumal users 42045 May 30 17:32 08-13B028N_0-PK -rw-rw-r-- 1 skoumal users 39830 May 30 17:32 08-13O009N_0-PK -rw-rw-r-- 1 skoumal users 45677 May 30 17:32 08-13P004N_0-PK -rw-rw-r-- 1 skoumal users 43908 May 30 17:32 08-14C006N_0-PK -rw-rw-r-- 1 skoumal users 46106 May 30 17:32 08-14T003N_1-PK -rw-rw-r-- 1 skoumal users 42032 May 30 17:32 08-14T014N_4-PK -rw-rw-r-- 1 skoumal users 48395 May 30 17:32 08-14X016N_5-PK -rw-rw-r-- 1 skoumal users 41217 May 30 17:32 08-15E010N_5-PK -rw-rw-r-- 1 skoumal users 42667 May 30 17:32 08-15O010N_1-PK -rw-rw-r-- 1 skoumal users 46423 May 30 17:32 08-15X020N_2-PK -rw-rw-r-- 1 skoumal users 49698 May 30 17:32 08-15X041N_3-PK -rw-rw-r-- 1 skoumal users 50216 May 30 17:32 08-16A005N_3-PK -rw-rw-r-- 1 skoumal users 47850 May 30 17:32 08-16E005N_4-PK -rw-rw-r-- 1 skoumal users 43570 May 30 17:32 08-16E007N_4-PK -rw-rw-r-- 1 skoumal users 45400 May 30 17:32 08-16X003N_4-PK -rw-rw-r-- 1 skoumal users 44771 May 30 17:32 08-16X026N_1-PK -rw-rw-r-- 1 skoumal users 44471 May 30 17:32 08-16X031N_2-PK -rw-rw-r-- 1 skoumal users 40769 May 30 17:32 09-12A004N_1-PK -rw-rw-r-- 1 skoumal users 44435 May 30 17:32 09-12A034N_3-PK -rw-rw-r-- 1 skoumal users 45093 May 30 17:32 09-12H004N_1-PK -rw-rw-r-- 1 skoumal users 40828 May 30 17:32 09-13A003N_0-PK -rw-rw-r-- 1 skoumal users 48448 May 30 17:32 09-13A074N_4-PK -rw-rw-r-- 1 skoumal users 42901 May 30 17:32 09-13A090N_2-PK -rw-rw-r-- 1 skoumal users 43918 May 30 17:32 09-13B011N_0-PK -rw-rw-r-- 1 skoumal users 46136 May 30 17:32 09-13B027N_0-PK -rw-rw-r-- 1 skoumal users 43410 May 30 17:32 09-13O007N_3-PK -rw-rw-r-- 1 skoumal users 44953 May 30 17:32 09-13P008N_1-PK -rw-rw-r-- 1 skoumal users 44550 May 30 17:32 09-13T029N_3-PK -rw-rw-r-- 1 skoumal users 50559 May 30 17:32 09-13X003N_0-PK -rw-rw-r-- 1 skoumal users 42449 May 30 17:32 09-14A016N_0-PK -rw-rw-r-- 1 skoumal users 39140 May 30 17:32 09-14C006N_3-PK -rw-rw-r-- 1 skoumal users 36023 May 30 17:32 09-14T024N_4-PK -rw-rw-r-- 1 skoumal users 47500 May 30 17:32 09-14X016N_2-PK -rw-rw-r-- 1 skoumal users 46232 May 30 17:32 09-15O004N_2-PK -rw-rw-r-- 1 skoumal users 49245 May 30 17:32 09-15X041N_0-PK -rw-rw-r-- 1 skoumal users 41205 May 30 17:32 09-15X044N_1-PK -rw-rw-r-- 1 skoumal users 45831 May 30 17:32 09-16A002N_1-PK -rw-rw-r-- 1 skoumal users 42869 May 30 17:32 09-16E007N_1-PK -rw-rw-r-- 1 skoumal users 43627 May 30 17:32 09-16X030N_0-PK -rw-rw-r-- 1 skoumal users 44994 May 30 17:32 10-13A005N_4-PK -rw-rw-r-- 1 skoumal users 41142 May 30 17:32 10-13A011N_3-PK -rw-rw-r-- 1 skoumal users 41662 May 30 17:32 10-13A018N_2-PK -rw-rw-r-- 1 skoumal users 47470 May 30 17:32 10-13A074N_5-PK -rw-rw-r-- 1 skoumal users 39450 May 30 17:32 10-13B016N_1-PK -rw-rw-r-- 1 skoumal users 44065 May 30 17:32 10-13O003N_0-PK -rw-rw-r-- 1 skoumal users 43813 May 30 17:32 10-13P009N_0-PK -rw-rw-r-- 1 skoumal users 44046 May 30 17:32 10-14A011N_1-PK -rw-rw-r-- 1 skoumal users 46636 May 30 17:32 10-14C009N_2-PK -rw-rw-r-- 1 skoumal users 48973 May 30 17:32 10-14O007N_0-PK -rw-rw-r-- 1 skoumal users 40730 May 30 17:32 10-14P006N_1-PK -rw-rw-r-- 1 skoumal users 49089 May 30 17:32 10-15O011N_0-PK -rw-rw-r-- 1 skoumal users 43590 May 30 17:32 10-15O012N_0-PK -rw-rw-r-- 1 skoumal users 41094 May 30 17:32 10-15P004N_0-PK -rw-rw-r-- 1 skoumal users 39004 May 30 17:32 10-15T002N_2-PK -rw-rw-r-- 1 skoumal users 45379 May 30 17:32 10-15T003N_1-PK -rw-rw-r-- 1 skoumal users 46787 May 30 17:32 10-15T011N_4-PK -rw-rw-r-- 1 skoumal users 49274 May 30 17:32 10-15X020N_0-PK -rw-rw-r-- 1 skoumal users 43482 May 30 17:32 10-16A002N_0-PK -rw-rw-r-- 1 skoumal users 43649 May 30 17:32 10-16E007N_5-PK -rw-rw-r-- 1 skoumal users 46847 May 30 17:32 10-16P004N_3-PK -rw-rw-r-- 1 skoumal users 44270 May 30 17:32 10-16X003N_0-PK
Slití ruční a automatické anotace
Příprava dat
- Pod adresářem
[/net/grimm]/store/corp/Ortofon
vytvoříme podadresářortofon-merge
a v němdavka-?/csts-import
adavka-?/csts-merge
. - V každém adresáři
csts-merge
si připravíme soubory pro slití.- Z adresáře
…/ortofon-automat/davka-?/csts-export
zkopírujeme soubory a příponu převedeme na malá písmena. Při kopírování budeme rovnou vybírat unikátní tagy:parallel-filter.sh -C /net/grimm/usr/local/corp/bin/unique-tag.pl -p6 \ -s ../../ortofon-automat/davka-?/csts-export -t csts-merge -v cd csts-merge for ff in *-PK; do gg=${ff%-PK}-pk; echo "$ff $gg"; mv $ff $gg; done
Tohle provedeme pro každou příponu.
- Z adresáře
../../ortofon-manual/davka-?/csts-export
zkopírujeme odpovídající ručně zpracované soubory:parallel-filter.sh -C /net/grimm/usr/local/corp/bin/unique-tag.pl -p6 \ -s ../../ortofon-manual/davka-?/csts-export -t csts-merge -v
- Soubory s velkými písmeny mají v mark-upu
<chunk>
a košaté<s>
; obojí mark-up musí obsahovat stejné tagy:for ff in *-[a-z][a-z]; do echo $ff; perl -i.bak -pe 's/<p>/<chunk>/' $ff; done for ff in *.bak; do echo ${ff%.bak}; sdiff -s ${ff%.bak} $ff; done for ff in *-[a-z][a-z]; do echo $ff; perl -i.bak -pe 's:</c>:</chunk>\n</c>:' $ff; done for ff in *-[a-z][a-z]; do echo $ff; perl -i.bak -pe 's:<s>:</s>\n<s>:' $ff; done for ff in *-[a-z][a-z]; do echo $ff; perl -i.bak -pe 'undef $/; s:(<chunk>)\n</s>:$1:' $ff; done for ff in *-[a-z][a-z]; do echo $ff; perl -i.bak -pe 'undef $/; s:<s>\n(</chunk>):$1:' $ff; done for ff in *-[A-Z][A-Z]; do echo $ff; perl -i.bak -pe 'undef $/; s:<p>\n<s>\n::' $ff; done for ff in *-[A-Z][A-Z]; do echo $ff; perl -i.bak -pe 'undef $/; s:<p>\n::' $ff; done
- Musíme zkontrolovat, jestli jsou zarovnané:
for ff in *-[A-Z][A-Z]; do echo $ff; paste $ff ${ff%-??}-[a-z][a-z] | grep "</s>"; done |\ grep -vP "</s>\t</s>" | l
a opravit podle originálních dat v
…/Ortofon/ortofon-data/0?
. - Automatické soubory nemají vid. Po zarovnání vytvoříme adresář
csts-tag
a zkopírujeme do něj soubory*-[A-Z][A-Z]
.mkdir ../csts-tag cp -p *-[A-Z][A-Z] ../csts-tag
Na ně provedeme vidování a opravy vidů:
[frozen] make-asp.sh -Eucs2 -fcsts -p6 -s csts-tag -t csts-tag-vid -v cd /usr/local/corp/frozen-states/201910/corp/DisambiguacniSkripty/PostDisambVid-utf-csts/povinne parallel-filter.sh -C "./11_OpravitVid-1 | ./11_OpravitVid-2 | ./20_asp_stat.pl" \ -s /net/grimm/store/corp/Ortofon/ortofon-merge/davka-?/csts-tag-vid \ -t /net/grimm/store/corp/Ortofon/ortofon-merge/davka-?/csts-tag-vid-corr -v cd - for ff in *; do echo $ff; perl -i -pe 's/invalid-/invalid/' $ff; done
- Soubory z
csts-tag-vid-corr
zkopírujeme zpátky docsts-merge
. - Sjednotíme mark-up:
for ff in *; do echo $ff; perl -i.bak -pe 's/<f[^>]+>/<f>/' $ff; done for ff in *; do echo $ff; perl -i.bak -pe 's/(<MM[lt])[^>]+>/$1>/g' $ff; done
- Zkontrolujeme a opravíme obouvidá slovesa:
grep "B$" * | cut -f3 -d'<' | sort -u > ../vidy.txt
U dalších dávek porovnáme s předchozími
for ff in $(cat vidy.txt); do echo $ff; grep -h "<l>$ff<" ../davka-1/csts-import/* | sort -u; done | l
A opravíme
for ff in $(grep -l "<MMl>bydlet<MMt>...............B" *); do echo $ff; \ perl -i.bak -pe\'s/(<MMl>bydlet<MMt>...............)B/$1I/' $ff; done
- Připravíme data pro import:
for ff in *-[A-Z][A-Z]; do gg=${ff%-??}-[a-z][a-z]; suff=$(echo $gg| cut -f3 -d'-'); \ echo $ff-$suff; paste $ff ${ff%-??}-[a-z][a-z] | perl -pe 's/<MM/</g' | merge-csts \ | perl -pe 's/<l>[^<]+<t>X@--------------(<l>[^<]+<t>[FM])/$1/' > ../csts-import/$ff-$suff; done
a upravíme řádky s tagy
F
,H
aM
:for ff in *-??-??; do echo $ff; perl -i.bak -pe 's/<d>@<l>@<t>Z:--------------/<f>@/' $ff; done for ff in *-??-??; do echo $ff; perl -i.bak -pe 's/<l>@@<t>Z:--------------//' $ff; done for ff in *-??-??; do echo $ff; perl -i.bak -pe 's/<l>emm<t>X@--------------//' $ff; done for ff in *-??-??; do echo $ff; perl -i.bak -pe 's/<l>hmm<t>II--------------//' $ff; done
- Provedeme import (na jakobsonovi):
cd ../csts-import for ff in *; do echo $ff; /usr/local/annotate/bin/csts-import-utkl.pl --force $ff; done
- Upravíme
/usr/local/annotate/users
- Určování vidů:
- dát P – dát se I
- (dokázat P – dokázat (umět) B)
- (dovést P – dovést (umět) B)
- hodit P – hodit se I
- jmenovat B – jmenovat se I
- (napovídat P – napovídat I)
- (orientovat B – orientovat se I)
- stát I – stát se P
- věnovat P – věnovat se I
Anotátoři a přidělené soubory (davka-1)
Jan Henyš
- (6696):
-rw-r--r-- 1 skoumal users 74626 Nov 4 14:57 01-12A005N_1-VH-pk -rw-r--r-- 1 skoumal users 81870 Nov 4 14:57 01-13A014N_1-VH-pk -rw-r--r-- 1 skoumal users 93895 Nov 4 14:57 01-13A028N_1-VH-pk -rw-r--r-- 1 skoumal users 127391 Nov 4 14:57 01-13B031N_1-VH-pk -rw-r--r-- 1 skoumal users 108674 Nov 4 14:57 01-13H005N_1-VH-pk -rw-r--r-- 1 skoumal users 139978 Nov 4 14:57 01-13O009N_3-VH-pk -rw-r--r-- 1 skoumal users 81235 Nov 4 14:57 01-13P004N_1-VH-pk -rw-r--r-- 1 skoumal users 106283 Nov 4 14:57 01-13P009N_2-VH-pk -rw-r--r-- 1 skoumal users 104777 Nov 4 14:57 01-14T003N_6-VH-pk -rw-r--r-- 1 skoumal users 114342 Nov 4 14:57 01-14T007N_1-VH-pk -rw-r--r-- 1 skoumal users 121376 Nov 4 14:57 01-14T010N_0-VH-pk -rw-r--r-- 1 skoumal users 122450 Nov 4 14:57 01-14X016N_6-VH-pk -rw-r--r-- 1 skoumal users 126982 Nov 4 14:57 01-14X019N_0-VH-pk -rw-r--r-- 1 skoumal users 130419 Nov 4 14:57 01-15O001N_0-VH-pk -rw-r--r-- 1 skoumal users 105776 Nov 4 14:57 01-15O004N_0-TM-pk -rw-r--r-- 1 skoumal users 111398 Nov 4 14:57 01-15O007N_1-TM-pk -rw-r--r-- 1 skoumal users 115392 Nov 4 14:57 01-15P001N_1-TM-pk -rw-r--r-- 1 skoumal users 154103 Nov 4 14:57 01-15T005N_0-TM-pk -rw-r--r-- 1 skoumal users 107637 Nov 4 14:57 01-15X045N_2-TM-pk -rw-r--r-- 1 skoumal users 125936 Nov 4 14:57 01-16A005N_2-TM-pk -rw-r--r-- 1 skoumal users 98450 Nov 4 14:57 01-16P002N_2-TM-pk -rw-r--r-- 1 skoumal users 104393 Nov 4 14:57 01-16X033N_5-TM-pk -rw-r--r-- 1 skoumal users 101802 Nov 4 14:57 02-12A011N_1-TM-pk -rw-r--r-- 1 skoumal users 122296 Nov 4 14:57 02-13A011N_5-TM-pk -rw-r--r-- 1 skoumal users 102029 Nov 4 14:57 02-13A036N_3-TM-pk -rw-r--r-- 1 skoumal users 149647 Nov 4 14:57 02-13B031N_0-TM-pk -rw-r--r-- 1 skoumal users 88170 Nov 4 14:57 02-13O009N_1-TM-pk -rw-r--r-- 1 skoumal users 97377 Nov 4 14:57 02-13O010N_2-TM-pk -rw-r--r-- 1 skoumal users 126451 Nov 4 14:57 02-13O013N_1-MZ-pk -rw-r--r-- 1 skoumal users 79652 Nov 4 14:57 02-13P004N_3-MZ-pk -rw-r--r-- 1 skoumal users 98536 Nov 4 14:57 02-13T029N_6-MZ-pk -rw-r--r-- 1 skoumal users 88034 Nov 4 14:57 02-14A016N_2-MZ-pk -rw-r--r-- 1 skoumal users 118994 Nov 4 14:57 02-14E003N_0-MZ-pk -rw-r--r-- 1 skoumal users 123841 Nov 4 14:57 02-14T003N_5-MZ-pk -rw-r--r-- 1 skoumal users 115630 Nov 4 14:57 02-14T013N_2-MZ-pk -rw-r--r-- 1 skoumal users 134818 Nov 4 14:57 02-14T020N_3-MZ-pk -rw-r--r-- 1 skoumal users 141686 Nov 4 14:57 02-15O002N_0-MZ-pk -rw-r--r-- 1 skoumal users 95958 Nov 4 14:57 02-15O004N_1-MZ-pk -rw-r--r-- 1 skoumal users 84948 Nov 4 14:57 02-15O009N_1-MZ-pk -rw-r--r-- 1 skoumal users 101862 Nov 4 14:57 02-15P001N_3-MZ-pk -rw-r--r-- 1 skoumal users 100946 Nov 4 14:57 02-15X041N_1-MZ-pk -rw-r--r-- 1 skoumal users 115018 Nov 4 14:57 02-16A001N_0-MZ-pk -rw-r--r-- 1 skoumal users 132001 Nov 4 14:57 02-16A005N_5-SK-pk -rw-r--r-- 1 skoumal users 99541 Nov 4 14:57 02-16P002N_0-SK-pk -rw-r--r-- 1 skoumal users 79580 Nov 4 14:57 02-16X003N_5-SK-pk -rw-r--r-- 1 skoumal users 109874 Nov 4 14:57 03-12A035N_3-SK-pk -rw-r--r-- 1 skoumal users 92330 Nov 4 14:57 03-13A014N_4-SK-pk -rw-r--r-- 1 skoumal users 101868 Nov 4 14:57 03-13O009N_2-SK-pk -rw-r--r-- 1 skoumal users 102143 Nov 4 14:57 03-13P010N_1-SK-pk -rw-r--r-- 1 skoumal users 119923 Nov 4 14:57 03-14A011N_3-SK-pk -rw-r--r-- 1 skoumal users 89549 Nov 4 14:57 03-14A016N_4-SK-pk -rw-r--r-- 1 skoumal users 102824 Nov 4 14:57 03-14P007N_2-SK-pk -rw-r--r-- 1 skoumal users 152089 Nov 4 14:57 03-14T010N_2-SK-pk -rw-r--r-- 1 skoumal users 127088 Nov 4 14:57 03-14T013N_0-SK-pk -rw-r--r-- 1 skoumal users 131223 Nov 4 14:57 03-14T020N_0-SK-pk -rw-r--r-- 1 skoumal users 125088 Nov 4 14:57 03-14X019N_4-SK-pk -rw-r--r-- 1 skoumal users 106008 Nov 4 14:57 03-14X021N_2-LK-pk
Václav Horký
- (7020) hotovo:
-rw-r--r-- 1 skoumal users 156567 Nov 4 14:57 03-15E003N_0-LK-pk -rw-r--r-- 1 skoumal users 112165 Nov 4 14:57 03-15E015N_1-LK-pk -rw-r--r-- 1 skoumal users 100938 Nov 4 14:57 03-15O002N_1-LK-pk -rw-r--r-- 1 skoumal users 122762 Nov 4 14:57 03-15O007N_0-LK-pk -rw-r--r-- 1 skoumal users 143197 Nov 4 14:57 03-15X020N_1-LK-pk -rw-r--r-- 1 skoumal users 115593 Nov 4 14:57 03-15X041N_2-LK-pk -rw-r--r-- 1 skoumal users 135237 Nov 4 14:57 03-16A005N_4-LK-pk -rw-r--r-- 1 skoumal users 97000 Nov 4 14:57 03-16P004N_0-LK-pk -rw-r--r-- 1 skoumal users 145176 Nov 4 14:57 03-16X001N_1-LK-pk -rw-r--r-- 1 skoumal users 118990 Nov 4 14:57 03-16X031N_4-LK-pk -rw-r--r-- 1 skoumal users 83711 Nov 4 14:57 04-12A025N_0-LK-pk -rw-r--r-- 1 skoumal users 140071 Nov 4 14:57 04-12P004N_4-LK-pk -rw-r--r-- 1 skoumal users 152345 Nov 4 14:57 04-13B009N_0-LK-pk -rw-r--r-- 1 skoumal users 126964 Nov 4 14:57 04-13B019N_0-MH-pk -rw-r--r-- 1 skoumal users 116116 Nov 4 14:57 04-13B025N_0-MH-pk -rw-r--r-- 1 skoumal users 102177 Nov 4 14:57 04-13O007N_1-MH-pk -rw-r--r-- 1 skoumal users 101580 Nov 4 14:57 04-13O010N_0-MH-pk -rw-r--r-- 1 skoumal users 139029 Nov 4 14:57 04-13O013N_0-MH-pk -rw-r--r-- 1 skoumal users 83185 Nov 4 14:57 04-13P004N_2-MH-pk -rw-r--r-- 1 skoumal users 101357 Nov 4 14:57 04-14P007N_0-MH-pk -rw-r--r-- 1 skoumal users 136504 Nov 4 14:57 04-14T003N_0-MH-pk -rw-r--r-- 1 skoumal users 125338 Nov 4 14:57 04-14T010N_1-MH-pk -rw-r--r-- 1 skoumal users 105675 Nov 4 14:57 04-14T013N_1-MH-pk -rw-r--r-- 1 skoumal users 106963 Nov 4 14:57 04-14X016N_3-MH-pk -rw-r--r-- 1 skoumal users 92255 Nov 4 14:57 04-15O010N_0-MH-pk -rw-r--r-- 1 skoumal users 111061 Nov 4 14:57 04-15P001N_2-MH-pk -rw-r--r-- 1 skoumal users 90023 Nov 4 14:57 04-15P006N_0-MH-pk -rw-r--r-- 1 skoumal users 102256 Nov 4 14:57 04-15X030N_3-PK-mh -rw-r--r-- 1 skoumal users 96979 Nov 4 14:57 04-15X043N_2-PK-mh -rw-r--r-- 1 skoumal users 125828 Nov 4 14:57 04-16A009N_2-PK-mh -rw-r--r-- 1 skoumal users 85615 Nov 4 14:57 04-16E007N_2-PK-mh -rw-r--r-- 1 skoumal users 113460 Nov 4 14:57 04-16P004N_1-PK-mh -rw-r--r-- 1 skoumal users 103271 Nov 4 14:57 04-16X003N_2-PK-mh -rw-r--r-- 1 skoumal users 114602 Nov 4 14:57 05-12P004N_2-PK-mh -rw-r--r-- 1 skoumal users 122736 Nov 4 14:57 05-13A011N_0-PK-mh -rw-r--r-- 1 skoumal users 96925 Nov 4 14:57 05-13A014N_3-PK-mh -rw-r--r-- 1 skoumal users 80363 Nov 4 14:57 05-13A023N_3-PK-mh -rw-r--r-- 1 skoumal users 108453 Nov 4 14:57 05-13B005N_1-PK-mh -rw-r--r-- 1 skoumal users 139140 Nov 4 14:57 05-13D015N_0-PK-mh -rw-r--r-- 1 skoumal users 104929 Nov 4 14:57 05-13O007N_0-PK-mh -rw-r--r-- 1 skoumal users 125205 Nov 4 14:57 05-14A011N_2-PK-mh -rw-r--r-- 1 skoumal users 147679 Nov 4 14:57 05-14T010N_3-AN-pk -rw-r--r-- 1 skoumal users 90556 Nov 4 14:57 05-14T019N_0-AN-pk -rw-r--r-- 1 skoumal users 117725 Nov 4 14:57 05-14X012N_2-AN-pk -rw-r--r-- 1 skoumal users 146139 Nov 4 14:57 05-14X019N_2-AN-pk -rw-r--r-- 1 skoumal users 148811 Nov 4 14:57 05-14X019N_3-AN-pk -rw-r--r-- 1 skoumal users 121542 Nov 4 14:57 05-15O001N_1-AN-pk -rw-r--r-- 1 skoumal users 117876 Nov 4 14:57 05-15P001N_0-AN-pk -rw-r--r-- 1 skoumal users 149222 Nov 4 14:57 05-15X009N_1-AN-pk -rw-r--r-- 1 skoumal users 141576 Nov 4 14:57 05-15X020N_3-AN-pk -rw-r--r-- 1 skoumal users 69482 Nov 4 14:57 05-15X043N_5-AN-pk -rw-r--r-- 1 skoumal users 149294 Nov 4 14:57 05-16A005N_1-AN-pk -rw-r--r-- 1 skoumal users 85369 Nov 4 14:57 05-16E007N_0-AN-pk -rw-r--r-- 1 skoumal users 95205 Nov 4 14:57 05-16P002N_1-AN-pk -rw-r--r-- 1 skoumal users 95081 Nov 4 14:57 05-16P007N_1-AN-pk -rw-r--r-- 1 skoumal users 133102 Nov 4 14:57 05-16X001N_2-MH-pk
Anotátoři a přidělené soubory (davka-2)
Jan Henyš
- ():
Václav Horký
- (4675) hotovo:
-rw-r--r-- 1 skoumal staff 151492 Dec 5 15:12 08-16A005N_3-AN-pk -rw-r--r-- 1 skoumal staff 100204 Dec 5 15:12 08-16E005N_4-AN-pk -rw-r--r-- 1 skoumal staff 93187 Dec 5 15:12 08-16E007N_4-AN-pk -rw-r--r-- 1 skoumal staff 109275 Dec 5 15:12 08-16X003N_4-AN-pk -rw-r--r-- 1 skoumal staff 131994 Dec 5 15:12 08-16X026N_1-AN-pk -rw-r--r-- 1 skoumal staff 111495 Dec 5 15:12 08-16X031N_2-AN-pk -rw-r--r-- 1 skoumal staff 92992 Dec 5 15:12 09-12A004N_1-PK-pk -rw-r--r-- 1 skoumal staff 115075 Dec 5 15:12 09-12A034N_3-PK-pk -rw-r--r-- 1 skoumal staff 109999 Dec 5 15:12 09-12H004N_1-PK-pk -rw-r--r-- 1 skoumal staff 91469 Dec 5 15:12 09-13A003N_0-PK-pk -rw-r--r-- 1 skoumal staff 112768 Dec 5 15:12 09-13A074N_4-PK-pk -rw-r--r-- 1 skoumal staff 112259 Dec 5 15:12 09-13A090N_2-PK-pk -rw-r--r-- 1 skoumal staff 87195 Dec 5 15:12 09-13B011N_0-PK-pk -rw-r--r-- 1 skoumal staff 129299 Dec 5 15:12 09-13B027N_0-PK-pk -rw-r--r-- 1 skoumal staff 122598 Dec 5 15:12 09-13O007N_3-PK-pk -rw-r--r-- 1 skoumal staff 110178 Dec 5 15:12 09-13P008N_1-PK-pk -rw-r--r-- 1 skoumal staff 120714 Dec 5 15:12 09-13T029N_3-PK-pk -rw-r--r-- 1 skoumal staff 185624 Dec 5 15:12 09-13X003N_0-PK-pk -rw-r--r-- 1 skoumal staff 90067 Dec 5 15:12 09-14A016N_0-PK-pk -rw-r--r-- 1 skoumal staff 103615 Dec 5 15:12 09-14C006N_3-PK-pk -rw-r--r-- 1 skoumal staff 73062 Dec 5 15:12 09-14T024N_4-PK-pk -rw-r--r-- 1 skoumal staff 128943 Dec 5 15:12 09-14X016N_2-PK-pk -rw-r--r-- 1 skoumal staff 108095 Dec 5 15:12 09-15O004N_2-PK-pk -rw-r--r-- 1 skoumal staff 126265 Dec 5 15:12 09-15X041N_0-PK-pk -rw-r--r-- 1 skoumal staff 119361 Dec 5 15:12 09-15X044N_1-PK-pk -rw-r--r-- 1 skoumal staff 142155 Dec 5 15:12 09-16A002N_1-PK-pk -rw-r--r-- 1 skoumal staff 75463 Dec 5 15:12 09-16E007N_1-PK-pk -rw-r--r-- 1 skoumal staff 120311 Dec 5 15:12 09-16X030N_0-PK-pk -rw-r--r-- 1 skoumal staff 135473 Dec 5 15:12 10-13A005N_4-MH-pk -rw-r--r-- 1 skoumal staff 113915 Dec 5 15:12 10-13A011N_3-MH-pk -rw-r--r-- 1 skoumal staff 87897 Dec 5 15:12 10-13A018N_2-MH-pk -rw-r--r-- 1 skoumal staff 121284 Dec 5 15:12 10-13A074N_5-MH-pk -rw-r--r-- 1 skoumal staff 103486 Dec 5 15:12 10-13B016N_1-MH-pk -rw-r--r-- 1 skoumal staff 127402 Dec 5 15:12 10-13O003N_0-MH-pk -rw-r--r-- 1 skoumal staff 99716 Dec 5 15:12 10-13P009N_0-MH-pk -rw-r--r-- 1 skoumal staff 126373 Dec 5 15:12 10-14A011N_1-MH-pk -rw-r--r-- 1 skoumal staff 132420 Dec 5 15:12 10-14C009N_2-MH-pk -rw-r--r-- 1 skoumal staff 154287 Dec 5 15:12 10-14O007N_0-MH-pk -rw-r--r-- 1 skoumal staff 108656 Dec 5 15:12 10-14P006N_1-MH-pk -rw-r--r-- 1 skoumal staff 111987 Dec 5 15:12 10-15O011N_0-MH-pk -rw-r--r-- 1 skoumal staff 98207 Dec 5 15:12 10-15O012N_0-MH-pk -rw-r--r-- 1 skoumal staff 139761 Dec 5 15:12 10-15P004N_0-MH-pk -rw-r--r-- 1 skoumal staff 94261 Dec 5 15:12 10-15T002N_2-MH-pk -rw-r--r-- 1 skoumal staff 179656 Dec 5 15:12 10-15T003N_1-MH-pk -rw-r--r-- 1 skoumal staff 158985 Dec 5 15:12 10-15T011N_4-MH-pk -rw-r--r-- 1 skoumal staff 147464 Dec 5 15:12 10-15X020N_0-MH-pk -rw-r--r-- 1 skoumal staff 133827 Dec 5 15:12 10-16A002N_0-MH-pk -rw-r--r-- 1 skoumal staff 87534 Dec 5 15:12 10-16E007N_5-MH-pk -rw-r--r-- 1 skoumal staff 96142 Dec 5 15:12 10-16P004N_3-MH-pk -rw-r--r-- 1 skoumal staff 103483 Dec 5 15:12 10-16X003N_0-MH-pk
Výroba vertikály s mark-upem
- Ručně anotované soubory jsou v adresáři
csts-export
- Do vertikály je převedeme skriptem
ortofon-csts-vert.pl
:parallel-filter.sh -C "ortofon-csts-vert.pl" -p45 -s csts-export -t vert-export -v
- Pro jistotu zkopírujeme vše do
vert-opravy
a opravy provádíme tam.
Kontrola a ruční opravy vertikály
Automatické opravy
- Forma
von.*
vs lemmaon.*
:grep -P "von.*\ton" *
- Varianta
6
u lemmatvon.*
:grep -P "von[^\t]*\tPP.*6" *
- Vid u příklonky
s
(forma #s) - Sjednotit
každý
- Zkratky
- Hesitační zvuky (hmm, @, @@)
Ruční opravy
invalid
X@
- Vizuální kontrola tagů:
grep -h -v "^<" * | cut -f3 | sort -u | l
- Kontrola správnosti tagů:
grep -h -v "^<" * | cut -f3 | sort -u | check-tag.pl -l16 > /dev/null
- Kontrola hvězdiček apod.
Porovnání lemmat a POS od nás vs. MorphoDita (pro studentku Dominiku)
- Vytvoříme adresář
merge-csts
, kde budeme připravovat texty pro anotaci.
Převod chunků do csts
- Je třeba z vertikály udělat
<csts>
s tagy<s>
:cd chunks for ff in *; do echo $ff; oral-vert-csts.pl < $ff > ../merge-csts/${ff%.vrt}.chunk.csts; done
Porovnání našich pravidel s chunky
- Provedeme pomocí diffu:
sdiff 05-16X001N_2.chunk.csts <(grep -v '<D>' ../csts-import/05-16X001N_2.vrt | perl -pe 's/(<MMt>.)[^<\n]+/$1/g' \ | remove-dupl-csts-mark.pl | perl -pe 's/<f[^>]*>/<f>/') | l
Převod csts-rules-frazrl do společného formátu
- Převedeme takto:
cd csts-import for ff in *.vrt; do echo $ff; grep -v '<D>' $ff | perl -pe 's/(<MMt>.)[^<\n]+/$1/g' \ | perl -pe 's/&dhellip;/../g' | perl -pe 's/&thellip;/.../g' | perl -pe 's/(\*<MMt>)X/$1F/' \ | perl -pe 's/\@+(<MMt>)Z/\@$1H/' | perl -pe 's/([eh]mm<MMt>)[IX]/$1H/' | perl -pe 's/<f[^>]+>([eh]mm<|\@+)/<d>$1/' \ | perl -pe 's/(\)<MMt>)X/$1M/' | perl -pe 's/(\&<MMt>)Z/$1H/' | remove-dupl-csts-mark.pl | perl -pe 's/<f[^>]*>/<f>/' \ > ../merge-csts/${ff%.vrt}.import.csts; done
Slití chunk a import do merge-import
- Vyrobíme data pro anotaci:
mkdir -p merge-import cd merge-csts for ff in *.chunk.csts; do echo $ff; sdiff -w 2500 ${ff%.chunk.csts}.import.csts $ff \ | perl -pe 's/[\ \t]+\|[\ \t]+<f>[^<]+//' | perl -pe 's/[\ \t]+<.*//' | remove-dupl-csts-mark.pl Q \ > ../merge-import/${ff%.chunk.csts}.csts; done
- zkontrolujeme tabulátory
- a potom naimportujeme do anotačního programu (na jakobsonovi):
cd ../merge-import for ff in *-Dom; do echo $ff; /usr/local/annotate/bin/csts-import-utkl.pl --force $ff; done