Table of Contents
Práce na SYNv7
- Použijeme hybridní nástroje - stará morfologie, vše ostatní pokud možno nové
- Opravy jednotlivých nástrojů:
remove-dupl-csts-mark.pl
negr_kolokace_csts_ucs2.pl
- PreMorfo
- PostMorfo
- Stará morfologie:
CZ170531ax
- Hybrid:
201802-hybrid
- Lexy:
-rw-r--r-- 1 skoumal staff 21020512 Oct 15 23:23 LEX_1byte -rw-r--r-- 1 skoumal staff 22168215 Oct 15 23:23 LEX_ucs2
Stroje a korpusy
grimm
- adresář
/store/corp/SYNv7
SYNv4
– cca 2,25 mld. slov, 49.901 souborů, cca 71:56 hod., 45 CPU, dokončeno 27.10. v 18:52:Job "Whole corpus" started at Wed Oct 24 18:56:36 CEST 2018 Target "morf" started at Wed Oct 24 18:56:36 CEST 2018 Target "rules" started at Wed Oct 24 22:09:14 CEST 2018 Target "frazrl" started at Thu Oct 25 12:38:52 CEST 2018 Target "rulh1" started at Fri Oct 26 01:47:02 CEST 2018 Target "tag" started at Fri Oct 26 20:33:20 CEST 2018 Target "vid" started at Sat Oct 27 16:28:35 CEST 2018 Target "corr" started at Sat Oct 27 16:52:06 CEST 2018 Job "Whole corpus" finished at Sat Oct 27 18:52:34 CEST 2018
- 31.10.: 139 souborů ve
vert-CNK
- 01.11.:
vert-CNK-vrt
- 04.11.: opraven vid,
vert-kolok-vrt
, tary:-rw-r--r-- 1 skoumal users 797957302 Nov 5 03:36 SYNv4-kolok-json.tgz -rw-r--r-- 1 skoumal users 6868640103 Nov 5 03:21 SYNv4-kolok-txt.tgz -rw-r--r-- 1 skoumal users 46081155387 Nov 5 01:17 SYNv4-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal users 16882827015 Nov 4 22:24 SYNv4-kolok-vrt.tgz
NEWTON
– cca 2,22 mld. slov, 164.975 souborů, 75:15 hod., 45 CPU, dokončeno 22.10. v 19:22:Job "Whole corpus" started at Sat Oct 13 18:51:03 CEST 2018 Target "morf" started at Sat Oct 13 18:51:03 CEST 2018 Target "rules" started at Sat Oct 13 22:08:25 CEST 2018 Job "Whole corpus" started at Fri Oct 19 21:28:00 CEST 2018 Target "rules" started at Fri Oct 19 21:28:00 CEST 2018 Target "frazrl" started at Sat Oct 20 11:28:52 CEST 2018 Target "rulh1" started at Sun Oct 21 03:26:12 CEST 2018 Target "tag" started at Sun Oct 21 20:20:21 CEST 2018 Target "vid" started at Mon Oct 22 16:25:56 CEST 2018 Target "corr" started at Mon Oct 22 17:10:21 CEST 2018 Job "Whole corpus" finished at Mon Oct 22 19:22:40 CEST 2018
- 31.10.: 18 souborů ve
vert-CNK
- 01.11.:
vert-CNK-vrt
- 04.11.: opraven vid,
vert-kolok-vrt
, tary:-rw-r--r-- 1 skoumal users 568716630 Nov 5 03:37 NEWTON-kolok-json.tgz -rw-r--r-- 1 skoumal users 4493576498 Nov 5 03:16 NEWTON-kolok-txt.tgz -rw-r--r-- 1 skoumal users 30058342160 Nov 5 00:42 NEWTON-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal users 10953163552 Nov 4 22:34 NEWTON-kolok-vrt.tgz
chomsky
- adresář
/store/corp/SYNv7
SYN2015
– cca 101,4 mil. slov, 3376 souborů, cca 11:20 hod., 10 CPU, dokončeno 16.10. v 10:58:Job "Whole corpus" started at Monday 15 October 23:39:23 CEST 2018 Target "morf" started at Monday 15 October 23:39:23 CEST 2018 Target "rules" started at Tuesday 16 October 00:04:44 CEST 2018 Target "frazrl" started at Tuesday 16 October 02:19:16 CEST 2018 Target "rulh1" started at Tuesday 16 October 04:53:08 CEST 2018 Target "tag" started at Tuesday 16 October 07:58:26 CEST 2018 Target "vid" started at Tuesday 16 October 10:40:30 CEST 2018 Target "corr" started at Tuesday 16 October 10:42:52 CEST 2018 Job "Whole corpus" finished at Tuesday 16 October 10:58:19 CEST 2018
- 31.10.:
vert-CNK-vrt
- 04.11.: opraven vid,
vert-kolok-vrt
, tary:-rw-r--r-- 1 skoumal staff 38881623 Nov 3 12:35 SYN2015-kolok-json.tgz -rw-r--r-- 1 skoumal staff 291842804 Nov 3 12:34 SYN2015-kolok-txt.tgz -rw-r--r-- 1 skoumal staff 2084953398 Nov 4 21:54 SYN2015-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal staff 730051497 Nov 4 21:45 SYN2015-kolok-vrt.tgz
NEWTON2017
– cca 172 mil. slov, 6191 souborů, cca 18 hod., 10 CPU, dokončeno 17.10. v 08:42:Job "Whole corpus" started at Tuesday 16 October 14:41:53 CEST 2018 Target "morf" started at Tuesday 16 October 14:41:53 CEST 2018 Target "rules" started at Tuesday 16 October 15:33:12 CEST 2018 Target "frazrl" started at Tuesday 16 October 19:33:01 CEST 2018 Target "rulh1" started at Tuesday 16 October 23:17:48 CEST 2018 Target "tag" started at Wednesday 17 October 03:42:41 CEST 2018 Target "vid" started at Wednesday 17 October 08:13:20 CEST 2018 Target "corr" started at Wednesday 17 October 08:19:16 CEST 2018 Job "Whole corpus" finished at Wednesday 17 October 08:42:52 CEST 2018
- kolok, cca 51:52 hod., 10 CPU, dokončeno 19.10. v 12:53
- 01.11.:
vert-kolok-CNK-vrt
- 04.11.: vid v pořádku, tary:
-rw-r--r-- 1 skoumal staff 72100181 Nov 3 11:37 NEWTON2017-kolok-json.tgz -rw-r--r-- 1 skoumal staff 518622828 Nov 3 11:39 NEWTON2017-kolok-txt.tgz -rw-r--r-- 1 skoumal staff 3468708754 Nov 3 11:35 NEWTON2017-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal staff 1267489893 Nov 3 11:17 NEWTON2017-kolok-vrt.tgz
jakobson
- adresář
/mnt/sdd1/corp/SYNv7
NEWTON2015
– cca 214,25 mil. slov, 6302 souborů, cca 30:38 hod., 7 CPU, dokončeno 25.10. v 17:51:Job "Whole corpus" started at Wednesday 24 October 09:53:27 CEST 2018 Target "morf" started at Wednesday 24 October 09:53:27 CEST 2018 Target "rules" started at Wednesday 24 October 11:20:20 CEST 2018 Target "frazrl" started at Wednesday 24 October 17:58:05 CEST 2018 Target "rulh1" started at Wednesday 24 October 23:58:21 CEST 2018 Target "tag" started at Thursday 25 October 07:49:27 CEST 2018 Job "Whole corpus" started at Thursday 25 October 09:09:23 CEST 2018 Target "tag" started at Thursday 25 October 09:09:23 CEST 2018 Target "vid" started at Thursday 25 October 16:59:34 CEST 2018 Target "corr" started at Thursday 25 October 17:11:27 CEST 2018 Job "Whole corpus" finished at Thursday 25 October 17:51:43 CEST 2018
- 01.11.:
vert-CNK-vrt
- 04.11.: opraven vid,
vert-kolok-vrt
, tary:-rw-r--r-- 1 skoumal staff 86316120 Nov 3 19:38 NEWTON2015-kolok-json.tgz -rw-r--r-- 1 skoumal staff 645338532 Nov 3 19:37 NEWTON2015-kolok-txt.tgz -rw-r--r-- 1 skoumal staff 4323163846 Nov 4 22:20 NEWTON2015-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal staff 1573363630 Nov 4 21:51 NEWTON2015-kolok-vrt.tgz
NEWTON2016
– cca 201 mil. slov, 6207 souborů, cca 28:44 hod., 7 CPU, dokončeno 28.10. v 01:07:Job "Whole corpus" started at Friday 26 October 13:16:38 CEST 2018 Target "morf" started at Friday 26 October 13:16:38 CEST 2018 Target "rules" started at Friday 26 October 14:38:48 CEST 2018 Target "frazrl" started at Friday 26 October 20:52:29 CEST 2018 Target "rulh1" started at Saturday 27 October 02:34:42 CEST 2018 Target "tag" started at Saturday 27 October 09:56:15 CEST 2018 Job "Whole corpus" started at Saturday 27 October 16:57:13 CEST 2018 Target "tag" started at Saturday 27 October 16:57:13 CEST 2018 Target "vid" started at Sunday 28 October 00:18:33 CEST 2018 Target "corr" started at Sunday 28 October 00:29:58 CEST 2018 Job "Whole corpus" finished at Sunday 28 October 01:07:44 CEST 2018
- 01.11.:
vert-CNK-vrt
- 04.11.: opraven vid,
vert-kolok-vrt
, tary:-rw-r--r-- 1 skoumal staff 84482662 Nov 3 20:47 NEWTON2016-kolok-json.tgz -rw-r--r-- 1 skoumal staff 604942840 Nov 3 20:46 NEWTON2016-kolok-txt.tgz -rw-r--r-- 1 skoumal staff 4036251853 Nov 4 21:53 NEWTON2016-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal staff 1476278377 Nov 4 21:39 NEWTON2016-kolok-vrt.tgz
Problémy obecné i na jednotlivých strojích
- Obecné:
-li
opraveno ručně:[skoumal@grimm NEWTON-vadne]$ ll in-utf8/ total 1096 -rw-rw-r-- 1 skoumal users 178668 Oct 13 18:16 fz100111.txt -rw-rw-r-- 1 skoumal users 162456 Oct 13 18:16 fz100208.txt -rw-rw-r-- 1 skoumal users 165740 Oct 13 18:16 fz100308.txt -rw-rw-r-- 1 skoumal users 160244 Oct 13 18:16 fz100408.txt -rw-rw-r-- 1 skoumal users 34046 Oct 13 18:18 ro140417.txt -rw-rw-r-- 1 skoumal users 398466 Oct 13 18:15 tydn0207.txt
[skoumal@chomsky NEWTON2017-vadne]$ ll in-utf8/ total 1024 -rw-rw-r-- 1 skoumal staff 462469 Oct 16 11:58 mf170906.txt -rw-rw-r-- 1 skoumal staff 534866 Oct 16 11:57 mf170907.txt -rw-rw-r-- 1 skoumal staff 47732 Oct 16 11:59 moro1744.txt
- jakobson:
featurama
neumí nastavitTEMPDIR
Opravy pro PostDisambVid
- Brně, Kladně, Krásně, Plavně, Mýtě, Stříbře, Jasně, Běsně, Ústí, Plzeň, Mže, Třeště, Mělníce, Nevadě, Liberce
- pan Kuna, Stehno
- paní Černo
- vidy: vodnýst, donýst, nanýst, povznýst, vodpovědět:
cd vert-kolok-vrt grep -Pl "^[^\t]+\t[^\t]+\tV..............-" *.vrt > ../opravit-asp.txt mkdir ../vert-aspect cd ../vert-aspect for ff in $(cat ../opravit-asp.txt); do cp -p ../vert-kolok-vrt/$ff .; done diff <(ls) <(cat ../opravit-asp.txt) grep -P "^[^\t]+\t[^\t]+\tV..............-" *.vrt | cut -f2 | sort -u
a pak třeba
for ff in *.vrt; do echo $ff; perl -i.bak -pe 's/(nýst\tV[^\t]{14})-/$1P/' $ff; done
- slovesa: utváří-utvářit:
for ff in *.vrt; do echo $ff; perl -i.bak -pe 's/utvářit(\tV[^\t]{14})-/utvářet$1I/' $ff; done
- zkontrolovat:
for ff in *.bak; do echo $ff; diffys $ff ${ff%.bak}; done
Přidávání kolokací ze starší verze
- Pracujeme s
vert-CNK-vrt
ze současné avert-kolok-CNK-vrt
ze starší verze, kde jsou jenom.vrt
. - Starou verzi nahrajeme do adresáře
merge-kolok
a :mkdir merge-kolok cd merge-kolok tar xzvf ../<korpus>-kolok-vrt.tgz
- Zkontrolujeme, jestli máme stejné soubory:
cd vert-CNK-vrt comm -3 <(ls | grep "vrt$") <(ls ../merge-kolok/)
Přebytečné nové odstraníme, nebo provedeme okolokování. Přebytečné staré odstraníme.
- Soubory přejmenujeme na
.kolok
:cd merge-kolok for ff in *; do mv $ff ${ff%.vrt}.kolok; done
- Přihrajeme novou verzi (pouze
.vrt
):.. mkdir tmp cd tmp for ff in ../vert-CNK-vrt/*.vrt; do ln -s $ff; done .. parallel-filter.sh -C "perl -pe 's/ /&space;/g' | perl -pe 's/\|/|/g'" \ -s tmp -t merge-kolok -v -p10
- Do nového adresáře
vert-kolok
uděláme nový korpus se starými kolokacem:cd merge-kolok mkdir ../vert-kolok for ff in *.vrt; do echo $ff; sdiff -w 5000 <(cut -f1 $ff) <(cut -f1,5-6 ${ff%.vrt}.kolok) \ | grep -Pv "^[\t][\t ]*\>" | tr -s "\t" | tr -d ' ' | cut -f4,5 > ${ff%.vrt}.56; \ paste $ff ${ff%.vrt}.56 | perl -pe 's/&space;/ /g' | perl -pe 's/|/|/g' > ../vert-kolok/$ff; done
- Zkontrolujeme správnost:
cd vert-kolok for ff in *.vrt; do check-vert-tab.pl < $ff; done
anebo paralelně:
parallel-filter.sh -C "check-vert-tab.pl" -p45 -s vert-kolok
a dále, jestli nezmizely prázdné řádky
cd ../vert-kolok for ff in *.vrt; do echo $ff; diffys <(cut -f1 $ff) <(cut -f1 ../vert-CNK-vrt/$ff); done
případně
for ff in *.vrt; do echo $ff; diffys <(cut -f1 $ff) <(cut -f1 ../vert-CNK-vrt/$ff); done | grep -B1 -v "\.vrt$"
- Dále pokračujeme podle návodu na výrobu SYNů