====== Práce na SYNv7 ====== * Použijeme hybridní nástroje - stará morfologie, vše ostatní pokud možno nové * Opravy jednotlivých nástrojů: * ''remove-dupl-csts-mark.pl'' * ''negr_kolokace_csts_ucs2.pl'' * PreMorfo * PostMorfo * Stará morfologie: ''CZ170531ax'' * Hybrid: ''201802-hybrid'' * Lexy:-rw-r--r-- 1 skoumal staff 21020512 Oct 15 23:23 LEX_1byte -rw-r--r-- 1 skoumal staff 22168215 Oct 15 23:23 LEX_ucs2 ===== Stroje a korpusy ===== ==== grimm ==== * adresář ''/store/corp/SYNv7'' * ''SYNv4'' -- cca 2,25 mld. slov, 49.901 souborů, cca 71:56 hod., 45 CPU, dokončeno 27.10. v 18:52: Job "Whole corpus" started at Wed Oct 24 18:56:36 CEST 2018 Target "morf" started at Wed Oct 24 18:56:36 CEST 2018 Target "rules" started at Wed Oct 24 22:09:14 CEST 2018 Target "frazrl" started at Thu Oct 25 12:38:52 CEST 2018 Target "rulh1" started at Fri Oct 26 01:47:02 CEST 2018 Target "tag" started at Fri Oct 26 20:33:20 CEST 2018 Target "vid" started at Sat Oct 27 16:28:35 CEST 2018 Target "corr" started at Sat Oct 27 16:52:06 CEST 2018 Job "Whole corpus" finished at Sat Oct 27 18:52:34 CEST 2018 * 31.10.: 139 souborů ve ''vert-CNK'' * 01.11.: ''vert-CNK-vrt'' * 04.11.: opraven vid, ''vert-kolok-vrt'', tary: -rw-r--r-- 1 skoumal users 797957302 Nov 5 03:36 SYNv4-kolok-json.tgz -rw-r--r-- 1 skoumal users 6868640103 Nov 5 03:21 SYNv4-kolok-txt.tgz -rw-r--r-- 1 skoumal users 46081155387 Nov 5 01:17 SYNv4-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal users 16882827015 Nov 4 22:24 SYNv4-kolok-vrt.tgz * ''NEWTON'' -- cca 2,22 mld. slov, 164.975 souborů, 75:15 hod., 45 CPU, dokončeno 22.10. v 19:22: Job "Whole corpus" started at Sat Oct 13 18:51:03 CEST 2018 Target "morf" started at Sat Oct 13 18:51:03 CEST 2018 Target "rules" started at Sat Oct 13 22:08:25 CEST 2018 Job "Whole corpus" started at Fri Oct 19 21:28:00 CEST 2018 Target "rules" started at Fri Oct 19 21:28:00 CEST 2018 Target "frazrl" started at Sat Oct 20 11:28:52 CEST 2018 Target "rulh1" started at Sun Oct 21 03:26:12 CEST 2018 Target "tag" started at Sun Oct 21 20:20:21 CEST 2018 Target "vid" started at Mon Oct 22 16:25:56 CEST 2018 Target "corr" started at Mon Oct 22 17:10:21 CEST 2018 Job "Whole corpus" finished at Mon Oct 22 19:22:40 CEST 2018 * 31.10.: 18 souborů ve ''vert-CNK'' * 01.11.: ''vert-CNK-vrt'' * 04.11.: opraven vid, ''vert-kolok-vrt'', tary: -rw-r--r-- 1 skoumal users 568716630 Nov 5 03:37 NEWTON-kolok-json.tgz -rw-r--r-- 1 skoumal users 4493576498 Nov 5 03:16 NEWTON-kolok-txt.tgz -rw-r--r-- 1 skoumal users 30058342160 Nov 5 00:42 NEWTON-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal users 10953163552 Nov 4 22:34 NEWTON-kolok-vrt.tgz ==== chomsky ==== * adresář ''/store/corp/SYNv7'' * ''SYN2015'' -- cca 101,4 mil. slov, 3376 souborů, cca 11:20 hod., 10 CPU, dokončeno 16.10. v 10:58: Job "Whole corpus" started at Monday 15 October 23:39:23 CEST 2018 Target "morf" started at Monday 15 October 23:39:23 CEST 2018 Target "rules" started at Tuesday 16 October 00:04:44 CEST 2018 Target "frazrl" started at Tuesday 16 October 02:19:16 CEST 2018 Target "rulh1" started at Tuesday 16 October 04:53:08 CEST 2018 Target "tag" started at Tuesday 16 October 07:58:26 CEST 2018 Target "vid" started at Tuesday 16 October 10:40:30 CEST 2018 Target "corr" started at Tuesday 16 October 10:42:52 CEST 2018 Job "Whole corpus" finished at Tuesday 16 October 10:58:19 CEST 2018 * 31.10.: ''vert-CNK-vrt'' * 04.11.: opraven vid, ''vert-kolok-vrt'', tary: -rw-r--r-- 1 skoumal staff 38881623 Nov 3 12:35 SYN2015-kolok-json.tgz -rw-r--r-- 1 skoumal staff 291842804 Nov 3 12:34 SYN2015-kolok-txt.tgz -rw-r--r-- 1 skoumal staff 2084953398 Nov 4 21:54 SYN2015-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal staff 730051497 Nov 4 21:45 SYN2015-kolok-vrt.tgz * ''NEWTON2017'' -- cca 172 mil. slov, 6191 souborů, cca 18 hod., 10 CPU, dokončeno 17.10. v 08:42: Job "Whole corpus" started at Tuesday 16 October 14:41:53 CEST 2018 Target "morf" started at Tuesday 16 October 14:41:53 CEST 2018 Target "rules" started at Tuesday 16 October 15:33:12 CEST 2018 Target "frazrl" started at Tuesday 16 October 19:33:01 CEST 2018 Target "rulh1" started at Tuesday 16 October 23:17:48 CEST 2018 Target "tag" started at Wednesday 17 October 03:42:41 CEST 2018 Target "vid" started at Wednesday 17 October 08:13:20 CEST 2018 Target "corr" started at Wednesday 17 October 08:19:16 CEST 2018 Job "Whole corpus" finished at Wednesday 17 October 08:42:52 CEST 2018 * kolok, cca 51:52 hod., 10 CPU, dokončeno 19.10. v 12:53 * 01.11.: ''vert-kolok-CNK-vrt'' * 04.11.: vid v pořádku, tary: -rw-r--r-- 1 skoumal staff 72100181 Nov 3 11:37 NEWTON2017-kolok-json.tgz -rw-r--r-- 1 skoumal staff 518622828 Nov 3 11:39 NEWTON2017-kolok-txt.tgz -rw-r--r-- 1 skoumal staff 3468708754 Nov 3 11:35 NEWTON2017-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal staff 1267489893 Nov 3 11:17 NEWTON2017-kolok-vrt.tgz ==== jakobson ==== * adresář ''/mnt/sdd1/corp/SYNv7'' * ''NEWTON2015'' -- cca 214,25 mil. slov, 6302 souborů, cca 30:38 hod., 7 CPU, dokončeno 25.10. v 17:51: Job "Whole corpus" started at Wednesday 24 October 09:53:27 CEST 2018 Target "morf" started at Wednesday 24 October 09:53:27 CEST 2018 Target "rules" started at Wednesday 24 October 11:20:20 CEST 2018 Target "frazrl" started at Wednesday 24 October 17:58:05 CEST 2018 Target "rulh1" started at Wednesday 24 October 23:58:21 CEST 2018 Target "tag" started at Thursday 25 October 07:49:27 CEST 2018 Job "Whole corpus" started at Thursday 25 October 09:09:23 CEST 2018 Target "tag" started at Thursday 25 October 09:09:23 CEST 2018 Target "vid" started at Thursday 25 October 16:59:34 CEST 2018 Target "corr" started at Thursday 25 October 17:11:27 CEST 2018 Job "Whole corpus" finished at Thursday 25 October 17:51:43 CEST 2018 * 01.11.: ''vert-CNK-vrt'' * 04.11.: opraven vid, ''vert-kolok-vrt'', tary: -rw-r--r-- 1 skoumal staff 86316120 Nov 3 19:38 NEWTON2015-kolok-json.tgz -rw-r--r-- 1 skoumal staff 645338532 Nov 3 19:37 NEWTON2015-kolok-txt.tgz -rw-r--r-- 1 skoumal staff 4323163846 Nov 4 22:20 NEWTON2015-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal staff 1573363630 Nov 4 21:51 NEWTON2015-kolok-vrt.tgz * ''NEWTON2016'' -- cca 201 mil. slov, 6207 souborů, cca 28:44 hod., 7 CPU, dokončeno 28.10. v 01:07: Job "Whole corpus" started at Friday 26 October 13:16:38 CEST 2018 Target "morf" started at Friday 26 October 13:16:38 CEST 2018 Target "rules" started at Friday 26 October 14:38:48 CEST 2018 Target "frazrl" started at Friday 26 October 20:52:29 CEST 2018 Target "rulh1" started at Saturday 27 October 02:34:42 CEST 2018 Target "tag" started at Saturday 27 October 09:56:15 CEST 2018 Job "Whole corpus" started at Saturday 27 October 16:57:13 CEST 2018 Target "tag" started at Saturday 27 October 16:57:13 CEST 2018 Target "vid" started at Sunday 28 October 00:18:33 CEST 2018 Target "corr" started at Sunday 28 October 00:29:58 CEST 2018 Job "Whole corpus" finished at Sunday 28 October 01:07:44 CEST 2018 * 01.11.: ''vert-CNK-vrt'' * 04.11.: opraven vid, ''vert-kolok-vrt'', tary: -rw-r--r-- 1 skoumal staff 84482662 Nov 3 20:47 NEWTON2016-kolok-json.tgz -rw-r--r-- 1 skoumal staff 604942840 Nov 3 20:46 NEWTON2016-kolok-txt.tgz -rw-r--r-- 1 skoumal staff 4036251853 Nov 4 21:53 NEWTON2016-kolok-vrt.json.tgz -rw-r--r-- 1 skoumal staff 1476278377 Nov 4 21:39 NEWTON2016-kolok-vrt.tgz ===== Problémy obecné i na jednotlivých strojích ===== * Obecné: * ''-li'' opraveno ručně: [skoumal@grimm NEWTON-vadne]$ ll in-utf8/ total 1096 -rw-rw-r-- 1 skoumal users 178668 Oct 13 18:16 fz100111.txt -rw-rw-r-- 1 skoumal users 162456 Oct 13 18:16 fz100208.txt -rw-rw-r-- 1 skoumal users 165740 Oct 13 18:16 fz100308.txt -rw-rw-r-- 1 skoumal users 160244 Oct 13 18:16 fz100408.txt -rw-rw-r-- 1 skoumal users 34046 Oct 13 18:18 ro140417.txt -rw-rw-r-- 1 skoumal users 398466 Oct 13 18:15 tydn0207.txt [skoumal@chomsky NEWTON2017-vadne]$ ll in-utf8/ total 1024 -rw-rw-r-- 1 skoumal staff 462469 Oct 16 11:58 mf170906.txt -rw-rw-r-- 1 skoumal staff 534866 Oct 16 11:57 mf170907.txt -rw-rw-r-- 1 skoumal staff 47732 Oct 16 11:59 moro1744.txt * jakobson: * ''featurama'' neumí nastavit ''TEMPDIR'' ===== Opravy pro PostDisambVid ===== * Brně, Kladně, Krásně, Plavně, Mýtě, Stříbře, Jasně, Běsně, Ústí, Plzeň, Mže, Třeště, Mělníce, Nevadě, Liberce * pan Kuna, Stehno * paní Černo * vidy: vodnýst, donýst, nanýst, povznýst, vodpovědět: cd vert-kolok-vrt grep -Pl "^[^\t]+\t[^\t]+\tV..............-" *.vrt > ../opravit-asp.txt mkdir ../vert-aspect cd ../vert-aspect for ff in $(cat ../opravit-asp.txt); do cp -p ../vert-kolok-vrt/$ff .; done diff <(ls) <(cat ../opravit-asp.txt) grep -P "^[^\t]+\t[^\t]+\tV..............-" *.vrt | cut -f2 | sort -ua pak třeba for ff in *.vrt; do echo $ff; perl -i.bak -pe 's/(nýst\tV[^\t]{14})-/$1P/' $ff; done * slovesa: utváří-utvářit: for ff in *.vrt; do echo $ff; perl -i.bak -pe 's/utvářit(\tV[^\t]{14})-/utvářet$1I/' $ff; done * zkontrolovat: for ff in *.bak; do echo $ff; diffys $ff ${ff%.bak}; done ===== Přidávání kolokací ze starší verze ===== * Pracujeme s ''vert-CNK-vrt'' ze současné a ''vert-kolok-CNK-vrt'' ze starší verze, kde jsou jenom ''.vrt''. * Starou verzi nahrajeme do adresáře ''merge-kolok'' a : mkdir merge-kolok cd merge-kolok tar xzvf ../-kolok-vrt.tgz * Zkontrolujeme, jestli máme stejné soubory:cd vert-CNK-vrt comm -3 <(ls | grep "vrt$") <(ls ../merge-kolok/)Přebytečné nové odstraníme, nebo provedeme okolokování. Přebytečné staré odstraníme. * Soubory přejmenujeme na ''.kolok'': cd merge-kolok for ff in *; do mv $ff ${ff%.vrt}.kolok; done * Přihrajeme novou verzi (pouze ''.vrt''): .. mkdir tmp cd tmp for ff in ../vert-CNK-vrt/*.vrt; do ln -s $ff; done .. parallel-filter.sh -C "perl -pe 's/ /&space;/g' | perl -pe 's/\|/|/g'" \ -s tmp -t merge-kolok -v -p10 * Do nového adresáře ''vert-kolok'' uděláme nový korpus se starými kolokacem: cd merge-kolok mkdir ../vert-kolok for ff in *.vrt; do echo $ff; sdiff -w 5000 <(cut -f1 $ff) <(cut -f1,5-6 ${ff%.vrt}.kolok) \ | grep -Pv "^[\t][\t ]*\>" | tr -s "\t" | tr -d ' ' | cut -f4,5 > ${ff%.vrt}.56; \ paste $ff ${ff%.vrt}.56 | perl -pe 's/&space;/ /g' | perl -pe 's/|/|/g' > ../vert-kolok/$ff; done * Zkontrolujeme správnost: cd vert-kolok for ff in *.vrt; do check-vert-tab.pl < $ff; doneanebo paralelně: parallel-filter.sh -C "check-vert-tab.pl" -p45 -s vert-koloka dále, jestli nezmizely prázdné řádky cd ../vert-kolok for ff in *.vrt; do echo $ff; diffys <(cut -f1 $ff) <(cut -f1 ../vert-CNK-vrt/$ff); donepřípadně for ff in *.vrt; do echo $ff; diffys <(cut -f1 $ff) <(cut -f1 ../vert-CNK-vrt/$ff); done | grep -B1 -v "\.vrt$" ---- * Dále pokračujeme podle návodu na výrobu [[wiki:user:skoumal:infra:syn|SYNů]]