Wiki spuštěna 24. 7. 2025

Práce na SYNv7

  • Použijeme hybridní nástroje - stará morfologie, vše ostatní pokud možno nové
  • Opravy jednotlivých nástrojů:
    • remove-dupl-csts-mark.pl
    • negr_kolokace_csts_ucs2.pl
    • PreMorfo
    • PostMorfo
  • Stará morfologie: CZ170531ax
  • Hybrid: 201802-hybrid
  • Lexy:
    -rw-r--r--  1 skoumal  staff 21020512 Oct 15 23:23 LEX_1byte
    -rw-r--r--  1 skoumal  staff 22168215 Oct 15 23:23 LEX_ucs2

Stroje a korpusy

grimm

  • adresář /store/corp/SYNv7
    • SYNv4 – cca 2,25 mld. slov, 49.901 souborů, cca 71:56 hod., 45 CPU, dokončeno 27.10. v 18:52:
      Job "Whole corpus" started at Wed Oct 24 18:56:36 CEST 2018
      Target "morf" started at Wed Oct 24 18:56:36 CEST 2018
      Target "rules" started at Wed Oct 24 22:09:14 CEST 2018
      Target "frazrl" started at Thu Oct 25 12:38:52 CEST 2018
      Target "rulh1" started at Fri Oct 26 01:47:02 CEST 2018
      Target "tag" started at Fri Oct 26 20:33:20 CEST 2018
      Target "vid" started at Sat Oct 27 16:28:35 CEST 2018
      Target "corr" started at Sat Oct 27 16:52:06 CEST 2018
      Job "Whole corpus" finished at Sat Oct 27 18:52:34 CEST 2018
      • 31.10.: 139 souborů ve vert-CNK
      • 01.11.: vert-CNK-vrt
      • 04.11.: opraven vid, vert-kolok-vrt, tary:
        -rw-r--r-- 1 skoumal users   797957302 Nov  5 03:36 SYNv4-kolok-json.tgz
        -rw-r--r-- 1 skoumal users  6868640103 Nov  5 03:21 SYNv4-kolok-txt.tgz
        -rw-r--r-- 1 skoumal users 46081155387 Nov  5 01:17 SYNv4-kolok-vrt.json.tgz
        -rw-r--r-- 1 skoumal users 16882827015 Nov  4 22:24 SYNv4-kolok-vrt.tgz
    • NEWTON – cca 2,22 mld. slov, 164.975 souborů, 75:15 hod., 45 CPU, dokončeno 22.10. v 19:22:
      Job "Whole corpus" started at Sat Oct 13 18:51:03 CEST 2018
      Target "morf" started at Sat Oct 13 18:51:03 CEST 2018
      Target "rules" started at Sat Oct 13 22:08:25 CEST 2018
      Job "Whole corpus" started at Fri Oct 19 21:28:00 CEST 2018
      Target "rules" started at Fri Oct 19 21:28:00 CEST 2018
      Target "frazrl" started at Sat Oct 20 11:28:52 CEST 2018
      Target "rulh1" started at Sun Oct 21 03:26:12 CEST 2018
      Target "tag" started at Sun Oct 21 20:20:21 CEST 2018
      Target "vid" started at Mon Oct 22 16:25:56 CEST 2018
      Target "corr" started at Mon Oct 22 17:10:21 CEST 2018
      Job "Whole corpus" finished at Mon Oct 22 19:22:40 CEST 2018
      • 31.10.: 18 souborů ve vert-CNK
      • 01.11.: vert-CNK-vrt
      • 04.11.: opraven vid, vert-kolok-vrt, tary:
        -rw-r--r-- 1 skoumal users   568716630 Nov  5 03:37 NEWTON-kolok-json.tgz
        -rw-r--r-- 1 skoumal users  4493576498 Nov  5 03:16 NEWTON-kolok-txt.tgz
        -rw-r--r-- 1 skoumal users 30058342160 Nov  5 00:42 NEWTON-kolok-vrt.json.tgz
        -rw-r--r-- 1 skoumal users 10953163552 Nov  4 22:34 NEWTON-kolok-vrt.tgz

chomsky

  • adresář /store/corp/SYNv7
    • SYN2015 – cca 101,4 mil. slov, 3376 souborů, cca 11:20 hod., 10 CPU, dokončeno 16.10. v 10:58:
      Job "Whole corpus" started at Monday 15 October  23:39:23 CEST 2018
      Target "morf" started at Monday 15 October  23:39:23 CEST 2018
      Target "rules" started at Tuesday 16 October  00:04:44 CEST 2018
      Target "frazrl" started at Tuesday 16 October  02:19:16 CEST 2018
      Target "rulh1" started at Tuesday 16 October  04:53:08 CEST 2018
      Target "tag" started at Tuesday 16 October  07:58:26 CEST 2018
      Target "vid" started at Tuesday 16 October  10:40:30 CEST 2018
      Target "corr" started at Tuesday 16 October  10:42:52 CEST 2018
      Job "Whole corpus" finished at Tuesday 16 October  10:58:19 CEST 2018
      • 31.10.: vert-CNK-vrt
      • 04.11.: opraven vid, vert-kolok-vrt, tary:
        -rw-r--r-- 1 skoumal staff   38881623 Nov  3 12:35 SYN2015-kolok-json.tgz
        -rw-r--r-- 1 skoumal staff  291842804 Nov  3 12:34 SYN2015-kolok-txt.tgz
        -rw-r--r-- 1 skoumal staff 2084953398 Nov  4 21:54 SYN2015-kolok-vrt.json.tgz
        -rw-r--r-- 1 skoumal staff  730051497 Nov  4 21:45 SYN2015-kolok-vrt.tgz
    • NEWTON2017 – cca 172 mil. slov, 6191 souborů, cca 18 hod., 10 CPU, dokončeno 17.10. v 08:42:
      Job "Whole corpus" started at Tuesday 16 October  14:41:53 CEST 2018
      Target "morf" started at Tuesday 16 October  14:41:53 CEST 2018
      Target "rules" started at Tuesday 16 October  15:33:12 CEST 2018
      Target "frazrl" started at Tuesday 16 October  19:33:01 CEST 2018
      Target "rulh1" started at Tuesday 16 October  23:17:48 CEST 2018
      Target "tag" started at Wednesday 17 October  03:42:41 CEST 2018
      Target "vid" started at Wednesday 17 October  08:13:20 CEST 2018
      Target "corr" started at Wednesday 17 October  08:19:16 CEST 2018
      Job "Whole corpus" finished at Wednesday 17 October  08:42:52 CEST 2018
      • kolok, cca 51:52 hod., 10 CPU, dokončeno 19.10. v 12:53
      • 01.11.: vert-kolok-CNK-vrt
      • 04.11.: vid v pořádku, tary:
        -rw-r--r-- 1 skoumal staff   72100181 Nov  3 11:37 NEWTON2017-kolok-json.tgz
        -rw-r--r-- 1 skoumal staff  518622828 Nov  3 11:39 NEWTON2017-kolok-txt.tgz
        -rw-r--r-- 1 skoumal staff 3468708754 Nov  3 11:35 NEWTON2017-kolok-vrt.json.tgz
        -rw-r--r-- 1 skoumal staff 1267489893 Nov  3 11:17 NEWTON2017-kolok-vrt.tgz

jakobson

  • adresář /mnt/sdd1/corp/SYNv7
    • NEWTON2015 – cca 214,25 mil. slov, 6302 souborů, cca 30:38 hod., 7 CPU, dokončeno 25.10. v 17:51:
      Job "Whole corpus" started at Wednesday 24 October  09:53:27 CEST 2018
      Target "morf" started at Wednesday 24 October  09:53:27 CEST 2018
      Target "rules" started at Wednesday 24 October  11:20:20 CEST 2018
      Target "frazrl" started at Wednesday 24 October  17:58:05 CEST 2018
      Target "rulh1" started at Wednesday 24 October  23:58:21 CEST 2018
      Target "tag" started at Thursday 25 October  07:49:27 CEST 2018
      Job "Whole corpus" started at Thursday 25 October  09:09:23 CEST 2018
      Target "tag" started at Thursday 25 October  09:09:23 CEST 2018
      Target "vid" started at Thursday 25 October  16:59:34 CEST 2018
      Target "corr" started at Thursday 25 October  17:11:27 CEST 2018
      Job "Whole corpus" finished at Thursday 25 October  17:51:43 CEST 2018
      • 01.11.: vert-CNK-vrt
      • 04.11.: opraven vid, vert-kolok-vrt, tary:
        -rw-r--r-- 1 skoumal staff   86316120 Nov  3 19:38 NEWTON2015-kolok-json.tgz
        -rw-r--r-- 1 skoumal staff  645338532 Nov  3 19:37 NEWTON2015-kolok-txt.tgz
        -rw-r--r-- 1 skoumal staff 4323163846 Nov  4 22:20 NEWTON2015-kolok-vrt.json.tgz
        -rw-r--r-- 1 skoumal staff 1573363630 Nov  4 21:51 NEWTON2015-kolok-vrt.tgz
    • NEWTON2016 – cca 201 mil. slov, 6207 souborů, cca 28:44 hod., 7 CPU, dokončeno 28.10. v 01:07:
      Job "Whole corpus" started at Friday 26 October  13:16:38 CEST 2018
      Target "morf" started at Friday 26 October  13:16:38 CEST 2018
      Target "rules" started at Friday 26 October  14:38:48 CEST 2018
      Target "frazrl" started at Friday 26 October  20:52:29 CEST 2018
      Target "rulh1" started at Saturday 27 October  02:34:42 CEST 2018
      Target "tag" started at Saturday 27 October  09:56:15 CEST 2018
      Job "Whole corpus" started at Saturday 27 October  16:57:13 CEST 2018
      Target "tag" started at Saturday 27 October  16:57:13 CEST 2018
      Target "vid" started at Sunday 28 October  00:18:33 CEST 2018
      Target "corr" started at Sunday 28 October  00:29:58 CEST 2018
      Job "Whole corpus" finished at Sunday 28 October  01:07:44 CEST 2018
      • 01.11.: vert-CNK-vrt
      • 04.11.: opraven vid, vert-kolok-vrt, tary:
        -rw-r--r-- 1 skoumal staff   84482662 Nov  3 20:47 NEWTON2016-kolok-json.tgz
        -rw-r--r-- 1 skoumal staff  604942840 Nov  3 20:46 NEWTON2016-kolok-txt.tgz
        -rw-r--r-- 1 skoumal staff 4036251853 Nov  4 21:53 NEWTON2016-kolok-vrt.json.tgz
        -rw-r--r-- 1 skoumal staff 1476278377 Nov  4 21:39 NEWTON2016-kolok-vrt.tgz

Problémy obecné i na jednotlivých strojích

  • Obecné:
    • -li opraveno ručně:
      [skoumal@grimm NEWTON-vadne]$ ll in-utf8/
      total 1096
      -rw-rw-r--  1 skoumal users 178668 Oct 13 18:16 fz100111.txt
      -rw-rw-r--  1 skoumal users 162456 Oct 13 18:16 fz100208.txt
      -rw-rw-r--  1 skoumal users 165740 Oct 13 18:16 fz100308.txt
      -rw-rw-r--  1 skoumal users 160244 Oct 13 18:16 fz100408.txt
      -rw-rw-r--  1 skoumal users  34046 Oct 13 18:18 ro140417.txt
      -rw-rw-r--  1 skoumal users 398466 Oct 13 18:15 tydn0207.txt
      [skoumal@chomsky NEWTON2017-vadne]$ ll in-utf8/
      total 1024
      -rw-rw-r-- 1 skoumal staff 462469 Oct 16 11:58 mf170906.txt
      -rw-rw-r-- 1 skoumal staff 534866 Oct 16 11:57 mf170907.txt
      -rw-rw-r-- 1 skoumal staff  47732 Oct 16 11:59 moro1744.txt
  • jakobson:
    • featurama neumí nastavit TEMPDIR

Opravy pro PostDisambVid

  • Brně, Kladně, Krásně, Plavně, Mýtě, Stříbře, Jasně, Běsně, Ústí, Plzeň, Mže, Třeště, Mělníce, Nevadě, Liberce
  • pan Kuna, Stehno
  • paní Černo
  • vidy: vodnýst, donýst, nanýst, povznýst, vodpovědět:
    cd vert-kolok-vrt
    grep -Pl "^[^\t]+\t[^\t]+\tV..............-" *.vrt > ../opravit-asp.txt
    mkdir ../vert-aspect
    cd ../vert-aspect
    for ff in $(cat ../opravit-asp.txt); do cp -p ../vert-kolok-vrt/$ff .; done
    diff <(ls) <(cat ../opravit-asp.txt)
    grep -P "^[^\t]+\t[^\t]+\tV..............-" *.vrt | cut -f2 | sort -u

    a pak třeba

    for ff in *.vrt; do echo $ff; perl -i.bak -pe 's/(nýst\tV[^\t]{14})-/$1P/' $ff; done
  • slovesa: utváří-utvářit:
    for ff in *.vrt; do echo $ff; perl -i.bak -pe 's/utvářit(\tV[^\t]{14})-/utvářet$1I/' $ff; done
  • zkontrolovat:
    for ff in *.bak; do echo $ff; diffys $ff ${ff%.bak}; done

Přidávání kolokací ze starší verze

  • Pracujeme s vert-CNK-vrt ze současné a vert-kolok-CNK-vrt ze starší verze, kde jsou jenom .vrt.
  • Starou verzi nahrajeme do adresáře merge-kolok a :
    mkdir merge-kolok
    cd merge-kolok
    tar xzvf ../<korpus>-kolok-vrt.tgz
  • Zkontrolujeme, jestli máme stejné soubory:
    cd vert-CNK-vrt
    comm -3 <(ls | grep "vrt$") <(ls ../merge-kolok/)

    Přebytečné nové odstraníme, nebo provedeme okolokování. Přebytečné staré odstraníme.

  • Soubory přejmenujeme na .kolok:
    cd merge-kolok
    for ff in *; do mv $ff ${ff%.vrt}.kolok; done
  • Přihrajeme novou verzi (pouze .vrt):
    ..
    mkdir tmp
    cd tmp
    for ff in ../vert-CNK-vrt/*.vrt; do ln -s $ff; done
    ..
    parallel-filter.sh -C "perl -pe 's/ /&space;/g' | perl -pe 's/\|/&verbar;/g'" \
    -s tmp -t merge-kolok -v -p10
  • Do nového adresáře vert-kolok uděláme nový korpus se starými kolokacem:
    cd merge-kolok
    mkdir ../vert-kolok
    for ff in *.vrt; do echo $ff; sdiff -w 5000 <(cut -f1 $ff) <(cut -f1,5-6 ${ff%.vrt}.kolok) \
    | grep -Pv "^[\t][\t ]*\>" | tr -s "\t" | tr -d ' ' | cut -f4,5 > ${ff%.vrt}.56; \
    paste $ff ${ff%.vrt}.56 | perl -pe 's/&space;/ /g' | perl -pe 's/&verbar;/|/g' > ../vert-kolok/$ff; done
  • Zkontrolujeme správnost:
    cd vert-kolok
    for ff in *.vrt; do check-vert-tab.pl < $ff; done

    anebo paralelně:

    parallel-filter.sh -C "check-vert-tab.pl" -p45 -s vert-kolok

    a dále, jestli nezmizely prázdné řádky

    cd ../vert-kolok
    for ff in *.vrt; do echo $ff; diffys <(cut -f1 $ff) <(cut -f1 ../vert-CNK-vrt/$ff); done

    případně

    for ff in *.vrt; do echo $ff; diffys <(cut -f1 $ff) <(cut -f1 ../vert-CNK-vrt/$ff); done | grep -B1 -v "\.vrt$"

  • Dále pokračujeme podle návodu na výrobu SYNů

QR Code
QR Code wiki:user:skoumal:infra:synv7 (generated for current page)