Sámediggi > Divvun
 
Font size:      

A flowchart of the parsing process

A flowchart of the parsing process

        Action taken..              ..by the means of the command:
        **************            ******************************

    |--------------------|
    | take incoming text |        sme$ cat corp/filename.txt |
    |--------------------|
             \/
 |--------------------------|
 | preprocessing it:        |
 | moving one word per line,|     preprocess --abbr=bin/abbr.txt |
 | finding sentence bound.  |
 |--------------------------|
             \/
|-----------------------------|
| morphological analysis:     |
| give each word all possible |   lookup -flags mbTT -utf8 bin/sme.fst |
| analyses                    |
|-----------------------------|
             \/
|-----------------------------|
| processing the output into  |
| a format that fits the dis- |   lookup2cg |
| ambiguator, w/a perlscript  |
|-----------------------------|
             \/
|------------------------------|
| adding syntactic tags        |
| disambiguating the m-analysis|
| picking only the relevant    |   vislcg --grammar src/sme-dis.rle
| morphological analyses.      |
| disambiguating the s-analysis|
|------------------------------|

Starting in May 2005, the file vislcg also assigns syntactic tags, and at the end of the file it also disambiguate them.

In order for the command to work, one must stand in the sme (etc.) directory. The files are in different directories, for the following reasons:

  • The text file is in the corp directory (any text can be used)
  • Starting in june 2005, our new xml corpus is in a separate corp directory, not in cvs.
  • The binary files (the files that are compiled) are in the bin directory
  • The lookup2cg script is a perlscript, common to all languages, and hence in the ../script directory
  • The .rle file is a source file, it is not compiled, and it is hence in the src directory

Hmm, one could perhaps claim that this is somewhat confusing

Last modified: $Date: 2008-11-05 18:52:54 +0100 (ons, 05 nov 2008) $, by $Author: boerre $

by Trond Trosterud