How to use the morphological parsers
Preamble: setting up the environment (You probably did this already)
Analysing and generating words
- For analysis, write u and then the 3-letter language code, e.g. for North Sami, write usme
- Then write the words that shall be analysed, one word at a time, followed by <ENTER>.
- To leave analysis mode, press ctrl C.
- For generation, write d and then the 3-letter language code, e.g. for North Sami, write dsme
- Then write lemma and grammatical tags (in the same form as was given as output, followed by <ENTER>.
For testing, you may also write a file with one wordform on each line, and then feed that to the analyser (example here is for Inari Sámi, with a file testfile.txt):
cat testfile.txt | usmn | less
(again, to leave analysis mode, press ctrl C, and to leave less, press q)
We have set up aliases for text analysis. These may be written anywhere.
- Gives a sentence analysis of North Sámi
- Gives a dependency analysis of North Sámi
- Gives a sentence analysis of North Sámi, in trace mode (showing which dis rules work)
- Gives a dependency analysis of North Sámi, in trace mode (showing which dep rules work)
To do the same for other languages, exchange sme with your own language code. If, for any given language, the alias does not work, it means there is no (say) dependency analysis written for that language.
These aliases may be used in two ways: either write the alias followed by a sentence in quotes
smedis "Mun lean boahtán."
Or, alternatively, pipe a file through it:
cat testfile.txt | smedis
If you are using Hfst, the command to tokenise, analyse and print the output in a CG compatible format is:
cat testfile.txt | hfst-tokenise --gtd tools/preprocess/tokeniser-disamb-gt-desc.pmhfst
Please note that the file tools/preprocess/tokeniser-disamb-gt-desc.pmhfst is not built by default. To enable building it, configure as follows:
./configure --with-hfst --enable-tokenisers
Instead of just showing the result on the screen as running text (as above), much can be done to manipulate it. Here are some examples, all the textstrings should be added after the smedis etc. above.
| grep '+N+Pl' > plnouns
(to get all plural nouns and save them to the file plnouns)
| grep -v '\?' | cut -f2 | sort | uniq -c | sort -nr | less
(to get a frequency list of the lexemes that the parser recognizes.
| grep '\?' | sort | uniq -c | sort -nr | less
(to get a frequency list of the words that the parser does not recognize)
| grep '\+\?' | sort | uniq -c | sort -nr | less
(to get a frequency list of the word forms that the parser does not recognize)
To analyse more files at the same time, write their names one after another after the cat command:
cat file1 file2 file3 | ...
Last modified $Date: 2016-09-28 11:04:38 +0200 (Wed, 28 Sep 2016) $, by $Author: sjur $
by Trond Trosterud