How to use the Sámi morphological parsers
Setting up the environment
- If you work on victorio, then log in with your own user name and password. If you work on your own machine, make sure the Xerox tools are available and in your path.
- If you have been away from victorio for a long time, or if this is your first time, write "svn co gt" and press the return key (from now on indicated by "RETURN"). (This also works on your machine, if you have SVN access to victorio). By doing that, you check out whatever new catalogues or files that have been added since last time. In order to update already existing files, "svn up" is enough. For more info on svn and the messages it may give you, see Introduction to svn. If you work on files checked out from anonymous svn, you only need the Xerox tools.
- Change to the directory gt ("cd gt RETURN") In order to compile North Sámi, write make TARGET=sme. The target is sma for Sourthern, smj for Lule, sms for Skolt and smn for Inari.
- The machine will then for the next 3 to 30 minutes (depending upon how many parts of the parser it must rebuild, on what language it is, and on how quick your computer is) write cryptic messages on the screen, and finish with an optimistic "bye.". The other parts of the parser are compiled in a couple of minutes, but compiling the preprocessor is a really slow process. While waiting, open a new window and do something else (you may e.g. read this documentation)
Analysing and generating words
Letters: we have changed default encoding of all files to UTF-8, and all Sámi characters are thus represented as themselves. Just make sure you have set up your environment to use UTF-8 in all places. Documentation for that can be found elsewhere (under Installation and Setup).
Analysing one word at a time:
Note that the source files are in src/, the binary files are in bin/. The exact commands depend upon where you are. In order to compile new versions of the analysers, you must be in gt/, and the write make TARGET=sme (for North Sámi, and smj, sma, smn for Lule, South and Inari Sámi). We assume that you have a separate window for analysis, and that you are in the gt/ catalogue when you analyse.
- For North Sámi, write "lookup -flags mbTT sme/bin/sme.fst RETURN"
- For Lule Sámi, write "lookup -flags TT smj/bin/smj.fst RETURN".
- For South Sámi, write "lookup -flags TT sma/bin/sma.fst RETURN".
- For Inari Sámi, write "lookup -flags TT smn/bin/smn.fst RETURN".
- then write the words that shall be analysed, one word at a time, followed by RETURN.
- To leave lookup mode, press "ctrl C".
For testing, you may also write a file with one wordform on each line, and then feed that to lookup (example here is for Inari Sámi):
- cat testfile.txt | lookup -flags mbTT smn/bin/smn.fst | less
(again, to leave lookup mode, press "ctrl C".)
Generating words
- Write exactly the same commands as you do when you analyse words, except that you change sme.fst to isme.fst, sma.fst to isma.fst, etc.
- Then write Sami words in their dictionary forms, followed by grammatical information. The format is given in the table in the file The grammatical tags.Note that the South Sámi sma.fst handles capital letters and ï-i variation, but that it only accepts correct "ïquot; when you write in the base forms in the generator.
- Again, to leave lookup mode, press "ctrl C".
A good way of working is to have two windows open, one for analysing and one for generating (and probably also addidtional windows, for documentation, for the source files, etc.).
Analysing more than one word at a time
Write the following command (the string 'sentence here' should be replaced with the actual sentence, and the part following the command lookup varies according to language, of course). I again assume you stand in the sme/ (sma/ etc.) catalogue).
echo "sentence here" | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT bin/sme.fst
We also have some shortcuts for analysis. These may be written anywhere. Contact us if you want more aliases, or make them yourself, on the basis of the existing ones.
- cealkka
- Gives a sentence analysis of North Sámi
- sme-dis.sh, smj-dis.sh
- Gives a sentence analysis of North (Lule) Sámi, with rule numbers
- sme-multi.sh, smj-multi.sh
- Gives a non-disambiguated morphological analysis of North (Lule) Sámi
- sme-multisyn.sh
- Gives a non-disambiguated morphological analysis of North Sámi, and adds possible syntactic tags
Generating one paradigm at a time
Each language catalogue contains a catalogue called testing/. Go there, and write the command (exchange the example words for whatever you want):.
make n-paradigm WORD=giella make v-paradigm WORD=boahtit make a-paradigm WORD=ođas
Analysing files
For each of the languages, write the following line:
cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT bin/sme.fst | less cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags TT bin/smj.fst | less cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags TT bin/sma.fst | less cat filename | preprocess | lookup -flags TT bin/smn.fst | less
You probably want disambiguation as well (there is no disambiguation for Inari Sámi):
cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT bin/sme.fst | lookup2cg | vislcg3 -g src/sme-3dis.rle | less cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags TT bin/smj.fst | lookup2cg | vislcg3 -g src/smj-dis.rle | less cat filename | preprocess --abbr=bin/abbr.txt | lookup -flags TT bin/smj.fst | lookup2cg | vislcg3 -g src/smn-dis.rle | less
To use content from our corpus repository as input to the analyser, one should use the tool ccat (type ccat -h to get usage details):
ccat -l sme -r zcorp/bound | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT bin/sme.fst | less
Instead of just showing the result on the screen as running text (as above), much can be done to manipulate it. Here are some examples, all the textstrings should replace the word less in the command above.
-
grep '+N+Pl' > plnouns
(to get all plural nouns and save them to the file plnouns) -
grep -v '\?' | cut -f2 | sort | uniq -c | sort -nr | less
RETURN
(to get a frequency list of the lexemes that the parser recognizes, note that this requires that the flag TT is turned off, i.e. not mentioned.) -
grep '\?' | sort | uniq -c | sort -nr | less RETURN
(to get a frequency list of the words that the parser does not recognize) -
grep '\+\?' | sort | uniq -c | sort -nr | less RETURN
(to get a frequency list of the word forms that the parser does not recognize)
To analyse more files at the same time, write their names one after another after the cat command:
cat file1 file2 file3 | preprocess ...
Last modified $Date: 2011-08-05 01:09:26 +0200 (fre, 05 aug 2011) $, by $Author: lene $
by Trond Trosterud

