Sámediggi > Divvun
 
Font size:      

Language recognition using text_cat

To be able to identify sections within a document not in the main language, we need automatic language reqognition. We have now installed an open-source package that performs such a task, and this page documents its usage and origin.

Source

The home page of the package TextCat is found at the University of Groningen, and the source code is also available there. The package is lisenced under a GPL license — see the home page for details — and it is developed by Gertjan van Noord. The home page does also include links to a background article, a list of supported languages coming with the tools, and also a list of competitors. Here's also another link to a demo page, with e-mail address of the author.

We have a local copy of the original source found at /opt/sami/tools/src/text_cat.tgz, in case the original becomes unavailable.

Usage

The tool text_cat itself is installed in gt/scripts/, and basic usage is explained by:

text_cat -h

Typical usage will be something like:

text_cat -l "What language is this"

Or:

text_cat <input-file>

In both cases text_cat will return one or more strings with the name of the language(s) the script believes the text to be in.

Adding a new recognizable language

The text_cat reference files are stored in $GTHOME/tools/lang-guesser.

Adding a new language to be recognized requires a suitable training corpus to be built. This is most easily done with the accompanying tool random_lines:

>$ random_lines < some-text-file > ShortTexts/language-name.txt

This commando extracts random lines of text from the input file, and stores them in the output file. It also cleans the file a bit. The file created is used to build a language model like this:

>$ text_cat -n < ShortTexts/language-name.txt > LM/language-name.lm

After this, the language recognition tool text_cat is ready for use with another language as shown in the previous section.

by Sjur N. Moshagen