Language recognition using text_cat
Source
The home page of the package TextCat is found at the University of Groningen, and the source code is also available there. The package is lisenced under a GPL license — see the home page for details — and it is developed by Gertjan van Noord. The home page does also include links to a background article, a list of supported languages coming with the tools, and also a list of competitors. Here's also another link to a demo page, with e-mail address of the author.
We have a local copy of the original source found at /opt/sami/tools/src/text_cat.tgz, in case the original becomes unavailable.
Usage
The tool text_cat itself is installed in gt/scripts/, and basic usage is explained by:
text_cat -h
Typical usage will be something like:
text_cat -l "What language is this"
Or:
text_cat <input-file>
In both cases text_cat will return one or more strings with the name of the language(s) the script believes the text to be in.
Adding a new recognizable language
The text_cat reference files are stored in $GTHOME/tools/lang-guesser.
Adding a new language to be recognized requires a suitable training corpus to be built. This is most easily done with the accompanying tool random_lines:
>$ random_lines < some-text-file > ShortTexts/language-name.txt
This commando extracts random lines of text from the input file, and stores them in the output file. It also cleans the file a bit. The file created is used to build a language model like this:
>$ text_cat -n < ShortTexts/language-name.txt > LM/language-name.lm
After this, the language recognition tool text_cat is ready for use with another language as shown in the previous section.
by Sjur N. Moshagen

