Sámediggi > Divvun
 
Font size:      

The cgi-bin environment

CGI-setup

There are webdemos available for analyzing and generating wordforms and paradigms for different languages. The tools can be called from pages on http://giellatekno.uit.no/cgi.

Pages are available for calling the tools in different languages. For the time being, the supported user interface languages are English, North Sámi and Norwegian. The tools are available to varying extend for North, Lule and South Sámi, Greenlandish and Faroese.

The cgi-bin scripts include scripts for analysing and generating words and sentences. The setup is divided between two systems, the cgi-bin and forrest documentation. For example, when adding a new language to the system, changes have to be made both in cgi-bin and forrest documentation.

CGI-scripts

The relevant scripts cgi-bin scripts are:

  • smi.cgi
    Language-independent cgi-script for calling different language technology applications: analysis, disambiguation, hyphenation and paradigm generation.
  • conf.pl
    Configuration file that contains most of the variable definitons and their initial values.
  • sme-num.cgi, smj-num.cgi, sma-num.cgi
    Generates North, Lule and South Sámi numerals, respectively.

All the scripts are developed in the svn-directory, under the module gt/scripts/cgi-scripts. The official location of cg-scripts is on victorio, at the cgi-bin directory. The latest versions are moved to the official directory by script cgi-export. The script exports the latest version of the release-tagged cgi-scripts to the directory. The user must have root privileges to be able to run the script.

The cgi-bin script smi.cgi is located on victorio, at the cgi-bin directory. The script processes input sentences from the html scheme by the user, and then sends it on to the transducers sme.fst for analysis and isme.fst for generation. The language, charset and the action taken are configured by the parameters given to the script.

The transducers are located at /opt/smi/lang/bin/and/opt/smi/common/bin/

The minimum requirements for an analyzer to work is the filelang.fst.

Forrest documentation and cgi-bin

The cgi-interface is integrated with Forrest documentation and the pages are generated when needed. The pages are named after the language technology application, the language of the application and the user interface language. The prefixes d- g- p- stand for disambiguation (and analysis and hyphenation), generation and paradigm generation. For example the file d-sme.sme.html calls analysis tools for North Sámi and the user interface is in North Sámi. The file p-sme.eng.html calls paradigm generation for North Sámi, the language of the user interface beign English. Finally the file g-sme.nor.html calls generator for North Sámi with Norwegian as the user interface language.

The texts for different user interface languages are stored in XML-files in the documentation under name: xtdoc/gtuit/src/documentation/content/xdocs/cgi/cgi-lang.xml. There is an xsl-script for creating the html-pages for different language technology applications and transferring the texts from the XML-files to the page. The script in question is xtdoc/gtuit/src/documentation/resources/stylesheets/cgi-index.xsl. The Forrest documentation does not pose a limit to the languages that are used for the documentation.

After the cgi-script is called the user interface to the cgi is regenerated together with the results of the query by smi.cgi. The same XML-file that contains the texts for the user interface is used for generating the new html-page.

An example: how to add new language to the documentation

To add a new language, changes have to be made both in the server side and in forrest documentation.

Changes to cgi-bin

Compile relevant transducers and abbr.txt and copy them to the transducer dir

mkdir -p /opt/smi/lang/bin
cp lang.fst abbr.txt /opt/smi/lang/bin
chgrp cvs /opt/smi/lang/bin/*
chmod 775 /opt/smi/lang/bin/*

If you want these files to be updated automatically each day, then add the language code to the script fst2opt.

Changes to forrest documentation

First create a page for the language, e.g. lang.xml and store it toxtdoc/gtuit/src/documentation/content/xdocs/. Then add links to the analysis pages:

<a href="cgi/d-ipk.eng.html">Analyzing</a>
<a href="cgi/g-ipk.eng.html">Generating</a>

The pages are generated automatically by forrest. However, first versions are quite reduced, since there are no language specific texts automatically available. To add text to the page, edit the different interface texts, e.g. for English the file:

xtdoc/gtuit/src/documentation/content/xdocs/cgi/cgi-eng.xml

Updating the transducers

The transducers and other relevant files are updated daily using cron facility. The script gt/script/fst2opt is responsible for retrieving the latest version from cvs, compiling the binaries and copying them to the relevant directories. The crontab is set up by the person who is responsible of the cgi-bin setup.

The cgi-bin scripts

The cgi-bin files are written in Perl and use the Perl module CGI.pm. The file smi.cgi is used for analysing and disambiguating as well as generating for different sámi languages. It can be used for other languages without any additional configuration, when the tools that are used by the script (lang.fst etc.) are in place.

The script contains a conversion from digraphs c1, s1, etc. to some utf-8 characters: č, š, etc. for those who don't have sámi characters in the keyboard. The latin-1 can be chosen as an input language as well, but the page that is generated is utf-8 encoded. The option is tested at the moment and it may turn out that it's not useful.

The files contain a very good documentation themselves (thanks to Ken Beesley, their original author). For info on how to maintain them, see available books on perl and cgi-bin.

  1. Edit the files in cvs-module gt/script/cgi-scripts
  2. Commit the changes to cvs
  3. If the version is ready for release, tag the version with release-tag using command cvs tag -F release filename1 filename2 ...
  4. execute the script cgi-export. The tagged files are exported from the cvs to the official directory.

Linking due to security

For security reasons, the webserver on victorio.uit.no is run via symbolic links. The files themselves are not where they seem to be, but in a more secure environment.

The url to the cgi-bin scripts is: http://sami-cgi-bin.uit.no/cgi-bin/smi/

Last modified $Date: 2011-08-05 01:09:26 +0200 (fre, 05 aug 2011) $, by $Author: lene $

by Trond Trosterud, Saara Huhmarniemi