Sámediggi > Divvun
 
Font size:      

Extracting text from xml elements in sámi xml corpus

Introduction

This document describes the overall process of processing the xml corpus into plain text for the needs of testing the language tools.

ccat

ccat replaces the catxml-script. The basic function of ccat is to print paragraphs that have text type (empty type in our xml-files), that is the default. In addition to text paragraphs, it is possible to print titles, lists and tables and to traverse directories recursively. Usage:

    Usage: ccat <options> [FileName]
    Print the contents of a corpus file in XML format.
    The default is to print paragraphs with no type (=text type).
    The possible options include:
        -a              Print all text elements.
        -p              Print plain paragraphs. (default)
        -T              Print paragraphs with title type.
        -L              Print paragraphs with list type.
        -t              Print paragraphs with table type.
        -r <dir>        Recursively process directory dir and subdirs enountered.
        -h              Print this help message.

The basic type of usage is for example to print out the contents of text paragraphs in all the files under /home/apache_corpus/boundcorpus/converted/sme:

$ ccat -r /home/apache_corpus/boundcorpus/converted/sme

If you want to use ccat in your own computer do the following:

  1. If you haven't done so already, check out our repository. Then open a terminal and write the following there:
  2. cd $GTHOME/gt/script/ccat
  3. make
  4. cp ccat ~/bin/

Assuming that you have bin directory under your home directory, and it is along PATH. The program is under development, so you will be later asked to update your version of the ccat.

by Tomi Pieski, Saara Huhmarniemi