<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.3//EN" "document-v13.dtd">
<document xmlns:xi="http://www.w3.org/2001/XInclude" xml:lang="en">
    <header>
        <title>Corpus infrastructure</title>
        <authors>
            <person email="tomi.pieski@hum.uit.no" name="Tomi Pieski"/>
        </authors>
    </header>
    <body>
        <section>
            <title>Intruduction</title>
        </section>
        <section>
            <title>Gathering</title>
            <p>Here is the discussion of what sort of corpus format we should be using and how to process the incoming texts into that format.</p>
            <p>First we have to decide about the XML -format. What information do we want and how to represent it. Simplest way to define the
                XML -format is to represent it with DTD. Other representation options include W3C Schema and Relax NG. We use DTD here and
                transform into some of the schemas in the future.</p>
             <p><link href="corpus_dtd.html">DTD -discussion</link></p>
            <p>Second target is to figure out how to convert the incoming raw original files into our XML -format. Below are listed the formats
                to be consisdered in the conversion process.</p>
            <ul>
                <li>Word document (.doc)</li>
                <li>Adobe document (.pdf)</li>
                <li>Web -documents (.html)</li>
                <li>Plain text (.txt)</li>
                <li>Other formats? (...)</li>
            </ul>
            <p><link href="corpus_conversion.html">Conversion</link></p>
        </section>
        <section>
            <title>Storing</title>
            <p>Filesystem</p>
            <p>OpenOffice -format</p>
        </section>
        <section>
            <title>Using</title>
            <p class="last_modified">Last modified: $Date: 2009-05-11 11:06:46 +0200 (man, 11 mai 2009) $, by $Author: boerre $</p>
        </section>
    </body>
</document>
