UiT > Divvun
 
Font size:      

Meeting_2005-09-19

Meeting setup

  • Date: 19.09.2005
  • Time: 10.00 Norw. time
  • Place: Wherever we are : -)
  • Tools: iChat, SubEthaEdit

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. Speller infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 10: 03.

Present: Børre, Maaren, Saara, Sjur, Thomas, Tomi

Absent: Trond

Main secretary: Sjur

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

  • Finish crontab specification for the cvs update/export script Tomi made
    • still working
  • reopen the jspwiki + UTF-8 issue
    • Done
  • Add issue to forrest issue tracker about utf-8 ihtml documents.
    • Done
  • Contact Svenska bibelsällskapet
    • Done, but haven't managed to get contact with Olavi Korhonen who has the latest translation
  • discuss with Anders Kintel about possible cooperation
    • Not done
  • Follow up on CVS mailing:
    • set up Maaren
      • Will do it this week, as I am in Kautokeino
  • Meet up with Trond about directory structure
    • Not done
  • Contact oahpahusossodat and the rest of the SD about texts
    • Will do it this week
  • Fixing the machine for the new coworker
    • Done
  • Document the corpus infrastructure
    • Not done
  • Read through the Helsinki contracts (new translations)
    • Not done
  • Reorganise the directory structure
    • Not done
  • Continue converting text from input format to our xml
    • Not done

Maaren

  • The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology.
    • shall do it next week
  • shall get mainly through the missing list from risten.no this week
    • working hard this week also
  • Start working on grammatical issues with Thomas and Trond
    • not done
  • Work on the name project with Trond and Maaren
    • not done
  • Start looking at normativity issues
    • not done it yet
  • Work on the numerals project with Trond
    • not done it yet

Saara

  • Look at the corpus infrastructure issue
    • Working on this
  • Look at the corpus interface issue with Lars
    • Not done
  • Convert texts from .doc to .xml, to get a grasp of our corpus format
    • Not done
  • Have a look at the pdf-to-xml issue (known problem: Keep the Sámi letters sound and safe)
    • Testing with some tools, e.g. pdftohtml, pdftotext. The Sámi characters pose no problem. Instead, it seems to be difficult to preserve the layout of the document, headings, paragraphs, etc.

Sjur

  • risten.no bugs and fixes
    • nothing done this week
  • complete the action summary after our half-year evaluation
  • follow up on:
    • voice group-chat not working across the Sámediggi firewall
      • Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
  • To the board:
    • write draft specification for the outsourced tasks
      • looking for software to integrate the specifiations with our regular development and testing, or at least support development and testing
    • write half-yearly project report with progress and bugdet status
      • continued, not finished
    • Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is October 4th, deadline for the documents is September 13th)
  • project planning with Trond
    • not done, but have been looking for software, see below
  • Work on the name project with Trond and Maaren
    • nothing done
  • Prepare for a Lule Sámi meeting with Árran
    • nothing done
  • Follow up on place names from Norge Digitalt
    • waiting for further response from Øystein Johannessen
  • Read through the Helsinki contracts (new translations)
    • browsed through them
    • added them to the menus, and rewrote the legal section text
  • Talk to Bitte about the Lule Sámi lexicon
    • Bitte has been travelling
  • Evaluate SFST as speller (and analyzer) lexicon
    • After project board meeting
  • Other tasks:
    • looked into project management software - there's a new page for it
    • also looked for software to help with writing and following requirements specification

Thomas

  • work on Lule Sami compounding and derivation
    • worked hard and still working
  • Look at Linguistic bugs with Trond.
    • solved some, some others are left
  • Prepare for a Lule Sámi meeting with Árran

Tomi

  • Aspell: Continue working on the affix file & aspell
    • Contact aspell author (UTF-8 thing)
      • Not done
  • three-part compounding
    • Not done
  • corpus infrastructure: dtd location (both public and internal)
    • Not done
  • corpus infrastructure: file and dir organisation
    • Not done
  • Document aspell and corpus infrastructure
    • Done, partly
  • Add html-to-xml conversion to corpus infra
    • Not done

Trond

  • ( Trond will be absent at next week's meeting, or perhaps accessible from Kastrup Airport)
  • Work on the bug list.
  • Work on compounds (three-part, with Tomi)
  • Work on the corpus interface (with Lars and Saara)
  • Work on the name agreement with "Norge digitalt" with Thomas
  • Look at the linguistic aspects of the speller clitics, with
  • Get the new version of the New Testament
  • Introduce the new coworker to the work routines
  • project planning with Sjur
  • Work on the name project with Maaren and Sjur
  • Prepare for a Lule Sámi meeting with Árran
  • Work on the numerals project with Maaren

3. Documentation

Documentation tasks:

  1. Add documentation on our corpus infrastructure and our corpus work in general ("To be done by the ones making the corpora": Børre, Tomi, Trond, Saara).
  2. add/update Aspell documentation (Tomi)
  3. finish divvun2web script (Børre)
  4. as always: document what you're doing: -) (all)
  5. divvun.no is turning white from time to time. Needs to be checked (Børre)
    1. This was probably too little memory allocated to java (the default is 64mb). Børre has now changed forrest.properties so that 256mb is allocated, and restarted the server.

4. Corpus gathering

See notes from the 12.9. meeting for details about the steps forward.

Tasks:

  • read through Trond's translations (Børre, Sjur)
  • e-mail Kimmo Koskenniemi about the missing fourth contract, and about unclear paragraphs in the versions received (Sjur).
  • send the license text to lawyers
  • add a background document explaining the model

5. Corpus infrastructure

Naming conventions and directory structure

See notes from the 12.9. meeting for details about the decision and implementation, as well as a list of tasks.

Corpus conversion

Pdf to XML

Extraction priority list

  1. retain correct Sámi characters
  2. retain word and sentence order
  3. retain paragraph order
  4. retain structure
    1. paragraphs
    2. titles, headers
    3. metadata (author, year, etc.)
    4. lists
    5. tables

Problems found so far using open-source tools:

  • text gets correctly out, also regarding encoding
  • paragraphs are correctly ordered, but not separated (i.e. one long paragraph)
  • no structure

HTML to XML

  • we already have some tools according to Saara
  • this is anyway easy, as HTML provides us with the structure we need
  • what is needed is a transformation to our XML, + adding the metadata as usual
  • it can wait at least a week or two (after pdf conversion is mostly done)

6. Linguistics

Test

Name lexicon

See notes from the 12.9. meeting

North Sámi

  • three-part compounds issue still open
    • Trond, Maaren, Sjur will look into this in Guovdageaidnu
  • Johnny Andersen has written a letter to us on the treatment of Sámi place names, we need a contract with "Norge digitalt", via UFD. We will have to get such an agreement. This is a task for Thomas and Trond.
    • Sjur has written an e-mail to the UFD contact person, Øystein Johannessen, who will look into it soon.
  • normativity issues:
    • the Giellalávdegoddi meeting is in October sometime

Lule Sámi

  • we need a lexicon
  • compounding and derivation
    • Thomas has finished with deverbals, now working with denominals
    • most likely the same three-part compound problem in Lule Sámi as well (the middle stem shortens to a stem only found in such compounds)
    • it is possible that even the first stem shortens the same way
  • Suffix boundary symbol has not been added, we are not sure whether we should do that.

Numerals

  1. An empirical overview
    1. Numeral generation
    2. Numeral inflection
    3. Numerals as parts of compounds
  2. A clear concept of how we want to treat them
    1. Tagging
  3. A treatment

7. Speller infrastructure

Aspell

Write documentation here as well.

The munch-list is working, and the affix file is improving. See 15.8. meeting memo for more.

See 12.9. meeting memo for a list of open issues.

8. Other

Technical issues

  • The mac os / perl bug (at least Trond and Sjur has it):
    • utf8 "\xC4" does not map to Unicode at /Users/trond/gt/script/preprocess line 82. This msg did not show up in 10.3 (perl 5.8.1), but does so in 10.4 (perl 5.8.6). It is probably a perl - OS mismatch. (Trond, Thor Øivind, Tomi)
      • Another example:
      • : "\x{00c3}" does not map to utf8 at ../script/preprocess line 113, <> chunk 33.
  • Sjur has a non-solved Backspace + UTF-8 issue
    • This issue is now fixed! (I believe I had to specify a locale that was supported by the OS, such as no_NO.UTF-8, not the more "obscure" one I tried (nn_NO.UTF-8). Note that locale is now finally supported/available on MacOS X, since Tiger, cf man locale, at the end.
  • CVS mailing to Maaren:
    • it is working, but she receives two copies of every message. I don't know why. It is a minor annoyance, but needs to be looked into. Thomas has the same problem. Børre will look into it (get rid of the double copies).

Bug fixing

  • 13 open bugs (and 24 risten.no bugs) - it seems Sjur can need some help with bug fixing in risten.no: -) (note: some of them are feature requests more than real bugs)
  • some bugs are not real, but just not closed yet even though they are fixed. Please have a look at all the bugs you are assigned to. Do we need to set up an automatic reminder system? Something like sending an e-mail if the bug has been open and untouched for more than X weeks?
    • Børre to ask Thor Øyvind to configure Bugzilla to send e-mail reminders for open bugs not touched in a month.

Memo and meeting practice update

From now on, next week's memo frame with task lists etc. will be made available at the same day as the previous meeting's (finished) memo. This will make it easy to use the task list in that memo as a reminder, and facilitates updating it as you go - as soon as a task has been started, it can be commented and problems described ready to be included in the next meeting. This can also be done for the final status of the task, well ahead of the meeting.

9. Summary, task list

Børre

  • Finish crontab specification for the cvs update/export script Tomi made
  • Contact Svenska bibelsällskapet / Olavi Korhonen
  • discuss with Anders Kintel about possible cooperation
  • Follow up on CVS mailing:
    • Have a look at why Maaren and Thomas get two copies of every samicvs mail.
  • Contact oahpahusossodat and the rest of the SD about texts
  • Document the corpus infrastructure
  • Read through the Helsinki contracts (new translations)
  • Reorganise the directory structure
  • Continue converting text from input format to our xml

Maaren

  • The missing list, both the overall missing list from our xml corpus, and a file-for-file review, in order to get different terminology.
    • shall do it next week
  • shall get mainly through the missing list from risten.no this week
    • working with risten.no this week also
  • Start working on grammatical issues with Thomas and Trond
    • shall do it this week or next week?
  • Work on the name project with Trond and Sjur
    • okei okei
  • Start looking at normativity issues
    • shall do it this week
  • Work on the numerals project with Trond
    • shall contact Trond

Saara

  • Look at the corpus infrastructure issue
  • Look at the corpus interface issue with Lars
  • Convert texts from .doc to .xml, to get a grasp of our corpus format
  • Have a look at the pdf-to-xml issue
    • use the priority list earlier in the memo for a guidance

Sjur

  • risten.no bugs and fixes
  • complete the action summary after our half-year evaluation
  • follow up on:
    • voice group-chat not working to Sámediggi
      • Now awaiting cost evaluation from the IT guys (Geir Kaaby et al)
  • To the board:
    • write draft specification for the outsourced tasks
    • write half-yearly project report with progress and bugdet status
    • Deadline for the board tasks: 3 weeks ahead of the meeting (the meeting is October 4th, deadline for the documents is September 13th)
  • project planning with Trond
  • Work on the name project with Trond and Maaren
  • Prepare for a Lule Sámi meeting with Árran
  • Follow up on place names from Norge Digitalt
  • Read through the Helsinki contracts (new translations)
  • Talk to Bitte about the Lule Sámi lexicon
  • Evaluate SFST as speller (and analyzer) lexicon
  • prepare for the Guovdageaidnu meeting:
    • name lexicon
    • three-part compounds
  • e-mail Kimmo Koskenniemi about contract issues

Thomas

  • work on Lule Sami compounding and derivation
  • Look at Linguistic bugs with Trond.
  • Prepare for a Lule Sámi meeting with Árran

Tomi

  • Aspell: Continue working on the affix file & aspell
    • Contact aspell author (UTF-8 thing)
  • three-part compounding
  • corpus infrastructure: dtd location (both public and internal)
  • corpus infrastructure: file and dir organisation
  • Document aspell and corpus infrastructure
  • Add html-to-xml conversion to corpus infra

Trond

  • ( Trond will be absent at next week's meeting, or perhaps accessible from Kastrup Airport)
  • Work on the bug list.
  • Work on compounds (three-part, with Tomi)
  • Work on the corpus interface (with Lars and Saara)
  • Work on the name agreement with "Norge digitalt" with Thomas
  • Look at the linguistic aspects of the speller clitics, with
  • Get the new version of the New Testament
  • Introduce the new coworker to the work routines
  • project planning with Sjur
  • Work on the name project with Maaren and Sjur
  • Prepare for a Lule Sámi meeting with Árran
  • Work on the numerals project with Maaren
  • Prepare for three-part compounds meeting in Guovdageaidnu

10. Next meeting, closing

26.09.2005 10: 00

Closed at 11: 10