Opened at 10:58.
Present: Børre, Ciprian, David, Sjur, Thomas, Tomi, Trond
Absent: Jovsset
Agenda accepted as is.
Hunspell
TODO:
- QA README and installation docs - report ( Trond )
- test the latest hunspell speller files ( Sjur )
- check the hunspell test bench and tokenization errors ( Børre, Sjur )
- make extension packages of the spellers ( Børre )
- test installations on all platforms ( Børre )
Testing
Spelling Error Markup
TODO:
- Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) ( Saara )
- test new and nested error markup ( Sjur )
Speller testing
We need a test file for testing the speller behaviour on a defined set of word constructions: compounds of all types, derivations, inflections, and combinations of those.
TODO:
- make speller test file for word types ( Thomas )
Speller updates
TODO:
- buy MS Office 2008 to all members of the Divvun team ( Børre )
- ordered and received
- installed for Thomas
- test the spellers ( Divvun team )
-
Sjur tested very briefly, it seems to work as it should
- new make target for Office 2008 ( Sjur )
- waiting for hyphenation updates for 2008 ( Polderland )
Speller bugs
List of bugs returned from Polderland: 621, 630, 652, 656, 676.
Open issues based on test results:
sme
Version: Davvisámi, version 1.0.1, 2008-09-25
- 380 - REGRESSION - citation compounds
- 397 - REGRESSION - name compounds, their derivations, other compounds
- 426 - comp words from Divvun.no - guoktedássásaš accepted, some compounds not accepted - still OPEN
- 431 - REGRESSION - name compounds / lowering in front of hyphen
- 435 - roman numbers - inflection of single letter numbers rejected, as well as some complex numbers (but is ok in smj ) - still OPEN
- we should pregenerate all numbers once and for all, and store them in a separate lexicon file
- 449 - REGRESSION - suopmasápmelaš comps wrongly accepted
- 489 - REGRESSION - stem vowel shortening / compound tags
- 581 - REGRESSION - consonant doubling
- 525 - REGRESSION - citation compounds not accepted (the bug relates to suggs)
- 536 - REGRESSION - speller accepts "impossible" compound forms
- 539 - REGRESSION - speller does not follow compound tagging
- 542 - REGRESSION - speller does not pronouns + clitic
- 575 - REGRESSION - name+name suggestions with double hyphens (legal compounds not accepted)
- 582 - REGRESSION - noun+Prop without hyphen
- 593 - REGRESSION - Missing words in beta 2
- 595 - prefix+name wihtout hyphen ( ovdaLot instead of ovda-Lot ) - still OPEN
- 597 - REGRESSION - does not recognize nubbelohki
- 599 - REGRESSION - numeral attr:s on lot
- 600 - gen+hyph compound sámi-dáru - still OPEN
- 603 - suomabealdi accepted - still OPEN
- 606 - speller accepts VUOHTA compound - FIXED
- 609 - REGRESSION - Anár-julggaštusa not recognized
- 610 - REGRESSION - duhát words missing
- 611 - double hyphen sugg still accepted - still OPEN
- 612 - REGRESSION - case-forms of makkár
- 613 - short gen. as second compound part - still FIXED
- 619 - numerals and pronouns to NAMÁK and SASJ fails - still OPEN
- 627 - prefix + hyhpen does not get accepted - FIXED
- 629 - a taking part in compounding without hyphen - still OPEN
- 631 - REGRESSION: number compound, numbers starting with 0
- 633 - double hyphens accepted
- 634 - PropGen+hyph+PropGen - still OPEN
- 641 - numeral+noun compounds - FIXED
- 642 - noun/adj/proper + hyphen + ain - FIXED
- 644 - cased numeral+numeral compund - still OPEN
- 646 - adverb + hyphen + noun - still OPEN
- 647 - numerals+NOUN - still OPEN
- 648 - unmotivated suggestions with numeral+noun - FIXED
- 649 - name + adj compound without hyphen - still OPEN
- 654 - speller does not recognize ordinals on -nuppelogát - FIXED
- 655 - pron + nai - still OPEN
- 658 - Suggestion saame - FIXED
- 666 - REGRESSION: guovtte- and njealje-
- 676 - triple-hyphen - now completely FIXED
- 696 - numerals + NAMAT-adjs + vuohta
- 697 - +N+Der1+Der/laš+A+Der3+Der/vuohta
- 709 - sámedikkeválga accepted, or with bad suggestion
- 717 - noun-acro compounds (juovla-CD)
- 728 - vowel shortening GenCmp+Left-tagged
- other regressions:
-
eaktudáhtolaččat now accepted
-
skuvlajagin now accepted - FIXED
-
skierranis now accepted - FIXED
smj
Version: Julevsáme, version 1.0.1, 2008-09-15
- 435 - roman number - single letter numbers now recognised
- we should pregenerate all numbers once and for all, and store them in a separate lexicon file
- please note that inflection of single letter numerals is fine in smj , as opposed to sme
- 482 - REGRESSION: double hyphen, polar>-dutkamin suggested
- 496 - REGRESSION: unrecognised clitics
- 595 - prefix+name wihtout hyphen ( tsåhkeLot instead of tsåhke-Lot ) - FIXED
- 599 - numeral attr:s on lot - still OPEN
- 600 - gen+hyph compound sáme-dáro - still OPEN
- 616 - Bispadime-me-ráden - still OPEN , try to find an acro or abbr me
- 619 - numerals and pronouns to NAMÁK and SASJ fails - still OPEN
- 627 - REGRESSION: prefix + hyhpen does not get accepted
- 629 - a taking part in compound - still OPEN
- 631 - REGRESSION: number compound, numbers starting with 0
- 634 - rop gen + hyphen + Prop gen - still OPEN
- 641 - numeral+noun compounds - still OPEN
- 644 - cased numeral+numeral compund - still OPEN
- 647 - numerals+NOUN - still OPEN
- 648 - unmotivated suggestions with numeral+noun - still OPEN
- 649 - name + adj compound without hyphen - still OPEN
- 650 - noun prefix+name compound without hyphen - still OPEN
- 652 - REGRESSION: UPPERCASE-typos only get acronym-suggestions
- 658 - Suggestion saame - FIXED
- 692 - numeral-variants
- 717 - NEW: - noun-acro compounds
- 721 - NEW: - Nom- and gen-numerals make compounds with nouns
- other regressions:
-
gus NOT accepted anymore
-
Svierigadárogielan NOT accepted anymore
TODO:
- document how compounding is controlled in the PLX conversion ( Tomi )
Hyphenator bugs
Open issues based on test results :
sme
Lexicon version: Davvisámi, version 1.0.1, 2008-04-01
- 468 - REGRESSION: Márkomeanu
- 547 - REGRESSION: hyphen in front of vowel: Lotnolasealáhusas
- 548 - REGRESSION: mid syllable hyphenation: Háliidivččen
- 549 - REGRESSION: division without hyph: Váccedettiin
- 673 - adj-derivations: guovttenuppelotčoarvvagiin (the word is not rec.)
- 677 - NEW: Wrongly hyphenated ending -danidja - invalid
smj
Lexicon version: Julevsáme, version 1.0.1, 2008-04-01
- 545 - REGRESSION: bad hyphenation in compounds: åhpadusorganisásjåvnån (not recognised)
- 546 - REGRESSION: obligatory hyph rules seem to work in facultative manner: organisásjåvnån (not recognised)
- 547 - REGRESSION: hyphen in front of vowel: Jienastimnjuolgadusá and Orgánajs
TODO:
- fix PL hyphenator errors ( Tomi )
- test hyphenator with new speller lexicons ( Tomi )
InDesign tools
Releases
TODO:
- update the Changes document ( Sjur )
- InDesign documentation ( Sjur )
- Norwegian translation received from Davvi Girji
- prepare 1.1 release soon
Forthcoming Sámi allaskuvla conference
Presentation at North Sámi SGL, Rica Hotell Tromsø
Thomas and Tomi are going to present our tools there.
TODO:
- ask Leif-Åge to burn some Divvun CD's, and send them with Maaren ( Thomas )
Text to speech
We are planning a North Sámi text to speech. Work on the text-to-transcription (ttt) component has begun, and for the transcription-to-sound we cooperate with U Hki. For a forthcoming october demo we have two alternative tracks:
- Do the ttt with quasi-Finnish orthography as output, and plug it onto the Finnish ttt (the Sámi-as-a-Finnish-priest solution) (pro: the web demo is there, so the demo is within reach; con: several phonemes are missing)
- Do the ttt with SAMPA as output, and read in 200 sentences, and have Hki generate a voice based upon that (pro: probably better result, head start on the project; con: we risk that is not be done by
- Do both in parallel. (pro: safety first, not much double work; con: well, the double work in the phon-sme.xfst file)
The best thing would probably be to do both tracks, to be on the safe side.
echo "23847 ja de mun ipmirdán ja 12° ja §12 ja 23,2 ja 23,- ja 23-23" | jietna.sh
kuokː.hte.lo.gi kolə.bmɑ tuː.hɑːh kɑːvht.tsiː tʃuo.ðiː ɲeæl.lje lo.giː tʃie.dʃɑ
jɑ te mun ip.mir.dɑːn jɑ kuokː.hte nup.pe.lohː.kɑːj grɑː.dɑ jɑ pɑ.rɑː.grɑː.fɑ...
TODO:
- Technical issues on the ttt automata ( Trond, ... )
- Moving input to required xml format ( Helsinki )
- Read in the 200 sentences ( Biret Ánne )
- Make sounds out of them ( Helsinki )
Corpus contracts + open source
Postponed until the svn repository is fully functional (it is too open now).
The next meeting is 6.10.2008, 9.30 Norwegian time.
The meeting was closed at 12:38.