UiT > Divvun
 
Font size:      

meeting-2009-10-28

Formalities

Participants:

  • Martti
  • Antti
  • Biret Ánne
  • Trond
  • Sjur

Tentative agenda:

  • software / tools we need to build and test our tts
  • how much time you can spend on this, and the payment for the time
  • overall project schedule, and how it fits with your plans

Project overview

Financed by unused money from the first Divvun project. Financed by the Sámi Parliament. The work is done in cooperation with UiT and HU. Targeted release date is sometime next year.

Recording

Voices: Do both a male and female voice - at least the recordings. Using the same text will simplify some of the perparations / postrprocessing, but not that much

The text should be something between rich and balanced. 2 hours of recordings, rather more than less. 120 - 150 000 letters is a rough esitmate, at least for Finnish. This would approximate perhaps 2000 sentences?

Whole paragraphs with intersentence dependencies. Even though it isn't possible to model that yet, it will come in the future, and it is good to be preprared.

The text needs to as much as possible reflect the text types we are expecting in final use. Things to consider:

  • from single words and numbers to whole paragraphs
  • also single letters
  • dates
  • with statistical synthesis you need several/many examples of each type

The microphone should be a high quality, phase linear condenser microphone. Sampling frequency = basic CD quality is good enough. At least 16 bit resolution. As noise free as possible.

Software / tools needed

easier to generate

start to plan a larger project with more independence for the UiT group, including a new software engineer position.

Output products

Three areas come up as most relevant:

  • The society will be interested in web solutions of the kind known for nor, eng: Have the machine read the webpage aloud
  • Disabled people will need the usual stuff (screen readers etc on your local computer)
  • Pedagogical programs and dictionaries would like integrated tts.

There are other future usages that is hard to envisage now, the synthetic voice has to be as general as possible, to be able to cope with as many different text types and usage scenarios as possible.

Tools for testing ling input

Tools for testing prosody prediction

  • Data tagged for prosody
  • A sharp ear

Testing procedure:

Until now:

  1. Ling analysis
  2. Feed to B-Á
  3. Look at output
  4. Rewrite ling analysis
  5. Etc.

Soon:

  1. Ling analysis
  2. Feed to syntesizer
  3. Listen to output <==== (use the prototype for this? No)
  4. Rewrite ling analysis
  5. Etc.

Work capacity at the HU

synthesizer made in Hki : -)

Now: Antti in a (too?) key position. Knowledge transfer is thus high on the priority list. It is not about money as long as we do not extend to a new person.

Intermediate solution:

A person working both for Hki and for Tromsø. This person would be in Hki for training, but could later move to Tromsø.

Work plan

UiT - speech corpus:

  1. Collect text
  2. make recordings
  3. cut recordings into sentenes (like for the prototype)
  4. transcribe recordings

UiT - Building the textual input for training the synthesizer:

  • letter to sound (automatic)
  • annotation of:
    • phrase breaks (3-4 levels, manual work)
    • word prominences (3-4 levels, manual work)
ortography 
Mon lean okta sápmelaš, guhte lean bargan visot sámi bargguid  ->
 -> lts rules (automatic)
  mun leæn okː.htɑ sɑːp.me.lɑʃ ,  kuh.te leæn pɑrə.kɑn vi.soh sɑː.miː pɑrk.kujht
-> manual labeling of prominence and breaks (for training the model)
"mun leæn 'okː.htɑ "sɑːp.me.lɑʃ , || kuh.te leæn 'pɑrə.kɑn "vi.soh 'sɑː.miː 'pɑrk.kujht

UiT: enhance the lts system to automatically add full prosody labeling as input to the synthesis. Match as close as possible the trained data, but a perfect match is not required to get a good synthesis.

Needed: A theory of Sámi prosody

  • a. a perfect theory of Sámi prosody (from ToBi and onwards)
  • b. a theory of tts-relevant Sámi prosody

b. 3-level word prominence, 3-4 levels of phrase breaks

The Hki functional approach: Primitives of the tts f-approach: Prominence, the 0123 theory (annotated by numbers). The algorithm for predicting 0123 + breaks (≈ syntactic analysis (?!!)) for Finnish would be relevant

Prominence:

Antti has a proof of concept ruleset for Finnish

  • + of linguistic interest
  • - difficult to implement

Machine learning:

  • + easy to implement
  • - needs larger corpus than 1 speaker training material

Build a spec based upon what has been done for Finnish

  • Look at actual Finnish input
  • Get a copy of the Finnish orthography-to-syntinput machine

Todolist

  • work on the recordings:
    • building a suitable text corpus
    • prepare recordings (contact voice owners, studio, etc)
    • record, split, annotate
  • Antti wants ot be informed about the annotation and the other work.
  • Tromsø wants to get the proof of concept ruleset and some Finnish annotated text data (input text to synthesizer)
  • Speech sample forthcoming from Tromsø

Then work proceeds as agreed.