How does text to speech work?

  • In order to reproduce the natural sound of each language, a narrator records a series of texts (poetry, political news, sports results, stock exchange updates, etc.) which contain every possible sound in the chosen language.
  • These recordings are then sliced and organized into an acoustic database.
  • During database creation, all recorded speech is segmented into some or all of the following: diphones, syllables, morphemes, words, phrases, and sentences.
  • To reproduce words from a text, the TTS system begins by carrying out a sophisticated linguistic analysis that transposes written text into phonetic text.
  • A grammatical and syntactic analysis then enables the system to define how to pronounce each word in order to reconstruct the sense. We call this the prosody: it gives the rhythm and intonation of a sentence.
  • Finally, the system produces information associating the phonetic writing with the tone and required length of the pronunciation.
    The chain of analysis ends here and sound is generated by selecting the best units stocked in the acoustic database.

Follow the full TTS creation process with this talking presentation:

