Brainy voices: innovative voice creation based on deep learning by Acapela Group Research Lab.

29 June 2017

Discover how Acapela Group can create a synthetic version of any voice based on a few minutes of speech recordings.

Neural Networks have revolutionized artificial vision and automatic speech recognition. This machine learning revolution is holding its promises as it enters the Text to Speech arena.

Acapela Group is actively working on Deep Neural Networks (DNN) and we are very enthusiastic and proud to present the first achievements of our research in this fascinating field, creating new opportunities for voice interfaces.

Our R&D lab has developed Acapela DNN, an engine capable of creating a voice using a limited amount of existing or new speech recordings.

“Acapela DNN represents ‘Acapela’s ultimate talking machine’, benefiting from our speech expertise and learning from our vast voice and language databases to model voice identities and reproduce speech, in many languages. This is much more than concatenating speech recordings from the studio like we used to do with unit selection. We are talking about creating a voice signal and persona from scratch and in many languages and it is happening now. We need only one week to release a new voice based on a few minutes of speech recordings”, says Vincent Pagel, R&D and Linguistic Group manager, Acapela Group.

While synthetic voice creation was usually based on rich audio material recorded by a professional voice actor, in a professional studio and under the supervision of a linguistic expert, Acapela can now create a voice with an average of 10 to 15 minutes of clean audio recordings and the associated text transcription of the audio samples.

Voices can be created based on minutes or hours of speech recordings, depending on the targeted usage. In specific cases such as voice replacement for patients, Acapela DNN can work with a few minutes of speech. For professional usage, such as creating a voice for a video game or for a passenger information system, Acapela DNN will need more recordings. Obviously, the more data there is, the more the DNN can learn from specific habits and create a voice that matches the original.

The first results of voices created using this approach are impressive.

We have worked on voice recordings of well-known people. We have also created voices for individuals who cannot speak correctly anymore due to surgery or disease. They will be the first ones to speak with voices created with Acapela DNN. Here are some voice samples.

Listen to voice samples

Above voice samples have been produced with only a few minutes of speech. Based on the speech recordings provided by the users, the Acapela DNN has defined a voice ID and after training has provided a voice that is very close to them.

John, US English

Acapela DNN US English - John

Original Voice

00:00 / 00:00
Acapela DNN US English - John

TTS

00:00 / 00:00

Stephen, US English

Acapela DNN US English - Stephen

Original Voice

00:00 / 00:00
Acapela DNN US English - Stephen

TTS

00:00 / 00:00

Anonymous user, French

Acapela DNN French

Original Voice

00:00 / 00:00
Acapela DNN French

TTS

00:00 / 00:00

Other ongoing experiments include among others voices for video games or robots. Creating voices based on DNN is limitless. With this new approach, Acapela will push the boundaries of technology allowing everyone to have a voice.

Material needed: average of 10-15 min of clean recordings + text transcription

Acapela DNN is trained offline with all the many different voices in our catalogue. We feed it all the text and acoustic databases we have for all of our voices. This means Acapela DNN knows a lot about human speech in general but doesn’t yet know anything about a specific person’s voice and will need to hear this voice for a while before reproducing it.

1^st pass algorithm: ‘Voice ID’ parameters to define the digital signature (or sonority) of the vocal tract of the speaker.
2^nd pass algorithm: Acapela DNN additional training to match the imprint of the voice with its fine grain details (accents, speaking habits, etc.)

Creation of a new voice based on limited amount of audio data

About DNN

A deep neural network (DNN) is an artificial neural network (ANN) with multiple hidden layers between the input and output layers. DNNs can model complex non-linear relationships. We use them in Text-to-Speech to learn the relationship between a set of input texts and their acoustic realizations by different speakers.

Neural networks are a set of algorithms, modelled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labelling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated.