Google DeepMind Artificial Intelligence Learns to Talk

Miles Willis/Getty Images

Using an “artificial brain,” Google DeepMind researchers have developed a new voice synthesizing technique they claim is at least 50% closer to real human speech than current text-to-speech (TTS) systems in both US English and Mandarin Chinese.

The system, known as WaveNet, is able to generate human speech by forming individual sound waves that are used in a human voice. Additionally, because it is designed to mimic human brain function, WaveNet is capable of learning from extremely detailed — at least 16,000 samples per second — audio samples. The program statistically chooses which samples to use and pieces them together, producing raw audio.

While most of the existing TTS systems also use the same “piece by piece” idea, they largely utilize concatenative TTS. Despite drawing from a large database, these systems are restricted to combinations of short recorded speech fragments from a single speaker, which makes modifying the voice or its inflection difficult.  As an alternative, some TTS systems use parametric TTS — processing information through a voice synthesizer — but the resulting output doesn’t sound very natural.

With WaveNet’s ability to produce raw audio output, the program is capable of seamless transitions between multiple voices, with researchers saying that further additions of emotions and accents will produce speech that sounds even more realistic. “To make sure [WaveNet] knew which voice to use for any given utterance, we conditioned the network on the identity of the speaker,” the researchers said.

Training a program capable of learning the intricacies of multiple voices, both male and female — including breathing and mouth movements — was revealing for the researchers: “Interestingly, we found that training on many speakers made it better at modelling a single speaker than training on that speaker alone, suggesting a form of transfer learning.”

Additionally, WaveNet isn’t just limited to speech. “As well as yielding more natural-sounding speech, using raw waveforms means that WaveNet can model any kind of audio, including music,” said researchers.

It will be a bit before WaveNet goes mainstream, however. The whole process requires an exorbitant amount of computational power. Beyond the 16,000 plus audio sample training, the program must take each sample and predict what the soundwave should look like based on each of the previous samples — a “clearly challenging task,” according to the research team.

Prior to WaveNet, DeepMind created AlphaGo, an artificial intelligence system that bested the world champion human player in the strategy game Go earlier this year. And before that, DeepMind killed at the arcade, mastering video games without any human instruction on how to play.  Additionally, the company’s research and technology allowed Google to reduce its data center electric bill by 40%, resulting in financial savings great enough to justify the cool half billion Google shelled out for the British artificial intelligence agency in 2014.