“Speech is the most natural way to communicate, so using voice to interact with machines has always been one of the top scenarios associated with AI.”
Getting to know Li Jiang
Li Jiang, now a distinguished engineer at Microsoft Speech, was introduced to speech recognition while doing research as an undergraduate student in the mid-1980’s. He fell in love with the problem space while building a speech recognition system on a rudimentary Apple computer.
“It was simply magic to see [the] computer respond to voice and recognize what is said.”
Halfway through his PhD program, Li started to focus on speech recognition at an internship with Microsoft Research in 1994. Li loved the company and the experience so much so that he stayed on, and never actually returned to school! Over the past 27 years, Li has worked on different roles in both the research and engineering teams, eventually returning to the field of speech. Li is currently leading the audio and speech technology department under Azure cognitive services.
“I was fortunate to witness not only the dramatic advancement in technology in the past few decades, but how the technologies are enabling people to do more and to improve their productivity. It has been a great ride and I loved every moment of it.”
Progression of Speech Technologies
Early speech technologies were pattern recognition and rule-based expert systems. When Li got started, a simple pattern matching tool was one of the early leaders in the space. One of the better-known technologies at the time was called DTW, Dynamic Time Warping. These systems essentially tried to match a language sequence given with another sequence template, and would recognize the language if the sequence matched a template. However, there was always some kind of restriction– these systems worked best with isolated speech, small vocabulary, or single-speaker audio.
The introduction of Hidden Markov Models (HMM) served as a foundation for modern speech recognition systems, which enabled accurate recognition of large vocabulary, speaker-independent, continuous speech.
Around 2010, deep learning studies started showing promising results for speech recognition. The LSTM model was found to be very suitable for speech, and it has since been used as the foundation for the current generation of improved models. More recently, transformer models, which originated in Natural Language Processing research, have shown promising improvements across different fields in speech.
In 2016, Microsoft actually reached human parity on the challenging Switchboard task, thanks to deep learning technologies.
The Switchboard Task & Human Parity
The Switchboard task is kind of like the Turing Test for the speech community, in that it’s the benchmark of human equivalence in system performance. In it, two people select a common topic and they carry a free form conversation that the systems transcribe.
Initially, systems had a very high error rate on this task, in the range of 20%. In October 2016, the professional transcribers reported an error rate of 5.9% on the switchboard task, while the Microsoft system using deep learning was able to achieve 5.8%. This was the first time human parity had been achieved in the speech recognition space.
Pros and Cons of Different Architectures
The traditional language architecture is a hybrid model made up of three parts: an acoustic model, a language model, and a pronunciation model.
An acoustic model tries to model the acoustic events, essentially trying to tell which particular words are produced in a sequence. The language model is trained on hours of speech, and trimmed down to recognize text. The pronunciation model connects the acoustic sound and the word together.
The biggest benefit of this hybrid model is that it’s easily customizable– since it has such a robust collection of sounds and speech, it’s easy to feed it a new word or new sound and have it integrated into the system. The downside of this hybrid model is that the memory footprint is huge. Even in its highly compressed binary form, it still takes multiple gigabytes.
End-to-end models have received a lot of attention recently, and rightly so—they have progressed immensely over the last few years. An end-to-end model is a single model that can take in speech and output text, essentially jointly modeling the acoustic and language aspects. End-to-end models can more or less match a hybrid model in accuracy scores, while being much more compact than a hybrid model: an end-to-end model is small enough to fit into a smartphone or an IoT device.
The downside to the end-to-end model’s smaller size is that the model is much more dependent on the speech labels and data it’s trained on, and not as adaptable to new language. It’s much harder for an end-to-end model to incorporate new vocabulary, and the community is working on ways to make end-to-end models more flexible.
Integrating Specialized Vocabulary
Even though a generic system performs pretty well, there are many domain-specific terms that systems can struggle with if not trained on them specifically. For this reason, Microsoft allows customers to bring their own specialized data and customize their language models. This is especially necessary for specialized domains like medicine, which have a lot of specific terminology and require a much deeper data investment. Microsoft recently acquired Nuance, a leader in the medical speech recognition space, to help make this process even smoother.
Li believes that it’s important to continue to improve both generic model capability and domain-specific models. The more data an algorithm is given, the more its training becomes effective. Li hopes that eventually a model is going to be good enough to handle almost all the domain-specific scenarios, but until then, we have to take a pragmatic approach and ask how we can make this technology work for different domains.
Specific Use Cases at Microsoft
A major challenge at Microsoft is figuring out how to stay current in technical innovations while still maintaining a short research-to-market cycle and keeping the cost economical for customers. Li mentioned that there’s a lot of work being done to make inferences faster, models smaller, and improve latency.
For most customers, it only takes a few hours of speech data in order to get a really high-quality voice tone. Microsoft uses this technology internally, too—Li mentioned he spent about 30 minutes building a personal voice font for himself.
“It’s really interesting to hear your own voice and read your own email, that’s a very interesting experience.”
For large and widely-spoken languages, like English and Chinese, there’s a ton of data to train models on. It’s more challenging when it comes to smaller languages that have less data to train on. To accommodate this, Li’s team is using transfer learning on a pre-trained base model, then adding on language-specific data. This approach has been working really well!
Emotional Encoding & Deepfakes
The Microsoft Speech team is also working on encoding emotional styles into TTS software, which can be differentiated across vertical domains. For a news anchor for example, the voice tone is programmed as calm and reputable, whereas for a personal assistant it’s more warm and cheerful.
In order to prevent the use of technology for malicious deepfakes, a top priority at Microsoft is making sure text-to-speech software is being used responsibly. The company has a dedicated responsible AI team, and a thorough review process that ensures a customer’s voice is coming from their own person and not someone else. They are also working on a feature that could offer a unique watermark that can be detected as being generated by text-to-speech software.
The Future of Speech
Li hopes to continue improving the technology itself going forward. He looks forward to having speech recognition systems learn abbreviations and be able to “code switch”, recognizing the same voice even in different languages. Li hopes to make the system more robust and more portable, easier to apply to different applications with less recognition errors. He said he’s always learning about areas where the system struggles and ways to keep improving its capability to help Microsoft better serve their customers.