Rohit Prasad is vice president and head scientist, Amazon Alexa, where he leads research and development in speech recognition, natural language understanding, and machine learning technologies aimed at improving customers’ experiences with Echo devices, powered by Alexa. He answers five questions about technology and Alexa's future.
Q. The U.S. Defense Advanced Research Projects Agency first began working on speech technology in the early 1970s. Why are we now suddenly experiencing this emergence in conversational AI technologies like Alexa?
A. Conversational AI as a technology has been actively researched for close to 50 years with the goal of interactions with machines becoming as seamless as communication among humans. This is one of the most challenging areas of AI, because machines must be supremely intelligent to understand and communicate in human language be it speech or text, or in combination with touch or visuals.
While speech as a human-machine interface has always been considered advantageous, the biggest barrier to its adoption has been the ability of machines to recognize and understand speech input in a hands- and eyes-free manner. We call this the challenge of far-field or distant speech recognition, where an ambient device like an Echo can recognize words spoken from a distance with high accuracy.
By launching Echo in November 2014, we showed that far-field speech recognition in noisy environments can be done with high accuracy through a combination of machine learning algorithms, data, and immense computing power.
Another important reason for adoption of Alexa is the wide array of intents she can understand and respond to, revolutionizing daily convenience such as accessing and playing music, books, and videos, controlling smart devices in the home, communicating with friends and family, shopping, setting reminders, or getting information you need.
Q. What are the key conversational AI and machine learning technologies behind Alexa?
A. Alexa is designed to take the best action on behalf of the user based on her interpretation of the user’s goal. Unlike search engines, she doesn't simply respond with a set of 10 blue links the user must choose from - instead Alexa acts on behalf of the user asking clarifying questions as needed. There are several key technologies that make Alexa work.
It starts with detecting the “wake word” that triggers Alexa to start listening to the subsequent spoken words from the user. Wake word detection employs deep learning technology running on-device to recognize the wake word the user has chosen. Far-field automatic speech recognition (ASR) in the Amazon Web Services (AWS) cloud then converts into text the audio following the wake word, and determines when the user has stopped talking to Alexa.
The success and adoption of Alexa is extremely satisfying but we have just scratched the surface of what’s possible.
Once the spoken audio has been converted to text, Alexa uses natural language understanding (NLU) to convert the words into a structured interpretation of intent that can be used to respond to the user from the more than 30,000 Alexa skills built by first- and third-party developers.
This structured interpretation is used in combination with different forms of context, such as which type of device the user is interacting with, what the most likely skills are that can provide a response, or who is speaking. This context helps determine the best next action Alexa should take. The possible outcomes are to either respond with the best action from a skill or to ask for more information from the user.
How Alexa responds or sounds is also critical for natural dialog. This is accomplished via text-to-speech synthesis (TTS) which converts arbitrary sequences of words into natural sounding, intelligible audio.
What’s common across all the technologies mentioned above is the emphasis on data-driven machine learning and fast inference at runtime to deliver an accurate response in as short a time as possible. As scientists and engineers we’re always battling this healthy tension between accuracy and latency from when the user stops speaking to Alexa to when she responds.
Q. Like other artificial intelligence technologies, Alexa gets smarter the more you use her and the more she learns about you. What are Amazon scientists and engineers doing to make Alexa smarter?
A. Since Alexa’s brain is mostly in the cloud, she gets smarter with every interaction. Alexa uses a suite of learning techniques: supervised, semi-supervised, and unsupervised learning. While supervised learning is still most powerful, it does not scale since we cannot generate manual labels at the pace required to continually improve Alexa for our customers. Therefore, our scientists and engineers are continually applying and inventing new learning techniques to reduce reliance on manual labels for training our statistical models. For instance, active learning as a class of semi-supervised learning techniques, where the system itself posits which part of the interactions it needs input from a human expert, is pervasive across our technologies. Unsupervised learning without any labeled responses is also being applied to make Alexa smarter, especially for speech recognition. Finally, we also use transfer learning, allowing Alexa to learn from one skill to another or even one language to another.
Q. What’s unique about doing conversational AI research at Amazon?
A. What makes us unique is how we approach research in general. Every research problem starts with a working backwards methodology borrowed from how we approach product development within Amazon. The basic idea is simple. We begin by writing what the research, if successful, would eventually accomplish or revolutionize, and then we work backwards from that goal in how we design our experiments and milestones to check progress. We believe in fast experimentation and proving or disproving our hypothesis as early as possible.
Another unique aspect to conversational AI research within Amazon is that we have a ground-breaking product in the form of Alexa, where we can prove new algorithms and technology at scale. This makes technical advances we publish in conference or journal publications even more credible.
The combination of large quantities of data, near-infinite computing power, deep experience of our team in AI problems to learn from, and the appetite for risk taking makes Amazon arguably the best place in the world to pursue your conversational AI research dreams.
Q. What do you see as the future of conversational AI?
A: I am excited about the future of AI as a whole. AI will have deep societal impact and will help humans learn new skills that we can’t even imagine today. For conversational AI, I still think it is Day 1. The success and adoption of Alexa is extremely satisfying but we have just scratched the surface of what’s possible.
In the next five years, we will see conversational AI get smarter on multiple dimensions as we make further advances with machine learning and reasoning. With these advances, we will see Alexa become more contextually aware in how she recognizes, understands, and responds to requests from users. Alexa will become smarter more quickly as unsupervised learning will dominate how she learns.
Alexa will engage in more natural conversation on every day topics and news events just as humans can. This is our focus with the Alexa Prize, a university competition for building “socialbots” that can coherently and engagingly conduct a 20-minute conversation with a human. Our customers have logged more than 100,000 hours of conversation with the 2017 Alexa Prize socialbots; our 2018 Alexa Prize socialbots will come online in May. It’s fun to try. Just say, “Alexa, Let’s chat.”