Recently, Amazon announced the availability of India’s first celebrity voice experience on Alexa with the legendary icon of Indian cinema - Mr. Amitabh Bachchan. Creating the world’s first bi-lingual celebrity voice on Alexa required the team to invent and re-invent across many elements of speech science – wake word, speech recognition, neural text-to-speech and more.
“Voice technology is both evolutionary and revolutionary and it is safe to say we had an exciting journey to get to this launch,” said Manoj Sindhwani, Vice President of Alexa Speech at Amazon. The teams brought in different specialisations right from customer research to audio engineering to data modelling to content to speech recognition to neural TTS to the wake word models. “This was a truly global project with teams across seven time zones – including USA, Canada, Poland, UK and India coming together to make this feature possible,” said Shridhar Pathak, Senior Technical Program Manager, Alexa TTS. “The pandemic forced teams to work remotely, but they still came together and collaborated on the many moving parts.”
“Voice technology is both evolutionary and revolutionary and it is safe to say we had an exciting journey to get to this launch.”
Co-existence of simultaneous wake words; Alexa and Amit ji
The wake word is an important part of the voice experience. It needs to be phonetically simple, yet fairly distinct. In the case of Mr. Bachchan, the wake word needed to consider the deep love and respect that people have for him and help customers across age cohorts interact with his voice with comfort. The wake word Amit ji seemed most appropriate and one that everyone related to. Over the years, fans of the actor have referred to him as Amit ji.
The new wake word came with new challenges; it is short and it could also get confused with many words that are commonly used in Indian households, like Ammaji or Ammachi. The team had to train the algorithm to recognise the new wake word (“Amit ji”), while concurrently detecting the other primary wake words – “Alexa”, “Echo”, “Amazon”, and “Computer”, working through the constraints to make it work even on older generation devices
“False rejections, where the device fails to wake up despite multiple attempts can be extremely frustrating. At the same time, we need to prevent false wakes, where the device might interpret another word or sound as the wake word,” said Shiv Naga Prasad Vitaladevuni, Director, Alexa WW, Alexa Edge ML. “While these challenges exist anywhere, it is even more complicated in a country like India with a rich diversity of languages, several dialects, and even more accents.”
The team relied on transfer learning techniques to train the new multi-wake word model to accept a wide spectrum of nuances in pronunciations, thereby reducing false rejects. For instance, you can walk up to the device, say “Alexa, what's the weather?” and Alexa will respond. You can say “Amit ji, what is your favourite movie?” and Amit ji will respond with his favourite movie.
Reproducing Mr. Bachchan’s voice, style, pronunciation, and intonation with neural TTS
At a very basic level, Text to Speech (TTS) voice technology works by giving the machine a phrase that you want to convert to voice. The machine then converts it into a voice that is similar to that of the person whose recording was used to train the neural models. This neural text-to-speech process requires deep neural network based models and deep learning. In this case, it involved taking Mr. Bachchan’s voice and building models that not only produce his voice, but also his style of speaking, the right pronunciation, pauses, stressing on the right words etc.
Given that the team was dealing with the one of the most well-recognised voice in India, the degree of the challenge was several notches higher than a regular text-to-speech project. Using neural TTS lead to a higher quality and more human sounding voice, which made interacting with Amit ji natural and delightful. The team was able to create enjoyable experiences like Amit ji reciting an interesting story from one of his movies, with various intonations reflecting the emotions in his voice.
Amit ji is the first bi-lingual celebrity voice on Alexa
Given that Mr. Bachchan is fluent and articulate in both Hindi and English, ensuring a bi-lingual experience for customers was a key part of creating a truly natural experience that drives customer delight. While the human mind is designed to seamlessly switch between multiple languages, getting a speech recognition system to do this can be a highly complex task that required a lot of innovation. For instance, if you say, “Amit ji, ek chutkula sunaiye” you get a response in Hindi, while asking “Amit ji, tell me a joke” will elicit a response in English.