奶糖直播

a black Amazon Echo with a light going from blue to purple

"Alexa, How Do
You Work?"

奶糖直播 experts shed light on the inner workings听
of voice assistant technology and how it has evolved听

Voice assistants go by many names鈥擜lexa, Siri, Bixby and Google Assistant, to name just a few. And they take on many forms鈥攁s a built-in feature of an ever-increasing list of devices, including smartphones, computers, tablets, smart speakers, gaming devices, TV remotes and even vehicles.

As the availability and capability of conversational artificial intelligence (AI) has grown over the past decade, more and more users have come to rely on voice assistants to search for information, execute tasks on their behalf and quickly answer any number of questions. In 2022, an estimated 142 million Americans鈥攏early half the country鈥檚 population鈥攗sed voice assistants at least once a month. So how exactly do they work?

奶糖直播 Magazine turned to four experts on the 奶糖直播 faculty to explain the technology behind voice assistants, how far these tools have come, where they are headed and the implications for users鈥 privacy.

MEET THE EXPERTS

black-and-white headshot of Stephen Andriole

Stephen Andriole, PhD

Thomas G. Labrecque Endowed Chair in Business, 奶糖直播 School of Business


Dr. Andriole teaches artificial intelligence, machine learning and generative AI, and his areas of research include automation, digital transformation and business technology strategy. An industry and government consultant on all aspects of digital technology, Dr. Andriole is also a go-to media source for all things related to the future of technology and business. He is the author of the recent The Digital Playbook: How to Win the Strategic Technology Game.

black-and-white headshot of Grant Berry

Grant Berry, PhD

Assistant Professor of Spanish Linguistics,听College of Liberal Arts and Sciences


Before joining the faculty at 奶糖直播 in 2020, Dr. Berry worked with Amazon Alexa as a language engineer, technical program manager and applied scientist, developing new features for Alexa and launching Alexa in new languages. As director of the Language Use and Variation Lab in the Cognitive Science Program, he continues to research and investigate the relationship between linguistics, language technology optimization and machine learning.

black-and-white headshot of Xue Qin

Xue Qin, PhD

Assistant Professor of Computing Sciences, College of Liberal Arts and Sciences


Earlier this year, Dr. Qin received a grant from the National Science Foundation to investigate how test code can be adapted to develop voice assistant features in mobile applications. She teaches courses in applied machine learning and algorithms and data structures. Her research interests focus on software engineering, privacy and security and natural language processing.

black-and-white headshot of Brett Frischmann

Brett Frischmann, JD

Charles Widger Endowed University Professor in Law, Business and Economics, Charles Widger School of Law


A renowned scholar in intellectual property and internet law, Professor Frischmann is a leading source on issues related to surveillance, technology policy and intellectual property. He teaches interdisciplinary courses at the intersection of law, economics, business, ethics and technology and is also an affiliate scholar at Stanford Law School鈥檚 Center for Internet and Society.

The Rise of the Voice Assistant

A voice assistant is a digital application or device that interprets and responds to spoken commands or questions from users.

鈥淭he technology has evolved rapidly and become increasingly sophisticated,鈥 says Xue Qin, PhD, assistant professor of Computing Sciences. 鈥淭he popularity of voice assistants has skyrocketed in recent years, with users able to perform tasks faster and more efficiently through simple voice commands.

These improvements in conversational AI have led to expanded applications for voice assistants, including:

  • Accessibility (providing an alternative method of accessing information and performing tasks for those with visual impairments or limitations in mobility)
  • Information retrieval (playing music, making a phone call, checking news and weather)
  • Home automation (turning on lights, adjusting the temperature)
  • Shopping (adding items to a shopping list, purchasing services and goods online)
  • Health and wellness (guided meditations, exercise routines and reminders to take medication)

Voice assistants now have the ability to lend hands-free support in nearly every aspect of daily life鈥攂ut it didn鈥檛 happen overnight.

鈥淭he concept of a disembodied voice that can interact with and complete tasks for the device owner isn鈥檛 new,鈥 says Grant Berry, PhD, director of the Language Use and Variation Lab and an assistant professor of Spanish Linguistics. 鈥淲hether you鈥檙e talking about Rosie the Robot on The Jetsons or the computer on Star Trek: The Next Generation, voice assistants have been part of popular culture for decades. It鈥檚 only in the last 15 years that they鈥檝e moved from science fiction to reality.鈥

The technology behind voice assistants has been in development much longer. Decades before Alexa, Siri and Google Assistant became household names, there was IBM Shoebox鈥攖he very first digital speech recognition tool. Released in 1961, it was able to recognize 16 words and digits. It鈥檚 a far cry from the capabilities of today鈥檚 voice assistants, but it laid a solid foundation for the technology now available.

The real turning point for voice assistants came in 2011 when Apple added Siri to the iPhone 4s. For the first time, millions of people had access to a voice assistant right in the palm of their hand. 鈥淪ince then, functionality has improved, integrations with third-party apps have increased and applications in various industries have broadened,鈥 Dr. Berry says.

Just Say the Word

Voice assistants don鈥檛 typically take very long to fulfill a spoken command鈥攂ut there are quite a number of steps and software components at work to make that happen, namely automated speech recognition and natural language processing technologies.

Natural language processing is an umbrella term for two key areas: natural language understanding (hearing) and natural language generation (speaking).

Dr. Qin explains the key points of the process like this: The voice assistant listens via a microphone for its wake word (a phrase that lets the device know a request is coming); 鈥渢ranslates鈥 the user鈥檚 spoken command or question to text through automated speech recognition and natural language understanding; performs tasks by executing predesigned code; and talks back by using AI technology called neural text-to-speech (a form of natural language generation).

Dr. Berry has an insider鈥檚 knowledge of the inner workings of voice assistants. Before joining the faculty at 奶糖直播 in 2020, he employed his skills in linguistics and understanding language variation as a language engineer, technical program manager and applied scientist for Amazon Alexa. In addition to supporting the launch of Alexa in Hindi, Portuguese and Arabic, Dr. Berry worked on household-related applications of the technology and interfaces with smart home devices.

鈥淓ven a simple command like 鈥榯urn off the light鈥 requires a lot of different levels of understanding that we may often take for granted,鈥 says Dr. Berry. The voice assistant has to understand:

  • What a light is
  • What 鈥渙ff鈥 is (and alternately, what 鈥渙n鈥 is)
  • That 鈥渢urn鈥 is a verb that can mean 鈥渢o rotate鈥, but also means 鈥渢o initialize鈥 when it鈥檚 paired with 鈥渙n鈥
  • Which light the user is referring to
  • Regional variations of 鈥渢urn off the light鈥 (e.g., 鈥渢urn out the light,鈥 鈥渟hut the light,鈥 and 鈥渃lose the light鈥)

鈥淲hen it comes to developing these programs, the language engineers have to think about what users are intending and all of the different possible ways they could get to that intention, and that鈥檚 not a trivial task,鈥 Dr. Berry says. The voice assistant鈥檚 natural language program has to be robust enough to filter out background noise and support quite a bit of variation in voices鈥攊ncluding languages, dialects, accents, ages, genders, regional phrasing, pitch and volume.

Once the voice assistant understands what the user is asking for, it uses speech-to-text conversion software to enter that request into the system.

If it doesn鈥檛 understand the question or it needs more information to fulfill the task, the voice assistant will formulate follow-up questions and use text-to-speech software to ask for clarifications or more specifics. 鈥淣eural text-to-speech technology gives each of these assistants a voice, synthetic speech created from millions of curated training examples,鈥 Dr. Berry explains.

With the necessary information gathered, the voice assistant then answers the question by searching the internet or executes the task by connecting to built-in applications, like a calendar or clock, or authorized third-party applications, like subscription-based streaming services or even bank accounts.

Smarter Every Day

Through each interaction, the voice assistant becomes better at understanding requests and providing more accurate responses. Sometimes a voice assistant will ask, 鈥淒id I answer your question?鈥 or 鈥淲as that what you were looking for?鈥澨

鈥淲hen I say, 鈥榶es,鈥 that鈥檚 machine learning in action鈥擨鈥檓 training the voice assistant,鈥 explains Stephen Andriole, PhD, Thomas G. Labrecque Endowed Chair in Business. 鈥淭he voice assistant continues to expand its knowledge base with data it鈥檚 getting from me and millions of other users.鈥

And it鈥檚 not just learning more about speech recognition鈥攊t鈥檚 also learning more about the user. 鈥淚t starts to get to know me personally, how I like to be communicated with, how I like to be addressed,鈥 Dr. Andriole says. 鈥淲hat鈥檚 going to happen with these interfaces is that they will begin to interpret my intentions based on the questions I ask or the requests I make, and they will become anticipatory.鈥澨

For instance, let鈥檚 say a user asks every weekend about the weather at a local fishing spot, as well as surf reports and tide tables. After several interactions, the voice assistant may correlate these requests with the user鈥檚 intention: to go fishing on the weekends at a particular location. And it may offer to package this information into a report that it delivers every Saturday morning.听

鈥淭hat鈥檚 the kind of voice interaction that鈥檚 prompted by data that鈥檚 collected over a period of time,鈥 Dr. Andriole says. Many users already experience this type of interaction if their voice assistant is integrated with a shopping application. It鈥檚 watching what the user buys and when, and is able to infer when they鈥檙e likely running low on something. Then it prompts the user: 鈥淚 noticed that you may be running low on oatmeal. Would you like me to order that for you?鈥澨

鈥淧hase one is the voice assistant providing simple answers to my questions, and phase two is when it starts to use that data to interpret and understand my intention and purpose鈥攖hat鈥檚 when it becomes more valuable to me,鈥 Dr. Andriole says. 鈥淚t鈥檚 akin to Netflix profiling me in terms of the films and TV shows I like 鈥 it鈥檚 watching what I鈥檓 watching to make more helpful, useful recommendations for me.鈥澨

There is, however, a trade-off involved: convenience at the expense of privacy. 鈥淗ow much of their privacy will users be willing to sacrifice for this kind of convenience?鈥 Dr. Andriole asks. 鈥淭hat depends almost entirely on how useful and helpful it is for them.听

NEXT IN FEATURES

A Natural Classroom

How 奶糖直播鈥檚 natural landscape enhances the academic experience of our student-scholars