Google Assistant’s Future Is Looking Us Right in the Face

Google says its voice assistant is getting more conversational, and that face unlocking is ready to replace wake words. But don't hold your breath.
Image may contain Screen Electronics Monitor Display Computer Lcd Screen and Tablet Computer
Photograph: Nicole Morrison; Illustration: Elena Lacey

For years we've been promised a computing future where our commands aren't tapped, typed, or swiped, but spoken. Embedded in this promise is, of course, convenience; voice computing will not only be hands-free, but totally helpful and rarely ineffective.

That hasn’t quite panned out. The usage of voice assistants has gone up in recent years as more smartphone and smart home customers opt into (or in some cases, accidentally “wake up”) the AI living in their devices. But ask most people what they use these assistants for, and the voice-controlled future sounds almost primitive, filled with weather reports and dinner timers. We were promised boundless intelligence; we got “Baby Shark” on repeat.

Google now says we’re on the cusp of a new era in voice computing, due to a combination of advancements in natural language processing and in chips designed to handle AI tasks. During its annual I/O developer conference today in Mountain View, California, Google’s head of Google Assistant, Sissie Hsiao, highlighted new features that are a part of the company’s long-term plan for the virtual assistant. All of that promised convenience is closer to reality now, Hsiao says. In an interview before I/O began, she gave the example of quickly ordering a pizza using your voice during your commute home from work by saying something like, “Hey, order the pizza from last Friday night.” The Assistant is getting more conversational. And those clunky wake words, i.e., “Hey, Google,” are slowly going away—provided you’re willing to use your face to unlock voice control.

Sissie Hsiao leads the Google Assistant team.

Photograph: Nicole Morrison

It’s an ambitious vision for voice, one that prompts questions about privacy, utility, and Google’s endgame for monetization. And not all of these features are available today, or across all languages. They’re “part of a long journey,” Hsiao says.

“This is not the first era of voice technology that people are excited about. We found a market fit for a class of voice queries that people repeat over and over,” Hsiao says. On the horizon are much more complicated use cases. “Three, four, five years ago, could a computer talk back to a human in a way that the human thought it was a human? We didn’t have the ability to show how it could do that. Now it can.”

Um, Interrupted

Whether or not two people speaking the same language always understand each other is probably a question best posed to marriage counselors, not technologists. Linguistically speaking, even with “ums,” awkward pauses, and frequent interruptions, two humans can understand each other. We’re active listeners and interpreters. Computers, not so much.

Google’s aim, Hsiao says, is to make the Assistant better understand these imperfections in human speech and respond more fluidly. “Play the new song from…Florence…and the something?” Hsiao demonstrated on stage at I/O. The Assistant knew that she meant Florence and the Machine. This was a quick demo, but one that’s preceded by years of research into speech and language models. Google had already made speech improvements by doing some of the speech processing on device; now it's deploying large language model algorithms as well.

Large language learning models, or LLMs, are machine-learning models built on giant text-based data sets that enable technology to recognize, process, and engage in more humanlike interactions. Google is hardly the only entity working on this. Maybe the most well-known LLM is OpenAI’s GPT3 and its sibling image generator, DALL-E. And Google recently shared, in an extremely technical blog post, its plans for PaLM, or Pathways Language Model, which the company claims has achieved breakthroughs in computing tasks “that require multi-step arithmetic or common-sense reasoning.” Your Google Assistant on your Pixel or smart home display doesn’t have these smarts yet, but it’s a glimpse of a future that passes the Turing test with flying colors.

Hsiao also demoed a feature called Look and Talk, which eliminates the need to say “Hey Google” to the Nest Hub Max smart display—assuming you’re OK with Google using the device’s built-in camera to scan your face instead. If you walk into your kitchen and notice a leaky faucet, you could theoretically just look at your Nest Hub Max and then ask it to show a list of nearby plumbers.

This is part of a broader effort by Google to let you skip saying “Hey Google” altogether. Last fall, when the company introduced its Pixel 6 smartphone, it started supporting “quick phrases” on the phone, so you could accept or decline a phone call or stop timers and alarms without having to say “Hey Google” first. Now on the Nest Hub Max, you could program a short command like “Turn on the bedroom lights” as a quick phrase. The phrase essentially becomes both the wake word and the command.

The face-scanning feature on Nest Hub Max is very likely to raise eyebrows (which I’m told will not affect the face scans). Hsiao said, more than once, that the feature is entirely opt-in; that it will only work at first on Google’s Nest Hub Max home display, which has a physical switch for disabling the camera; and that the software won’t work with someone else’s face, and thus won’t allow that person to make queries on the primary user’s behalf. For added privacy, the face scans are being processed on the device itself and not in Google’s cloud.

Still, all virtual assistants do carry with them a privacy risk, real and perceived. They’re utilizing microphones that capture our voices, built-in radar sensors (like in the second-generation Nest Hub) that track our movements, or full-fledged camera sensors that capture faces. Inherent to their usability is the promise that they get to know you. We give so much of ourselves in exchange for convenience. In this case the convenience is not having to say “Hey, Google” out loud.

Hey Google, Are We There Yet?

Privacy questions aside, some of the technologies Hsiao is referring to have yet to make their way out of research land, as she puts it, and into mass-market consumer products. Totally conversational AI is here—but “here” might not be right in your hand just yet.

One example: Right now, when you ask Google Assistant to tell you a joke, those jokes are all scripted and vetted by real humans. Language learning models are impressive, and also highly imperfect. They can write poetry; they can also be downright racist. So Google still uses human content moderators for some elements of its virtual assistant product. But humans, skin-and-bone beings with ideas and proclivities and the need to eat and sleep and stuff, aren’t “scalable” the way software is. Voice assistant technology may be passing more human-level intellect benchmarks than ever before, but applying it to products that could end up in millions or billions of hands, and having it work reliably for all parties using it, is a massive undertaking.

Bern Elliott, a vice president at Gartner Research who studies the use of virtual assistants in business environments, says that voice assistants are by no means static. “We’re seeing movement towards improved flows, more usability, and more advanced and sophisticated use cases,” Elliott says. Interactive voice assistants in business environments used to be overly simplistic; press one for service, press two for sales, and so on. Now they’re capable of much more complex conversations.

The consumer market is headed that way, Elliott believes, but it’s still very “one-shot—you know, ‘Alexa, what time is it,’ or ‘Siri, what’s my calendar for today?’”

Ads and Subtraction

And if Google Assistant exists as a voice means to a search end—the way, say, Google Lens uses augmented reality to reverse-look up products in the real world, thus leading you back to search—then the next inevitability for voice interaction seems to be monetization. When will Google Assistant serve up ads? It’s not a stretch when you consider that Hsiao, a nearly-16-year Google veteran, worked in the company’s display, video, and mobile app advertising units for several years before taking the lead on Assistant. She now oversees thousands of people, with more than 2,000 working on some facet of Google’s virtual assistant tech.

Hsiao says she doesn’t think it’s “inevitable” that Google Assistant will eventually serve up ads. Voice is not an obvious ad channel, she adds, and is “not how we envision the Assistant evolving.”

Plus, there’s the matter of scale: Google says that Assistant has over 700 million monthly users, up from 500 million two years ago. That’s small potatoes (Would you like to add “small potatoes” to your grocery list?) compared to the billions of searches that people type into the Google search box every single day. Hsiao didn’t say this explicitly, but her remarks on Google Assistant’s scale suggest that it’s just not big enough, at least not yet, to justify serving up potentially intrusive ads.

I continued to press Hsiao on her pizza delivery example, asking whether it’s conceivable that if someone were to use voice search to order a pizza to their home, while they’re driving home, then couldn’t a merchant pay for prioritization in those voice search results? And wouldn’t that be, well, an ad? Hypothetically, yes, Hsiao says. But while ads are one potential model for monetization, they’re not necessarily the model. She insists her focus is “really on getting this product to be helpful and conversational and useful for people.”

Like a lot of evolutions in computing, the most significant changes in voice assistants might come gradually. They’re already happening. The building blocks are there. One day soon, Google Assistant users might wake up, peer into their Nest Hub Max, and have the Google Assistant at the ready, waiting for their command. The question—one that even Google’s artificial intelligence can’t answer—is whether they’ll trust Google with complicated conversations, or if they’ll just ask for the weather forecast that day. And again a day later. And the day after that.