Skip to content

()

Q&A: On Voice

By Bakken & Bæck and Philipp Gross

With the development and integration of new AI functionalities, the world of voice-driven products is rapidly opening up. As a technology-driven design studio, we’ve gained first-hand expertise on how voice can elevate and expand user experiences.

In a series of three articles, we’ll share some of our thoughts on voice technology from different perspectives. We’ll take a closer look at how it has evolved, how to design for voice-first products, and the potential and limitations of this mode of interaction, interviewing our colleagues working in development, product and branding.

We’re kicking off the series with BB’s Head of Data Science, Philipp Gross, who shares why voice technology appeals to machine learning engineers, what it takes to build a voice assistant, and how voice is gaining new momentum now that large language models have found theirs—and are being heard loud and clear around the world.

B.B.

What was the beginning of the relationship between machine learning and voice technology?

P.G.

Voice technology has a rich history, drawing on multiple complex research fields, including machine learning, linguistics, and cognitive science. This combination makes voice technology a rich testing ground for scientists and technologists who aim to understand and mimic human communication.

Already in the early 1960s, IBM introduced the Shoebox, a remarkable system for its time that could understand 16 spoken command words and digits, all based on manually designed audio filter circuits. While this doesn’t compare to the voice assistants released over the last couple of years, it is an early example of practical applications of voice technology, and it highlights the first attempt at making our interactions with technology more natural and inclusive.

You can say this is the start of what we now call Natural Language systems—systems that allow users to speak (and be spoken to) as they would in person or on the phone.

B.B.

What makes machine learning and voice technology an interesting combination?

P.G.

Voice communication involves a lot of ambiguity and context dependence. It is precisely these attributes that resonate with different generations of machine learning engineers. Understanding and generating human speech requires not only the recognition of sound patterns, but also the interpretation of meaning, context, emotion and intent.

For example, one of the oldest voice technologies is speech transcription—the conversion of spoken language into written text. To design such a system, you need to create a dataset of sound and text pairs, along with a ground truth that specifies the expected output for a given input. It doesn’t matter how this system is created, as long as it shows the desired behaviour.

Now, every problem that can be represented as such a mapping problem is perfect for machine learning solutions. You don’t have to figure out the exact instructions that lead to mapping specific sound patterns to certain letter sequences. Instead, you let the algorithm figure it out by training it on a sufficiently large training dataset.

B.B.

And how do you then apply this ‘ground-truth’ to a product context? How does it fit within a user experience?

P.G.

To use an example from a couple of years ago: we worked on a voice bot for a smart home control system and developed a prototype, building the whole software stack from the ground up. We began with voice activity detection and transcription. It detected what was said, but we also used components to identify the speaker’s tone of voice.

The task wasn’t focused on having long, existential conversations with the smart home device, but on using voice as an alternative way for the user to control the system. To ensure we were addressing those specific user needs, we developed additional outputs to enrich the user experience, such as notification sounds and subtle visual cues. This project was interesting because we realized that voice is just a different channel of communication, or a sensory device, that needs to be aligned with other modes of interaction to function correctly.

B.B.

The underlying technology of voice applications has shifted from command-based systems to large multimodal speech models—what are the consequences of this shift?

P.G.

This shift has mostly opened up a much wider range of applications. To make voice technologies useful in our everyday lives, it’s not enough to perfectly transcribe voice to text; the system needs to be able to act on the information it gathers. Multimodal speech models have now entered the game, integrating audio, visual, and textual inputs to enhance speech recognition, understanding, and generation capabilities. They are facilitating this transition from somewhat robotic interactions to genuinely authentic conversations. The challenge now lies in integrating more complex actions.

Take in-car voice use as an example. When I use Siri while driving, it generally understands what I say, but often fails to act on it, usually due to missing integrations or limited access to real-time information. For instance, if I ask,“Is there any live jazz near me tonight?” Siri struggles because answering requires browsing current event listings online. This isn’t just a matter of voice recognition — it demands real-time web access, contextual reasoning, and the ability to synthesize information from multiple sources. Without those capabilities, basic command-response systems fall short in delivering meaningful help.

B.B.

So what’s the next step to increase interactivity between systems?

P.G.

I don’t think the exact change that needs to happen is completely understood yet. It’s a work-in-progress. The traditional AI approach involves converting speech to text, allowing LLMs to process it. The LLM then generates a text-based response, which is then converted back into speech and played back to the user.

However, the world of AI moves quickly, and this technology is already seen as ‘traditional’ because we’re witnessing the emergence of voice-native architectures known as Speech-to-Speech models (STS). They bypass transcription and process raw audio inputs into a system that considers conversational history, tone, and rhythm to produce much more natural voice outputs. I believe these models will replace traditional models, as they can provide the required low-latency responses for real-time interactions.

A diagram of the evolution of voice technology.

B.B.

You’ve talked a lot about the integration of more complex actions—does this imply that software stacks need to change to make these integrations successful?

P.G.

Yes, the whole ecosystem has to change. If you want to operate a website efficiently by voice, it is not enough to have a written system that can describe the visual content for each page and recite all the possible mouse actions you could choose from. Instead, the system must understand your intent and make decisions on your behalf. For that to work, it needs the right context, and up until now, this required processing rendered website frames with large vision-capable language models.

This requires a demanding, computationally intensive approach, which seems a waste of resources in many cases. However, with the creation of the Model Context Protocol (MCP) by Anthropic, a clear path is now outlined to provide application contexts to AI agents without the need for rendering. If websites implement this protocol, AI agents will behave more effectively and, in turn, communicate by voice more efficiently.

B.B.

So, in your experience building voice products, how do you navigate the push and pull between the current reality and what people aspire to?

P.G.

When people start thinking about voice products, it’s usually from the perspective of how we want to interact with them. Therefore, it’s important to narrow down use cases—to take the perspective of the end user. What type of product are you after? Does this product require a command voice system that will understand the user in any setting, or does it need to allow for human-like interaction and conversation? First, you have to figure that out, then you can choose the technology.

Once you pick the technology, it brings more restrictions that you have to seamlessly integrate into a well-designed UX and UI. Using voice-activity detection as an example: when does the system start listening, or when does it start interpreting what you say? Most users don’t want the system to be listening all the time, so we now have specialized hardware chips in our phones that are triggered by specific voice commands, such as “Hey, Siri”. In short, you have to think about what you really want to achieve, and then choose your technologies around that. This is the tricky part, because you have to navigate between working within technological limitations and finding something that’s interesting and relevant to the user. 

B.B.

Do you think that voice lends itself to many different types of products, or is it quite limited in its application?

P.G.

I think it will apply to an ever-wider range of products in the future. But voice technology isn’t happening in a silo. Gen AI is advancing on all fronts—or modalities. Nowadays, models can understand and generate images, text, audio, and video simultaneously, and their performance is improving rapidly. Using voice will be just one way to connect with agents that are equipped with those multimodal models, which will eventually solve many tasks in the near future.

B.B.

Lastly, which applications are you most excited about right now?

P.G.

Real-time conversational AIs with LLMs are voice-first interfaces and a real game changer. They enable natural, conversational interactions with AI that sound human, support complex queries, provide near real-time feedback, and incorporate emotion, tone and memory. This has the potential to add voice as an equal alternative to screen-based interfaces. OpenAI’s GPT-4o Realtime model is just one example that you can easily integrate, and will kick off interesting applications.