The tricky thing with writing anything about the future of voice intelligence is that the future is already here. Technology that felt like science fiction five years ago is already in pilot programs and labs today.
With that in mind, we’re less focused on what exactly the products coming into the mainstream will look like, and more on what they mean for voice intelligence broadly: what we're worried about, what challenges we’re solving for, and what we think the next few years might feel like from the front lines.
First, what is not going to change:
Voice is the first-class communication method
People want to be understood and trusted
People will talk to people and machines
Now, what is inevitable:
Voice and voiceprint will be used to bolster trust
Synthetic voices used for new good purposes and some not so good
Commercial benefits from voice will grow beyond the call center
The most important thing we’re focused on, woven into our tech and our development, is serving the human. Whether we’re talking about protecting against fraud, providing better and more personalized interactions, or even making a smarter IVR systems, we’re prioritizing to the experience on the human side of every interaction.
That’s shaped our approach to voiceprint and how we go beyond it to safeguard against fraud, and what we think comes next.
What’s Next for Voiceprint
Over the last decade, voiceprint has become a fixture in fraud prevention and identity verification. It’s come a long way, finding its way into more systems, more workflows, more teams. Of course, voiceprint has its issues in practice, which we’ve explored in a previous article.
The major hurdle to voiceprint tools getting better is that they’re limited by their medium. Call centers are bound by the quality of the telephone system, where the audio is compressed and lossy, and customer-side background noise is sometimes unavoidable.
If we were to use fingerprints as an analog, we’d say most voiceprints are more like latent prints than exemplar prints.
A latent print is what gets left behind on a glass or doorknob. Partial, smudged, sometimes overlapping with others. If you’ve ever watched a crime drama, these are the prints that ties the suspect to the murder weapon. Much like with voiceprint, if a latent is all you have, you’re going to try and make it work.
An exemplar, on the other hand, is the cleanest version of a print. One example is the fingerprints you give during a background check. They're deliberate, high-quality, and taken in controlled conditions.
Most voiceprint systems are trying to compare latent to latent using noisy, compressed phone audio from real-world calls. And when fraudulent callers are deliberately adding interference, it’s adding ‘smudges’.
Detecting fraudulent callers becomes faster and easier with an exemplar voiceprint. But customers aren’t coming into the studio and recording an exemplar for use on all their accounts. The original voiceprint on the account is also limited by the phone system.
Finding ways to capture or approximate an exemplar within a call center, or with the constraints of the phone system, is the next big jump for improving voiceprint. We’re looking at a few different ways to get or approximate an exemplar that we’re not quite ready to share yet, but that’s the direction we think voiceprint is heading.
Synthetic Voices and their uses
At the same time, we’re seeing a significant shift in the voice landscape happening right now. Synthetic voices are becoming more common across industries, used by both people and systems on either end of the call. While it raises obvious concerns around trust and fraud, we’re also seeing increased legitimate uses for synthetic speech. This increases the need for detection to keep up with the latest advancements.
Most of the discourse today on synthetic voice is focused on the specific threat of deepfakes. Deepfakes are only one type of synthetic voice, and certainly not the most prevalent.
Synthetic voice is a broad category that covers any machine-generated voice, whether it’s your smart assistant reading the weather or a bot answering customer support questions. It’s not pretending to be a specific person, just a voice that’s generated. Think Siri, Alexa, or your bank’s IVR system.
Let's explore and define some of the specific terms and uses involved, with some examples:
Voice Clone
A voice clone is a synthetic voice designed to sound like a specific person with their consent. There are plenty of legitimate reasons to do this: someone with a degenerative disease preserving their voice, a voicemail greeting that sounds more personal, or an actor licensing their voice for dubs.
These are often watermarkedwith a kind of digital signature that helps identify them as approved and traceable.
Deepfake (or Deceptive Clone)
A deepfake uses a synthetic voice to impersonate someone without their consent, often for the purpose of fraud or manipulation. Deceptive clones are the threat most people worry about.
At least currently, the real-world deepfake attacks against human agents in call centers is low. The effort to perform deepfake attacks is high enough, the tech still isn’t good enough, and the delay is too long for natural sounding responses if agents ask unexpected questions (which is the oldest trick in the book).
Legitimate Uses
We’re seeing big increases in legitimate uses of synthetic voices as a whole, which is why as we continue to improve detection for deceptive clones and modulators, we’re also thinking about how to identify and allow legitimate uses.
Legitimate uses enable better access, smoother interactions, or just buy us a little time. Especially in cases where a use is authorized (and has an appropriate watermark), we want to ensure that fraud detection doesn’t block the caller from accomplishing their goal.
For instance, if a caller has ALS or throat cancer and has an authorized reproduction of their voice they are using to interface with the agent, we don’t want to prevent them from accessing their account. We also don’t want to flag a caller simply because their smart assistant called to schedule an appointment with a financial advisor (or the advisor’s smart assistant).
As we build for the future, we want to allow for the real human uses of synthetic voice while also improving detection of bad actors.
Going beyond the call center
Right now, a lot of our products are designed for call centers. But the real value of voice intelligence doesn’t stop there.
If voice is the common interface to the world, there’s no reason to limit it to one environment. The same bio-signals we use to verify identity in a phone call could just as easily be used in:
Your car’s infotainment system to recognize who’s speaking, verify if the driver is authorized to drive the vehicle, or detect signs of intoxication
Personal voicebotsthat step in when someone’s unavailable (without pretending to be them)
Live call screening where transcription or synthetic voice helps manage incoming calls in real time
Personal check-ins with elderly patients, listening for changes in vocal health or stress that indicate medical issues early and remotely
Intelligent voicemail systems that have context to greet callers or help schedule follow-ups
Assistive tech or smart assistants that can adapt to who’s speaking, how they’re doing, and what kind of response is appropriate in the moment
In call centers and beyond, we’re seeing so many opportunities to take what we can read from bio-signals and improve the experience of the human speaking.
The future of voice intelligence is here, and we’re already building it. If you're thinking about the future too, drop us a line. We'd love to chat.