End-to-End Voice Agents: Combining Real-Time STT and Human-Level TTS for Natural Conversations

Thursday, December 18 2025 at 5:00 pm (CET)

About 1 hour

About this event

Voice agents are finally good enough to deploy — but only if they can listen and respond like humans do. In this webinar, Rime and Gladia show how pairing real-time speech-to-text (STT) with human-level text-to-speech (TTS) enables truly conversational, low-friction voice experiences.

We’ll break down the full voice pipeline (streaming audio in, streaming speech out), where latency and prosody make or break UX, and how to design agents that handle interruptions, emotion, and multilingual users without falling apart.

Expect practical guidance on architecture, tuning, and evaluation — plus a live walk-through of an end-to-end agent loop using Gladia for perception and Rime for generation.

You’ll learn how to:

Design a real-time “listen → reason → speak” architecture that feels natural with under 1-second round-trip latency.
Use streaming STT (partials) + diarization/custom vocabulary to improve intent capture and reduce agent hallucinations.
Choose and tune TTS for the moment: expressive voices for rapport vs high-speed voices for scale.
Handle interruptions, turn-taking, and repairs (agent “sorry, let me rephrase”) with tighter speech loops.
Evaluate voice agents with the right metrics: latency budget, WER vs task success, prosody/CSAT, and multilingual robustness.

Hosted by

External speaker

E
Lily Clifford CEO & Co-Founder @ Rime
Team member

T
Jean-Louis Queguiner CEO @ Gladia

Gladia

From async to live streaming, Gladia's API empowers your platform with accurate, multilingual speech-to-text and actionable insights.

Our users trust us to deliver fast and accurate transcriptions that can be easily scaled and integrated into existing tech stacks.

View all events

Share this event

Copy permalink