Documentation

Latency Metrics

The Agent Debugger surfaces five per‑turn latency metrics that capture responsiveness from the moment a user speaks to when audio plays back. View them in the "Turn Latencies" tooltip. Among them, the End-to-End Turn Taking ★ is the most user‑perceptible metric.

Agent Debugger: Turn-by-turn view with the "Turn Latencies" tooltip.

Time to First Transcription

Time from when the user stops speaking until the first transcription arrives from STT. Measures the pure STT processing latency after speech completes. Independent of End‑to‑End Turn Taking and does not sum into it. The timer resets at turn boundaries and session stop.

How it's measured: Timer starts when the user stops speaking (UserStoppedSpeakingFrame); records on the first TranscriptionFrame received. The timer resets at turn end and session stop to prevent cross‑turn contamination. Safeguards skip negative latency values; if no transcription arrives before the turn ends, this metric is omitted for that turn.

Why it matters: Measures the STT service's actual processing speed, excluding user speaking time. Lower values indicate faster STT response, enabling quicker responses to users. This metric helps identify STT performance bottlenecks independent of speech duration.

Time to First Speech Event

Latency from the handler receiving the user input (TextEvent) to the first speech-producing event from the handler.

How it's measured: Captured on the first Text-to-Speech event produced by your handler each turn.

Why it matters: Good proxy for LLM/handler prompt+thinking time before speaking starts. Component of End‑to‑End Turn Taking.

Time to First Audio

Time from TTS start to the first audio frame streamed to the listener.

How it's measured: Starts at TTS start; recorded on the first TTS audio frame for the turn.

Why it matters: Indicates TTS startup/streaming latency that affects perceived snappiness. Component of End‑to‑End Turn Taking.

End-to-End Turn Taking ★

The overall time from when the user stops speaking to the first audio frame streamed to the listener.

How it's measured: Timer starts when the user stops speaking (UserStoppedSpeakingFrame); ends at the first TTS audio frame emitted to the listener.

Why it matters: Represents perceived responsiveness after a user finishes talking. This is the most user‑perceptible metric. Approximate relation: End‑to‑End Turn Taking ≈ Time to First Transcription + Time to First Speech Event + Time to First Audio (+ small pipeline/transport overhead).

Function Runtime

Duration of the handler's work for the turn (end-to-end function time).

How it's measured: Recorded when the turn ends, shown as the duration chip in the debugger.

Why it matters: Helps identify slow logic, blocking I/O, or long-running tool calls.

Tips to Improve Latency

Use streaming TTS and stream partial LLM responses (speak as you think).
Trim prompts and tool output; cache static opening lines with TTS cache.
Choose STT models optimized for first token speed if backchanneling is important.
Avoid long blocking I/O in your handler; make external calls concurrent when possible.