Overview

Loquent supports multiple AI voice providers for realtime phone conversations. Each provider connects via WebSocket, streams bidirectional audio, and handles tool/function calling — all behind a shared trait interface.

Provider Architecture

The agent module defines two provider-agnostic traits:

RealtimeSessionSender — sends audio deltas and tool responses to the provider
RealtimeSessionReceiver — receives events (audio, speech detection, tool calls) as canonical RealtimeInEvent variants

Each provider module implements these traits. The call pipeline doesn’t know which provider is active — it works with the trait interface only.

Twilio (µ-law audio) ↔ Call Pipeline ↔ RealtimeSession{Sender,Receiver} ↔ Provider WebSocket
                                              ↑
                                     OpenAI or Gemini

Supported Providers

Provider	Module	Connection	Audio Format
OpenAI	`mods/openai`	Direct WebSocket to `api.openai.com`	µ-law 8kHz (native)
Gemini	`mods/gemini`	Vertex AI WebSocket (OAuth2 JWT auth)	PCM16 24kHz (transcoded from µ-law)

Event Mapping

Both providers map their native events to the same RealtimeInEvent enum:

RealtimeInEvent	Description
`AudioDelta { delta }`	Base64 µ-law audio chunk for Twilio
`SpeechStarted`	User started speaking (barge-in)
`ResponseStarted`	Model began generating a response
`ToolCall { call_id, function_name, arguments_json }`	Function call request
`Unknown`	Unhandled provider-specific events

Key Differences

Aspect	OpenAI	Gemini
Auth	API key header	Service account JWT → OAuth2 token
Audio codec	µ-law native (no transcoding)	PCM16 24kHz (requires µ-law ↔ PCM16 transcoding)
Tool response correlation	`call_id` only	`function_name` + optional `id`
Session priming	`response.create` event	Text turn with `"."` + `turn_complete: true`
WebSocket frames	Text only	Text and Binary (JSON as binary)