Skip to content

Overview

Loquent supports multiple AI voice providers for realtime phone conversations. Each provider connects via WebSocket, streams bidirectional audio, and handles tool/function calling — all behind a shared trait interface.

The agent module defines two provider-agnostic traits:

  • RealtimeSessionSender — sends audio deltas and tool responses to the provider
  • RealtimeSessionReceiver — receives events (audio, speech detection, tool calls) as canonical RealtimeInEvent variants

Each provider module implements these traits. The call pipeline doesn’t know which provider is active — it works with the trait interface only.

Twilio (µ-law audio) ↔ Call Pipeline ↔ RealtimeSession{Sender,Receiver} ↔ Provider WebSocket
OpenAI or Gemini
ProviderModuleConnectionAudio Format
OpenAImods/openaiDirect WebSocket to api.openai.comµ-law 8kHz (native)
Geminimods/geminiVertex AI WebSocket (OAuth2 JWT auth)PCM16 24kHz (transcoded from µ-law)

Both providers map their native events to the same RealtimeInEvent enum:

RealtimeInEventDescription
AudioDelta { delta }Base64 µ-law audio chunk for Twilio
SpeechStartedUser started speaking (barge-in)
ResponseStartedModel began generating a response
ToolCall { call_id, function_name, arguments_json }Function call request
UnknownUnhandled provider-specific events
AspectOpenAIGemini
AuthAPI key headerService account JWT → OAuth2 token
Audio codecµ-law native (no transcoding)PCM16 24kHz (requires µ-law ↔ PCM16 transcoding)
Tool response correlationcall_id onlyfunction_name + optional id
Session primingresponse.create eventText turn with "." + turn_complete: true
WebSocket framesText onlyText and Binary (JSON as binary)