Overview
Loquent supports multiple AI voice providers for realtime phone conversations. Each provider connects via WebSocket, streams bidirectional audio, and handles tool/function calling — all behind a shared trait interface.
Provider Architecture
Section titled “Provider Architecture”The agent module defines two provider-agnostic traits:
RealtimeSessionSender— sends audio deltas and tool responses to the providerRealtimeSessionReceiver— receives events (audio, speech detection, tool calls) as canonicalRealtimeInEventvariants
Each provider module implements these traits. The call pipeline doesn’t know which provider is active — it works with the trait interface only.
Twilio (µ-law audio) ↔ Call Pipeline ↔ RealtimeSession{Sender,Receiver} ↔ Provider WebSocket ↑ OpenAI or GeminiSupported Providers
Section titled “Supported Providers”| Provider | Module | Connection | Audio Format |
|---|---|---|---|
| OpenAI | mods/openai | Direct WebSocket to api.openai.com | µ-law 8kHz (native) |
| Gemini | mods/gemini | Vertex AI WebSocket (OAuth2 JWT auth) | PCM16 24kHz (transcoded from µ-law) |
Event Mapping
Section titled “Event Mapping”Both providers map their native events to the same RealtimeInEvent enum:
| RealtimeInEvent | Description |
|---|---|
AudioDelta { delta } | Base64 µ-law audio chunk for Twilio |
SpeechStarted | User started speaking (barge-in) |
ResponseStarted | Model began generating a response |
ToolCall { call_id, function_name, arguments_json } | Function call request |
Unknown | Unhandled provider-specific events |
Key Differences
Section titled “Key Differences”| Aspect | OpenAI | Gemini |
|---|---|---|
| Auth | API key header | Service account JWT → OAuth2 token |
| Audio codec | µ-law native (no transcoding) | PCM16 24kHz (requires µ-law ↔ PCM16 transcoding) |
| Tool response correlation | call_id only | function_name + optional id |
| Session priming | response.create event | Text turn with "." + turn_complete: true |
| WebSocket frames | Text only | Text and Binary (JSON as binary) |