OpenAI
The openai module connects to OpenAI’s Realtime API for live voice sessions and provides two post-call transcription paths: standard transcription and speaker-diarized transcription with AI-powered speaker identification.
Realtime Sessions
Section titled “Realtime Sessions”When an agent uses the OpenAI provider, create_openai_realtime_session establishes a WebSocket connection:
- Build URL:
wss://api.openai.com/v1/realtime?model=<model_id> - Load API key from
core_confviaOpenAICoreConf - Connect with
Authorization: Bearer <api_key>header - Send
session.updatewith full config (instructions, voice, VAD, tools) - Send
response.createto prime the model - Return
(OpenAIRealtimeSessionSender, OpenAIRealtimeSessionReceiver)
Sender (OpenAIRealtimeSessionSender)
Section titled “Sender (OpenAIRealtimeSessionSender)”Implements RealtimeSessionSender:
| Method | What It Sends |
|---|---|
send_audio_delta(audio) | input_audio_buffer.append with base64 audio |
send_tool_response(call_id, _, output) | conversation.item.create (function_call_output) + response.create |
Receiver (OpenAIRealtimeSessionReceiver)
Section titled “Receiver (OpenAIRealtimeSessionReceiver)”Implements RealtimeSessionReceiver. Maps OpenAI events to RealtimeInEvent:
| OpenAI Event | → RealtimeInEvent |
|---|---|
response.output_audio.delta | AudioDelta |
input_audio_buffer.speech_started | SpeechStarted |
response.created | ResponseStarted |
response.function_call_arguments.done | ToolCall |
| Everything else (20+ event types) | Unknown |
Session Configuration
Section titled “Session Configuration”pub struct OpenaiRealtimeSessionConfig { pub r#type: String, // "realtime" pub model: String, // "gpt-realtime-1.5", "gpt-realtime", or "gpt-realtime-mini" pub output_modalities: Vec<String>, // ["audio"] pub audio: OpenaiRealtimeSessionConfigAudio, pub instructions: String, pub tools: Option<Vec<OpenaiRealtimeTool>>,}Models
Section titled “Models”| Model ID | Display Name | Description |
|---|---|---|
gpt-realtime-1.5 | GPT Realtime 1.5 | Latest model with improved reasoning and transcription accuracy (default) |
gpt-realtime | GPT Realtime | Stable realtime voice model |
gpt-realtime-mini | GPT Realtime Mini | Cost-efficient alternative |
New agents default to gpt-realtime-1.5. Existing agents keep their configured model.
Voices
Section titled “Voices”All 10 OpenAI Realtime voices are available. Marin and Cedar are recommended and shown first in the UI.
| Voice ID | Display Name | Note |
|---|---|---|
marin | Marin (Recommended) | Default voice |
cedar | Cedar (Recommended) | |
alloy | Alloy | |
ash | Ash | |
ballad | Ballad | |
coral | Coral | |
echo | Echo | |
sage | Sage | |
shimmer | Shimmer | |
verse | Verse |
Audio input/output both use g711_ulaw format (matching Twilio’s µ-law stream). Output includes voice and speed settings. Input includes optional noise_reduction and turn_detection.
Tool Definitions
Section titled “Tool Definitions”pub struct OpenaiRealtimeTool { pub r#type: String, // "function" pub name: String, // e.g. "query_knowledge" pub description: String, pub parameters: serde_json::Value, // JSON Schema}Transcription
Section titled “Transcription”Post-call transcription supports two modes controlled by the USE_DIARIZED_TRANSCRIPTION constant in process_twilio_recording_util.rs. Currently defaults to diarized (true).
Standard Transcription
Section titled “Standard Transcription”Uses gpt-4o-transcribe with a bilingual style-matching prompt. The prompt hints at English/Spanish domain vocabulary (appointments, scheduling, voicemail) so the model handles code-switching naturally. Language detection is automatic — no language parameter is set.
let response = transcribe_audio(recording_bytes, None, Some(prompt)).await?;Diarized Transcription
Section titled “Diarized Transcription”Uses gpt-4o-transcribe-diarize to produce speaker-labeled segments, then applies AI post-processing to identify who is the agent and who is the caller.
Recording audio → transcribe_audio_diarized(bytes) Model: gpt-4o-transcribe-diarize Returns segments with generic labels ("A", "B") → identify_speakers(diarized, agent_name, agent_prompt) Model: GPT-5 Mini (structured output) Analyzes transcript against agent's system prompt Returns mapping: {"A" → "Dr. Smith", "B" → "Caller"} Extracts caller's first name if mentioned → format_diarized_transcript(diarized, mapping) Merges consecutive same-speaker segments Produces labeled linesOutput example:
Dr. Smith: Thanks for calling, how can I help?John: Hi, I need to reschedule my appointment for next week.Dr. Smith: Of course, let me check availability.Speaker Identification
Section titled “Speaker Identification”identify_speakers sends the raw transcript (with generic labels) and the agent’s system prompt to GPT-5 Mini with structured output. The model returns:
struct SpeakerIdentification { agent_speaker: String, // e.g. "A" caller_name: Option<String>, // e.g. "John" (if mentioned)}The agent is identified by matching greeting patterns and system prompt behavior. If the caller introduces themselves, their first name replaces the generic “Caller” label.
If speaker identification fails, the system falls back to the raw diarized text and logs a warning.
Diarized Response Types
Section titled “Diarized Response Types”pub struct DiarizedTranscriptionResponse { pub text: String, pub duration: Option<f64>, pub segments: Option<Vec<DiarizedSegment>>,}
pub struct DiarizedSegment { pub start: f64, pub end: f64, pub text: String, pub speaker: String, // Generic label: "A", "B", etc.}Configuration
Section titled “Configuration”Loaded from core_conf table:
pub struct OpenAICoreConf { pub openai_api_key: String,}Used by both realtime session creation and transcription.
Constants
Section titled “Constants”| Constant | Value |
|---|---|
OPENAI_REALTIME_BASE_URL | wss://api.openai.com/v1/realtime |
OPENAI_TRANSCRIPTION_URL | https://api.openai.com/v1/audio/transcriptions |
Module Structure
Section titled “Module Structure”src/mods/openai/├── constants/ # API URLs (realtime WebSocket, transcription)├── services/ # create_openai_realtime_session├── types/ # Realtime events, session config, transcription + diarized types└── utils/ # transcribe_audio, transcribe_audio_diarized, # identify_speakers, format_diarized_transcript