Skip to content

OpenAI

The openai module connects to OpenAI’s Realtime API for live voice sessions and provides two post-call transcription paths: standard transcription and speaker-diarized transcription with AI-powered speaker identification.

When an agent uses the OpenAI provider, create_openai_realtime_session establishes a WebSocket connection:

  1. Build URL: wss://api.openai.com/v1/realtime?model=<model_id>
  2. Load API key from core_conf via OpenAICoreConf
  3. Connect with Authorization: Bearer <api_key> header
  4. Send session.update with full config (instructions, voice, VAD, tools)
  5. Send response.create to prime the model
  6. Return (OpenAIRealtimeSessionSender, OpenAIRealtimeSessionReceiver)

Implements RealtimeSessionSender:

MethodWhat It Sends
send_audio_delta(audio)input_audio_buffer.append with base64 audio
send_tool_response(call_id, _, output)conversation.item.create (function_call_output) + response.create

Implements RealtimeSessionReceiver. Maps OpenAI events to RealtimeInEvent:

OpenAI Event→ RealtimeInEvent
response.output_audio.deltaAudioDelta
input_audio_buffer.speech_startedSpeechStarted
response.createdResponseStarted
response.function_call_arguments.doneToolCall
Everything else (20+ event types)Unknown
pub struct OpenaiRealtimeSessionConfig {
pub r#type: String, // "realtime"
pub model: String, // "gpt-realtime-1.5", "gpt-realtime", or "gpt-realtime-mini"
pub output_modalities: Vec<String>, // ["audio"]
pub audio: OpenaiRealtimeSessionConfigAudio,
pub instructions: String,
pub tools: Option<Vec<OpenaiRealtimeTool>>,
}
Model IDDisplay NameDescription
gpt-realtime-1.5GPT Realtime 1.5Latest model with improved reasoning and transcription accuracy (default)
gpt-realtimeGPT RealtimeStable realtime voice model
gpt-realtime-miniGPT Realtime MiniCost-efficient alternative

New agents default to gpt-realtime-1.5. Existing agents keep their configured model.

All 10 OpenAI Realtime voices are available. Marin and Cedar are recommended and shown first in the UI.

Voice IDDisplay NameNote
marinMarin (Recommended)Default voice
cedarCedar (Recommended)
alloyAlloy
ashAsh
balladBallad
coralCoral
echoEcho
sageSage
shimmerShimmer
verseVerse

Audio input/output both use g711_ulaw format (matching Twilio’s µ-law stream). Output includes voice and speed settings. Input includes optional noise_reduction and turn_detection.

pub struct OpenaiRealtimeTool {
pub r#type: String, // "function"
pub name: String, // e.g. "query_knowledge"
pub description: String,
pub parameters: serde_json::Value, // JSON Schema
}

Post-call transcription supports two modes controlled by the USE_DIARIZED_TRANSCRIPTION constant in process_twilio_recording_util.rs. Currently defaults to diarized (true).

Uses gpt-4o-transcribe with a bilingual style-matching prompt. The prompt hints at English/Spanish domain vocabulary (appointments, scheduling, voicemail) so the model handles code-switching naturally. Language detection is automatic — no language parameter is set.

let response = transcribe_audio(recording_bytes, None, Some(prompt)).await?;

Uses gpt-4o-transcribe-diarize to produce speaker-labeled segments, then applies AI post-processing to identify who is the agent and who is the caller.

Recording audio
→ transcribe_audio_diarized(bytes)
Model: gpt-4o-transcribe-diarize
Returns segments with generic labels ("A", "B")
→ identify_speakers(diarized, agent_name, agent_prompt)
Model: GPT-5 Mini (structured output)
Analyzes transcript against agent's system prompt
Returns mapping: {"A" → "Dr. Smith", "B" → "Caller"}
Extracts caller's first name if mentioned
→ format_diarized_transcript(diarized, mapping)
Merges consecutive same-speaker segments
Produces labeled lines

Output example:

Dr. Smith: Thanks for calling, how can I help?
John: Hi, I need to reschedule my appointment for next week.
Dr. Smith: Of course, let me check availability.

identify_speakers sends the raw transcript (with generic labels) and the agent’s system prompt to GPT-5 Mini with structured output. The model returns:

struct SpeakerIdentification {
agent_speaker: String, // e.g. "A"
caller_name: Option<String>, // e.g. "John" (if mentioned)
}

The agent is identified by matching greeting patterns and system prompt behavior. If the caller introduces themselves, their first name replaces the generic “Caller” label.

If speaker identification fails, the system falls back to the raw diarized text and logs a warning.

pub struct DiarizedTranscriptionResponse {
pub text: String,
pub duration: Option<f64>,
pub segments: Option<Vec<DiarizedSegment>>,
}
pub struct DiarizedSegment {
pub start: f64,
pub end: f64,
pub text: String,
pub speaker: String, // Generic label: "A", "B", etc.
}

Loaded from core_conf table:

pub struct OpenAICoreConf {
pub openai_api_key: String,
}

Used by both realtime session creation and transcription.

ConstantValue
OPENAI_REALTIME_BASE_URLwss://api.openai.com/v1/realtime
OPENAI_TRANSCRIPTION_URLhttps://api.openai.com/v1/audio/transcriptions
src/mods/openai/
├── constants/ # API URLs (realtime WebSocket, transcription)
├── services/ # create_openai_realtime_session
├── types/ # Realtime events, session config, transcription + diarized types
└── utils/ # transcribe_audio, transcribe_audio_diarized,
# identify_speakers, format_diarized_transcript
  • Agent — defines RealtimeSessionSender/Receiver traits this module implements
  • Twilio — forwards audio to/from the session, triggers transcription after recording