OpenAI

The openai module connects to OpenAI’s Realtime API for live voice sessions and provides two post-call transcription paths: standard transcription and speaker-diarized transcription with AI-powered speaker identification.

Realtime Sessions

When an agent uses the OpenAI provider, create_openai_realtime_session establishes a WebSocket connection:

Build URL: wss://api.openai.com/v1/realtime?model=<model_id>
Load API key from core_conf via OpenAICoreConf
Connect with Authorization: Bearer <api_key> header
Send session.update with full config (instructions, voice, VAD, tools)
Send response.create to prime the model
Return (OpenAIRealtimeSessionSender, OpenAIRealtimeSessionReceiver)

Sender (`OpenAIRealtimeSessionSender`)

Implements RealtimeSessionSender:

Method	What It Sends
`send_audio_delta(audio)`	`input_audio_buffer.append` with base64 audio
`send_tool_response(call_id, _, output)`	`conversation.item.create` (function_call_output) + `response.create`

Receiver (`OpenAIRealtimeSessionReceiver`)

Implements RealtimeSessionReceiver. Maps OpenAI events to RealtimeInEvent:

OpenAI Event	→ RealtimeInEvent
`response.output_audio.delta`	`AudioDelta`
`input_audio_buffer.speech_started`	`SpeechStarted`
`response.created`	`ResponseStarted`
`response.function_call_arguments.done`	`ToolCall`
Everything else (20+ event types)	`Unknown`

Session Configuration

pub struct OpenaiRealtimeSessionConfig {
    pub r#type: String,                    // "realtime"
    pub model: String,                     // "gpt-realtime-1.5", "gpt-realtime", or "gpt-realtime-mini"
    pub output_modalities: Vec<String>,    // ["audio"]
    pub audio: OpenaiRealtimeSessionConfigAudio,
    pub instructions: String,
    pub tools: Option<Vec<OpenaiRealtimeTool>>,
}

Models

Model ID	Display Name	Description
`gpt-realtime-1.5`	GPT Realtime 1.5	Latest model with improved reasoning and transcription accuracy (default)
`gpt-realtime`	GPT Realtime	Stable realtime voice model
`gpt-realtime-mini`	GPT Realtime Mini	Cost-efficient alternative

New agents default to gpt-realtime-1.5. Existing agents keep their configured model.

Voices

All 10 OpenAI Realtime voices are available. Marin and Cedar are recommended and shown first in the UI.

Voice ID	Display Name	Note
`marin`	Marin (Recommended)	Default voice
`cedar`	Cedar (Recommended)
`alloy`	Alloy
`ash`	Ash
`ballad`	Ballad
`coral`	Coral
`echo`	Echo
`sage`	Sage
`shimmer`	Shimmer
`verse`	Verse

Audio input/output both use g711_ulaw format (matching Twilio’s µ-law stream). Output includes voice and speed settings. Input includes optional noise_reduction and turn_detection.

Tool Definitions

pub struct OpenaiRealtimeTool {
    pub r#type: String,           // "function"
    pub name: String,             // e.g. "query_knowledge"
    pub description: String,
    pub parameters: serde_json::Value,  // JSON Schema
}

Transcription

Post-call transcription supports two modes controlled by the USE_DIARIZED_TRANSCRIPTION constant in process_twilio_recording_util.rs. Currently defaults to diarized (true).

Standard Transcription

Uses gpt-4o-transcribe with a bilingual style-matching prompt. The prompt hints at English/Spanish domain vocabulary (appointments, scheduling, voicemail) so the model handles code-switching naturally. Language detection is automatic — no language parameter is set.

let response = transcribe_audio(recording_bytes, None, Some(prompt)).await?;

Diarized Transcription

Uses gpt-4o-transcribe-diarize to produce speaker-labeled segments, then applies AI post-processing to identify who is the agent and who is the caller.

Recording audio
  → transcribe_audio_diarized(bytes)
      Model: gpt-4o-transcribe-diarize
      Returns segments with generic labels ("A", "B")
  → identify_speakers(diarized, agent_name, agent_prompt)
      Model: GPT-5 Mini (structured output)
      Analyzes transcript against agent's system prompt
      Returns mapping: {"A" → "Dr. Smith", "B" → "Caller"}
      Extracts caller's first name if mentioned
  → format_diarized_transcript(diarized, mapping)
      Merges consecutive same-speaker segments
      Produces labeled lines

Output example:

Dr. Smith: Thanks for calling, how can I help?
John: Hi, I need to reschedule my appointment for next week.
Dr. Smith: Of course, let me check availability.

Speaker Identification

identify_speakers sends the raw transcript (with generic labels) and the agent’s system prompt to GPT-5 Mini with structured output. The model returns:

struct SpeakerIdentification {
    agent_speaker: String,        // e.g. "A"
    caller_name: Option<String>,  // e.g. "John" (if mentioned)
}

The agent is identified by matching greeting patterns and system prompt behavior. If the caller introduces themselves, their first name replaces the generic “Caller” label.

If speaker identification fails, the system falls back to the raw diarized text and logs a warning.

Diarized Response Types

pub struct DiarizedTranscriptionResponse {
    pub text: String,
    pub duration: Option<f64>,
    pub segments: Option<Vec<DiarizedSegment>>,
}

pub struct DiarizedSegment {
    pub start: f64,
    pub end: f64,
    pub text: String,
    pub speaker: String,   // Generic label: "A", "B", etc.
}

Configuration

Loaded from core_conf table:

pub struct OpenAICoreConf {
    pub openai_api_key: String,
}

Used by both realtime session creation and transcription.

Constants

Constant	Value
`OPENAI_REALTIME_BASE_URL`	`wss://api.openai.com/v1/realtime`
`OPENAI_TRANSCRIPTION_URL`	`https://api.openai.com/v1/audio/transcriptions`

Module Structure

src/mods/openai/
├── constants/  # API URLs (realtime WebSocket, transcription)
├── services/   # create_openai_realtime_session
├── types/      # Realtime events, session config, transcription + diarized types
└── utils/      # transcribe_audio, transcribe_audio_diarized,
                #   identify_speakers, format_diarized_transcript

Agent — defines RealtimeSessionSender/Receiver traits this module implements
Twilio — forwards audio to/from the session, triggers transcription after recording