OpenAI

The openai module connects to OpenAI’s Realtime API for live voice sessions and uses GPT-4o Transcribe for post-call transcription.

Realtime Sessions

When an agent uses the OpenAI provider, create_openai_realtime_session establishes a WebSocket connection:

Build URL: wss://api.openai.com/v1/realtime?model=<model_id>
Load API key from core_conf via OpenAICoreConf
Connect with Authorization: Bearer <api_key> header
Send session.update with full config (instructions, voice, VAD, tools)
Send response.create to prime the model
Return (OpenAIRealtimeSessionSender, OpenAIRealtimeSessionReceiver)

Sender (`OpenAIRealtimeSessionSender`)

Implements RealtimeSessionSender:

Method	What It Sends
`send_audio_delta(audio)`	`input_audio_buffer.append` with base64 audio
`send_tool_response(call_id, _, output)`	`conversation.item.create` (function_call_output) + `response.create`

Receiver (`OpenAIRealtimeSessionReceiver`)

Implements RealtimeSessionReceiver. Maps OpenAI events to RealtimeInEvent:

OpenAI Event	→ RealtimeInEvent
`response.output_audio.delta`	`AudioDelta`
`input_audio_buffer.speech_started`	`SpeechStarted`
`response.created`	`ResponseStarted`
`response.function_call_arguments.done`	`ToolCall`
Everything else (20+ event types)	`Unknown`

Session Configuration

pub struct OpenaiRealtimeSessionConfig {
    pub r#type: String,                    // "realtime"
    pub model: String,                     // "gpt-realtime"
    pub output_modalities: Vec<String>,    // ["audio"]
    pub audio: OpenaiRealtimeSessionConfigAudio,
    pub instructions: String,
    pub tools: Option<Vec<OpenaiRealtimeTool>>,
}

Audio input/output both use g711_ulaw format (matching Twilio’s µ-law stream). Output includes voice and speed settings. Input includes optional noise_reduction and turn_detection.

Tool Definitions

pub struct OpenaiRealtimeTool {
    pub r#type: String,           // "function"
    pub name: String,             // e.g. "query_knowledge"
    pub description: String,
    pub parameters: serde_json::Value,  // JSON Schema
}

Transcription

Post-call transcription uses the OpenAI Audio API.

Basic Usage

let response = transcribe_audio(recording_bytes).await?;
println!("{}", response.text);

With Options

let request = TranscriptionRequest {
    file_bytes: recording_bytes,
    model: "gpt-4o-transcribe".to_string(),
    language: Some("en".to_string()),
    prompt: None,
    response_format: None,
};
let response = transcribe_audio_with_options(request).await?;

Response

pub struct TranscriptionResponse {
    pub text: String,
    pub language: Option<String>,
    pub duration: Option<f64>,
    pub segments: Option<Vec<TranscriptionSegment>>,
    pub words: Option<Vec<TranscriptionWord>>,
}

Segments include timing, token, and confidence data. Words include start/end timestamps and optional speaker diarization.

Configuration

Loaded from core_conf table:

pub struct OpenAICoreConf {
    pub openai_api_key: String,
}

Used by both realtime session creation and transcription.

Constants

Constant	Value
`OPENAI_REALTIME_BASE_URL`	`wss://api.openai.com/v1/realtime`
`OPENAI_TRANSCRIPTION_URL`	`https://api.openai.com/v1/audio/transcriptions`

Module Structure

src/mods/openai/
├── constants/  # API URLs (realtime WebSocket, transcription)
├── services/   # create_openai_realtime_session
├── types/      # Realtime events (20+ types), session config, transcription types
└── utils/      # transcribe_audio, transcribe_audio_with_options

Agent — defines RealtimeSessionSender/Receiver traits this module implements
Twilio — forwards audio to/from the session, triggers transcription after recording