Skip to content

OpenAI

The openai module connects to OpenAI’s Realtime API for live voice sessions and uses GPT-4o Transcribe for post-call transcription.

When an agent uses the OpenAI provider, create_openai_realtime_session establishes a WebSocket connection:

  1. Build URL: wss://api.openai.com/v1/realtime?model=<model_id>
  2. Load API key from core_conf via OpenAICoreConf
  3. Connect with Authorization: Bearer <api_key> header
  4. Send session.update with full config (instructions, voice, VAD, tools)
  5. Send response.create to prime the model
  6. Return (OpenAIRealtimeSessionSender, OpenAIRealtimeSessionReceiver)

Implements RealtimeSessionSender:

MethodWhat It Sends
send_audio_delta(audio)input_audio_buffer.append with base64 audio
send_tool_response(call_id, _, output)conversation.item.create (function_call_output) + response.create

Implements RealtimeSessionReceiver. Maps OpenAI events to RealtimeInEvent:

OpenAI Event→ RealtimeInEvent
response.output_audio.deltaAudioDelta
input_audio_buffer.speech_startedSpeechStarted
response.createdResponseStarted
response.function_call_arguments.doneToolCall
Everything else (20+ event types)Unknown
pub struct OpenaiRealtimeSessionConfig {
pub r#type: String, // "realtime"
pub model: String, // "gpt-realtime"
pub output_modalities: Vec<String>, // ["audio"]
pub audio: OpenaiRealtimeSessionConfigAudio,
pub instructions: String,
pub tools: Option<Vec<OpenaiRealtimeTool>>,
}

Audio input/output both use g711_ulaw format (matching Twilio’s µ-law stream). Output includes voice and speed settings. Input includes optional noise_reduction and turn_detection.

pub struct OpenaiRealtimeTool {
pub r#type: String, // "function"
pub name: String, // e.g. "query_knowledge"
pub description: String,
pub parameters: serde_json::Value, // JSON Schema
}

Post-call transcription uses the OpenAI Audio API.

let response = transcribe_audio(recording_bytes).await?;
println!("{}", response.text);
let request = TranscriptionRequest {
file_bytes: recording_bytes,
model: "gpt-4o-transcribe".to_string(),
language: Some("en".to_string()),
prompt: None,
response_format: None,
};
let response = transcribe_audio_with_options(request).await?;
pub struct TranscriptionResponse {
pub text: String,
pub language: Option<String>,
pub duration: Option<f64>,
pub segments: Option<Vec<TranscriptionSegment>>,
pub words: Option<Vec<TranscriptionWord>>,
}

Segments include timing, token, and confidence data. Words include start/end timestamps and optional speaker diarization.

Loaded from core_conf table:

pub struct OpenAICoreConf {
pub openai_api_key: String,
}

Used by both realtime session creation and transcription.

ConstantValue
OPENAI_REALTIME_BASE_URLwss://api.openai.com/v1/realtime
OPENAI_TRANSCRIPTION_URLhttps://api.openai.com/v1/audio/transcriptions
src/mods/openai/
├── constants/ # API URLs (realtime WebSocket, transcription)
├── services/ # create_openai_realtime_session
├── types/ # Realtime events (20+ types), session config, transcription types
└── utils/ # transcribe_audio, transcribe_audio_with_options
  • Agent — defines RealtimeSessionSender/Receiver traits this module implements
  • Twilio — forwards audio to/from the session, triggers transcription after recording