Gemini

The gemini module connects to Google’s Gemini Live API through Vertex AI, providing an alternative realtime voice provider alongside OpenAI. It handles WebSocket session lifecycle, bidirectional audio streaming with codec transcoding, and OAuth2 authentication via service account JWT.

Architecture

Gemini implements the same RealtimeSessionSender / RealtimeSessionReceiver traits as the OpenAI module. The agent module doesn’t know which provider it’s using — both expose identical interfaces.

src/mods/gemini/
├── constants/  # Vertex AI WebSocket URL template
├── services/   # create_gemini_realtime_session
├── types/      # Setup config, event types, tool types, session types
└── utils/      # Vertex AI auth, audio codec transcoding

Session Creation

create_gemini_realtime_session establishes a live voice session:

Load GeminiCoreConf from core_conf table (project ID, location, service account credentials)
Obtain a Vertex AI access token via JWT signing + OAuth2 exchange
Build WebSocket URL from VERTEX_AI_LIVE_WS_URL_TEMPLATE, replacing {location} (e.g. us-central1)
Connect with Authorization: Bearer {token} header
Send GeminiSetupMessage as the first frame (model, voice, VAD, tools, system instructions)
Wait for setupComplete response
Send a greeting trigger (GeminiClientContent with text: ".") to prime the model
Return (GeminiRealtimeSessionSender, GeminiRealtimeSessionReceiver)

pub async fn create_gemini_realtime_session(
    gemini_session_config: GeminiRealtimeSessionConfig,
) -> Result<(GeminiRealtimeSessionSender, GeminiRealtimeSessionReceiver), AppError>

Vertex AI Authentication

get_vertex_access_token performs a service account JWT flow:

Load credentials from GeminiCoreConf
Build JWT claims: iss = service account email, scope = cloud-platform, exp = 1 hour
Sign with RS256 using the PEM private key
POST to google_token_uri with grant_type=urn:ietf:params:oauth:grant-type:jwt-bearer
Return the access_token string

Configuration

Loaded from the core_conf database table:

pub struct GeminiCoreConf {
    pub google_project_id: String,     // GCP project ID
    pub google_location: String,       // Region (e.g. "us-central1")
    pub google_client_email: String,   // Service account email
    pub google_private_key: String,    // PEM RSA private key
    pub google_token_uri: String,      // OAuth2 token endpoint
}

Session Setup

The setup message configures the model, voice, VAD, and tools:

pub struct GeminiRealtimeSessionConfig {
    pub model: String,                                // e.g. "models/gemini-2.5-flash"
    pub generation_config: GeminiGenerationConfig,
    pub system_instruction: GeminiSystemInstruction,
    pub realtime_input_config: GeminiRealtimeInputConfig,
    pub tools: Option<Vec<GeminiToolDeclaration>>,
}

Voice is set through a nested config path: generation_config.speech_config.voice_config.prebuilt_voice_config.voice_name (e.g. "Puck", "Charon", "Kore").

VAD uses GeminiAutomaticActivityDetection with start_of_speech_sensitivity, end_of_speech_sensitivity, silence_duration_ms, and activity_handling (e.g. "INTERRUPT_AND_RESPOND").

Audio Codec Transcoding

Twilio sends µ-law 8 kHz audio. Gemini expects PCM16 and outputs at 24 kHz. The codec utilities handle all conversions:

Function	Direction	Description
`transcode_mulaw_b64_to_pcm16_b64`	Twilio → Gemini	Base64 µ-law → base64 PCM16
`transcode_pcm16_b64_to_mulaw_b64`	Gemini → Twilio	Base64 PCM16 → downsample → base64 µ-law
`mulaw_to_pcm16`	—	Raw µ-law bytes → PCM16 LE bytes
`pcm16_to_mulaw`	—	PCM16 LE bytes → µ-law bytes
`downsample_pcm16`	—	Integer decimation (e.g. 24000 → 8000)

Runtime constants: input audio/pcm;rate=8000, output rate 24000, Twilio rate 8000.

Event Mapping

The receiver maps Gemini server messages to the canonical RealtimeInEvent enum:

Gemini Event	RealtimeInEvent
`tool_call` with `function_calls`	`ToolCall { call_id, function_name, arguments_json }`
`interrupted = true`	`SpeechStarted`
First `inline_data` after interruption	`ResponseStarted`
`inline_data` (audio)	`AudioDelta { delta }` (transcoded to µ-law)
`setupComplete`, `tool_call_cancellation`	`Unknown`

Both Text and Binary WebSocket frames are accepted — Vertex AI sends JSON as binary.

Tool / Function Calling

Tools are declared in the setup message:

pub struct GeminiToolDeclaration {
    pub function_declarations: Vec<GeminiFunctionDeclaration>,
}

pub struct GeminiFunctionDeclaration {
    pub name: String,
    pub description: String,
    pub parameters: serde_json::Value,  // JSON Schema
}

Tool call responses use GeminiToolResponseMessage. The id field is optional — Gemini’s native-audio models omit it, but correlation still works.

The sender’s send_tool_response accepts call_id, function_name, and output — unlike OpenAI, Gemini requires function_name for response correlation.

Constants

Constant	Value
`VERTEX_AI_LIVE_WS_URL_TEMPLATE`	`wss://{location}-aiplatform.googleapis.com/ws/...BidiGenerateContent`

Agent — defines RealtimeSessionSender/Receiver traits this module implements
OpenAI — the other realtime provider implementation
Twilio — forwards µ-law audio to/from the session