Gemini
The gemini module connects to Google’s Gemini Live API through Vertex AI, providing an alternative realtime voice provider alongside OpenAI. It handles WebSocket session lifecycle, bidirectional audio streaming with codec transcoding, and OAuth2 authentication via service account JWT.
Architecture
Section titled “Architecture”Gemini implements the same RealtimeSessionSender / RealtimeSessionReceiver traits as the OpenAI module. The agent module doesn’t know which provider it’s using — both expose identical interfaces.
src/mods/gemini/├── constants/ # Vertex AI WebSocket URL template├── services/ # create_gemini_realtime_session├── types/ # Setup config, event types, tool types, session types└── utils/ # Vertex AI auth, audio codec transcodingSession Creation
Section titled “Session Creation”create_gemini_realtime_session establishes a live voice session:
- Load
GeminiCoreConffromcore_conftable (project ID, location, service account credentials) - Obtain a Vertex AI access token via JWT signing + OAuth2 exchange
- Build WebSocket URL from
VERTEX_AI_LIVE_WS_URL_TEMPLATE, replacing{location}(e.g.us-central1) - Connect with
Authorization: Bearer {token}header - Send
GeminiSetupMessageas the first frame (model, voice, VAD, tools, system instructions) - Wait for
setupCompleteresponse - Send a greeting trigger (
GeminiClientContentwithtext: ".") to prime the model - Return
(GeminiRealtimeSessionSender, GeminiRealtimeSessionReceiver)
pub async fn create_gemini_realtime_session( gemini_session_config: GeminiRealtimeSessionConfig,) -> Result<(GeminiRealtimeSessionSender, GeminiRealtimeSessionReceiver), AppError>Vertex AI Authentication
Section titled “Vertex AI Authentication”get_vertex_access_token performs a service account JWT flow:
- Load credentials from
GeminiCoreConf - Build JWT claims:
iss= service account email,scope=cloud-platform,exp= 1 hour - Sign with RS256 using the PEM private key
- POST to
google_token_uriwithgrant_type=urn:ietf:params:oauth:grant-type:jwt-bearer - Return the
access_tokenstring
Configuration
Section titled “Configuration”Loaded from the core_conf database table:
pub struct GeminiCoreConf { pub google_project_id: String, // GCP project ID pub google_location: String, // Region (e.g. "us-central1") pub google_client_email: String, // Service account email pub google_private_key: String, // PEM RSA private key pub google_token_uri: String, // OAuth2 token endpoint}Session Setup
Section titled “Session Setup”The setup message configures the model, voice, VAD, and tools:
pub struct GeminiRealtimeSessionConfig { pub model: String, // e.g. "models/gemini-2.5-flash" pub generation_config: GeminiGenerationConfig, pub system_instruction: GeminiSystemInstruction, pub realtime_input_config: GeminiRealtimeInputConfig, pub tools: Option<Vec<GeminiToolDeclaration>>,}Voice is set through a nested config path: generation_config.speech_config.voice_config.prebuilt_voice_config.voice_name (e.g. "Puck", "Charon", "Kore").
VAD uses GeminiAutomaticActivityDetection with start_of_speech_sensitivity, end_of_speech_sensitivity, silence_duration_ms, and activity_handling (e.g. "INTERRUPT_AND_RESPOND").
Audio Codec Transcoding
Section titled “Audio Codec Transcoding”Twilio sends µ-law 8 kHz audio. Gemini expects PCM16 and outputs at 24 kHz. The codec utilities handle all conversions:
| Function | Direction | Description |
|---|---|---|
transcode_mulaw_b64_to_pcm16_b64 | Twilio → Gemini | Base64 µ-law → base64 PCM16 |
transcode_pcm16_b64_to_mulaw_b64 | Gemini → Twilio | Base64 PCM16 → downsample → base64 µ-law |
mulaw_to_pcm16 | — | Raw µ-law bytes → PCM16 LE bytes |
pcm16_to_mulaw | — | PCM16 LE bytes → µ-law bytes |
downsample_pcm16 | — | Integer decimation (e.g. 24000 → 8000) |
Runtime constants: input audio/pcm;rate=8000, output rate 24000, Twilio rate 8000.
Event Mapping
Section titled “Event Mapping”The receiver maps Gemini server messages to the canonical RealtimeInEvent enum:
| Gemini Event | RealtimeInEvent |
|---|---|
tool_call with function_calls | ToolCall { call_id, function_name, arguments_json } |
interrupted = true | SpeechStarted |
First inline_data after interruption | ResponseStarted |
inline_data (audio) | AudioDelta { delta } (transcoded to µ-law) |
setupComplete, tool_call_cancellation | Unknown |
Both Text and Binary WebSocket frames are accepted — Vertex AI sends JSON as binary.
Tool / Function Calling
Section titled “Tool / Function Calling”Tools are declared in the setup message:
pub struct GeminiToolDeclaration { pub function_declarations: Vec<GeminiFunctionDeclaration>,}
pub struct GeminiFunctionDeclaration { pub name: String, pub description: String, pub parameters: serde_json::Value, // JSON Schema}Tool call responses use GeminiToolResponseMessage. The id field is optional — Gemini’s native-audio models omit it, but correlation still works.
The sender’s send_tool_response accepts call_id, function_name, and output — unlike OpenAI, Gemini requires function_name for response correlation.
Constants
Section titled “Constants”| Constant | Value |
|---|---|
VERTEX_AI_LIVE_WS_URL_TEMPLATE | wss://{location}-aiplatform.googleapis.com/ws/...BidiGenerateContent |