Skip to content

Gemini

The gemini module connects to Google’s Gemini Live API through Vertex AI, providing an alternative realtime voice provider alongside OpenAI. It handles WebSocket session lifecycle, bidirectional audio streaming with codec transcoding, and OAuth2 authentication via service account JWT.

Gemini implements the same RealtimeSessionSender / RealtimeSessionReceiver traits as the OpenAI module. The agent module doesn’t know which provider it’s using — both expose identical interfaces.

src/mods/gemini/
├── constants/ # Vertex AI WebSocket URL template
├── services/ # create_gemini_realtime_session
├── types/ # Setup config, event types, tool types, session types
└── utils/ # Vertex AI auth, audio codec transcoding

create_gemini_realtime_session establishes a live voice session:

  1. Load GeminiCoreConf from core_conf table (project ID, location, service account credentials)
  2. Obtain a Vertex AI access token via JWT signing + OAuth2 exchange
  3. Build WebSocket URL from VERTEX_AI_LIVE_WS_URL_TEMPLATE, replacing {location} (e.g. us-central1)
  4. Connect with Authorization: Bearer {token} header
  5. Send GeminiSetupMessage as the first frame (model, voice, VAD, tools, system instructions)
  6. Wait for setupComplete response
  7. Send a greeting trigger (GeminiClientContent with text: ".") to prime the model
  8. Return (GeminiRealtimeSessionSender, GeminiRealtimeSessionReceiver)
pub async fn create_gemini_realtime_session(
gemini_session_config: GeminiRealtimeSessionConfig,
) -> Result<(GeminiRealtimeSessionSender, GeminiRealtimeSessionReceiver), AppError>

get_vertex_access_token performs a service account JWT flow:

  1. Load credentials from GeminiCoreConf
  2. Build JWT claims: iss = service account email, scope = cloud-platform, exp = 1 hour
  3. Sign with RS256 using the PEM private key
  4. POST to google_token_uri with grant_type=urn:ietf:params:oauth:grant-type:jwt-bearer
  5. Return the access_token string

Loaded from the core_conf database table:

pub struct GeminiCoreConf {
pub google_project_id: String, // GCP project ID
pub google_location: String, // Region (e.g. "us-central1")
pub google_client_email: String, // Service account email
pub google_private_key: String, // PEM RSA private key
pub google_token_uri: String, // OAuth2 token endpoint
}

The setup message configures the model, voice, VAD, and tools:

pub struct GeminiRealtimeSessionConfig {
pub model: String, // e.g. "models/gemini-2.5-flash"
pub generation_config: GeminiGenerationConfig,
pub system_instruction: GeminiSystemInstruction,
pub realtime_input_config: GeminiRealtimeInputConfig,
pub tools: Option<Vec<GeminiToolDeclaration>>,
}

Voice is set through a nested config path: generation_config.speech_config.voice_config.prebuilt_voice_config.voice_name (e.g. "Puck", "Charon", "Kore").

VAD uses GeminiAutomaticActivityDetection with start_of_speech_sensitivity, end_of_speech_sensitivity, silence_duration_ms, and activity_handling (e.g. "INTERRUPT_AND_RESPOND").

Twilio sends µ-law 8 kHz audio. Gemini expects PCM16 and outputs at 24 kHz. The codec utilities handle all conversions:

FunctionDirectionDescription
transcode_mulaw_b64_to_pcm16_b64Twilio → GeminiBase64 µ-law → base64 PCM16
transcode_pcm16_b64_to_mulaw_b64Gemini → TwilioBase64 PCM16 → downsample → base64 µ-law
mulaw_to_pcm16Raw µ-law bytes → PCM16 LE bytes
pcm16_to_mulawPCM16 LE bytes → µ-law bytes
downsample_pcm16Integer decimation (e.g. 24000 → 8000)

Runtime constants: input audio/pcm;rate=8000, output rate 24000, Twilio rate 8000.

The receiver maps Gemini server messages to the canonical RealtimeInEvent enum:

Gemini EventRealtimeInEvent
tool_call with function_callsToolCall { call_id, function_name, arguments_json }
interrupted = trueSpeechStarted
First inline_data after interruptionResponseStarted
inline_data (audio)AudioDelta { delta } (transcoded to µ-law)
setupComplete, tool_call_cancellationUnknown

Both Text and Binary WebSocket frames are accepted — Vertex AI sends JSON as binary.

Tools are declared in the setup message:

pub struct GeminiToolDeclaration {
pub function_declarations: Vec<GeminiFunctionDeclaration>,
}
pub struct GeminiFunctionDeclaration {
pub name: String,
pub description: String,
pub parameters: serde_json::Value, // JSON Schema
}

Tool call responses use GeminiToolResponseMessage. The id field is optional — Gemini’s native-audio models omit it, but correlation still works.

The sender’s send_tool_response accepts call_id, function_name, and output — unlike OpenAI, Gemini requires function_name for response correlation.

ConstantValue
VERTEX_AI_LIVE_WS_URL_TEMPLATEwss://{location}-aiplatform.googleapis.com/ws/...BidiGenerateContent
  • Agent — defines RealtimeSessionSender/Receiver traits this module implements
  • OpenAI — the other realtime provider implementation
  • Twilio — forwards µ-law audio to/from the session