Architecture Overview

This guide explains how the gemini-genai-rs workspace is structured, how data flows through the system, and how to decide which layer to build on.

The Three-Crate Stack

The workspace is organized into three crates, each adding a layer of abstraction on top of the one below:

+--------------------------------------------------+
|  gemini-adk-fluent-rs  (L2)  — Fluent DX                |
|  Live::builder(), operator algebra, composition   |
+--------------------------------------------------+
|  gemini-adk-rs  (L1)  — Agent Runtime                   |
|  LiveSessionBuilder, callbacks, tool dispatch,    |
|  state, phases, watchers, extractors, telemetry   |
+--------------------------------------------------+
|  gemini-genai-rs  (L0)  — Wire Protocol                 |
|  SessionHandle, SessionConfig, Transport, Codec,  |
|  AuthProvider, events, commands, VAD, buffers     |
+--------------------------------------------------+
|          Gemini Multimodal Live API               |
|       (WebSocket, full-duplex audio/text)         |
+--------------------------------------------------+

L0: gemini-genai-rs

The wire protocol crate. Maps 1:1 to the Gemini API surface. No application logic, no opinions about how you structure your app.

What it provides:

  • SessionConfig for building the setup message (model, voice, tools, VAD)
  • SessionHandle for sending commands and subscribing to events
  • ConnectBuilder for establishing the WebSocket connection
  • Transport / Codec / AuthProvider traits for pluggable I/O
  • SessionEvent enum (17 variants) for everything the server can send
  • SessionCommand enum (9 variants) for everything you can send
  • SessionPhase FSM with validated transitions
  • Audio buffers (lock-free SPSC ring buffer, adaptive jitter buffer)
  • Client-side VAD (Voice Activity Detection)

L1: gemini-adk-rs

The agent runtime. Adds the event processing loop, typed callbacks, automatic tool dispatch, state management, and the phase machine.

What it provides:

  • LiveSessionBuilder to wire up config + callbacks + tools in one place
  • EventCallbacks with typed fast-lane (sync) and control-lane (async) hooks
  • ToolDispatcher with ToolFunction, StreamingTool, InputStreamingTool
  • State (concurrent key-value store with prefix scoping and delta tracking)
  • PhaseMachine for multi-step conversation flows
  • WatcherRegistry for state-reactive triggers
  • TranscriptBuffer for accumulating conversation history
  • TurnExtractor / LlmExtractor for structured data extraction
  • LiveHandle as the runtime API surface
  • SessionSignals + SessionTelemetry for observability
  • TextAgent trait and combinators for text-based LLM pipelines

L2: gemini-adk-fluent-rs

The fluent developer experience layer. Wraps L1 in a chainable builder API and adds operator-algebra composition (S, C, T, P, M modules).

What it provides:

  • Live::builder() with method chaining for the entire configuration surface
  • .phase() / .watch() / .computed() sub-builders that flow back naturally
  • .connect_google_ai() / .connect_vertex() for one-line connection
  • T::simple(), T::google_search() for composable tool registration
  • S, C, T, P, M operator modules with | composition
  • let_clone! macro for reducing Arc::clone boilerplate in closures
  • Test utilities and mock helpers

Data Flow

Here is how data moves through the system during a live session:

  Client App                    gemini-genai-rs                   Gemini API
  ----------                    --------------                   ----------

  Microphone
      |
      v
  [PCM16 16kHz] --send_audio()--> SessionHandle --WebSocket--> Gemini Live
                                       |                          |
                                  SessionCommand                  |
                                  (mpsc channel)                  |
                                       |                          |
                                  Transport::send()               |
                                       |                          v
                                       |                    Model processes
                                       |                    audio/text/tools
                                       |                          |
                                  Transport::recv()               |
                                       |                          |
                                  Codec::decode()                 |
                                       |                          |
                                  SessionEvent          <--- WebSocket frames
                                  (broadcast channel)
                                       |
                              +--------+--------+
                              |        |        |
                          Fast Lane  Ctrl Lane  Telemetry Lane
                              |        |        |
                          on_audio  on_tool  SessionSignals
                          on_text   phases   SessionTelemetry
                          on_vad    extract
                              |        |
                              v        v
                          Speaker   State
                          Display   Updates

Outbound path: Your app calls send_audio() / send_text() on the LiveHandle (L1/L2) or SessionHandle (L0). These become SessionCommand variants sent through an mpsc channel to the transport loop, which encodes them via the Codec and sends them over the WebSocket.

Inbound path: The transport loop receives WebSocket frames, decodes them via the Codec into SessionEvent variants, and broadcasts them. The three-lane processor (L1) routes each event to the appropriate lane.

Three-Lane Processor

Audio arrives at 40-100 events per second. Tool dispatch can take 1-30 seconds. Sharing one processing loop causes audio stutter during tool execution. The solution: split the event stream into three priority lanes.

Fast Lane (sync, <1ms)

Handles latency-sensitive events with sync callbacks that must never block:

EventCallback
AudioDataon_audio(&Bytes)
TextDeltaon_text(&str)
TextCompleteon_text_complete(&str)
InputTranscriptionon_input_transcript(&str, bool)
OutputTranscriptionon_output_transcript(&str, bool)
VoiceActivityStarton_vad_start()
VoiceActivityEndon_vad_end()
InterruptedSets interrupted flag, stops forwarding audio

Fast lane callbacks are Fn (not FnMut, not async). If your callback takes longer than 1ms, audio playback will stutter.

Control Lane (async, can block)

Handles events that require I/O, state mutation, or multi-step processing:

EventCallback
ToolCallon_tool_call (auto-dispatch or manual)
ToolCallCancelledCancels pending tool tasks
Interruptedon_interrupted()
TurnCompleteExtractors, phase transitions, on_turn_complete()
GoAwayon_go_away(Duration)
Connectedon_connected()
Disconnectedon_disconnected(Option<String>)
Erroron_error(String)

The control lane also owns the TranscriptBuffer (no Arc<Mutex<>>) and runs extractors concurrently via join_all.

Telemetry Lane (async, debounced)

Runs on its own broadcast receiver. Collects SessionSignals (activity timestamps, timing, token usage) and SessionTelemetry (atomic counters for audio chunks, tool calls, interruptions, latency tracking, token counts). Flushes periodically (100ms debounce) with zero overhead on the hot path.

The telemetry lane also handles UsageMetadata events from the Gemini API, recording prompt/response/cached/thoughts token counts in both SessionSignals (as session: state keys) and SessionTelemetry (as atomic counters). The .on_usage() callback fires here for real-time token observation.

The Router

The router is the zero-work dispatcher that sits between the broadcast channel and the two processing lanes. It pattern-matches each SessionEvent and sends it to the correct lane(s) via mpsc channels. No session signals, no telemetry, no allocations on the hot path.

Key Traits

TraitCratePurpose
TransportL0 (gemini_genai_rs::transport::ws)Bidirectional byte transport (WebSocket or mock)
CodecL0 (gemini_genai_rs::transport::codec)Encode commands / decode server messages (JSON default)
AuthProviderL0 (gemini_genai_rs::transport::auth)URL construction + auth headers (Google AI / Vertex AI)
SessionWriterL0 (gemini_genai_rs::session)Send audio/text/video/tools/instructions (trait object safe)
SessionReaderL0 (gemini_genai_rs::session)Subscribe to events, observe phase
ToolFunctionL1 (gemini_adk_rs::tool)One-shot tool: call(args) -> Result<Value>
StreamingToolL1 (gemini_adk_rs::tool)Background tool yielding multiple results
InputStreamingToolL1 (gemini_adk_rs::tool)Tool receiving live input while running
TurnExtractorL1 (gemini_adk_rs::live::extractor)Extract structured data from transcript window
TextAgentL1 (gemini_adk_rs::text)Text-based LLM agent (generate(), not Live)
BaseLlmL1 (gemini_adk_rs::llm)LLM abstraction for generate() calls

Choosing Your Layer

Use L0 (gemini-genai-rs) if you need:

  • Raw WebSocket access with no abstraction overhead
  • Custom event loop logic that does not fit the callback model
  • A custom transport (e.g., Unix domain socket, QUIC)
  • To build your own agent runtime
  • Maximum control over every message sent and received
use gemini_genai_rs::prelude::*;

let config = SessionConfig::from_endpoint(ApiEndpoint::google_ai("YOUR_KEY"))
    .model(GeminiModel::Gemini2_0FlashLive);

let handle = ConnectBuilder::new(config).build().await?;
let mut events = handle.subscribe();

handle.send_text("Hello").await?;
while let Some(event) = recv_event(&mut events).await {
    match event {
        SessionEvent::TextDelta(text) => print!("{text}"),
        SessionEvent::TurnComplete => break,
        _ => {}
    }
}

Use L1 (gemini-adk-rs) if you need:

  • Automatic tool dispatch without manual message matching
  • State management with prefix scoping (session:, turn:, app:)
  • Phase machine for multi-step conversation flows
  • Turn extraction (LLM-based or custom) between turns
  • Telemetry and session signals
  • Full control over callback registration without the fluent syntax

Use L2 (gemini-adk-fluent-rs) if you want:

  • The fastest path from zero to working voice agent
  • Chainable builder API with sub-builders for phases and watchers
  • Operator algebra for composing tools (T::simple() | T::google_search())
  • One-line connection (connect_vertex(project, location, token))
  • Sensible defaults (auto-enables transcription when extractors are used)
use gemini_adk_fluent_rs::prelude::*;

let handle = Live::builder()
    .model(GeminiModel::Gemini2_0FlashLive)
    .voice(Voice::Kore)
    .instruction("You are a helpful assistant")
    .on_audio(|data| { /* play audio */ })
    .on_text(|t| print!("{t}"))
    .connect_google_ai("YOUR_KEY")
    .await?;

handle.send_text("Hello!").await?;
handle.done().await?;

Most developers should start at L2 and drop to L1/L0 only when they hit a specific limitation. The layers are designed to compose: you can use L0 types (like SessionConfig) with L2 builders via Live::connect(config).