Architecture Overview
This guide explains how the gemini-genai-rs workspace is structured, how data flows through the system, and how to decide which layer to build on.
The Three-Crate Stack
The workspace is organized into three crates, each adding a layer of abstraction on top of the one below:
+--------------------------------------------------+
| gemini-adk-fluent-rs (L2) — Fluent DX |
| Live::builder(), operator algebra, composition |
+--------------------------------------------------+
| gemini-adk-rs (L1) — Agent Runtime |
| LiveSessionBuilder, callbacks, tool dispatch, |
| state, phases, watchers, extractors, telemetry |
+--------------------------------------------------+
| gemini-genai-rs (L0) — Wire Protocol |
| SessionHandle, SessionConfig, Transport, Codec, |
| AuthProvider, events, commands, VAD, buffers |
+--------------------------------------------------+
| Gemini Multimodal Live API |
| (WebSocket, full-duplex audio/text) |
+--------------------------------------------------+
L0: gemini-genai-rs
The wire protocol crate. Maps 1:1 to the Gemini API surface. No application logic, no opinions about how you structure your app.
What it provides:
SessionConfigfor building the setup message (model, voice, tools, VAD)SessionHandlefor sending commands and subscribing to eventsConnectBuilderfor establishing the WebSocket connectionTransport/Codec/AuthProvidertraits for pluggable I/OSessionEventenum (17 variants) for everything the server can sendSessionCommandenum (9 variants) for everything you can sendSessionPhaseFSM with validated transitions- Audio buffers (lock-free SPSC ring buffer, adaptive jitter buffer)
- Client-side VAD (Voice Activity Detection)
L1: gemini-adk-rs
The agent runtime. Adds the event processing loop, typed callbacks, automatic tool dispatch, state management, and the phase machine.
What it provides:
LiveSessionBuilderto wire up config + callbacks + tools in one placeEventCallbackswith typed fast-lane (sync) and control-lane (async) hooksToolDispatcherwithToolFunction,StreamingTool,InputStreamingToolState(concurrent key-value store with prefix scoping and delta tracking)PhaseMachinefor multi-step conversation flowsWatcherRegistryfor state-reactive triggersTranscriptBufferfor accumulating conversation historyTurnExtractor/LlmExtractorfor structured data extractionLiveHandleas the runtime API surfaceSessionSignals+SessionTelemetryfor observabilityTextAgenttrait and combinators for text-based LLM pipelines
L2: gemini-adk-fluent-rs
The fluent developer experience layer. Wraps L1 in a chainable builder API and adds operator-algebra composition (S, C, T, P, M modules).
What it provides:
Live::builder()with method chaining for the entire configuration surface.phase()/.watch()/.computed()sub-builders that flow back naturally.connect_google_ai()/.connect_vertex()for one-line connectionT::simple(),T::google_search()for composable tool registrationS,C,T,P,Moperator modules with|compositionlet_clone!macro for reducingArc::cloneboilerplate in closures- Test utilities and mock helpers
Data Flow
Here is how data moves through the system during a live session:
Client App gemini-genai-rs Gemini API
---------- -------------- ----------
Microphone
|
v
[PCM16 16kHz] --send_audio()--> SessionHandle --WebSocket--> Gemini Live
| |
SessionCommand |
(mpsc channel) |
| |
Transport::send() |
| v
| Model processes
| audio/text/tools
| |
Transport::recv() |
| |
Codec::decode() |
| |
SessionEvent <--- WebSocket frames
(broadcast channel)
|
+--------+--------+
| | |
Fast Lane Ctrl Lane Telemetry Lane
| | |
on_audio on_tool SessionSignals
on_text phases SessionTelemetry
on_vad extract
| |
v v
Speaker State
Display Updates
Outbound path: Your app calls send_audio() / send_text() on the
LiveHandle (L1/L2) or SessionHandle (L0). These become SessionCommand
variants sent through an mpsc channel to the transport loop, which encodes
them via the Codec and sends them over the WebSocket.
Inbound path: The transport loop receives WebSocket frames, decodes them
via the Codec into SessionEvent variants, and broadcasts them. The
three-lane processor (L1) routes each event to the appropriate lane.
Three-Lane Processor
Audio arrives at 40-100 events per second. Tool dispatch can take 1-30 seconds. Sharing one processing loop causes audio stutter during tool execution. The solution: split the event stream into three priority lanes.
Fast Lane (sync, <1ms)
Handles latency-sensitive events with sync callbacks that must never block:
| Event | Callback |
|---|---|
AudioData | on_audio(&Bytes) |
TextDelta | on_text(&str) |
TextComplete | on_text_complete(&str) |
InputTranscription | on_input_transcript(&str, bool) |
OutputTranscription | on_output_transcript(&str, bool) |
VoiceActivityStart | on_vad_start() |
VoiceActivityEnd | on_vad_end() |
Interrupted | Sets interrupted flag, stops forwarding audio |
Fast lane callbacks are Fn (not FnMut, not async). If your callback
takes longer than 1ms, audio playback will stutter.
Control Lane (async, can block)
Handles events that require I/O, state mutation, or multi-step processing:
| Event | Callback |
|---|---|
ToolCall | on_tool_call (auto-dispatch or manual) |
ToolCallCancelled | Cancels pending tool tasks |
Interrupted | on_interrupted() |
TurnComplete | Extractors, phase transitions, on_turn_complete() |
GoAway | on_go_away(Duration) |
Connected | on_connected() |
Disconnected | on_disconnected(Option<String>) |
Error | on_error(String) |
The control lane also owns the TranscriptBuffer (no Arc<Mutex<>>) and
runs extractors concurrently via join_all.
Telemetry Lane (async, debounced)
Runs on its own broadcast receiver. Collects SessionSignals (activity
timestamps, timing, token usage) and SessionTelemetry (atomic counters for
audio chunks, tool calls, interruptions, latency tracking, token counts).
Flushes periodically (100ms debounce) with zero overhead on the hot path.
The telemetry lane also handles UsageMetadata events from the Gemini API,
recording prompt/response/cached/thoughts token counts in both SessionSignals
(as session: state keys) and SessionTelemetry (as atomic counters). The
.on_usage() callback fires here for real-time token observation.
The Router
The router is the zero-work dispatcher that sits between the broadcast
channel and the two processing lanes. It pattern-matches each SessionEvent
and sends it to the correct lane(s) via mpsc channels. No session signals,
no telemetry, no allocations on the hot path.
Key Traits
| Trait | Crate | Purpose |
|---|---|---|
Transport | L0 (gemini_genai_rs::transport::ws) | Bidirectional byte transport (WebSocket or mock) |
Codec | L0 (gemini_genai_rs::transport::codec) | Encode commands / decode server messages (JSON default) |
AuthProvider | L0 (gemini_genai_rs::transport::auth) | URL construction + auth headers (Google AI / Vertex AI) |
SessionWriter | L0 (gemini_genai_rs::session) | Send audio/text/video/tools/instructions (trait object safe) |
SessionReader | L0 (gemini_genai_rs::session) | Subscribe to events, observe phase |
ToolFunction | L1 (gemini_adk_rs::tool) | One-shot tool: call(args) -> Result<Value> |
StreamingTool | L1 (gemini_adk_rs::tool) | Background tool yielding multiple results |
InputStreamingTool | L1 (gemini_adk_rs::tool) | Tool receiving live input while running |
TurnExtractor | L1 (gemini_adk_rs::live::extractor) | Extract structured data from transcript window |
TextAgent | L1 (gemini_adk_rs::text) | Text-based LLM agent (generate(), not Live) |
BaseLlm | L1 (gemini_adk_rs::llm) | LLM abstraction for generate() calls |
Choosing Your Layer
Use L0 (gemini-genai-rs) if you need:
- Raw WebSocket access with no abstraction overhead
- Custom event loop logic that does not fit the callback model
- A custom transport (e.g., Unix domain socket, QUIC)
- To build your own agent runtime
- Maximum control over every message sent and received
use gemini_genai_rs::prelude::*;
let config = SessionConfig::from_endpoint(ApiEndpoint::google_ai("YOUR_KEY"))
.model(GeminiModel::Gemini2_0FlashLive);
let handle = ConnectBuilder::new(config).build().await?;
let mut events = handle.subscribe();
handle.send_text("Hello").await?;
while let Some(event) = recv_event(&mut events).await {
match event {
SessionEvent::TextDelta(text) => print!("{text}"),
SessionEvent::TurnComplete => break,
_ => {}
}
}
Use L1 (gemini-adk-rs) if you need:
- Automatic tool dispatch without manual message matching
- State management with prefix scoping (
session:,turn:,app:) - Phase machine for multi-step conversation flows
- Turn extraction (LLM-based or custom) between turns
- Telemetry and session signals
- Full control over callback registration without the fluent syntax
Use L2 (gemini-adk-fluent-rs) if you want:
- The fastest path from zero to working voice agent
- Chainable builder API with sub-builders for phases and watchers
- Operator algebra for composing tools (
T::simple() | T::google_search()) - One-line connection (
connect_vertex(project, location, token)) - Sensible defaults (auto-enables transcription when extractors are used)
use gemini_adk_fluent_rs::prelude::*;
let handle = Live::builder()
.model(GeminiModel::Gemini2_0FlashLive)
.voice(Voice::Kore)
.instruction("You are a helpful assistant")
.on_audio(|data| { /* play audio */ })
.on_text(|t| print!("{t}"))
.connect_google_ai("YOUR_KEY")
.await?;
handle.send_text("Hello!").await?;
handle.done().await?;
Most developers should start at L2 and drop to L1/L0 only when they hit a
specific limitation. The layers are designed to compose: you can use L0
types (like SessionConfig) with L2 builders via Live::connect(config).