Evaluation

adk-fluent wraps Google ADK’s evaluation framework with a fluent E module, consistent with the other namespace modules (P, C, S, M, T, A). It provides three tiers of evaluation — inline, suite, and comparison — all composable with the | operator.

Quick Start

from adk_fluent import Agent, E

agent = Agent("math", "gemini-2.5-flash").instruct("You are a math tutor.")

# Inline: one-line smoke test
report = await agent.eval("What is 2+2?", expect="4").run()
assert report.ok

Three Tiers

Tier 1: Inline Evaluation

The fastest way to test a single prompt. Call .eval() directly on any agent builder:

# With expected response (auto-selects response_match criterion)
report = await agent.eval("What is 2+2?", expect="4").run()

# With explicit criteria
report = await agent.eval(
    "Explain gravity",
    criteria=E.semantic_match(0.7)
).run()

print(report.summary())

Tier 2: Evaluation Suites

For structured evaluation with multiple test cases and criteria:

report = await (
    E.suite(agent)
    .case("What is 2+2?", expect="4")
    .case("What is 3*5?", expect="15")
    .case("Factor x^2-1", expect="(x+1)(x-1)")
    .criteria(E.response_match() | E.safety())
    .num_runs(3)
    .run()
)

print(report.summary())
# Eval Report
# ========================================
#   response_match_score: 0.850 (threshold: 0.8) [PASS]
#   safety_v1: 1.000 (threshold: 1.0) [PASS]
# ========================================
# Overall: PASSED

You can also use .eval_suite() on the agent directly:

report = await (
    agent.eval_suite()
    .case("What is 2+2?", expect="4")
    .criteria(E.trajectory() | E.response_match())
    .run()
)

Tier 3: Model Comparison

Compare multiple agents (or the same agent with different models) side-by-side:

fast = Agent("fast", "gemini-2.5-flash").instruct("Answer concisely.")
strong = Agent("strong", "gemini-2.5-pro").instruct("Answer thoroughly.")

report = await (
    E.compare(fast, strong)
    .case("Explain quantum entanglement", expect="correlated particles")
    .case("What causes tides?", expect="gravitational pull of the moon")
    .criteria(E.semantic_match() | E.safety())
    .run()
)

print(report.summary())
print(f"Winner: {report.winner}")
print(report.ranked())  # [("strong", 0.92), ("fast", 0.78)]

Criteria

Criteria are composable evaluation metrics. Combine them with |:

criteria = E.trajectory() | E.response_match() | E.safety()

Built-in Criteria

Criterion

Metric

Description

E.trajectory()

tool_trajectory_avg_score

Tool call sequence matches expected trajectory

E.response_match()

response_match_score

ROUGE-1 text similarity against expected response

E.semantic_match()

final_response_match_v2

LLM-as-a-judge semantic evaluation

E.hallucination()

hallucinations_v1

Groundedness and factual accuracy

E.safety()

safety_v1

Response safety standards

E.rubric()

rubric_based_final_response_quality_v1

Custom rubric-based quality

E.tool_rubric()

rubric_based_tool_use_quality_v1

Custom rubric for tool usage

E.custom()

user-defined

User-provided scoring function

Configuring Criteria

Each criterion accepts a threshold parameter (0.0 to 1.0):

E.response_match(0.9)          # Require 90% ROUGE-1 match
E.trajectory(0.8, match="in_order")  # 80%, in-order matching
E.semantic_match(0.7, judge_model="gemini-2.5-pro")
E.hallucination(0.9, check_intermediate=True)

Custom Criteria

Define your own scoring function:

def keyword_check(invocation, expected):
    text = invocation.final_response.parts[0].text
    return 1.0 if "quantum" in text.lower() else 0.0

criteria = E.custom("keyword_present", keyword_check, threshold=1.0)

Rubric-Based Evaluation

Use natural-language rubrics for qualitative assessment:

criteria = E.rubric(
    "Response must cite at least one source",
    "Response must be under 200 words",
    threshold=0.7,
)

Eval Cases

Cases define what to test. Each case specifies a prompt and expected outcomes:

suite = (
    E.suite(agent)
    # Simple response check
    .case("What is 2+2?", expect="4")

    # Tool trajectory check
    .case(
        "Search for recent news about AI",
        tools=[("google_search", {"query": "AI news"})],
    )

    # Rubric-based quality check
    .case(
        "Write a haiku about coding",
        rubrics=["Must follow 5-7-5 syllable structure"],
    )

    # Expected final state
    .case(
        "Classify this as positive or negative: Great product!",
        state={"sentiment": "positive"},
    )
)

You can also create standalone cases:

case = E.case("What is 2+2?", expect="4")

Reports

EvalReport

The result of running an evaluation suite:

report = await suite.run()

report.ok          # True if all metrics passed
report.scores      # {"response_match_score": 0.85, "safety_v1": 1.0}
report.thresholds  # {"response_match_score": 0.8, "safety_v1": 1.0}
report.passed      # {"response_match_score": True, "safety_v1": True}
report.summary()   # Formatted text table

ComparisonReport

The result of comparing multiple agents:

report = await comparison.run()

report.winner              # "strong" (highest avg score)
report.ranked()            # [("strong", 0.92), ("fast", 0.78)]
report.agent_reports       # {"fast": EvalReport, "strong": EvalReport}
report.summary()           # Formatted comparison table

File-Based Eval Sets

Serialize eval suites to JSON for CI pipelines or sharing:

# Save to file (ADK-compatible format)
(
    E.suite(agent)
    .case("What is 2+2?", expect="4")
    .case("What is 3*5?", expect="15")
    .criteria(E.response_match())
    .to_file("tests/math_eval.json")
)

# Load and run from file
suite = E.from_file("tests/math_eval.json", agent=agent)
report = await suite.run()

User Simulation (Personas)

For multi-turn conversation testing, create scenarios with simulated user personas:

scenario = E.scenario(
    start="Book a flight from SFO to JFK",
    plan="User wants economy class, next Friday, window seat",
    persona=E.persona.expert(),
)

Built-in Personas

Persona

Description

E.persona.expert()

Knows exactly what they want, professional tone

E.persona.novice()

Relies on the agent, conversational tone

E.persona.evaluator()

Assessing agent capabilities

Custom Personas

persona = E.persona.custom(
    persona_id="impatient_user",
    description="A user who wants quick answers and gets frustrated with long responses",
    behaviors=["Sends short messages", "Asks for summaries"],
)

Note: Personas require google-adk >= 1.26.0.

Integration with Other Modules

The E module composes naturally with other adk-fluent namespaces:

from adk_fluent import Agent, E, C, S

# Eval an agent with context constraints
agent = Agent("helper", "gemini-2.5-flash").context(C.window(5))
report = await agent.eval("What was my first question?", expect="...").run()

# Assert state after agent runs
report = await (
    E.suite(agent)
    .case("Classify: great product!", state={"sentiment": "positive"})
    .criteria(E.response_match())
    .run()
)

pytest Integration

Use evaluations in your test suite:

import pytest
from adk_fluent import Agent, E

agent = Agent("math", "gemini-2.5-flash").instruct("You are a math tutor.")

@pytest.mark.asyncio
async def test_math_accuracy():
    report = await (
        agent.eval_suite()
        .case("What is 2+2?", expect="4")
        .case("What is 10/2?", expect="5")
        .criteria(E.response_match(0.8))
        .run()
    )
    assert report.ok, report.summary()

@pytest.mark.asyncio
async def test_tool_trajectory():
    search_agent = Agent("searcher", "gemini-2.5-flash").tool(search_fn)
    report = await (
        search_agent.eval(
            "Find news about AI",
            criteria=E.trajectory(),
        )
        .case("Find news about AI", tools=[("search", {"query": "AI news"})])
        .run()
    )
    assert report.ok

@pytest.mark.asyncio
async def test_model_comparison():
    fast = Agent("fast", "gemini-2.5-flash").instruct("Be concise.")
    strong = Agent("strong", "gemini-2.5-pro").instruct("Be thorough.")

    report = await (
        E.compare(fast, strong)
        .case("Explain gravity", expect="gravitational force")
        .criteria(E.semantic_match())
        .run()
    )
    assert report.winner is not None

Suite Configuration

Fine-tune evaluation execution:

suite = (
    E.suite(agent)
    .name("math_eval_v2")
    .description("Math tutor accuracy evaluation")
    .case("What is 2+2?", expect="4")
    .criteria(E.response_match())
    .threshold("response_match_score", 0.9)  # Override per-metric
    .num_runs(5)                               # Statistical significance
)

report = await suite.run()

API Summary

E (static namespace)

Method

Returns

Description

E.trajectory()

EComposite

Tool trajectory matching criterion

E.response_match()

EComposite

ROUGE-1 response match criterion

E.semantic_match()

EComposite

LLM-as-a-judge semantic matching

E.hallucination()

EComposite

Hallucination detection

E.safety()

EComposite

Safety evaluation

E.rubric()

EComposite

Custom rubric-based quality

E.tool_rubric()

EComposite

Rubric for tool usage quality

E.custom()

EComposite

User-defined custom metric

E.case()

ECase

Create a standalone eval case

E.scenario()

ConversationScenario

Create a user simulation scenario

E.suite()

EvalSuite

Create an evaluation suite

E.compare()

ComparisonSuite

Compare multiple agents

E.from_file()

EvalSuite

Load eval set from JSON file

E.persona

EPersona

Namespace for prebuilt user personas

Agent eval methods

Method

Returns

Description

agent.eval(prompt, ...)

EvalSuite

Inline evaluation with a single case

agent.eval_suite()

EvalSuite

Create an empty suite bound to this agent

EvalSuite (builder)

Method

Returns

Description

.case(prompt, ...)

EvalSuite

Add an evaluation case

.criteria(composite)

EvalSuite

Set evaluation criteria

.rubric(text)

EvalSuite

Add a rubric to all cases

.threshold(metric, value)

EvalSuite

Override a metric threshold

.num_runs(n)

EvalSuite

Set number of eval runs

.name(text)

EvalSuite

Set suite name

.description(text)

EvalSuite

Set suite description

.to_file(path)

EvalSuite

Serialize to JSON file

.run()

EvalReport

Run the evaluation (async)