Module: eval¶
from adk_fluent import E
Fluent evaluation composition. Consistent with P, C, S, M, T, A modules.
Quick Reference¶
Method |
Returns |
Description |
|---|---|---|
|
|
Tool trajectory matching criterion |
|
|
ROUGE-1 response match criterion |
|
|
LLM-as-a-judge semantic matching criterion |
|
|
Hallucination detection criterion |
|
|
Safety evaluation criterion |
|
|
Rubric-based response quality criterion |
|
|
Rubric-based tool use quality criterion |
|
|
User-defined custom metric |
|
|
Create a standalone eval case |
|
|
Create a conversation scenario for user simulation |
|
|
Create an evaluation suite for an agent builder |
|
|
Compare multiple agents on the same eval set |
|
|
Load an eval set from a JSON file |
|
|
Load all eval sets from a directory |
|
|
Create a quality gate for use in pipelines |
Criteria factories¶
E.trajectory(threshold: float = 1.0, *, match: str = exact) -> EComposite¶
Tool trajectory matching criterion.
Checks that the agent’s tool calls match the expected trajectory.
Args:
threshold: Score threshold (0.0 to 1.0). Default 1.0 (exact match).match: Match type —"exact","in_order", or"any_order".
Usage: E.trajectory() # exact match, threshold 1.0 E.trajectory(0.8, match=”in_order”) # in-order, 80% threshold
Parameters:
threshold(float) — default:1.0match(str) — default:'exact'
E.response_match(threshold: float = 0.8) -> EComposite¶
ROUGE-1 response match criterion.
Compares agent’s response against expected text using ROUGE-1 scoring.
Args:
threshold: Minimum ROUGE-1 score to pass. Default 0.8.
Usage: E.response_match() # 80% threshold E.response_match(0.9) # 90% threshold
Parameters:
threshold(float) — default:0.8
E.semantic_match(threshold: float = 0.5, *, judge_model: str = gemini-2.5-flash) -> EComposite¶
LLM-as-a-judge semantic matching criterion.
Uses a judge LLM to evaluate whether the response semantically matches the expected output.
Args:
threshold: Minimum score to pass. Default 0.5.judge_model: Model to use as judge. Default"gemini-2.5-flash".
Usage: E.semantic_match() # defaults E.semantic_match(0.7, judge_model=”gemini-2.5-pro”)
Parameters:
threshold(float) — default:0.5judge_model(str) — default:'gemini-2.5-flash'
E.hallucination(threshold: float = 0.8, *, judge_model: str = gemini-2.5-flash, check_intermediate: bool = False) -> EComposite¶
Hallucination detection criterion.
Evaluates whether the agent’s response is grounded and factual.
Args:
threshold: Minimum groundedness score. Default 0.8.judge_model: Model to use as judge.check_intermediate: Also check intermediate NL responses.
Usage: E.hallucination() # defaults E.hallucination(0.9, check_intermediate=True)
Parameters:
threshold(float) — default:0.8judge_model(str) — default:'gemini-2.5-flash'check_intermediate(bool) — default:False
E.safety(threshold: float = 1.0) -> EComposite¶
Safety evaluation criterion.
Checks that the agent’s response meets safety standards.
Args:
threshold: Minimum safety score. Default 1.0 (must be fully safe).
Parameters:
threshold(float) — default:1.0
E.rubric(*texts: str, threshold: float = 0.5, judge_model: str = gemini-2.5-flash) -> EComposite¶
Rubric-based response quality criterion.
Uses custom rubrics to evaluate the quality of agent responses.
Args:
texts: One or more rubric text strings.threshold: Minimum quality score. Default 0.5.judge_model: Model to use as judge.
Usage: E.rubric(“Response must be concise”) E.rubric(“Must cite sources”, “Must be factual”, threshold=0.7)
Parameters:
*texts(str)threshold(float) — default:0.5judge_model(str) — default:'gemini-2.5-flash'
E.tool_rubric(*texts: str, threshold: float = 0.5, judge_model: str = gemini-2.5-flash) -> EComposite¶
Rubric-based tool use quality criterion.
Evaluates the quality of tool usage via custom rubrics.
Args:
texts: One or more rubric text strings.threshold: Minimum quality score. Default 0.5.judge_model: Model to use as judge.
Usage: E.tool_rubric(“Must use search before answering”)
Parameters:
*texts(str)threshold(float) — default:0.5judge_model(str) — default:'gemini-2.5-flash'
E.custom(name: str, fn: Callable[..., float], *, threshold: float = 0.5) -> EComposite¶
User-defined custom metric.
Args:
name: Metric name (must be unique in the criteria set).fn: Callable that receives evaluation data and returns a float score.threshold: Minimum score to pass.
Usage: def my_metric(invocation, expected): return 1.0 if “keyword” in invocation.final_response else 0.0
E.custom("keyword_check", my_metric, threshold=1.0)
Parameters:
name(str)fn(Callable[…, float])threshold(float) — default:0.5
Case factory¶
E.case(prompt: str, *, expect: str | None = None, tools: list[tuple[str, dict[str, Any]]] | None = None, rubrics: list[str] | None = None, state: dict[str, Any] | None = None) -> ECase¶
Create a standalone eval case.
Usage: case = E.case(“What is 2+2?”, expect=”4”)
Parameters:
prompt(str)expect(str | None) — default:Nonetools(list[tuple[str, dict[str, Any]]] | None) — default:Nonerubrics(list[str] | None) — default:Nonestate(dict[str, Any] | None) — default:None
Scenario factory (user simulation)¶
E.scenario(start: str, plan: str, *, persona: Any | None = None) -> Any¶
Create a conversation scenario for user simulation.
Args:
start: The initial user prompt.plan: Description of the conversation plan/goal.persona: OptionalUserPersona(fromE.persona.expert()etc.).
Usage: scenario = E.scenario( start=”Book a flight”, plan=”User wants SFO to JFK next Friday, economy class”, persona=E.persona.expert(), )
Parameters:
start(str)plan(str)persona(Any | None) — default:None
Suite factory¶
E.suite(agent: Any) -> EvalSuite¶
Create an evaluation suite for an agent builder.
Args:
agent: An agent builder (or built ADK agent).
Usage: suite = E.suite(my_agent) .case(“prompt”, expect=”response”) .criteria(E.response_match())
report = await suite.run()
Parameters:
agent(Any)
Comparison factory¶
E.compare(*agents: Any) -> ComparisonSuite¶
Compare multiple agents on the same eval set.
Args:
agents: Two or more agent builders to compare.
Usage: report = await ( E.compare(fast_agent, smart_agent) .case(“query”, expect=”answer”) .criteria(E.semantic_match()) .run() )
Parameters:
*agents(Any)
File-based eval¶
E.from_file(path: str) -> Any¶
Load an eval set from a JSON file.
Args:
path: Path to the.test.jsoneval set file (ADK format).
Returns:
An ADK `EvalSet` instance.
Parameters:
path(str)
E.from_dir(path: str) -> list[Any]¶
Load all eval sets from a directory.
Args:
path: Directory containing.test.jsonfiles.
Returns:
List of ADK `EvalSet` instances.
Parameters:
path(str)
Gate (quality threshold for pipelines)¶
E.gate(criteria: EComposite, *, threshold: float | None = None) -> Any¶
Create a quality gate for use in pipelines.
The gate evaluates the preceding agent’s output and blocks propagation if the quality score falls below the threshold.
Args:
criteria: Evaluation criteria to check.threshold: Override threshold (uses criterion default if None).
Returns:
A callable suitable for use with the `>>` operator.
Usage: pipeline = agent >> E.gate(E.hallucination()) >> next_agent
Parameters:
criteria(EComposite)threshold(float | None) — default:None
Composition Operators¶
| (compose (EComposite))¶
Combine evaluation criteria
Types¶
Type |
Description |
|---|---|
|
Composable evaluation criteria. |
|
A single evaluation criterion descriptor |
|
A single evaluation case. |
|
Fluent builder for structured evaluation suites |
|
Result of running an evaluation suite |
|
Result of comparing multiple agents on the same eval cases |
|
Namespace for prebuilt user simulation personas |