Expand description
Evaluation framework — evaluate agent quality using various metrics.
Mirrors ADK-Python’s evaluation module. Provides traits and types for
evaluating agent invocations against expected results, including
LLM-as-judge, trajectory, response, rubric, hallucination, safety,
and user simulator evaluators.
Structs§
- Eval
Case - A single evaluation case — pairs actual invocations with optional expected results.
- Eval
Case File - A single evaluation case within an eval set file.
- Eval
Metric - A named metric with its aggregated and per-invocation results.
- Eval
Result - The result of evaluating an evaluation set.
- EvalSet
- A collection of evaluation cases.
- Eval
SetFile - Top-level structure of a
.evalset.jsonfile. - Expected
Tool Use - An expected tool call within an invocation.
- Hallucination
Evaluator - Evaluates whether agent responses are grounded (not hallucinated).
- Intermediate
Data - Intermediate data captured during agent execution.
- Invocation
- A single invocation (conversation) for evaluation.
- Invocation
File - A single invocation (turn pair) in the eval case conversation.
- Invocation
Turn - A single turn in a conversation for evaluation.
- LlmAs
Judge - Evaluator that uses an LLM to judge agent responses.
- LlmAs
Judge Config - Configuration for the LLM-as-judge evaluator.
- PerInvocation
Result - A single metric evaluation score for one invocation.
- Response
Evaluator - Evaluates the agent’s final response against expected output.
- Rubric
Evaluator - Evaluator that scores agent outputs against free-text rubric criteria using an LLM as judge.
- Safety
Evaluator - Evaluates agent responses for safety violations.
- Safety
Signal - Safety signal detected during evaluation.
- Test
Config - Top-level test configuration loaded from
test_config.json. - Tool
UseRecord - Record of a tool use that occurred during execution.
- Trajectory
Evaluator - Evaluates the tool-call trajectory of agent invocations.
- User
Simulator Evaluator - Evaluates the fidelity of a user simulator in multi-turn conversations.
Enums§
- Criterion
Config - Configuration for a single evaluation criterion.
- Eval
Error - Errors from evaluation operations.
- Match
Strategy - Strategy for comparing actual vs. expected responses.
- Rubric
Mode - Evaluation mode for rubric evaluation.
- Safety
Category - Categories of safety concerns.
- Trajectory
Match Type - How to compare actual vs. expected tool call trajectories.
Traits§
- Evaluator
- Trait for evaluating agent invocations against expected results.
Functions§
- parse_
evalset - Parse an
.evalset.jsonfile from disk. - parse_
evalset_ str - Parse an
.evalset.jsonfrom a raw JSON string. - parse_
test_ config - Parse a
test_config.jsonfile from disk. - parse_
test_ config_ str - Parse a
test_config.jsonfrom a raw JSON string.