Module evaluation

Module evaluation 

Source
Expand description

Evaluation framework — evaluate agent quality using various metrics.

Mirrors ADK-Python’s evaluation module. Provides traits and types for evaluating agent invocations against expected results, including LLM-as-judge, trajectory, response, rubric, hallucination, safety, and user simulator evaluators.

Structs§

EvalCase
A single evaluation case — pairs actual invocations with optional expected results.
EvalCaseFile
A single evaluation case within an eval set file.
EvalMetric
A named metric with its aggregated and per-invocation results.
EvalResult
The result of evaluating an evaluation set.
EvalSet
A collection of evaluation cases.
EvalSetFile
Top-level structure of a .evalset.json file.
ExpectedToolUse
An expected tool call within an invocation.
HallucinationEvaluator
Evaluates whether agent responses are grounded (not hallucinated).
IntermediateData
Intermediate data captured during agent execution.
Invocation
A single invocation (conversation) for evaluation.
InvocationFile
A single invocation (turn pair) in the eval case conversation.
InvocationTurn
A single turn in a conversation for evaluation.
LlmAsJudge
Evaluator that uses an LLM to judge agent responses.
LlmAsJudgeConfig
Configuration for the LLM-as-judge evaluator.
PerInvocationResult
A single metric evaluation score for one invocation.
ResponseEvaluator
Evaluates the agent’s final response against expected output.
RubricEvaluator
Evaluator that scores agent outputs against free-text rubric criteria using an LLM as judge.
SafetyEvaluator
Evaluates agent responses for safety violations.
SafetySignal
Safety signal detected during evaluation.
TestConfig
Top-level test configuration loaded from test_config.json.
ToolUseRecord
Record of a tool use that occurred during execution.
TrajectoryEvaluator
Evaluates the tool-call trajectory of agent invocations.
UserSimulatorEvaluator
Evaluates the fidelity of a user simulator in multi-turn conversations.

Enums§

CriterionConfig
Configuration for a single evaluation criterion.
EvalError
Errors from evaluation operations.
MatchStrategy
Strategy for comparing actual vs. expected responses.
RubricMode
Evaluation mode for rubric evaluation.
SafetyCategory
Categories of safety concerns.
TrajectoryMatchType
How to compare actual vs. expected tool call trajectories.

Traits§

Evaluator
Trait for evaluating agent invocations against expected results.

Functions§

parse_evalset
Parse an .evalset.json file from disk.
parse_evalset_str
Parse an .evalset.json from a raw JSON string.
parse_test_config
Parse a test_config.json file from disk.
parse_test_config_str
Parse a test_config.json from a raw JSON string.