Module evaluation

Expand description

Evaluation framework — evaluate agent quality using various metrics.

Mirrors ADK-Python’s evaluation module. Provides traits and types for evaluating agent invocations against expected results, including LLM-as-judge, trajectory, response, rubric, hallucination, safety, and user simulator evaluators.

Structs§

EvalCase: A single evaluation case — pairs actual invocations with optional expected results.
EvalCaseFile: A single evaluation case within an eval set file.
EvalMetric: A named metric with its aggregated and per-invocation results.
EvalResult: The result of evaluating an evaluation set.
EvalSet: A collection of evaluation cases.
EvalSetFile: Top-level structure of a .evalset.json file.
ExpectedToolUse: An expected tool call within an invocation.
HallucinationEvaluator: Evaluates whether agent responses are grounded (not hallucinated).
IntermediateData: Intermediate data captured during agent execution.
Invocation: A single invocation (conversation) for evaluation.
InvocationFile: A single invocation (turn pair) in the eval case conversation.
InvocationTurn: A single turn in a conversation for evaluation.
LlmAsJudge: Evaluator that uses an LLM to judge agent responses.
LlmAsJudgeConfig: Configuration for the LLM-as-judge evaluator.
PerInvocationResult: A single metric evaluation score for one invocation.
ResponseEvaluator: Evaluates the agent’s final response against expected output.
RubricEvaluator: Evaluator that scores agent outputs against free-text rubric criteria using an LLM as judge.
SafetyEvaluator: Evaluates agent responses for safety violations.
SafetySignal: Safety signal detected during evaluation.
TestConfig: Top-level test configuration loaded from test_config.json.
ToolUseRecord: Record of a tool use that occurred during execution.
TrajectoryEvaluator: Evaluates the tool-call trajectory of agent invocations.
UserSimulatorEvaluator: Evaluates the fidelity of a user simulator in multi-turn conversations.

Enums§

CriterionConfig: Configuration for a single evaluation criterion.
EvalError: Errors from evaluation operations.
MatchStrategy: Strategy for comparing actual vs. expected responses.
RubricMode: Evaluation mode for rubric evaluation.
SafetyCategory: Categories of safety concerns.
TrajectoryMatchType: How to compare actual vs. expected tool call trajectories.

Traits§

Evaluator: Trait for evaluating agent invocations against expected results.

Functions§

parse_evalset: Parse an .evalset.json file from disk.
parse_evalset_str: Parse an .evalset.json from a raw JSON string.
parse_test_config: Parse a test_config.json file from disk.
parse_test_config_str: Parse a test_config.json from a raw JSON string.

Module evaluation

Module evaluation Copy item path

Structs§

Enums§

Traits§

Functions§

Module evaluation