Evaluations

Evaluations run a task function over a dataset, score the outputs, and report results so you can compare agent performance over time.

What an evaluation is

An evaluation answers: How well does this agent perform on a defined workload? You provide:

A dataset of inputs
A task function that produces outputs
One or more scorers to grade those outputs

The SDK’s Evaluation class orchestrates the run and streams progress events while the agent executes inside its sandbox.

Core building blocks

Configuration essentials

Setting	What it controls	Typical use
`dataset`	Items the task runs on	Benchmarks, test cases, prompts
`task`	Function that produces outputs	Agent call, tool workflow
`scorers`	How outputs are scored	Accuracy, safety, or assertions
`scenarios`	Parameter variations	Prompt versions or tool configs
`iterations`	Repeats of the full dataset	Variance and stability checks
`concurrency`	Parallel samples per batch	Faster runs with safe limits
`maxErrors`	Total errors before stop	Circuit breaker
`maxConsecutiveErrors`	Back-to-back errors before stop	Circuit breaker

Scoring guidance

Prefer built-in scorers when possible to keep results consistent. Custom scorers are supported for domain-specific grading.

Lifecycle and execution flow

Evaluations follow a predictable loop:

Configure the evaluation (dataset, task, scorers).
Execute samples across scenarios and iterations in batches.
Score each sample and aggregate metrics.
Finish with a summary report and stop reason.

Results and reporting

Result hierarchy

Level	What it contains
Evaluation	Overall stop reason and timing
Scenario	Parameter set for the run
Iteration	One full pass over the dataset
Sample	Input, output, scores, and assertions

Where results show up

Evaluation results stream in real time and are stored for later analysis. Use the platform UI to review summaries, pass rates, and metrics across runs.