Evaluations
Evaluations run a task function over a dataset, score the outputs, and report results so you can compare agent performance over time.
What an evaluation is
Section titled “What an evaluation is”An evaluation answers: How well does this agent perform on a defined workload? You provide:
- A dataset of inputs
- A task function that produces outputs
- One or more scorers to grade those outputs
The SDK’s Evaluation class orchestrates the run and streams progress events while the agent executes inside its sandbox.
Core building blocks
Section titled “Core building blocks”Configuration essentials
Section titled “Configuration essentials”| Setting | What it controls | Typical use |
|---|---|---|
dataset | Items the task runs on | Benchmarks, test cases, prompts |
task | Function that produces outputs | Agent call, tool workflow |
scorers | How outputs are scored | Accuracy, safety, or assertions |
scenarios | Parameter variations | Prompt versions or tool configs |
iterations | Repeats of the full dataset | Variance and stability checks |
concurrency | Parallel samples per batch | Faster runs with safe limits |
maxErrors | Total errors before stop | Circuit breaker |
maxConsecutiveErrors | Back-to-back errors before stop | Circuit breaker |
Scoring guidance
Section titled “Scoring guidance”Prefer built-in scorers when possible to keep results consistent. Custom scorers are supported for domain-specific grading.
Lifecycle and execution flow
Section titled “Lifecycle and execution flow”Evaluations follow a predictable loop:
- Configure the evaluation (dataset, task, scorers).
- Execute samples across scenarios and iterations in batches.
- Score each sample and aggregate metrics.
- Finish with a summary report and stop reason.
Results and reporting
Section titled “Results and reporting”Result hierarchy
Section titled “Result hierarchy”| Level | What it contains |
|---|---|
| Evaluation | Overall stop reason and timing |
| Scenario | Parameter set for the run |
| Iteration | One full pass over the dataset |
| Sample | Input, output, scores, and assertions |
Where results show up
Section titled “Where results show up”Evaluation results stream in real time and are stored for later analysis. Use the platform UI to review summaries, pass rates, and metrics across runs.