Skip to content

Evaluations

Evaluations run a task function over a dataset, score the outputs, and report results so you can compare agent performance over time.

An evaluation answers: How well does this agent perform on a defined workload? You provide:

  • A dataset of inputs
  • A task function that produces outputs
  • One or more scorers to grade those outputs

The SDK’s Evaluation class orchestrates the run and streams progress events while the agent executes inside its sandbox.

SettingWhat it controlsTypical use
datasetItems the task runs onBenchmarks, test cases, prompts
taskFunction that produces outputsAgent call, tool workflow
scorersHow outputs are scoredAccuracy, safety, or assertions
scenariosParameter variationsPrompt versions or tool configs
iterationsRepeats of the full datasetVariance and stability checks
concurrencyParallel samples per batchFaster runs with safe limits
maxErrorsTotal errors before stopCircuit breaker
maxConsecutiveErrorsBack-to-back errors before stopCircuit breaker

Prefer built-in scorers when possible to keep results consistent. Custom scorers are supported for domain-specific grading.

Evaluations follow a predictable loop:

  1. Configure the evaluation (dataset, task, scorers).
  2. Execute samples across scenarios and iterations in batches.
  3. Score each sample and aggregate metrics.
  4. Finish with a summary report and stop reason.
LevelWhat it contains
EvaluationOverall stop reason and timing
ScenarioParameter set for the run
IterationOne full pass over the dataset
SampleInput, output, scores, and assertions

Evaluation results stream in real time and are stored for later analysis. Use the platform UI to review summaries, pass rates, and metrics across runs.