Skip to content

dreadnode.evaluations

API reference for the dreadnode.evaluations module.

Signals the end of an evaluation.

Base class for all evaluation events.

type: str

Event type discriminator for serialization.

as_dict() -> dict[str, t.Any]

Serialize event to a dictionary.

emit(span: TaskSpan) -> None

Emit telemetry to the span.

EvalResult(
samples: list[Sample[In, Out]] = list(),
stop_reason: EvalStopReason | None = None,
)

Result of an evaluation run.

assertions_summary: dict[str, dict[str, float | int]]

Calculates and returns a summary for each assertion across all samples.

error_count: int

The number of samples that encountered an error during processing.

error_samples: list[Sample[In, Out]]

A list of all samples that encountered an error during processing.

failed_count: int

The number of samples that failed any assertions.

failed_samples: list[Sample[In, Out]]

A list of all samples that failed at least one assertion.

metrics: dict[str, list[float]]

Returns a breakdown of all metric values across all samples.

metrics_aggregated: dict[str, float]

Aggregates metrics by calculating the mean for each metric.

metrics_summary: dict[str, dict[str, float]]

Calculates and returns a summary of statistics for each metric.

pass_rate: float

The overall pass rate of the evaluation, from 0.0 to 1.0.

passed_count: int

The number of samples that passed all assertions.

passed_samples: list[Sample[In, Out]]

A list of all samples that passed all assertions.

samples: list[Sample[In, Out]] = field(default_factory=list)

All samples from this evaluation.

stop_reason: EvalStopReason | None = None

The reason the evaluation stopped.

to_dataframe() -> pd.DataFrame

Converts the results into a pandas DataFrame for analysis.

to_dicts() -> list[dict[str, t.Any]]

Flattens the results into a list of dictionaries.

to_jsonl(path: str | Path) -> None

Saves the results to a JSON Lines (JSONL) file.

A single sample in the evaluation.

Signals the beginning of an evaluation.

Evaluation of a task against a dataset.

Attributes:

  • task (Task[..., Out] | str) –The task to evaluate.
  • dataset (Any | None) –The dataset to use for the evaluation.
  • dataset_file (FilePath | str | None) –File path of a JSONL, CSV, JSON, or YAML dataset.
  • name (str) –The name of the evaluation.
  • dataset_input_mapping (list[str] | dict[str, str] | None) –Mapping from dataset keys to task parameter names.
  • preprocessor (InputDatasetProcessor | None) –Optional preprocessor for the dataset.
  • scorers (ScorersLike[Out]) –Scorers to evaluate task output.
  • assert_scores (list[str] | Literal[True]) –Scores to assert are truthy.
  • trace (bool) –Whether to produce trace contexts.
max_consecutive_errors: int | None = Config(default=10)

Maximum consecutive errors before stopping the evaluation.

max_errors: int | None = Config(default=None)

Maximum total errors before stopping the evaluation.

console() -> EvalResult[In, Out]

Run the evaluation with a live display in the console.

with_(
*,
name: str | None = None,
description: str | None = None,
tags: list[str] | None = None,
label: str | None = None,
task: Task[..., Out] | str | None = None,
dataset: Any | None = None,
concurrency: int | None = None,
iterations: int | None = None,
max_errors: int | None = None,
max_consecutive_errors: int | None = None,
parameters: dict[str, list[Any]] | None = None,
scorers: ScorersLike[Out] | None = None,
assert_scores: list[str] | Literal[True] | None = None,
append: bool = False,
) -> te.Self

Create a modified clone of the evaluation.

Represents a single input-output sample processed by a task.

Attributes:

  • id (UUID) –Unique identifier for the sample.
  • input (In) –The sample input value.
  • output (Out | None) –The sample output value.
  • index (int) –The index of the sample in the dataset.
  • metrics (dict[str, MetricSeries]) –Metrics from scorers and execution.
  • assertions (dict[str, bool]) –Pass/fail status for asserted scorers.
  • context (dict[str, Any] | None) –Contextual information about the sample.
  • error (ErrorField | None) –Any error that occurred.
  • task (TaskSpan[Out] | None) –Associated task span.
  • created_at (datetime) –The creation timestamp of the sample.
failed: bool

Whether the underlying task failed for reasons other than score assertions.

passed: bool

Whether all assertions have passed.

get_average_metric_value(key: str) -> float

Compute the average value of the specified metric.

to_dict() -> dict[str, t.Any]

Flatten the sample’s data for DataFrame conversion.