dreadnode.evaluations

API reference for the dreadnode.evaluations module.

EvalEnd

Signals the end of an evaluation.

EvalEvent

Base class for all evaluation events.

type

type: str

Event type discriminator for serialization.

as_dict

as_dict() -> dict[str, t.Any]

Serialize event to a dictionary.

emit

emit(span: TaskSpan) -> None

Emit telemetry to the span.

EvalResult

EvalResult(
    samples: list[Sample[In, Out]] = list(),
    stop_reason: EvalStopReason | None = None,
)

Result of an evaluation run.

assertions_summary

assertions_summary: dict[str, dict[str, float | int]]

Calculates and returns a summary for each assertion across all samples.

error_count

error_count: int

The number of samples that encountered an error during processing.

error_samples

error_samples: list[Sample[In, Out]]

A list of all samples that encountered an error during processing.

failed_count

failed_count: int

The number of samples that failed any assertions.

failed_samples

failed_samples: list[Sample[In, Out]]

A list of all samples that failed at least one assertion.

metrics

metrics: dict[str, list[float]]

Returns a breakdown of all metric values across all samples.

metrics_aggregated

metrics_aggregated: dict[str, float]

Aggregates metrics by calculating the mean for each metric.

metrics_summary

metrics_summary: dict[str, dict[str, float]]

Calculates and returns a summary of statistics for each metric.

pass_rate

pass_rate: float

The overall pass rate of the evaluation, from 0.0 to 1.0.

passed_count

passed_count: int

The number of samples that passed all assertions.

passed_samples

passed_samples: list[Sample[In, Out]]

A list of all samples that passed all assertions.

samples

samples: list[Sample[In, Out]] = field(default_factory=list)

All samples from this evaluation.

stop_reason

stop_reason: EvalStopReason | None = None

The reason the evaluation stopped.

to_dataframe

to_dataframe() -> pd.DataFrame

Converts the results into a pandas DataFrame for analysis.

to_dicts

to_dicts() -> list[dict[str, t.Any]]

Flattens the results into a list of dictionaries.

to_jsonl

to_jsonl(path: str | Path) -> None

Saves the results to a JSON Lines (JSONL) file.

EvalSample

A single sample in the evaluation.

EvalStart

Signals the beginning of an evaluation.

Evaluation

Evaluation of a task against a dataset.

Attributes:

task (Task[..., Out] | str) –The task to evaluate.
dataset (Any | None) –The dataset to use for the evaluation.
dataset_file (FilePath | str | None) –File path of a JSONL, CSV, JSON, or YAML dataset.
name (str) –The name of the evaluation.
dataset_input_mapping (list[str] | dict[str, str] | None) –Mapping from dataset keys to task parameter names.
preprocessor (InputDatasetProcessor | None) –Optional preprocessor for the dataset.
scorers (ScorersLike[Out]) –Scorers to evaluate task output.
assert_scores (list[str] | Literal[True]) –Scores to assert are truthy.
trace (bool) –Whether to produce trace contexts.

max_consecutive_errors

max_consecutive_errors: int | None = Config(default=10)

Maximum consecutive errors before stopping the evaluation.

max_errors

max_errors: int | None = Config(default=None)

Maximum total errors before stopping the evaluation.

console

console() -> EvalResult[In, Out]

Run the evaluation with a live display in the console.

with_

with_(
    *,
    name: str | None = None,
    description: str | None = None,
    tags: list[str] | None = None,
    label: str | None = None,
    task: Task[..., Out] | str | None = None,
    dataset: Any | None = None,
    concurrency: int | None = None,
    iterations: int | None = None,
    max_errors: int | None = None,
    max_consecutive_errors: int | None = None,
    parameters: dict[str, list[Any]] | None = None,
    scorers: ScorersLike[Out] | None = None,
    assert_scores: list[str] | Literal[True] | None = None,
    append: bool = False,
) -> te.Self

Create a modified clone of the evaluation.

Sample

Represents a single input-output sample processed by a task.

Attributes:

id (UUID) –Unique identifier for the sample.
input (In) –The sample input value.
output (Out | None) –The sample output value.
index (int) –The index of the sample in the dataset.
metrics (dict[str, MetricSeries]) –Metrics from scorers and execution.
assertions (dict[str, bool]) –Pass/fail status for asserted scorers.
context (dict[str, Any] | None) –Contextual information about the sample.
error (ErrorField | None) –Any error that occurred.
task (TaskSpan[Out] | None) –Associated task span.
created_at (datetime) –The creation timestamp of the sample.

failed

failed: bool

Whether the underlying task failed for reasons other than score assertions.

passed

passed: bool

Whether all assertions have passed.

get_average_metric_value

get_average_metric_value(key: str) -> float

Compute the average value of the specified metric.

to_dict

to_dict() -> dict[str, t.Any]

Flatten the sample’s data for DataFrame conversion.