dreadnode.evaluations
API reference for the dreadnode.evaluations module.
EvalEnd
Section titled “EvalEnd”Signals the end of an evaluation.
EvalEvent
Section titled “EvalEvent”Base class for all evaluation events.
type: strEvent type discriminator for serialization.
as_dict
Section titled “as_dict”as_dict() -> dict[str, t.Any]Serialize event to a dictionary.
emit(span: TaskSpan) -> NoneEmit telemetry to the span.
EvalResult
Section titled “EvalResult”EvalResult( samples: list[Sample[In, Out]] = list(), stop_reason: EvalStopReason | None = None,)Result of an evaluation run.
assertions_summary
Section titled “assertions_summary”assertions_summary: dict[str, dict[str, float | int]]Calculates and returns a summary for each assertion across all samples.
error_count
Section titled “error_count”error_count: intThe number of samples that encountered an error during processing.
error_samples
Section titled “error_samples”error_samples: list[Sample[In, Out]]A list of all samples that encountered an error during processing.
failed_count
Section titled “failed_count”failed_count: intThe number of samples that failed any assertions.
failed_samples
Section titled “failed_samples”failed_samples: list[Sample[In, Out]]A list of all samples that failed at least one assertion.
metrics
Section titled “metrics”metrics: dict[str, list[float]]Returns a breakdown of all metric values across all samples.
metrics_aggregated
Section titled “metrics_aggregated”metrics_aggregated: dict[str, float]Aggregates metrics by calculating the mean for each metric.
metrics_summary
Section titled “metrics_summary”metrics_summary: dict[str, dict[str, float]]Calculates and returns a summary of statistics for each metric.
pass_rate
Section titled “pass_rate”pass_rate: floatThe overall pass rate of the evaluation, from 0.0 to 1.0.
passed_count
Section titled “passed_count”passed_count: intThe number of samples that passed all assertions.
passed_samples
Section titled “passed_samples”passed_samples: list[Sample[In, Out]]A list of all samples that passed all assertions.
samples
Section titled “samples”samples: list[Sample[In, Out]] = field(default_factory=list)All samples from this evaluation.
stop_reason
Section titled “stop_reason”stop_reason: EvalStopReason | None = NoneThe reason the evaluation stopped.
to_dataframe
Section titled “to_dataframe”to_dataframe() -> pd.DataFrameConverts the results into a pandas DataFrame for analysis.
to_dicts
Section titled “to_dicts”to_dicts() -> list[dict[str, t.Any]]Flattens the results into a list of dictionaries.
to_jsonl
Section titled “to_jsonl”to_jsonl(path: str | Path) -> NoneSaves the results to a JSON Lines (JSONL) file.
EvalSample
Section titled “EvalSample”A single sample in the evaluation.
EvalStart
Section titled “EvalStart”Signals the beginning of an evaluation.
Evaluation
Section titled “Evaluation”Evaluation of a task against a dataset.
Attributes:
task(Task[..., Out] | str) –The task to evaluate.dataset(Any | None) –The dataset to use for the evaluation.dataset_file(FilePath | str | None) –File path of a JSONL, CSV, JSON, or YAML dataset.name(str) –The name of the evaluation.dataset_input_mapping(list[str] | dict[str, str] | None) –Mapping from dataset keys to task parameter names.preprocessor(InputDatasetProcessor | None) –Optional preprocessor for the dataset.scorers(ScorersLike[Out]) –Scorers to evaluate task output.assert_scores(list[str] | Literal[True]) –Scores to assert are truthy.trace(bool) –Whether to produce trace contexts.
max_consecutive_errors
Section titled “max_consecutive_errors”max_consecutive_errors: int | None = Config(default=10)Maximum consecutive errors before stopping the evaluation.
max_errors
Section titled “max_errors”max_errors: int | None = Config(default=None)Maximum total errors before stopping the evaluation.
console
Section titled “console”console() -> EvalResult[In, Out]Run the evaluation with a live display in the console.
with_( *, name: str | None = None, description: str | None = None, tags: list[str] | None = None, label: str | None = None, task: Task[..., Out] | str | None = None, dataset: Any | None = None, concurrency: int | None = None, iterations: int | None = None, max_errors: int | None = None, max_consecutive_errors: int | None = None, parameters: dict[str, list[Any]] | None = None, scorers: ScorersLike[Out] | None = None, assert_scores: list[str] | Literal[True] | None = None, append: bool = False,) -> te.SelfCreate a modified clone of the evaluation.
Sample
Section titled “Sample”Represents a single input-output sample processed by a task.
Attributes:
id(UUID) –Unique identifier for the sample.input(In) –The sample input value.output(Out | None) –The sample output value.index(int) –The index of the sample in the dataset.metrics(dict[str, MetricSeries]) –Metrics from scorers and execution.assertions(dict[str, bool]) –Pass/fail status for asserted scorers.context(dict[str, Any] | None) –Contextual information about the sample.error(ErrorField | None) –Any error that occurred.task(TaskSpan[Out] | None) –Associated task span.created_at(datetime) –The creation timestamp of the sample.
failed
Section titled “failed”failed: boolWhether the underlying task failed for reasons other than score assertions.
passed
Section titled “passed”passed: boolWhether all assertions have passed.
get_average_metric_value
Section titled “get_average_metric_value”get_average_metric_value(key: str) -> floatCompute the average value of the specified metric.
to_dict
Section titled “to_dict”to_dict() -> dict[str, t.Any]Flatten the sample’s data for DataFrame conversion.