Skip to content

dreadnode.training

API reference for the dreadnode.training module.

Training module with lazy imports for heavy dependencies.

This module uses lazy loading to avoid importing torch/ray unless needed. Heavy dependencies (torch, ray, transformers, vllm) are only loaded when the user actually accesses training-related classes.

AsyncRayGRPOTrainer(config: RayGRPOConfig)

Async Ray-based GRPO trainer.

Uses separate GPUs for inference and training to overlap computation:

  • GPU 0: vLLM inference (generates batches continuously)
  • GPU 1: Training (processes batches as they arrive)

This achieves much higher throughput than the colocated version.

Requires at least 2 GPUs.

shutdown() -> None

Shutdown workers.

train(
prompts: Sequence[str],
reward_fn: RewardFn,
num_steps: int | None = None,
) -> TrainingState

Run async GRPO training.

Overlaps inference and training for maximum throughput.

DPOConfig(
model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
tokenizer_name: str | None = None,
beta: float = 0.1,
label_smoothing: float = 0.0,
loss_type: str = "sigmoid",
max_seq_length: int = 2048,
max_prompt_length: int = 512,
learning_rate: float = 5e-07,
weight_decay: float = 0.01,
warmup_ratio: float = 0.1,
max_steps: int = 1000,
max_epochs: int = 1,
batch_size: int = 4,
gradient_accumulation_steps: int = 4,
max_grad_norm: float = 1.0,
ref_model_offload: bool = True,
log_interval: int = 10,
checkpoint_interval: int = 100,
checkpoint_dir: str = "./checkpoints",
seed: int = 42,
trust_remote_code: bool = True,
)

Configuration for DPO training.

batch_size: int = 4

Batch size per device.

beta: float = 0.1

Temperature parameter for DPO loss. Higher = more conservative updates.

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval: int = 100

Steps between checkpoints.

gradient_accumulation_steps: int = 4

Gradient accumulation steps.

label_smoothing: float = 0.0

Label smoothing for DPO loss (0 = no smoothing).

learning_rate: float = 5e-07

Learning rate (DPO typically uses lower LR than SFT).

log_interval: int = 10

Steps between logging.

loss_type: str = 'sigmoid'

Loss type: ‘sigmoid’ (standard DPO), ‘hinge’, ‘ipo’.

max_epochs: int = 1

Maximum training epochs.

max_grad_norm: float = 1.0

Maximum gradient norm.

max_prompt_length: int = 512

Maximum prompt length.

max_seq_length: int = 2048

Maximum sequence length.

max_steps: int = 1000

Maximum training steps.

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Model name or path.

ref_model_offload: bool = True

Keep reference model on CPU to save GPU memory.

seed: int = 42

Random seed.

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

trust_remote_code: bool = True

Trust remote code in model repository.

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay: float = 0.01

Weight decay.

DPOTrainer(
config: DPOConfig,
fsdp_config: FSDP2Config | None = None,
storage: Storage | None = None,
checkpoint_name: str | None = None,
)

DPO (Direct Preference Optimization) trainer.

DPO directly optimizes the policy using preference pairs without needing a separate reward model or PPO. This makes it much simpler than RLHF.

The training process:

  1. Load policy model and frozen reference model
  2. For each preference pair (chosen, rejected):
  • Compute log probabilities for both under policy and reference
  • Compute DPO loss to prefer chosen over rejected
  1. Update policy via gradient descent

Attributes:

  • config –DPO configuration
  • model –Training policy model
  • ref_model –Frozen reference model
  • tokenizer –Tokenizer

Initialize DPO trainer.

Parameters:

  • config (DPOConfig) –DPO configuration
  • fsdp_config (FSDP2Config | None, default: None ) –Optional FSDP2 configuration
  • storage (Storage | None, default: None ) –Optional storage for CAS checkpointing
  • checkpoint_name (str | None, default: None ) –Name for checkpoints
get_model() -> nn.Module

Get the trained model.

save_checkpoint() -> None

Save training checkpoint.

train(
dataset: Dataset | list[PreferencePair] | list[dict],
) -> dict[str, float]

Run DPO training.

Parameters:

  • dataset (Dataset | list[PreferencePair] | list[dict]) –Training dataset with preference pairs. Each item should have ‘prompt’, ‘chosen’, ‘rejected’ keys.

Returns:

  • dict[str, float] –Final training metrics
PPOConfig(
model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
tokenizer_name: str | None = None,
reward_model_name: str | None = None,
clip_ratio: float = 0.2,
value_clip_ratio: float = 0.2,
kl_coef: float = 0.1,
kl_target: float | None = 0.01,
entropy_coef: float = 0.01,
gamma: float = 1.0,
gae_lambda: float = 0.95,
max_seq_length: int = 2048,
max_new_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.9,
learning_rate: float = 1e-06,
critic_lr: float = 1e-05,
weight_decay: float = 0.01,
warmup_ratio: float = 0.1,
max_steps: int = 1000,
batch_size: int = 8,
mini_batch_size: int = 4,
ppo_epochs: int = 4,
gradient_accumulation_steps: int = 1,
max_grad_norm: float = 1.0,
ref_model_offload: bool = True,
share_critic: bool = False,
critic_warmup_steps: int = 0,
log_interval: int = 10,
checkpoint_interval: int = 100,
checkpoint_dir: str = "./checkpoints",
seed: int = 42,
trust_remote_code: bool = True,
)

Configuration for PPO training.

batch_size: int = 8

Prompts per batch.

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval: int = 100

Steps between checkpoints.

clip_ratio: float = 0.2

PPO clipping ratio (epsilon).

critic_lr: float = 1e-05

Learning rate for value function (typically higher than policy).

critic_warmup_steps: int = 0

Pretrain critic for N steps before PPO (0 = no warmup).

entropy_coef: float = 0.01

Entropy bonus coefficient.

gae_lambda: float = 0.95

GAE lambda for advantage estimation.

gamma: float = 1.0

Discount factor (1.0 for episodic tasks like text generation).

gradient_accumulation_steps: int = 1

Gradient accumulation steps.

kl_coef: float = 0.1

KL penalty coefficient.

kl_target: float | None = 0.01

Target KL divergence. If exceeded, KL coef is increased.

learning_rate: float = 1e-06

Learning rate for policy.

log_interval: int = 10

Steps between logging.

max_grad_norm: float = 1.0

Maximum gradient norm.

max_new_tokens: int = 512

Maximum new tokens to generate.

max_seq_length: int = 2048

Maximum sequence length.

max_steps: int = 1000

Maximum training steps.

mini_batch_size: int = 4

Mini-batch size for PPO updates.

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Policy model name or path.

ppo_epochs: int = 4

Number of PPO epochs per batch of experience.

ref_model_offload: bool = True

Keep reference model on CPU to save GPU memory.

reward_model_name: str | None = None

Reward model name or path. If None, must provide reward_fn to train().

seed: int = 42

Random seed.

share_critic: bool = False

Share weights between policy and critic (adds value head to policy).

temperature: float = 0.7

Sampling temperature.

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

top_p: float = 0.9

Top-p sampling.

trust_remote_code: bool = True

Trust remote code in model repository.

value_clip_ratio: float = 0.2

Value function clipping ratio.

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay: float = 0.01

Weight decay.

PPOTrainer(
config: PPOConfig,
fsdp_config: FSDP2Config | None = None,
storage: Storage | None = None,
checkpoint_name: str | None = None,
)

PPO (Proximal Policy Optimization) trainer for RLHF.

Implements the full PPO algorithm with:

  • Policy network (actor)
  • Value network (critic)
  • GAE advantage estimation
  • Clipped surrogate objective
  • KL penalty and adaptive KL coefficient

The training loop:

  1. Generate responses from current policy
  2. Compute rewards using reward model/function
  3. Estimate advantages with GAE
  4. Update policy and value networks with PPO

Attributes:

  • config –PPO configuration
  • policy –Policy (actor) model
  • critic –Value (critic) model
  • ref_model –Frozen reference model for KL penalty
  • tokenizer –Tokenizer

Initialize PPO trainer.

Parameters:

  • config (PPOConfig) –PPO configuration
  • fsdp_config (FSDP2Config | None, default: None ) –Optional FSDP2 configuration
  • storage (Storage | None, default: None ) –Optional storage for CAS checkpointing
  • checkpoint_name (str | None, default: None ) –Name for checkpoints
get_policy() -> nn.Module

Get the trained policy model.

save_checkpoint() -> None

Save training checkpoint.

train(
prompts: list[str],
reward_fn: Callable[[list[str], list[str]], list[float]]
| None = None,
) -> dict[str, float]

Run PPO training.

Parameters:

  • prompts (list[str]) –List of training prompts
  • reward_fn (Callable[[list[str], list[str]], list[float]] | None, default: None ) –Optional reward function (prompts, completions) -> rewards. Required if reward_model_name not set in config.

Returns:

  • dict[str, float] –Final training metrics
RMConfig(
model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
tokenizer_name: str | None = None,
value_head_hidden_size: int | None = None,
value_head_dropout: float = 0.1,
pooling: str = "last",
max_seq_length: int = 2048,
max_prompt_length: int = 512,
learning_rate: float = 1e-05,
weight_decay: float = 0.01,
warmup_ratio: float = 0.1,
max_steps: int = 1000,
max_epochs: int = 3,
batch_size: int = 4,
gradient_accumulation_steps: int = 4,
max_grad_norm: float = 1.0,
margin: float = 0.0,
log_interval: int = 10,
checkpoint_interval: int = 100,
checkpoint_dir: str = "./checkpoints",
seed: int = 42,
trust_remote_code: bool = True,
)

Configuration for Reward Model training.

batch_size: int = 4

Batch size per device.

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval: int = 100

Steps between checkpoints.

gradient_accumulation_steps: int = 4

Gradient accumulation steps.

learning_rate: float = 1e-05

Learning rate.

log_interval: int = 10

Steps between logging.

margin: float = 0.0

Margin for Bradley-Terry loss (0 = no margin).

max_epochs: int = 3

Maximum training epochs.

max_grad_norm: float = 1.0

Maximum gradient norm.

max_prompt_length: int = 512

Maximum prompt length.

max_seq_length: int = 2048

Maximum sequence length.

max_steps: int = 1000

Maximum training steps.

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Base model name or path.

pooling: str = 'last'

Pooling method: ‘last’ (last non-pad token), ‘mean’, ‘max’.

seed: int = 42

Random seed.

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

trust_remote_code: bool = True

Trust remote code in model repository.

value_head_dropout: float = 0.1

Dropout for value head.

value_head_hidden_size: int | None = None

Hidden size for value head. None = match model hidden size.

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay: float = 0.01

Weight decay.

RayGRPOConfig(
model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
tokenizer_name: str | None = None,
num_prompts_per_step: int = 8,
num_generations_per_prompt: int = 4,
max_steps: int = 1000,
max_epochs: int = 10,
max_new_tokens: int = 512,
temperature: float = 0.7,
top_p: float = 0.9,
learning_rate: float = 1e-06,
weight_decay: float = 0.01,
warmup_ratio: float = 0.1,
gradient_accumulation_steps: int = 1,
max_grad_norm: float = 1.0,
log_interval: int = 10,
eval_interval: int = 100,
checkpoint_interval: int = 100,
checkpoint_dir: str = "./checkpoints",
seed: int = 42,
vllm: VLLMConfig = VLLMConfig(),
training: TrainingConfig = TrainingConfig(),
loss: GRPOLossConfig = GRPOLossConfig(),
)

Complete configuration for Ray-based GRPO training.

This configuration controls all aspects of GRPO training:

  • Model and tokenizer
  • Generation (vLLM)
  • Training (DeepSpeed/FSDP)
  • GRPO algorithm parameters
checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval: int = 100

Steps between checkpoints.

eval_interval: int = 100

Steps between evaluation.

gradient_accumulation_steps: int = 1

Gradient accumulation steps.

learning_rate: float = 1e-06

Learning rate.

log_interval: int = 10

Steps between logging.

loss: GRPOLossConfig = field(default_factory=GRPOLossConfig)

GRPO loss configuration.

max_epochs: int = 10

Maximum training epochs.

max_grad_norm: float = 1.0

Maximum gradient norm for clipping.

max_new_tokens: int = 512

Maximum tokens to generate per completion.

max_steps: int = 1000

Maximum training steps.

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Model name or path.

num_generations_per_prompt: int = 4

Number of completions to generate per prompt (G in GRPO).

num_prompts_per_step: int = 8

Number of unique prompts per training step.

seed: int = 42

Random seed for reproducibility.

temperature: float = 0.7

Sampling temperature.

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

top_p: float = 0.9

Top-p (nucleus) sampling.

train_batch_size: int

Total batch size for training.

training: TrainingConfig = field(
default_factory=TrainingConfig
)

Distributed training configuration.

vllm: VLLMConfig = field(default_factory=VLLMConfig)

vLLM inference configuration.

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay: float = 0.01

Weight decay.

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

RayGRPOTrainer(
config: RayGRPOConfig,
colocate: bool = False,
storage: Storage | None = None,
checkpoint_name: str | None = None,
callbacks: list[TrainerCallback] | None = None,
)

Native Ray-based GRPO trainer with colocated inference/training.

Supports two modes:

  1. Memory-efficient mode (default): Time-shares GPU between vLLM and training
  • Lower memory, but slower due to model loading/unloading
  1. Fast mode (colocate=True): Keeps both models loaded
  • Higher memory usage, but much faster (no reload overhead)
  • Uses in-place vLLM weight updates

Example

config = RayGRPOConfig( … model_name=“Qwen/Qwen2.5-1.5B-Instruct”, … num_generations_per_prompt=4, … ) trainer = RayGRPOTrainer(config, colocate=True) # Fast mode

def reward_fn(prompts, completions): … return [1.0 if is_correct(c) else 0.0 for c in completions]

trainer.train(prompts, reward_fn)

Initialize GRPO trainer.

Parameters:

  • config (RayGRPOConfig) –GRPO configuration.
  • colocate (bool, default: False ) –If True, keep both vLLM and training model loaded (faster but more memory).
  • storage (Storage | None, default: None ) –Optional Storage for CAS-based checkpointing.
  • checkpoint_name (str | None, default: None ) –Name for checkpoints (defaults to sanitized model name).
  • callbacks (list[TrainerCallback] | None, default: None ) –List of TrainerCallback instances for customizing training behavior.
add_callback(callback: TrainerCallback) -> None

Add a callback to the trainer.

remove_callback(callback_type: type) -> None

Remove all callbacks of a given type.

save_checkpoint_to_storage(
version: str | None = None,
) -> LocalModel | None

Public method to save checkpoint to CAS.

Parameters:

  • version (str | None, default: None ) –Version string. If None, auto-increments.

Returns:

  • LocalModel | None –LocalModel instance if storage is configured, None otherwise.
shutdown() -> None

Shutdown trainer.

train(
prompts: Sequence[str],
reward_fn: RewardFn,
eval_prompts: Sequence[str] | None = None,
num_steps: int | None = None,
) -> TrainingState

Run GRPO training.

Parameters:

  • prompts (Sequence[str]) –Training prompts.
  • reward_fn (RewardFn) –Function to score completions.
  • eval_prompts (Sequence[str] | None, default: None ) –Optional evaluation prompts.
  • num_steps (int | None, default: None ) –Optional number of steps (overrides config).

Returns:

  • TrainingState –Final training state.
RewardModelTrainer(
config: RMConfig,
fsdp_config: FSDP2Config | None = None,
storage: Storage | None = None,
checkpoint_name: str | None = None,
)

Reward Model trainer using Bradley-Terry loss.

Trains a model to predict scalar rewards from preference pairs. The trained model can then be used in RLHF pipelines (PPO, GRPO, etc.).

Attributes:

  • config –Reward model configuration
  • model –The reward model (base LLM + value head)
  • tokenizer –Tokenizer

Initialize Reward Model trainer.

Parameters:

  • config (RMConfig) –Reward model configuration
  • fsdp_config (FSDP2Config | None, default: None ) –Optional FSDP2 configuration
  • storage (Storage | None, default: None ) –Optional storage for CAS checkpointing
  • checkpoint_name (str | None, default: None ) –Name for checkpoints
compute_rewards(
texts: list[str], batch_size: int = 8
) -> list[float]

Compute rewards for a list of texts.

Parameters:

  • texts (list[str]) –List of text sequences
  • batch_size (int, default: 8 ) –Batch size for inference

Returns:

  • list[float] –List of scalar rewards
get_model() -> RewardModel

Get the trained reward model.

get_reward_fn() -> callable

Get a reward function for use with GRPO/PPO.

Returns:

  • callable –A callable that takes texts and returns rewards
save_checkpoint() -> None

Save training checkpoint.

train(dataset: Dataset | list[dict]) -> dict[str, float]

Run reward model training.

Parameters:

  • dataset (Dataset | list[dict]) –Training dataset with preference pairs. Each item should have ‘prompt’, ‘chosen’, ‘rejected’ keys.

Returns:

  • dict[str, float] –Final training metrics
SFTConfig(
model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
tokenizer_name: str | None = None,
max_seq_length: int = 2048,
use_packing: bool = True,
packing_efficiency_threshold: float = 0.9,
learning_rate: float = 2e-05,
weight_decay: float = 0.01,
warmup_ratio: float = 0.1,
max_steps: int = 1000,
max_epochs: int = 3,
batch_size: int = 4,
gradient_accumulation_steps: int = 1,
max_grad_norm: float = 1.0,
log_interval: int = 10,
checkpoint_interval: int = 100,
checkpoint_dir: str = "./checkpoints",
seed: int = 42,
trust_remote_code: bool = True,
)

Configuration for SFT training.

batch_size: int = 4

Batch size per device.

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval: int = 100

Steps between checkpoints.

gradient_accumulation_steps: int = 1

Gradient accumulation steps.

learning_rate: float = 2e-05

Learning rate.

log_interval: int = 10

Steps between logging.

max_epochs: int = 3

Maximum training epochs.

max_grad_norm: float = 1.0

Maximum gradient norm.

max_seq_length: int = 2048

Maximum sequence length.

max_steps: int = 1000

Maximum training steps.

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Model name or path.

packing_efficiency_threshold: float = 0.9

Minimum packing efficiency before padding.

seed: int = 42

Random seed.

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

trust_remote_code: bool = True

Trust remote code in model repository.

use_packing: bool = True

Enable sequence packing for efficiency.

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay: float = 0.01

Weight decay.

SFTTrainer(
config: SFTConfig,
fsdp_config: FSDP2Config | None = None,
)

SFT trainer with sequence packing and FSDP2 support.

Features:

  • Sequence packing for efficient training
  • FSDP2 distributed training
  • Gradient accumulation
  • Mixed precision (bf16)
  • Checkpointing

Initialize SFT trainer.

Parameters:

  • config (SFTConfig) –SFT configuration
  • fsdp_config (FSDP2Config | None, default: None ) –Optional FSDP2 configuration
load_checkpoint(path: str) -> None

Load training checkpoint.

save_checkpoint() -> None

Save training checkpoint.

train(
dataset: Dataset | Sequence[dict],
eval_dataset: Dataset | Sequence[dict] | None = None,
) -> dict[str, float]

Run SFT training.

Parameters:

  • dataset (Dataset | Sequence[dict]) –Training dataset
  • eval_dataset (Dataset | Sequence[dict] | None, default: None ) –Optional evaluation dataset

Returns:

  • dict[str, float] –Final training metrics
TinkerSFTConfig(
base_model: str = "meta-llama/Llama-3.1-8B-Instruct",
base_url: str | None = None,
lora_rank: int = 16,
data_dir: str = "data",
train_split: str = "train",
eval_split: str | None = "test",
max_train_examples: int | None = None,
max_eval_examples: int | None = None,
max_sequence_length: int = 2048,
batch_size: int = 16,
gradient_accumulation_steps: int = 1,
learning_rate: float = 0.0001,
steps: int = 100,
checkpoint_interval: int = 10,
adam_beta1: float = 0.9,
adam_beta2: float = 0.95,
adam_eps: float = 1e-08,
sample_prompt: str = "",
max_new_tokens: int = 64,
temperature: float = 0.0,
num_samples: int = 4,
skip_sample: bool = False,
project: str | None = None,
run_name: str | None = None,
tags: list[str] = (
lambda: ["training", "sft", "tinker"]
)(),
seed: int = 0,
)

Configuration for Tinker-based supervised fine-tuning.

This configuration is used to set up LoRA-based SFT training with the Tinker framework.

Example

config = TinkerSFTConfig( base_model=“meta-llama/Llama-3.1-8B-Instruct”, learning_rate=1e-4, steps=100, lora_rank=16, )

adam_beta1: float = 0.9

Adam beta1 parameter.

adam_beta2: float = 0.95

Adam beta2 parameter.

adam_eps: float = 1e-08

Adam epsilon parameter.

base_model: str = 'meta-llama/Llama-3.1-8B-Instruct'

Model name or path for the base model to fine-tune.

base_url: str | None = None

Tinker service URL. If None, uses default from environment.

batch_size: int = 16

Number of sequences per training step.

checkpoint_interval: int = 10

Save checkpoint every N training steps.

data_dir: str = 'data'

Directory containing parquet dataset files.

eval_split: str | None = 'test'

Prefix for evaluation data files. Set to None to skip eval.

gradient_accumulation_steps: int = 1

Number of micro-batches to accumulate before each optimizer step.

learning_rate: float = 0.0001

Adam optimizer learning rate.

lora_rank: int = 16

LoRA rank parameter for adapter training.

max_eval_examples: int | None = None

Maximum number of evaluation examples. None for all.

max_new_tokens: int = 64

Maximum new tokens when sampling.

max_sequence_length: int = 2048

Maximum sequence length for tokenization (truncates from left).

max_train_examples: int | None = None

Maximum number of training examples. None for all.

num_samples: int = 4

Number of samples to generate after training.

project: str | None = None

Dreadnode project name for logging.

run_name: str | None = None

Dreadnode run name.

sample_prompt: str = ''

Prompt used for sampling after training.

seed: int = 0

Random seed for batch selection.

skip_sample: bool = False

Skip sampling after training checkpoints.

steps: int = 100

Total number of training steps.

tags: list[str] = field(
default_factory=lambda: ["training", "sft", "tinker"]
)

Tags for the Dreadnode run.

temperature: float = 0.0

Sampling temperature (0.0 for greedy).

train_split: str = 'train'

Prefix for training data files (e.g., ‘train_*.parquet’).

__post_init__() -> None

Validate configuration after initialization.

TinkerSFTTrainer(
config: TinkerSFTConfig,
training_client: TrainingClient | None = None,
service_client: ServiceClient | None = None,
callbacks: Sequence[TrainingCallback] | None = None,
)

Trainer for supervised fine-tuning using Tinker with LoRA.

This trainer provides:

  • LoRA-based fine-tuning via Tinker service
  • Checkpoint saving and artifact logging
  • Optional sampling after training
  • Integration with Dreadnode for experiment tracking

Example

config = TinkerSFTConfig( base_model=“meta-llama/Llama-3.1-8B-Instruct”, steps=100, lora_rank=16, )

trainer = TinkerSFTTrainer(config)

state = trainer.train(train_data) print(f”Final loss: {state.losses[-1]:.4f}”)

Initialize the Tinker SFT trainer.

Parameters:

  • config (TinkerSFTConfig) –Training configuration.
  • training_client (TrainingClient | None, default: None ) –Optional pre-initialized Tinker training client.
  • service_client (ServiceClient | None, default: None ) –Optional pre-initialized Tinker service client.
  • callbacks (Sequence[TrainingCallback] | None, default: None ) –Optional list of training callbacks.
renderer: Any

Get the model-specific renderer (initializes clients if needed).

service_client: ServiceClient

Get the service client (initializes clients if needed).

tokenizer: Any

Get the tokenizer (initializes clients if needed).

training_client: TrainingClient

Get the training client (initializes clients if needed).

add_callback(callback: TrainingCallback) -> None

Add a training callback.

evaluate(
eval_data: list[Datum],
step: int = 0,
log_to_dreadnode: bool = True,
) -> float

Run evaluation on the provided data.

Parameters:

  • eval_data (list[Datum]) –Evaluation data as Tinker Datum objects.
  • step (int, default: 0 ) –Current training step (for logging).
  • log_to_dreadnode (bool, default: True ) –Whether to log metrics to Dreadnode.

Returns:

  • float –Evaluation loss.
sample() -> list[dict[str, str]]

Generate samples from the fine-tuned model.

Returns:

  • list[dict[str, str]] –List of sample dictionaries with ‘prompt’ and ‘completion’ keys.
save_checkpoint(name: str | None = None) -> str

Save the current model weights as a checkpoint.

Parameters:

  • name (str | None, default: None ) –Optional checkpoint name.

Returns:

  • str –Path to the saved checkpoint.
train(
train_data: list[Datum],
eval_data: list[Datum] | None = None,
log_to_dreadnode: bool = True,
) -> TrainingState

Run supervised fine-tuning.

Parameters:

  • train_data (list[Datum]) –Training data as Tinker Datum objects.
  • eval_data (list[Datum] | None, default: None ) –Optional evaluation data.
  • log_to_dreadnode (bool, default: True ) –Whether to log metrics to Dreadnode.

Returns:

  • TrainingState –Final training state.

Raises:

  • ValueError –If training data is empty.

One base model available for hosted training jobs.

Optional upstream pricing metadata.

All values are USD per million tokens. None means “not published” — callers should fall back to the live Tinker console for authoritative numbers (pricing changes faster than we can update the SDK).

VerificationResult(
passed: bool,
score: float,
metrics: dict[str, Any] = dict(),
)

Outcome of grading a rollout against a task’s verification config.

Attributes:

  • passed (bool) –Whether the task was considered solved.
  • score (float) –Scalar in [0, 1]. For binary env_flag / env_script this is 1.0 on pass and 0.0 on fail. For llm_judge this is the judge’s rubric score.
  • metrics (dict[str, Any]) –Free-form metadata attached to traces and training metrics (method, exit_code, judge reason and attributes, …).
__getattr__(name: str) -> t.Any

Lazy load training components to avoid importing torch/ray at module load.

batched_environments(
envs: list[TaskEnvironment],
*,
max_concurrent_setup: int = 32,
) -> AsyncIterator[list[TaskEnvironment]]

Provision a batch of envs in parallel; tear them all down on exit.

Caps concurrent setup via a semaphore so a 64-rollout RL step doesn’t pummel the sandbox provider at batch boundaries. Envs that fail setup() are logged and excluded from the yielded list; their teardown() is not called (nothing to tear down). Envs that succeeded setup are always torn down on exit — even if the caller raises inside the async with block.

Parameters:

  • envs (list[TaskEnvironment]) –Pre-constructed TaskEnvironment instances. They must not already be set up (setup() is called by this context manager).
  • max_concurrent_setup (int, default: 32 ) –Maximum concurrent setup() calls. Defaults to 32; tune down under tight provider quota.

Yields:

  • AsyncIterator[list[TaskEnvironment]] –The live envs (those that succeeded setup()), in the input order
  • AsyncIterator[list[TaskEnvironment]] –with failed envs skipped.

Example::

envs = [
TaskEnvironment(api_client=api, org=ORG, workspace=WS,
task_ref="pwn/flag", inputs=row.get("inputs"))
for row in batch_rows
]
async with batched_environments(envs, max_concurrent_setup=8) as live:
rewards = await asyncio.gather(*[score(env) for env in live])
run_in_sandbox(
code: str,
timeout_seconds: int = 300,
memory_mb: int = 2048,
) -> dict

Run code in a Prime Intellect sandbox.

Sandboxes are lightweight execution environments for running AI-generated code or quick experiments.

Parameters:

  • code (str) –Python code to execute.
  • timeout_seconds (int, default: 300 ) –Execution timeout.
  • memory_mb (int, default: 2048 ) –Memory limit in MB.

Returns:

  • dict –Dict with stdout, stderr, and return_code.

Example

result = await run_in_sandbox(''' import torch print(f”CUDA available: {torch.cuda.is_available()}”) ''') print(result[“stdout”])

train_dpo(
config_dict: dict[str, Any], prompts: list[str]
) -> t.Any

Train with DPO.

train_grpo(
config_dict: dict[str, Any],
prompts: list[str],
reward_fn: Callable[..., Any],
) -> t.Any

Train with GRPO.

train_on_prime(
config: dict[str, Any] | None = None,
name: str | None = None,
gpu_type: str = "H100_80GB",
gpu_count: int = 1,
training_type: str = "sft",
requirements: list[str] | None = None,
env_vars: dict[str, str] | None = None,
auto_terminate: bool = True,
region: str | None = None,
interruptible: bool = False,
) -> TrainingResult

Run training on Prime Intellect infrastructure.

This function provides a high-level interface for running training jobs on Prime’s decentralized GPU compute.

Parameters:

  • config (dict[str, Any] | None, default: None ) –Training configuration dict. Common options:
    • model_name: Model name or path
    • max_steps: Maximum training steps
    • batch_size: Batch size per device
    • learning_rate: Learning rate
    • checkpoint_dir: Checkpoint directory
  • name (str | None, default: None ) –Job name.
  • gpu_type (str, default: 'H100_80GB' ) –GPU type (H100_80GB, A100_80GB, etc.).
  • gpu_count (int, default: 1 ) –Number of GPUs.
  • training_type (str, default: 'sft' ) –Type of training (sft, grpo, dpo, ppo).
  • requirements (list[str] | None, default: None ) –Additional Python requirements.
  • env_vars (dict[str, str] | None, default: None ) –Environment variables.
  • auto_terminate (bool, default: True ) –Terminate pods after training.
  • region (str | None, default: None ) –Preferred region.
  • interruptible (bool, default: False ) –Use spot/interruptible instances.

Returns:

  • TrainingResult –TrainingResult with final state and checkpoint info.

Example

result = await train_on_prime( config={ “model_name”: “meta-llama/Llama-3.1-8B-Instruct”, “max_steps”: 1000, “batch_size”: 32, }, gpu_type=“H100_80GB”, gpu_count=8, )

if result.succeeded: print(f”Checkpoint: {result.checkpoint_path}“)

train_ppo(
config_dict: dict[str, Any],
prompts: list[str],
reward_fn: Callable[..., Any],
) -> t.Any

Train with PPO.

train_sft(
config_dict: dict[str, Any], prompts: list[str]
) -> t.Any

Train with SFT.

train_tinker_sft(
config: dict[str, Any] | None = None,
messages: Sequence[list[dict[str, str]]] | None = None,
examples: Sequence[tuple[str, str]] | None = None,
data_dir: str | None = None,
project: str | None = None,
run_name: str | None = None,
tags: list[str] | None = None,
log_to_dreadnode: bool = True,
) -> TrainingState

Train a model using Tinker SFT.

This function provides a high-level interface for supervised fine-tuning using the Tinker framework. Data can be provided in multiple formats:

  • Conversation messages (list of message dicts)
  • Simple examples (input/output pairs)
  • Parquet files in a data directory

Parameters:

  • config (dict[str, Any] | None, default: None ) –Training configuration dict. See TinkerSFTConfig for options.
  • messages (Sequence[list[dict[str, str]]] | None, default: None ) –List of conversations, each a list of message dicts with ‘role’ and ‘content’ keys.
  • examples (Sequence[tuple[str, str]] | None, default: None ) –List of (input, output) tuples for simple supervised learning.
  • data_dir (str | None, default: None ) –Directory containing parquet files with training data.
  • project (str | None, default: None ) –Dreadnode project name.
  • run_name (str | None, default: None ) –Dreadnode run name.
  • tags (list[str] | None, default: None ) –Tags for the Dreadnode run.
  • log_to_dreadnode (bool, default: True ) –Whether to log to Dreadnode (default: True).

Returns:

  • TrainingState –TrainingState with training metrics and checkpoint paths.

Raises:

  • ValueError –If no data source is provided.
verify_env_state(
env: TaskEnvironment,
trajectory: Trajectory | None,
verification: dict[str, Any] | None,
*,
judge_context: dict[str, Any] | None = None,
) -> VerificationResult

Grade the rollout against the task’s verification config.

Supports three dispatch keys on the verification dict:

  • env_flag — read a file from the env sandbox; compare against a sha256 hash (hash) or plaintext expected value.
  • env_script — execute a script inside the env; pass iff the exit code matches expected_exit_code (default 0).
  • llm_judge — score trajectory with :class:~dreadnode.agents.AgentJudge against a rubric; pass iff score clears passing_threshold.

Parameters:

  • env (TaskEnvironment) –A provisioned :class:TaskEnvironment with execute() available.
  • trajectory (Trajectory | None) –The agent’s rollout. Required for llm_judge; ignored by env_flag / env_script. Pass None for single-shot recipes that don’t produce a trajectory.
  • verification (dict[str, Any] | None) –The task’s verification config (typically from env.task_verification). None or missing method raises ValueError.
  • judge_context (dict[str, Any] | None, default: None ) –Optional context passed through to AgentJudge.evaluate when method=llm_judge. Good for task instruction / env state.

Returns:

  • A ( VerificationResult ) –class:VerificationResult.

Raises:

  • ValueError –if verification is missing, method is unknown, or the chosen method’s required fields are absent.
  • RuntimeError –if env_flag / env_script invocation is attempted against an un-provisioned env (caller must setup() first).