dreadnode.training

API reference for the dreadnode.training module.

Training module with lazy imports for heavy dependencies.

This module uses lazy loading to avoid importing torch/ray unless needed. Heavy dependencies (torch, ray, transformers, vllm) are only loaded when the user actually accesses training-related classes.

AsyncRayGRPOTrainer

AsyncRayGRPOTrainer(config: RayGRPOConfig)

Async Ray-based GRPO trainer.

Uses separate GPUs for inference and training to overlap computation:

GPU 0: vLLM inference (generates batches continuously)
GPU 1: Training (processes batches as they arrive)

This achieves much higher throughput than the colocated version.

Requires at least 2 GPUs.

shutdown

shutdown() -> None

Shutdown workers.

train

train(
    prompts: Sequence[str],
    reward_fn: RewardFn,
    num_steps: int | None = None,
) -> TrainingState

Run async GRPO training.

Overlaps inference and training for maximum throughput.

DPOConfig

DPOConfig(
    model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
    tokenizer_name: str | None = None,
    beta: float = 0.1,
    label_smoothing: float = 0.0,
    loss_type: str = "sigmoid",
    max_seq_length: int = 2048,
    max_prompt_length: int = 512,
    learning_rate: float = 5e-07,
    weight_decay: float = 0.01,
    warmup_ratio: float = 0.1,
    max_steps: int = 1000,
    max_epochs: int = 1,
    batch_size: int = 4,
    gradient_accumulation_steps: int = 4,
    max_grad_norm: float = 1.0,
    ref_model_offload: bool = True,
    log_interval: int = 10,
    checkpoint_interval: int = 100,
    checkpoint_dir: str = "./checkpoints",
    seed: int = 42,
    trust_remote_code: bool = True,
)

Configuration for DPO training.

batch_size

batch_size: int = 4

Batch size per device.

beta

beta: float = 0.1

Temperature parameter for DPO loss. Higher = more conservative updates.

checkpoint_dir

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval

checkpoint_interval: int = 100

Steps between checkpoints.

gradient_accumulation_steps

gradient_accumulation_steps: int = 4

Gradient accumulation steps.

label_smoothing

label_smoothing: float = 0.0

Label smoothing for DPO loss (0 = no smoothing).

learning_rate

learning_rate: float = 5e-07

Learning rate (DPO typically uses lower LR than SFT).

log_interval

log_interval: int = 10

Steps between logging.

loss_type

loss_type: str = 'sigmoid'

Loss type: ‘sigmoid’ (standard DPO), ‘hinge’, ‘ipo’.

max_epochs

max_epochs: int = 1

Maximum training epochs.

max_grad_norm

max_grad_norm: float = 1.0

Maximum gradient norm.

max_prompt_length

max_prompt_length: int = 512

Maximum prompt length.

max_seq_length

max_seq_length: int = 2048

Maximum sequence length.

max_steps

max_steps: int = 1000

Maximum training steps.

model_name

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Model name or path.

ref_model_offload

ref_model_offload: bool = True

Keep reference model on CPU to save GPU memory.

seed

seed: int = 42

Random seed.

tokenizer_name

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

trust_remote_code

trust_remote_code: bool = True

Trust remote code in model repository.

warmup_ratio

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay

weight_decay: float = 0.01

Weight decay.

DPOTrainer

DPOTrainer(
    config: DPOConfig,
    fsdp_config: FSDP2Config | None = None,
    storage: Storage | None = None,
    checkpoint_name: str | None = None,
)

DPO (Direct Preference Optimization) trainer.

DPO directly optimizes the policy using preference pairs without needing a separate reward model or PPO. This makes it much simpler than RLHF.

The training process:

Load policy model and frozen reference model
For each preference pair (chosen, rejected):

Compute log probabilities for both under policy and reference
Compute DPO loss to prefer chosen over rejected

Update policy via gradient descent

Attributes:

config –DPO configuration
model –Training policy model
ref_model –Frozen reference model
tokenizer –Tokenizer

Initialize DPO trainer.

Parameters:

config (DPOConfig) –DPO configuration
fsdp_config (FSDP2Config | None, default: None ) –Optional FSDP2 configuration
storage (Storage | None, default: None ) –Optional storage for CAS checkpointing
checkpoint_name (str | None, default: None ) –Name for checkpoints

get_model

get_model() -> nn.Module

Get the trained model.

save_checkpoint

save_checkpoint() -> None

Save training checkpoint.

train

train(
    dataset: Dataset | list[PreferencePair] | list[dict],
) -> dict[str, float]

Run DPO training.

Parameters:

dataset (Dataset | list[PreferencePair] | list[dict]) –Training dataset with preference pairs. Each item should have ‘prompt’, ‘chosen’, ‘rejected’ keys.

Returns:

dict[str, float] –Final training metrics

PPOConfig

PPOConfig(
    model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
    tokenizer_name: str | None = None,
    reward_model_name: str | None = None,
    clip_ratio: float = 0.2,
    value_clip_ratio: float = 0.2,
    kl_coef: float = 0.1,
    kl_target: float | None = 0.01,
    entropy_coef: float = 0.01,
    gamma: float = 1.0,
    gae_lambda: float = 0.95,
    max_seq_length: int = 2048,
    max_new_tokens: int = 512,
    temperature: float = 0.7,
    top_p: float = 0.9,
    learning_rate: float = 1e-06,
    critic_lr: float = 1e-05,
    weight_decay: float = 0.01,
    warmup_ratio: float = 0.1,
    max_steps: int = 1000,
    batch_size: int = 8,
    mini_batch_size: int = 4,
    ppo_epochs: int = 4,
    gradient_accumulation_steps: int = 1,
    max_grad_norm: float = 1.0,
    ref_model_offload: bool = True,
    share_critic: bool = False,
    critic_warmup_steps: int = 0,
    log_interval: int = 10,
    checkpoint_interval: int = 100,
    checkpoint_dir: str = "./checkpoints",
    seed: int = 42,
    trust_remote_code: bool = True,
)

Configuration for PPO training.

batch_size

batch_size: int = 8

Prompts per batch.

checkpoint_dir

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval

checkpoint_interval: int = 100

Steps between checkpoints.

clip_ratio

clip_ratio: float = 0.2

PPO clipping ratio (epsilon).

critic_lr

critic_lr: float = 1e-05

Learning rate for value function (typically higher than policy).

critic_warmup_steps

critic_warmup_steps: int = 0

Pretrain critic for N steps before PPO (0 = no warmup).

entropy_coef

entropy_coef: float = 0.01

Entropy bonus coefficient.

gae_lambda

gae_lambda: float = 0.95

GAE lambda for advantage estimation.

gamma

gamma: float = 1.0

Discount factor (1.0 for episodic tasks like text generation).

gradient_accumulation_steps

gradient_accumulation_steps: int = 1

Gradient accumulation steps.

kl_coef

kl_coef: float = 0.1

KL penalty coefficient.

kl_target

kl_target: float | None = 0.01

Target KL divergence. If exceeded, KL coef is increased.

learning_rate

learning_rate: float = 1e-06

Learning rate for policy.

log_interval

log_interval: int = 10

Steps between logging.

max_grad_norm

max_grad_norm: float = 1.0

Maximum gradient norm.

max_new_tokens

max_new_tokens: int = 512

Maximum new tokens to generate.

max_seq_length

max_seq_length: int = 2048

Maximum sequence length.

max_steps

max_steps: int = 1000

Maximum training steps.

mini_batch_size

mini_batch_size: int = 4

Mini-batch size for PPO updates.

model_name

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Policy model name or path.

ppo_epochs

ppo_epochs: int = 4

Number of PPO epochs per batch of experience.

ref_model_offload

ref_model_offload: bool = True

Keep reference model on CPU to save GPU memory.

reward_model_name

reward_model_name: str | None = None

Reward model name or path. If None, must provide reward_fn to train().

seed

seed: int = 42

Random seed.

share_critic

share_critic: bool = False

Share weights between policy and critic (adds value head to policy).

temperature

temperature: float = 0.7

Sampling temperature.

tokenizer_name

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

top_p

top_p: float = 0.9

Top-p sampling.

trust_remote_code

trust_remote_code: bool = True

Trust remote code in model repository.

value_clip_ratio

value_clip_ratio: float = 0.2

Value function clipping ratio.

warmup_ratio

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay

weight_decay: float = 0.01

Weight decay.

PPOTrainer

PPOTrainer(
    config: PPOConfig,
    fsdp_config: FSDP2Config | None = None,
    storage: Storage | None = None,
    checkpoint_name: str | None = None,
)

PPO (Proximal Policy Optimization) trainer for RLHF.

Implements the full PPO algorithm with:

Policy network (actor)
Value network (critic)
GAE advantage estimation
Clipped surrogate objective
KL penalty and adaptive KL coefficient

The training loop:

Generate responses from current policy
Compute rewards using reward model/function
Estimate advantages with GAE
Update policy and value networks with PPO

Attributes:

config –PPO configuration
policy –Policy (actor) model
critic –Value (critic) model
ref_model –Frozen reference model for KL penalty
tokenizer –Tokenizer

Initialize PPO trainer.

Parameters:

config (PPOConfig) –PPO configuration
fsdp_config (FSDP2Config | None, default: None ) –Optional FSDP2 configuration
storage (Storage | None, default: None ) –Optional storage for CAS checkpointing
checkpoint_name (str | None, default: None ) –Name for checkpoints

get_policy

get_policy() -> nn.Module

Get the trained policy model.

save_checkpoint

save_checkpoint() -> None

Save training checkpoint.

train

train(
    prompts: list[str],
    reward_fn: Callable[[list[str], list[str]], list[float]]
    | None = None,
) -> dict[str, float]

Run PPO training.

Parameters:

prompts (list[str]) –List of training prompts
reward_fn (Callable[[list[str], list[str]], list[float]] | None, default: None ) –Optional reward function (prompts, completions) -> rewards. Required if reward_model_name not set in config.

Returns:

dict[str, float] –Final training metrics

RMConfig

RMConfig(
    model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
    tokenizer_name: str | None = None,
    value_head_hidden_size: int | None = None,
    value_head_dropout: float = 0.1,
    pooling: str = "last",
    max_seq_length: int = 2048,
    max_prompt_length: int = 512,
    learning_rate: float = 1e-05,
    weight_decay: float = 0.01,
    warmup_ratio: float = 0.1,
    max_steps: int = 1000,
    max_epochs: int = 3,
    batch_size: int = 4,
    gradient_accumulation_steps: int = 4,
    max_grad_norm: float = 1.0,
    margin: float = 0.0,
    log_interval: int = 10,
    checkpoint_interval: int = 100,
    checkpoint_dir: str = "./checkpoints",
    seed: int = 42,
    trust_remote_code: bool = True,
)

Configuration for Reward Model training.

batch_size

batch_size: int = 4

Batch size per device.

checkpoint_dir

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval

checkpoint_interval: int = 100

Steps between checkpoints.

gradient_accumulation_steps

gradient_accumulation_steps: int = 4

Gradient accumulation steps.

learning_rate

learning_rate: float = 1e-05

Learning rate.

log_interval

log_interval: int = 10

Steps between logging.

margin

margin: float = 0.0

Margin for Bradley-Terry loss (0 = no margin).

max_epochs

max_epochs: int = 3

Maximum training epochs.

max_grad_norm

max_grad_norm: float = 1.0

Maximum gradient norm.

max_prompt_length

max_prompt_length: int = 512

Maximum prompt length.

max_seq_length

max_seq_length: int = 2048

Maximum sequence length.

max_steps

max_steps: int = 1000

Maximum training steps.

model_name

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Base model name or path.

pooling

pooling: str = 'last'

Pooling method: ‘last’ (last non-pad token), ‘mean’, ‘max’.

seed

seed: int = 42

Random seed.

tokenizer_name

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

trust_remote_code

trust_remote_code: bool = True

Trust remote code in model repository.

value_head_dropout

value_head_dropout: float = 0.1

Dropout for value head.

value_head_hidden_size

value_head_hidden_size: int | None = None

Hidden size for value head. None = match model hidden size.

warmup_ratio

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay

weight_decay: float = 0.01

Weight decay.

RayGRPOConfig

RayGRPOConfig(
    model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
    tokenizer_name: str | None = None,
    num_prompts_per_step: int = 8,
    num_generations_per_prompt: int = 4,
    max_steps: int = 1000,
    max_epochs: int = 10,
    max_new_tokens: int = 512,
    temperature: float = 0.7,
    top_p: float = 0.9,
    learning_rate: float = 1e-06,
    weight_decay: float = 0.01,
    warmup_ratio: float = 0.1,
    gradient_accumulation_steps: int = 1,
    max_grad_norm: float = 1.0,
    log_interval: int = 10,
    eval_interval: int = 100,
    checkpoint_interval: int = 100,
    checkpoint_dir: str = "./checkpoints",
    seed: int = 42,
    vllm: VLLMConfig = VLLMConfig(),
    training: TrainingConfig = TrainingConfig(),
    loss: GRPOLossConfig = GRPOLossConfig(),
)

Complete configuration for Ray-based GRPO training.

This configuration controls all aspects of GRPO training:

Model and tokenizer
Generation (vLLM)
Training (DeepSpeed/FSDP)
GRPO algorithm parameters

checkpoint_dir

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval

checkpoint_interval: int = 100

Steps between checkpoints.

eval_interval

eval_interval: int = 100

Steps between evaluation.

gradient_accumulation_steps

gradient_accumulation_steps: int = 1

Gradient accumulation steps.

learning_rate

learning_rate: float = 1e-06

Learning rate.

log_interval

log_interval: int = 10

Steps between logging.

loss

loss: GRPOLossConfig = field(default_factory=GRPOLossConfig)

GRPO loss configuration.

max_epochs

max_epochs: int = 10

Maximum training epochs.

max_grad_norm

max_grad_norm: float = 1.0

Maximum gradient norm for clipping.

max_new_tokens

max_new_tokens: int = 512

Maximum tokens to generate per completion.

max_steps

max_steps: int = 1000

Maximum training steps.

model_name

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Model name or path.

num_generations_per_prompt

num_generations_per_prompt: int = 4

Number of completions to generate per prompt (G in GRPO).

num_prompts_per_step

num_prompts_per_step: int = 8

Number of unique prompts per training step.

seed

seed: int = 42

Random seed for reproducibility.

temperature

temperature: float = 0.7

Sampling temperature.

tokenizer_name

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

top_p

top_p: float = 0.9

Top-p (nucleus) sampling.

train_batch_size

train_batch_size: int

Total batch size for training.

training

training: TrainingConfig = field(
    default_factory=TrainingConfig
)

Distributed training configuration.

vllm

vllm: VLLMConfig = field(default_factory=VLLMConfig)

vLLM inference configuration.

warmup_ratio

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay

weight_decay: float = 0.01

Weight decay.

to_dict

to_dict() -> dict[str, Any]

Convert to dictionary for serialization.

RayGRPOTrainer

RayGRPOTrainer(
    config: RayGRPOConfig,
    colocate: bool = False,
    storage: Storage | None = None,
    checkpoint_name: str | None = None,
    callbacks: list[TrainerCallback] | None = None,
)

Native Ray-based GRPO trainer with colocated inference/training.

Supports two modes:

Memory-efficient mode (default): Time-shares GPU between vLLM and training

Lower memory, but slower due to model loading/unloading

Fast mode (colocate=True): Keeps both models loaded

Higher memory usage, but much faster (no reload overhead)
Uses in-place vLLM weight updates

Example

config = RayGRPOConfig( … model_name=“Qwen/Qwen2.5-1.5B-Instruct”, … num_generations_per_prompt=4, … ) trainer = RayGRPOTrainer(config, colocate=True) # Fast mode

def reward_fn(prompts, completions): … return [1.0 if is_correct(c) else 0.0 for c in completions]

trainer.train(prompts, reward_fn)

Initialize GRPO trainer.

Parameters:

config (RayGRPOConfig) –GRPO configuration.
colocate (bool, default: False ) –If True, keep both vLLM and training model loaded (faster but more memory).
storage (Storage | None, default: None ) –Optional Storage for CAS-based checkpointing.
checkpoint_name (str | None, default: None ) –Name for checkpoints (defaults to sanitized model name).
callbacks (list[TrainerCallback] | None, default: None ) –List of TrainerCallback instances for customizing training behavior.

add_callback

add_callback(callback: TrainerCallback) -> None

Add a callback to the trainer.

remove_callback

remove_callback(callback_type: type) -> None

Remove all callbacks of a given type.

save_checkpoint_to_storage

save_checkpoint_to_storage(
    version: str | None = None,
) -> LocalModel | None

Public method to save checkpoint to CAS.

Parameters:

version (str | None, default: None ) –Version string. If None, auto-increments.

Returns:

LocalModel | None –LocalModel instance if storage is configured, None otherwise.

shutdown

shutdown() -> None

Shutdown trainer.

train

train(
    prompts: Sequence[str],
    reward_fn: RewardFn,
    eval_prompts: Sequence[str] | None = None,
    num_steps: int | None = None,
) -> TrainingState

Run GRPO training.

Parameters:

prompts (Sequence[str]) –Training prompts.
reward_fn (RewardFn) –Function to score completions.
eval_prompts (Sequence[str] | None, default: None ) –Optional evaluation prompts.
num_steps (int | None, default: None ) –Optional number of steps (overrides config).

Returns:

TrainingState –Final training state.

RewardModelTrainer

RewardModelTrainer(
    config: RMConfig,
    fsdp_config: FSDP2Config | None = None,
    storage: Storage | None = None,
    checkpoint_name: str | None = None,
)

Reward Model trainer using Bradley-Terry loss.

Trains a model to predict scalar rewards from preference pairs. The trained model can then be used in RLHF pipelines (PPO, GRPO, etc.).

Attributes:

config –Reward model configuration
model –The reward model (base LLM + value head)
tokenizer –Tokenizer

Initialize Reward Model trainer.

Parameters:

config (RMConfig) –Reward model configuration
fsdp_config (FSDP2Config | None, default: None ) –Optional FSDP2 configuration
storage (Storage | None, default: None ) –Optional storage for CAS checkpointing
checkpoint_name (str | None, default: None ) –Name for checkpoints

compute_rewards

compute_rewards(
    texts: list[str], batch_size: int = 8
) -> list[float]

Compute rewards for a list of texts.

Parameters:

texts (list[str]) –List of text sequences
batch_size (int, default: 8 ) –Batch size for inference

Returns:

list[float] –List of scalar rewards

get_model

get_model() -> RewardModel

Get the trained reward model.

get_reward_fn

get_reward_fn() -> callable

Get a reward function for use with GRPO/PPO.

Returns:

callable –A callable that takes texts and returns rewards

save_checkpoint

save_checkpoint() -> None

Save training checkpoint.

train

train(dataset: Dataset | list[dict]) -> dict[str, float]

Run reward model training.

Parameters:

dataset (Dataset | list[dict]) –Training dataset with preference pairs. Each item should have ‘prompt’, ‘chosen’, ‘rejected’ keys.

Returns:

dict[str, float] –Final training metrics

SFTConfig

SFTConfig(
    model_name: str = "Qwen/Qwen2.5-1.5B-Instruct",
    tokenizer_name: str | None = None,
    max_seq_length: int = 2048,
    use_packing: bool = True,
    packing_efficiency_threshold: float = 0.9,
    learning_rate: float = 2e-05,
    weight_decay: float = 0.01,
    warmup_ratio: float = 0.1,
    max_steps: int = 1000,
    max_epochs: int = 3,
    batch_size: int = 4,
    gradient_accumulation_steps: int = 1,
    max_grad_norm: float = 1.0,
    log_interval: int = 10,
    checkpoint_interval: int = 100,
    checkpoint_dir: str = "./checkpoints",
    seed: int = 42,
    trust_remote_code: bool = True,
)

Configuration for SFT training.

batch_size

batch_size: int = 4

Batch size per device.

checkpoint_dir

checkpoint_dir: str = './checkpoints'

Directory for checkpoints.

checkpoint_interval

checkpoint_interval: int = 100

Steps between checkpoints.

gradient_accumulation_steps

gradient_accumulation_steps: int = 1

Gradient accumulation steps.

learning_rate

learning_rate: float = 2e-05

Learning rate.

log_interval

log_interval: int = 10

Steps between logging.

max_epochs

max_epochs: int = 3

Maximum training epochs.

max_grad_norm

max_grad_norm: float = 1.0

Maximum gradient norm.

max_seq_length

max_seq_length: int = 2048

Maximum sequence length.

max_steps

max_steps: int = 1000

Maximum training steps.

model_name

model_name: str = 'Qwen/Qwen2.5-1.5B-Instruct'

Model name or path.

packing_efficiency_threshold

packing_efficiency_threshold: float = 0.9

Minimum packing efficiency before padding.

seed

seed: int = 42

Random seed.

tokenizer_name

tokenizer_name: str | None = None

Tokenizer name (defaults to model_name).

trust_remote_code

trust_remote_code: bool = True

Trust remote code in model repository.

use_packing

use_packing: bool = True

Enable sequence packing for efficiency.

warmup_ratio

warmup_ratio: float = 0.1

Warmup steps as fraction of total.

weight_decay

weight_decay: float = 0.01

Weight decay.

SFTTrainer

SFTTrainer(
    config: SFTConfig,
    fsdp_config: FSDP2Config | None = None,
)

SFT trainer with sequence packing and FSDP2 support.

Features:

Sequence packing for efficient training
FSDP2 distributed training
Gradient accumulation
Mixed precision (bf16)
Checkpointing

Initialize SFT trainer.

Parameters:

config (SFTConfig) –SFT configuration
fsdp_config (FSDP2Config | None, default: None ) –Optional FSDP2 configuration

load_checkpoint

load_checkpoint(path: str) -> None

Load training checkpoint.

save_checkpoint

save_checkpoint() -> None

Save training checkpoint.

train

train(
    dataset: Dataset | Sequence[dict],
    eval_dataset: Dataset | Sequence[dict] | None = None,
) -> dict[str, float]

Run SFT training.

Parameters:

dataset (Dataset | Sequence[dict]) –Training dataset
eval_dataset (Dataset | Sequence[dict] | None, default: None ) –Optional evaluation dataset

Returns:

dict[str, float] –Final training metrics

TinkerSFTConfig

TinkerSFTConfig(
    base_model: str = "meta-llama/Llama-3.1-8B-Instruct",
    base_url: str | None = None,
    lora_rank: int = 16,
    data_dir: str = "data",
    train_split: str = "train",
    eval_split: str | None = "test",
    max_train_examples: int | None = None,
    max_eval_examples: int | None = None,
    max_sequence_length: int = 2048,
    batch_size: int = 16,
    gradient_accumulation_steps: int = 1,
    learning_rate: float = 0.0001,
    steps: int = 100,
    checkpoint_interval: int = 10,
    adam_beta1: float = 0.9,
    adam_beta2: float = 0.95,
    adam_eps: float = 1e-08,
    sample_prompt: str = "",
    max_new_tokens: int = 64,
    temperature: float = 0.0,
    num_samples: int = 4,
    skip_sample: bool = False,
    project: str | None = None,
    run_name: str | None = None,
    tags: list[str] = (
        lambda: ["training", "sft", "tinker"]
    )(),
    seed: int = 0,
)

Configuration for Tinker-based supervised fine-tuning.

This configuration is used to set up LoRA-based SFT training with the Tinker framework.

Example

config = TinkerSFTConfig( base_model=“meta-llama/Llama-3.1-8B-Instruct”, learning_rate=1e-4, steps=100, lora_rank=16, )

adam_beta1

adam_beta1: float = 0.9

Adam beta1 parameter.

adam_beta2

adam_beta2: float = 0.95

Adam beta2 parameter.

adam_eps

adam_eps: float = 1e-08

Adam epsilon parameter.

base_model

base_model: str = 'meta-llama/Llama-3.1-8B-Instruct'

Model name or path for the base model to fine-tune.

base_url

base_url: str | None = None

Tinker service URL. If None, uses default from environment.

batch_size

batch_size: int = 16

Number of sequences per training step.

checkpoint_interval

checkpoint_interval: int = 10

Save checkpoint every N training steps.

data_dir

data_dir: str = 'data'

Directory containing parquet dataset files.

eval_split

eval_split: str | None = 'test'

Prefix for evaluation data files. Set to None to skip eval.

gradient_accumulation_steps

gradient_accumulation_steps: int = 1

Number of micro-batches to accumulate before each optimizer step.

learning_rate

learning_rate: float = 0.0001

Adam optimizer learning rate.

lora_rank

lora_rank: int = 16

LoRA rank parameter for adapter training.

max_eval_examples

max_eval_examples: int | None = None

Maximum number of evaluation examples. None for all.

max_new_tokens

max_new_tokens: int = 64

Maximum new tokens when sampling.

max_sequence_length

max_sequence_length: int = 2048

Maximum sequence length for tokenization (truncates from left).

max_train_examples

max_train_examples: int | None = None

Maximum number of training examples. None for all.

num_samples

num_samples: int = 4

Number of samples to generate after training.

project

project: str | None = None

Dreadnode project name for logging.

run_name

run_name: str | None = None

Dreadnode run name.

sample_prompt

sample_prompt: str = ''

Prompt used for sampling after training.

seed

seed: int = 0

Random seed for batch selection.

skip_sample

skip_sample: bool = False

Skip sampling after training checkpoints.

steps

steps: int = 100

Total number of training steps.

temperature

temperature: float = 0.0

Sampling temperature (0.0 for greedy).

train_split

train_split: str = 'train'

Prefix for training data files (e.g., ‘train_*.parquet’).

__post_init__

__post_init__() -> None

Validate configuration after initialization.

TinkerSFTTrainer

TinkerSFTTrainer(
    config: TinkerSFTConfig,
    training_client: TrainingClient | None = None,
    service_client: ServiceClient | None = None,
    callbacks: Sequence[TrainingCallback] | None = None,
)

Trainer for supervised fine-tuning using Tinker with LoRA.

This trainer provides:

LoRA-based fine-tuning via Tinker service
Checkpoint saving and artifact logging
Optional sampling after training
Integration with Dreadnode for experiment tracking

Example

Create configuration

config = TinkerSFTConfig( base_model=“meta-llama/Llama-3.1-8B-Instruct”, steps=100, lora_rank=16, )

Create trainer

trainer = TinkerSFTTrainer(config)

Train

state = trainer.train(train_data) print(f”Final loss: {state.losses[-1]:.4f}”)

Initialize the Tinker SFT trainer.

Parameters:

config (TinkerSFTConfig) –Training configuration.
training_client (TrainingClient | None, default: None ) –Optional pre-initialized Tinker training client.
service_client (ServiceClient | None, default: None ) –Optional pre-initialized Tinker service client.
callbacks (Sequence[TrainingCallback] | None, default: None ) –Optional list of training callbacks.

renderer

renderer: Any

Get the model-specific renderer (initializes clients if needed).

service_client

service_client: ServiceClient

Get the service client (initializes clients if needed).

tokenizer

tokenizer: Any

Get the tokenizer (initializes clients if needed).

training_client

training_client: TrainingClient

Get the training client (initializes clients if needed).

add_callback

add_callback(callback: TrainingCallback) -> None

Add a training callback.

evaluate

evaluate(
    eval_data: list[Datum],
    step: int = 0,
    log_to_dreadnode: bool = True,
) -> float

Run evaluation on the provided data.

Parameters:

eval_data (list[Datum]) –Evaluation data as Tinker Datum objects.
step (int, default: 0 ) –Current training step (for logging).
log_to_dreadnode (bool, default: True ) –Whether to log metrics to Dreadnode.

Returns:

float –Evaluation loss.

sample

sample() -> list[dict[str, str]]

Generate samples from the fine-tuned model.

Returns:

list[dict[str, str]] –List of sample dictionaries with ‘prompt’ and ‘completion’ keys.

save_checkpoint

save_checkpoint(name: str | None = None) -> str

Save the current model weights as a checkpoint.

Parameters:

name (str | None, default: None ) –Optional checkpoint name.

Returns:

str –Path to the saved checkpoint.

train

train(
    train_data: list[Datum],
    eval_data: list[Datum] | None = None,
    log_to_dreadnode: bool = True,
) -> TrainingState

Run supervised fine-tuning.

Parameters:

train_data (list[Datum]) –Training data as Tinker Datum objects.
eval_data (list[Datum] | None, default: None ) –Optional evaluation data.
log_to_dreadnode (bool, default: True ) –Whether to log metrics to Dreadnode.

Returns:

TrainingState –Final training state.

Raises:

ValueError –If training data is empty.

TrainingModel

One base model available for hosted training jobs.

TrainingModelPricing

Optional upstream pricing metadata.

All values are USD per million tokens. None means “not published” — callers should fall back to the live Tinker console for authoritative numbers (pricing changes faster than we can update the SDK).

VerificationResult

VerificationResult(
    passed: bool,
    score: float,
    metrics: dict[str, Any] = dict(),
)

Outcome of grading a rollout against a task’s verification config.

Attributes:

passed (bool) –Whether the task was considered solved.
score (float) –Scalar in [0, 1]. For binary env_flag / env_script this is 1.0 on pass and 0.0 on fail. For llm_judge this is the judge’s rubric score.
metrics (dict[str, Any]) –Free-form metadata attached to traces and training metrics (method, exit_code, judge reason and attributes, …).

getattr

__getattr__(name: str) -> t.Any

Lazy load training components to avoid importing torch/ray at module load.

batched_environments

batched_environments(
    envs: list[TaskEnvironment],
    *,
    max_concurrent_setup: int = 32,
) -> AsyncIterator[list[TaskEnvironment]]

Provision a batch of envs in parallel; tear them all down on exit.

Caps concurrent setup via a semaphore so a 64-rollout RL step doesn’t pummel the sandbox provider at batch boundaries. Envs that fail setup() are logged and excluded from the yielded list; their teardown() is not called (nothing to tear down). Envs that succeeded setup are always torn down on exit — even if the caller raises inside the async with block.

Parameters:

envs (list[TaskEnvironment]) –Pre-constructed TaskEnvironment instances. They must not already be set up (setup() is called by this context manager).
max_concurrent_setup (int, default: 32 ) –Maximum concurrent setup() calls. Defaults to 32; tune down under tight provider quota.

Yields:

AsyncIterator[list[TaskEnvironment]] –The live envs (those that succeeded setup()), in the input order
AsyncIterator[list[TaskEnvironment]] –with failed envs skipped.

Example::

envs = [
    TaskEnvironment(api_client=api, org=ORG, workspace=WS,
                    task_ref="pwn/flag", inputs=row.get("inputs"))
    for row in batch_rows
]
async with batched_environments(envs, max_concurrent_setup=8) as live:
    rewards = await asyncio.gather(*[score(env) for env in live])

run_in_sandbox

run_in_sandbox(
    code: str,
    timeout_seconds: int = 300,
    memory_mb: int = 2048,
) -> dict

Run code in a Prime Intellect sandbox.

Sandboxes are lightweight execution environments for running AI-generated code or quick experiments.

Parameters:

code (str) –Python code to execute.
timeout_seconds (int, default: 300 ) –Execution timeout.
memory_mb (int, default: 2048 ) –Memory limit in MB.

Returns:

dict –Dict with stdout, stderr, and return_code.

Example

result = await run_in_sandbox(''' import torch print(f”CUDA available: {torch.cuda.is_available()}”) ''') print(result[“stdout”])

train_dpo

train_dpo(
    config_dict: dict[str, Any], prompts: list[str]
) -> t.Any

Train with DPO.

train_grpo

train_grpo(
    config_dict: dict[str, Any],
    prompts: list[str],
    reward_fn: Callable[..., Any],
) -> t.Any

Train with GRPO.

train_on_prime

train_on_prime(
    config: dict[str, Any] | None = None,
    name: str | None = None,
    gpu_type: str = "H100_80GB",
    gpu_count: int = 1,
    training_type: str = "sft",
    requirements: list[str] | None = None,
    env_vars: dict[str, str] | None = None,
    auto_terminate: bool = True,
    region: str | None = None,
    interruptible: bool = False,
) -> TrainingResult

Run training on Prime Intellect infrastructure.

This function provides a high-level interface for running training jobs on Prime’s decentralized GPU compute.

Parameters:

config (dict[str, Any] | None, default: None ) –Training configuration dict. Common options:
- model_name: Model name or path
- max_steps: Maximum training steps
- batch_size: Batch size per device
- learning_rate: Learning rate
- checkpoint_dir: Checkpoint directory
name (str | None, default: None ) –Job name.
gpu_type (str, default: 'H100_80GB' ) –GPU type (H100_80GB, A100_80GB, etc.).
gpu_count (int, default: 1 ) –Number of GPUs.
training_type (str, default: 'sft' ) –Type of training (sft, grpo, dpo, ppo).
requirements (list[str] | None, default: None ) –Additional Python requirements.
env_vars (dict[str, str] | None, default: None ) –Environment variables.
auto_terminate (bool, default: True ) –Terminate pods after training.
region (str | None, default: None ) –Preferred region.
interruptible (bool, default: False ) –Use spot/interruptible instances.

Returns:

TrainingResult –TrainingResult with final state and checkpoint info.

Example

SFT training on H100s

result = await train_on_prime( config={ “model_name”: “meta-llama/Llama-3.1-8B-Instruct”, “max_steps”: 1000, “batch_size”: 32, }, gpu_type=“H100_80GB”, gpu_count=8, )

if result.succeeded: print(f”Checkpoint: {result.checkpoint_path}“)

train_ppo

train_ppo(
    config_dict: dict[str, Any],
    prompts: list[str],
    reward_fn: Callable[..., Any],
) -> t.Any

Train with PPO.

train_sft

train_sft(
    config_dict: dict[str, Any], prompts: list[str]
) -> t.Any

Train with SFT.

train_tinker_sft

train_tinker_sft(
    config: dict[str, Any] | None = None,
    messages: Sequence[list[dict[str, str]]] | None = None,
    examples: Sequence[tuple[str, str]] | None = None,
    data_dir: str | None = None,
    project: str | None = None,
    run_name: str | None = None,
    tags: list[str] | None = None,
    log_to_dreadnode: bool = True,
) -> TrainingState

Train a model using Tinker SFT.

This function provides a high-level interface for supervised fine-tuning using the Tinker framework. Data can be provided in multiple formats:

Conversation messages (list of message dicts)
Simple examples (input/output pairs)
Parquet files in a data directory

Parameters:

config (dict[str, Any] | None, default: None ) –Training configuration dict. See TinkerSFTConfig for options.
messages (Sequence[list[dict[str, str]]] | None, default: None ) –List of conversations, each a list of message dicts with ‘role’ and ‘content’ keys.
examples (Sequence[tuple[str, str]] | None, default: None ) –List of (input, output) tuples for simple supervised learning.
data_dir (str | None, default: None ) –Directory containing parquet files with training data.
project (str | None, default: None ) –Dreadnode project name.
run_name (str | None, default: None ) –Dreadnode run name.
tags (list[str] | None, default: None ) –Tags for the Dreadnode run.
log_to_dreadnode (bool, default: True ) –Whether to log to Dreadnode (default: True).

Returns:

TrainingState –TrainingState with training metrics and checkpoint paths.

Raises:

ValueError –If no data source is provided.

verify_env_state

verify_env_state(
    env: TaskEnvironment,
    trajectory: Trajectory | None,
    verification: dict[str, Any] | None,
    *,
    judge_context: dict[str, Any] | None = None,
) -> VerificationResult

Grade the rollout against the task’s verification config.

Supports three dispatch keys on the verification dict:

env_flag — read a file from the env sandbox; compare against a sha256 hash (hash) or plaintext expected value.
env_script — execute a script inside the env; pass iff the exit code matches expected_exit_code (default 0).
llm_judge — score trajectory with :class:~dreadnode.agents.AgentJudge against a rubric; pass iff score clears passing_threshold.

Parameters:

env (TaskEnvironment) –A provisioned :class:TaskEnvironment with execute() available.
trajectory (Trajectory | None) –The agent’s rollout. Required for llm_judge; ignored by env_flag / env_script. Pass None for single-shot recipes that don’t produce a trajectory.
verification (dict[str, Any] | None) –The task’s verification config (typically from env.task_verification). None or missing method raises ValueError.
judge_context (dict[str, Any] | None, default: None ) –Optional context passed through to AgentJudge.evaluate when method=llm_judge. Good for task instruction / env state.

Returns:

A ( VerificationResult ) –class:VerificationResult.

Raises:

ValueError –if verification is missing, method is unknown, or the chosen method’s required fields are absent.
RuntimeError –if env_flag / env_script invocation is attempted against an un-provisioned env (caller must setup() first).