Skip to content

Reward recipes

The five server-side reward recipes that turn a rollout into a score, plus Worlds reward policies for live RL.

RL jobs use a reward recipe to turn each rollout completion into a float reward. Pick one by name when you submit:

Terminal window
dn train rl ... --reward-recipe task_verifier_v1

Pass parameters as a JSON object when the recipe needs configuration:

Terminal window
dn train rl ... --reward-recipe contains_v1 \
--reward-params '{"needle": "flag", "reward_if_true": 1.0, "reward_if_false": 0.0}'

Every recipe receives the completion text plus the dataset row (for prompt-dataset RL) or the task definition (for verifier-driven RL). Recipes return a single float the optimizer maximizes.

Training and optimization share four of these recipes; the fifth — task_verifier_v1 — is training-specific.

Scores 1.0 when the completion exactly matches the expected answer after whitespace strip, 0.0 otherwise.

FieldTypeSource
params.expectedstringOptional global expected value. Falls back to the row’s expected_output.
Dataset columnexpected_output — required when params.expected is not set.

Use this when every prompt has one ground-truth answer and partial matches don’t count.

Scores based on whether a fixed substring appears anywhere in the completion.

FieldTypeDefaultNotes
params.needlestringRequired. Substring to look for.
params.reward_if_truefloat1.0Returned when the substring is present.
params.reward_if_falsefloat0.0Returned when the substring is absent.

The needle is global to the run — it does not read per-row fields. Use this when “did the agent mention this term?” is the entire metric.

Passes a per-row reward value from the dataset straight through to the optimizer.

FieldTypeSource
params.defaultfloatFallback when a row has no reward. Defaults to 0.0.
Dataset columnreward — the per-row numeric value returned unchanged.

Use this when the metric is already in the dataset — human labels, reward-model scores, anything you computed offline. The recipe adds nothing on top.

Returns the row’s reward when the completion matches the expected output; otherwise returns a fallback.

FieldTypeDefaultSource
params.expectedstringOptional global expected. Falls back to expected_output.
params.reward_if_truefloat1.0Used when match succeeds and the row has no reward.
params.reward_if_falsefloat0.0Used when the completion doesn’t match.

Use this when you want the model to imitate known-good outputs but weight rows differently — harder examples carry more reward via the row’s reward column.

Verifies a completion against a task’s embedded flag. The recipe strips whitespace, SHA-256 hashes the result, and compares it byte-for-byte against the expected hash pinned in the task.

FieldTypeDefaultNotes
params.reward_if_truefloat1.0Returned when the hash matches.
params.reward_if_falsefloat0.0Returned when it doesn’t.

Use this for security tasks that embed a flag or secret solution. The recipe never sees the plaintext — only the hash — so tasks stay checkable without leaking the answer.

Provisions a live task environment per rollout, lets the policy sample one completion, then grades the env’s final state using the task’s verification config. Use this when the reward comes from world state (flag files, database rows, service state) rather than completion text.

Terminal window
dn train rl ... \
--task-ref security-mutillidae-sqli@1.0.0 \
--reward-recipe task_env_verifier_v1 \
--reward-params '{"max_concurrent_rollouts": 8, "reward_if_true": 1.0}'

The recipe reads the task’s verification dict (snapshotted onto the env at provision time) and dispatches to env_flag, env_script, or llm_judge — see the Verification page for the methods.

FieldTypeDefaultNotes
params.reward_if_truefloat1.0Returned when verification passes.
params.reward_if_falsefloat0.0Returned when verification fails.
params.max_concurrent_rolloutsint8Parallel env provisions per step; cap under tight E2B quota.
params.env_timeout_secint300Env lifetime per rollout.

Single-shot only — the policy sees the rendered task instruction once, replies once, and the reward comes from the env. For multi-turn agents that use tools, reach for task_env_agent_v1.

Provisions a task environment, builds an in-process agent from the job’s capability, runs a full tool-use rollout against the env, then grades the env state (same verification methods as above). This is the primary recipe for cyber RL — the policy is an agent that iterates against the target.

Terminal window
dn train rl ... \
--capability cyber-agent@3.1.0 \
--task-ref security-mutillidae-sqli@1.0.0 \
--reward-recipe task_env_agent_v1 \
--reward-params '{"max_turns": 20, "max_concurrent_rollouts": 8}'

Per-turn credit assignment uses reward-to-go — the terminal reward (from verification) is distributed across the rollout’s assistant turns so the optimizer can credit earlier steps. Works with any capability that runs under optimization today; no capability changes required.

FieldTypeDefaultNotes
params.max_turnsint20Cap on agent steps per rollout.
params.max_concurrent_rolloutsint8Parallel env provisions per step.
params.env_timeout_secint600Env lifetime per rollout (longer than single-shot — tools need time).
params.reward_if_truefloat1.0Returned when verification passes.
params.reward_if_falsefloat0.0Returned when verification fails.
You have…Reach for
Ground-truth answers per row.exact_match_v1
A single target phrase the agent should produce.contains_v1
Pre-computed rewards already in the dataset.row_reward_v1
Ground-truth outputs plus per-row weights.trajectory_imitation_v1
A task with an embedded flag-style solution.task_verifier_v1
A task whose reward lives in world state (single-shot).task_env_verifier_v1
A task that needs a tool-using agent to solve it.task_env_agent_v1

For multi-metric composition or custom scorers not covered above, publish pre-scored datasets and use row_reward_v1, or reach for optimization when the knob you want to turn is prompt or instruction text rather than weights.

When you train RL with --world-manifest-id, a separate --world-reward policy shapes intermediate signals during the live trajectory — distinct from the per-completion recipe above.

Terminal window
dn train rl ... \
--world-manifest-id <id> \
--world-reward discovery_v1 \
--world-reward-params '{"success_reward": 1.5, "error_penalty": -0.5}'

Three presets are available:

PresetShapes
heuristic_v1General-purpose: reasoning traces, tool observations, host / credential / privilege discovery, stop-tool bonus, plus terminal state rewards.
goal_only_v1Sparse goal-driven reward: success bonus and penalties for stalls, step limits, and errors.
discovery_v1Red-team shaping: bonuses for host discovery, credential acquisition, and privilege escalation on top of terminal outcomes.

Each preset accepts params that override its default weights (reasoning_trace_bonus, host_discovery_reward, success_reward, etc.).

For fully custom shaping, pass a components list instead of a preset name:

Terminal window
dn train rl ... \
--world-reward-params '{
"components": [
{"name": "reasoning_trace", "params": {"value": 0.02}},
{"name": "host_discovery", "params": {"value": 0.15}},
{"name": "terminal_state", "params": {"success_reward": 1.5, "error_penalty": -0.5}}
]
}'

Available components: reasoning_trace, tool_observation, host_discovery, credential_discovery, privilege_escalation, tool_stop, tool_error_penalty, terminal_state.

Both can be set on the same RL job; they are orthogonal.

--reward-recipe--world-reward
ScoresThe completion text.The trajectory — tool calls, observations, state.
When evaluatedOnce per rollout, after generation.Throughout a live rollout, per event.
Required forAny RL job that uses a recipe.Only --world-manifest-id rollouts.

Use the recipe when you have a metric for the final output. Use the world reward when the journey matters and you want to shape exploration.