Reward recipes
The five server-side reward recipes that turn a rollout into a score, plus Worlds reward policies for live RL.
RL jobs use a reward recipe to turn each rollout completion into a float reward. Pick one by name when you submit:
dn train rl ... --reward-recipe task_verifier_v1Pass parameters as a JSON object when the recipe needs configuration:
dn train rl ... --reward-recipe contains_v1 \ --reward-params '{"needle": "flag", "reward_if_true": 1.0, "reward_if_false": 0.0}'Every recipe receives the completion text plus the dataset row (for prompt-dataset RL) or the task definition (for verifier-driven RL). Recipes return a single float the optimizer maximizes.
Training and optimization share four of these recipes; the
fifth — task_verifier_v1 — is training-specific.
exact_match_v1
Section titled “exact_match_v1”Scores 1.0 when the completion exactly matches the expected answer after whitespace strip,
0.0 otherwise.
| Field | Type | Source |
|---|---|---|
params.expected | string | Optional global expected value. Falls back to the row’s expected_output. |
| Dataset column | — | expected_output — required when params.expected is not set. |
Use this when every prompt has one ground-truth answer and partial matches don’t count.
contains_v1
Section titled “contains_v1”Scores based on whether a fixed substring appears anywhere in the completion.
| Field | Type | Default | Notes |
|---|---|---|---|
params.needle | string | — | Required. Substring to look for. |
params.reward_if_true | float | 1.0 | Returned when the substring is present. |
params.reward_if_false | float | 0.0 | Returned when the substring is absent. |
The needle is global to the run — it does not read per-row fields. Use this when “did the agent mention this term?” is the entire metric.
row_reward_v1
Section titled “row_reward_v1”Passes a per-row reward value from the dataset straight through to the optimizer.
| Field | Type | Source |
|---|---|---|
params.default | float | Fallback when a row has no reward. Defaults to 0.0. |
| Dataset column | — | reward — the per-row numeric value returned unchanged. |
Use this when the metric is already in the dataset — human labels, reward-model scores, anything you computed offline. The recipe adds nothing on top.
trajectory_imitation_v1
Section titled “trajectory_imitation_v1”Returns the row’s reward when the completion matches the expected output; otherwise returns a
fallback.
| Field | Type | Default | Source |
|---|---|---|---|
params.expected | string | — | Optional global expected. Falls back to expected_output. |
params.reward_if_true | float | 1.0 | Used when match succeeds and the row has no reward. |
params.reward_if_false | float | 0.0 | Used when the completion doesn’t match. |
Use this when you want the model to imitate known-good outputs but weight rows differently —
harder examples carry more reward via the row’s reward column.
task_verifier_v1
Section titled “task_verifier_v1”Verifies a completion against a task’s embedded flag. The recipe strips whitespace, SHA-256 hashes the result, and compares it byte-for-byte against the expected hash pinned in the task.
| Field | Type | Default | Notes |
|---|---|---|---|
params.reward_if_true | float | 1.0 | Returned when the hash matches. |
params.reward_if_false | float | 0.0 | Returned when it doesn’t. |
Use this for security tasks that embed a flag or secret solution. The recipe never sees the plaintext — only the hash — so tasks stay checkable without leaking the answer.
task_env_verifier_v1
Section titled “task_env_verifier_v1”Provisions a live task environment per rollout, lets the policy sample one completion, then
grades the env’s final state using the task’s verification config. Use this when the reward
comes from world state (flag files, database rows, service state) rather than completion text.
dn train rl ... \ --task-ref security-mutillidae-sqli@1.0.0 \ --reward-recipe task_env_verifier_v1 \ --reward-params '{"max_concurrent_rollouts": 8, "reward_if_true": 1.0}'The recipe reads the task’s verification dict (snapshotted onto the env at provision time) and
dispatches to env_flag, env_script, or llm_judge — see the
Verification page for the methods.
| Field | Type | Default | Notes |
|---|---|---|---|
params.reward_if_true | float | 1.0 | Returned when verification passes. |
params.reward_if_false | float | 0.0 | Returned when verification fails. |
params.max_concurrent_rollouts | int | 8 | Parallel env provisions per step; cap under tight E2B quota. |
params.env_timeout_sec | int | 300 | Env lifetime per rollout. |
Single-shot only — the policy sees the rendered task instruction once, replies once, and the
reward comes from the env. For multi-turn agents that use tools, reach for task_env_agent_v1.
task_env_agent_v1
Section titled “task_env_agent_v1”Provisions a task environment, builds an in-process agent from the job’s capability, runs a full tool-use rollout against the env, then grades the env state (same verification methods as above). This is the primary recipe for cyber RL — the policy is an agent that iterates against the target.
dn train rl ... \ --capability cyber-agent@3.1.0 \ --task-ref security-mutillidae-sqli@1.0.0 \ --reward-recipe task_env_agent_v1 \ --reward-params '{"max_turns": 20, "max_concurrent_rollouts": 8}'Per-turn credit assignment uses reward-to-go — the terminal reward (from verification) is distributed across the rollout’s assistant turns so the optimizer can credit earlier steps. Works with any capability that runs under optimization today; no capability changes required.
| Field | Type | Default | Notes |
|---|---|---|---|
params.max_turns | int | 20 | Cap on agent steps per rollout. |
params.max_concurrent_rollouts | int | 8 | Parallel env provisions per step. |
params.env_timeout_sec | int | 600 | Env lifetime per rollout (longer than single-shot — tools need time). |
params.reward_if_true | float | 1.0 | Returned when verification passes. |
params.reward_if_false | float | 0.0 | Returned when verification fails. |
Picking a recipe
Section titled “Picking a recipe”| You have… | Reach for |
|---|---|
| Ground-truth answers per row. | exact_match_v1 |
| A single target phrase the agent should produce. | contains_v1 |
| Pre-computed rewards already in the dataset. | row_reward_v1 |
| Ground-truth outputs plus per-row weights. | trajectory_imitation_v1 |
| A task with an embedded flag-style solution. | task_verifier_v1 |
| A task whose reward lives in world state (single-shot). | task_env_verifier_v1 |
| A task that needs a tool-using agent to solve it. | task_env_agent_v1 |
For multi-metric composition or custom scorers not covered above, publish pre-scored datasets
and use row_reward_v1, or reach for optimization when the knob you
want to turn is prompt or instruction text rather than weights.
World reward policies
Section titled “World reward policies”When you train RL with --world-manifest-id, a separate --world-reward policy shapes
intermediate signals during the live trajectory — distinct from the per-completion recipe above.
dn train rl ... \ --world-manifest-id <id> \ --world-reward discovery_v1 \ --world-reward-params '{"success_reward": 1.5, "error_penalty": -0.5}'Three presets are available:
| Preset | Shapes |
|---|---|
heuristic_v1 | General-purpose: reasoning traces, tool observations, host / credential / privilege discovery, stop-tool bonus, plus terminal state rewards. |
goal_only_v1 | Sparse goal-driven reward: success bonus and penalties for stalls, step limits, and errors. |
discovery_v1 | Red-team shaping: bonuses for host discovery, credential acquisition, and privilege escalation on top of terminal outcomes. |
Each preset accepts params that override its default weights (reasoning_trace_bonus,
host_discovery_reward, success_reward, etc.).
For fully custom shaping, pass a components list instead of a preset name:
dn train rl ... \ --world-reward-params '{ "components": [ {"name": "reasoning_trace", "params": {"value": 0.02}}, {"name": "host_discovery", "params": {"value": 0.15}}, {"name": "terminal_state", "params": {"success_reward": 1.5, "error_penalty": -0.5}} ] }'Available components: reasoning_trace, tool_observation, host_discovery,
credential_discovery, privilege_escalation, tool_stop, tool_error_penalty,
terminal_state.
--reward-recipe vs. --world-reward
Section titled “--reward-recipe vs. --world-reward”Both can be set on the same RL job; they are orthogonal.
--reward-recipe | --world-reward | |
|---|---|---|
| Scores | The completion text. | The trajectory — tool calls, observations, state. |
| When evaluated | Once per rollout, after generation. | Throughout a live rollout, per event. |
| Required for | Any RL job that uses a recipe. | Only --world-manifest-id rollouts. |
Use the recipe when you have a metric for the final output. Use the world reward when the journey matters and you want to shape exploration.
Where to go next
Section titled “Where to go next”- Reinforcement learning for the full RL submission flow.
- Manifest reference for every RL config field.