Running Benchmarks
This page is the runbook for the Sentry benchmark. The overview page is the readout. This page is for creating a new result.
Run one Warden shard per corpus commit. Each shard checks out the public
getsentry/sentry repository at that commit and scans only the files referenced
by corpus entries for that commit.
Do not run this benchmark against all of Sentry. The point is a repeatable comparison against known vulnerable files.
Clone Sentry and pick a runtime/model pair:
git clone git@github.com:getsentry/sentry.git /tmp/sentry-benchmark
export BENCH_MODEL="openai/gpt-5.5"export BENCH_MODEL_SLUG="gpt-5-5"export BENCH_RUNTIME="pi"export BENCH_RUNTIME_SLUG="pi"export BENCH_EFFORT="high"export BENCH_RUN_SLUG="pi-gpt-5-5-high"export BENCH_ROOT="/tmp/warden-sentry-benchmark-${BENCH_RUN_SLUG}"export BENCH_PARALLEL="4"mkdir -p "$BENCH_ROOT"Set the provider API key required by the model. For GPT 5.5 through Pi, set
WARDEN_OPENAI_API_KEY. For Anthropic models, set WARDEN_ANTHROPIC_API_KEY.
For OpenRouter models through Pi, set WARDEN_OPENROUTER_API_KEY and use the
full OpenRouter selector, such as openrouter/deepseek/deepseek-v4-pro.
For the Claude SDK runtime, use Claude Code model IDs such as
claude-sonnet-4-6 instead of Pi provider/model selectors.
If you load credentials from a local env file, export them before running
pnpm. A plain source .env.local sets shell variables, but child processes
will not see them unless they are exported:
set -asource .env.localset +aThe older GPT 5.5 run used openai/gpt-5.5. The installed Pi registry did not
expose openai/gpt-5.5-codex. It did not pass an explicit effort flag, so the
run used the runtime/provider default. New benchmark runs should pass
--effort high or --effort low when testing reasoning behavior, and omit
--effort only when intentionally measuring the runtime default.
Target Lists
Section titled “Target Lists”From the Warden repository, write one target list per corpus commit:
node <<'NODE'const {execFileSync} = require("node:child_process");const {mkdirSync, readFileSync, writeFileSync} = require("node:fs");
const corpus = JSON.parse( readFileSync("packages/docs/src/data/benchmarking/sentry-vulnerability-corpus.json", "utf8"),);const repo = "/tmp/sentry-benchmark";const outDir = process.env.BENCH_ROOT;mkdirSync(outDir, {recursive: true});
const bySha = new Map();for (const finding of corpus.findings) { const entry = bySha.get(finding.sha) ?? {findings: 0, paths: new Set()}; entry.findings += 1; entry.paths.add(finding.code.path); bySha.set(finding.sha, entry);}
for (const [sha, entry] of [...bySha.entries()].sort()) { const paths = [...entry.paths].sort(); const missing = []; for (const path of paths) { try { execFileSync("git", ["-C", repo, "cat-file", "-e", `${sha}:${path}`], {stdio: "ignore"}); } catch { missing.push(path); } } if (missing.length > 0) { throw new Error(`${sha} missing corpus paths:\n${missing.join("\n")}`); } writeFileSync(`${outDir}/targets-${sha}.txt`, `${paths.join("\n")}\n`); console.error(`${sha}: ${paths.length} target files for ${entry.findings} corpus findings`);}NODEThe current corpus produces 79 target files across 6 commits.
Warden Config
Section titled “Warden Config”Write the config outside the Sentry checkout. Keep runtime, model, and effort explicit on the CLI so the invocation captures the run shape. The config keeps the benchmark skill, thresholds, concurrency, and verifier policy stable.
cat > "$BENCH_ROOT/warden.toml" <<EOFversion = 1
[defaults]reportOn = "low"maxTurns = 100
[defaults.verification]enabled = true
[runner]concurrency = 4
[[skills]]name = "security-review"EOFSet maxTurns explicitly for corpus runs. The default is lower and can make
heavy security chunks fail with turn_limit, which turns a model comparison
into a runner-limit comparison.
Run from the Warden repository:
set -euo pipefail
BENCH_PARALLEL="${BENCH_PARALLEL:-4}"
effort_args=()if [ -n "${BENCH_EFFORT:-}" ]; then effort_args=(--effort "$BENCH_EFFORT")fi
for target in "$BENCH_ROOT"/targets-*.txt; do sha=${target##*/targets-} sha=${sha%.txt} short=${sha:0:8} output="$BENCH_ROOT/sentry-security-review-${BENCH_RUN_SLUG}-corpus-${short}.jsonl" validated="$output.validated"
if [ -s "$output" ] && [ -f "$validated" ]; then echo "Skipping ${short}; output and validated marker exist" continue fi
rm -f "$validated" git -C /tmp/sentry-benchmark checkout "$sha"
if ! pnpm cli -- run \ -C /tmp/sentry-benchmark \ @"$target" \ --skill security-review \ --config-path "$BENCH_ROOT/warden.toml" \ --runtime "$BENCH_RUNTIME" \ --model "$BENCH_MODEL" \ "${effort_args[@]}" \ --traces \ --report-on low \ --min-confidence low \ --parallel "$BENCH_PARALLEL" \ -o "$output" \ -v \ --log then echo "Run failed for ${short}; leaving validated marker absent" continue fi
if node - "$output" "$target" <<'NODE'const {readFileSync} = require("node:fs");
const output = process.argv[2];const target = process.argv[3];const lines = readFileSync(output, "utf8").trim().split("\n").filter(Boolean);const expectedFiles = readFileSync(target, "utf8") .trim() .split("\n") .filter(Boolean);if (lines.length === 0) { throw new Error(`${output} is empty`);}
const records = lines.map((line) => JSON.parse(line));const failed = records.filter((record) => record.status && record.status !== "ok");const summary = records.find((record) => record.type === "summary");const sourceChunks = records.filter( (record) => record.status === "ok" && record.chunk?.file,);const missingTraces = sourceChunks.filter((record) => !record.trace);const chunksByFile = new Map();for (const record of sourceChunks) { const chunks = chunksByFile.get(record.chunk.file) ?? []; chunks.push(record.chunk); chunksByFile.set(record.chunk.file, chunks);}
if (failed.length > 0) { throw new Error(`${output} has ${failed.length} failed records`);}if (!summary) { throw new Error(`${output} is missing a summary record`);}const expectedFileSet = new Set(expectedFiles);const missingFiles = expectedFiles.filter((file) => !chunksByFile.has(file));const unexpectedFiles = [...chunksByFile.keys()].filter( (file) => !expectedFileSet.has(file),);if (missingFiles.length > 0) { throw new Error(`${output} is missing ${missingFiles.length} target files`);}if (unexpectedFiles.length > 0) { throw new Error(`${output} has ${unexpectedFiles.length} unexpected target files`);}for (const [file, chunks] of chunksByFile) { const totals = new Set(chunks.map((chunk) => chunk.total)); if (totals.size !== 1) { throw new Error(`${output} has inconsistent chunk totals for ${file}`); } const total = chunks[0].total; const indices = new Set(chunks.map((chunk) => chunk.index)); if (indices.size !== chunks.length) { throw new Error(`${output} has duplicate chunk records for ${file}`); } for (let index = 1; index <= total; index += 1) { if (!indices.has(index)) { throw new Error(`${output} is missing chunk ${index}/${total} for ${file}`); } }}if (missingTraces.length > 0) { throw new Error(`${output} has ${missingTraces.length} untraced source chunks`);}NODE then touch "$validated" else echo "Validation failed for ${short}; leaving validated marker absent" continue fidoneUse a separate .validated marker for resumable benchmark scripts. Do not use
Warden’s .jsonl.done marker as proof that a shard is usable; it can exist for
failed JSONL artifacts.
BENCH_PARALLEL=4 is the standard corpus setting. If a new, slow, or
high-effort provider repeatedly fails with 503s, timeouts, or provider
disconnects, retry the failed shard with a lower BENCH_PARALLEL and record
that in the result notes. Do not mix failed and clean attempts in one stitched
summary.
If only a few chunks fail in an otherwise expensive shard, move the dirty JSONL
aside, rerun the affected target files with BENCH_PARALLEL=1, and rebuild the
shard by replacing only the failed chunk records with traced, successful repair
records. Keep the repair usage in the aggregate cost and note the repair in the
result. Do not fabricate missing chunk records.
Inspect the stitched output:
pnpm cli -- runs show "$BENCH_ROOT"/sentry-security-review-"$BENCH_RUN_SLUG"-corpus-*.jsonl \ -C /tmp/sentry-benchmark \ --report-on low \ --min-confidence lowOnly stitch clean shard JSONL files. If a shard fails because of auth,
provider 503s, timeouts, turn_limit, or a manual abort, move that artifact out
of the *.jsonl glob before running runs show, remove any stale validation
marker, and rerun the shard. Keep the failed artifact with a .withheld suffix
if it is useful for diagnosis.
Record Results
Section titled “Record Results”Keep every raw JSONL shard with the result summary, but do not commit raw JSONL until it has been reviewed for sensitive data. The raw artifacts are the source of truth for cost, duration, token counts, and future rescoring.
Store results in packages/docs/src/data/benchmarking/results/.
Record:
- stable
runId - corpus ID
- repository
- Warden version
- skill
- model
- runtime
- effort level, or
provider-default - configured
maxTurns, plus any shard reruns done after a turn-limit failure - whether Warden’s post-analysis finding verifier was enabled
- whether
--traceswas enabled, plus any run-level trace IDs preserved in the JSONL metadata - report and confidence thresholds
- one shard per corpus commit, including SHA, target list, raw JSONL artifact name, and raw artifact review status
- total files, chunks, failed chunks, findings, cost, duration, and tokens
timing.analysisChunkMsfrom top-level per-recorddurationMsvalues in the raw JSONL artifacts, when all raw shard artifacts are available- total wall duration from the stitched run summary
- scoring summary once a reviewer matches findings back to the corpus
Warden’s finding verifier is enabled by default. Benchmark runs should leave it
enabled unless they are deliberately testing verifier-off behavior. It is
disabled only when defaults.verification.enabled = false is set in
warden.toml. Record this as findingVerification.enabled in the result JSON.
Verifier calls are part of Warden’s analysis pipeline, not benchmark scoring.
They can add provider cost, and runs with more candidate findings generally do
more verifier work. Keep this separate from the benchmark scores field, which
is the later semantic match against the corpus.
The timing breakdown has the same separation. Per-chunk P50 and P90 timing is recorded before Warden’s post-analysis verifier runs. Total timing includes post-analysis work and upstream provider latency, so treat it as flaky operational context rather than a stable comparison metric.
Score by agent-verified semantic match, not exact wording or line number. A result counts as found when it identifies the same bug in roughly the same location as an existing corpus finding.
This is not deterministic. An agent reviews every emitted finding against the existing corpus findings for that commit. Same-file findings about different bugs do not count. One emitted finding may count for multiple corpus entries when it clearly covers the same bug represented by multiple existing entries. Duplicate emitted findings do not double-count the same corpus entry.
Use this scoring checklist:
- Read every emitted finding from the raw JSONL shards for the run.
- For each finding, compare it against corpus entries from the same commit.
- Use same path and nearby line range as the first candidate filter, but make the final decision semantically.
- Count
known-foundonly when the finding would lead a reviewer to the same bug in roughly the same code location. - Mark same-file findings about different bugs as
not-known. - Record one
scoresentry for every emitted finding, including non-matches. - Set
scoring.knownFoundto the number of unique matched corpus entries, not the number of emitted findings. - Leave the run unscored when the raw findings are missing or cannot be semantically verified.
Keep the distinction clear:
- known found: corpus vulnerabilities Warden found
- total findings: all findings Warden emitted before scoring
- unexpected valid: real vulnerabilities not already in the corpus
- false positives: findings rejected by review
Do not treat the score as a universal pass rate. It is a relative comparison for this corpus and this run shape.