Skip to content

Overview

The Sentry benchmark is a small, qualitative readout for Warden’s security review behavior. It compares runs against known vulnerabilities from the public getsentry/sentry repository.

This is not an exhaustive eval and it is not a proof that Warden will catch every future issue. It is a way to compare implementations, prompts, models, and runtimes against the same historical security corpus.

The corpus currently contains 86 validated vulnerabilities across 79 files and 6 historical Sentry commits. A benchmark run checks out each commit and scans only the files tied to known vulnerabilities at that commit.

That keeps the run focused. We are measuring whether Warden can recognize the same root causes, not whether it can discover unrelated issues across the whole Sentry repository.

The score table is the headline and sorts by known-corpus recall. The cost and timing tables below it are operational context for understanding why two runs with similar scores may look very different to operate. They sort separately: cost by recorded cost, timing by P50 analysis-chunk duration. This matrix only shows stable comparison runs with per-chunk timing metadata and no failed chunks; older incomplete or partial runs remain in the result data but are hidden here.

Run Known Findings Recorded Cost

GPT 5.5 (Pi)

high

Known corpus 41/86 47.7%
Total findings 72
Recorded cost $148.63

GPT 5.5 (Pi)

low

Known corpus 28/86 32.6%
Total findings 38
Recorded cost $39.36

Claude Sonnet 4.6 (Pi)

Known corpus 25/86 29.1%
Total findings 32
Recorded cost $19.84

Claude Sonnet 4.6 (Claude SDK)

Known corpus 24/86 27.9%
Total findings 32
Recorded cost $103.59

Claude Opus 4.6 (Pi)

high

Known corpus 23/86 26.7%
Total findings 24
Recorded cost $36.86

DeepSeek V4 Pro (Pi)

xhigh

Known corpus 23/86 26.7%
Total findings 30
Recorded cost $18.70

Claude Opus 4.8 (Pi)

high

Known corpus 21/86 24.4%
Total findings 24
Recorded cost $21.31

Claude Opus 4.8 (Pi)

medium

Known corpus 18/86 20.9%
Total findings 19
Recorded cost $14.50

DeepSeek V4 Flash (Pi)

xhigh

Known corpus 18/86 20.9%
Total findings 27
Recorded cost $10.11

Claude Opus 4.8 (Claude SDK)

high

Known corpus 17/86 19.8%
Total findings 17
Recorded cost $79.56

Claude Opus 4.7 (Pi)

medium

Known corpus 6/86 7.0%
Total findings 7
Recorded cost $4.39

Cost and Tokens

Sorted by recorded cost, lowest first.

Run Recorded Cost Input Tokens Output Tokens

Claude Opus 4.7 (Pi)

medium

Recorded cost $4.39
Input tokens 1.53m
Output tokens 20.77k

DeepSeek V4 Flash (Pi)

xhigh

Recorded cost $10.11
Input tokens 74.35m
Output tokens 2.09m

Claude Opus 4.8 (Pi)

medium

Recorded cost $14.50
Input tokens 4.62m
Output tokens 225.33k

DeepSeek V4 Pro (Pi)

xhigh

Recorded cost $18.70
Input tokens 65.51m
Output tokens 1.85m

Claude Sonnet 4.6 (Pi)

Recorded cost $19.84
Input tokens 9.67m
Output tokens 508.84k

Claude Opus 4.8 (Pi)

high

Recorded cost $21.31
Input tokens 6.52m
Output tokens 376.36k

Claude Opus 4.6 (Pi)

high

Recorded cost $36.86
Input tokens 16.14m
Output tokens 585.84k

GPT 5.5 (Pi)

low

Recorded cost $39.36
Input tokens 18.71m
Output tokens 390.01k

Claude Opus 4.8 (Claude SDK)

high

Recorded cost $79.56
Input tokens 31.84m
Output tokens 386.17k

Claude Sonnet 4.6 (Claude SDK)

Recorded cost $103.59
Input tokens 65.67m
Output tokens 1.09m

GPT 5.5 (Pi)

high

Recorded cost $148.63
Input tokens 127.9m
Output tokens 986.84k

Timing

Sorted by P50 analysis chunk duration, lowest first.

Run P50 P90 Total

Claude Opus 4.7 (Pi)

medium

P50 1.2s
P90 9.6s
Total 6.6m

Claude Opus 4.8 (Pi)

medium

P50 11.9s
P90 51.7s
Total 42.4m

Claude Opus 4.8 (Pi)

high

P50 20.9s
P90 1.1m
Total 31.4m

Claude Opus 4.8 (Claude SDK)

high

P50 21.6s
P90 1.1m
Total 34.3m

GPT 5.5 (Pi)

low

P50 34.2s
P90 56.4s
Total 55.2m

Claude Sonnet 4.6 (Pi)

P50 41.9s
P90 1.9m
Total 53.6m

Claude Opus 4.6 (Pi)

high

P50 49.6s
P90 2.5m
Total 67.6m

Claude Sonnet 4.6 (Claude SDK)

P50 1.9m
P90 26.6m
Total 448.4m

DeepSeek V4 Flash (Pi)

xhigh

P50 2.9m
P90 18.5m
Total 494.5m

GPT 5.5 (Pi)

high

P50 3.0m
P90 5.6m
Total 163.9m

DeepSeek V4 Pro (Pi)

xhigh

P50 3.8m
P90 19.3m
Total 1056.9m
  • Known found is the headline score. It counts corpus entries where scoring verified that Warden found the same bug in roughly the same location.
  • Total findings is review volume before scoring. More findings can mean better recall, but it also means more human review.
  • Scoring is semantic. Same-file findings about different bugs do not count, duplicate findings do not double-count, and one finding can cover multiple corpus entries when it catches the same root bug.
  • Benchmark runs use Warden’s post-analysis finding verifier unless a row opts out. The verifier filters candidate findings during the run; benchmark scoring happens later. Verifier calls add provider cost.
  • Treat cost and duration as operational measurements. Recorded cost is not normalized model pricing or cost per finding. P50 and P90 are per-analysis-chunk durations. Total includes verifier work, provider latency, queueing, retries, and runtime overhead.
  • The stable matrix only shows clean comparison runs: 156 analysis chunks, zero failed chunks, and per-chunk timing metadata. Partial and superseded rows stay in the result data for audit history.
  • Trace and auxiliary usage fields depend on what the raw artifacts preserved. When verifier usage is available, it appears under auxiliaryUsage.verification.

The Sonnet 4.6 comparison is clean enough to compare directly. Both rows scan the same 156 analysis chunks, complete with zero failed chunks, use Warden’s finding verifier, and have agent-verified scoring. Pi found 25 of 86 known corpus entries. The Claude SDK found 24 of 86. Both emitted 32 total findings.

The difference is cost and runtime behavior, not benchmark quality. The Claude SDK row records $103.59 total cost, including $61.61 for scan work. The Pi row records $19.84 total cost, including $11.20 for scan work. On scan work alone, Claude SDK cost is 5.5x Pi, input tokens are 6.34x Pi, output tokens are 2.44x Pi, cache reads are 5.72x Pi, and cache creation is 9.56x Pi.

Total cost does include Warden’s auxiliary post-processing work. That matters: Sonnet 4.6 verification cost $41.97 through the Claude SDK and $8.54 through Pi. But it is not the whole explanation. Removing verifier and merge work still leaves $61.61 of Claude SDK scan cost against $11.20 of Pi scan cost. The auxiliary gap has the same shape because verifier calls use the configured runtime unless a separate auxiliary model is set.

Turns do not explain the whole gap. The stored trace summaries show 939 Claude SDK turns versus 628 Pi turns, a 1.5x increase. The larger multiplier is the amount of context the Claude SDK runtime carries through those turns. It reads and searches more, then repeats a larger conversation and tool-result context through later model calls.

Targeted child-span reruns of representative Sonnet 4.6 files show the same shape. Those reruns are diagnostic, not the scoring source of truth, and their sanitized summary is checked into the benchmark data. On src/sentry/replays/usecases/replay_counts.py, Claude SDK used 9 turns, 7 tool executions, 346.7k scan input tokens, and $0.55 scan cost. Pi used 3 turns, 2 tool executions, 19.8k scan input tokens, and $0.10 scan cost. On src/sentry/api/endpoints/project_rules.py, Claude SDK used 47 turns, 41 tool executions, 2.23M scan input tokens, and $1.87 scan cost. Pi used 18 turns, 15 tool executions, 176k scan input tokens, and $0.27 scan cost.

In the targeted rerun, the clearest chunk was project_rules.py:607-808. Claude SDK spent 28 turns and 27 tool executions there: 10 Read, 16 Grep, and 1 Glob. That single chunk cost $0.89 and consumed 1.39M scan input tokens. Pi handled the same chunk in one turn with no tools, 6.7k scan input tokens, and $0.01 scan cost.

The practical read is that Claude SDK explores more aggressively and carries more context through each step. Pi exits many clean chunks earlier. On this corpus, the extra Claude SDK exploration did not improve the Sonnet 4.6 score, but it did make the run materially more expensive.

Pi runs without an explicit Warden --effort use Pi’s default thinking level, which is currently medium.

The Opus 4.8 high-effort comparison now has a fresh traced pair. Both rows scan the same 156 analysis chunks, complete with zero failed chunks, use Warden’s finding verifier, and have agent-verified scoring. Pi found 21 of 86 known corpus entries and emitted 24 total findings. The Claude SDK found 17 of 86 and emitted 17 total findings.

The cost gap is still large, but the trace shape is different from Sonnet 4.6. Claude SDK records $79.56 total cost, including $61.08 for scan work. Pi records $21.31 total cost, including $17.39 for scan work. On scan work alone, Claude SDK cost is 3.5x Pi, input tokens are 4.35x Pi, cache reads are 3.82x Pi, and cache creation is 6.10x Pi. Output tokens do not explain the gap: Pi actually emitted slightly more scan output tokens than Claude SDK.

The traces do not show Claude SDK doing more tool work. Claude SDK used 375 turns and 219 tool executions. Pi used 426 turns and 371 tool executions. Pi also produced more final findings. The difference is that each Claude SDK turn carried much more input context: about 60.0k scan input tokens per turn versus 12.1k for Pi.

No-finding chunks show the same pattern. Claude SDK no-finding chunks averaged 2.0 turns, 1.0 tool executions, 118.4k scan input tokens, and $0.35 scan cost. Pi no-finding chunks averaged 2.3 turns, 1.8 tool executions, 25.8k scan input tokens, and $0.09 scan cost. Finding chunks were similar on turns but not on context size: Claude SDK averaged 5.6 turns and $0.76 scan cost; Pi averaged 5.5 turns and $0.22.

Representative chunks make the point. On project_rules.py:607-808, both runtimes used one turn and no tools. Claude SDK used 48.4k scan input tokens and cost $0.18. Pi used 8.7k scan input tokens and cost $0.03. On replay_counts.py:1-202, both again used one turn and no tools. Claude SDK used 48.3k scan input tokens and cost $0.19. Pi used 8.7k scan input tokens and cost $0.04.

The heavier files do not reverse the conclusion. Across integrations/perforce/integration.py, Claude SDK used 18 turns, 14 tool executions, 1.43M scan input tokens, and $2.66 scan cost, producing one final finding. Pi used 28 turns, 30 tool executions, 460k scan input tokens, and $1.11 scan cost, producing two final findings. Across integrations/msteams/webhook.py, Claude SDK used 15 turns, 11 tool executions, 1.40M scan input tokens, and $3.13 scan cost, producing no final finding. Pi used 17 turns, 13 tool executions, 260k scan input tokens, and $0.85 scan cost, producing one final finding.

The practical read is that Opus 4.8 on Pi is not cheaper because it skips more work. In this high-effort pair, Pi does more turns and more tool executions, but each turn carries a much smaller input/cache footprint. Claude SDK’s extra cost is mostly repeated context volume and verifier context volume, not additional tool fanout.

The traced Pi rows are the direct Opus comparison. Both rows scan the same 156 chunks, complete with zero failed chunks, use Warden’s finding verifier, and have agent-verified scoring. Opus 4.6 high found 23 of 86 known corpus entries. Opus 4.8 high found 21 of 86. Both emitted 24 total findings.

That means Opus 4.8 high did not score lower because it produced a smaller or noisier report. It emitted the same number of final findings and had slightly fewer findings without a known-corpus match: 4 versus 5. The issue is recall against this specific corpus.

The traces explain the difference. Opus 4.6 high did much more investigation: 981 turns, 1,101 tool executions, 13.6M scan input tokens, and $30.11 scan cost. Opus 4.8 high used 426 turns, 371 tool executions, 5.2M scan input tokens, and $17.39 scan cost. Average turns per chunk dropped from 6.29 to 2.73, and the maximum chunk dropped from 51 turns to 11.

No-finding chunks show the same shape. Opus 4.6 high averaged 5.4 turns, 6.0 tool executions, 73.4k scan input tokens, and $0.17 scan cost on chunks that ended without a finding. Opus 4.8 high averaged 2.3 turns, 1.8 tool executions, 25.8k scan input tokens, and $0.09. Finding chunks were also shorter: 11.0 turns on Opus 4.6 high versus 5.5 on Opus 4.8 high.

The matched corpus IDs shifted, not just shrank. The two rows overlap on 12 known corpus entries. Opus 4.6 high has 11 unique matches that Opus 4.8 high missed, and Opus 4.8 high has 9 unique matches that Opus 4.6 high missed. Opus 4.6 high is better on aggregate recall here, but it is not a strict superset of Opus 4.8 high.

The best supported conclusion is that Opus 4.8 high is more selective under the current Warden prompt and corpus. It scans every chunk and does not fail more often. It exits more investigations earlier, which lowers cost and tool fanout, but misses enough known vulnerabilities to trail Opus 4.6 high on this corpus.

The Opus 4.6 high traced row also shows why benchmark runs now set maxTurns = 100. One heavy MS Teams chunk hit the default turn cap and was rerun cleanly with the higher cap. Without that, the row would measure a runner limit instead of model behavior.

The DeepSeek V4 rows use Pi 0.78.0 through OpenRouter with explicit --effort xhigh. That setting was applied: Warden passes the effort to Pi as thinkingLevel, and Pi’s OpenRouter model entry for deepseek/deepseek-v4-flash exposes off, high, and xhigh thinking levels with reasoning: true. Pi keeps xhigh after model-capability clamping and sends it as reasoning: {effort: "xhigh"}. Both rows scan the same 156 chunks and use Warden’s finding verifier. V4 Pro found 23 of 86 known corpus entries and emitted 30 total findings. V4 Flash found 18 of 86 known corpus entries and emitted 27 total findings.

Flash is cheaper because the model price is lower, not because it does less work. V4 Pro used 3,019 turns and 3,502 tool executions across the corpus. V4 Flash used 3,138 turns and 4,191 tool executions. V4 Flash also consumed more scan input tokens: 72.2M, compared to V4 Pro’s 62.2M. Recorded scan cost was still lower for Flash at $3.44, versus $9.76 for V4 Pro. Total recorded cost was $10.11 for Flash and $18.70 for V4 Pro.

The recall tradeoff is real. V4 Pro ties Opus 4.6 high on known matches and beats Opus 4.8 high on Pi by two. V4 Flash lands below Opus 4.8 high on Pi by three known matches, but still beats the Claude SDK Opus 4.8 high row by one. The DeepSeek rows overlap on 11 known corpus entries. V4 Pro has 12 unique matches that Flash missed; Flash has 7 unique matches that Pro missed.

The closest Claude-family row to V4 Flash is Opus 4.8 on Pi at Pi’s default medium effort: both found 18 of 86 known corpus entries. They got there in very different ways. Opus used 330 turns, 3.4M scan input tokens, and 172k scan output tokens. Flash used 3,138 turns, 72.2M scan input tokens, and 2.0M scan output tokens. Opus had an 11.9-second P50 chunk duration and a 51.7-second P90; Flash had a 2.9-minute P50 and an 18.5-minute P90.

The span-complete Opus 4.8 high rows make the tool-call difference explicit. Opus 4.8 high on Pi found 21 of 86 with 426 turns and 371 tool executions. Opus 4.8 high through the Claude SDK found 17 of 86 with 375 turns and 219 tool executions. Flash found 18 of 86 with 3,138 turns and 4,191 tool executions, mostly read and grep calls. The result is not just a cheaper Opus-shaped run. Flash explores far more context, loops through many more tool calls, and lands on a different set of known findings.

The Sentry vulnerability corpus lists the known issues used for scoring. Each entry includes the repository SHA, the affected file, a short vulnerability description, and the relevant code snippet.

Use the running guide to reproduce the benchmark, add a new model run, and record sanitized result metadata.