Claude Opus 4.7: +11 SWE-Bench, Cyber Nerfed, and It Pushes Back
Better at coding. Deliberately worse at hacking. And it argues when you're wrong. Anthropic's Opus 4.7 is the first frontier model to ship a regression on purpose — and the reasoning matters.
The video version · same thesis, looser edits
What Anthropic actually shipped
Claude Opus 4.7 dropped on April 16, 2026. Same price as 4.6 — $5 per million input tokens, 90% savings on prompt caching. But the changelog has three stories in it, and only one of them is getting the attention it deserves.
The headline number: SWE-Bench Pro jumped 11 points, from 53.4% to 64.3%. That’s the largest single-version jump in the Opus line. For agentic coding workflows — the kind where the model is autonomously resolving GitHub issues across multi-file codebases — this is a material improvement.
But that’s the easy story. The harder ones are below.
The coding and vision upgrades
The coding gains come from deeper adaptive thinking. Opus 4.7 burns more thinking tokens per response, which is why Anthropic raised rate limits across the board. Boris Cherny, Head of Claude Code, published a specific recommendation: give Claude a way to verify its own work. With 4.7’s adaptive thinking, self-verification acts as a 2–3× multiplier on code quality.
Four new controls ship for agentic developers:
x-higheffort level — sits betweenhighandmax, trading latency for deeper reasoning on hard problems- Task budgets (beta) — set a token allowance for an entire agentic loop, not just per-request
/ultra-reviewcommand — runs a dedicated review session in Claude Code- Auto mode — now available on the Max plan
The vision upgrades matter for computer-use agents. Previous Opus models downscaled every uploaded image to roughly 1.15 megapixels. Opus 4.7 triples the limit to 2,576 pixels on the long edge and maps pixel coordinates one-to-one. When the model says “click here,” it means exactly that pixel on your actual screen. No more scale-factor math in your automation pipeline.
The deliberate cyber regression
This is the story most coverage is missing.
Look at the benchmark chart. Opus 4.7’s CyberGym score — Anthropic’s offensive cybersecurity benchmark — actually dropped, from 73.8% to 73.1%. In a release where every other benchmark went up, one went down. That’s not a training failure. That’s a policy decision.
Anthropic deliberately reduced offensive cyber capabilities during training. They published the number. A frontier model company shipped a model that is measurably worse at something, on purpose, and put the receipt in the blog post.
Here’s the context that explains why.
Project Glasswing
Anthropic’s unreleased model, Mythos Preview, scores 83.1% on CyberGym — significantly higher than anything in the public Opus line. In internal testing, Mythos has already identified thousands of high-severity vulnerabilities across every major operating system and browser.
Rather than release these capabilities publicly, Anthropic launched Project Glasswing — a coalition of twelve partners including Apple, Google, Microsoft, NVIDIA, and the Linux Foundation, backed by $100 million in usage credits. The mandate: use Mythos-class capabilities for defense, not offense.
Glasswing partners get access to the unrestricted model. Everyone else gets Opus 4.7, which was deliberately trained with reduced offensive surface.
This is a meaningful shift in safety strategy. Previous frontier labs treated every benchmark as a number to maximize. Anthropic is treating offensive cyber as a capability to restrict in the public model while channeling it into controlled defensive programs. Whether you agree with the approach or not, the transparency is notable — most labs wouldn’t publish the regression.
It pushes back
Beyond benchmarks, the behavioral shift is noticeable in the first few hours of use.
Opus 4.7 disagrees with you. Not performatively — it won’t argue for fun. But when you’re wrong, it says so. Ask it to implement something that violates its understanding of the problem, and it’ll push back with its reasoning before complying.
Compare that to Gemini 3.1 Pro, which will agree with your framing, implement what you asked, and then quietly pivot when you start listing negatives. Opus 4.7 front-loads the disagreement. For agentic workflows where the model operates autonomously for extended chains, a model that challenges bad instructions early is safer than one that silently accumulates errors.
It also asks sharper clarifying questions. The intent-matching feels tighter — it’s parsing what you mean rather than pattern-matching to what you said. This tracks with the adaptive thinking changes. More thinking tokens means more room for the model to model your intent before committing to a response.
The breaking changes
Two things will bite you if you’re migrating from 4.6:
The tokenizer changed. The same text produces up to 35% more tokens under 4.7’s tokenizer. Per-token price is identical, but your effective cost just went up. If you’re on tight API budgets, run your existing prompts through the new tokenizer before committing.
It follows instructions more literally. Prompts that relied on 4.6’s slightly looser interpretation may produce different results. This is a side effect of the improved instruction-following — the model is doing exactly what you said, which turns out to be different from what you meant. Audit your system prompts.
Anthropic’s scale context
Some numbers worth knowing:
- $14 billion annual run-rate revenue, growing 10× annually for three consecutive years
- Claude Code alone accounts for $2.5 billion of that
- 4% of all public commits on GitHub are now authored by Claude Code
- 8 of the Fortune 10 are Claude customers
Claude Code is no longer a developer tool — it’s infrastructure. The 4% GitHub commit share means that one in twenty-five public commits is machine-authored through a single product. That’s a distribution footprint, not a feature.
Should you upgrade
Yes if you’re doing agentic coding, long-horizon autonomous workflows, or computer-use pipelines with screenshot-based navigation. The SWE-Bench jump and pixel-accurate vision are real improvements for these use cases.
Wait if your workloads are heavy on terminal-based DevOps (GPT 5.4 still leads here), high-volume agentic search (BrowseComp actually regressed), or high-volume API calls where the tokenizer change quietly bumps your bill.
The model is better. The tokenizer makes it more expensive. The cyber regression is a policy statement, not a capability problem. And the pushback behavior is — for production agentic systems — arguably the most important change on the list.
- Claude Opus 4.7: +11 SWE-Bench, Cyber Nerfed, and It Pushes Back
- Gemma 4: 8 Real-World Tests (JSON, Code, Vision, Reasoning)
- The Claude Mythos Leak: Why the Capybara Tier Was Withheld