Claude Sonnet 4.6: near-flagship performance at mid-tier pricing

Anthropic has released Claude Sonnet 4.6, a significant upgrade to its mid-tier AI model. The company says it outperforms its predecessor across coding, computer use, long-context reasoning, agent planning, knowledge work, and design. Sonnet 4.6 is now the default model in claude.ai and Claude Cowork and carries the same price as its predecessor, Sonnet 4.5, at $3 per million input tokens and $15 per million output tokens.

The pricing is notable because Anthropic’s flagship Opus models cost five times as much. According to Anthropic, Sonnet 4.6 now matches or approaches Opus-level performance across several key benchmarks. On SWE-bench Verified, a standard test for real-world software coding, Sonnet 4.6 scored 79.6%, compared to Opus 4.6’s 80.8%. On agentic financial analysis, Sonnet 4.6 scored 63.3%, beating Opus 4.6’s 60.1%. For companies running AI agents that process millions of requests per day, this cost difference has a large practical impact.

In early user testing of Claude Code, Anthropic’s developer tool, users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. They also preferred it over Opus 4.5 — the previous flagship model — 59% of the time, citing fewer hallucinations, less over-engineering, and better instruction following.

One of the headline capabilities is computer use, which allows an AI to interact with software visually, the way a human does, by clicking and typing. Anthropic introduced this feature in October 2024 and described it then as experimental and error-prone. On the OSWorld benchmark, which tests AI models on real software tasks, Sonnet 4.6 scored 72.5%, up from 14.9% when the feature first launched. Jamie Cuffe, CEO of insurance technology company Pace, said Sonnet 4.6 scored 94% on their internal computer use benchmark, calling it the highest result from any Claude model they had tested.

Anthropic also highlighted improvements to the model’s resistance to prompt injection attacks, a security risk where malicious instructions hidden on websites attempt to hijack an AI agent’s behavior.

Sonnet 4.6 includes a 1 million token context window in beta. This allows it to process entire codebases or large document collections in a single request. Anthropic tested the model’s long-horizon reasoning using Vending-Bench Arena, a simulation in which AI models compete to run a virtual business over a full year. Sonnet 4.6 ended the simulation with approximately $5,700 in balance, compared to Sonnet 4.5’s roughly $2,100, by investing in capacity early and shifting to profitability later.

Anthropic stated that safety evaluations show Sonnet 4.6 is as safe as, or safer than, its recent models. The company’s safety researchers described the model as having “strong safety behaviors and no signs of major concerns around high-stakes forms of misalignment.”

Sources: Anthropic, VentureBeat

Claude Sonnet 4.6: near-flagship performance at mid-tier pricing

Related posts:

Stay up-to-date: