Anthropic Claude 4: record-breaking coding abilities and long-term

Anthropic has released its newest AI models, Claude Opus 4 and Claude Sonnet 4, which set new standards for coding abilities and sustained performance on complex tasks. The Claude 4 family represents a significant advancement in AI capabilities, with Opus 4 demonstrating the ability to work continuously for up to seven hours on complex coding tasks without losing focus.

According to Anthropic, Claude Opus 4 is “the world’s best coding model,” achieving industry-leading scores on benchmarks like SWE-bench (72.5%) and Terminal-bench (43.2%). The company claims these scores surpass competitors including OpenAI’s GPT-4.1 and Google’s Gemini 2.5 Pro.

Enhanced capabilities for extended tasks

Both new models introduce significant improvements over their predecessors:

Extended thinking with tool use: Claude can now alternate between reasoning and using tools like web search, integrating information gathering directly into its thought process.
Improved memory: When given access to local files, the models can extract and save key information to maintain continuity across long sessions.
Reduced reward hacking: Anthropic reports a 65% reduction in behaviors where the model takes shortcuts or finds loopholes to complete tasks.
Parallel tool execution: The models can now use multiple tools simultaneously for more efficient problem-solving.

The most striking demonstration of Claude’s improved capabilities comes from Rakuten, which reported using Opus 4 for a “demanding open-source refactor running independently for 7 hours with sustained performance.” This represents a dramatic improvement over previous models, which typically maintained coherence for only 1-2 hours before losing focus.

Strategic positioning in a competitive landscape

Anthropic’s release comes amid intensifying competition in the AI space. In recent months, OpenAI launched its o3 and o4-mini reasoning models, Google updated its Gemini lineup, and Meta released Llama 4 with enhanced multimodal capabilities.

Each major AI lab has developed distinctive strengths. Anthropic is positioning Claude 4 as the leader in sustained coding performance and complex reasoning tasks, particularly for professional developers and enterprises requiring reliable AI assistants for extended work sessions.

“Across all the companies out there that are building things, there’s a really large wave of these agentic applications springing up, and a very high demand and premium being placed on intelligence,” said Alex Albert, Anthropic’s head of Claude Relations, in an interview with Ars Technica. “I think Opus is going to fit that groove perfectly.”

Pricing and availability

Both Claude 4 models maintain the same pricing structure as their predecessors:

Opus 4: $15 per million tokens for input and $75 per million for output
Sonnet 4: $3 per million tokens for input and $15 per million for output

Anthropic has made both models available through its API, Amazon Bedrock, and Google Cloud Vertex AI. Sonnet 4 is accessible to free users, while Opus 4 requires a paid subscription.

Challenges of AI transparency and reliability

Despite the impressive capabilities, questions remain about AI transparency and reliability. Anthropic’s own research has highlighted how reasoning models often fail to reveal their complete thought processes. A study found that Claude 3.7 Sonnet mentioned crucial hints it used to solve problems only 25% of the time.

The company has introduced “thinking summaries” for Claude 4 models that use a smaller model to condense lengthy thought processes. This summarization is needed about 5% of the time, according to Anthropic.

Alex Albert acknowledged the inherent unpredictability of these systems presents challenges: “In the realm and the world of software for the past 40, 50 years, we’ve been running on deterministic systems, and now all of a sudden, it’s non-deterministic, and that changes how we build.”

Human oversight remains essential, particularly for production code. “The human review will become more important, and more of your job as developer will be in this review than it will be in the generation part,” Albert told Ars Technica.

Sources: Anthropic, Wired, CNBC, TechCrunch, The Verge, Ars Technica, VentureBeat

Anthropic Claude 4: record-breaking coding abilities and long-term performance

Enhanced capabilities for extended tasks

Strategic positioning in a competitive landscape

Pricing and availability

Challenges of AI transparency and reliability

Related posts:

Enhanced capabilities for extended tasks

Strategic positioning in a competitive landscape

Pricing and availability

Challenges of AI transparency and reliability

Stay up to date

Related posts: