Claude Sonnet 4.5 can operate autonomously longer than ever

Anthropic announced Claude Sonnet 4.5, an AI model that can operate autonomously for up to 30 hours on complex tasks. The company demonstrated this capability by having the model create a chat application similar to Slack, producing 11,000 lines of code before stopping upon task completion.

The new model represents a significant improvement over Anthropic’s previous Opus 4 model, which could operate for seven hours autonomously when released in May. According to Anthropic, Claude Sonnet 4.5 achieves state-of-the-art performance on SWE-bench Verified, a benchmark that measures real-world software coding abilities.

The model shows enhanced computer use capabilities, scoring 61.4% on OSWorld, a benchmark testing AI models on real-world computer tasks. This marks an improvement from the 42.2% score achieved by Sonnet 4 four months earlier. Dianne Penn, head of product management at Anthropic, noted that the model is more than three times as skilled at navigating browsers and using computers compared to the company’s technology from last October.

Enterprise applications and customer feedback

Early customers report significant improvements in various domains. Cursor CEO Michael Truell highlighted the model’s coding performance on longer tasks. GitHub’s Chief Product Officer Mario Rodriguez noted improvements in multi-step reasoning and code comprehension for their Copilot service.

Other companies reported specific performance gains. Hai’s Chief Product Officer Nidhi Aggarwal stated that the model reduced vulnerability intake time by 44% while improving accuracy by 25%. Cognition CEO Scott Wu reported that the model increased planning performance by 18% and end-to-end evaluation scores by 12% for their Devin coding assistant.

Scott White, product lead for Claude.ai, described the model as operating at “chief-of-staff level,” capable of coordinating calendars, analyzing data dashboards, and generating status updates based on meeting notes.

Technical improvements and safety measures

Anthropic claims Claude Sonnet 4.5 is their most aligned frontier model, showing reduced concerning behaviors including deception, power-seeking, and compliance with harmful prompts. The company has implemented AI Safety Level 3 protections, including classifiers designed to detect potentially dangerous inputs related to chemical, biological, radiological, and nuclear weapons.

The model is available through the Claude API using the identifier “claude-sonnet-4-5” at the same pricing as Claude Sonnet 4: $3 per million input tokens and $15 per million output tokens. Anthropic has also released the Claude Agent SDK, providing developers with infrastructure tools used in their Claude Code product.

The company introduced additional features including checkpoints in Claude Code, a native VS Code extension, and code execution capabilities directly within conversations. A temporary research preview called “Imagine with Claude” allows the model to generate software in real-time, available to Max subscribers for five days.

Additional source: The Verge

Enterprise applications and customer feedback

Technical improvements and safety measures

Stay up to date

Related posts: