Anthropic's new AI safety system blocks most jailbreak attempts

Anthropic has unveiled “constitutional classifiers,” a new security system that prevents AI models from generating harmful content. According to research published by Anthropic and reported by Taryn Plumb in VentureBeat, the system successfully blocks 95.6% of jailbreak attempts on their Claude 3.5 Sonnet model. The company tested the system with 10,000 synthetic jailbreaking prompts in various languages and writing styles. To validate its effectiveness, Anthropic invited security researchers to attempt breaking the system in a bug-bounty program. Over 185 participants spent approximately 3,000 hours trying to circumvent the safeguards, but none succeeded in achieving a universal jailbreak. While the system increases compute costs by 23.7%, Anthropic reports that it maintains normal functionality for legitimate queries with only a minimal increase in false positives.

Anthropic’s new AI safety system blocks most jailbreak attempts

Related posts:

Stay up-to-date: