Anthropic tests its AI models for sabotage capabilities

Anthropic has developed new security assessments for AI models that test their ability to sabotage. In a blog post, the company describes four types of tests: “human decision sabotage,” “code sabotage,” “sandbagging,” and “undermining oversight.”

In human decision sabotage, the models try to trick people into making the wrong decisions without arousing suspicion. Code sabotage tests whether models can insert bugs into code without being detected. Sandbagging describes the hiding of dangerous capabilities by the AI, which are only revealed later. Finally, undermining oversight examines whether models can manipulate monitoring systems.

Initial tests with the company’s Claude 3 Opus and Claude 3.5 Sonnet models have already shown signs of these sabotage capabilities, particularly when it comes to influencing human decisions and inserting code bugs. However, Anthropic emphasizes that the sabotage attempts were often detected and the overall capabilities of the models are still limited. Therefore, only minimal countermeasures are needed at this time.

The tests are intended to help identify potentially dangerous capabilities early and develop countermeasures before the models are used publicly.

Stay up to date

Related posts: