Anthropic tests its AI models for sabotage capabilities
Anthropic has developed new security assessments for AI models that test their ability to sabotage. In a blog post, the company describes four types of tests: “human decision sabotage,” “code sabotage,” “sandbagging,” and “undermining oversight.” In human decision sabotage, the models try to trick people into making the wrong decisions without arousing suspicion. Code sabotage …