Anthropic, the company behind the AI assistant Claude, has developed a new technique to observe and analyze how its AI expresses values during real-world conversations with users. The research, conducted by Anthropic’s Societal Impacts team, examines whether Claude adheres to the company’s goal of making it “helpful, honest, and harmless” when interacting with users.
The study analyzed 700,000 anonymized conversations between users and Claude, focusing on 308,210 exchanges that contained subjective elements rather than purely factual information. Researchers identified five primary categories of values expressed by Claude: Practical, Epistemic, Social, Protective, and Personal.
At the most granular level, the most common individual values expressed by Claude were professionalism, clarity, and transparency. The research team noted that these align with the AI’s intended role as an assistant.
“What we need is a way of rigorously observing the values of an AI model as it responds to users ‘in the wild’—that is, in real conversations with people,” Anthropic explained in their announcement. The company’s methodology involved using privacy-preserving systems to remove personal information from conversations while preserving the context for analysis.
The research revealed that Claude’s expressed values change based on the conversation context, similar to how humans adapt to different situations. For example, when asked about romantic relationships, Claude emphasized “healthy boundaries” and “mutual respect,” while discussions about controversial historical events triggered emphasis on “historical accuracy.”
The study found that in 28.2% of conversations, Claude strongly supported users’ expressed values. However, in 6.6% of cases, the AI reframed users’ values by acknowledging them while adding new perspectives. This reframing occurred most often during psychological or interpersonal advice.
Interestingly, Claude actively resisted users’ values in 3% of conversations, typically when users requested unethical content or expressed moral nihilism. Anthropic suggests these instances might reveal the AI’s “deepest, most immovable values” when challenged.
The research also identified rare instances where Claude expressed values that contradicted its training, such as “dominance” and “amorality.” Anthropic attributed these to possible jailbreak attempts, where users try to bypass the AI’s safety guardrails.
While the method shows promise for evaluating AI alignment with intended values, Anthropic acknowledged limitations. The approach requires substantial real-world conversation data, making it unsuitable for pre-deployment evaluation. Additionally, defining what constitutes a value expression remains inherently subjective.
Anthropic has released both the full research paper and the dataset for other researchers to explore. The company suggests this methodology could help identify problems that emerge only in real-world interactions, potentially improving future AI alignment efforts.
via: VentureBeat