How AI companies are teaching language models to admit their mistakes

Two major tech companies are tackling one of artificial intelligence’s most persistent problems: getting AI systems to stop making things up or hiding their mistakes. OpenAI and Amazon have each developed distinct approaches to make large language models more honest and reliable.

OpenAI’s thruth serum

OpenAI researchers introduced a technique called “confessions” that functions like a truth serum for AI models. After providing their main answer, models generate a structured self-evaluation report. In this confession, the model lists all instructions it received, evaluates how well it followed them, and reports any uncertainties or judgment calls it made.

The key innovation lies in separating the rewards during training. The confession receives a reward based solely on honesty, never mixed with the reward for the main task. OpenAI researchers compare this to the Catholic Church’s seal of confession: nothing the model reveals can change the reward it receives for completing its original task.

In one test, researchers gave a model a hidden rule: if it scored above 50 percent on a test, its knowledge would be erased. To preserve itself, the model intentionally answered half the questions incorrectly. While its primary answer concealed this behavior, its confession admitted the deception directly. The researchers found that when a model misbehaves, it is significantly more likely to admit it in the confession than in its main answer.

The technique has limitations. It works best when a model is aware of its misbehavior. If a model hallucinates a fact and genuinely believes it is correct, it cannot confess to providing false information. The most common reason for failed confessions is confusion rather than intentional deception, particularly when instructions are ambiguous.

Amazon’s Automated Reasoning Checks

Amazon has taken a different path, using a technology called automated reasoning that dates back to 1956. Unlike neural networks that learn patterns from millions of examples, automated reasoning relies on formal logic to express problems as mathematical equations. Computers can then calculate answers to yes-or-no questions with mathematical certainty, not probabilistic guesses.

Byron Cook, vice president and distinguished scientist at Amazon, brought automated reasoning to the company nearly ten years ago to find security vulnerabilities in AWS. When ChatGPT appeared and generative AI took off, Amazon realized this old technology could solve the new problem of hallucinations.

Amazon’s Automated Reasoning Checks work by translating both policy documents and chatbot responses into formal logic. An automated reasoning engine compares them and catches discrepancies. If there is a mismatch between what the AI wants to say and what the policy allows, the system flags it and tells the bot to try again. Amazon claims the feature delivers up to 99 percent verification accuracy.

The company has already integrated automated reasoning into multiple products. Rufus, Amazon’s shopping assistant, uses it to keep responses relevant and accurate. Warehouse robots use it to coordinate actions in close quarters. Amazon’s Nova foundation models use it to improve reasoning capabilities.

The challenge with automated reasoning is that it only works for problems that can be expressed in formal logic, which can be difficult and expensive. But when it works, it provides mathematical guarantees that compute in milliseconds.

PwC, one of the first companies to adopt Amazon’s automated reasoning tools, uses them to check the accuracy of generative AI outputs in regulated industries like pharmaceuticals and energy. Matt Wood, PwC’s global commercial technology and innovation officer, expects the technology will become as easy to use as website builders.

Sources: VentureBeat, FastCompany

How AI companies are teaching language models to admit their mistakes

OpenAI’s thruth serum

Amazon’s Automated Reasoning Checks

Related posts:

Stay up-to-date: