OpenAI has developed a new approach called “deliberative alignment” to make its AI models safer and more aligned with human values. According to Maxwell Zeff’s article, the company implemented this system in its latest AI reasoning models, o1 and o3. The new method enables the models to consider OpenAI’s safety policy during the inference phase – the period after a user enters a prompt. The system works by having the AI models automatically review relevant parts of OpenAI’s safety guidelines before generating responses to potentially sensitive queries.
Tests showed that this approach significantly reduced inappropriate responses while maintaining the ability to answer legitimate questions. The company created the training data synthetically, using one AI model to generate examples and another to evaluate them, rather than relying on human-written responses. This new method particularly improved the models’ ability to identify and reject requests for potentially harmful information, such as instructions for illegal activities. The research demonstrates better performance compared to other AI models like GPT-4o, Gemini, and Claude in safety benchmarks. OpenAI plans to release o3, which incorporates these safety features, in 2025.