Why OpenAI’s new models kept obsessing over goblins and how the company finally stopped it

OpenAI has identified the root cause of an unusual behavior in its latest AI models: a tendency to spontaneously mention goblins, gremlins, raccoons, trolls, ogres, and pigeons in otherwise unrelated conversations. The company traced the problem to a flaw in how it trained models to adopt different personality styles.

The issue first appeared after the launch of GPT-5.1, when OpenAI noticed that the word “goblin” appeared 175% more often in conversations compared to the previous model. The word “gremlin” rose by 52%. At the time, the increase seemed minor. By GPT-5.4, however, the trend had accelerated significantly.

OpenAI found that the behavior was concentrated among users who had selected the “Nerdy” personality, one of several customizable styles available in ChatGPT. Although the Nerdy personality accounted for only 2.5% of all ChatGPT responses, it was responsible for 66.7% of all goblin mentions. Between GPT-5.2 and GPT-5.4, the rate of goblin mentions under the Nerdy personality increased by over 3,800%.

The company traced the problem to a reward signal used during training. Reward signals are numerical scores that guide how an AI model learns to behave. OpenAI had designed a specific reward to encourage the Nerdy personality’s playful, quirky style. That reward, it turned out, consistently scored responses containing creature words higher than responses without them. In 76.2% of the training datasets examined, outputs containing “goblin” or “gremlin” received better scores.

Reinforcement learning does not keep learned behaviors neatly confined to the context in which they were trained. Once the model learned that creature language produced higher rewards in the Nerdy context, the style began to appear in other contexts as well. Model-generated responses containing these words were then reused in later training data, creating a feedback loop that spread the behavior further across model versions.

By GPT-5.5, the problem had become noticeable enough that OpenAI added explicit instructions to Codex CLI, a coding tool, telling the model: “Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.” Users of OpenClaw, a computer automation tool OpenAI acquired, reported that the model described software bugs as “gremlins” and “goblins.”

To address the root cause, OpenAI retired the Nerdy personality, removed the problematic reward signal from training, and filtered creature words from training data. The company states that GPT-5.5 had already begun training before the root cause was identified, which is why the instructions in Codex were added as a temporary workaround.

OpenAI describes the episode as a demonstration of how small training incentives can produce unexpected model behavior and spread beyond their intended scope. The company says the investigation led to new internal tools for auditing and correcting model behavior.

Sources: Wired, OpenAI

Stay up to date

AI for content creation: the latest tools, tips and trends. Every two weeks in your inbox:

More info …

About the author

Related posts:

Advertisement

×