The research paper “Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse” investigates the effectiveness of chain-of-thought (CoT) prompting in large language and multimodal models. While CoT has generally improved model performance on various tasks, the authors explore scenarios where it may actually hinder performance, drawing parallels from cognitive psychology.
They identify three task categories—implicit statistical learning, visual recognition, and classification with exceptions—where CoT leads to significant performance drops in models, with reductions up to 36.3% in accuracy. Conversely, they also identify tasks where verbal reasoning negatively impacts human performance but does not affect model performance similarly.
The study emphasizes that although models and humans share some cognitive limitations, they operate under different constraints, which can influence the outcomes of CoT prompting. Through extensive experiments, the authors find that CoT can drastically impair model performance when both human and model constraints align.
They suggest that understanding these dynamics can help refine the application of CoT prompting in AI. The paper concludes with recommendations for leveraging psychological insights to better evaluate and improve model performance in various tasks. Overall, it highlights the complexity of inference-time reasoning in AI and the need for careful consideration of when to apply CoT techniques.
An example for an AI using CoT is the o1 family of models by OpenAI.