Doubts about Reflection 70B performance

What a difference a weekend makes: Just a few days ago, the new Reflection 70B was considered the potential champion among open source models. According to official information, the model is based on Meta’s Llama 3.1 70B, but with a special feature: it is specially trained not to give an immediate answer, but to analyze the problem first and check its own solution in a “reflection” phase. In this way, Reflection 70B is supposed to recognize its own mistakes. It was also promised to outperform other free models in key benchmarks and even compete with commercial offerings such as ChatGPT or Claude.

But as Carl Franzen summarizes in his article on VentureBeat, a few days later there are considerable doubts about the promises made by the creators. Some even see a scam.

For one, there seems to be evidence that the current 3.1 Llama model is not at the base, but the older 3.0. More importantly, independent testers were initially unable to confirm the claimed benchmark results. Matt Shumer then explained that something went wrong when the model was uploaded to the Hugging Face site. He later explained that the model had to be retrained, which left experts confused.

In addition, some testers believe that the official API does not provide access to Reflection 70B, but to Anthropic’s Claude 3.5 Sonnet. As the evidence mounted, it appeared that OpenAI’s GPT-4o was being used instead. However, it is not entirely clear what was and is going on behind the scenes.

Whatever comes out in the end, it just goes to show that you can’t blindly trust the promises of developers and vendors – both small teams like those behind Reflection 70B and large companies like OpenAI and Google.

Doubts about Reflection 70B performance

Related posts:

Stay up-to-date: