Meta has released its newest generation of artificial intelligence models, Llama 4, introducing three variants with improved capabilities. The weekend release included two immediate offerings – Llama 4 Scout and Llama 4 Maverick – with a third model, Llama 4 Behemoth, still in development.
According to Meta, Llama 4 models mark “the beginning of a new era” for their AI ecosystem. These are Meta’s first models to use a mixture-of-experts (MoE) architecture, which makes them more computationally efficient by activating only a subset of parameters for each token processed.
Key model specifications
- Llama 4 Scout: A 17 billion active parameter model with 16 experts (109 billion total parameters), featuring an industry-leading 10 million token context window. Meta positions Scout as suitable for multi-document summarization and reasoning over large codebases.
- Llama 4 Maverick: A 17 billion active parameter model with 128 experts (400 billion total parameters), described as Meta’s “workhorse” for general assistant and chat use cases. Meta claims Maverick exceeds models like GPT-4o and Gemini 2.0 on various benchmarks.
- Llama 4 Behemoth: Still in training, this model has 288 billion active parameters with 16 experts (nearly 2 trillion total parameters). Meta says it outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks.
Both Scout and Maverick are available for download from llama.com and Hugging Face, and have been integrated into Meta’s AI assistant across WhatsApp, Messenger, and Instagram Direct. Meta CEO Mark Zuckerberg also mentioned a fourth model, Llama 4 Reasoning, would be announced “in the next month.”
Reception and controversy
Despite Meta’s claims about its models’ capabilities, the AI community has raised significant concerns following the release. Initial testing by researchers and community members showed inconsistent performance, particularly on coding tasks.
The most notable controversy erupted when Meta was discovered to have submitted a different version of Llama 4 Maverick to the AI benchmark site LMArena than what was made publicly available. This “experimental chat version” was specifically “optimized for conversationality” and secured the number-two spot on the leaderboard.
LMArena later posted on X (formerly Twitter) that “Meta’s interpretation of our policy did not match what we expect from model providers” and announced they would update their leaderboard policies to ensure fair evaluations.
Meta’s VP and Head of GenAI Ahmad Al-Dahle responded to criticisms by stating: “We’re also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it’ll take several days for all the public implementations to get dialed in.”
Al-Dahle also denied allegations that Meta had “trained on test sets” to game benchmarks, calling such claims “simply not true.”
Political stance adjustments
In its release, Meta highlighted that Llama 4 models have been tuned to refuse to answer “contentious” questions less often than previous versions. The company states that Llama 4 responds to “debated” political and social topics that earlier Llama models wouldn’t address.
According to Meta, Llama 4 refuses less on debated political and social topics overall (from 7% in Llama 3.3 to below 2%) and is “dramatically more balanced” in which prompts it refuses to respond to.
These adjustments come amid accusations from some political figures that AI chatbots have political biases, though Meta maintains its goal is to make Llama “more responsive” without favoring particular viewpoints.
As Meta prepares for its developer conference LlamaCon on April 29, the initial mixed reception of Llama 4 suggests the company will have significant discussion points to address regarding benchmark practices and model performance.
Sources: Meta, TechCrunch, Engadget, VentureBeat, TechCrunch