GPT‑5.4 aims to handle real professional work as OpenAI expands agent-style AI

OpenAI has released GPT‑5.4, a new AI model designed for professional tasks such as coding, document creation, spreadsheet analysis, and multi‑step workflows. The company positions the model as its most capable system for knowledge work and software development so far.

The model is available across ChatGPT, the OpenAI API, and the company’s coding tool Codex. Two main variants are offered: GPT‑5.4 Thinking, a reasoning model integrated into ChatGPT, and GPT‑5.4 Pro, a higher‑performance version aimed at complex workloads in enterprise and developer environments.

According to OpenAI, GPT‑5.4 combines improvements in reasoning, coding, and tool use. The model also introduces native computer‑use capabilities, allowing it to operate software through screenshots, keyboard and mouse commands, or automation libraries such as Playwright.

Focus on professional work

The company emphasizes that GPT‑5.4 is optimized for common office tasks. In internal tests simulating work by junior investment banking analysts, the model scored about 87 percent on spreadsheet modeling tasks, compared with about 68 percent for GPT‑5.2.

Human evaluators also preferred presentations created by GPT‑5.4 in 68 percent of comparisons with the earlier model. Reviewers cited stronger visual design and more varied use of generated images.

OpenAI says the model also reduces factual errors. In a dataset of prompts where users had previously flagged mistakes, GPT‑5.4 produced 33 percent fewer incorrect claims and 18 percent fewer responses containing any errors compared with GPT‑5.2.

Benchmarks and knowledge‑work tests

OpenAI highlights results from its GDPval benchmark, which measures the ability of AI systems to complete knowledge‑work tasks across 44 occupations in nine industries. According to the company, GPT‑5.4 matched or exceeded human professionals in 83 percent of comparisons.

The benchmark includes tasks such as financial analysis, legal documents, project planning, and media production. Tasks are reviewed and graded by human professionals.

Some external testers report similar results. Recruiting platform Mercor said GPT‑5.4 leads its APEX‑Agents benchmark, which evaluates AI systems on professional service tasks like financial modeling and legal analysis.

Native computer use and tool orchestration

One of the major technical changes is built‑in computer interaction. In the API and Codex, GPT‑5.4 can control software directly and carry out workflows across multiple applications.

Benchmarks cited by OpenAI show improvements in several environments:

  • 75 percent success on OSWorld‑Verified, which tests desktop navigation through screenshots and input commands
  • 67 percent success on WebArena‑Verified for browser tasks
  • 92.8 percent success on Online‑Mind2Web using screenshot observations

The model also introduces “tool search,” which lets the system retrieve tool definitions only when needed instead of loading them all into the prompt. OpenAI reports this reduced token usage by 47 percent in a benchmark with large tool ecosystems.

Larger context and developer features

GPT‑5.4 supports context windows of up to one million tokens in the API and Codex, enabling long workflows with large documents or datasets. The model is also designed to use fewer reasoning tokens than earlier systems, which OpenAI says can reduce latency and total costs for some tasks.

Within ChatGPT, GPT‑5.4 Thinking outlines its approach before generating long answers. Users can adjust instructions while the model is still working.

The company is also introducing integrations with spreadsheet software such as Microsoft Excel and Google Sheets to support financial analysis and data modeling.

GPT‑5.4 is gradually rolling out across ChatGPT, Codex, and the API.

Sources: OpenAI, The Verge, Engadget, TechCrunch, ZDNet, VentureBeat

About the author

Related posts:

Stay up-to-date:

Advertisement