Evaluating Anthropic’s Claude Opus 4.5: Real Improvements and When It’s Worth Your Time

Anthropic just dropped Claude Opus 4.5, their latest top-tier AI model, and it’s causing a stir in the packed world of advanced AIs. This release follows Google’s Gemini 3 Pro and OpenAI’s GPT-5.1 closely, putting Opus 4.5 right in the mix. But does it bring enough changes to make you switch? We look at the main updates based on reports from Ars Technica, Decrypt, Livemint, and CNBC, plus The New Stack and CNET, with a focus on performance, cost, and daily use.

Big Gains in Coding and Agent Tasks

One of the strong points of Opus 4.5 is its work on coding tests. According to Anthropic’s own checks, shared in Ars Technica, it scored 80.9% on SWE-Bench Verified—a real-world software engineering test—beating OpenAI’s GPT-5.1-Codex-Max at 77.9% and Google’s Gemini 3 Pro at 76.2%. This is the first model to pass 80% there, which helps developers with tough code changes or agent-based jobs. The New Stack calls it a return to the top for coding AIs.

Livemint echoes this, noting Opus 4.5 also led τ2-bench for multi-turn agent tasks, like handling a tricky airline booking scenario by suggesting a smart upgrade workaround. And in an internal two-hour engineering exam, it beat every human applicant, raising questions about AI’s place in coding jobs.

Decrypt points out developer feedback supports this: Simon Willison refactored a project with 20 commits across 39 files using Opus 4.5, calling it an excellent model, while Theo Browne called it the best coding AI yet in a video review. That said, Willison noted the jump from the mid-tier Sonnet 4.5 wasn’t huge for his tasks, so if you’re already on Sonnet, the upgrade might feel small for basic work.

Smarter Conversations and Better Safety

Beyond code, Opus 4.5 improves how it handles long chats. Ars Technica reports that instead of cutting off talks at the 200,000-token limit, it now summarizes earlier parts to keep things coherent—rolling out to all Claude models in apps. This means fewer frustrating mid-conversation stops, which helps users in research or brainstorming sessions.

On safety, Livemint highlights Opus 4.5 as Anthropic’s most aligned model yet, resisting prompt injections better than rivals. It’s harder to trick into harmful actions, which matters for enterprise use where reliability counts.

Pricing Drop Makes It More Appealing

Input tokens: $5 per million (down 67% from Opus 4.1’s $15)
Output tokens: $25 per million (from $75)

Decrypt details how this puts Opus 4.5 between cheaper options like Gemini 3 Pro ($2/$12) and pricier ones like the older Opus models, though still higher than some competitors. The cut responds to market pressure, making it doable for large-scale use without high costs. CNBC adds it’s now the default for Pro, Max, and Enterprise plans, available via API, AWS Bedrock, Google Vertex AI, and apps. CNET notes it works well for office tasks like spreadsheets and slides, targeting pros who build and grow their work.

Is It Worth Using?

If you’re deep into coding, agents, or long enterprise tasks, yes—Opus 4.5 leads in benchmarks and adds real benefits like better context handling and safety. The price cut makes it good for teams on a budget. But for casual or speed-focused needs, sticking with Sonnet 4.5 or even Haiku might work, as some testers found the differences small. Compared to GPT-5.1 or Gemini 3 Pro, it stands out in coding but falls short in visual reasoning per Ars Technica. Test it yourself; Anthropic’s quick releases keep things moving, and this one is a good advance without big overhauls.