Opus 4.6
Anthropic's Most Capable Model Shows Why We Need Independent Verification
Late last week, Anthropic released its newest and most capable model yet: Opus 4.6. The new Opus is genuinely impressive. Its performance is state of the art on the leading benchmarks, with the highest score on the agentic coding evaluation Terminal-Bench 2.0, the multidisciplinary reasoning test Humanity’s Last Exam, the leading test for knowledge work GDPval-AA, and the agentic search test BrowseComp. It performs better on long-context retrieval and long-context reasoning, helping avoid “context rot,” and also features a 1M token context window — a first for Opus-class models. Anthropic has upgraded Claude’s integration in applications like Powerpoint and Excel and stresses the new model’s applications in everyday tasks like financial analyses and presentations.
Anthropic has also bulked up its safety protections. In the model release announcement, the company writes that “these intelligence gains do not come at the cost of safety.” Its system card states that Opus 4.6 is as well-aligned as its predecessor, Opus 4.5, with the lowest rate of excessive refusals of any recent Claude model. Anthropic presents results from a range of new and upgraded safety tests covering user well-being, refusals, interpretability, and misaligned behaviors such as deception, sycophancy, encouragement of user delusions, and cooperation with misuse. The company has also published new safeguards in areas with dual-use concerns, like cyber and bio-risks, and argues that Claude should be used to “level the playing field” against bad actors by accelerating cyber-defensive applications like finding and patching vulnerabilities in open-source software.
Uncharted Territory
So, what do we make of Opus 4.6? We applaud Anthropic — for producing a best-of-class model that can make a real difference for users, yes, but much more importantly, for its transparency. Most frontier labs push a new model and tell us to simply trust that it’s safe. Anthropic opens their books.
But that transparency reveals real concerns, based on the system card’s own admissions. We see four structural problems that, taken together, suggest the current system of voluntary self-assessment is no longer fit for purpose.
Cyber evals are saturated. On cyber risks, the system card states that Opus 4.6 has “saturated all of Anthropic’s current cyber evaluations,” meaning the model has maxed out every cyber test Anthropic has, leaving no room for current benchmarks to measure how much further its capabilities extend. Anthropic goes further: internal testing revealed cyber capabilities the company didn’t expect to see yet, and that previous models couldn’t demonstrate. In short, the model’s capabilities are outrunning the tools designed to measure them. Anthropic says it is investing in harder evaluations, but has not yet published a solution.
It’s getting harder to prove these models are safe. Anthropic finds that Opus 4.6 does not cross thresholds for autonomy risks, but the basis for that rule-out has shifted. Because automated benchmarks are saturated, the determination of whether the model crosses the critical ASL-4 threshold (the level at which a model could fully automate the work of an entry-level researcher at Anthropic) now rests primarily on qualitative impressions and a survey of 16 Anthropic employees. Those employees are knowledgeable, and none believed the model had crossed the threshold. But this should give us tremendous pause: the safety determination for a frontier AI model now depends on an informal poll of people employed by the company that built it. And the system card warns that “clear rule-out of the next capability threshold may soon be difficult or impossible under the current regime.”
The model evaluates itself. Perhaps the most novel concern is about the integrity of the evaluation process. To test Opus 4.6, Anthropic used Opus 4.6 itself. The system card flags this directly, noting a “potential risk where a misaligned model could influence the very infrastructure designed to measure its capabilities.” It further notes that the model has shown “improved ability to complete suspicious side tasks without attracting the attention of automated monitors.” This is a recursive integrity problem: the model under evaluation is shaping the environment in which it is evaluated, creating a structural vulnerability that goes beyond the general concern about labs grading their own homework. Here, the homework is grading itself.
Governance frameworks miss the full risk surface. AI capability is not just a property of the base model. It is a property of the full system: the model plus its scaffolding, tooling, and integrations. The system card acknowledges that Opus 4.6 could cross critical capability thresholds with better tooling, even if it does not cross them today. As we argued in our analysis of GPT-5’s routing architecture, governance must be systems-aware: a model that sits safely below a critical threshold today could cross it tomorrow with a tooling upgrade, without any change to the model itself.
Taken together, these four issues point to a single structural problem: elements that should be independent — setting thresholds, running evaluations, and making safety determinations — are all handled by the same actor. None of this reflects bad faith. But good faith does not imply good structure.
Where Do We Go From Here?
Opus 4.6 makes clear that the current system of voluntary self-assessment is straining under the weight of what it’s being asked to do. These are not problems that a single lab, no matter how well-intentioned, can solve on its own. They are problems of institutional design, requiring an institutional solution.
We believe that solution is independent, third-party verification. Under what we call the Independent Verification Organization (IVO) framework, a government authority — at the state or federal level — authorizes and oversees a marketplace of independent assurance providers. These IVOs, staffed by subject matter experts who are structurally independent from industry, test and verify whether AI products meet safety outcomes set by the government based on democratic input.
More granularly: Governments set outcomes — in the case of cyber, goals such as: AI systems must not provide material uplift to offensive cyber operations beyond what is freely accessible through existing tools; that AI-assisted code generation must incorporate vulnerability detection; or that AI systems deployed in critical infrastructure contexts must demonstrate resistance to prompt injection and other adversarial attacks. The IVOs develop the technical criteria and testing protocols to verify against those outcomes, and update them as the technology evolves. When a new model is released, or when the underlying system changes, the AI product would need to be re-verified. This is the key advantage of the IVO approach over traditional regulation: it is designed to keep pace with a technology that moves faster than any legislature can.
This framework would address the structural safety concerns raised by the 4.6 release:
Better Testing: Under the current system, Anthropic alone is responsible for developing more robust evaluations while simultaneously racing to ship the next model. In a competitive marketplace, IVOs would be far more incentivized than the labs to build better, more robust testing because they compete on the rigor and currency of their evaluations. This framework realigns testing incentives with the public good.
Eliminate Conflict of Interest: No matter how knowledgeable Anthropic’s employees are, there is a real conflict of interest when a company’s own staff are charged with determining whether its product crosses a critical safety threshold. Under an IVO framework, that determination would be made by independent experts with no financial stake in deployment.
No More Recursive Testing: Third-party evaluators would use their own infrastructure, breaking the recursive loop in which the model is, essentially, tested by itself. This doesn’t eliminate every risk — a sufficiently capable model could, in theory, fool external testers too — but it removes the specific dependency that the system card itself identifies as concerning.
Systems Approach: IVOs can be mandated to evaluate the full deployed system — model plus scaffolding plus tooling — not just the base model in isolation. A governance framework that only looks at the model is like a vehicle safety regime that tests the engine but not the brakes.
Transparency is Not Enough
IVOs are not a silver bullet. A fully misaligned model could, in theory, deceive a third-party tester, just as it could deceive an internal one. And independent verification cannot substitute for the kind of deep, ongoing model access that internal teams have during development, unless such access were mandated.
But independent verification dramatically raises the bar. It introduces structural separation between the entity that builds a model, the entity that tests it, and the entity that determines whether it’s safe to deploy. It creates a competitive market for better safety tools. And it provides the public — and policymakers — with an independent source of information about what these systems can and cannot do, rather than relying solely on the say-so of the companies that profit from their deployment.
As AI forecaster Peter Wildeford wrote after the 4.6 release: “Transparency is not a substitute for oversight. Anthropic is telling us their voluntary system is no longer fit for purpose. We urgently need something better.”
We believe that “something” is independent verification. The next model will be even more capable, and the current system is already straining. If you agree, join us in our mission.

