The Missing Layer
Trust Doesn't Scale Without Infrastructure
Every major sector of the American economy is now deploying AI into decisions that carry real financial, legal, clinical, reputational consequences. Enterprise spending on generative AI reached an estimated $37 billion in 2025, more than tripling year-over-year, and worldwide AI spending is projected to reach $2.5 trillion by 2027. But the institutions and processes that have traditionally helped industries scale with public confidence, such as shared standards, independent verification, and clear lines of accountability, haven’t kept pace. That gap is already shaping adoption decisions, liability exposure, and regulatory outcomes across the economy.
Over the past several months, we’ve been talking with leaders across industries, including healthcare, financial services, real estate, and the emerging field of AI assurance itself to understand how organizations are actually navigating this environment. What we saw, consistently, was impressive work being done under genuinely difficult structural conditions, as well as a shared recognition that no single organization can build the earned trust infrastructure the economy needs.
I. The State of Play: The Accountability Asymmetry
“You can outsource a task to AI, but you cannot outsource the accountability.” – Dave Garland, Managing Partner at Second Century Ventures
Today, there is a major mismatch between who builds AI systems and who bears the consequences when they fail.
The models powering enterprise AI are overwhelmingly built by a handful of frontier labs. The organizations deploying those models are integrating capabilities they didn’t develop into workflows they are fully accountable for. When an AI system produces a bad outcome, the consumer doesn’t sue the model provider. They sue the company whose name is on the product.
In real estate, the world’s largest asset class at over $300 trillion, practitioners are integrating AI into underwriting, lease analysis, and transaction execution. The foundational models underneath those tools are built by companies with legal teams designed to shield them from downstream liability. But when an AI hallucination leads to a compliance failure or a discriminatory lending decision, it’s the deployer who is accountable. Garland, whose firm is the strategic investment arm of the National Association of Realtors, sees this across the industry: “Legacy industries are being asked to absorb 100% of the liability for black box systems they neither control nor fully audit. That’s a massive asymmetry of risk.”
That asymmetry runs across sectors. Steve Abbott, Head of Government Affairs at Gusto, an HR Platform for small businesses serving over 500,000 customers, described a layered approach: policy experts, legal review, and AI-powered monitoring tools work in concert to identify and evaluate changes in state and federal requirements. Human experts then validate and maintain the compliance guidance that customer-facing tools rely on.
In healthcare, the stakes are perhaps the most visceral. Dr. Girish Nadkarni, Chief AI Officer at Mount Sinai Health System, described a system that uses LLM-as-judge evaluations, human sign-off, and quantified error rates to monitor AI-assisted clinical tools. The approach is sophisticated, but the challenge is structural: “You don’t know when is the first time or the one time that you’re wrong.” Human-in-the-loop review helps, but it also introduces its own failure modes, including automation bias and alert fatigue, which compound as the volume of AI-assisted decisions grows.
The most sophisticated deployers like these are investing heavily, building governance frameworks, and thinking carefully about risk. They share a recognition that they are each building those frameworks largely from scratch, without shared infrastructure that would let them demonstrate trustworthiness in ways that regulators, insurers, and the public can recognize and understand.
To be clear: this is not a story of negligent model developers shifting risk onto unsuspecting deployers. The frontier labs that build foundational models largely do not have line of sight into how their models are being used in practice: what workflows they’re embedded in, what decisions they’re informing, or what populations they’re affecting. There is a genuine gap between the system as built and the system as deployed, and no one in the current ecosystem has full visibility across it. And that is precisely the kind of gap that shared infrastructure is designed to close.
II. The Problem: Why This Time Is Different
“I don’t think anyone has the complete assurance framework for this as of now, just because these things are changing so rapidly.” — Dr. Girish Nadkarni, Mount Sinai
It would be tempting to treat AI assurance as the next iteration of software quality assurance: the same discipline, applied to a new product category. But the technology has properties that break that analogy.
In the physical world, industries manage risk through standards of care. An electrician who wires a building to code, an auditor who follows GAAS, a physician who adheres to clinical guidelines: each operates within a framework that defines what “reasonable” looks like. That framework doesn’t eliminate risk. It makes risk manageable, insurable, and adjudicable.
For AI systems, particularly generative and agentic models, no equivalent framework exists. The reasons are technical.
AI systems are non-deterministic: the same input can produce different outputs depending on model version, temperature settings, and internal activations. They are opaque: even the developers who build them cannot always reconstruct how a model arrived at a specific output. And increasingly, they are autonomous: they take actions, initiate transactions, and make sequential decisions with limited human oversight at each step.
Dr. Nadkarni at Mount Sinai was precise about what this means for existing validation approaches. “Traditional assurance works for deterministic products: same inputs, same outputs, 100 times out of 100,” he said. Healthcare’s regulatory frameworks (FDA clearance processes, clinical trial protocols, change management controls) were built for that world. “They don’t really work for non-deterministic or probabilistic things.” And when agents enter the picture, the challenge compounds: you need “a completely new set of continuous controls,” meaning real-time monitoring of outputs, not pre-deployment checkpoints alone.
The evaluation challenge runs deeper than most observers appreciate. Dr. Rumman Chowdhury, who leads Humane Intelligence, a nonprofit focused on AI evaluation infrastructure, drew a sharp distinction between what the industry commonly calls an “eval” and what evaluation actually requires. “A benchmark is a test. It tells you a score,” she said. A genuine evaluation is a designed process: it defines what “good” means in context, selects appropriate testing tools, analyzes the results for actionable patterns, and communicates findings in ways that engineering teams, legal counsel, and leadership can actually use. By that standard, most of what the industry calls evaluation is really just basic benchmarking and testing.
A growing field of serious practitioners is doing sophisticated assurance work. But that work does not yet function as infrastructure. It is not shared, not interoperable, and not recognized across the organizations and oversight bodies that need to rely on it.
III. The Fragmentation Tax
“When a vendor claims independent validation, my first question is: validated by whom, and under whose financial leverage?” — Dave Garland, Second Century Ventures
Sophisticated deployers are doing serious assurance work. A growing field of providers is building the tools and methods for independent evaluation. But right now, the assurance ecosystem is fragmented in ways that impose real costs on everyone in it.
Take the supply side. Dr. Chowdhury described a market that is growing but lacks basic connective tissue. There is no standardized way for an enterprise to vet an AI evaluator. There is no shared definition of what “independence” means in the context of AI assurance. There are no legal protections for independent evaluators. “Being a purely independent evaluator is nearly impossible right now,” she said, “because the ecosystem isn’t built out yet.” A real and growing field of practitioners is doing serious technical work, but the market is fragmented, lacks widely recognized standards for the market to build from, and difficult for deployers to navigate. And beyond the market structure, the science itself is still maturing: the methodologies for conducting rigorous, context-specific AI evaluations, as distinct from running standardized benchmarks, are advancing rapidly but remain early.
On the demand side, Garland described what this looks like from an investor’s chair. A paid-for badge from a boutique compliance firm that rubber-stamped a constrained demo is not validation, he said. It’s marketing. He also flagged what he called the “engine vs. driver” problem: vendors routinely borrow the credibility of the foundational models they’re built on and present it as their own validation. “That’s like saying a car is perfectly safe because it has a reliable engine, completely ignoring the fact that you’ve built a faulty steering wheel.” Without a shared standard for what independent evaluation actually entails, deployers and investors have no reliable way to distinguish credible assurance from performative compliance. And in most cases, they cannot look to insurance, as insurers eschew systems and products they cannot see into, understand, or price risk for.
Dr. Nadkarni offered a practical lens on what this means for a health system trying to make decisions. The question of whether to bring in external auditors, he said, “depends on the use case. If you have to work with external auditors for every single use case, it becomes unworkable. But on a risk-adjusted basis, for the high-risk decisions, it becomes almost a standard.” The market needs to be structured around risk proportionality, and that structure does not yet exist. Steve Abbott similarly noted the need for industry-level benchmarks “tailored to the individual use case and the relative risk... the riskier the activity to the individual or to the system, the higher level of oversight and supervision you need.”
The cumulative effect is a three-sided market failure. Deployers can’t demonstrate trust in ways that regulators or insurers, or consumers recognize. Assurance providers can’t scale because demand is episodic, unstandardized, and difficult to build a sustainable business around. Insurers can’t price AI risk because they lack the third-party data and actuarial infrastructure they rely on in every other line of business. Fragmentation across the market, the methods, and the definitions is itself a cost: a tax on responsible deployment that makes scaling AI harder and governing it worse.
IV. What We Need: Four Pillars of Trust Infrastructure
“Traditional risk management is not about zero risk. It’s about having knowledge of your risk space and having knowledge of your risk appetite, and then mapping one to the other.” – Dr. Rumman Chowdhury, Humane Intelligence
Across our conversations, four requirements emerged consistently.
1. Common standards defined by the demand side.
“If you ask a major tech vendor what a safe AI integration looks like,” Garland observed, “the answer almost always involves buying more of their proprietary platform. This perpetuates the same problem of opacity and lack of interoperable standards that the ecosystem can rely on. What the market needs is a definition of “good” built by the industries relying on the technology, in collaboration with the technical experts who know how to test for it, and independent of the labs and vendors selling it
This is not a new problem. Capital markets faced a version of it in the early twentieth century: companies made claims about their financial health, and investors had no standardized way to verify them. The creation of 10-K filings, generally accepted accounting principles, and eventually Sarbanes-Oxley gave markets the structural norms they needed to function with confidence. AI is at an equivalent inflection point, and the economy needs a similar foundation: shared, trusted, and designed to enable participation rather than constrain it.
2. Independent, verifiable assurance.
“In the age of agentic AI, we cannot grade our own homework,” Garland said. When an enterprise integrates technology that can autonomously execute contracts or move capital, a board of directors cannot simply take a vendor’s word that the system is safe. “Trust isn’t a feeling,” he said. “It has to be a verifiable metric. Independent assurance is the currency of trust.”
Dr. Chowdhury framed the goal from the provider’s perspective. The purpose of assurance is to produce a product “that has been thoughtfully produced and put out into the world such that if something goes wrong, you know exactly what to do.” You understand the trade-offs and decisions that were made. You can explain why. Traditional risk management has never been about zero risk. It’s about mapping the risk space to the risk appetite, and that mapping, she said, is what’s missing in AI today.
Abbott at Gusto described why independent verification matters from the deployer’s side. “Clear standards and independent verification can help responsible companies innovate while maintaining strong protections for customers.” Independent verification, in his view, could make continued responsible deployment possible: it gives companies a credible way to demonstrate that they’ve been thoughtful about what they’ve built, so they can keep building.
3. A deployer voice in policy.
A deployer voice in policy. The rules governing AI are being written now, across hundreds of state legislatures, and the people who understand what those rules will cost to implement, and which requirements are achievable versus performative, are largely absent from the conversation.
Steve Abbott at Gusto described payroll industry trade associations working to develop policy frameworks while the technology changes underneath them. What’s needed, he said, are strong public-private working groups where deployers can share what actually works in practice. That tension, between standardization and competitive differentiation, is exactly what good infrastructure resolves. It sets a floor, not a ceiling. It defines what responsible deployment looks like without prescribing how to build on top of it.
4. Alignment between claims of safety and assumption of risk.
Garland offered the investor’s acid test: “If your system is as validated and foolproof as you claim, are you willing to underwrite the risk?” If the answer is that liability falls on the end user, the validation is meaningless. “You’re asking my industry to take 100% of the risk on your beta test,” he said. A credible assurance standard forces alignment between the claim of safety and the assumption of risk, which is exactly what indemnification clauses, insurance markets, and procurement decisions need in order to function.
This is where the insurance dimension becomes critical. Insurers are being asked to underwrite AI-driven risk without the actuarial history, shared standards, or credible third-party evaluation they rely on in every other line of business. Without that infrastructure, coverage becomes either unavailable or prohibitively expensive, creating a structural barrier to scaling responsible AI deployment. Independent assurance, built on shared standards, creates the actuarial foundation that makes AI insurance markets possible.
V. The Window
The infrastructure described above does not yet exist at scale. But pieces of it are being built.
Legislation to establish Independent Verification Organizations, expert-led bodies that are structurally independent from industry and test AI systems against government-set outcomes, has advanced in Virginia and Connecticut, creating a framework for market-based, third-party AI verification. Organizations across the assurance ecosystem are beginning to convene around shared standards and methods. The question is no longer whether this infrastructure is needed. It’s whether it will be built deliberately, by the people who understand what’s at stake, or whether it will emerge by default, shaped by litigation, ad hoc regulation, and incumbent market power.
Garland framed the bottleneck: the biggest barrier to AI adoption in legacy industries is the absence of assurance. Enterprises “can see the agentic revolution coming,” he said, “but they’re completely paralyzed by the level of uncertainty that comes with it. They’re being bombarded by vendors and consulting firms selling closed ecosystems. They don’t have a compass, and they don’t have a trusted neutral framework to evaluate their readiness.”
The field of AI assurance is real, growing, and doing serious technical work. It is also fragmented, under-resourced relative to the pace of the technology, and without a unified voice in markets or policy. The organizations that build this trust infrastructure now, that define what good looks like and create the standards and methods and institutional credibility the economy needs, will shape how AI operates for decades. The window is open. It won’t stay open indefinitely.
If you are interested in learning more about this work, please reach out and contact our team.

