A governance-first AI architecture, Cognitive Core, scored 91% on an 11-case prior-authorization benchmark and eliminated silent errors, while two prompt-based agent baselines trailed at 55% and 45% and produced multiple silent errors; the paper argues governability—knowing when not to act autonomously—should be a primary metric for institutional AI.

Governed Reasoning for Institutional AI

Mamadou Seck · April 12, 2026

arxiv descriptive low evidence 7/10 relevance Source PDF

Cognitive Core, a governed decision substrate for institutional AI, achieved 91% accuracy and zero silent errors on an 11-case prior-authorization appeal benchmark, outperforming ReAct (55%, ~5–6 silent errors) and Plan-and-Solve (45%, ~5–6 silent errors).

Institutional decisions -- regulatory compliance, clinical triage, prior authorization appeal -- require a different AI architecture than general-purpose agents provide. Agent frameworks infer authority conversationally, reconstruct accountability from logs, and produce silent errors: incorrect determinations that execute without any human review signal. We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. We benchmark three systems on an 11-case balanced prior authorization appeal evaluation set. Cognitive Core achieves 91% accuracy against 55% (ReAct) and 45% (Plan-and-Solve). The governance result is more significant: CC produced zero silent errors while both baselines produced 5-6. We introduce governability -- how reliably a system knows when it should not act autonomously -- as a primary evaluation axis for institutional AI alongside accuracy. The baselines are implemented as prompts, representing the realistic deployment alternative to a governed framework. A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.

Summary

Main Finding

Cognitive Core (CC) is an AI execution substrate designed for institutional decision-making that composes typed epistemic primitives, structural governance, metacognitive reflection, and a tamper-evident audit trail. In a focused benchmark (11-case prior-authorization appeals), CC achieved 91% accuracy and—critically—zero “silent errors” (incorrect determinations executed without human-review signals), compared with 55% and 45% accuracy and 5–6 silent errors for two standard agent baselines (ReAct and Plan-and-Solve). The paper introduces “governability” as a distinct evaluation axis: how reliably a system knows when it should not execute autonomously.

Key Points

Architectural thesis: Institutional decisions require a different substrate than general-purpose agent loops because they need governed, inspectable, persistent reasoning under bounded authority.
Cognitive primitives: A compact typed vocabulary of nine atomic epistemic operations (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate). Each primitive has a defined reasoning function, typed output contract, and governance profile.
Metacognitive reflection: The reflect primitive synthesizes accumulated epistemic state into a structured assessment (quality, gaps, recommendations) and acts as a post-challenge guard against sycophantic capitulation and other multi-agent failure modes.
Governance model: A four-tier governance scheme where review requirements are enforced as execution-time conditions (tiers lock for an instance’s lifetime). Two execution modes: AUTO / SPOT-CHECK (allow completion with sampling) and GATE / HOLD (suspend and require human reviewer).
Epistemic state: Replaces a single confidence scalar with framework-computed mechanical signals, judgment signals, and cross-step coherence flags that are not self-reported by LLMs.
Audit model: Endogenous, tamper-evident reasoning trace produced during computation (SHA-256 hash-chained ledger) capturing every primitive output, orchestrator decision, and governance action.
Demand-driven delegation: Supports both declared workflows and adaptive sequencing determined at runtime from goals and evidence; delegation, suspension, and resumption are first-class execution properties.
Configuration model: Three-layer configuration (workflow YAML, domain YAML, case JSON) intended to make new domain deployments require domain expertise more than engineering changes.
Benchmark results: On an 11-case balanced prior-authorization appeal set, CC = 91% (10/11), ReAct = 55%, Plan-and-Solve = 45%. CC routed the single unique error and a shared hard case to human review (GATE); baselines produced multiple silent errors.
Limitations stated by author: primitive completeness not proven, small benchmark scale, implementation scaling challenges, and no claim of legal sufficiency.

Data & Methods

Evaluation task: Prior-authorization appeal review (institutional, high-stakes administrative decisions).
Benchmark: Three-system comparison (Cognitive Core vs. ReAct vs. Plan-and-Solve) on an 11-case balanced evaluation set drawn and described in Appendix B.
Baselines: Neutral-framed implementations of ReAct and Plan-and-Solve (details and prompt design in Appendix A).
Metrics:
- Accuracy (correct determinations vs. ground truth).
- Silent errors (incorrect determinations that executed without triggering human review).
- Governability (qualitative and quantitative measurement of how reliably a system abstains or routes to human review when it lacks sufficient epistemic support).
Implementation:
- Reference implementation with live LLM calls, primitive registry, governance pipeline, HITL state machine, and hash-chained audit ledger.
- Multi-provider LLM support and an API/streaming interface described.
Experimental protocol:
- Neutral prompts to avoid framing bias.
- Ground truth and scoring rules documented in Appendix B; consistent scoring across systems.
- Analysis separated into accuracy, governability, and governance-tier calibration.
Results summary:
- CC: 10/11 correct (91%); 0 silent errors; system routed difficult cases to human gate.
- ReAct: ~55% accuracy; 5–6 silent errors.
- Plan-and-Solve: ~45% accuracy; 5–6 silent errors.
Robustness and validity caveats: Small sample size (11 cases), domain-specific evaluation (prior authorization), and implementation-dependent baseline fidelity.

Implications for AI Economics

New procurement KPIs: Governability becomes a procurement and regulatory KPI distinct from accuracy—buyers of institutional AI will value systems that reliably defer to humans when uncertain, changing how vendors are evaluated and priced.
Reallocation of labor and specialization: The configuration model (workflow/domain/case) and enforced gating mean more work for domain experts (configuration, review tiers) and potentially less for general engineers. Labor demand may shift toward domain-specialist reviewers and governance engineers.
Transaction and compliance costs:
- Upfront costs rise (designing governance tiers, integrating tamper-evident ledgers, HITL workflows).
- Expected downstream savings via reduced litigation/penalty risk, fewer costly post-hoc error corrections, and improved regulator trust.
- Insurance and liability markets may value governability—lower premiums for systems with provable gating/auditing.
Product differentiation and market structure:
- Platforms that embed governance primitives, auditability, and metacognitive reflection could command premium pricing in high-stakes sectors (healthcare, finance, permitting, social welfare).
- A market for “governance-configurations” and domain YAML templates may emerge—analogous to libraries or SaaS vertical packages—favoring incumbents who accumulate validated domain configs.
Throughput vs. safety trade-off:
- Governance-as-architecture introduces human-in-the-loop bottlenecks (GATE/HOLD), reducing throughput compared with unconstrained agentic systems. Economic optimization will involve calibrating spot-check rates, tier thresholds, and staffing levels.
Regulatory and institutional adoption:
- A substrate that produces endogenous, tamper-evident reasoning records aligns with regulatory requirements (audit logs, explainability), lowering the compliance friction for deploying AI in regulated industries.
- Policymakers may start requiring or incentivizing architected governability (rather than ad-hoc monitoring), affecting market entry barriers.
Measurement and valuation:
- Firms and investors should incorporate governability metrics (rate of silent errors, proportion of cases gated, audit quality) into TCO and risk models.
- Valuation effects could be material for vendors and adopters: systems that demonstrably reduce catastrophic decision risk and evidentiary costs may unlock insurance, capital, and regulatory allowances.
Innovation and standards:
- Standardized primitive vocabularies, governance tiers, and ledger formats could generate network effects and reduce interoperability costs; standards bodies and consortia may form around these interfaces.

Limitations for economic inference: results are from a focused, small benchmark and an independent reference implementation. Broader empirical validation is needed before strong general-equilibrium claims about labor, pricing, or regulatory shifts can be confirmed.

Assessment

Paper Typedescriptive Evidence Strengthlow — Evaluation is on an 11-case, balanced prior-authorization appeal set (very small N), with baselines implemented as prompts rather than full production comparators; no field deployment, no pre-registered evaluation, and limited statistical testing, so results are suggestive but not robust evidence of real-world effectiveness. Methods Rigorlow — The paper reports a head-to-head system benchmark but uses a very small and likely curated test set, compares against prompt-based baselines that may be under-optimized, provides limited detail on annotation/blinding and statistical uncertainty, and lacks real-world or randomized deployment to control for confounders. SampleAn 11-case balanced prior-authorization appeal evaluation set used to benchmark three systems (Cognitive Core, ReAct, Plan-and-Solve); baselines implemented as prompts representing realistic deployment alternatives; no large-scale or field data reported. Themesgovernance human_ai_collab GeneralizabilityVery small, curated test set (11 cases) limits statistical confidence and representativeness, Single institutional domain (healthcare prior-authorization appeals) — results may not transfer to other decision domains, Baselines are prompt-based implementations that may not reflect best-practice engineered deployments, Evaluation appears simulated/case-based rather than from real operational use or diverse institutions, Unclear which LLMs or model versions were used — model-dependent results may not generalize, No evidence on long-run behavior, human workflows, or economic impacts

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Agent frameworks infer authority conversationally, reconstruct accountability from logs, and produce silent errors: incorrect determinations that execute without any human review signal. Ai Safety And Ethics	negative	high	occurrence of silent errors (incorrect determinations executing without human-review signal)	0.18
We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. Governance And Regulation	positive	high	system governability and auditability as properties of the decision substrate	0.03
We benchmark three systems on an 11-case balanced prior authorization appeal evaluation set. Decision Quality	null_result	high	benchmark evaluation on prior authorization appeal cases	n=11 0.18
Cognitive Core achieves 91% accuracy on the 11-case prior authorization appeal set, versus 55% for ReAct and 45% for Plan-and-Solve. Decision Quality	positive	high	accuracy on prior authorization appeal cases	n=11 91% accuracy; 55% (ReAct); 45% (Plan-and-Solve) 0.18
Cognitive Core produced zero silent errors while both baselines produced 5-6 silent errors on the evaluation set. Ai Safety And Ethics	positive	high	count of silent errors (incorrect determinations that executed without human-review signal)	n=11 zero silent errors vs 5-6 silent errors 0.18
We introduce governability — how reliably a system knows when it should not act autonomously — as a primary evaluation axis for institutional AI alongside accuracy. Governance And Regulation	positive	high	governability (system's ability to know when not to act autonomously)	0.03
The baselines are implemented as prompts, representing the realistic deployment alternative to a governed framework. Other	null_result	high	implementation approach for baseline systems (prompt-based)	0.09
A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity. Adoption Rate	positive	high	deployment effort required to support a new institutional decision domain	0.03