A governance-first AI architecture, Cognitive Core, scored 91% on an 11-case prior-authorization benchmark and eliminated silent errors, while two prompt-based agent baselines trailed at 55% and 45% and produced multiple silent errors; the paper argues governability—knowing when not to act autonomously—should be a primary metric for institutional AI.
Institutional decisions -- regulatory compliance, clinical triage, prior authorization appeal -- require a different AI architecture than general-purpose agents provide. Agent frameworks infer authority conversationally, reconstruct accountability from logs, and produce silent errors: incorrect determinations that execute without any human review signal. We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. We benchmark three systems on an 11-case balanced prior authorization appeal evaluation set. Cognitive Core achieves 91% accuracy against 55% (ReAct) and 45% (Plan-and-Solve). The governance result is more significant: CC produced zero silent errors while both baselines produced 5-6. We introduce governability -- how reliably a system knows when it should not act autonomously -- as a primary evaluation axis for institutional AI alongside accuracy. The baselines are implemented as prompts, representing the realistic deployment alternative to a governed framework. A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity.
Summary
Main Finding
Cognitive Core (CC) is an AI execution substrate designed for institutional decision-making that composes typed epistemic primitives, structural governance, metacognitive reflection, and a tamper-evident audit trail. In a focused benchmark (11-case prior-authorization appeals), CC achieved 91% accuracy and—critically—zero “silent errors” (incorrect determinations executed without human-review signals), compared with 55% and 45% accuracy and 5–6 silent errors for two standard agent baselines (ReAct and Plan-and-Solve). The paper introduces “governability” as a distinct evaluation axis: how reliably a system knows when it should not execute autonomously.
Key Points
- Architectural thesis: Institutional decisions require a different substrate than general-purpose agent loops because they need governed, inspectable, persistent reasoning under bounded authority.
- Cognitive primitives: A compact typed vocabulary of nine atomic epistemic operations (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate). Each primitive has a defined reasoning function, typed output contract, and governance profile.
- Metacognitive reflection: The reflect primitive synthesizes accumulated epistemic state into a structured assessment (quality, gaps, recommendations) and acts as a post-challenge guard against sycophantic capitulation and other multi-agent failure modes.
- Governance model: A four-tier governance scheme where review requirements are enforced as execution-time conditions (tiers lock for an instance’s lifetime). Two execution modes: AUTO / SPOT-CHECK (allow completion with sampling) and GATE / HOLD (suspend and require human reviewer).
- Epistemic state: Replaces a single confidence scalar with framework-computed mechanical signals, judgment signals, and cross-step coherence flags that are not self-reported by LLMs.
- Audit model: Endogenous, tamper-evident reasoning trace produced during computation (SHA-256 hash-chained ledger) capturing every primitive output, orchestrator decision, and governance action.
- Demand-driven delegation: Supports both declared workflows and adaptive sequencing determined at runtime from goals and evidence; delegation, suspension, and resumption are first-class execution properties.
- Configuration model: Three-layer configuration (workflow YAML, domain YAML, case JSON) intended to make new domain deployments require domain expertise more than engineering changes.
- Benchmark results: On an 11-case balanced prior-authorization appeal set, CC = 91% (10/11), ReAct = 55%, Plan-and-Solve = 45%. CC routed the single unique error and a shared hard case to human review (GATE); baselines produced multiple silent errors.
- Limitations stated by author: primitive completeness not proven, small benchmark scale, implementation scaling challenges, and no claim of legal sufficiency.
Data & Methods
- Evaluation task: Prior-authorization appeal review (institutional, high-stakes administrative decisions).
- Benchmark: Three-system comparison (Cognitive Core vs. ReAct vs. Plan-and-Solve) on an 11-case balanced evaluation set drawn and described in Appendix B.
- Baselines: Neutral-framed implementations of ReAct and Plan-and-Solve (details and prompt design in Appendix A).
- Metrics:
- Accuracy (correct determinations vs. ground truth).
- Silent errors (incorrect determinations that executed without triggering human review).
- Governability (qualitative and quantitative measurement of how reliably a system abstains or routes to human review when it lacks sufficient epistemic support).
- Implementation:
- Reference implementation with live LLM calls, primitive registry, governance pipeline, HITL state machine, and hash-chained audit ledger.
- Multi-provider LLM support and an API/streaming interface described.
- Experimental protocol:
- Neutral prompts to avoid framing bias.
- Ground truth and scoring rules documented in Appendix B; consistent scoring across systems.
- Analysis separated into accuracy, governability, and governance-tier calibration.
- Results summary:
- CC: 10/11 correct (91%); 0 silent errors; system routed difficult cases to human gate.
- ReAct: ~55% accuracy; 5–6 silent errors.
- Plan-and-Solve: ~45% accuracy; 5–6 silent errors.
- Robustness and validity caveats: Small sample size (11 cases), domain-specific evaluation (prior authorization), and implementation-dependent baseline fidelity.
Implications for AI Economics
- New procurement KPIs: Governability becomes a procurement and regulatory KPI distinct from accuracy—buyers of institutional AI will value systems that reliably defer to humans when uncertain, changing how vendors are evaluated and priced.
- Reallocation of labor and specialization: The configuration model (workflow/domain/case) and enforced gating mean more work for domain experts (configuration, review tiers) and potentially less for general engineers. Labor demand may shift toward domain-specialist reviewers and governance engineers.
- Transaction and compliance costs:
- Upfront costs rise (designing governance tiers, integrating tamper-evident ledgers, HITL workflows).
- Expected downstream savings via reduced litigation/penalty risk, fewer costly post-hoc error corrections, and improved regulator trust.
- Insurance and liability markets may value governability—lower premiums for systems with provable gating/auditing.
- Product differentiation and market structure:
- Platforms that embed governance primitives, auditability, and metacognitive reflection could command premium pricing in high-stakes sectors (healthcare, finance, permitting, social welfare).
- A market for “governance-configurations” and domain YAML templates may emerge—analogous to libraries or SaaS vertical packages—favoring incumbents who accumulate validated domain configs.
- Throughput vs. safety trade-off:
- Governance-as-architecture introduces human-in-the-loop bottlenecks (GATE/HOLD), reducing throughput compared with unconstrained agentic systems. Economic optimization will involve calibrating spot-check rates, tier thresholds, and staffing levels.
- Regulatory and institutional adoption:
- A substrate that produces endogenous, tamper-evident reasoning records aligns with regulatory requirements (audit logs, explainability), lowering the compliance friction for deploying AI in regulated industries.
- Policymakers may start requiring or incentivizing architected governability (rather than ad-hoc monitoring), affecting market entry barriers.
- Measurement and valuation:
- Firms and investors should incorporate governability metrics (rate of silent errors, proportion of cases gated, audit quality) into TCO and risk models.
- Valuation effects could be material for vendors and adopters: systems that demonstrably reduce catastrophic decision risk and evidentiary costs may unlock insurance, capital, and regulatory allowances.
- Innovation and standards:
- Standardized primitive vocabularies, governance tiers, and ledger formats could generate network effects and reduce interoperability costs; standards bodies and consortia may form around these interfaces.
Limitations for economic inference: results are from a focused, small benchmark and an independent reference implementation. Broader empirical validation is needed before strong general-equilibrium claims about labor, pricing, or regulatory shifts can be confirmed.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Agent frameworks infer authority conversationally, reconstruct accountability from logs, and produce silent errors: incorrect determinations that execute without any human review signal. Ai Safety And Ethics | negative | high | occurrence of silent errors (incorrect determinations executing without human-review signal) |
0.18
|
| We propose Cognitive Core: a governed decision substrate built from nine typed cognitive primitives (retrieve, classify, investigate, verify, challenge, reflect, deliberate, govern, generate), a four-tier governance model where human review is a condition of execution rather than a post-hoc check, a tamper-evident SHA-256 hash-chain audit ledger endogenous to computation, and a demand-driven delegation architecture supporting both declared and autonomously reasoned epistemic sequences. Governance And Regulation | positive | high | system governability and auditability as properties of the decision substrate |
0.03
|
| We benchmark three systems on an 11-case balanced prior authorization appeal evaluation set. Decision Quality | null_result | high | benchmark evaluation on prior authorization appeal cases |
n=11
0.18
|
| Cognitive Core achieves 91% accuracy on the 11-case prior authorization appeal set, versus 55% for ReAct and 45% for Plan-and-Solve. Decision Quality | positive | high | accuracy on prior authorization appeal cases |
n=11
91% accuracy; 55% (ReAct); 45% (Plan-and-Solve)
0.18
|
| Cognitive Core produced zero silent errors while both baselines produced 5-6 silent errors on the evaluation set. Ai Safety And Ethics | positive | high | count of silent errors (incorrect determinations that executed without human-review signal) |
n=11
zero silent errors vs 5-6 silent errors
0.18
|
| We introduce governability — how reliably a system knows when it should not act autonomously — as a primary evaluation axis for institutional AI alongside accuracy. Governance And Regulation | positive | high | governability (system's ability to know when not to act autonomously) |
0.03
|
| The baselines are implemented as prompts, representing the realistic deployment alternative to a governed framework. Other | null_result | high | implementation approach for baseline systems (prompt-based) |
0.09
|
| A configuration-driven domain model means deploying a new institutional decision domain requires YAML configuration, not engineering capacity. Adoption Rate | positive | high | deployment effort required to support a new institutional decision domain |
0.03
|