← Papers

Commercial LLMs speak like junior engineers but often mislead: over half of cited sources are unverifiable and model confidence is inversely related to reasoning quality, producing alignment with experts for operational triage but near-complete divergence on strategic capital decisions.

Governance risks of AI reasoning in urban infrastructure through Delphi audit of human and large language model judgment

Alence Poudel, Carla Barrios, Paola De La Torre, Huy Ton, Trevor Surface, Varenya Mehta, Samanata Silwal · May 14, 2026 · Discover Cities

openalex descriptive medium evidence 7/10 relevance Full text usable extracted full text DOI Source PDF

Structured author observations

Linked only from stored provider relations; the raw author line above is never matched by name.

OpenAlex

Latest observation: July 23, 2026

Alence Poudel exact ORCID
Carla Barrios provider ID
Paola De La Torre provider ID
Huy Ton provider ID
Trevor Surface provider ID
Varenya Mehta provider ID
Samanata Silwal provider ID

Semantic Scholar

Latest observation: July 23, 2026

Alence Poudel provider ID
Carla Barrios provider ID
Paola Garcia de la Torre provider ID
Huy Ton provider ID
Trevor Surface provider ID
V. Mehta provider ID
Samanata Silwal provider ID

In simulated urban-infrastructure scenarios, LLMs produce well-structured but often factually ungrounded recommendations—fabricating over half of cited sources—and diverge from expert judgement increasingly as scenario complexity rises.

Citation observations

Cumulative provider counts captured on specific dates; providers are never combined.

1 cumulative citations

OpenAlex · Observed July 22, 2026

View corpus context

1 cumulative citations

Semantic Scholar · Observed July 22, 2026

View corpus context

Cities are increasingly considering large language models (LLMs) to support smart city operations and infrastructure decision-making. While these tools promise efficiency, their use in public institutions raises concerns about accountability, reliability, and institutional risk. This study presents a sociotechnical audit of six commercial LLMs by comparing their reasoning with a Delphi-derived rubric constructed from the responses of twenty infrastructure professionals. The Delphi process elicited and refined expert reasoning criteria, producing a rubric that emphasized public safety, regulatory compliance, contextual judgment, financial stewardship, and system reliability. Results show that LLMs often generate responses with the structural clarity associated with early-career engineers, yet they display persistent weaknesses in factual grounding and contextual interpretation. Across all models, 51.3% of cited sources were unverifiable or fabricated, and LLM self-reported confidence was negatively correlated with actual reasoning quality (r = -0.23), meaning the lowest-performing models projected the greatest certainty. Decision alignment with expert judgment degraded as scenario complexity increased, with strong agreement on operational triage but near-complete divergence on strategic capital allocation. Many responses misinterpreted regulatory requirements or relied on shallow justification. These failures extend beyond technical accuracy and introduce risks for governance, fiscal responsibility, and regulatory compliance. Methodologically, the study demonstrates how expert reasoning can be operationalized as a benchmark for evaluating AI systems in urban infrastructure contexts, addressing gaps in empirical assessment and governance tools. The findings carry direct implications for accountability, institutional integrity, and public trust in urban governance, and contribute to ongoing discourse on responsible AI adoption in cities aligned with global sustainability priorities.

Summary

Main Finding

LLMs produce fluent, well-structured explanations that resemble early-career engineering reasoning but exhibit systemic weaknesses in factual grounding and contextual interpretation when applied to municipal infrastructure decisions. In a sociotechnical audit against a Delphi-derived expert rubric (20 infrastructure professionals), six commercial LLMs: (1) fabricated or returned unverifiable citations in 51.3% of cases, (2) showed negative correlation between self-reported confidence and actual reasoning quality (r = −0.23), and (3) aligned with experts on simple operational triage but diverged sharply on complex strategic choices (e.g., capital allocation). These failures create governance, fiscal, and regulatory risks for cities considering LLM deployment.

Key Points

Delphi rubric: created from 20 infrastructure professionals; emphasized public safety, regulatory compliance, contextual judgement, financial stewardship, and system reliability.
Models tested: six commercial LLMs (names not specified in the abstract).
Hallucination prevalence: 51.3% of model-cited sources were unverifiable or fabricated.
Confidence-reasoning mismatch: model self-confidence negatively correlated with reasoning quality (r = −0.23); poorer models tended to state higher certainty.
Decision-alignment by complexity: strong agreement with experts on operational triage; near-complete divergence on strategic, long-term decisions (e.g., capital investment prioritization).
Typical failure modes: misinterpretation of regulations, shallow justifications, poor contextualization of local constraints, and fabricated evidence.
Contribution: operationalized an expert-derived benchmark for auditing LLM reasoning in urban infrastructure contexts and introduced a readiness framework linking model performance to deployment risk.

Data & Methods

Design: mixed-methods sociotechnical audit with five phases — Delphi rubric development, scenario design, LLM evaluation protocol, scoring framework construction, and integrated quantitative/qualitative analysis.
Expert panel: 20 infrastructure professionals from municipal utilities, engineering consultancies, and academia (roles included utility directors, senior/project engineers, consultants, researchers).
Delphi process: iterative elicitation to stabilize criteria of defensible infrastructure reasoning (focus on safety, compliance, contextual judgment, fiscal stewardship, reliability).
Scenarios: realistic municipal infrastructure decision contexts spanning operations, maintenance triage, regulatory interpretation, and strategic capital planning across sectors (transportation, water, energy, waste).
LLM evaluation: six commercial models prompted on the scenarios; outputs scored against the Delphi-derived rubric (quantitative scoring plus qualitative coding of error types).
Key quantitative metrics reported: fraction of unverifiable/fabricated citations (51.3%); Pearson correlation between model-reported confidence and rubric-scored reasoning quality (r = −0.23). Decision-alignment measured across scenario complexity levels (strong to weak agreement spectrum).
Ethical oversight: informed consent, institutional ethics exemption due to minimal risk.

Implications for AI Economics

Risk-adjusted value: Apparent productivity gains from LLMs in public-sector infrastructure may be overstated once governance, verification, regulatory, and liability costs are internalized. Economic evaluations must discount LLM output value for factual unreliability and required oversight.
Cost of verification and auditing: Municipalities will incur recurring costs (staff time, third-party auditing, automated verifiers) to validate LLM outputs; these transaction costs materially affect the ROI of deploying LLM-based decision aids.
Procurement and contracting: Procurement design must price in vendor accountability (e.g., warranties, audit logs, model provenance). Contracts may shift toward outcome- and compliance-based terms, increasing complexity and potentially vendor risk premia.
Liability and insurance markets: Fabricated evidence and regulatory misinterpretation create novel liability exposures for public agencies and vendors, driving demand for new insurance products and higher premiums or indemnity clauses.
Labor and task allocation: LLMs may substitute for low-complexity tasks (operational triage, first drafts) but cannot reliably replace expert judgment for high-stakes strategic decisions. This implies demand shifts (more verification, governance, and oversight roles) rather than straightforward labor savings.
Capital budgeting and investment decisions: Because models diverge most on strategic capital allocation, reliance on LLMs for long-term budgeting could lead to misallocated investments or higher discounting of AI-assisted recommendations by decision-makers.
Market for verification services: High prevalence of unverifiable citations signals an emergent market opportunity — firms offering domain-specific grounding, citation verification, regulatory validation, and certification services for LLM outputs.
Regulatory and public-good externalities: Failures in LLM reasoning can generate negative externalities (public-safety risks, regulatory noncompliance, loss of public trust), which suggests a role for regulation, mandated audits, and standards to internalize those externalities.
Vendor valuation and competition: Vendors who can demonstrably reduce hallucination and provide verifiable provenance may command premium prices; conversely, models that overstate confidence may face reputational risk and market discounting when audited results are publicized.
Policy recommendations with economic implications:
- Require human-in-the-loop sign-off for strategic/high-stakes decisions; quantify oversight costs in deployment budgets.
- Mandate provenance and citation verifiability as procurement criteria to reduce downstream verification costs.
- Subsidize or standardize audit rubrics (like the Delphi-derived one here) to lower transaction costs across municipalities.
- Encourage insurance innovation to cover AI-assisted governance risks while incentivizing safer vendor behavior.

Overall, from an AI economics perspective, adopting LLMs in city infrastructure brings potential efficiency gains in well-bounded operational tasks but introduces measurable governance and transaction costs that can erode net economic benefits and create new markets for verification, insurance, and regulatory compliance services.

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides direct, quantitative benchmarking of six commercial LLMs against a Delphi-derived expert rubric, producing concrete failure modes (e.g., 51.3% unverifiable citations, negative correlation between confidence and quality). However, it is not a causal study of real-world outcomes and its conclusions about institutional risk rely on simulation-style scenarios and expert judgment rather than observed deployments, limiting inferential reach. Methods Rigormedium — The study uses a recognized Delphi process to elicit and refine expert criteria and applies that rubric systematically to multiple models, which is methodologically sound; but sample size of experts (20) is moderate, the rubric and scoring remain partly subjective, model versions and prompt details may be under-specified, and external validity across jurisdictions, model updates, and real operational settings is limited. SampleResponses from six commercial large language models were evaluated across a set of urban infrastructure decision scenarios of varying complexity; a Delphi panel of 20 infrastructure professionals produced the benchmark rubric emphasizing public safety, regulatory compliance, contextual judgment, financial stewardship, and system reliability; analyses included source verifiability checks, self-reported model confidence, and agreement with expert judgment. Themesgovernance human_ai_collab GeneralizabilityFindings limited to the six commercial LLMs and specific model versions tested (may not hold for other or updated models), Delphi experts (n=20) may not represent geographic, disciplinary, or regulatory diversity of all cities, Scenarios are simulated rather than observations of deployed city decision-making, so operational behavior and organizational factors are not fully captured, Rubric reflects aggregated expert judgments and thus carries subjectivity that may not generalize across jurisdictions or policy contexts, Measures like citation verifiability depend on the adjudication procedure used and could vary with stricter/looser checks

Claims (10)

Claim	Direction	Outcome	Confidence & Evidence	Details
This study presents a sociotechnical audit of six commercial LLMs by comparing their reasoning with a Delphi-derived rubric constructed from the responses of twenty infrastructure professionals. Research Productivity	neutral	comparison of LLM reasoning to expert-derived rubric	Reading fidelity high Study strength medium	n=20 0.18
The Delphi process elicited and refined expert reasoning criteria, producing a rubric that emphasized public safety, regulatory compliance, contextual judgment, financial stewardship, and system reliability. Governance And Regulation	positive	content/themes of the expert-derived rubric	Reading fidelity high Study strength medium	n=20 0.18
LLMs often generate responses with the structural clarity associated with early-career engineers, yet they display persistent weaknesses in factual grounding and contextual interpretation. Output Quality	mixed	response structure and factual/contextual quality	Reading fidelity high Study strength medium	not reported 0.18
Across all models, 51.3% of cited sources were unverifiable or fabricated. Output Quality	negative	verifiability of cited sources	Reading fidelity high Study strength medium	51.3% 0.18
LLM self-reported confidence was negatively correlated with actual reasoning quality (r = -0.23), meaning the lowest-performing models projected the greatest certainty. Decision Quality	negative	relationship between self-reported confidence and measured reasoning quality	Reading fidelity high Study strength medium	r = -0.23 0.18
Decision alignment with expert judgment degraded as scenario complexity increased, with strong agreement on operational triage but near-complete divergence on strategic capital allocation. Decision Quality	negative	alignment between LLM decisions and expert judgment across scenario complexity	Reading fidelity high Study strength medium	not reported 0.18
Many responses misinterpreted regulatory requirements or relied on shallow justification. Regulatory Compliance	negative	accuracy of regulatory interpretation and depth of justification	Reading fidelity high Study strength medium	not reported 0.18
These failures extend beyond technical accuracy and introduce risks for governance, fiscal responsibility, and regulatory compliance. Governance And Regulation	negative	risks to governance, fiscal responsibility, regulatory compliance	Reading fidelity high Study strength speculative	not reported 0.03
Methodologically, the study demonstrates how expert reasoning can be operationalized as a benchmark for evaluating AI systems in urban infrastructure contexts, addressing gaps in empirical assessment and governance tools. Research Productivity	positive	feasibility of operationalizing expert reasoning as evaluation benchmark	Reading fidelity high Study strength medium	n=20 0.18
The findings carry direct implications for accountability, institutional integrity, and public trust in urban governance, and contribute to ongoing discourse on responsible AI adoption in cities aligned with global sustainability priorities. Governance And Regulation	negative	implications for accountability, institutional integrity, public trust	Reading fidelity high Study strength speculative	not reported 0.03