A World Bank–sourced AI evidence engine helped policy and development experts reclaim roughly 2.4–3.9 hours per week by producing verifiable syntheses and declining unsupported queries; the result comes from an observational DiD analysis of in-the-wild use and mixed-methods triangulation, so selection and measurement caveats remain.

Learning from AVA: Early Lessons from a Curated and Trustworthy Generative AI for Policy and Development Research

Nimisha Karnatak, Mohamad Chatila, Daniel Alejandro Pinzón Hernández, Reza Yazdanfar, Michelle Dugas, Renos Vakis · April 20, 2026

arxiv quasi_experimental medium evidence 7/10 relevance Source PDF

A specialized LLM platform (AVA) grounded in World Bank reports and designed for citation verifiability and reasoned abstention is associated with 2.4–3.9 fewer hours spent weekly by engaged policy and development professionals, with qualitative evidence that verifiable citations and abstention improved trust and scope clarity.

General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs. We present AVA (AI + Verified Analysis), a GenAI platform built on a curated library of over 4,000 World Bank Reports with multilingual capabilities. AVA's multi-agent pipeline enables users to query and receive evidence-based syntheses. It operationalizes epistemic humility through two mechanisms: citation verifiability (tracing claims to sources) and reasoned abstention (declining unsupported queries with justification and redirection). We conducted an in-the-wild evaluation with over 2,200 individuals from heterogeneous organisations and roles in 116 countries, via log analysis, surveys, and 20 interviews. Difference-in-Differences estimates associate sustained engagement with 2.4-3.9 hours saved weekly. Qualitatively, participants used AVA as a specialized "evidence engine"; reasoned abstention clarified scope boundaries, and trust was calibrated through institutional provenance and page-anchored citations. We contribute design guidelines for specialized AI and articulate a vision for "ecosystem-aware" Humble AI.

Summary

Main Finding

AVA, a domain-bounded generative AI built on a curated library of 4,000+ World Bank reports, demonstrates that operationalizing epistemic humility (page-level, verifiable citations + reasoned abstention) in a multi-agent RAG pipeline yields high user trust and measurable productivity gains in real-world policy/development work. In a five-month, in-the-wild deployment with 2,200+ professionals across 116 countries, sustained AVA engagement is associated with estimated weekly time savings of 2.4–3.9 hours; qualitatively, users adopted AVA as a specialized “evidence engine,” valuing in-place verification and scope-aware refusals.

Key Points

System design
- Domain-bounded RAG: corpus = 4,000+ curated World Bank reports, hierarchically indexed.
- Multi-agent pipeline with staged workflow: (1) corpus curation & indexing, (2) agentic retrieval (decomposer, planner, tree-walker, drafting), (3) synthesis & verification (evidence classifier, mapping claims to source spans), (4) user personalization & memory.
- Outputs: synthesized answers with page-anchored, clickable citations; reasoned abstention when evidence is insufficient; multilingual support (>60 languages); in-place drafting workspace.
Epistemic humility operationalized
- Citation verifiability: every factual claim traces to source/page-level evidence.
- Reasoned abstention: explicit "I don't know" with rationale and redirection options, improving boundary legibility.
Evaluation
- Mixed methods, five-month deployment: 2,259 registrants; 3,797 queries logged; matched sample n ≈ 1,029 for pre/post analyses; 20 semi-structured interviews; global sample across 116 countries and heterogeneous organizations (NGOs, governments, academia, private sector).
- Difference-in-Differences analysis links sustained use to 2.4–3.9 hours/week saved.
- Qualitative findings: users used AVA for verification-intensive tasks, trusted outputs more when provenance and page anchors were present, interpreted abstentions as useful scope signals, and employed AVA outputs with informed disclosure practices.
Design trade-offs and lessons
- Reliability vs coverage: curated corpus reduces hallucination risk but limits topical coverage; abstention and redirection are crucial to manage unmet information needs.
- End-to-end trust pipeline needed: corpus curation → retrieval quality → explicit abstention → verifiable output.
- Interface guidelines: prioritize verification, make provenance inspectable, surface abstention rationale and next steps.
- Vision for ecosystem-aware Humble AI: domain-specialized systems should interoperate with general models for coverage while preserving verification guarantees.

Data & Methods

Corpus & System
- Curated source set: 4,000+ World Bank Reports, hierarchically indexed into a RAG database.
- System architecture: embeddings + retrieval, multi-agent planning and tree-walking, evidence extraction, verification ensemble, page-level citation linking, multilingual generation.
Deployment & samples
- Five-month, multi-institutional in-the-wild deployment with discretionary adoption.
- Users: 2,200+ registrants across 116 countries; 95.4% external organizations (NGOs, governments, academia, private sector).
- Usage logs: 3,797 user queries; matched pre/post survey sample n ≈ 1,029.
- Qualitative: 20 semi-structured interviews exploring verification practices, refusal experiences, workflow integration.
Analyses
- Log analysis to characterize session flows, retrieval/verification interactions, reuse patterns.
- Pre–post surveys measuring perceived efficiency, trust, and adoption; Difference-in-Differences (DiD) estimates for sustained use → time-saved effects (2.4–3.9 hours/week).
- Thematic coding of interview transcripts to surface appropriation patterns, trust calibration, and disclosure norms.
Limitations of methods reported in the paper
- Corpus restricted to World Bank reports (coverage limitations; domain-specific bias).
- Observational deployment (self-selected users), not randomized controlled trial—causal claims are suggestive via DiD but not definitive.
- Language and topic coverage beyond the corpus constrained; external validity to other document types/domains requires further testing.

Implications for AI Economics

Productivity and labor impacts
- Time savings: estimates of 2.4–3.9 hours/week per engaged user suggest substantive productivity gains for analysts and researchers who perform literature review and evidence synthesis. Aggregated across teams, this could represent large labor-time reallocations.
- Task substitution vs augmentation: AVA is used as an evidence-acceleration tool (not full replacement). Economists should model effects as augmentation that increases throughput and may shift skilled labor toward higher-level analysis and decision-making.
Value of curated, domain-bounded systems
- Cost–benefit trade-off: building and maintaining curated corpora and verification infrastructure incurs upfront/ongoing costs but materially reduces hallucination risk and raises trust, which is valuable in high-stakes domains. Economic evaluation should compare (i) costs of curation/maintenance + system ops versus (ii) time saved, error reductions, and faster policy cycles.
- Market segmentation: there is economic space for specialized, institution-backed generative agents (trusted knowledge products) alongside general-purpose LLMs; procurement and pricing models should reflect trust premiums.
Risk, trust, and decision-making
- Trust calibration: explicit provenance and abstention change how users interpret outputs. Economists modelling information flows should include credibility signals (institutional provenance, page-anchors) as modifiers of impact on decisions.
- Disclosure and accountability norms: the platform’s affordances shape professional disclosure practices. Policy outcomes and liability frameworks need to consider how evidence provenance and system abstention affect responsibility assignment.
Research and evaluation directions for AI economics
- Quantify downstream economic benefits: beyond self-reported hours saved, measure effects on decision quality, policy outcomes, error rates, and time-to-decision in controlled trials or natural experiments.
- ROI analysis: estimate lifecycle costs of corpus curation, annotation, multilingual support, and verification versus realized productivity and risk mitigation benefits.
- Hybrid architectures: explore economic models for hybrid systems that combine curated domain corpora (high trust, limited coverage) with gated web-scale augmentation (coverage) while maintaining verifiability—study marginal benefits of coverage expansion.
- Labor market effects: model reallocation of skilled research labor, demand for verification/curation roles, and potential changes in wage structure for policy research jobs.
Practical recommendations for economists advising institutions
- Invest in curated, verifiable knowledge bases where policy errors are costly; design procurement to include long-term curation budgets.
- Require citation and abstention mechanisms when deploying generative tools for policy advice; factor verification quality into vendor comparisons.
- Pilot and measure: deploy domain-bounded assistants in staged pilots with pre/post measurement of decision-relevant outcomes, not just time savings.

Summary takeaway: AVA shows that specialized, curated RAG systems that operationalize epistemic humility can deliver measurable productivity and trust benefits for policy research. For AI economics, the paper highlights the importance of valuing verification and abstention mechanisms in economic assessments of generative AI deployments, and motivates rigorous ROI and downstream-effect studies comparing curated vs open-web models.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The paper uses a plausible quasi-experimental DiD approach and triangulates with log data, surveys, and interviews, providing consistent evidence of time savings; however, the key outcome (hours saved) and treatment (sustained engagement) are subject to selection and measurement concerns and the setting is domain-specific, limiting causal confidence. Methods Rigormedium — The mixed-methods design (log analysis, surveys, interviews) and DiD identification are appropriate and strengthen inference, but absent random assignment there remain potential confounders (self-selection into sustained use), possible lack of demonstrated pre-trends or balance tests in the description, and some reliance on self-reported outcomes. SampleIn-the-wild deployment with over 2,200 individuals from heterogeneous organisations and roles across 116 countries; data sources include platform interaction logs, user surveys, and 20 semi-structured interviews; AVA is built on a curated, multilingual library of >4,000 World Bank reports. Themesproductivity human_ai_collab adoption governance IdentificationDifference-in-Differences comparing users with sustained engagement with AVA to other users over time (using logs and surveys) to estimate changes in weekly hours saved, supplemented by robustness checks and qualitative triangulation (interviews, surveys). GeneralizabilityUsers self-selected into AVA and into sustained engagement, raising selection bias concerns, Sample consists of development and policy professionals interacting with World Bank–sourced content, not a representative population of workers or firms, Findings tied to a specialized evidence-focused interface and corpus (World Bank reports), limiting transferability to other domains or general-purpose LLMs, Primary outcome (hours saved) may be partly self-reported and short-term, Languages, institutional contexts, and internet access variations across countries may affect external validity

Claims (8)

Claim	Direction	Confidence	Outcome	Details
General-purpose LLMs pose misinformation risks for development and policy experts, lacking epistemic humility for verifiable outputs. Ai Safety And Ethics	negative	high	misinformation risk / epistemic humility	0.08
AVA is a GenAI platform built on a curated library of over 4,000 World Bank Reports with multilingual capabilities. Other	positive	high	system corpus size / multilingual capability	0.8
AVA's multi-agent pipeline enables users to query and receive evidence-based syntheses. Output Quality	positive	high	output: evidence-based syntheses	0.48
AVA operationalizes epistemic humility through two mechanisms: citation verifiability (tracing claims to sources) and reasoned abstention (declining unsupported queries with justification and redirection). Ai Safety And Ethics	positive	high	epistemic humility operationalization (citation verifiability and reasoned abstention)	0.48
We conducted an in-the-wild evaluation with over 2,200 individuals from heterogeneous organisations and roles in 116 countries, via log analysis, surveys, and 20 interviews. Other	null_result	high	evaluation sample and methods	n=2200 0.48
Difference-in-Differences estimates associate sustained engagement with 2.4-3.9 hours saved weekly. Task Completion Time	positive	high	time saved per week	n=2200 2.4-3.9 hours saved weekly 0.48
Qualitatively, participants used AVA as a specialized 'evidence engine'; reasoned abstention clarified scope boundaries, and trust was calibrated through institutional provenance and page-anchored citations. Ai Safety And Ethics	positive	high	user behavior and trust calibration (use as evidence engine; role of abstention and provenance in trust)	n=20 0.48
We contribute design guidelines for specialized AI and articulate a vision for 'ecosystem-aware' Humble AI. Governance And Regulation	positive	high	design guidance / conceptual framework	0.08