A student-built benchmark exposes large gaps in deep research AIs: across 256 humanities and social-science questions the average pass rate is under 17%, with the leading system (GPT-5.5) clearing 57.6%; building such tests in class helps students learn to judge machine-produced knowledge.

Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

Haiyang Shen, Jiuzheng Wang, Taian Guo, Mugeng Liu, Wenchun Jing, Chongyang Pan, Siqi Zhong, Zhiyang Chen, Weichen Bi, Yudong Han, Xiaoying Bai, Yun Ma · May 20, 2026

arxiv descriptive medium evidence 7/10 relevance Source PDF

Students built QuestBench, a 256-question benchmark in humanities and social sciences, and evaluation across 13 deep research systems finds a mean pass rate of 16.85% (best system GPT-5.5 at 57.58%), revealing frequent, pedagogically useful failures where fluent, source-backed answers still miss required queries, sources, terms, or evidence standards.

As AI becomes part of everyday learning, many courses teach students to use it mainly as a productivity tool: how to prompt, search, summarize, write, code, and use tools more efficiently. We argue that AI education also needs a setting in which students learn to test AI and understand their own role in judging machine-produced knowledge. To this end, we introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Students turn disciplinary knowledge into verifiable expert-level questions, review one another's designs for ambiguity and shortcuts, and evaluate AI systems on the resulting tasks. This activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%, and the best-performing system, GPT-5.5, reaches a 57.58% pass rate. The failures are educationally useful because they show how fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. We present QuestBench as a benchmark artifact and as a reusable classroom setting for a larger educational question: how students can remain responsible knowledge actors as AI enters learning and professional work. The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main.

Summary

Main Finding

A course-based benchmark construction exercise (QUESTBENCH) — where undergraduate students design, peer-review, and validate expert-level questions from humanities and social sciences — both serves as an effective pedagogical tool for teaching accountability in AI-mediated knowledge work and reveals substantial limits of current "deep research" systems. On 256 student-designed, verifiable questions spanning 14 H&SS domains, state-of-the-art systems show low pass rates (mean question-level pass rate ≈ 16.85% across 13 systems; best system GPT‑5.5 ≈ 57.58%), with common failures that fluent, citation-backed answers can conceal.

Key Points

Educational innovation: Treats benchmark construction as a classroom practice that trains students to (a) design verifiable, domain-aware tasks, (b) define grading/evidence standards, (c) audit shortcuts, and (d) analyze AI failures.
Dataset artifact: QUESTBENCH contains 256 questions authored by 85 students across 37 self-reported fields (normalized into 14 domains such as law, history, literature, international relations).
Construction protocol: Five design requirements (domain expertise, long-tail sources, answer uniqueness/verifiability, complexity-evolution documentation, anti-shortcut validation) + iterative three-stage filtering (preliminary screening, three-round expert validation, domain normalization) and peer review.
Technical evaluation: Models must search the open web and answer in Chinese; evaluated systems include GPT‑5.5, Claude Opus, Gemini, GLM, several research search agents and industry deep-search systems.
Performance and failure modes: Scores ranged ~14.6–67.1/100 across models; common failure patterns are retrieval failure, unsupported inference, entity confusion, and answer-extraction errors — often due to discipline-specific standards (wrong legal version, provenance confusion, edition/translation issues).
Educational payoff: Student reflections indicate the activity shifts their view of disciplinary knowledge from passive content to operational standards for judging AI outputs.
Resource: QUESTBENCH dataset available publicly (Hugging Face).

Data & Methods

Data
- 256 curated questions in Chinese.
- 85 student authors from humanities & social sciences; initial 37 field labels normalized to 14 domains.
- Each question packaged with a reference answer, explicit grading criteria, and documentation of how the question was hardened (anti-shortcut checks).
Construction pipeline (high level)
- Stage 1: Preliminary screening to remove trivial/ambiguous items and filter by baseline search-enabled model difficulty.
- Stage 2: Iterative expert validation with three independent review rounds per question: (1) answer correctness verification, (2) grading-criteria clarity audit, (3) anti-shortcut validation (no pre-existing answer, no trivial identifiers, no bypassable reasoning).
- Stage 3: Domain normalization and final assembly.
Evaluation task
- Models M (with web search/document visit) answer each question q; answer ˆa scored against reference a by grading criteria G to produce s ∈ [0,100]. Models required to retrieve from open web; all materials in Chinese.
Models evaluated
- Thirteen systems spanning research and frontier commercial agents (examples: GPT‑5.5, Claude Opus 4.7, Gemini 3.1 Pro, GLM 5.1, DeepSeek variants, Kimi variants, Qwen, MiMo, MiniMax, Seed family).
Results (highlights)
- Mean pass rate across evaluated systems: ≈16.85% (question-level).
- Best pass rate: GPT‑5.5 ≈57.58%.
- Score range across systems: ≈14.58 to 67.12 out of 100.
- Identified failure categories emphasize interactional errors (query/source/term/extraction) under domain standards, not merely lack of raw information.

Implications for AI Economics

Complementarity and skilled verification labor
- Results imply strong complementarities: domain experts (or verifiers) remain essential where discipline-specific standards matter. Automation reduces some search/summary costs but raises verification and auditing demand.
- Economic value of AI in knowledge work must net out costs of oversight, rework, and error mitigation.
Productivity measurement and mismeasurement risk
- Fluent AI outputs can inflate apparent productivity if assessment relies on surface fluency or citations. True productivity gains need adjustment for verification time and the quality of final decisions.
- Benchmarks like QUESTBENCH help quantify effective utility (correctness under domain standards) rather than superficial fluency.
Labor demand shifts and skill premiums
- Demand is likely to rise for skills in (a) defining tasks and standards, (b) source-provenance validation, (c) anti-shortcut question design, and (d) interpreting model failures. These are less automatable and may carry higher wages.
- Education that trains verification and benchmark-construction skills can increase worker employability where deep-research tools are adopted.
Market for auditing, benchmarking, and certification
- Need for third-party evaluation services and benchmark-driven procurement: buyers (universities, law firms, publishers, media organizations) will value vendors that demonstrate robustness on domain-specific, anti-shortcut benchmarks.
- Benchmarks built via low-cost course activity suggest scalable ways to create domain-focused evaluation sets, lowering entry barriers for vertical auditing markets.
Pricing AI products & procurement decisions
- Economic valuation of AI tools should incorporate domain-specific pass rates and expected downstream correction costs. A tool with higher citation/fluency but lower pass rate may be cheaper but riskier.
- Procurement contracts and SLAs should reference domain benchmarks or verification metrics rather than aggregate language-model scores.
Regulation, liability, and quality externalities
- Errors in applied knowledge (legal, historical, medical-adjacent policy analysis) can have outsized costs. Regulators and firms should require demonstrable domain accuracy and human-in-the-loop standards for certain tasks.
- Benchmark-based certification could play a role in liability allocation and compliance frameworks.
Research opportunities for economists
- Use QUESTBENCH to estimate task-level automation potential and heterogeneity across fields — input for models of occupational exposure and wage dynamics.
- Combine model performance with time-to-verify data to estimate net productivity changes and the return to verification skills.
- Run field experiments comparing human-only, AI-assisted, and AI+verified workflows to measure impacts on output quality, speed, and downstream economic outcomes.
- Study market formation for benchmarking/auditing services and how credentialing based on benchmark expertise affects labor markets.
Education and human-capital policy
- Integrating benchmark construction into curricula is low-cost and yields measurable human capital in verification and evaluative judgment — skills likely to be rewarded as AI tools diffuse.
- Policymakers should support pedagogies that teach students to define standards and audit AI outputs, not only to prompt or use tools.

Suggested immediate uses for researchers/policymakers: - Economists can incorporate QUESTBENCH into experiments to price the marginal value of domain expertise in AI-assisted tasks. - Procurement teams can evaluate candidate systems on QUESTBENCH to inform purchase and integration decisions in H&SS-heavy workflows. - Educators can replicate the course-based benchmark-construction model to build domain-specific evaluation sets and teach accountable AI use.

Overall, QUESTBENCH highlights that measurable AI progress must be judged against domain-aware, anti-shortcut standards — an insight with concrete implications for labor demand, productivity accounting, market design, procurement, and educational policy in the AI era.

Assessment

Paper Typedescriptive Evidence Strengthmedium — Provides empirical evaluation of 13 deep research systems on a purposely designed 256-question benchmark, giving concrete, reproducible measures of system failure modes; however, it does not claim causal effects on economic outcomes, the sample of tasks is small and student-designed (potentially idiosyncratic), and performance depends on snapshot model versions and subjective pass criteria. Methods Rigormedium — The paper documents a systematic classroom protocol for question construction, peer review, and evaluation and reports question-level pass rates across multiple systems, but methods rely on student-created items that may vary in difficulty and style, evaluative judgments that may be partially subjective, and the study lacks broader external validation or inter-rater reliability reporting. SampleQuestBench: 256 expert-level questions created by students in a course, spanning 14 humanities and social-science domains; evaluated across thirteen deep research systems (including GPT-5.5 as the top performer) with question-level pass/fail scoring; dataset publicly available on Hugging Face. Themeshuman_ai_collab skills_training GeneralizabilityTasks are student-authored and may not represent professional or industry-standard benchmarks, Coverage limited to humanities and social-science domains (not STEM, coding, or other work tasks), Pass/fail criteria can be subjective and may not reflect workplace standards or varied user expectations, Evaluations are model-version and time-specific; system performance may change rapidly, Likely English- and course-context-biased; cultural and disciplinary scope may be narrow

Claims (8)

Claim	Direction	Confidence	Outcome	Details
We introduce a course-based practice that teaches AI through benchmark construction, using deep research systems as a concrete example of AI-era knowledge work. Skill Acquisition	positive	high	students' ability to test and judge AI (educational practice introduced)	0.18
The produced benchmark, QuestBench, consists of 256 questions across 14 humanities and social-science domains. Research Productivity	positive	high	creation of benchmark dataset (question count and domain coverage)	n=256 0.3
The dataset is available at https://huggingface.co/datasets/PKUAIWeb/QuestBench/tree/main. Research Productivity	positive	high	public availability of dataset	0.3
Evaluation on QuestBench shows that student-designed tasks reveal hidden failures in current deep research systems: across thirteen evaluated systems, the mean question-level pass rate is only 16.85%. Output Quality	negative	high	question-level pass rate (model performance on benchmark)	n=13 16.85% 0.18
Across thirteen evaluated systems, the best-performing system, GPT-5.5, reaches a 57.58% pass rate. Output Quality	positive	high	pass rate of top-performing model	n=13 57.58% 0.18
Student-designed tasks reveal hidden failures in current deep research systems: fluent, source-backed answers can still miss the right query, source, term, or evidence standard. Output Quality	negative	high	types of model failure (mismatch on query, source selection, terminology, evidence standards)	0.18
Reflections from five student contributors suggest that benchmark construction can help students see professional knowledge not only as content AI may retrieve, but as the basis for judging AI outputs. Skill Acquisition	positive	high	students' conceptualization of professional knowledge and ability to judge AI outputs	n=5 0.18
The activity gives students direct exposure to a powerful tool while asking them to specify what a trustworthy answer would require. Training Effectiveness	positive	high	student exposure to AI tools combined with critical evaluation practices	0.18