The Commonplace
Home Dashboard Papers Evidence Digests 🎲
← Papers

Researchers use LLM research assistants more like collaborators than search engines, submitting longer, more complex queries and delegating tasks such as drafting and gap identification; experienced users become more targeted and engage citations more deeply, though simple keyword queries persist.

Understanding Usage and Engagement in AI-Powered Scientific Research Tools: The Asta Interaction Dataset
Dany Haddad, Daniel Bareket, Joseph Chee Chang, Jay DeYoung, Jena D. Hwang, Uri Katz, M. Polak, Sangho Suh, Harshit Surana, Aryeh Tiktinsky, Shriya Atmakuri, Jonathan Bragg, Mike D'Arcy, Sergey Feldman, Amal Hassan-Ali, R. Lozano, Bodhisattwa Prasad Majumder, C. Mcgrady, Amanpreet Singh, Brooke Vlahos, Yoav Goldberg, Doug Downey · Fetched March 31, 2026 · arXiv.org
semantic_scholar descriptive n/a evidence 7/10 relevance DOI Source
Analysis of 200k+ interaction logs from two LLM-based research tools shows researchers submit longer, more complex queries, treat the system as a collaborative research partner and persistent artifact, and — with experience — move toward more targeted queries and deeper engagement with citations while still using keyword-style queries.

AI-powered scientific research tools are rapidly being integrated into research workflows, yet the field lacks a clear lens into how researchers use these systems in real-world settings. We present and analyze the Asta Interaction Dataset, a large-scale resource comprising over 200,000 user queries and interaction logs from two deployed tools (a literature discovery interface and a scientific question-answering interface) within an LLM-powered retrieval-augmented generation platform. Using this dataset, we characterize query patterns, engagement behaviors, and how usage evolves with experience. We find that users submit longer and more complex queries than in traditional search, and treat the system as a collaborative research partner, delegating tasks such as drafting content and identifying research gaps. Users treat generated responses as persistent artifacts, revisiting and navigating among outputs and cited evidence in non-linear ways. With experience, users issue more targeted queries and engage more deeply with supporting citations, although keyword-style queries persist even among experienced users. We release the anonymized dataset and analysis with a new query intent taxonomy to inform future designs of real-world AI research assistants and to support realistic evaluation.

Summary

Main Finding

Researchers use deployed LLM-powered research tools not as simple search engines but as collaborative partners: they submit longer, more complex queries, delegate substantive research tasks (e.g., drafting, gap-finding), and treat generated outputs and cited evidence as persistent artifacts. Usage patterns evolve with experience—becoming more targeted and citation-focused—yet simpler, keyword-style queries persist. The authors release the Asta Interaction Dataset (≈200K anonymized queries and logs) and a new query-intent taxonomy to support realistic evaluation and design.

Key Points

  • Dataset: ≈200,000 user queries and interaction logs from two deployed tools (literature discovery interface; scientific question-answering interface) inside a retrieval-augmented generation (RAG) platform.
  • Query characteristics: users issue longer and more complex queries than typical web-search queries.
  • Role of the system: treated as a collaborative research partner; users delegate tasks such as drafting content and identifying research gaps.
  • Artifact persistence: generated responses and cited evidence are revisited and navigated in non-linear ways; users treat outputs as persistent artifacts in their workflows.
  • Engagement and learning: with experience, users tend to (a) issue more targeted queries and (b) engage more deeply with supporting citations; however, simple keyword-style queries remain common even among experienced users.
  • Contribution: anonymized dataset release + a new query-intent taxonomy to guide future designs and to enable more realistic evaluation of research assistants.

Data & Methods

  • Data source: interaction logs and textual queries collected from two production research-assistant tools within an LLM-based RAG platform.
  • Scale: roughly 200,000 queries across many users and sessions.
  • Analyses: descriptive and exploratory characterization of query patterns (length, complexity), temporal/experience-based changes in behavior, and interaction behaviors (navigation among outputs and citations, revisitation patterns).
  • Taxonomy: qualitative or mixed-method coding produced a new query-intent taxonomy to classify real-world research queries and inform evaluation design.
  • Release: the dataset and accompanying analysis (anonymized) are published to enable replication and downstream work.

Implications for AI Economics

  • Productivity and task-shifting

    • Tools enable delegation of substantive tasks (drafting, gap identification), implying potential productivity gains and a shift in researchers’ time allocation from routine search to higher-level oversight and synthesis.
    • Economic impact depends on how quality scales: if outputs reduce time-to-result without harming quality, research throughput and returns to research investment may rise.
  • Complementarity vs. substitution

    • The collaborative treatment suggests complementarity between tools and skilled researchers (tools augment cognitive tasks). However, persistent automation of routine tasks could substitute some junior or administrative research tasks.
    • Heterogeneous effects likely across roles and seniority: experienced users appear to extract more value, signaling skill-biased complementarities.
  • Learning, human capital, and adoption dynamics

    • Usage evolves with experience (more targeted queries, deeper citation engagement), indicating on-the-job learning and increasing returns to experience. Adoption models should incorporate learning curves and heterogeneity in use intensity.
    • Persistent prevalence of keyword-style queries suggests limits to learning or variation in worker incentives/time constraints.
  • Market design and pricing

    • Value derives not just from raw model output but from features that support persistence, citation engagement, and non-linear navigation. Product and pricing strategies should reflect these workflow integrations (e.g., subscription tiers for collaboration/history features).
    • Metrics for product success should go beyond query-response latency and BLEU-like metrics to include measures of citation engagement, revision cycles, and downstream research outputs.
  • Evaluation and policy

    • Standard benchmarking (isolated question answering) may misrepresent real-world value; the released taxonomy and dataset enable evaluations that reflect actual intents and workflows, informing procurement and regulatory assessment.
    • Persistent artifacts and citation behaviors raise questions about provenance, reproducibility, and attribution—implications for academic norms, incentives, and IP policy.
  • Research and public-good considerations

    • Release of an anonymized interaction dataset lowers barriers for independent study of economic impacts and tool design, enabling better calibration of models of adoption, productivity, and labor-market effects.
    • Future empirical work should quantify effects on publication output, time-to-discovery, research quality, and labor demand within research organizations.

Suggested next empirical steps for economists - Link interaction logs to researcher outputs (papers, grants) to estimate causal effects on productivity and quality. - Quantify heterogeneous returns by experience, field, and role to assess distributional impacts. - Measure substitution vs. complementarity by tracking task allocation and staffing needs over time.

Assessment

Paper Typedescriptive Evidence Strengthn/a — The paper is a descriptive/observational analysis of interaction logs and does not attempt causal identification or estimate causal effects; its findings document usage patterns rather than causal impacts on productivity or labor outcomes. Methods Rigormedium — Uses a large-scale (200k+ query) anonymized interaction log from two deployed research tools and develops a taxonomy and behavioral characterizations, which is appropriate for descriptive aims; however, rigor is limited by potential selection bias (single platform, self-selected users), sparse information on sample composition (user counts, disciplines, time window), possible measurement/annotation choices for intent taxonomy, and lack of external validation or mixed-method triangulation. SampleAn anonymized dataset of over 200,000 user queries and interaction logs collected from two deployed tools (a literature-discovery interface and a scientific question-answering interface) within a single LLM-powered retrieval-augmented generation (RAG) platform (Asta); metadata about users (counts, disciplines, geographic distribution) and precise time window are not reported in the abstract. Themeshuman_ai_collab productivity GeneralizabilityPlatform-specific (data from a single RAG platform — Asta) so findings may not generalize to other LLM implementations or UI designs, Self-selected user base (early adopters or those seeking AI-assisted research) — likely not representative of all researchers, Specific to scientific research workflows — may not apply to other domains (e.g., business, consumer search), Potential language and discipline biases (likely English-dominant and particular research fields), Anonymization/logging may remove context (offline work, external tools) so observed interactions are partial

Claims (7)

ClaimDirectionConfidenceOutcomeDetails
The Asta Interaction Dataset comprises over 200,000 user queries and interaction logs from two deployed tools (a literature discovery interface and a scientific question-answering interface) within an LLM-powered retrieval-augmented generation platform. Other null_result high size and composition of dataset (number of queries, tools included)
n=200000
0.3
Users submit longer and more complex queries than in traditional search. Research Productivity positive medium query length and complexity
n=200000
0.11
Users treat the system as a collaborative research partner, delegating tasks such as drafting content and identifying research gaps. Research Productivity positive medium frequency of delegation behaviors (drafting content, gap identification) in user interactions
n=200000
0.11
Users treat generated responses as persistent artifacts, revisiting and navigating among outputs and cited evidence in non-linear ways. Research Productivity positive medium revisit and navigation behavior (frequency of revisits, non-linear navigation patterns)
n=200000
0.11
With experience, users issue more targeted queries and engage more deeply with supporting citations. Research Productivity positive medium targeted query frequency and citation engagement over user experience/time
n=200000
0.11
Keyword-style queries persist even among experienced users. Research Productivity mixed medium prevalence of keyword-style queries by user experience level
n=200000
0.11
We release the anonymized dataset and analysis with a new query intent taxonomy to inform future designs of real-world AI research assistants and to support realistic evaluation. Other null_result high data and taxonomy release
0.3

Notes