Auditing and Controlling AI Agent Actions in Spreadsheets

Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already been made without their involvement. This lack of transparency leaves users unable to examine the agent's assumptions, identify errors before they propagate, or redirect execution when it deviates from their intent. The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable. Each decision the agent makes is recorded directly in cells that belong to and reflect on the user. We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step. A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Users identified their own intent reflected in the agent's actions, detected errors that post-hoc review would have failed to surface, and reported a sense of co-ownership over the resulting output. These findings indicate that meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made.

Summary

Main Finding

Pista is a spreadsheet AI agent that makes multi-step agent execution auditable and controllable by decomposing work into traceable steps. By surfacing stepwise explanations, underlying formulas and ranges, providing in-step question/answer and localized edits, and supporting branching when edits change downstream logic, Pista lets users participate in execution rather than only review final outputs. In user studies, this interaction model improved users’ ability to detect errors, understand agent reasoning, and feel co-ownership of results — suggesting that effective oversight of AI agents in knowledge work requires participation during execution, not just post-hoc review.

Key Points

Problem framed: traditional spreadsheet agents act autonomously and hide intermediate decisions; spreadsheets conflate artifact and computation so invisible agent choices are recorded directly in users’ artifacts and are hard to verify.
Formative study (N=8) identified five recurring challenges:
- C1: inability to probe/clarify agent decisions,
- C2: lack of visibility into underlying computation and data propagation,
- C3: low confidence in editing/exploring alternatives,
- C4: difficulty specifying tasks and constraints upfront,
- C5: not knowing what to question or alternative approaches to try.
Design goals (DG1–DG3): make execution traceable/digestible; surface computation logic; enable steering and safe exploration.
Core Pista features:
- F1 Decomposed execution into bundled, stop-and-review steps (halt when data/logic changes or high-impact decisions occur).
- F2 In-situ plain-language explanations for each step.
- F3 Surfacing formulas, data ranges, and dependencies for affected cells without requiring inspecting every cell.
- F4 Ask/probe affordance to query the agent’s rationale on-demand.
- F5 Scaffolded task formulation (help users specify constraints/requirements).
- F6 Localized Editing to issue step-scoped corrections; edits propagate only downstream.
- F7 Branching exploration: when edits change downstream logic, Pista branches and regenerates subsequent steps from the edited state, preserving the original sequence for comparison.
Interaction tradeoff: step granularity balanced to avoid verbosity while keeping steps interpretable.
User outcomes: participants caught errors that would have been missed in post-hoc review, reported better comprehension of the task and agent, and experienced a sense of co-authorship; active participation changed both outputs and users’ mental models/attitudes toward the agent.
Implementation: Excel add-in (desktop + web) with a right-side panel showing plan, step nodes, affected ranges, formulas, and controls for Ask/Edit/branch navigation.

Data & Methods

Technology probe: initial Excel add-in version with two features:
- S1 Decomposed agent execution into incremental, pausable steps with plain-language descriptions and cell highlights.
- S2 Localized editing to scope natural-language corrections to individual steps and propagate changes.
Formative study:
- N = 8 participants (mixed disciplinary and spreadsheet experience).
- Task: 15-minute food warehouse restocking spreadsheet task (plus bonus tasks).
- Protocol: remote Zoom sessions, probe installation, task completion, semi-structured interview; coding via inductive analysis (Cohen’s κ = 0.91).
- Output: identification of the five key challenges (C1–C5) that motivated Pista’s design.
Pista evaluation:
- Within-subjects summative study (N = 16) comparing Pista to a baseline spreadsheet agent with equivalent capability (details in paper).
- Measures: qualitative and task outcome comparisons (participants’ ability to detect errors, comprehension, perception of agent, sense of role/co-ownership); users interacted stepwise, probed, edited, and explored branches.
- Findings: active participation (via Pista) influenced task outcomes and user cognition: improved error detection, greater comprehension, and increased perceived control and co-authorship. (Paper reports qualitative results and behavioral observations; consult full paper for quantitative metrics and statistical analysis details.)

Implications for AI Economics

Labor allocation and division of work:
- Tools like Pista shift the optimal division of labor from “agent-alone then human-review” to “human-in-execution” oversight. This can change demand for labor: less time spent on lengthy post-hoc audits but more on real-time supervision and on-line decision-making skills.
- Skill premium may increase for workers who can efficiently guide and audit stepwise agent execution (spreadsheet fluency + oversight expertise).
Productivity and error economics:
- Embedding verification during execution can reduce downstream error propagation and rework costs in spreadsheet-heavy domains (finance, accounting, operations), potentially lowering loss rates from unnoticed agent errors and improving quality-adjusted productivity.
- Branching and localized edits allow safe exploration of alternatives; firms can experimentally evaluate policies/assumptions faster, reducing decision-making friction and search costs.
Trust, adoption, and reliance:
- Making reasoning legible and enabling interventions may increase calibrated trust and adoption of agents, reducing overreliance or blind acceptance of outputs (which have measurable economic consequences).
- Conversely, additional human-in-the-loop effort has costs; the net benefit depends on the tradeoff between verification time and error-cost reduction — suggesting firms will optimize for a threshold level of human participation per task complexity.
Accountability, compliance, and liability:
- Stepwise audit trails created by Pista-style systems produce richer provenance and could reduce compliance costs, simplify regulatory audits, and affect liability allocation (who “signed off” on each decision — the agent vs the human supervisor).
- This could lower insurer/insurable risk for automated workflows, or change contractual arrangements (e.g., service-level agreements specifying human supervision requirements).
Market for auditing & tool ecosystems:
- Demand for interfaces that make model decisions auditable and steerable will grow; markets may emerge for audit-analytics, human-overwatch services, and specialized tooling integrated into domain workflows.
Directions for economic research and evaluation:
- Quantify the value of in-execution oversight: measure error rates, rework costs, and time-to-completion under “post-hoc review” vs “in-execution participation” regimes across task types.
- Model optimal human-in-the-loop intensity as a function of task complexity, agent reliability, and error cost to derive staffing and process design recommendations.
- Study distributional impacts: which worker groups benefit or are displaced by adoption of steerable agents; how incentives for careful supervision should be structured.
- Consider firm-level investment decisions: cost-benefit of adopting stepwise-agent interfaces (licensing, training) versus continuing with purely automated agents plus periodic audits.

If you want, I can extract specific feature screenshots/flow details into a concise one-page checklist for product managers deciding whether to adopt a Pista-like interface, or outline an economic model to quantify the tradeoff between human oversight time and error-cost reduction. Which would be most useful?

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The within-subjects experimental comparison provides suggestive causal evidence that exposing and allowing intervention on agent actions changes user behavior and perceptions, but the sample is small (N=16 for the summative study), tasks are lab-based, many outcomes are subjective, and the baseline and task scope are narrow—limiting external validity and statistical power. Methods Rigormedium — The study uses a sensible mixed-methods approach (formative + summative, within-subjects design) appropriate for early-stage HCI evaluation and reduces between-subject confounds, but it is limited by small sample sizes, likely convenience sampling, unclear counterbalancing/blinding details, and few objective productivity or long-run measures. SampleA formative qualitative study with 8 participants and a within-subjects summative evaluation with 16 participants who completed equivalent spreadsheet tasks using Pista and a baseline agent; measured task outcomes, participants' ability to detect errors, comprehension, perceptions of the agent, and sense of role/co-ownership. Themeshuman_ai_collab productivity adoption IdentificationWithin-subjects user study comparing Pista to a baseline spreadsheet agent: each participant used both agents on equivalent tasks, allowing within-participant comparisons of task outcomes, error detection, comprehension, and subjective perceptions (supplemented by a small formative study). No strong claims of randomization/blinding or large-scale field variation are reported. GeneralizabilitySmall, convenience sample limits statistical representativeness, Laboratory/task-based setting may not reflect real-world, complex spreadsheet workflows, Findings specific to spreadsheet environments and this UI paradigm may not generalize to other tools or agent types, Short-term interactions; no evidence on long-term use, learning, or productivity impacts, Baseline agent implementation and task difficulty determine effect size and may not mirror production agents

Claims (9)

Claim	Direction	Confidence	Outcome	Details
Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. Other	negative	high	user oversight ability	0.08
AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution: by the time users receive the output, all underlying decisions have already been made without their involvement. Automation Exposure	negative	high	process transparency / accessibility during execution	0.08
The stakes are particularly high in spreadsheet environments, where process and artifact are inseparable: each decision the agent makes is recorded directly in cells that belong to and reflect on the user. Task Allocation	negative	high	risk associated with automated changes to user-owned artifacts	0.08
We introduce Pista, a spreadsheet AI agent that decomposes execution into auditable, controllable actions, providing users with visibility into the agent's decision-making process and the capacity to intervene at each step. Task Allocation	positive	high	availability of auditable, controllable actions and ability to intervene	0.08
A formative study (N = 8) and a within-subjects summative evaluation (N = 16) comparing Pista to a baseline agent demonstrated that active participation in execution influenced not only task outcomes but also users' comprehension of the task, their perception of the agent, and their sense of role within the workflow. Output Quality	positive	high	task outcomes (primary claim), plus user comprehension, perception, and role sense	n=24 0.48
Users identified their own intent reflected in the agent's actions. Worker Satisfaction	positive	high	alignment between user intent and agent actions	n=16 0.24
Users detected errors that post-hoc review would have failed to surface. Error Rate	positive	high	error detection (compared to post-hoc review)	n=16 0.48
Users reported a sense of co-ownership over the resulting output. Worker Satisfaction	positive	high	sense of ownership / co-ownership	n=16 0.24
Meaningful human oversight of AI agents in knowledge work requires not improved post-hoc review mechanisms, but active participation in decisions as they are made. Governance And Regulation	positive	high	oversight effectiveness (design implication favoring in-line/active participation over post-hoc review)	n=24 0.48

A spreadsheet assistant that reveals its stepwise decisions lets users catch mistakes and feel ownership: exposing and controlling each action led participants to detect errors, understand the task better, and view themselves as collaborators rather than passive reviewers.