The Commonplace
Home Dashboard Papers Evidence Syntheses Digests 🎲
← Papers

Tool-augmented LLMs can reliably handle complex quantitative finance queries: in a 100-question benchmark, top agents achieved near-perfect tool selection and minimal hallucination, underscoring external tools as the path to dependable numerical reasoning.

Time Series Augmented Generation for Financial Applications
Anton Kolonin, Alexey Glushchenko, Evgeny Bochkov, Abhishek Saxena · April 21, 2026
arxiv descriptive medium evidence 7/10 relevance Source PDF
A 100-question benchmark for financial time-series reasoning shows that tool-augmented LLM agents (via TSAG) can achieve near-perfect tool selection and low hallucination, validating the utility of delegating computations to verifiable external tools.

Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.

Summary

Main Finding

TSAG (Time Series Augmented Generation) is a tool‑augmented evaluation framework and benchmark that isolates an LLM agent’s core quantitative reasoning for financial time‑series Q&A by forcing the agent to select and call verifiable grounding functions. In a 100‑question benchmark (using tool stubs to remove live‑data noise), modern agents such as GPT‑4o and Qwen2 (7B) achieved near‑perfect tool selection (Match Accuracy = 1.00) with very low hallucination rates (GPT‑4o HR = 0.02), validating the tool‑augmented paradigm as a reliable way to combine LLMs and quantitative finance code. Smaller models performed substantially worse, supporting the view that robust tool orchestration is an emergent capability of larger models.

Key Points

  • Contribution: TSAG framework + 100‑item benchmark + evaluation metrics (Return Rate, Match Accuracy, LLM‑assessed Accuracy, Hallucination Rate, Seconds per Query) and an open‑source release to standardize evaluation of financial agents.
  • Methodological novelty: use of hard‑coded tool “stubs” (grounding functions with expected outputs) to decouple agentic reasoning (parsing, tool selection, parameter extraction) from noise in live market data or tool execution.
  • Architecture: 4 layers — User front end, LLM kernel (LangChain), Tools layer (grounding functions), and TS DB. Agents parse NLQs, select tools, parameters extracted (defaults provided), tool(s) return structured results, LLM synthesizes final NLR.
  • Toolset (POC): seasonality/pattern analysis (peak_traded_volume, lowest_traded_volume, round_the_clock_pattern, abnormal_deviations), price/volatility (price, volatility, predict_*), correlation_between_tokens, and metadata retrieval.
  • Key experimental results (temperature = 0.0 unless noted):
    • GPT‑4o (API): RR 1.00, MA 1.00, LA 0.65, HR 0.02, SPQ ≈ 2s
    • Qwen2 (7B): RR 1.00, MA 1.00, LA 0.66, HR 0.08, SPQ ≈ 2s
    • GPT‑4o‑mini: RR 1.00, MA 0.97, HR 0.04
    • Llama 3.1 (8B): MA 0.90, HR 0.13
    • Llama 3.2 (3.2B): MA 0.76, HR 0.26
  • Practical findings:
    • Context window matters: 8192 tokens were necessary to hold the tools’ parameter descriptions; smaller contexts (4096 or 2048) degraded parameter extraction.
    • Temperature: temperature = 0.0 reduced hallucinations and increased determinism; temperature = 1.0 increased variability.
    • Latency/cost tradeoffs: local open models (Qwen2 7B) gave fast responses; some API models (DeepSeek‑V3) had higher latency.
  • Limitations noted by authors:
    • Proof‑of‑concept uses hard‑coded tool outputs and focuses on cryptocurrency data; real‑world integration, noisy data, and richer tools remain to be tested.
    • Some candidate models were excluded due to LangChain/tooling incompatibilities.
    • Metrics are evaluated via DeepEval automated scoring—useful but not a full human judgment sweep.

Data & Methods

  • Benchmark: 100 NLQ items. Each item specified as a triplet: (natural language query, expected keywords/numbers to find in generated NLR, expected full NLR text).
  • Tool stubs: grounding functions implemented in Python with deterministic outputs driven by the benchmark (tsag/tools.py and tests/benchmark.tsv). This forces the agent to demonstrate correct invocation rather than relying on tool correctness.
  • Metrics:
    • Return Rate (RR): proportion of queries producing any nonempty response.
    • Match Accuracy (MA): strict programmatic check that generated tool name and all extracted parameters exactly match ground truth.
    • LLM‑accessed Accuracy (LA): DeepEval score (0–1) comparing generated NLR to expected NLR.
    • Hallucination Rate (HR): DeepEval measure of deviation (lower is better).
    • Seconds per Query (SPQ): wallclock time per query.
  • Models evaluated: Llama 3.x, Qwen2/2.5 (0.5B–7B), GPT‑4o and GPT‑4o‑mini (API), DeepSeek‑V3 (API). Some models were run locally via Ollama; others via vendor APIs.
  • Hyperparameters and setup:
    • Context window set to 8192 tokens.
    • Temperatures tested: 0.0 and 1.0 (multiple seeds for high temp).
    • num_predict = 512, retries and seeds varied.
    • Implementation: Python 3.11, LangChain, DeepEval for scoring. Hardware: consumer GPU (RTX 3070 Ti laptop), experiments took ~2 hours per full benchmark run.
  • Comparative design: multiple runs for temperature=1.0 to evaluate variability; direct comparisons reported in Table 2 of the paper.

Implications for AI Economics

  • For applied financial systems:
    • Tool augmentation is a pragmatic path to reliable, auditable quantitative finance assistants: LLMs can handle natural language parsing and orchestration while verified code executes numeric work. This separation improves verifiability and reduces hallucination.
    • Model choice matters: larger, capable models (GPT‑4o, Qwen2 7B) show emergent, near‑perfect tool orchestration, supporting deployment in higher‑stakes analytics and decision support.
    • Operational settings: deterministic generation (temperature = 0) and sufficient context windows (≥8192 tokens) are advisable for production financial Q&A to reduce hallucinations and parameter extraction errors.
  • For market design and productivity:
    • Lowered friction for complex time‑series analytics could democratize quantitative workflows (traders, analysts, regulators), reducing costs and time to insight.
    • Faster local open models offer latency/cost advantages, while proprietary models may still edge on factuality and hallucination suppression—tradeoffs important for deployment decisions.
  • For regulation, auditability, and model risk:
    • The strict MA metric demonstrates a pathway to auditable tool calls (logs of tool invoked + parameters) that regulators and risk teams could inspect — facilitating compliance and model‑risk management.
    • Because TSAG isolates agent reasoning from execution, it can help separate model‑level failures (wrong tool/params) from execution/data failures (noisy feeds or buggy functions), which is vital for accountability frameworks.
  • Caveats and future research priorities for AI economics:
    • Real‑world robustness: the study used deterministic tool stubs and crypto data; performance in noisy live markets, under adversarial queries, or across traditional asset classes needs evaluation.
    • Generalization: expand tools (order execution, portfolio metrics, multi‑asset correlation), test transfer to unseen instruments, and stress test parameter defaults and ambiguous NLQs.
    • Economic and labor impacts: automation of routine quantitative tasks may reshape analyst roles; research should quantify productivity gains and distributional effects.
    • Governance risks: ease of building such agents raises misuse and systemic‑risk concerns; standardized benchmarks (like TSAG) should be paired with safety/auditing standards.
  • Practical takeaways for researchers and practitioners:
    • Use tool‑augmented evaluation to certify LLM agent reasoning before hooking to live data/tools.
    • Prefer larger, well‑contextualized models for mission‑critical financial tooling; test latency and cost tradeoffs against operational requirements.
    • Extend TSAG-style benchmarks to real tools and more diverse financial tasks to measure end‑to‑end reliability in realistic settings.

If you want, I can: - extract the full per‑model metric table into CSV or JSON; - produce a checklist for deploying TSAG‑style agents in production (logging, testing, governance); - sketch an experimental plan to extend TSAG to live market data and more complex tools. Which would be most useful?

Assessment

Paper Typedescriptive Evidence Strengthmedium — The paper provides a large-scale, comparative empirical evaluation with clear metrics (tool selection accuracy, faithfulness, hallucination) and multiple SOTA models, giving credible internal evidence about agent behavior on the benchmark; however, conclusions are limited by the curated 100-question sample, potential selection and implementation biases (choice of tools, prompt engineering, model versions), and lack of real-world deployment/impact measurement. Methods Rigorhigh — The authors introduce a principled evaluation framework (TSAG) that isolates parsing vs computation by delegating numeric work to verifiable external tools, use objective metrics (tool-use accuracy, hallucination rates), and compare multiple state-of-the-art agents; these design choices strengthen measurement validity, though some risks remain (benchmark curation, tool-interface specifics, model-version drift). SampleA curated benchmark of 100 financial time-series questions designed to require quantitative reasoning; evaluation of multiple state-of-the-art LLM agents (e.g., GPT-4o, Llama 3, Qwen2) running under the TSAG tool-augmented framework that delegates computations to verifiable external tools; metrics reported include tool selection accuracy, faithfulness, and hallucination; benchmark and framework are released publicly. Themeshuman_ai_collab productivity adoption innovation GeneralizabilityLimited sample size and curated question set (100 items) may not cover the full breadth of real-world financial tasks or market regimes, Performance depends on the particular external tools, tool interfaces, and prompt engineering used—results may not generalize to other tool stacks or integration choices, Results tied to specific model versions; newer or different LLMs may behave differently, Benchmark focuses on quantitative time-series tasks in finance and may not generalize to other domains, languages, or interdisciplinary financial judgment tasks, Controlled benchmark setting does not capture deployment issues (latency, data access limits, regulatory constraints, interactive human oversight)

Claims (8)

ClaimDirectionConfidenceOutcomeDetails
We introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. Other positive high existence of a new evaluation methodology / benchmark
0.3
We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Ai Safety And Ethics positive high use of external/verifiable tools by LLM agents
n=100
0.18
Our benchmark consists of 100 financial questions. Other null_result high benchmark size (number of questions)
n=100
0.3
We compare multiple state-of-the-art agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. Ai Safety And Ethics null_result high tool selection accuracy, faithfulness, hallucination
n=100
0.3
The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Ai Safety And Ethics positive high tool-use accuracy; hallucination rate
n=100
0.18
Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. Other negative high benchmark adequacy for isolating parsing/computation orchestration
0.03
We publicly release the evaluation framework and empirical insights to foster standardized research on reliable financial AI. Other positive high public release of resources
0.3
Time Series Augmented Generation (TSAG) enables LLM agents to delegate quantitative tasks to verifiable external tools. Ai Safety And Ethics positive high delegation capability to external tools
0.18

Notes