Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. On most signals, configurations disagree not merely in magnitude but in direction. Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement. Framework identity accounts for more of this variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. The implication is that the same observable behavioral signal can carry opposite meaning for different agent configurations. Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.

Summary

Main Finding

The same observable behavioral signals (e.g., error rate, test-after-modify, trajectory length) can predict success in one software-engineering agent configuration and failure in another. Across 126 ⟨framework, LLM⟩ configurations, the semantics of these signals are often framework-dependent: for many features, swapping the framework (holding LLM fixed) changes not only effect magnitude but direction. For several trajectory-shape features, framework identity explains far more cross-configuration variation than LLM family, so behavioral rules derived from a single framework do not universally transfer.

Key Points

Scale and scope: 64,380 trajectories from 126 configurations (43 frameworks, many LLMs) on SWE-bench Verified under a 100% oracle setting.
Directional disagreement is common:
- Error rate: among configurations with measurable effects, 47 associate lower error rate with resolution, 48 associate higher error rate with resolution (i.e., opposite semantics).
- Overall, six continuous features and three of seven binary patterns from prior SE literature show such direction-divided effects across configurations.
Attribution of heterogeneity:
- For trajectory-shape features (e.g., mean turns), framework identity is the dominant source of variation. Example: for mean turns, framework explains ~64% of between-configuration variance vs ~10% for LLM family.
- For action-composition features and raw error counts, neither layer (framework nor LLM) uniformly dominates.
- One exception by trajectory type: Type 1 (long-exploratory) trajectories show LLM dominance in some diagnostics.
Transferability taxonomy:
- Direction-stable signals (high cross-configuration agreement on sign): shorter trajectories, fewer revisits, lower motif/transition entropy, lower backtrack rate. These carry qualitative, framework-agnostic principles but need per-framework calibration for numerical targets.
- Direction-unstable signals (low agreement / sign reversals): error rate, test-after-modify, fast cascade recovery. These carry no universal rule and can mislead if applied without framework fit checks.
Practical recommendation: behavioral rules from single-framework studies require cross-configuration validation before generalization; otherwise they can mislead framework design, model procurement, or performance diagnostics.

Data & Methods

Datasets:
- Full: 64,380 runs from 126 ⟨framework, LLM⟩ configurations across 43 frameworks.
- Two analytic slices to separate layers:
  - Slice A (LLM held fixed): three tracer LLMs (Claude 4 Sonnet, Claude 3.5 Sonnet, GPT-4o) each appearing in 6–8 frameworks — isolates framework effects.
  - Slice B (framework held fixed): 33 LLMs (15 families) running on a single framework (mini-swe-agent) — isolates LLM effects.
- Subsets: bash-only subset (16,522 trajectories, 33 LLMs on mini-swe-agent); verified subset (47,858 trajectories, 42 frameworks, 93 configurations).
Preprocessing pipeline:
- Parsing: 45 agent-specific parsers (15 format families) map raw logs to turns ⟨θ, a, o⟩.
- Action classification: 184 action patterns → 6 semantic categories (e.g., Exploration, Patch, Test).
- Error detection: regex-based detection of 15 error types; contiguous error-producing turns form error cascades.
- Metadata validation and resolution status cross-checks with SWE-bench leaderboard.
- Quality: classifier Cohen’s κ > 0.85 on 500 annotated turns; 122/132 configurations have ≤5% unknown-action rate.
Feature extraction and trajectory taxonomy:
- Per-configuration aggregation of 16 behavioral features (action composition, temporal, errors, efficiency), 6 control-flow graph features (revisits, backtrack, branching, motif/transition entropy), and 7 binary patterns from prior literature.
- Standardization, PCA, and k-means clustering produce five trajectory types (trace-level taxonomy).
Per-configuration meta-analysis:
- Each configuration contributes one effect size for each behavior–outcome relationship.
- Diagnostics:
  - Higgins’ I^2: measures fraction of between-configuration variance beyond sampling noise (high I^2 → configuration dependence).
  - Direction split ⟨n+, n−⟩: counts configurations with strictly positive vs strictly negative effects.
  - Meta-regressions: framework identity and LLM family as moderators; R^2-type share of cross-configuration variance attributed to each layer.
- Inclusion criteria: only SWE-bench Verified 100% oracle runs with parseable logs and verifiable metadata; manual 1% spot check on parser outputs.

Implications for AI Economics

Investment strategy (upgrade model vs redesign framework):
- For many trajectory-shape features, framework redesign yields larger changes than upgrading the LLM. Economically, this implies potentially higher marginal returns from investing in framework engineering (workflows, tool interfaces, loop design) when the target performance metric is trajectory shape or control-flow structure.
- For some regimes (e.g., Type 1 long-exploratory trajectories), LLM upgrades can dominate—suggesting context-specific cost-effectiveness analysis is required.
Benchmarking, procurement, and incentives:
- Metrics commonly used to compare agent products (error rate, test-after-modify compliance, cascade recovery) may be misleading if used without adjusting for framework interactions. Contracts or procurement decisions tied to such metrics risk paying for the wrong lever (model vs framework).
- Market valuations of LLM improvements should account for ecosystem heterogeneity: the realized benefit of a more capable LLM depends on the frameworks buyers deploy.
Productivity measurement and accountability:
- Firms measuring agent productivity via behavioral proxies must calibrate those proxies per framework. Using direction-unstable signals as productivity indicators can lead to incorrect conclusions about agent effectiveness and misallocation of engineering resources.
Policy and standardization:
- Standard benchmarks and interpretability guidelines should require cross-framework validation of behavioral claims. Regulators or standards bodies promoting agent evaluation protocols should include checks for framework sensitivity to avoid overgeneralizing findings from single-framework studies.
Research and development prioritization:
- Funding and R&D that aim to improve end-to-end agent performance should consider two-pronged strategies: (a) general improvements in LLM capabilities, and (b) explicit framework design research to channel identical behaviors into desirable semantics (e.g., turning extended post-error exploration into disciplined recovery rather than collapse).
Actionable rule of thumb:
- Treat direction-stable signals as generally informative but numerically calibrate for the framework. Treat direction-unstable signals as requiring a framework-fit analysis before use in decision-making, incentive design, or performance contracts.

Assessment

Paper Typecorrelational Evidence Strengthmedium — Large-scale, systematic cross-configuration evidence (64,380 runs across 126 configurations) provides strong descriptive support that behavioral signals disagree across frameworks, but the study is observational and associations may reflect unobserved confounding or interaction effects rather than causal mechanisms. Methods Rigorhigh — The dataset is broad and the authors explicitly separate framework vs. LLM effects by holding layers fixed and quantifying variance explained, they examine many pre-specified features and binary patterns, and they report directional disagreement rates; these design and robustness checks indicate careful, rigorous methodology for a non-experimental study. Sample64,380 SWE-bench runs drawn from 126 agent configurations that pair LLMs with 43 distinct agent frameworks (examples: SWE-Agent, OpenHands); for each configuration the authors measure behavioral features (e.g., whether a test follows code modification, error-cascade length, trajectory compactness, error rate, mean turns) and compute one behavior–outcome effect (association with issue resolution). Themesproductivity adoption IdentificationComparative observational analysis across 126 agent configurations: the authors hold either the LLM or the framework fixed in turn to decompose variance and compute one behavior–outcome association per configuration (correlations/associations between action features and resolution rate); no randomized or instrumental causal identification is used. GeneralizabilityRestricted to software-engineering (SWE) benchmark tasks—may not generalize to other task domains (e.g., writing, customer support, scientific discovery)., Limited to the set of 43 frameworks and the LLM families included; newer LLMs, unseen frameworks, or different tool integrations may behave differently., Benchmarked agent runs may differ from real-world production workflows (human-in-the-loop, varying tooling, codebase heterogeneity)., Outcome is task-resolution rate and related behavioral proxies—may not map directly to economic outcomes like productivity, time-to-market, or wages., Potentially sensitive to evaluation metrics and dataset composition (type of bugs/problems, prompt engineering).

Claims (12)

Claim	Direction	Confidence	Outcome	Details
Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates (e.g., that a test step follows a code modification). Developer Productivity	positive	medium	issue resolution rate (correlation with trajectory pattern: test step following code modification)	0.09
Behavioral studies report that short error cascades correlate with higher resolution rates. Developer Productivity	positive	medium	issue resolution rate (correlation with short error cascades)	0.09
Behavioral studies report that compact trajectories correlate with higher resolution rates. Developer Productivity	positive	medium	issue resolution rate (correlation with trajectory compactness)	0.09
This study analyzes 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework supplying tools and workflow. Other	null_result	high	number of benchmark runs / experimental scale	n=64380 0.5
The analysis separates framework effects from LLM effects by holding each layer fixed in turn and measures one behavior–outcome effect per configuration to examine agreement across configurations. Other	null_result	high	behavior–outcome effects per configuration (methodological approach)	n=126 0.3
Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. Other	mixed	high	action features (behavioral signals/actions taken by agents)	n=126 large behavioral differences in every action feature 0.5
On most signals, configurations disagree not merely in magnitude but in direction (i.e., the same signal correlates positively with resolution in some configurations and negatively in others). Developer Productivity	mixed	high	direction of correlation between behavioral signals and issue resolution	n=126 0.5
Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Developer Productivity	mixed	high	issue resolution count/rate as a function of error rate	n=95 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher 0.5
Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement across configurations. Developer Productivity	mixed	high	directional agreement/disagreement of feature–outcome relations for five continuous features and seven binary patterns	n=126 five other continuous features and three of seven binary patterns show directional disagreement 0.3
Framework identity accounts for more of the between-configuration variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. Task Completion Time	null_result	high	mean turns (average number of turns per task)	n=126 64% of the between-configuration variance explained by framework vs 10% explained by LLM 0.5
The same observable behavioral signal can carry opposite meaning for different agent configurations. Other	mixed	high	interpretation of behavioral signals (sign of correlation with outcomes)	n=126 0.3
Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general. Other	null_result	high	validity/generalizability of behavioral findings across agent configurations	n=126 0.3

Behavioral 'rules' for LLM coding agents do not generalize across frameworks: in a 64,000+ run study of 126 configurations, swapping the framework often reverses whether a behavioral signal predicts success, implying single-framework findings are unreliable for broader deployment.