Behavioral 'rules' for LLM coding agents do not generalize across frameworks: in a 64,000+ run study of 126 configurations, swapping the framework often reverses whether a behavioral signal predicts success, implying single-framework findings are unreliable for broader deployment.
Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. On most signals, configurations disagree not merely in magnitude but in direction. Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement. Framework identity accounts for more of this variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. The implication is that the same observable behavioral signal can carry opposite meaning for different agent configurations. Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.
Summary
Main Finding
The same observable behavioral signals (e.g., error rate, test-after-modify, trajectory length) can predict success in one software-engineering agent configuration and failure in another. Across 126 ⟨framework, LLM⟩ configurations, the semantics of these signals are often framework-dependent: for many features, swapping the framework (holding LLM fixed) changes not only effect magnitude but direction. For several trajectory-shape features, framework identity explains far more cross-configuration variation than LLM family, so behavioral rules derived from a single framework do not universally transfer.
Key Points
- Scale and scope: 64,380 trajectories from 126 configurations (43 frameworks, many LLMs) on SWE-bench Verified under a 100% oracle setting.
- Directional disagreement is common:
- Error rate: among configurations with measurable effects, 47 associate lower error rate with resolution, 48 associate higher error rate with resolution (i.e., opposite semantics).
- Overall, six continuous features and three of seven binary patterns from prior SE literature show such direction-divided effects across configurations.
- Attribution of heterogeneity:
- For trajectory-shape features (e.g., mean turns), framework identity is the dominant source of variation. Example: for mean turns, framework explains ~64% of between-configuration variance vs ~10% for LLM family.
- For action-composition features and raw error counts, neither layer (framework nor LLM) uniformly dominates.
- One exception by trajectory type: Type 1 (long-exploratory) trajectories show LLM dominance in some diagnostics.
- Transferability taxonomy:
- Direction-stable signals (high cross-configuration agreement on sign): shorter trajectories, fewer revisits, lower motif/transition entropy, lower backtrack rate. These carry qualitative, framework-agnostic principles but need per-framework calibration for numerical targets.
- Direction-unstable signals (low agreement / sign reversals): error rate, test-after-modify, fast cascade recovery. These carry no universal rule and can mislead if applied without framework fit checks.
- Practical recommendation: behavioral rules from single-framework studies require cross-configuration validation before generalization; otherwise they can mislead framework design, model procurement, or performance diagnostics.
Data & Methods
- Datasets:
- Full: 64,380 runs from 126 ⟨framework, LLM⟩ configurations across 43 frameworks.
- Two analytic slices to separate layers:
- Slice A (LLM held fixed): three tracer LLMs (Claude 4 Sonnet, Claude 3.5 Sonnet, GPT-4o) each appearing in 6–8 frameworks — isolates framework effects.
- Slice B (framework held fixed): 33 LLMs (15 families) running on a single framework (mini-swe-agent) — isolates LLM effects.
- Subsets: bash-only subset (16,522 trajectories, 33 LLMs on mini-swe-agent); verified subset (47,858 trajectories, 42 frameworks, 93 configurations).
- Preprocessing pipeline:
- Parsing: 45 agent-specific parsers (15 format families) map raw logs to turns ⟨θ, a, o⟩.
- Action classification: 184 action patterns → 6 semantic categories (e.g., Exploration, Patch, Test).
- Error detection: regex-based detection of 15 error types; contiguous error-producing turns form error cascades.
- Metadata validation and resolution status cross-checks with SWE-bench leaderboard.
- Quality: classifier Cohen’s κ > 0.85 on 500 annotated turns; 122/132 configurations have ≤5% unknown-action rate.
- Feature extraction and trajectory taxonomy:
- Per-configuration aggregation of 16 behavioral features (action composition, temporal, errors, efficiency), 6 control-flow graph features (revisits, backtrack, branching, motif/transition entropy), and 7 binary patterns from prior literature.
- Standardization, PCA, and k-means clustering produce five trajectory types (trace-level taxonomy).
- Per-configuration meta-analysis:
- Each configuration contributes one effect size for each behavior–outcome relationship.
- Diagnostics:
- Higgins’ I^2: measures fraction of between-configuration variance beyond sampling noise (high I^2 → configuration dependence).
- Direction split ⟨n+, n−⟩: counts configurations with strictly positive vs strictly negative effects.
- Meta-regressions: framework identity and LLM family as moderators; R^2-type share of cross-configuration variance attributed to each layer.
- Inclusion criteria: only SWE-bench Verified 100% oracle runs with parseable logs and verifiable metadata; manual 1% spot check on parser outputs.
Implications for AI Economics
- Investment strategy (upgrade model vs redesign framework):
- For many trajectory-shape features, framework redesign yields larger changes than upgrading the LLM. Economically, this implies potentially higher marginal returns from investing in framework engineering (workflows, tool interfaces, loop design) when the target performance metric is trajectory shape or control-flow structure.
- For some regimes (e.g., Type 1 long-exploratory trajectories), LLM upgrades can dominate—suggesting context-specific cost-effectiveness analysis is required.
- Benchmarking, procurement, and incentives:
- Metrics commonly used to compare agent products (error rate, test-after-modify compliance, cascade recovery) may be misleading if used without adjusting for framework interactions. Contracts or procurement decisions tied to such metrics risk paying for the wrong lever (model vs framework).
- Market valuations of LLM improvements should account for ecosystem heterogeneity: the realized benefit of a more capable LLM depends on the frameworks buyers deploy.
- Productivity measurement and accountability:
- Firms measuring agent productivity via behavioral proxies must calibrate those proxies per framework. Using direction-unstable signals as productivity indicators can lead to incorrect conclusions about agent effectiveness and misallocation of engineering resources.
- Policy and standardization:
- Standard benchmarks and interpretability guidelines should require cross-framework validation of behavioral claims. Regulators or standards bodies promoting agent evaluation protocols should include checks for framework sensitivity to avoid overgeneralizing findings from single-framework studies.
- Research and development prioritization:
- Funding and R&D that aim to improve end-to-end agent performance should consider two-pronged strategies: (a) general improvements in LLM capabilities, and (b) explicit framework design research to channel identical behaviors into desirable semantics (e.g., turning extended post-error exploration into disciplined recovery rather than collapse).
- Actionable rule of thumb:
- Treat direction-stable signals as generally informative but numerically calibrate for the framework. Treat direction-unstable signals as requiring a framework-fit analysis before use in decision-making, incentive design, or performance contracts.
Assessment
Claims (12)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates (e.g., that a test step follows a code modification). Developer Productivity | positive | medium | issue resolution rate (correlation with trajectory pattern: test step following code modification) |
0.09
|
| Behavioral studies report that short error cascades correlate with higher resolution rates. Developer Productivity | positive | medium | issue resolution rate (correlation with short error cascades) |
0.09
|
| Behavioral studies report that compact trajectories correlate with higher resolution rates. Developer Productivity | positive | medium | issue resolution rate (correlation with trajectory compactness) |
0.09
|
| This study analyzes 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework supplying tools and workflow. Other | null_result | high | number of benchmark runs / experimental scale |
n=64380
0.5
|
| The analysis separates framework effects from LLM effects by holding each layer fixed in turn and measures one behavior–outcome effect per configuration to examine agreement across configurations. Other | null_result | high | behavior–outcome effects per configuration (methodological approach) |
n=126
0.3
|
| Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. Other | mixed | high | action features (behavioral signals/actions taken by agents) |
n=126
large behavioral differences in every action feature
0.5
|
| On most signals, configurations disagree not merely in magnitude but in direction (i.e., the same signal correlates positively with resolution in some configurations and negatively in others). Developer Productivity | mixed | high | direction of correlation between behavioral signals and issue resolution |
n=126
0.5
|
| Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Developer Productivity | mixed | high | issue resolution count/rate as a function of error rate |
n=95
47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher
0.5
|
| Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement across configurations. Developer Productivity | mixed | high | directional agreement/disagreement of feature–outcome relations for five continuous features and seven binary patterns |
n=126
five other continuous features and three of seven binary patterns show directional disagreement
0.3
|
| Framework identity accounts for more of the between-configuration variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. Task Completion Time | null_result | high | mean turns (average number of turns per task) |
n=126
64% of the between-configuration variance explained by framework vs 10% explained by LLM
0.5
|
| The same observable behavioral signal can carry opposite meaning for different agent configurations. Other | mixed | high | interpretation of behavioral signals (sign of correlation with outcomes) |
n=126
0.3
|
| Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general. Other | null_result | high | validity/generalizability of behavioral findings across agent configurations |
n=126
0.3
|