Sentiment judgments of short, context-free team messages depend more on the message than the rater: individuals' labels are only moderately stable and ambiguous wording drives most disagreement, while mood traits produce only small positive biases. The result warns that automated sentiment signals—often trained on pooled human labels—can be noisy and systematically biased unless annotator and item heterogeneity are explicitly modeled.

Exploring Indicators of Developers' Sentiment Perceptions in Student Software Projects

Martin Obaidi, Marc Herrmann, Jendrik Martensen, Jil Klünder, Kurt Schneider · March 11, 2026

arxiv correlational medium evidence 7/10 relevance Source PDF

Sentiment labels for short, context-free team messages are only moderately stable within individuals and largely driven by statement ambiguity, with trait mood weakly biasing some raters toward more positive labels but overall small and fragile effects.

Communication is a crucial social factor in the success of software projects, as positively or negatively perceived statements can influence how recipients feel and affect team collaboration through emotional contagion. Whether a developer perceives a written message as positive, negative, or neutral is likely shaped by multiple factors. In this paper, we investigate how mood traits and states, life circumstances, project phases, and group dynamics relate to the perception of text-based messages in software development. We conducted a four-round survey study with 81 students in team-based software projects. Across rounds, participants reported these factors and labeled 30 decontextualized statements for sentiment, including meta-data on labeling rationale and uncertainty. Our results show: (1) Sentiment perception is only moderately stable within individuals, and label changes concentrate on ambiguity-prone statements; (2) Correlation-level signals are small and do not survive global multiple-testing correction; (3) In statement-level repeated-measures models (GEE), higher mood trait and reactivity are associated with more positive (and less neutral) labeling, while predictors of negative labeling are weaker and at most trend-level (e.g., task conflict); (4) We find no clear evidence of systematic project-phase effects. Overall, sentiment perception varies within persons and is strongly statement-dependent. Although our study was conducted in an academic setting, the observed variability and ambiguity effects suggest caution when interpreting sentiment analysis outputs and motivate future work with contextualized, in-project communication.

Summary

Main Finding

Sentiment perception of short, decontextualized messages in team-based software projects is only moderately stable within individuals and is strongly statement-dependent. Individual mood traits (especially trait mood and reactivity) weakly predict a tendency to label statements as more positive (and less neutral), but overall effect sizes are small, negative-label predictors are weaker, and many apparent correlations do not survive correction for multiple testing. Ambiguity in statements drives most label changes; no clear project-phase effects were found.

Key Points

Study design: four survey rounds with 81 student participants working in teams; each round participants reported mood, life circumstances, project-phase and group-dynamics measures and labeled 30 decontextualized statements for sentiment (positive/neutral/negative), recording rationale and uncertainty.
Stability: Sentiment labels are only moderately stable within individuals across rounds; label changes concentrate on statements judged as ambiguous.
Predictors:
- Trait-level mood and emotional reactivity correlate with more positive (and fewer neutral) labels in statement-level repeated-measures models (GEE).
- Predictors of negative labeling are weak and at best trend-level (e.g., task conflict).
- Correlation-level signals are generally small and fail to remain significant after global multiple-testing correction.
Context effects: No clear evidence that project phase systematically shifts sentiment perception.
Statement dependence: The particular statement’s ambiguity/wording is a dominant source of labeling variability.
External validity caution: Study was conducted with students and decontextualized messages; findings motivate replication in real project communication.

Data & Methods

Sample: 81 students in team-based software projects, surveyed across four rounds (longitudinal repeated-measures design).
Tasks: In each round, participants labeled the sentiment of 30 decontextualized text statements; provided meta-data on labeling rationale and self-reported labeling uncertainty.
Measures collected: mood traits, momentary mood states, life circumstances, team-level dynamics (e.g., conflict), and project-phase indicators.
Analytical approach:
- Descriptive stability analyses across rounds.
- Correlation analyses between sentiment labels and predictors, with multiple-testing correction.
- Statement-level repeated-measures generalized estimating equations (GEE) to model predictors of positive/neutral/negative labeling while accounting for repeated observations per participant and per statement.
Key statistical outcomes: Small predictor effects overall; trait mood/reactivity significant in GEE for positive vs neutral; many correlations not robust to multiple-testing adjustment.

Implications for AI Economics

Measurement error and attenuation: High within-person variability and statement-dependent ambiguity imply noisy sentiment labels. In econometric analyses that use automated sentiment measurements (e.g., linking communication tone to productivity, turnover, or team performance), this noise can attenuate estimated effects and bias inference toward zero unless explicitly modeled.
Heterogeneous annotator effects: Mood traits and reactivity shift labeling propensity (toward positivity). Classifiers trained on labels pooled from annotators with heterogeneous affective traits may inherit systematic biases. Economists evaluating AI tools should account for annotator heterogeneity (e.g., by modeling rater effects or using hierarchical labeling schemes).
Classifier evaluation and deployment: Because ambiguity is a main source of disagreement, evaluation should emphasize performance on ambiguous items and measure uncertainty calibration. For high-stakes economic decisions (HR monitoring, productivity signals, automated moderation), relying on context-free sentiment scores is risky—invest in contextualized, in-project training data to improve external validity.
Policy and firm decisions: Automated sentiment analytics used for managerial decisions, performance-based pay, or market signaling can produce erroneous or unfair outcomes if they ignore label variability and ambiguity. Cost–benefit assessments of deploying such tools should incorporate error rates, heterogeneity across raters, and potential adverse reactions from affected workers.
Research design recommendations for AI economists:
- Model measurement error explicitly (errors-in-variables, latent-variable models, or use instrument variables where possible).
- Use multi-annotator labels with recorded uncertainty and model annotator and item (statement) random effects.
- Prefer contextualized communication data over decontextualized snippets when estimating economic effects of sentiment.
- Prioritize collecting additional labels for ambiguous items (active learning) and calibrate classifier uncertainty to downstream economic decisions.
Generalizability caveat: Findings are from a student sample and decontextualized statements; quantify transfer risk before applying these conclusions to industry settings or macroeconomic inference.

Assessment

Paper Typecorrelational Evidence Strengthmedium — Longitudinal repeated-measures design (four rounds) and statement-level GEE models strengthen internal validity for describing label stability and statement effects, and the study records uncertainty and rater rationales; however, the sample is small (N=81 students), effects are generally small and often fail multiple-testing correction, and the stimuli are decontextualized, limiting external validity. Methods Rigormedium — Analytical approach is appropriate (descriptive stability analyses, multiple-testing correction, and GEE to account for repeated observations by participant and item) and the study collects useful covariates (trait and state mood, team dynamics, uncertainty), but statistical power is limited, many correlations are fragile, and contextual/ecological validity is constrained by the survey and decontextualized-statement design. Sample81 students engaged in team-based software projects surveyed in four rounds; in each round participants labeled sentiment (positive/neutral/negative) for 30 decontextualized short statements and reported trait and momentary mood, life circumstances, team-level dynamics (e.g., conflict), project-phase indicators, labeling rationale, and self-reported uncertainty. Themeshuman_ai_collab productivity GeneralizabilityStudent sample (limited external validity to professional workplaces), Decontextualized short statements (may not reflect in-situ, threaded communication), Small sample size and likely limited geographic/cultural diversity, Specific to team-based software projects (may not generalize across industries or communication types), Survey-based labeling differs from live or platform-based annotation processes

Claims (10)

Claim	Direction	Confidence	Outcome	Details
Sentiment perception of short, decontextualized messages in team-based software projects is only moderately stable within individuals and is strongly statement-dependent. Other	mixed	medium	within-individual stability of sentiment labels (positive/neutral/negative) across rounds	n=81 moderate 0.18
Trait-level mood and emotional reactivity weakly predict a higher tendency to label statements as positive (and fewer as neutral). Other	positive	medium	probability of labeling a statement as positive (vs neutral)	n=81 small 0.18
Predictors of negative labeling are weak and at best trend-level (e.g., task conflict shows only weak/trend-level association with negative labels). Output Quality	null_result	medium	probability of labeling a statement as negative	n=81 weak / trend-level associations; small and non-robust effects 0.18
Many apparent correlations between predictors and sentiment labels do not remain significant after global multiple-testing correction. Output Quality	null_result	high	statistical significance of correlations between predictors (e.g., mood, team measures) and sentiment labels	n=81 many correlations fail to survive global multiple-testing correction 0.3
Label changes across rounds concentrate on statements judged as ambiguous; statement ambiguity drives most label changes. Error Rate	positive	high	frequency of label changes per statement and its association with self-reported labeling uncertainty/ambiguity	n=30 higher label-change frequency for statements with higher self-reported ambiguity/uncertainty 0.3
No clear evidence that project phase systematically shifts sentiment perception. Output Quality	null_result	medium	sentiment label distribution across project phases	n=81 no systematic shift in sentiment label distribution across project phases (null/weak effect) 0.18
The particular statement’s wording/ambiguity is a dominant source of labeling variability (statement dependence outweighs annotator-level effects). Error Rate	mixed	medium	variance in sentiment labels attributable to statement/item identity vs annotator characteristics	n=81 statement/item identity explains more label variance than annotator-level characteristics 0.18
High within-person variability and statement-dependent ambiguity imply noisy sentiment labels that can attenuate estimated effects in econometric analyses (measurement error / attenuation bias). Output Quality	negative	medium	expected bias (attenuation) in estimated associations when using noisy sentiment measures as predictors or outcomes	n=81 high within-person variability and statement-dependent ambiguity imply measurement noise that attenuates estimated associations (measurement-error/attenuation bias) 0.18
Annotator affective traits shift labeling propensity (toward positivity); classifiers trained on pooled annotator labels may inherit systematic biases from annotator heterogeneity. Ai Safety And Ethics	positive	medium	systematic shift in aggregate labels (and therefore potential classifier outputs) associated with annotator affective traits	n=81 annotator affective traits shift labeling propensity toward positivity; pooled labels may induce systematic classifier bias 0.18
Findings are based on a student sample rating decontextualized messages, so external validity to industry communication or real project logs is uncertain and requires replication. Other	null_result	high	generalizability/external validity of the study findings to non-student, contextualized settings	n=81 external validity uncertain; findings from student sample rating decontextualized messages may not generalize 0.3