← Papers

Large language models can map who-knows-what from workplace Slack logs, but accuracy is limited: Gemini 2.5 Flash estimated individual expertise with a mean absolute error of 21.1 percentage points while GPT models performed considerably worse, and more messages did not substantially improve inference—highlighting feasibility, limits, and privacy concerns for automated expertise mapping.

Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

Ko Watanabe, Shoya Ishimaru · May 21, 2026

arxiv correlational medium evidence 7/10 relevance Source PDF

LLMs can partially infer individuals' domain expertise from long-term Slack logs—Gemini 2.5 Flash achieved the lowest MAE (21.13%), while GPT-family models showed larger errors, and accuracy improved only weakly with message volume.

Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.

Summary

Main Finding

Large language models can infer individual domain knowledge from long-term organizational Slack logs, but accuracy varies substantially by model and is limited in absolute terms. In a 27-user evaluation (27,188 messages), Gemini 2.5 Flash produced the best zero-shot estimates (MAE ≈ 21.13%), while GPT-family models showed substantially larger estimation errors. Message volume had only a weak relationship with accuracy, so more text alone did not guarantee better inference.

Key Points

Research questions: (1) How precise can LLMs estimate human domain knowledge from chat logs? (2) Which LLMs perform best? (3) How does message volume affect accuracy?
Dataset: Slack archive spanning Apr 30, 2017–Nov 4, 2024 (2,744 days), 27,188 messages, 43 users in 94 channels; 27 users participated in self-annotation (ground truth).
User message counts varied widely: min 3, max 10,819, mean ≈ 792, median 208.
Models evaluated (zero-shot): Claude Haiku 4.5, Claude Sonnet 4.5, Gemini 2.5 Flash, Gemini 2.5 Pro, GPT-4o, GPT o3, GPT-5.
Best performer: Gemini 2.5 Flash (MAE ≈ 21.13%). GPT models performed worse (errors notably larger).
Signal strength: Estimation accuracy depended only weakly on number of messages; passive cues (e.g., channel join events) were also used.
Practical pipeline: CLI for ingesting Slack JSON, per-user/per-channel chunking, model-specific token budgeting and context-window handling, averaging per-user estimates across channels.
Limitations acknowledged by authors: small and single-organization sample, reliance on self-reported skills as ground truth, zero-shot evaluation (no fine-tuning), privacy risks, and need for richer/structured representations of knowledge.

Data & Methods

Data source and selection:
- Slack logs (JSON) across 2,744 days; filtered to 27 target participants (availability/contactability).
- Messages include user text, replies, reactions, attachments, and system events (e.g., channel_join).
Preprocessing & features:
- Used both active utterances and passive signals (channel membership) to infer expertise.
- Token counting via cl100k_base tokenizer as a conservative approximation across vendors.
LLM orchestration:
- Selected contemporary API-friendly LLMs (Anthropic, Google Gemini, OpenAI).
- Handled model differences: token parameters (max_tokens vs max_completion_tokens), model-specific context windows (ranged from 4k to 200k+ tokens), sampling controls, and output length reservations.
- Chunking strategy: safety factor on context window to compute per-chunk budget, reserved tokens for system prompt and model output.
Inference & aggregation:
- Sent chunked prompts to each model, obtained per-user, per-channel estimates of domain-knowledge levels.
- Averaged estimates across channels to form each user’s predicted profile.
Evaluation:
- Ground truth: self-reported skill ratings from 27 participants (annotation after seeing model estimates).
- Metric: Mean Absolute Error (MAE) between LLM estimates and self-ratings.
- Comparison: MAE reported per model; Gemini 2.5 Flash lowest (≈21.13%).

Implications for AI Economics

Productivity & search-cost reduction:
- Automated expertise mapping can reduce time spent “finding who knows what” (McKinsey/IDC context cited), potentially lowering search costs and onboarding delays—i.e., measurable organizational productivity gains.
Valuation of tacit knowledge:
- Firms may better quantify internal knowledge capital, enabling targeted training investments, smarter staffing, and faster knowledge transfer—improving human capital allocation efficiency.
Model selection & procurement:
- Model choice materially affects performance; organizations should benchmark multiple LLMs for knowledge-mapping tasks rather than assume parity across vendors.
Limits to scale-by-data:
- Weak dependence on message volume suggests diminishing returns from simply collecting more chat logs; quality, structure, and diversity of signals (e.g., explicit artifacts, code, documents, task metadata) matter more than raw quantity.
Privacy, governance, and legal risks:
- Deploying this capability raises privacy and consent issues (sensitive info in chats, surveillance concerns). Economic benefits must be weighed against regulatory compliance costs, potential morale impacts, and opt-in/compensation policies.
Implementation trade-offs:
- Zero-shot performance is promising but imperfect (MAE ~21%). To reach production-grade accuracy, firms likely need structured integrations (HR metadata, project assignments), fine-tuning, active human feedback loops, and privacy-preserving architectures (on-prem or federated models).
Market implications:
- Demand for tools that map organizational expertise could spur new enterprise offerings (expertise discovery platforms, matchmaking services), altering spend on knowledge management and internal search systems.
Research & investment priorities:
- From an economics viewpoint, investing in richer data collection (structured signals, verification labels), privacy-safe deployment, and evaluation frameworks will likely yield higher ROI than simply increasing log volume or relying on off-the-shelf zero-shot LLMs.

Suggested next steps for researchers/practitioners: - Replicate in multiple organizations and domains to test generalizability and economic impact. - Evaluate supervised/fine-tuned models and richer feature sets (documents, code, calendar events). - Develop privacy-preserving protocols (differential privacy, federated learning) and human-centered consent mechanisms before deployment.

Assessment

Paper Typecorrelational Evidence Strengthmedium — The paper provides direct empirical evidence that LLMs can infer domain expertise from workplace chat logs and benchmarks multiple contemporary models, supporting the feasibility claim; however the sample is small (27 self-reports), ground truth is self-reported rather than externally validated, the setting appears to be a single communication platform/organization, and only zero-shot inference (no fine-tuning or broader validation) was evaluated, limiting the strength and generalizability of conclusions. Methods Rigormedium — Strengths include evaluation across seven models, use of an interpretable metric (MAE), and an explicit check of dependence on message volume; weaknesses include small N of validated participants, reliance on subjective self-ratings as the only ground truth, limited description of prompt engineering or statistical uncertainty reporting in the summary, and no external validation or counterfactual tests to rule out confounds (e.g., role signaling or demographic cues). Sample27,188 Slack messages written by 43 users from (apparently) one organizational Slack workspace; self-reported domain-skill ratings were collected from 27 participants and used as ground truth; seven LLMs (including Gemini 2.5 Flash, Claude, and multiple GPT-family models) were evaluated in a zero-shot setting. Themeshuman_ai_collab org_design productivity IdentificationNo causal identification; the study assesses validity by directly comparing zero-shot LLM-generated skill estimates from long-term Slack logs to participants' self-reported skill ratings and reporting prediction error (MAE) and correlations; robustness checks include analysis of how estimation error varies with message volume. GeneralizabilitySmall sample of validated users (27) limits statistical power, Likely single-organization Slack data — may not generalize across industries, communication norms, or languages, Ground truth is self-reported skill ratings, which are subjective and may be biased, Only Slack text was used; other signals (code repos, documents, meetings) were not evaluated, Only zero-shot inference was tested—fine-tuned or prompt-engineered deployments may perform differently, Model versions and training data change rapidly, so results may not transfer to future models, Privacy, role, and demographic confounds in chat data could bias inferences

Claims (8)

Claim	Direction	Confidence	Outcome	Details
Employees often struggle to identify "who knows what," leading to organizational productivity losses. Organizational Efficiency	negative	high	organizational productivity (general claim about productivity losses due to difficulty locating expertise)	0.05
We analyze 27,188 messages from 43 users to investigate whether LLMs can infer individual domain knowledge from long-term Slack logs. Other	positive	high	dataset size and coverage (messages and users analyzed)	n=431 27,188 messages; 43 users 0.5
We evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Other	neutral	high	comparison between model zero-shot skill estimates and self-reported skill ratings	n=27 0.5
Gemini 2.5 Flash achieved the lowest error (MAE 21.13%). Output Quality	positive	high	mean absolute error (MAE) of skill estimates	n=27 MAE 21.13% 0.5
GPT models showed significantly larger discrepancies compared to other evaluated models. Output Quality	negative	high	discrepancy/error between model estimates and self-reported skills	n=27 0.3
Estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. Task Completion Time	null_result	high	relationship between message volume (amount of text) and model estimation accuracy	n=43 0.3
These findings demonstrate the feasibility and current limits of automated expertise mapping. Output Quality	mixed	high	feasibility (ability to infer expertise) and limits (accuracy constraints) of automated expertise mapping	n=27 0.3
There is a need for privacy-preserving deployments and richer, structure-aware representations of human knowledge for practical use. Ai Safety And Ethics	positive	high	requirement for privacy-preserving deployment practices and improved representations	0.05