Large language models can map who-knows-what from workplace Slack logs, but accuracy is limited: Gemini 2.5 Flash estimated individual expertise with a mean absolute error of 21.1 percentage points while GPT models performed considerably worse, and more messages did not substantially improve inference—highlighting feasibility, limits, and privacy concerns for automated expertise mapping.
Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.
Summary
Main Finding
Large language models can infer individual domain knowledge from long-term organizational Slack logs, but accuracy varies substantially by model and is limited in absolute terms. In a 27-user evaluation (27,188 messages), Gemini 2.5 Flash produced the best zero-shot estimates (MAE ≈ 21.13%), while GPT-family models showed substantially larger estimation errors. Message volume had only a weak relationship with accuracy, so more text alone did not guarantee better inference.
Key Points
- Research questions: (1) How precise can LLMs estimate human domain knowledge from chat logs? (2) Which LLMs perform best? (3) How does message volume affect accuracy?
- Dataset: Slack archive spanning Apr 30, 2017–Nov 4, 2024 (2,744 days), 27,188 messages, 43 users in 94 channels; 27 users participated in self-annotation (ground truth).
- User message counts varied widely: min 3, max 10,819, mean ≈ 792, median 208.
- Models evaluated (zero-shot): Claude Haiku 4.5, Claude Sonnet 4.5, Gemini 2.5 Flash, Gemini 2.5 Pro, GPT-4o, GPT o3, GPT-5.
- Best performer: Gemini 2.5 Flash (MAE ≈ 21.13%). GPT models performed worse (errors notably larger).
- Signal strength: Estimation accuracy depended only weakly on number of messages; passive cues (e.g., channel join events) were also used.
- Practical pipeline: CLI for ingesting Slack JSON, per-user/per-channel chunking, model-specific token budgeting and context-window handling, averaging per-user estimates across channels.
- Limitations acknowledged by authors: small and single-organization sample, reliance on self-reported skills as ground truth, zero-shot evaluation (no fine-tuning), privacy risks, and need for richer/structured representations of knowledge.
Data & Methods
- Data source and selection:
- Slack logs (JSON) across 2,744 days; filtered to 27 target participants (availability/contactability).
- Messages include user text, replies, reactions, attachments, and system events (e.g., channel_join).
- Preprocessing & features:
- Used both active utterances and passive signals (channel membership) to infer expertise.
- Token counting via cl100k_base tokenizer as a conservative approximation across vendors.
- LLM orchestration:
- Selected contemporary API-friendly LLMs (Anthropic, Google Gemini, OpenAI).
- Handled model differences: token parameters (max_tokens vs max_completion_tokens), model-specific context windows (ranged from 4k to 200k+ tokens), sampling controls, and output length reservations.
- Chunking strategy: safety factor on context window to compute per-chunk budget, reserved tokens for system prompt and model output.
- Inference & aggregation:
- Sent chunked prompts to each model, obtained per-user, per-channel estimates of domain-knowledge levels.
- Averaged estimates across channels to form each user’s predicted profile.
- Evaluation:
- Ground truth: self-reported skill ratings from 27 participants (annotation after seeing model estimates).
- Metric: Mean Absolute Error (MAE) between LLM estimates and self-ratings.
- Comparison: MAE reported per model; Gemini 2.5 Flash lowest (≈21.13%).
Implications for AI Economics
- Productivity & search-cost reduction:
- Automated expertise mapping can reduce time spent “finding who knows what” (McKinsey/IDC context cited), potentially lowering search costs and onboarding delays—i.e., measurable organizational productivity gains.
- Valuation of tacit knowledge:
- Firms may better quantify internal knowledge capital, enabling targeted training investments, smarter staffing, and faster knowledge transfer—improving human capital allocation efficiency.
- Model selection & procurement:
- Model choice materially affects performance; organizations should benchmark multiple LLMs for knowledge-mapping tasks rather than assume parity across vendors.
- Limits to scale-by-data:
- Weak dependence on message volume suggests diminishing returns from simply collecting more chat logs; quality, structure, and diversity of signals (e.g., explicit artifacts, code, documents, task metadata) matter more than raw quantity.
- Privacy, governance, and legal risks:
- Deploying this capability raises privacy and consent issues (sensitive info in chats, surveillance concerns). Economic benefits must be weighed against regulatory compliance costs, potential morale impacts, and opt-in/compensation policies.
- Implementation trade-offs:
- Zero-shot performance is promising but imperfect (MAE ~21%). To reach production-grade accuracy, firms likely need structured integrations (HR metadata, project assignments), fine-tuning, active human feedback loops, and privacy-preserving architectures (on-prem or federated models).
- Market implications:
- Demand for tools that map organizational expertise could spur new enterprise offerings (expertise discovery platforms, matchmaking services), altering spend on knowledge management and internal search systems.
- Research & investment priorities:
- From an economics viewpoint, investing in richer data collection (structured signals, verification labels), privacy-safe deployment, and evaluation frameworks will likely yield higher ROI than simply increasing log volume or relying on off-the-shelf zero-shot LLMs.
Suggested next steps for researchers/practitioners: - Replicate in multiple organizations and domains to test generalizability and economic impact. - Evaluate supervised/fine-tuned models and richer feature sets (documents, code, calendar events). - Develop privacy-preserving protocols (differential privacy, federated learning) and human-centered consent mechanisms before deployment.
Assessment
Claims (8)
| Claim | Direction | Confidence | Outcome | Details |
|---|---|---|---|---|
| Employees often struggle to identify "who knows what," leading to organizational productivity losses. Organizational Efficiency | negative | high | organizational productivity (general claim about productivity losses due to difficulty locating expertise) |
0.05
|
| We analyze 27,188 messages from 43 users to investigate whether LLMs can infer individual domain knowledge from long-term Slack logs. Other | positive | high | dataset size and coverage (messages and users analyzed) |
n=431
27,188 messages; 43 users
0.5
|
| We evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Other | neutral | high | comparison between model zero-shot skill estimates and self-reported skill ratings |
n=27
0.5
|
| Gemini 2.5 Flash achieved the lowest error (MAE 21.13%). Output Quality | positive | high | mean absolute error (MAE) of skill estimates |
n=27
MAE 21.13%
0.5
|
| GPT models showed significantly larger discrepancies compared to other evaluated models. Output Quality | negative | high | discrepancy/error between model estimates and self-reported skills |
n=27
0.3
|
| Estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. Task Completion Time | null_result | high | relationship between message volume (amount of text) and model estimation accuracy |
n=43
0.3
|
| These findings demonstrate the feasibility and current limits of automated expertise mapping. Output Quality | mixed | high | feasibility (ability to infer expertise) and limits (accuracy constraints) of automated expertise mapping |
n=27
0.3
|
| There is a need for privacy-preserving deployments and richer, structure-aware representations of human knowledge for practical use. Ai Safety And Ethics | positive | high | requirement for privacy-preserving deployment practices and improved representations |
0.05
|