Human Autonomy Teaming and AI Metacognition in Maritime Threat Assessment

Human-autonomy teaming (HAT) is implemented across numerous industries as a means of increasing workload capabilities, without increasing worker cognitive load. However, autonomous systems face a major sociotechnical integration challenge when they must collaborate with human operators, which hinders their effectiveness. Specifically, human-AI agent teamwork comes with new cognitive costs and skill requirements for humans and artificial agents. By improving shared understanding and mutual adaptation it is possible to overcome these gaps, specifically through human-AI co-learning (HACL) of teamwork and taskwork. We hypothesize that to be effective, HAT systems must focus on more than simply human and AI-based counterparts learning how to perform required taskwork. They must implement HACL to learn how to engage together in the teamwork processes, developing mutual understanding and trust for effective mission management and adaptation. Implementing an adaptive command and control process with adjustable HAT, augmented by AI metacognition, has significant potential to instigate HACL. Cognitive Shadow (CS) is an expert policy capturing toolkit that can automatically learn human decision patterns using a combination of supervised machine learning algorithms, classification or regression. Its main goal is to learn from experts and then provide real-time automation support, enhancing HAT effectiveness though judgmental bootstrapping. Moreover, CS provides real-time, dynamic model adjustments based on immediate user feedback, facilitating continuous improvement in decision-making recommendations. New AI metacognition capabilities have expanded CS, using a recursive approach to model its own reliability based on situation attributes. The meta-model supervises the decision support model, learning to predict when it is likely to be correct and when it has a greater risk of being wrong, on a 0-1 scale. This AI metacognition capability provides an empirically grounded reliability metric to help the human collaborator decide whether or not to rely on the AI. For HAT systems, this metacognitive capability allows for the setting of a self-confidence threshold. This threshold permits autonomous decisions for high-certainty model predictions and reduces AI-autonomy for low-certainty cases. HAT systems have been successfully integrated into various industries, including aspects of national defence. In the Canadian Arctic waterways, climate change continues to increase available routes and therefore increase maritime traffic. This increase necessitates more enhanced and efficient surveillance strategies, such as HACL. Our framework was tested in simulated maritime surveillance scenarios, in the Canadian Arctic waterways, where entities were assessed and assigned threat levels by human operators. Concurrently, CS was implemented to capture decision-making patterns, aligning AI threat assessments with those of human operators. Using a workload perception and situational awareness questionnaire, and trust and self-confidence scales, we are able to quantify the human factors associated with implementing HACL. Additionally, performance outcomes in surveillance scenarios can be quantitatively assessed through key metrics, including classification accuracy, critical change detection, time to classify, and omission rates. This ongoing work contributes to the acquisition of knowledge for the design of effective HACL systems, offers new applied cognitive science perspectives on human and AI-agent collaboration and provides a new testbed with benchmark data for iteratively testing successive versions of this new HACL capability.

Summary

Main Finding

Human-AI co-learning (HACL) — where humans and autonomous agents learn both taskwork and how to collaborate — improves human-autonomy teaming (HAT) effectiveness. Implementing an adaptive command-and-control process augmented by AI metacognition (the Cognitive Shadow toolkit) yields dynamic, real-time decision support that (a) aligns AI judgments with expert human patterns, (b) quantifies AI reliability, and (c) enables adjustable autonomy via self-confidence thresholds. In simulated Canadian Arctic maritime surveillance, this approach shows promise for improving classification accuracy, reducing time-to-decision and omissions, and supporting human trust and situational awareness.

Key Points

Problem: HAT increases capabilities but introduces new cognitive costs and skills requirements; lack of shared understanding and mutual adaptation limits effectiveness.
Proposal: HACL — joint learning of teamwork processes (not just taskwork) to create mutual understanding and trust that supports mission adaptation.
Tool: Cognitive Shadow (CS) — a toolkit that learns expert decision patterns using supervised ML (classification/regression) to provide real-time automation and judgmental bootstrapping.
Metacognition: A recursive meta-model predicts the primary model’s reliability (0–1), producing an empirical confidence metric that humans can use for reliance decisions.
Adjustable autonomy: Self-confidence thresholds allow the system to act autonomously on high-certainty predictions and defer to humans on low-certainty cases.
Empirical testbed: Simulated maritime surveillance in Canadian Arctic waterways—reflecting rising traffic due to climate change—used to test HACL and CS.
Measured outcomes: Human factors (workload perception, situational awareness, trust, self-confidence) and performance metrics (classification accuracy, critical change detection, time to classify, omission rates).
System dynamics: CS supports real-time model updates based on immediate user feedback, enabling iterative improvement and continuous alignment with human decision patterns.

Data & Methods

Domain and setting: Simulated maritime surveillance scenarios reflecting Canadian Arctic waterways (growing commercial and strategic traffic).
Human participants: Operators assessed entities and assigned threat levels (exact sample size and participant demographics not specified in the summary).
AI system: Cognitive Shadow implemented as supervised ML models to mimic human expert decisions (classification/regression); a separate recursive meta-model estimates model reliability per situation.
Interaction loop: CS captures human decisions, provides real-time recommendations, receives immediate feedback, and adjusts models dynamically (human-in-the-loop learning).
Evaluation instruments:
- Human factors: workload perception and situational awareness questionnaires; trust and self-confidence scales.
- Performance metrics: classification accuracy, detection of critical changes, time to classify, omission/error rates.
Analysis: Comparison of human-alone vs. HACL-assisted performance and measures of trust/situational awareness; tracking model calibration and meta-model reliability across scenarios.
Limitations noted or implied: reported work is ongoing and based on simulation; details such as sample sizes, statistical significance, generalizability across domains, and exact ML architectures/hyperparameters are not provided in the summary.

Implications for AI Economics

Productivity and operational efficiency:
- Faster, more accurate classification and reduced omission rates can lower operating costs (fewer false alarms, quicker responses), increasing throughput per operator.
- Adjustable autonomy allows reallocation of human oversight to higher-value tasks, potentially increasing per-worker productivity.
Labor demand and skill premium:
- HACL shifts required human skills from routine monitoring to supervisory, interpretive, and teaming skills (training and reskilling costs).
- Demand for operators with higher trust calibration and decision oversight skills may increase wages for such roles; routine monitoring roles may contract.
Adoption and deployment economics:
- The metacognitive reliability metric can reduce adoption risk for purchasers (procurement decision-makers) by providing transparent error-risk assessments and enabling performance-based autonomy thresholds.
- Systems that demonstrate measurable ROI via metrics (time saved, error reduction) are more likely to secure procurement in defense, maritime, and industrial markets.
Liability, risk management, and insurance:
- Empirical confidence estimates can inform liability allocation and contractual terms (e.g., when AI acts autonomously vs. human-in-the-loop), potentially lowering insurer premiums if risk is demonstrably managed.
Market for HAT/HACL tools:
- Growing demand in domains with rising operational tempo (e.g., Arctic shipping surveillance, critical infrastructure monitoring) creates a niche market for metacognitive HAT systems.
- Benchmark datasets and testbeds (as this work provides) reduce market entry friction by enabling third-party validation and comparative procurement.
Cost-benefit considerations:
- Upfront R&D and integration costs versus recurring benefits: need for careful quantification of time savings, error reduction value, reduced staffing or redeployment benefits, and training/reskilling costs.
- Continuous learning capabilities imply ongoing maintenance/data-costs but can also lower long-run performance degradation and retraining expenses.
Policy and institutional impacts:
- Regulators and procurement agencies may need standards for metacognitive reliability reporting, human-autonomy thresholds, and auditing to facilitate adoption.
- Public-good considerations (e.g., Arctic safety, environmental monitoring) might justify public investment or subsidies to accelerate deployment.
Recommended next steps for economic analysis:
- Conduct a quantified cost-benefit analysis using measured performance improvements (time-to-classify, omission reduction) to estimate operational savings.
- Model labor reallocation scenarios and wage impacts under partial automation with adjustable autonomy.
- Evaluate procurement models (capability-based contracting, milestone payments tied to performance metrics) and insurance implications tied to metacognitive reliability outputs.
- Expand validation beyond simulation to operational pilots to better estimate externalities, maintenance costs, and long-term value.

Assessment

Paper Typequasi_experimental Evidence Strengthmedium — The study uses a controlled simulated environment with human participants and objective performance measures, providing suggestive causal evidence that HACL improves team performance; however, sample size, randomization, statistical significance, and external validation in real-world operations are not reported, limiting confidence in generalizability and effect magnitude. Methods Rigormedium — The approach combines supervised ML, a recursive meta-model for reliability, human-in-the-loop updates, and standard human-factors instruments, which is methodologically sound, but the summary lacks key details (sample composition and size, experimental design specifics, statistical analyses, ML architecture/hyperparameters, and robustness checks), preventing a higher rigor rating. SampleSimulated Canadian Arctic maritime surveillance scenarios; human operators assessed entities and assigned threat levels while interacting with the Cognitive Shadow system; data include operator decisions, timestamps, questionnaires on workload/trust/situational awareness, and model predictions/confidence scores; exact participant count, expertise level, and demographics not reported. Themeshuman_ai_collab productivity skills_training adoption IdentificationControlled comparison of operator performance in simulation under human-alone vs. HACL-assisted conditions using the Cognitive Shadow toolkit; causal claims rest on experimental manipulation in a simulated testbed (real-time recommendations, adjustable autonomy thresholds, and supervised/meta-model updates), though details on randomization, counterbalancing, and statistical controls are not provided. GeneralizabilitySimulation-based results may not translate to operational field settings with real sensor noise, stakes, and complex workflows, Domain-specific (Arctic maritime surveillance) limits transferability to other sectors (healthcare, finance, manufacturing), Unreported participant expertise and sample size limit inference about effects for novices vs. experts, Performance depends on the specific ML models, training data, and metacognitive design choices used, Organizational, regulatory, and cultural factors in real deployments could alter trust and adoption outcomes, Long-run maintenance, data drift, and cost aspects not evaluated, limiting economic extrapolation

Claims (12)

Claim	Direction	Confidence	Outcome	Details
Human-AI co-learning (HACL) improves human-autonomy teaming (HAT) effectiveness. Team Performance	positive	medium	overall HAT effectiveness (operational performance and human factors composite)	HACL improves human-autonomy teaming effectiveness in simulated maritime testbed 0.29
Implementing an adaptive command-and-control process augmented by AI metacognition (the Cognitive Shadow toolkit) aligns AI judgments with expert human decision patterns. Decision Quality	positive	medium	degree of alignment between AI model judgments and expert human decision patterns (model-human agreement)	Cognitive Shadow aligns AI judgments with expert human decision patterns (model-human agreement) 0.29
The Cognitive Shadow toolkit quantifies AI reliability with an empirical (0–1) confidence metric produced by a recursive meta-model. Ai Safety And Ethics	positive	medium	meta-model predicted reliability (empirical confidence score, 0–1)	recursive meta-model outputs empirical reliability/confidence score (0–1) 0.29
Adjustable autonomy via self-confidence thresholds enables the system to act autonomously on high-certainty predictions and defer to humans on low-certainty cases. Task Allocation	positive	medium	frequency of autonomous actions vs. human deferrals as a function of meta-model confidence thresholds	adjustable autonomy uses confidence thresholds to decide autonomous action vs human deferral 0.29
In the simulated Canadian Arctic maritime surveillance domain, HACL/CS shows promise for improving classification accuracy. Output Quality	positive	medium	classification accuracy (correctly classifying entities/threat levels)	HACL/CS improves classification accuracy in simulation 0.29
HACL/CS reduces time-to-decision in the simulated maritime surveillance tasks. Task Completion Time	positive	medium	time to classify / time-to-decision	HACL/CS reduces time-to-decision in simulation 0.29
HACL/CS reduces omission rates (missed detections) in the simulated scenarios. Error Rate	positive	medium	omission rate / missed detections	HACL/CS reduces omission rates (missed detections) in simulation 0.29
HACL/CS supports human trust and situational awareness. Worker Satisfaction	positive	low	self-reported trust and situational awareness scores	HACL/CS supports human trust and situational awareness (self-reported) 0.14
Cognitive Shadow supports real-time model updates based on immediate user feedback, enabling iterative improvement and continuous alignment with human decision patterns. Ai Safety And Ethics	positive	medium	model update frequency / change in model-human agreement over iterative interactions	Cognitive Shadow enables real-time updates and iterative alignment with human decisions 0.29
HACL shifts required human skills from routine monitoring to supervisory, interpretive, and teaming skills, implying training and reskilling costs. Skill Acquisition	mixed	low	nature of operator tasks/skills required (qualitative change) and implied training/reskilling costs	HACL shifts human skill requirements toward supervisory/interpretive/teaming skills, implying retraining costs 0.14
The metacognitive reliability metric can reduce adoption risk for purchasers by providing transparent error-risk assessments and enabling performance-based autonomy thresholds. Adoption Rate	positive	low	adoption risk (qualitative or procurement decision proxies)	metacognitive reliability metric can reduce adoption risk by providing transparent error-risk assessments 0.14
Continuous learning capabilities imply ongoing maintenance/data costs but can lower long-run performance degradation and retraining expenses. Firm Productivity	mixed	speculative	maintenance/data costs versus long-run performance degradation and retraining costs (economic estimates)	continuous learning implies ongoing maintenance/data costs but may lower long-run degradation and retraining expenses 0.05