From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety

Ganen Sethupathy, Lalit Dumka, Jan Schagen · March 31, 2026

arxiv Source PDF

Public spaces such as transport hubs, city centres, and event venues require timely and reliable detection of potentially violent behaviour to support public safety. While automated video analysis has made significant progress, practical deployment remains constrained by latency, privacy, and resource limitations, particularly under edge-computing conditions. This paper presents the design and demonstrator-based deployment of a hybrid edge-based action detection system that combines skeleton-based motion analysis with vision-language models for semantic scene interpretation. Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead, while vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. The results highlight the complementary strengths and limitations of motioncentric and semantic approaches and motivate a hybrid architecture that selectively augments fast skeletonbased detection with higher-level semantic reasoning. The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications.

Summary

Main Finding

A practical, open-source demonstrator shows that a hybrid edge deployment combining fast, privacy-preserving skeleton-based action detection with selectively invoked vision‑language models (VLMs) is a viable approach for real‑time public‑safety video analysis on GPU‑enabled edge hardware. Skeleton pipelines provide low‑latency, low‑resource continuous monitoring, while VLMs add contextual, zero‑shot semantic reasoning at greater computational and latency cost; a hybrid architecture that augments skeleton detections with targeted semantic checks balances performance, privacy, and operational cost.

Key Points

System goal: compare and integrate two paradigms under realistic edge constraints (latency, resource usage, privacy) rather than propose new models.
Implementation: dual-backend demonstrator running on an NVIDIA Jetson AGX Thor with a 5 MP USB camera; open-source agent-based architecture (agent layer not yet fully automated).
Skeleton pipeline (real‑time, motion-centric):
- Components: YOLOv6‑L‑Pose for 2D keypoints, ByteTrack for multi‑person tracking, per-track skeleton buffers (clip_len=100, stride=30), MotionBERT for 2D→3D lifting, remapping to NTU skeleton layout, ProtoGCN (or CTR‑GCN) for action classification.
- Throughput: YOLO‑Pose ~21 ms/frame in .pt; TensorRT export reduced to ~13.2 ms/frame on Jetson AGX Thor. New classifications emitted roughly every ~1 s per track given the buffering and stride settings.
- Risk thresholds (empirically tuned): DANGER when cumulative danger class probability > 0.3; WARNING when warning probability > 0.5.
- Strengths: lightweight data (keypoints), privacy-aware (no raw frames transmitted), scalable per person because YOLO‑Pose is bottom‑up, and suitable for continuous monitoring.
- Limitations: “context blindness” (can't reliably disambiguate contextually different but kinematically similar actions), sensitivity to occlusion/shadows, closed‑set label limitations.
VLM pipeline (semantic, higher cost):
- Uses large VLM (example: Qwen‑style 35B), chunked inference (e.g., 4 s chunks, sampling ~6 fps), richer textual summaries and zero‑shot reasoning capabilities.
- Strengths: contextual understanding, can handle previously unseen scenarios and produce human‑readable summaries.
- Limitations: substantially higher compute/memory requirements, harder to run on constrained edge devices, higher latency, and higher energy consumption.
Software & UI:
- Dual FastAPI backends (ports 8080 VLM, 8090 skeleton) + React dashboard with live monitor, alert feed, metrics, and control bar.
- Producer‑consumer capture architecture to decouple frame acquisition from slow inference and avoid frame drops.
- Persistent storage via SQLite; configurable environment via Pydantic settings.
Demonstrator focus: system design, deployment trade‑offs, latency/resource profiling, and human‑in‑the‑loop workflows; not large‑scale benchmark evaluation.

Data & Methods

Hardware: NVIDIA Jetson AGX Thor Developer Kit (earlier Jetson Nano found insufficient for concurrent VLM workloads).
Camera: 5 MP RGB USB, all processing local (no raw video transmitted off‑device).
Skeleton pipeline stages:
Pose estimation: YOLOv6L‑Pose (17 COCO keypoints).
Tracking: ByteTrack for ID persistence and robust association.
Buffering: sliding windows per track, clip_len=100 frames, stride=30 frames.
2D→3D lifting: MotionBERT (DSTformer pretrained on Human3.6M) or fallback pseudo‑3D (z=0).
Joint remapping: COCO→H36M→NTU25 with synthetic/approximate joints where needed.
Pairing: proximity‑based pairing for two‑person NTU mutual actions; optional distance gating.
Classification: ProtoGCN (preprocessing: centering, uniform sampling to 100 frames, formatting to model input). Output probabilities over NTU classes aggregated into DANGER/WARNING/SAFE.
VLM pipeline settings (representative): chunk_duration_sec=4.0s, sampling recent_fps=6, generation params (example) max_tokens ~10024, temp settings as in prototype; heavy model examples reported (e.g., Qwen‑series).
Software stack: Python (FastAPI, asyncio queues), TensorRT exports for faster inference, React + Tailwind CSS frontend, SQLite + SQLAlchemy for persistence.
Evaluation: demonstrator‑based tests on device measuring latency, resource usage, and qualitative false‑alarm behavior; thresholds tuned empirically. No large-scale real‑world deployment due to regulatory/ethical constraints.

Implications for AI Economics

Capital vs operating costs
- Edge hardware investment (e.g., Jetson AGX Thor) entails higher upfront capital cost than minimal embedded boards (Jetson Nano) but lowers recurring bandwidth and cloud compute charges by keeping raw video local. For privacy‑sensitive public safety, edge deployment can reduce compliance and data transfer costs.
- VLMs dramatically increase compute and energy requirements; maintaining many VLM‑capable edge nodes is costly. Hybrid strategies that run lightweight skeleton models continuously and invoke VLMs selectively (e.g., on alerts or low‑confidence/ambiguous cases) optimize marginal costs.
Scalability and deployment economics
- Skeleton-based pipelines scale more cheaply per camera: compact keypoint data and lower GPU load allow many cameras per device or lower‑cost hardware. This reduces per‑unit deployment cost and operational energy spend.
- VLM-based semantic checks add per‑incident marginal cost; operational design should minimize unnecessary invocations (agent orchestration, confidence thresholds, batching) to contain costs.
False positives, human workload, and downstream costs
- Skeleton context blindness can increase false alarms (e.g., misclassify benign interactions as violent), raising human review workload and potential response costs (dispatch, interruptions). These operational costs can outweigh savings from automation if not mitigated.
- Hybrid architectures that integrate semantic checks and human‑in‑the‑loop review reduce false positives and thus lower costly false responses; design of thresholds and orchestration has direct economic impact.
Privacy, regulation, and procurement value
- Privacy‑preserving edge processing (no raw video export) has economic advantages: reduced compliance burden, lower legal/data‑handling costs, and potentially higher public acceptance—factors that facilitate procurement and long‑term operation.
- Open‑source architecture lowers vendor lock‑in risk and licensing costs, enabling flexible procurement and local adaptation; however, it shifts some maintenance and support cost to operators.
Market and labor effects
- Demand for hybrid edge analytics creates market opportunities for bundled hardware+software offerings, managed edge services, and specialized maintenance/support providers.
- Operational staff requirements shift from continuous human monitoring to alert review, model tuning, and edge maintenance—requiring different skills (data engineering, model ops) with implications for training and labor allocation.
Recommendations for economically efficient deployments
- Use skeleton‑based monitoring as the continuous, low‑cost first line and reserve VLM semantics for disambiguation or higher‑risk events.
- Invest in agent orchestration to manage when to run expensive semantic models (confidence gating, batching, human‑in‑loop escalation).
- Monitor and tune thresholds empirically to control false alarm rates and human review costs.
- Favor hardware that supports TensorRT/quantization to reduce per‑inference latency and energy, improving total cost of ownership.
- Adopt open APIs and modular design to avoid vendor lock‑in and enable incremental upgrades as models/hardware evolve.

Overall, this work illustrates that careful system design—matching continuous, cheap sensing (skeletons) with occasional expensive semantic inference (VLMs)—can yield realtime, privacy‑aware surveillance solutions whose economic viability depends on hardware choice, orchestration policies, and operational workflows.

Claims (6)

Claim	Direction	Confidence	Outcome
Skeleton-based processing enables continuous, privacy-aware monitoring with low computational overhead. Organizational Efficiency	positive	high	computational overhead (latency and resource usage)
Vision-language models provide contextual understanding and zero-shot reasoning capabilities for complex and previously unseen situations. Decision Quality	positive	high	semantic/contextual understanding and zero-shot reasoning capability
The system is implemented on a GPU-enabled edge device and evaluated with respect to latency, resource usage, and operational trade-offs using a demonstrator-based setup. Organizational Efficiency	null_result	high	latency and resource usage
Results highlight the complementary strengths and limitations of motion-centric (skeleton-based) and semantic (vision-language) approaches and motivate a hybrid architecture that selectively augments fast skeleton-based detection with higher-level semantic reasoning. Task Allocation	positive	high	trade-offs between detection speed (motion-centric) and semantic reasoning capability
The presented system provides a practical foundation for privacy-aware, real-time video analysis in public safety applications. Ai Safety And Ethics	positive	medium	privacy-aware real-time video analysis capability
Rather than proposing new recognition models, the contribution focuses on a system-level comparison of both paradigms under realistic edge constraints. Other	null_result	high	scope of contribution (system-level comparison vs model development)

Incomplete processing: Assessment, headline, and rigor analysis missing.