DocHub/WhatsApp CRM (Queunir)/SaaS Production Deployment/Monitoring & Telemetry

60-second health loop with auto-recovery, per-slice and server telemetry, and 7-day retention

Monitoring & Telemetry

Purpose

The monitor runs every 60 seconds inside the gateway process. It collects server-level and per-slice telemetry, performs health checks, auto-restarts crashed containers, manages the slice pool (assigns waiting users and provisions new slices), and prunes old telemetry data.

Architecture

Monitor Loop (every 60 seconds)

Server telemetry — Collects RAM, swap, CPU, disk from system commands. Stores in gateway.server_telemetry.
Slice pool management — Assigns available slices to users without one. If available count < 1, auto-provisions a new slice.
Per-slice health checks — For each assigned/available slice:
- Checks Docker container is running (auto-restarts if not)
- Calls /api/health and /api/status/state on the slice
- Collects RAM + CPU from bulk docker stats --no-stream
- Collects disk usage from du -sb /data/slices/{N}/
- Stores in gateway.telemetry

Data Collection Methods

Metric	Source	Frequency
Server RAM	`free -b`	60s
Server swap	`free -b`	60s
Server CPU	`top -bn1`	60s
Server disk	`df -B1 /`	60s
Container RAM	`docker stats --no-stream` (bulk)	60s
Container CPU	`docker stats --no-stream` (bulk)	60s
Slice disk	`du -sb /data/slices/{N}/`	60s
WhatsApp state	`curl /api/status/state`	60s
Slice health	`curl /api/health`	60s

Auto-Recovery

Condition	Action
Container not running	`docker start wank-slice-{N}`
User has no slice + one available	Automatic assignment
Available slices < 1	Auto-provision new slice

Telemetry Retention

Pruned every 1 hour
Both gateway.telemetry and gateway.server_telemetry kept for 7 days

Startup Timing

Monitor starts 30 seconds after gateway boot (let services settle)
Then runs every 60 seconds
Telemetry prune runs every 1 hour

Status

Active and collecting data continuously since deployment. Telemetry feeds the admin dashboard sparklines and capacity planning panel.