DocHub/WhatsApp CRM/Gateway/SaaS Platform Plan

Multi-tenant SaaS architecture — gateway auth, slice provisioning, billing, admin dashboard

WhatsApp CRM SaaS — Production Deployment Plan

Master plan for deploying the WhatsApp CRM as a multi-tenant SaaS product on OVH VPS.

Infrastructure Overview
Server Specification
Architecture
Pricing & Business Model
Build Order
Phase 1: Auth Gateway & Routing
Phase 2: Shared Frontend
Phase 3: Orchestrator Service
Phase 4: Health Monitoring & Alerts
Phase 5: Stripe Billing
Phase 6: Marketing Site & Signup Flow
Phase 7: Admin Dashboard
Resource Budgets & Limits
Security & Isolation
Backup & Data Policy
Deployment & Updates
Domain & SSL
Open Questions

Infrastructure Overview

Previous State (Development — now decommissioned)

~~4 slices running on a DigitalOcean droplet (178.128.183.166)~~ — Removed 2026-02-24
Each slice = Docker container with its own backend, frontend, Chromium, and PostgreSQL
Manual port assignment, no auth gateway, no billing
Local development on localhost:3101

Current State (Production SaaS on OVH)

Multi-tenant platform on dedicated OVH VPS
60-80 slices capacity (optimized), scalable to additional servers
Automated provisioning, billing, monitoring
Users log in to a web portal, get routed to their slice
Self-service signup, cancellation, and data management
$9/month per slice

Server Specification

Item	Details
Provider	OVH Cloud
Plan	VPS-5
Cost	$40.40/month
CPU	16 vCores
RAM	64 GB
Storage	350 GB SSD NVMe
Bandwidth	2.5 Gbps unlimited
OS	Ubuntu 25.04
Location	Canada East (Beauharnois)
Backup	Automated daily (OVH option)
IP	TBD (awaiting provisioning)

Architecture

High-Level Diagram

                         Internet
                            │
                      ┌─────┴─────┐
                      │   nginx   │
                      │ (SSL +    │
                      │  routing) │
                      └─────┬─────┘
                            │
           ┌────────────────┼────────────────┐
           │                │                │
    ┌──────┴──────┐  ┌──────┴──────┐  ┌──────┴──────┐
    │   Gateway   │  │  Marketing  │  │   Static    │
    │   Service   │  │    Site     │  │  Frontend   │
    │             │  │             │  │  (shared)   │
    │ - Auth      │  │ - Landing   │  │             │
    │ - Session   │  │ - Pricing   │  │ One build   │
    │ - Routing   │  │ - Signup    │  │ serves all  │
    │ - Billing   │  │             │  │ users       │
    │ - Monitor   │  │             │  │             │
    └──────┬──────┘  └─────────────┘  └──────┬──────┘
           │                                  │
           │         API calls routed         │
           │         by session cookie        │
           │◄─────────────────────────────────┘
           │
    ┌──────┴──────────────────────────────────┐
    │              Slice Pool                  │
    │                                          │
    │  ┌─────────┐ ┌─────────┐ ┌─────────┐   │
    │  │ Slice 1 │ │ Slice 2 │ │ Slice N │   │
    │  │ Backend │ │ Backend │ │ Backend │   │
    │  │ Chrome  │ │ Chrome  │ │ Chrome  │   │
    │  │         │ │         │ │         │   │
    │  └────┬────┘ └────┬────┘ └────┬────┘   │
    │       │           │           │         │
    │  ┌────┴───────────┴───────────┴────┐    │
    │  │    Shared PostgreSQL Instance    │    │
    │  │    (schema per user)            │    │
    │  └─────────────────────────────────┘    │
    │                                          │
    │  ┌─────────────────────────────────┐    │
    │  │    Pre-provisioned Empty Slice   │    │
    │  │    (ready for next signup)       │    │
    │  └─────────────────────────────────┘    │
    └──────────────────────────────────────────┘

Key Architectural Decisions

Decision	Rationale
Shared PostgreSQL	One Postgres instance with schema-per-user instead of DB-per-container. Saves ~2-3GB RAM overhead.
Single shared frontend	The React frontend is identical for all users. One build served by nginx. API routing determined by auth session. Eliminates N frontend containers.
Pre-provisioned “next” slice	One empty container always ready. On signup: assign it, spin up the next empty one in background. No user waiting.
Gateway service	Central Node.js service handling auth, session management, Stripe webhooks, health checks, and API proxying to the correct slice backend.
Chromium per slice	Each slice must maintain its own WhatsApp Web session. Cannot be shared. This is the main memory consumer.

Per-Slice Composition (Optimized)

Component	Memory Budget
Node.js backend	~100-150 MB
Chromium (whatsapp-web.js)	~250-350 MB (with memory flags)
Media storage	Varies (1-2 GB disk quota)
Total per slice	~400-500 MB RAM

Shared Services (Run Once)

Service	Memory Budget
PostgreSQL (shared)	~500 MB - 2 GB
nginx	~50 MB
Gateway service	~100 MB
Marketing site	~50 MB
Total shared	~1-2 GB

Capacity Estimate

64 GB total RAM
~2 GB for OS + shared services
~62 GB for slices
At 450 MB per slice: ~130 slices theoretical max
With headroom for spikes: 60-80 slices comfortable, 100 aggressive
Max capacity enforced by the gateway service

Pricing & Business Model

Item	Details
Monthly price	$9/slice
Launch discount	50% ($4.50/slice)
Server cost	$40.40/month
Break-even	5 slices at full price, 9 at discount
Target capacity	60-80 slices per server
Revenue at 60 slices	$540/month (full) or $270/month (discount)
Trial period	TBD

Build Order

Phase	What	Priority	Depends On
1	Auth Gateway & Routing	FIRST	Server provisioned
2	Shared Frontend	HIGH	Phase 1
3	Orchestrator (provision/destroy)	HIGH	Phase 1
4	Health Monitoring & Alerts	HIGH	Phase 3
5	Stripe Billing	MEDIUM	Phase 3
6	Marketing Site & Signup	MEDIUM	Phase 5
7	Admin Dashboard	MEDIUM	Phase 4

Phase 1: Auth Gateway & Routing

The foundation. A central service that authenticates users and routes their API calls to the correct slice backend.

Components

1.1 Gateway Service (Node.js + Express)

A new lightweight service that runs once on the server:

/gateway/
  src/
    index.ts
    auth.ts          # Login, register, session management
    proxy.ts         # Route API calls to correct slice
    middleware.ts     # Session validation, rate limiting
  package.json
  Dockerfile

1.2 Authentication Flow

User visits app.domain.com
        │
        ▼
   ┌─────────┐     No session
   │  nginx   │ ──────────────► /login page
   └────┬────┘
        │ Has valid session cookie
        ▼
   ┌─────────┐
   │ Gateway  │ ── Looks up user → finds their slice port
   └────┬────┘
        │
        ▼
   ┌─────────┐
   │ Slice N  │ ── Backend on port 50XX
   │ Backend  │
   └─────────┘

1.3 Session Management

Item	Details
Method	Signed HTTP-only cookie
Duration	30 days (configurable)
Storage	Gateway’s PostgreSQL schema
Contains	User ID, slice ID, expiry
Refresh	Rolling — extends on each request

1.4 Gateway Database Schema

The gateway has its own schema in the shared PostgreSQL:

CREATE SCHEMA gateway;

CREATE TABLE gateway.users (
    id              SERIAL PRIMARY KEY,
    email           TEXT UNIQUE NOT NULL,
    password_hash   TEXT NOT NULL,
    slice_id        INTEGER REFERENCES gateway.slices(id),
    stripe_customer TEXT,           -- Stripe customer ID
    subscription_status TEXT DEFAULT 'trial',  -- trial, active, past_due, cancelled
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    last_login_at   TIMESTAMPTZ
);

CREATE TABLE gateway.slices (
    id              SERIAL PRIMARY KEY,
    port            INTEGER UNIQUE NOT NULL,    -- Backend port (5001, 5003, etc.)
    status          TEXT DEFAULT 'available',    -- available, assigned, suspended, destroying
    user_id         INTEGER REFERENCES gateway.users(id),
    wa_connected    BOOLEAN DEFAULT FALSE,
    wa_phone        TEXT,                        -- Connected WhatsApp number
    last_health_at  TIMESTAMPTZ,
    created_at      TIMESTAMPTZ DEFAULT NOW(),
    storage_bytes   BIGINT DEFAULT 0
);

CREATE TABLE gateway.sessions (
    id              TEXT PRIMARY KEY,            -- Random token
    user_id         INTEGER REFERENCES gateway.users(id),
    expires_at      TIMESTAMPTZ NOT NULL,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

1.5 nginx Configuration

# Marketing site
server {
    listen 443 ssl;
    server_name domain.com www.domain.com;
    # ... SSL config ...
    root /var/www/marketing;
    index index.html;
}

# App (gateway + slices)
server {
    listen 443 ssl;
    server_name app.domain.com;
    # ... SSL config ...

    # Static frontend (shared)
    location / {
        root /var/www/app;
        try_files $uri $uri/ /index.html;
    }

    # API calls — proxied through gateway
    location /api/ {
        proxy_pass http://127.0.0.1:3000;  # Gateway service
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";
    }

    # Auth endpoints (login, register, logout)
    location /auth/ {
        proxy_pass http://127.0.0.1:3000;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}

1.6 Gateway API Proxy Logic

The gateway intercepts all /api/* requests:

Read session cookie → look up user → find their slice port
Rewrite the request to http://127.0.0.1:{slice_port}/api/*
Stream the response back
For real-time events, the gateway connects to the slice’s Socket.io as a client and re-emits via SSE (GET /api/events)

// Simplified proxy logic
app.use('/api', async (req, res) => {
    const session = await validateSession(req.cookies.session);
    if (!session) return res.status(401).json({ error: 'Not authenticated' });

    const slice = await getSliceForUser(session.userId);
    if (!slice) return res.status(503).json({ error: 'No slice assigned' });

    // Proxy to slice backend
    proxy.web(req, res, { target: `http://127.0.0.1:${slice.port}` });
});

1.7 Gateway Auth Endpoints

Endpoint	Method	Purpose
`POST /auth/register`	POST	Create account (email + password)
`POST /auth/login`	POST	Authenticate, set session cookie
`POST /auth/logout`	POST	Clear session
`GET /auth/me`	GET	Return current user + slice status
`POST /auth/forgot-password`	POST	Send reset email
`POST /auth/reset-password`	POST	Set new password with token

Phase 2: Shared Frontend

Current Problem

Each slice currently runs its own frontend container. With 80 slices that’s 80 identical React apps consuming memory.

Solution

Build the frontend once
Serve it as static files from nginx
The frontend talks to /api/* which nginx routes through the gateway
The gateway proxies to the correct slice backend based on the session
The frontend code doesn’t need to know which slice it’s talking to

Changes Required

Remove frontend container from slice Docker composition — slices only run backend + Chromium
Frontend API base URL — already uses relative /api paths (confirmed in vite.config.ts), so no changes needed
Add login page — new route in React app for /login, /register
Add auth context — React context that checks /auth/me on load, redirects to login if not authenticated
Build pipeline — single npm run build, output deployed to /var/www/app/

Frontend Auth Flow

App loads
    │
    ▼
Check /auth/me
    │
    ├─── 200 + user data ──► Load CRM (existing app)
    │
    └─── 401 ──► Show login page
                      │
                      ▼
                 POST /auth/login
                      │
                      ├─── 200 ──► Set cookie, redirect to /
                      │
                      └─── 401 ──► Show error

Phase 3: Orchestrator Service

Manages the lifecycle of slices. Can be part of the gateway service or a separate process.

Responsibilities

Provision new slices — create Docker container, assign port, create DB schema
Destroy slices — stop container, remove data, free port
Maintain “next up” slice — always one empty container ready for instant assignment
Assign slices to users — on signup/payment confirmation
Suspend slices — on payment failure (stop container, retain data for grace period)
Track resource usage — disk, memory per slice

Slice Lifecycle

[Pre-provisioned]  ──signup──►  [Assigned]  ──cancel──►  [Destroying]
       │                            │                         │
       │                            │ payment fail            │
       │                            ▼                         ▼
       │                       [Suspended]               [Destroyed]
       │                            │                    (data deleted)
       │                            │ payment restored
       │                            ▼
       │                       [Assigned]
       │
       └── Always one available. When assigned, provision next.

Provisioning a New Slice

# Orchestrator runs these steps:
1. Pick next available port (5001, 5003, 5005, ...)
2. Create DB schema: CREATE SCHEMA slice_{id};
3. Run migrations on the new schema
4. Start Docker container:
   docker run -d \
     --name slice-{id} \
     --memory=512m \
     --cpus=0.5 \
     -p {port}:4001 \
     -e DB_SCHEMA=slice_{id} \
     -e SLICE_ID={id} \
     -v /data/slices/{id}/media:/app/media \
     whatsapp-slice:latest
5. Update gateway.slices table
6. Health check until container responds

Port Allocation

Backend ports: 5001, 5003, 5005, … (odd numbers, room for 200+)
WebSocket on same port as backend (upgrade)
No frontend ports needed (shared frontend)

Phase 4: Health Monitoring & Alerts

What To Monitor

Check	Frequency	Alert If
Container running	30 seconds	Container stopped/crashed
WhatsApp connected	60 seconds	`state !== 'CONNECTED'` for >2 min
Backend responding	60 seconds	HTTP health check fails
Memory usage	60 seconds	Container >90% of limit
Disk usage per slice	5 minutes	>90% of quota
Server disk	5 minutes	>85% of 350GB
Server RAM	60 seconds	>90% of 64GB
Server CPU	60 seconds	>90% sustained for 5 min

Health Check Endpoint (Per Slice)

Each slice backend already has or will have:

GET /api/status → { state: 'ready', phoneNumber: '+1234...', uptime: 3600 }

The gateway polls this for every assigned slice.

Alert Channels

Email — via Resend (already used for WIT project)
WhatsApp — send alert to the user’s connected WhatsApp number (use the slice’s own connection)
Admin email — alert the admin (you) for server-level issues

Alert Flow for WhatsApp Disconnection

Health check detects state !== 'ready'
        │
        ▼
Wait 2 minutes (transient disconnections resolve themselves)
        │
        ▼
Still disconnected?
        │
        ├── YES ──► Send email: "Your WhatsApp is disconnected.
        │           Click here to reconnect: app.domain.com/reconnect"
        │           Update gateway.slices.wa_connected = false
        │
        └── NO ──► Clear alert, no action

Reconnection Page

When a user’s WhatsApp disconnects:

They receive an email with a link
Link goes to app.domain.com/reconnect
Page shows the live QR code from their slice (via WebSocket)
User scans with phone
Connection restores, alert clears automatically

This uses the existing QR code WebSocket flow that whatsapp-web.js already provides.

Prototype Telemetry — Swap & Capacity Tracking

CRITICAL: This is a prototype deployment. Understanding real-world memory behaviour under aggressive swapping is essential before scaling. Every capacity decision depends on data collected during this phase.

The capacity estimates (150-200 slices on 64GB + 32GB swap) are based on modelling, not production data. Real-world WhatsApp Web sessions may behave differently over days/weeks — memory leaks, Chromium bloat, message volume spikes. We must instrument aggressively from day one.

Validated: Swap Does Not Break Connections

Tested on the DigitalOcean droplet (2026-02-23): forced idle slices down to 10MB resident using cgroups v2 memory.reclaim. Slice 1 (live WhatsApp connection, phone +50499311987) remained state: "ready" throughout. Pages back into RAM in milliseconds when activity resumes. NVMe swap latency is negligible for WebSocket heartbeats.

What To Track (Logged Every 60 Seconds)

Metric	Source	Purpose
Per-slice RSS (resident memory)	cgroups `memory.current`	Track actual RAM usage per slice over time
Per-slice swap usage	cgroups `memory.swap.current`	How much each slice has paged out
Total server RAM used/free	`/proc/meminfo`	Overall headroom
Total swap used/free	`/proc/meminfo`	How close to swap ceiling
Per-slice WhatsApp state	`/api/status/state`	Connection health correlated with swap pressure
Per-slice last activity time	Network bytes delta	Idle duration tracking
Swap-in rate	`vmstat` or `/proc/vmstat`	Detect thrashing (high swap-in = too many slices)
OOM kill events	`dmesg` / kernel log	Hard failures

Stored As Time-Series Data

CREATE TABLE gateway.telemetry (
    id              BIGSERIAL PRIMARY KEY,
    slice_id        INTEGER REFERENCES gateway.slices(id),
    timestamp       TIMESTAMPTZ DEFAULT NOW(),
    rss_bytes       BIGINT,
    swap_bytes      BIGINT,
    wa_state        TEXT,
    idle_seconds    INTEGER,
    cpu_percent     REAL
);

CREATE TABLE gateway.server_telemetry (
    id              BIGSERIAL PRIMARY KEY,
    timestamp       TIMESTAMPTZ DEFAULT NOW(),
    ram_used_bytes  BIGINT,
    ram_total_bytes BIGINT,
    swap_used_bytes BIGINT,
    swap_total_bytes BIGINT,
    swap_in_rate    BIGINT,      -- pages/sec swapped in
    cpu_percent     REAL,
    disk_used_bytes BIGINT,
    active_slices   INTEGER,
    total_slices    INTEGER
);

Key Questions This Data Must Answer

Stable idle footprint: Do swapped slices stay at ~10MB resident, or does Chromium gradually pull pages back in?
Swap ceiling: At what total swap usage do we see increased swap-in rates (thrashing)?
Connection stability: Over days/weeks of aggressive swapping, do WhatsApp disconnections increase?
Memory leaks: Do slices grow over time even when idle? How fast?
Peak behaviour: When a user actively browses the CRM, how quickly does the slice page back in? Is there noticeable latency?
Safe slice count: At what number of slices does the server start showing stress signals?

Admin Dashboard Must Surface

Real-time graph: total RAM vs swap usage over last 24h/7d/30d
Per-slice memory timeline (RSS + swap)
Available capacity: “N more slices can be provisioned” based on current usage patterns
Swap-in rate trend — the single best indicator of overcommitment
Alert if swap-in rate exceeds threshold (e.g., >1000 pages/sec sustained)

Aggressive Swap Script (Already Tested)

The swap-test.sh script on the droplet (/root/swap-test.sh) uses memory.reclaim via cgroups v2 to force idle slice memory into swap after 10 seconds of inactivity. This will be refined into a production-grade daemon on the OVH server:

Configurable idle threshold
Graduated reclaim (swap more aggressively the longer a slice is idle)
Logging to the telemetry database instead of flat files
Excluded from reclaim during active user sessions (detected via WebSocket connections or HTTP activity)

Phase 5: Stripe Billing

Integration Points

Event	Action
User signs up	Create Stripe customer, start subscription
Payment succeeds	Assign slice (from pre-provisioned pool)
Payment fails	Send dunning email, 7-day grace period
Grace period expires	Suspend slice (stop container, retain data)
Payment restored	Resume slice
User cancels	Offer data export, then destroy slice at end of billing period
Stripe webhook	Gateway receives and processes all events

Stripe Configuration

Item	Details
Product	WhatsApp CRM Slice
Price	$9.00/month
Launch price	$4.50/month (50% coupon)
Trial	TBD
Payment methods	Card (via Stripe Checkout)

Gateway Endpoints for Billing

Endpoint	Purpose
`POST /billing/create-checkout`	Generate Stripe Checkout session
`POST /billing/webhook`	Receive Stripe events
`GET /billing/portal`	Redirect to Stripe Customer Portal (manage subscription)
`GET /billing/status`	Return current subscription status

Structure

Separate static site at domain.com (the gateway/app lives at app.domain.com):

/var/www/marketing/
  index.html        # Landing page
  pricing.html      # Pricing details
  features.html     # Feature showcase
  signup.html       # Redirects to Stripe Checkout or app.domain.com/register
  assets/           # CSS, images, JS

User visits domain.com
        │
        ▼
Clicks "Get Started" → app.domain.com/register
        │
        ▼
Creates account (email + password)
        │
        ▼
Redirected to Stripe Checkout ($9/month or $4.50 launch)
        │
        ▼
Payment succeeds → Stripe webhook fires
        │
        ▼
Gateway assigns pre-provisioned slice
        │
        ▼
User redirected to app.domain.com → QR code page
        │
        ▼
User scans QR with phone → WhatsApp connected → CRM ready

Server Overview — CPU, RAM, disk, uptime
Slice Grid — all slices with status indicators (green/yellow/red)
Alert Log — history of disconnections, restarts, failures
User Management — list users, suspend/terminate, view subscription status
Revenue — MRR, active subscriptions, churn
Capacity — current usage vs max, projected fill rate

Resource Budgets & Limits

Per-Slice Docker Limits

deploy:
  resources:
    limits:
      memory: 512M
      cpus: '0.5'
    reservations:
      memory: 256M
      cpus: '0.25'

Per-Slice Disk Quota

Item	Limit
Media storage	2 GB included
Additional storage	Paid add-on (TBD pricing)
Enforcement	Orchestrator checks usage, blocks media downloads at quota

Chromium Memory Optimization Flags

--disable-gpu
--disable-dev-shm-usage
--no-sandbox
--disable-setuid-sandbox
--disable-extensions
--disable-background-networking
--js-flags="--max-old-space-size=128"
--single-process

Security & Isolation

Container Isolation

Each slice runs in its own Docker container with resource limits
Containers have no access to other containers’ volumes
Containers cannot access the host network directly
No --privileged flag
Read-only root filesystem where possible

Network Isolation

Slice backends only listen on localhost (127.0.0.1)
Only nginx is exposed to the internet (ports 80/443)
Gateway communicates with slices via localhost ports
No inter-slice communication possible

Data Isolation

Schema-per-user in PostgreSQL (not shared tables)
Media stored in /data/slices/{id}/ with Linux user/group permissions
Gateway validates every request belongs to the authenticated user’s slice

Auth Security

Passwords hashed with bcrypt (cost 12)
Session tokens: 256-bit random, HTTP-only cookies, Secure flag, SameSite=Strict
Rate limiting on login (5 attempts per minute per IP)
HTTPS everywhere (Let’s Encrypt)

Backup & Data Policy

Server-Level

OVH automated daily backup (included in plan)
Covers full disk including all slice data

Future Enhancement

Per-slice backup to S3/R2 (when customer base justifies cost)
User-initiated export (download all messages, photos, contacts as ZIP)

Data Destruction

User can destroy their own data from within the CRM (existing feature)
On cancellation: data retained for 7 days, then slice destroyed completely
Nothing retained after destruction

Deployment & Updates

Rolling Update Strategy

For each active slice:
  1. Pull new image
  2. Stop container (WhatsApp session persists in local auth store)
  3. Start new container with same volume mounts
  4. Health check — verify backend responds
  5. Verify WhatsApp reconnects (usually automatic)
  6. Move to next slice

Update 5-10 slices at a time (parallel batches)
Monitor for failures between batches
Rollback: keep previous image tagged, revert if issues

Code Deployment Pipeline

Local development
      │
      ▼
Test on localhost (existing dev setup)
      │
      ▼
Build Docker image
      │
      ▼
Push to server (rsync or registry)
      │
      ▼
Rolling update via orchestrator command

Domain & SSL

Domain Structure (TBD — needs domain choice)

Subdomain	Purpose
`domain.com`	Marketing site
`app.domain.com`	CRM application (gateway + frontend)

SSL

Let’s Encrypt with certbot
Auto-renewal via cron
Wildcard cert if needed for subdomains

Open Questions

Domain name — What domain for the SaaS product?
Trial period — Free trial before payment? If so, how long?
Additional storage pricing — How much per extra GB?
Email provider — Resend for transactional emails? (already in use for WIT)
Max slices cap — Hard limit per server? Suggest 80 for safety.
Onboarding flow — Any guided tour for new users?
Support channel — How do users get help? Email? In-app chat?
Terms of service / Privacy policy — Need legal docs before launch.
Data residency — Server is in Canada. Any user concerns?
Branding — Product name? Logo?