WhatsApp CRM SaaS — Production Deployment Plan
Master plan for deploying the WhatsApp CRM as a multi-tenant SaaS product on OVH VPS.
Table of Contents
- Infrastructure Overview
- Server Specification
- Architecture
- Pricing & Business Model
- Build Order
- Phase 1: Auth Gateway & Routing
- Phase 2: Shared Frontend
- Phase 3: Orchestrator Service
- Phase 4: Health Monitoring & Alerts
- Phase 5: Stripe Billing
- Phase 6: Marketing Site & Signup Flow
- Phase 7: Admin Dashboard
- Resource Budgets & Limits
- Security & Isolation
- Backup & Data Policy
- Deployment & Updates
- Domain & SSL
- Open Questions
Infrastructure Overview
Previous State (Development — now decommissioned)
4 slices running on a DigitalOcean droplet (178.128.183.166)— Removed 2026-02-24- Each slice = Docker container with its own backend, frontend, Chromium, and PostgreSQL
- Manual port assignment, no auth gateway, no billing
- Local development on localhost:3101
Current State (Production SaaS on OVH)
- Multi-tenant platform on dedicated OVH VPS
- 60-80 slices capacity (optimized), scalable to additional servers
- Automated provisioning, billing, monitoring
- Users log in to a web portal, get routed to their slice
- Self-service signup, cancellation, and data management
- $9/month per slice
Server Specification
| Item | Details |
|---|---|
| Provider | OVH Cloud |
| Plan | VPS-5 |
| Cost | $40.40/month |
| CPU | 16 vCores |
| RAM | 64 GB |
| Storage | 350 GB SSD NVMe |
| Bandwidth | 2.5 Gbps unlimited |
| OS | Ubuntu 25.04 |
| Location | Canada East (Beauharnois) |
| Backup | Automated daily (OVH option) |
| IP | TBD (awaiting provisioning) |
Architecture
High-Level Diagram
Internet
│
┌─────┴─────┐
│ nginx │
│ (SSL + │
│ routing) │
└─────┬─────┘
│
┌────────────────┼────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Gateway │ │ Marketing │ │ Static │
│ Service │ │ Site │ │ Frontend │
│ │ │ │ │ (shared) │
│ - Auth │ │ - Landing │ │ │
│ - Session │ │ - Pricing │ │ One build │
│ - Routing │ │ - Signup │ │ serves all │
│ - Billing │ │ │ │ users │
│ - Monitor │ │ │ │ │
└──────┬──────┘ └─────────────┘ └──────┬──────┘
│ │
│ API calls routed │
│ by session cookie │
│◄─────────────────────────────────┘
│
┌──────┴──────────────────────────────────┐
│ Slice Pool │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Slice 1 │ │ Slice 2 │ │ Slice N │ │
│ │ Backend │ │ Backend │ │ Backend │ │
│ │ Chrome │ │ Chrome │ │ Chrome │ │
│ │ │ │ │ │ │ │
│ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
│ ┌────┴───────────┴───────────┴────┐ │
│ │ Shared PostgreSQL Instance │ │
│ │ (schema per user) │ │
│ └─────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────┐ │
│ │ Pre-provisioned Empty Slice │ │
│ │ (ready for next signup) │ │
│ └─────────────────────────────────┘ │
└──────────────────────────────────────────┘
Key Architectural Decisions
| Decision | Rationale |
|---|---|
| Shared PostgreSQL | One Postgres instance with schema-per-user instead of DB-per-container. Saves ~2-3GB RAM overhead. |
| Single shared frontend | The React frontend is identical for all users. One build served by nginx. API routing determined by auth session. Eliminates N frontend containers. |
| Pre-provisioned “next” slice | One empty container always ready. On signup: assign it, spin up the next empty one in background. No user waiting. |
| Gateway service | Central Node.js service handling auth, session management, Stripe webhooks, health checks, and API proxying to the correct slice backend. |
| Chromium per slice | Each slice must maintain its own WhatsApp Web session. Cannot be shared. This is the main memory consumer. |
Per-Slice Composition (Optimized)
| Component | Memory Budget |
|---|---|
| Node.js backend | ~100-150 MB |
| Chromium (whatsapp-web.js) | ~250-350 MB (with memory flags) |
| Media storage | Varies (1-2 GB disk quota) |
| Total per slice | ~400-500 MB RAM |
Shared Services (Run Once)
| Service | Memory Budget |
|---|---|
| PostgreSQL (shared) | ~500 MB - 2 GB |
| nginx | ~50 MB |
| Gateway service | ~100 MB |
| Marketing site | ~50 MB |
| Total shared | ~1-2 GB |
Capacity Estimate
- 64 GB total RAM
- ~2 GB for OS + shared services
- ~62 GB for slices
- At 450 MB per slice: ~130 slices theoretical max
- With headroom for spikes: 60-80 slices comfortable, 100 aggressive
- Max capacity enforced by the gateway service
Pricing & Business Model
| Item | Details |
|---|---|
| Monthly price | $9/slice |
| Launch discount | 50% ($4.50/slice) |
| Server cost | $40.40/month |
| Break-even | 5 slices at full price, 9 at discount |
| Target capacity | 60-80 slices per server |
| Revenue at 60 slices | $540/month (full) or $270/month (discount) |
| Trial period | TBD |
Build Order
| Phase | What | Priority | Depends On |
|---|---|---|---|
| 1 | Auth Gateway & Routing | FIRST | Server provisioned |
| 2 | Shared Frontend | HIGH | Phase 1 |
| 3 | Orchestrator (provision/destroy) | HIGH | Phase 1 |
| 4 | Health Monitoring & Alerts | HIGH | Phase 3 |
| 5 | Stripe Billing | MEDIUM | Phase 3 |
| 6 | Marketing Site & Signup | MEDIUM | Phase 5 |
| 7 | Admin Dashboard | MEDIUM | Phase 4 |
Phase 1: Auth Gateway & Routing
The foundation. A central service that authenticates users and routes their API calls to the correct slice backend.
Components
1.1 Gateway Service (Node.js + Express)
A new lightweight service that runs once on the server:
/gateway/
src/
index.ts
auth.ts # Login, register, session management
proxy.ts # Route API calls to correct slice
middleware.ts # Session validation, rate limiting
package.json
Dockerfile
1.2 Authentication Flow
User visits app.domain.com
│
▼
┌─────────┐ No session
│ nginx │ ──────────────► /login page
└────┬────┘
│ Has valid session cookie
▼
┌─────────┐
│ Gateway │ ── Looks up user → finds their slice port
└────┬────┘
│
▼
┌─────────┐
│ Slice N │ ── Backend on port 50XX
│ Backend │
└─────────┘
1.3 Session Management
| Item | Details |
|---|---|
| Method | Signed HTTP-only cookie |
| Duration | 30 days (configurable) |
| Storage | Gateway’s PostgreSQL schema |
| Contains | User ID, slice ID, expiry |
| Refresh | Rolling — extends on each request |
1.4 Gateway Database Schema
The gateway has its own schema in the shared PostgreSQL:
CREATE SCHEMA gateway;
CREATE TABLE gateway.users (
id SERIAL PRIMARY KEY,
email TEXT UNIQUE NOT NULL,
password_hash TEXT NOT NULL,
slice_id INTEGER REFERENCES gateway.slices(id),
stripe_customer TEXT, -- Stripe customer ID
subscription_status TEXT DEFAULT 'trial', -- trial, active, past_due, cancelled
created_at TIMESTAMPTZ DEFAULT NOW(),
last_login_at TIMESTAMPTZ
);
CREATE TABLE gateway.slices (
id SERIAL PRIMARY KEY,
port INTEGER UNIQUE NOT NULL, -- Backend port (5001, 5003, etc.)
status TEXT DEFAULT 'available', -- available, assigned, suspended, destroying
user_id INTEGER REFERENCES gateway.users(id),
wa_connected BOOLEAN DEFAULT FALSE,
wa_phone TEXT, -- Connected WhatsApp number
last_health_at TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW(),
storage_bytes BIGINT DEFAULT 0
);
CREATE TABLE gateway.sessions (
id TEXT PRIMARY KEY, -- Random token
user_id INTEGER REFERENCES gateway.users(id),
expires_at TIMESTAMPTZ NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
1.5 nginx Configuration
# Marketing site
server {
listen 443 ssl;
server_name domain.com www.domain.com;
# ... SSL config ...
root /var/www/marketing;
index index.html;
}
# App (gateway + slices)
server {
listen 443 ssl;
server_name app.domain.com;
# ... SSL config ...
# Static frontend (shared)
location / {
root /var/www/app;
try_files $uri $uri/ /index.html;
}
# API calls — proxied through gateway
location /api/ {
proxy_pass http://127.0.0.1:3000; # Gateway service
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
}
# Auth endpoints (login, register, logout)
location /auth/ {
proxy_pass http://127.0.0.1:3000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
1.6 Gateway API Proxy Logic
The gateway intercepts all /api/* requests:
- Read session cookie → look up user → find their slice port
- Rewrite the request to
http://127.0.0.1:{slice_port}/api/* - Stream the response back
- For real-time events, the gateway connects to the slice’s Socket.io as a client and re-emits via SSE (
GET /api/events)
// Simplified proxy logic
app.use('/api', async (req, res) => {
const session = await validateSession(req.cookies.session);
if (!session) return res.status(401).json({ error: 'Not authenticated' });
const slice = await getSliceForUser(session.userId);
if (!slice) return res.status(503).json({ error: 'No slice assigned' });
// Proxy to slice backend
proxy.web(req, res, { target: `http://127.0.0.1:${slice.port}` });
});
1.7 Gateway Auth Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
POST /auth/register |
POST | Create account (email + password) |
POST /auth/login |
POST | Authenticate, set session cookie |
POST /auth/logout |
POST | Clear session |
GET /auth/me |
GET | Return current user + slice status |
POST /auth/forgot-password |
POST | Send reset email |
POST /auth/reset-password |
POST | Set new password with token |
Phase 2: Shared Frontend
Current Problem
Each slice currently runs its own frontend container. With 80 slices that’s 80 identical React apps consuming memory.
Solution
- Build the frontend once
- Serve it as static files from nginx
- The frontend talks to
/api/*which nginx routes through the gateway - The gateway proxies to the correct slice backend based on the session
- The frontend code doesn’t need to know which slice it’s talking to
Changes Required
- Remove frontend container from slice Docker composition — slices only run backend + Chromium
- Frontend API base URL — already uses relative
/apipaths (confirmed in vite.config.ts), so no changes needed - Add login page — new route in React app for
/login,/register - Add auth context — React context that checks
/auth/meon load, redirects to login if not authenticated - Build pipeline — single
npm run build, output deployed to/var/www/app/
Frontend Auth Flow
App loads
│
▼
Check /auth/me
│
├─── 200 + user data ──► Load CRM (existing app)
│
└─── 401 ──► Show login page
│
▼
POST /auth/login
│
├─── 200 ──► Set cookie, redirect to /
│
└─── 401 ──► Show error
Phase 3: Orchestrator Service
Manages the lifecycle of slices. Can be part of the gateway service or a separate process.
Responsibilities
- Provision new slices — create Docker container, assign port, create DB schema
- Destroy slices — stop container, remove data, free port
- Maintain “next up” slice — always one empty container ready for instant assignment
- Assign slices to users — on signup/payment confirmation
- Suspend slices — on payment failure (stop container, retain data for grace period)
- Track resource usage — disk, memory per slice
Slice Lifecycle
[Pre-provisioned] ──signup──► [Assigned] ──cancel──► [Destroying]
│ │ │
│ │ payment fail │
│ ▼ ▼
│ [Suspended] [Destroyed]
│ │ (data deleted)
│ │ payment restored
│ ▼
│ [Assigned]
│
└── Always one available. When assigned, provision next.
Provisioning a New Slice
# Orchestrator runs these steps:
1. Pick next available port (5001, 5003, 5005, ...)
2. Create DB schema: CREATE SCHEMA slice_{id};
3. Run migrations on the new schema
4. Start Docker container:
docker run -d \
--name slice-{id} \
--memory=512m \
--cpus=0.5 \
-p {port}:4001 \
-e DB_SCHEMA=slice_{id} \
-e SLICE_ID={id} \
-v /data/slices/{id}/media:/app/media \
whatsapp-slice:latest
5. Update gateway.slices table
6. Health check until container responds
Port Allocation
- Backend ports: 5001, 5003, 5005, … (odd numbers, room for 200+)
- WebSocket on same port as backend (upgrade)
- No frontend ports needed (shared frontend)
Phase 4: Health Monitoring & Alerts
What To Monitor
| Check | Frequency | Alert If |
|---|---|---|
| Container running | 30 seconds | Container stopped/crashed |
| WhatsApp connected | 60 seconds | state !== 'CONNECTED' for >2 min |
| Backend responding | 60 seconds | HTTP health check fails |
| Memory usage | 60 seconds | Container >90% of limit |
| Disk usage per slice | 5 minutes | >90% of quota |
| Server disk | 5 minutes | >85% of 350GB |
| Server RAM | 60 seconds | >90% of 64GB |
| Server CPU | 60 seconds | >90% sustained for 5 min |
Health Check Endpoint (Per Slice)
Each slice backend already has or will have:
GET /api/status → { state: 'ready', phoneNumber: '+1234...', uptime: 3600 }
The gateway polls this for every assigned slice.
Alert Channels
- Email — via Resend (already used for WIT project)
- WhatsApp — send alert to the user’s connected WhatsApp number (use the slice’s own connection)
- Admin email — alert the admin (you) for server-level issues
Alert Flow for WhatsApp Disconnection
Health check detects state !== 'ready'
│
▼
Wait 2 minutes (transient disconnections resolve themselves)
│
▼
Still disconnected?
│
├── YES ──► Send email: "Your WhatsApp is disconnected.
│ Click here to reconnect: app.domain.com/reconnect"
│ Update gateway.slices.wa_connected = false
│
└── NO ──► Clear alert, no action
Reconnection Page
When a user’s WhatsApp disconnects:
- They receive an email with a link
- Link goes to
app.domain.com/reconnect - Page shows the live QR code from their slice (via WebSocket)
- User scans with phone
- Connection restores, alert clears automatically
This uses the existing QR code WebSocket flow that whatsapp-web.js already provides.
Prototype Telemetry — Swap & Capacity Tracking
CRITICAL: This is a prototype deployment. Understanding real-world memory behaviour under aggressive swapping is essential before scaling. Every capacity decision depends on data collected during this phase.
Why This Matters
The capacity estimates (150-200 slices on 64GB + 32GB swap) are based on modelling, not production data. Real-world WhatsApp Web sessions may behave differently over days/weeks — memory leaks, Chromium bloat, message volume spikes. We must instrument aggressively from day one.
Validated: Swap Does Not Break Connections
Tested on the DigitalOcean droplet (2026-02-23): forced idle slices down to 10MB resident using cgroups v2 memory.reclaim. Slice 1 (live WhatsApp connection, phone +50499311987) remained state: "ready" throughout. Pages back into RAM in milliseconds when activity resumes. NVMe swap latency is negligible for WebSocket heartbeats.
What To Track (Logged Every 60 Seconds)
| Metric | Source | Purpose |
|---|---|---|
| Per-slice RSS (resident memory) | cgroups memory.current |
Track actual RAM usage per slice over time |
| Per-slice swap usage | cgroups memory.swap.current |
How much each slice has paged out |
| Total server RAM used/free | /proc/meminfo |
Overall headroom |
| Total swap used/free | /proc/meminfo |
How close to swap ceiling |
| Per-slice WhatsApp state | /api/status/state |
Connection health correlated with swap pressure |
| Per-slice last activity time | Network bytes delta | Idle duration tracking |
| Swap-in rate | vmstat or /proc/vmstat |
Detect thrashing (high swap-in = too many slices) |
| OOM kill events | dmesg / kernel log |
Hard failures |
Stored As Time-Series Data
CREATE TABLE gateway.telemetry (
id BIGSERIAL PRIMARY KEY,
slice_id INTEGER REFERENCES gateway.slices(id),
timestamp TIMESTAMPTZ DEFAULT NOW(),
rss_bytes BIGINT,
swap_bytes BIGINT,
wa_state TEXT,
idle_seconds INTEGER,
cpu_percent REAL
);
CREATE TABLE gateway.server_telemetry (
id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ DEFAULT NOW(),
ram_used_bytes BIGINT,
ram_total_bytes BIGINT,
swap_used_bytes BIGINT,
swap_total_bytes BIGINT,
swap_in_rate BIGINT, -- pages/sec swapped in
cpu_percent REAL,
disk_used_bytes BIGINT,
active_slices INTEGER,
total_slices INTEGER
);
Key Questions This Data Must Answer
- Stable idle footprint: Do swapped slices stay at ~10MB resident, or does Chromium gradually pull pages back in?
- Swap ceiling: At what total swap usage do we see increased swap-in rates (thrashing)?
- Connection stability: Over days/weeks of aggressive swapping, do WhatsApp disconnections increase?
- Memory leaks: Do slices grow over time even when idle? How fast?
- Peak behaviour: When a user actively browses the CRM, how quickly does the slice page back in? Is there noticeable latency?
- Safe slice count: At what number of slices does the server start showing stress signals?
Admin Dashboard Must Surface
- Real-time graph: total RAM vs swap usage over last 24h/7d/30d
- Per-slice memory timeline (RSS + swap)
- Available capacity: “N more slices can be provisioned” based on current usage patterns
- Swap-in rate trend — the single best indicator of overcommitment
- Alert if swap-in rate exceeds threshold (e.g., >1000 pages/sec sustained)
Aggressive Swap Script (Already Tested)
The swap-test.sh script on the droplet (/root/swap-test.sh) uses memory.reclaim via cgroups v2 to force idle slice memory into swap after 10 seconds of inactivity. This will be refined into a production-grade daemon on the OVH server:
- Configurable idle threshold
- Graduated reclaim (swap more aggressively the longer a slice is idle)
- Logging to the telemetry database instead of flat files
- Excluded from reclaim during active user sessions (detected via WebSocket connections or HTTP activity)
Phase 5: Stripe Billing
Integration Points
| Event | Action |
|---|---|
| User signs up | Create Stripe customer, start subscription |
| Payment succeeds | Assign slice (from pre-provisioned pool) |
| Payment fails | Send dunning email, 7-day grace period |
| Grace period expires | Suspend slice (stop container, retain data) |
| Payment restored | Resume slice |
| User cancels | Offer data export, then destroy slice at end of billing period |
| Stripe webhook | Gateway receives and processes all events |
Stripe Configuration
| Item | Details |
|---|---|
| Product | WhatsApp CRM Slice |
| Price | $9.00/month |
| Launch price | $4.50/month (50% coupon) |
| Trial | TBD |
| Payment methods | Card (via Stripe Checkout) |
Gateway Endpoints for Billing
| Endpoint | Purpose |
|---|---|
POST /billing/create-checkout |
Generate Stripe Checkout session |
POST /billing/webhook |
Receive Stripe events |
GET /billing/portal |
Redirect to Stripe Customer Portal (manage subscription) |
GET /billing/status |
Return current subscription status |
Phase 6: Marketing Site & Signup Flow
Structure
Separate static site at domain.com (the gateway/app lives at app.domain.com):
/var/www/marketing/
index.html # Landing page
pricing.html # Pricing details
features.html # Feature showcase
signup.html # Redirects to Stripe Checkout or app.domain.com/register
assets/ # CSS, images, JS
Signup Flow
User visits domain.com
│
▼
Clicks "Get Started" → app.domain.com/register
│
▼
Creates account (email + password)
│
▼
Redirected to Stripe Checkout ($9/month or $4.50 launch)
│
▼
Payment succeeds → Stripe webhook fires
│
▼
Gateway assigns pre-provisioned slice
│
▼
User redirected to app.domain.com → QR code page
│
▼
User scans QR with phone → WhatsApp connected → CRM ready
Phase 7: Admin Dashboard
Accessible at app.domain.com/admin (admin-only route)
Panels
- Server Overview — CPU, RAM, disk, uptime
- Slice Grid — all slices with status indicators (green/yellow/red)
- Alert Log — history of disconnections, restarts, failures
- User Management — list users, suspend/terminate, view subscription status
- Revenue — MRR, active subscriptions, churn
- Capacity — current usage vs max, projected fill rate
Resource Budgets & Limits
Per-Slice Docker Limits
deploy:
resources:
limits:
memory: 512M
cpus: '0.5'
reservations:
memory: 256M
cpus: '0.25'
Per-Slice Disk Quota
| Item | Limit |
|---|---|
| Media storage | 2 GB included |
| Additional storage | Paid add-on (TBD pricing) |
| Enforcement | Orchestrator checks usage, blocks media downloads at quota |
Chromium Memory Optimization Flags
--disable-gpu
--disable-dev-shm-usage
--no-sandbox
--disable-setuid-sandbox
--disable-extensions
--disable-background-networking
--js-flags="--max-old-space-size=128"
--single-process
Security & Isolation
Container Isolation
- Each slice runs in its own Docker container with resource limits
- Containers have no access to other containers’ volumes
- Containers cannot access the host network directly
- No
--privilegedflag - Read-only root filesystem where possible
Network Isolation
- Slice backends only listen on localhost (127.0.0.1)
- Only nginx is exposed to the internet (ports 80/443)
- Gateway communicates with slices via localhost ports
- No inter-slice communication possible
Data Isolation
- Schema-per-user in PostgreSQL (not shared tables)
- Media stored in
/data/slices/{id}/with Linux user/group permissions - Gateway validates every request belongs to the authenticated user’s slice
Auth Security
- Passwords hashed with bcrypt (cost 12)
- Session tokens: 256-bit random, HTTP-only cookies, Secure flag, SameSite=Strict
- Rate limiting on login (5 attempts per minute per IP)
- HTTPS everywhere (Let’s Encrypt)
Backup & Data Policy
Server-Level
- OVH automated daily backup (included in plan)
- Covers full disk including all slice data
Future Enhancement
- Per-slice backup to S3/R2 (when customer base justifies cost)
- User-initiated export (download all messages, photos, contacts as ZIP)
Data Destruction
- User can destroy their own data from within the CRM (existing feature)
- On cancellation: data retained for 7 days, then slice destroyed completely
- Nothing retained after destruction
Deployment & Updates
Rolling Update Strategy
For each active slice:
1. Pull new image
2. Stop container (WhatsApp session persists in local auth store)
3. Start new container with same volume mounts
4. Health check — verify backend responds
5. Verify WhatsApp reconnects (usually automatic)
6. Move to next slice
- Update 5-10 slices at a time (parallel batches)
- Monitor for failures between batches
- Rollback: keep previous image tagged, revert if issues
Code Deployment Pipeline
Local development
│
▼
Test on localhost (existing dev setup)
│
▼
Build Docker image
│
▼
Push to server (rsync or registry)
│
▼
Rolling update via orchestrator command
Domain & SSL
Domain Structure (TBD — needs domain choice)
| Subdomain | Purpose |
|---|---|
domain.com |
Marketing site |
app.domain.com |
CRM application (gateway + frontend) |
SSL
- Let’s Encrypt with certbot
- Auto-renewal via cron
- Wildcard cert if needed for subdomains
Open Questions
- Domain name — What domain for the SaaS product?
- Trial period — Free trial before payment? If so, how long?
- Additional storage pricing — How much per extra GB?
- Email provider — Resend for transactional emails? (already in use for WIT)
- Max slices cap — Hard limit per server? Suggest 80 for safety.
- Onboarding flow — Any guided tour for new users?
- Support channel — How do users get help? Email? In-app chat?
- Terms of service / Privacy policy — Need legal docs before launch.
- Data residency — Server is in Canada. Any user concerns?
- Branding — Product name? Logo?