Collection System
Overview
The collection system systematically archives all WhatsApp data (contacts, messages, media) into PostgreSQL. It operates in three modes depending on user presence: CollectionService (user disconnected), FillService (user present but idle), and on-demand (user clicks a contact).
How It Works
CollectionService (user disconnected)
Triggered when the last browser tab closes and WhatsApp is in ready state. Processes every contact that hasn’t been collected yet:
- Resolve contact info from WhatsApp fiber data
- Cache avatar image
- Open chat via
searchAndOpenChat(phone)— types phone in WhatsApp search box, presses Enter - Verify correct chat opened (message
data-idattributes contain the chat’s waId) - Deep-load and archive all messages
- Download media files
- Insert boundary marker (“No earlier messages available”)
- Mark
collection_completed = true
Empty chat handling: If openChatId doesn’t contain @, it’s a header label (e.g., “Business Account”) — treated as empty chat, boundary marker inserted, marked complete.
Pace: 3 seconds between contacts. Full collection of ~340 contacts takes ~3 hours.
Abort conditions: User reconnects, WhatsApp disconnects, Chrome crashes.
FillService (user present, idle)
Runs same pipeline as CollectionService but only when user has been idle for 15+ seconds. Pauses immediately when user activity is detected.
On-Demand Loading
When user clicks a contact with no cached messages, messageController loads messages from the DOM on the fly.
Key Files
| File | Purpose |
|---|---|
backend/src/services/CollectionService.ts |
Main collection loop, runs when user disconnects |
backend/src/services/FillService.ts |
Idle-based collection when user is present |
backend/src/services/DataCollectionSteps.ts |
Shared per-contact functions (resolve, avatar, archive, media) |
backend/src/services/DOMReader.ts |
searchAndOpenChat(), scrapeAllContacts(), getRenderedMessages() |
gateway/src/monitor.ts |
checkCollectionOpportunities() — triggers collection on disconnected slices |
database/migrations/021-collection-status.sql |
Schema additions for collection tracking |
Database Columns
connection_state:
collection_status—idle|collecting|filling|upkeepcollection_progress— JSONB with{ total, completed, current, percent }collection_completed_at— timestamp when all contacts processed
contacts:
collection_completed— boolean, true when fully archivedmessages_archived_count— number of messages savedlast_collected_at— timestamp of last collection run
gateway.slices:
collection_status— mirrors the slice’s collection state
Boundary Markers
Every collected contact gets a boundary marker message:
wa_id:boundary_no_earlier_{chatId}message_type:boundaryfrom_wa_id:systembody: “No earlier messages available on the WhatsApp Web app…”timestamp: epoch 1 (sorts to beginning of chat)
Smart Scrape Gating
The full sidebar scrape (scrapeAllContacts()) only runs if:
- Contacts table is empty, OR
last_connected_atis more than 10 minutes ago, ORlast_connected_atis null
Quick restarts within 10 minutes skip the 90-second sidebar scroll.