DocHub
Systematic data archiving — CollectionService, FillService, and on-demand loading

Collection System

Overview

The collection system systematically archives all WhatsApp data (contacts, messages, media) into PostgreSQL. It operates in three modes depending on user presence: CollectionService (user disconnected), FillService (user present but idle), and on-demand (user clicks a contact).

How It Works

CollectionService (user disconnected)

Triggered when the last browser tab closes and WhatsApp is in ready state. Processes every contact that hasn’t been collected yet:

  1. Resolve contact info from WhatsApp fiber data
  2. Cache avatar image
  3. Open chat via searchAndOpenChat(phone) — types phone in WhatsApp search box, presses Enter
  4. Verify correct chat opened (message data-id attributes contain the chat’s waId)
  5. Deep-load and archive all messages
  6. Download media files
  7. Insert boundary marker (“No earlier messages available”)
  8. Mark collection_completed = true

Empty chat handling: If openChatId doesn’t contain @, it’s a header label (e.g., “Business Account”) — treated as empty chat, boundary marker inserted, marked complete.

Pace: 3 seconds between contacts. Full collection of ~340 contacts takes ~3 hours.

Abort conditions: User reconnects, WhatsApp disconnects, Chrome crashes.

FillService (user present, idle)

Runs same pipeline as CollectionService but only when user has been idle for 15+ seconds. Pauses immediately when user activity is detected.

On-Demand Loading

When user clicks a contact with no cached messages, messageController loads messages from the DOM on the fly.

Key Files

File Purpose
backend/src/services/CollectionService.ts Main collection loop, runs when user disconnects
backend/src/services/FillService.ts Idle-based collection when user is present
backend/src/services/DataCollectionSteps.ts Shared per-contact functions (resolve, avatar, archive, media)
backend/src/services/DOMReader.ts searchAndOpenChat(), scrapeAllContacts(), getRenderedMessages()
gateway/src/monitor.ts checkCollectionOpportunities() — triggers collection on disconnected slices
database/migrations/021-collection-status.sql Schema additions for collection tracking

Database Columns

connection_state:

  • collection_statusidle | collecting | filling | upkeep
  • collection_progress — JSONB with { total, completed, current, percent }
  • collection_completed_at — timestamp when all contacts processed

contacts:

  • collection_completed — boolean, true when fully archived
  • messages_archived_count — number of messages saved
  • last_collected_at — timestamp of last collection run

gateway.slices:

  • collection_status — mirrors the slice’s collection state

Boundary Markers

Every collected contact gets a boundary marker message:

  • wa_id: boundary_no_earlier_{chatId}
  • message_type: boundary
  • from_wa_id: system
  • body: “No earlier messages available on the WhatsApp Web app…”
  • timestamp: epoch 1 (sorts to beginning of chat)

Smart Scrape Gating

The full sidebar scrape (scrapeAllContacts()) only runs if:

  • Contacts table is empty, OR
  • last_connected_at is more than 10 minutes ago, OR
  • last_connected_at is null

Quick restarts within 10 minutes skip the 90-second sidebar scroll.