The Collector

The Collector is the primary data ingestion engine of Swarm Nexus. It is a Node.js service that uses Puppeteer to automate scraping, ensuring that valuable community conversations are captured and stored in the Swarm Memory.

Core Responsibilities

  • Data Ingestion: The Collector's main job is to find and process mentions of @SwarmNexus on Twitter (X).

  • Context Hydration: It doesn't just save the mention itself; it fetches the entire parent thread to capture the full context of the conversation.

  • Metadata Extraction: It parses the text to identify and extract key metadata, such as cryptocurrency tickers ($TICKER) and other relevant patterns.

  • Database Writing: It is the only component with write access to the production database, where it upserts data into the predictions and tickers tables.

  • Health Reporting: After every successful collection cycle, it writes to a collector_heartbeat.json file. This file is monitored by a separate process to ensure the Collector is alive and functioning correctly.

How It Works

  1. Authentication: The Collector logs into Twitter using a set of pre-configured browser cookies from a dedicated scraping account.

  2. Polling: It polls Twitter's notifications endpoint at a configurable interval (e.g., every 60 seconds).

  3. Scraping: Using Puppeteer, it controls a headless browser to navigate the Twitter UI, read notifications, and click through to parent tweets.

  4. Data Processing: The raw HTML is parsed, and the relevant text, authors, timestamps, and URLs are extracted and structured.

  5. Database Upsert: The structured data is written to the swarm.db SQLite database. An "upsert" operation is used to avoid creating duplicate entries.

Configuration

The Collector's behavior is controlled by environment variables defined in /opt/swarmnexus/collector/.env.

Variable
Description
Default Value

SOURCE

The scraping source (e.g., notifications).

notifications

POLL_SECONDS

The interval in seconds between scraping cycles.

60

MAX_MENTIONS_PER_CYCLE

The maximum number of mentions to process in a single run.

80

BACKOFF_BASE_MS

The base time in milliseconds for exponential backoff if rate-limited.

120000

TW_COOKIES_PATH

The path to the Twitter cookies file.

twitter-cookies.json

Last updated