The Collector
The Collector is the primary data ingestion engine of Swarm Nexus. It is a Node.js service that uses Puppeteer to automate scraping, ensuring that valuable community conversations are captured and stored in the Swarm Memory.
Core Responsibilities
Data Ingestion: The Collector's main job is to find and process mentions of
@SwarmNexuson Twitter (X).Context Hydration: It doesn't just save the mention itself; it fetches the entire parent thread to capture the full context of the conversation.
Metadata Extraction: It parses the text to identify and extract key metadata, such as cryptocurrency tickers (
$TICKER) and other relevant patterns.Database Writing: It is the only component with write access to the production database, where it upserts data into the
predictionsandtickerstables.Health Reporting: After every successful collection cycle, it writes to a
collector_heartbeat.jsonfile. This file is monitored by a separate process to ensure the Collector is alive and functioning correctly.
How It Works
Authentication: The Collector logs into Twitter using a set of pre-configured browser cookies from a dedicated scraping account.
Polling: It polls Twitter's notifications endpoint at a configurable interval (e.g., every 60 seconds).
Scraping: Using Puppeteer, it controls a headless browser to navigate the Twitter UI, read notifications, and click through to parent tweets.
Data Processing: The raw HTML is parsed, and the relevant text, authors, timestamps, and URLs are extracted and structured.
Database Upsert: The structured data is written to the
swarm.dbSQLite database. An "upsert" operation is used to avoid creating duplicate entries.
Configuration
The Collector's behavior is controlled by environment variables defined in /opt/swarmnexus/collector/.env.
SOURCE
The scraping source (e.g., notifications).
notifications
POLL_SECONDS
The interval in seconds between scraping cycles.
60
MAX_MENTIONS_PER_CYCLE
The maximum number of mentions to process in a single run.
80
BACKOFF_BASE_MS
The base time in milliseconds for exponential backoff if rate-limited.
120000
TW_COOKIES_PATH
The path to the Twitter cookies file.
twitter-cookies.json
Last updated

