Discovery Sources¶

ColdReach runs all enabled sources concurrently and merges results. Each source contributes a confidence_hint that feeds into the final email score.

Website Crawler¶

Source ID: website/*
Requires: Nothing — built-in
Speed: Fast (1–3s per page, ~10s total)

Fetches the company's public pages and extracts email addresses using three methods:

mailto: link extraction (highest quality — explicit intent)
RFC 5322 email regex
Obfuscated patterns (hello [at] example.com, hello(at)example.com)

Pages crawled: homepage, /contact, /contact-us, /team, /our-team, /about, /about-us, /people, /staff, /leadership, /management, /company

Confidence hints by page:

Page type	Confidence delta
Contact page	+35
Team / people page	+30
About page	+25
Homepage / other	+15

Skip with: --no-web

WHOIS¶

Source ID: whois
Requires: Nothing — built-in
Speed: Fast (1–2s)

Queries the WHOIS registry for the domain's registrant and administrative contact email. Often masked by privacy services (WhoisGuard, Domains by Proxy), but useful for smaller companies.

Skip with: --no-whois

GitHub¶

Source ID: github/commit, github/profile
Requires: Nothing — uses unauthenticated GitHub API
Speed: Fast (2–5s)

Searches GitHub for repositories matching the company domain, then mines commit history for author email addresses. Also checks organization member profiles for public email fields.

Tip

Works best for tech companies and developer tools where founders/employees commit publicly.

Rate limiting: The unauthenticated GitHub API allows 60 requests/hour. ColdReach respects this — add a GITHUB_TOKEN env var to the token if you hit limits.

Skip with: --no-github

Reddit¶

Source ID: reddit
Requires: Nothing — uses Reddit JSON API
Speed: Fast (1–3s)

Searches Reddit for posts mentioning the company domain and extracts any email addresses from post bodies and comments. Useful for finding support contacts or founders who have posted publicly.

Skip with: --no-reddit

Search Engine (SearXNG / DDG)¶

Source ID: search
Requires: docker compose up searxng for SearXNG; DDG and Brave are fallbacks
Speed: Medium (3–8s)

Performs web searches for the domain with email-targeted queries (e.g. site:acme.com email contact). Falls back automatically:

SearXNG (self-hosted, 40+ engines) — preferred
DuckDuckGo Lite — if SearXNG unavailable
Brave Search — final fallback

Skip with: --no-search

theHarvester¶

Source ID: osint/theharvester
Requires: docker compose up theharvester
Speed: Slow (15–60s depending on sources)

Runs theHarvester — a mature OSINT tool that searches certificate transparency logs, Bing, Google dorks, PGP keyservers, and more. ColdReach calls the theHarvester REST API running in Docker.

Best for: finding emails that have appeared in public records or search engine indexes.

Skip with: --no-harvester or --quick

SpiderFoot¶

Source ID: osint/spiderfoot
Requires: docker compose up spiderfoot
Speed: Slow (30–120s)

Runs SpiderFoot — a deep OSINT framework that correlates WHOIS, DNS records, social networks, threat intel feeds, and web crawling into a unified graph. ColdReach submits a scan via the SpiderFoot REST API and polls for results.

Best for: thorough investigations where accuracy matters more than speed.

Skip with: --no-spiderfoot or --quick

Firecrawl (opt-in)¶

Source ID: website/*
Requires: pip install firecrawl-py + self-hosted Firecrawl server
Enable with: --firecrawl
Speed: Slow (varies by site)

Uses the Firecrawl SDK to scrape JS-rendered pages that plain httpx cannot handle. Before scraping, ColdReach fetches the site's sitemap.xml to discover the most relevant contact/about/team pages automatically. Falls back to hardcoded paths (/contact, /about, /team, etc.) if no sitemap exists.

Firecrawl is not included in the default docker-compose.yml — it requires its own multi-service stack (API server, workers, Playwright). See github.com/mendableai/firecrawl for setup.

Crawl4AI (opt-in)¶

Source ID: website/*
Requires: pip install crawl4ai && crawl4ai-setup
Enable with: --crawl4ai
Speed: Medium–Slow (Playwright browser)

Uses crawl4ai to render pages with a headless Playwright browser, extracting markdown from the rendered DOM. Handles JS-heavy SPAs that return blank pages to plain httpx.

ColdReach runs crawl4ai alongside (not instead of) the built-in website crawler. Known anti-bot platforms (LinkedIn, Booking.com, etc.) are skipped automatically. Junk content (bot-block pages, "JavaScript is disabled" pages) is detected and discarded.

Pattern Generator¶

Source ID: generated/pattern
Requires: --name "First Last" to be specified
Speed: Instant

Not a network source — generates likely email addresses from a person's name and the company's inferred email format. See Pattern Generation for the full algorithm.

Patterns are verified through the pipeline like any discovered email; their initial confidence is lower than directly found emails.

Role Emails (always-on)¶

Source ID: generated/pattern
Requires: Nothing
Speed: Instant

ColdReach always generates common role-based email candidates for every domain:

info@, contact@, hello@, sales@, marketing@, partnerships@, press@, support@, business@, growth@

These are added with a low initial confidence score (5) and only included if not already found by a real source. They go through the full verification pipeline — with Reacher running, genuine addresses get confirmed as valid; non-existent ones resolve as undeliverable.

Intelligent Search (Groq-enhanced)¶

Source ID: search/searxng
Requires: COLDREACH_GROQ_API_KEY in .env for best results; falls back to smart heuristics
Speed: Slow (30–60s — Groq + multi-query + URL crawling)

Multi-stage pipeline that outperforms simple keyword searches:

Company context — scrapes the company homepage + SearXNG meta-descriptions (works for JS SPAs)
Query generation — Groq LLM generates 6 targeted queries tailored to the company's industry, location, and type (e.g. "Snapdeal investor relations contact email India")
Parallel search — runs all queries through SearXNG concurrently
URL crawl — crawls both on-domain pages AND external press/media articles from results (press releases and interview articles often contain direct email addresses)
Reddit search — Groq identifies 4 relevant subreddits, searches for company email mentions

Groq free tier

Groq's free tier (14,400 tokens/minute on llama-3.1-8b-instant) is more than enough. Get a key at console.groq.com. Without a Groq key, the source still runs with smart heuristic queries.

Source comparison¶

Source	Speed	Accuracy	Requires
Website crawler	Fast	High	Nothing
WHOIS	Fast	Low–Medium	Nothing
GitHub	Fast	Medium (tech)	Nothing
Reddit	Fast	Low	Nothing
Search engine	Medium	Medium	Nothing (DDG fallback)
Intelligent search	Slow	Medium–High	Nothing (Groq key for best results)
theHarvester	Slow	Medium–High	`docker compose up theharvester`
SpiderFoot	Slow	High	`docker compose up spiderfoot`
Firecrawl	Slow	High (JS sites)	`pip install firecrawl-py` + server
Crawl4AI	Medium	High (JS SPAs)	`pip install crawl4ai && crawl4ai-setup`
Pattern generator	Instant	Low–Medium	`--name "First Last"`
Role emails	Instant	Varies	Nothing (always generated)

Use --quick for most lookups

--quick skips theHarvester, SpiderFoot, and Intelligent Search (~10s). Standard mode runs all sources — best for thorough discovery.