Discovery Sources API¶
All sources implement BaseSource and are designed to run concurrently via asyncio.gather().
Base class¶
coldreach.sources.base ¶
Abstract base class for all ColdReach email discovery sources.
Every source follows the same contract: - Accepts a domain (and optionally a person name hint) - Returns a list of SourceResult objects - Never raises — catches its own errors and returns an empty list with a note
Sources are designed to run concurrently via asyncio.gather().
SourceResult
dataclass
¶
A single email address found by a source.
Attributes:
| Name | Type | Description |
|---|---|---|
email |
str
|
The discovered email address (raw, not yet normalised). |
source |
EmailSource
|
Which source found this email. |
url |
str
|
The page or endpoint where the email was found. |
context |
str
|
Surrounding text snippet that contained the email. |
confidence_hint |
int
|
Optional score delta hint from the source (0 = no hint). |
SourceSummary
dataclass
¶
Execution summary for one source run.
BaseSource ¶
Bases: ABC
Abstract base for all email discovery sources.
Subclasses must implement :meth:fetch.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timeout
|
float
|
HTTP / subprocess timeout in seconds. |
10.0
|
Source code in coldreach/sources/base.py
fetch
abstractmethod
async
¶
Discover emails for domain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
domain
|
str
|
The target domain, e.g. |
required |
person_name
|
str | None
|
Optional full name hint for pattern-based sources. |
None
|
Returns:
| Type | Description |
|---|---|
list[SourceResult]
|
Discovered email addresses. Empty list if nothing found or on error. |
Source code in coldreach/sources/base.py
run
async
¶
Safe wrapper around :meth:fetch — never raises.
Returns the results list and a summary suitable for logging/display.
Source code in coldreach/sources/base.py
Website crawler¶
coldreach.sources.web_crawler ¶
Website crawler source.
Crawls the target domain's public pages (homepage, /contact, /team, /about, /people, /staff, /leadership) and extracts email addresses using: - RFC 5322-compliant regex - mailto: link href extraction - Common obfuscation patterns ([at], (at), " at ", [dot], etc.)
Uses httpx for async HTTP (no Playwright/JS required for most B2B sites). Falls back gracefully on SSL errors, redirects, and timeouts.
Source priority for scoring: contact page > team page > generic.
WebCrawlerSource ¶
Bases: BaseSource
Crawl company website pages to find email addresses.
Fetches the homepage plus a fixed list of high-value paths (/contact, /team, /about, etc.) concurrently and extracts emails from each page.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timeout
|
float
|
Per-request HTTP timeout in seconds. |
10.0
|
max_pages
|
int
|
Maximum number of pages to fetch (homepage + paths). |
15
|
follow_homepage_links
|
bool
|
If True, also parse any internal links on the homepage that match the high-value path patterns. |
True
|
Source code in coldreach/sources/web_crawler.py
WHOIS¶
coldreach.sources.whois_source ¶
WHOIS source — extract registrant/admin/tech contact emails from WHOIS records.
Uses python-whois (synchronous) run in an executor to avoid blocking the event loop. WHOIS data often contains registrant email, admin email, and tech contact email — these are high-value leads for small companies.
Note: Many large companies use privacy-protecting WHOIS proxies (e.g. domains@squarespace.com). These are filtered out via a blocklist.
WhoisSource ¶
Bases: BaseSource
Fetch WHOIS registrant contact emails for a domain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timeout
|
float
|
Timeout passed to the thread executor (seconds). |
10.0
|
Source code in coldreach/sources/base.py
GitHub¶
coldreach.sources.github ¶
GitHub source — mine commit author emails for a company domain.
Strategy: 1. Search GitHub for repos whose owner matches the company domain slug (e.g. "stripe.com" → org slug "stripe") 2. Fetch recent commits from top repos → extract author.email 3. Filter to emails belonging to the target domain
Uses the public GitHub REST API (unauthenticated: 60 req/hr, authenticated: 5000/hr). Set COLDREACH_GITHUB_TOKEN in .env for higher rate limits.
Rate limit handling: - Checks X-RateLimit-Remaining header - Stops early if remaining < 5 to avoid exhausting the allowance
GitHubSource ¶
Bases: BaseSource
Mine public GitHub commits for company domain email addresses.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
token
|
str | None
|
Optional GitHub personal access token for higher rate limits. |
None
|
timeout
|
float
|
Per-request HTTP timeout in seconds. |
10.0
|
Source code in coldreach/sources/github.py
Reddit¶
coldreach.sources.reddit ¶
Reddit source — search Reddit posts/comments for company contact emails.
Uses the public Reddit JSON API — no authentication required. Rate limit: 1 request per second (enforced via asyncio.sleep).
Queries: 1. Search for "@domain.com" mentions across all of Reddit 2. Search for "company name" + "email" or "contact"
Both queries parse post titles, selftext, and comment bodies for email patterns matching the target domain.
RedditSource ¶
Bases: BaseSource
Search Reddit for company email addresses via the public JSON API.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timeout
|
float
|
Per-request HTTP timeout in seconds. |
10.0
|
max_results
|
int
|
Maximum number of Reddit posts to inspect per query. |
25
|
Source code in coldreach/sources/reddit.py
Search engine¶
coldreach.sources.search_engine ¶
Search engine source — queries SearXNG (self-hosted) for domain email mentions.
Fallback chain (in order): 1. SearXNG local instance (http://localhost:8080 by default) 2. DuckDuckGo Lite HTML (no JS, scrapeable, no auth) 3. Brave Search API (free tier: 2000 req/month, requires API key)
Queries run against the target domain: - "@domain.com" — direct email format search - site:domain.com email contact — on-site contact pages - "domain.com" email — general mentions
Rate limiting: - SearXNG: 1 req/3s per query (configurable) - DDG Lite: 1 req/5s (more aggressive block) - Brave: respects X-RateLimit-Remaining header
All results are filtered to emails belonging to the target domain.
SearchEngineSource ¶
SearchEngineSource(
searxng_url=_SEARXNG_DEFAULT,
brave_api_key=None,
query_delay=3.0,
timeout=10.0,
)
Bases: BaseSource
Search for domain email addresses via SearXNG → DDG Lite → Brave fallback chain.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
searxng_url
|
str | None
|
URL of the local SearXNG instance. Set to None to skip SearXNG. |
_SEARXNG_DEFAULT
|
brave_api_key
|
str | None
|
Brave Search API key (free: 2000 req/month). Set to None to skip. |
None
|
query_delay
|
float
|
Seconds to wait between queries to avoid rate limiting. |
3.0
|
timeout
|
float
|
Per-request HTTP timeout in seconds. |
10.0
|
Source code in coldreach/sources/search_engine.py
theHarvester¶
coldreach.sources.harvester ¶
theHarvester source — HTTP REST API client.
The coldreach-theharvester Docker container runs restfulHarvest,
which exposes a REST API on port 5050. We call it directly with httpx
instead of docker exec (which failed silently because the container's
entrypoint is the API server, not the CLI).
API endpoint: GET http://localhost:5050/query ?domain=acme.com&source=duckduckgo,bing,crtsh&limit=500
Swagger docs (when container is running): http://localhost:5050/docs
Response JSON: { "emails": ["user@acme.com", ...], "hosts": [...], "interesting_urls": [...], ... }
Free sources (no API key needed): duckduckgo, yahoo, bing, baidu, crtsh, certspotter, hackertarget, rapiddns, dnsdumpster, urlscan, otx, robtex
Excluded (slow / decommissioned): commoncrawl — terabyte dataset, queries take 10+ minutes waybackarchive — rarely contains emails, very slow thc — unreliable timeouts threatcrowd — decommissioned
Service: docker compose up theharvester
HarvesterSource ¶
HarvesterSource(
api_base=_API_BASE,
sources=None,
limit=500,
timeout=240.0,
container="coldreach-theharvester",
max_wait=240.0,
harvester_sources=None,
)
Bases: BaseSource
Discover emails via the theHarvester REST API (localhost:5050).
Calls GET /query?domain=...&source=...&limit=... on the running
coldreach-theharvester container. No docker exec needed — the
container's REST server is the correct integration point.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_base
|
str
|
Base URL of the theHarvester REST server. |
_API_BASE
|
sources
|
str | None
|
Comma-separated source names. Defaults to all free sources. |
None
|
limit
|
int
|
Maximum results per source query. |
500
|
timeout
|
float
|
HTTP request timeout in seconds. theHarvester queries several external APIs so allow generous time. |
240.0
|
Source code in coldreach/sources/harvester.py
SpiderFoot¶
coldreach.sources.spiderfoot ¶
SpiderFoot source — REST API client with incremental result streaming.
Uses SpiderFoot's built-in CherryPy REST API (localhost:5001) instead of docker exec. This gives us real-time results as the scan progresses:
- POST /startscan → get scan_id immediately
- Poll GET /scaneventresults?id=SCAN_ID&eventType=EMAILADDR every 15s
- Emit each new email as it appears (streaming-friendly)
- GET /stopscan?id=SCAN_ID when done/timeout reached
Why not docker exec sf.py? - sf.py outputs ALL results only at the very end (no streaming) - The scan continues running in SpiderFoot's DB even after we kill docker exec - No way to cancel from Python side - REST API solves all three problems
Key endpoints confirmed working (SpiderFoot v4.0.0): GET /ping → ["SUCCESS", "4.0.0"] GET /scanlist → list of all scans POST /startscan (form data) → ["SUCCESS", "SCAN_ID"] GET /scaneventresults?id=ID&eventType=X → [[...], [...]] rows GET /scanstatus?id=ID → [name, target, started, ...] GET /stopscan?id=ID → stops the scan
Modules used (fast, no API keys, effective): sfp_pgp — PGP keyservers: finds 20+ emails per domain in ~60s sfp_emailformat — email-format.com database: instant sfp_whois — WHOIS registrant: instant sfp_email — extracts emails from any content fed by other modules sfp_citadel — breach/enrichment databases (free sources: PeopleDataLabs)
SpiderFootSource ¶
SpiderFootSource(
api_base=_API_BASE,
max_wait=_MAX_SCAN_SECONDS,
container="coldreach-spiderfoot",
timeout=30.0,
)
Bases: BaseSource
Email discovery via SpiderFoot REST API with incremental streaming.
Creates a scan via SpiderFoot's REST API, polls for results every 15s, and yields emails as they are found. Stops the scan when done or after the configured timeout.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
api_base
|
str
|
SpiderFoot web server base URL. |
_API_BASE
|
max_wait
|
float
|
Maximum seconds before the scan is aborted. |
_MAX_SCAN_SECONDS
|
timeout
|
float
|
HTTP request timeout for individual API calls. |
30.0
|
Source code in coldreach/sources/spiderfoot.py
fetch
async
¶
Run scan and return all found emails (blocking until done or timeout).
Source code in coldreach/sources/spiderfoot.py
fetch_stream
async
¶
Stream email results as they are found by SpiderFoot.
Yields a SourceResult each time SpiderFoot discovers a new email, allowing callers to forward results to SSE streams immediately.
Source code in coldreach/sources/spiderfoot.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | |