GitHub - CYFARE/CrawlKit: Advance Mass Web Crawler For ClearWeb & DeepWeb

Advance Mass Web Crawler For ClearWeb & DeepWeb

Features

Dual-scope crawling -- clearweb (cw) and deepweb/Tor (dw) modes
High concurrency -- configurable async workers with per-host connection limits
Crawl modes -- full_crawl (all URLs) or unique_domains (one URL per domain)
Depth control -- limit crawl depth or crawl without bounds
URL filtering -- regex-based include/exclude patterns
Multiple export formats -- JSON, JSONL, CSV, SQLite
Session persistence -- save and resume crawls across restarts
Web Admin panel -- multi-job campaign management with authentication
Live Web UI -- real-time stats dashboard via WebSocket
Domain graph -- tracks link relationships between domains

Quick Start

# Install
pip install -e .

# Crawl from a seed file
crawlkit crawl -s seeds.txt --scope cw

# Crawl specific URLs
crawlkit crawl --url https://example.com --scope cw

# Resume a previous crawl
crawlkit resume

# Launch the web admin panel
crawlkit webadmin

Configuration

CrawlKit reads crawlkit.toml from the working directory. CLI flags override config file values.

[crawl]
concurrency = 20
timeout = 20
scope = "dw"                    # "dw" (deepweb/.onion) or "cw" (clearweb)
mode = "full_crawl"             # "full_crawl" or "unique_domains"
max_depth = 0                   # 0 = unlimited
user_agent = "Mozilla/5.0 (Windows NT 10.0; rv:102.0) Gecko/20100101 Firefox/102.0"
include_pattern = ""            # regex, empty = match all
exclude_pattern = ""            # regex, empty = exclude none

[output]
formats = ["json"]
directory = "."

[session]
auto_save = true

CLI Reference

`crawlkit crawl`

Flag	Description
`-s`, `--seeds`	Seed file(s) with one URL per line
`--url`	Direct seed URL(s)
`-c`, `--concurrency`	Number of concurrent workers
`--timeout`	Request timeout in seconds
`--scope`	`cw` (clearweb) or `dw` (deepweb)
`--mode`	`full_crawl` or `unique_domains`
`--max-depth`	Maximum crawl depth (0 = unlimited)
`-f`, `--format`	Output format(s): json, jsonl, csv, sqlite
`-o`, `--output-dir`	Output directory
`--include`	Regex pattern -- only crawl matching URLs
`--exclude`	Regex pattern -- skip matching URLs
`--config`	Path to config file (default: `crawlkit.toml`)
`--webui`	Enable live stats Web UI
`--webui-port`	Port for the Web UI

`crawlkit resume`

Resume from a session file. Pass the path as an argument or let CrawlKit find the latest session automatically.

`crawlkit webadmin`

Flag	Description
`--port`	Server port (default: 8471)
`--host`	Bind address (default: 127.0.0.1)
`--username`	Admin username (default: admin)
`--password`	Admin password (auto-generated if omitted)

Dependencies

aiohttp -- async HTTP client/server
beautifulsoup4 + lxml -- HTML parsing
tldextract -- domain extraction
rich -- terminal UI
typer -- CLI framework
aiosqlite -- async SQLite

Documentation

See the docs/ directory for detailed guides:

User Guide -- installation, configuration, and usage
Developer Guide -- architecture, modules, and contributing
Testing Guide -- running and writing tests

Security

To report a vulnerability, see SECURITY.md.

License

GNU Affero General Public License v3 (AGPL-3.0). See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
crawlkit		crawlkit
crawlkit_data		crawlkit_data
docs		docs
sample		sample
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
crawlkit.toml		crawlkit.toml
gather.py		gather.py
gh-banner.jpeg		gh-banner.jpeg
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
seeds.txt		seeds.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

Quick Start

Configuration

CLI Reference

`crawlkit crawl`

`crawlkit resume`

`crawlkit webadmin`

Dependencies

Documentation

Security

License

About

Uh oh!

Releases 1

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Features

Quick Start

Configuration

CLI Reference

crawlkit crawl

crawlkit resume

crawlkit webadmin

Dependencies

Documentation

Security

License

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages

`crawlkit crawl`

`crawlkit resume`

`crawlkit webadmin`