Async-first web scraping framework built on rnet (HTTP with browser impersonation) and scraper-rs (fast HTML parsing). Silkworm gives you a minimal Spider/Request/Response model, middlewares, and pipelines so you can script quick scrapes or build larger crawlers without boilerplate.
- Async engine with configurable concurrency, bounded queue backpressure (defaults to
concurrency * 10), and per-request timeouts. - rnet-powered HTTP client: browser impersonation, redirect following with loop detection, query merging, and proxy support via
request.meta["proxy"]. - Typed spiders and callbacks that can return items or
Requestobjects;HTMLResponseships helper methods plusResponse.followto reuse callbacks. - Middlewares: User-Agent rotation/default, proxy rotation, retry with exponential backoff + optional sleep codes, flexible delays (fixed/random/custom), and
SkipNonHTMLMiddlewareto drop non-HTML callbacks. - Pipelines: JSON Lines, SQLite, XML (nested data preserved), and CSV (flattens dicts and lists) out of the box.
- Structured logging via
logly(SILKWORM_LOG_LEVEL=DEBUG), plus periodic/final crawl statistics (requests/sec, queue size, memory, seen URLs).
From PyPI with pip:
pip install silkworm-rsFrom PyPI with uv (recommended for faster installs):
uv pip install --prerelease=allow silkworm-rs
# or if using uv's project management:
uv add --prerelease=allow silkworm-rsNote: The
--prerelease=allowflag is required because silkworm-rs depends on prerelease versions of some packages (e.g., rnet).
From source:
uv venv # install uv from https://docs.astral.sh/uv/getting-started/ if needed
source .venv/bin/activate # Windows: .venv\Scripts\activate
uv pip install --prerelease=allow -e .Targets Python 3.13+; dependencies are pinned in pyproject.toml.
Define a spider by subclassing Spider, implementing parse, and yielding items or follow-up Request objects. This example writes quotes to data/quotes.jl and enables basic user agent, retry, and non-HTML filtering middlewares.
from silkworm import HTMLResponse, Response, Spider, run_spider
from silkworm.middlewares import (
RetryMiddleware,
SkipNonHTMLMiddleware,
UserAgentMiddleware,
)
from silkworm.pipelines import JsonLinesPipeline
class QuotesSpider(Spider):
name = "quotes"
start_urls = ("https://quotes.toscrape.com/",)
async def parse(self, response: Response):
if not isinstance(response, HTMLResponse):
return
html = response
for quote in await html.select(".quote"):
text_el = await quote.select_first(".text")
author_el = await quote.select_first(".author")
if text_el is None or author_el is None:
continue
tags = await quote.select(".tag")
yield {
"text": text_el.text,
"author": author_el.text,
"tags": [t.text for t in tags],
}
if next_link := await html.select_first("li.next > a"):
yield html.follow(next_link.attr("href"), callback=self.parse)
if __name__ == "__main__":
run_spider(
QuotesSpider,
request_middlewares=[UserAgentMiddleware()],
response_middlewares=[
SkipNonHTMLMiddleware(),
RetryMiddleware(max_times=3, sleep_http_codes=[429, 503]),
],
item_pipelines=[JsonLinesPipeline("data/quotes.jl")],
concurrency=16,
request_timeout=10,
log_stats_interval=30,
)run_spider/crawl knobs:
concurrency: number of concurrent HTTP requests; default 16.max_pending_requests: queue bound to avoid unbounded memory use (defaults toconcurrency * 10).request_timeout: per-request timeout (seconds).keep_alive: reuse HTTP connections when supported by the underlying client (sendsConnection: keep-alive).html_max_size_bytes: limit HTML parsed intoDocumentto avoid huge payloads.log_stats_interval: seconds between periodic stats logs; final stats are always emitted.request_middlewares/response_middlewares/item_pipelines: plug-ins run on every request/response/item.- use
run_spider_uvloop(...)instead ofrun_spider(...)to run under uvloop (requirespip install silkworm-rs[uvloop]). - use
run_spider_winloop(...)instead ofrun_spider(...)to run under winloop on Windows (requirespip install silkworm-rs[winloop]).
from silkworm.middlewares import (
DelayMiddleware,
ProxyMiddleware,
RetryMiddleware,
SkipNonHTMLMiddleware,
UserAgentMiddleware,
)
from silkworm.pipelines import (
CallbackPipeline, # invoke a custom callback function on each item
CSVPipeline,
JsonLinesPipeline,
MsgPackPipeline, # requires: pip install silkworm-rs[msgpack]
SQLitePipeline,
XMLPipeline,
TaskiqPipeline, # requires: pip install silkworm-rs[taskiq]
PolarsPipeline, # requires: pip install silkworm-rs[polars]
ExcelPipeline, # requires: pip install silkworm-rs[excel]
YAMLPipeline, # requires: pip install silkworm-rs[yaml]
AvroPipeline, # requires: pip install silkworm-rs[avro]
ElasticsearchPipeline, # requires: pip install silkworm-rs[elasticsearch]
MongoDBPipeline, # requires: pip install silkworm-rs[mongodb]
MySQLPipeline, # requires: pip install silkworm-rs[mysql]
PostgreSQLPipeline, # requires: pip install silkworm-rs[postgresql]
S3JsonLinesPipeline, # requires: pip install silkworm-rs[s3]
VortexPipeline, # requires: pip install silkworm-rs[vortex]
WebhookPipeline, # sends items to webhook endpoints using rnet
GoogleSheetsPipeline, # requires: pip install silkworm-rs[gsheets]
SnowflakePipeline, # requires: pip install silkworm-rs[snowflake]
FTPPipeline, # requires: pip install silkworm-rs[ftp]
SFTPPipeline, # requires: pip install silkworm-rs[sftp]
CassandraPipeline, # requires: pip install silkworm-rs[cassandra]
CouchDBPipeline, # requires: pip install silkworm-rs[couchdb]
DynamoDBPipeline, # requires: pip install silkworm-rs[dynamodb]
DuckDBPipeline, # requires: pip install silkworm-rs[duckdb]
)
run_spider(
QuotesSpider,
request_middlewares=[
UserAgentMiddleware(), # rotate/custom user agent
DelayMiddleware(min_delay=0.3, max_delay=1.2), # polite throttling
# ProxyMiddleware with round-robin selection (default)
# ProxyMiddleware(proxies=["http://user:pass@proxy1:8080", "http://proxy2:8080"]),
# ProxyMiddleware with random selection
# ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True),
# ProxyMiddleware from file with random selection
# ProxyMiddleware(proxy_file="proxies.txt", random_selection=True),
],
response_middlewares=[
RetryMiddleware(max_times=3, sleep_http_codes=[403, 429]), # backoff + retry
SkipNonHTMLMiddleware(), # drop callbacks for images/APIs/etc
],
item_pipelines=[
JsonLinesPipeline("data/quotes.jl"),
SQLitePipeline("data/quotes.db", table="quotes"),
XMLPipeline("data/quotes.xml", root_element="quotes", item_element="quote"),
CSVPipeline("data/quotes.csv", fieldnames=["author", "text", "tags"]),
MsgPackPipeline("data/quotes.msgpack"),
],
)DelayMiddlewarestrategies:delay=1.0(fixed),min_delay/max_delay(random), ordelay_func(custom).ProxyMiddlewaresupports three modes:- Round-robin (default):
ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"])cycles through proxies in order. - Random selection:
ProxyMiddleware(proxies=["http://proxy1:8080", "http://proxy2:8080"], random_selection=True)randomly selects a proxy for each request. - From file:
ProxyMiddleware(proxy_file="proxies.txt")loads proxies from a file (one proxy per line, blank lines ignored). Combine withrandom_selection=Truefor random selection from the file.
- Round-robin (default):
RetryMiddlewarebacks off withasyncio.sleep; any status insleep_http_codesis retried even if not inretry_http_codes.SkipNonHTMLMiddlewarechecksContent-Typeand optionally sniffs the body (sniff_bytes) to avoid running HTML callbacks on binary/API responses.JsonLinesPipelinewrites items to a local JSON Lines file and, whenopendalis installed, appends asynchronously via the filesystem backend (use_opendal=Falseto stick to a regular file handle).CSVPipelineflattens nested dicts (e.g.,{"user": {"name": "Alice"}}->user_name) and joins lists with commas;XMLPipelinepreserves nesting.MsgPackPipelinewrites items in binary MessagePack format using ormsgpack for fast and compact serialization (requirespip install silkworm-rs[msgpack]).TaskiqPipelinesends items to a Taskiq queue for distributed processing (requirespip install silkworm-rs[taskiq]).PolarsPipelinewrites items to a Parquet file using Polars for efficient columnar storage (requirespip install silkworm-rs[polars]).ExcelPipelinewrites items to an Excel .xlsx file (requirespip install silkworm-rs[excel]).YAMLPipelinewrites items to a YAML file (requirespip install silkworm-rs[yaml]).AvroPipelinewrites items to an Avro file with optional schema (requirespip install silkworm-rs[avro]).ElasticsearchPipelinesends items to an Elasticsearch index (requirespip install silkworm-rs[elasticsearch]).MongoDBPipelinesends items to a MongoDB collection (requirespip install silkworm-rs[mongodb]).MySQLPipelinesends items to a MySQL database table as JSON (requirespip install silkworm-rs[mysql]).PostgreSQLPipelinesends items to a PostgreSQL database table as JSONB (requirespip install silkworm-rs[postgresql]).S3JsonLinesPipelinewrites items to AWS S3 in JSON Lines format using async OpenDAL (requirespip install silkworm-rs[s3]).VortexPipelinewrites items to a Vortex file for high-performance columnar storage with 100x faster random access and 10-20x faster scans compared to Parquet (requirespip install silkworm-rs[vortex]).WebhookPipelinesends items to webhook endpoints via HTTP POST/PUT using rnet (same HTTP client as the spider) with support for batching and custom headers.GoogleSheetsPipelineappends items to Google Sheets with automatic flattening of nested data structures (requirespip install silkworm-rs[gsheets]and service account credentials).SnowflakePipelinesends items to Snowflake data warehouse tables as JSON (requirespip install silkworm-rs[snowflake]).FTPPipelinewrites items to an FTP server in JSON Lines format (requirespip install silkworm-rs[ftp]).SFTPPipelinewrites items to an SFTP server in JSON Lines format with support for password or key-based authentication (requirespip install silkworm-rs[sftp]).CassandraPipelinesends items to Apache Cassandra database tables (requirespip install silkworm-rs[cassandra]).CouchDBPipelinesends items to CouchDB databases as documents (requirespip install silkworm-rs[couchdb]).DynamoDBPipelinesends items to AWS DynamoDB tables with automatic table creation (requirespip install silkworm-rs[dynamodb]).DuckDBPipelinesends items to a DuckDB database table as JSON (requirespip install silkworm-rs[duckdb]).CallbackPipelineinvokes a custom callback function (sync or async) on each item, enabling inline processing logic without creating a full pipeline class. See example below.
Process items with custom callback functions without creating a full pipeline class:
from silkworm.pipelines import CallbackPipeline
# Sync callback
def print_item(item, spider):
print(f"[{spider.name}] {item}")
return item
# Async callback
async def validate_item(item, spider):
# Could do async operations like database checks
if len(item.get("text", "")) < 10:
print(f"Warning: Short text in item")
return item
# Modifying callback
def enrich_item(item, spider):
item["spider_name"] = spider.name
item["processed"] = True
return item
run_spider(
QuotesSpider,
item_pipelines=[
CallbackPipeline(callback=print_item),
CallbackPipeline(callback=validate_item),
CallbackPipeline(callback=enrich_item),
],
)Callbacks receive (item, spider) and should return the processed item (or None to return the original item unchanged).
Stream scraped items to a Taskiq queue for distributed processing:
from taskiq import InMemoryBroker
from silkworm.pipelines import TaskiqPipeline
broker = InMemoryBroker()
@broker.task
async def process_item(item):
# Your item processing logic here
print(f"Processing: {item}")
# Save to database, send to another service, etc.
pipeline = TaskiqPipeline(broker, task=process_item)
run_spider(MySpider, item_pipelines=[pipeline])This enables distributed processing, retries, rate limiting, and other Taskiq features. See examples/taskiq_quotes_spider.py for a complete example.
Keep crawls cheap when URLs mix HTML and binaries/APIs:
response_middlewares=[SkipNonHTMLMiddleware(sniff_bytes=1024)]
# Tighten HTML parsing size (bytes) to avoid loading huge bodies into scraper-rs
run_spider(MySpider, html_max_size_bytes=1_000_000)For improved async performance, enable uvloop (a fast, drop-in replacement for asyncio's event loop):
pip install silkworm-rs[uvloop]
# or with uv:
uv pip install --prerelease=allow silkworm-rs[uvloop]Then call run_spider_uvloop (same signature as run_spider):
from silkworm import run_spider_uvloop
run_spider_uvloop(
QuotesSpider,
concurrency=32,
)uvloop can provide 2-4x performance improvement for I/O-bound workloads.
For Windows users who want improved async performance, enable winloop (a Windows-compatible alternative to uvloop):
pip install silkworm-rs[winloop]
# or with uv:
uv pip install --prerelease=allow silkworm-rs[winloop]Then call run_spider_winloop (same signature as run_spider):
from silkworm import run_spider_winloop
run_spider_winloop(
QuotesSpider,
concurrency=32,
)winloop provides significant performance improvements on Windows, similar to what uvloop offers on Unix-like systems.
If you prefer trio over asyncio, you can use run_spider_trio instead of run_spider:
pip install silkworm-rs[trio]
# or with uv:
uv pip install --prerelease=allow silkworm-rs[trio]Then use run_spider_trio:
from silkworm import run_spider_trio
run_spider_trio(
QuotesSpider,
concurrency=16,
request_timeout=10,
)This runs your spider using trio as the async backend via trio-asyncio compatibility layer.
For pages that require JavaScript execution, you can use Lightpanda (or any CDP-compatible browser) instead of the standard HTTP client. This uses the Chrome DevTools Protocol (CDP) to control a browser.
pip install silkworm-rs[cdp]
# or with uv:
uv pip install --prerelease=allow silkworm-rs[cdp]lightpanda --remote-debugging-port=9222Or use Chrome/Chromium:
chromium --remote-debugging-port=9222 --headlessThere are two ways to use CDP: the convenience API or custom spider integration.
import asyncio
from silkworm import fetch_html_cdp
async def main():
# Fetch HTML with JavaScript rendering
text, doc = await fetch_html_cdp(
"https://example.com",
ws_endpoint="ws://127.0.0.1:9222",
timeout=30.0
)
# Extract data from rendered page
title = doc.select_first("title")
print(title.text if title else "No title")
asyncio.run(main())from silkworm import HTMLResponse, Request, Response, Spider
from silkworm.cdp import CDPClient
class LightpandaSpider(Spider):
name = "lightpanda"
start_urls = ("https://example.com/",)
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._cdp_client = None
async def start_requests(self):
# Connect to CDP endpoint
self._cdp_client = CDPClient(
ws_endpoint="ws://127.0.0.1:9222",
timeout=30.0
)
await self._cdp_client.connect()
for url in self.start_urls:
yield Request(url=url, callback=self.parse)
async def parse(self, response: Response):
if not isinstance(response, HTMLResponse):
return
# Extract links from JavaScript-rendered page
for link in await response.select("a"):
href = link.attr("href")
if href:
yield {"url": href}
async def close(self):
if self._cdp_client:
await self._cdp_client.close()See examples/lightpanda_simple.py and examples/lightpanda_spider.py for complete working examples.
Note: CDP support is experimental. For production use, consider using dedicated browser automation tools or the standard HTTP client when JavaScript rendering is not required.
- Structured logs via
logly; setSILKWORM_LOG_LEVEL=DEBUGfor verbose request/response/middleware output. - Periodic statistics with
log_stats_interval; final stats always include elapsed time, queue size, requests/sec, seen URLs, items scraped, errors, and memory MB.
- By default, HTTP fetches are rnet-based without JavaScript execution; pages requiring client-side rendering can use the optional CDP integration (see "JavaScript rendering with Lightpanda" section) or external browser automation tools.
- Request deduplication keys only on
Request.url; query params, HTTP method, and body are ignored, so same-URL requests with different params/data are dropped unless you setdont_filter=Trueor make the URL unique yourself. - HTML parsing auto-detects encoding (BOM, HTTP headers/meta, charset detection fallback) but still enforces a
html_max_size_bytes/doc_max_size_bytescap (default 5 MB) inscraper-rsselectors, so very large pages may need a higher limit or preprocessing. - Several pipelines buffer all items in memory until close (PolarsPipeline, ExcelPipeline, YAMLPipeline, AvroPipeline, VortexPipeline, S3JsonLinesPipeline, FTPPipeline, SFTPPipeline), which can bloat RAM on long crawls; prefer streaming pipelines like JsonLines/CSV/SQLite for high-volume runs.
- Many destination pipelines rely on optional extras; CassandraPipeline is disabled on Windows because
cassandra-driverdepends on libev there.
python examples/quotes_spider.py→data/quotes.jlpython examples/quotes_spider_trio.py→data/quotes_trio.jl(demonstrates trio backend)python examples/quotes_spider_winloop.py→data/quotes_winloop.jl(demonstrates winloop backend for Windows)python examples/hackernews_spider.py --pages 5→data/hackernews.jlpython examples/lobsters_spider.py --pages 2→data/lobsters.jlpython examples/url_titles_spider.py --urls-file data/url_titles.jl --output data/titles.jl(includesSkipNonHTMLMiddlewareand stricter HTML size limits)python examples/export_formats_demo.py --pages 2→ JSONL, XML, and CSV outputs indata/python examples/taskiq_quotes_spider.py --pages 2→ demonstrates TaskiqPipeline for queue-based processingpython examples/sitemap_spider.py --sitemap-url https://example.com/sitemap.xml --pages 50→data/sitemap_meta.jl(extracts meta tags and Open Graph data from sitemap URLs)python examples/lightpanda_simple.py→ demonstrates CDP/Lightpanda for JavaScript rendering (requirespip install silkworm-rs[cdp]and running Lightpanda)python examples/lightpanda_spider.py→ full spider example using CDP/Lightpanda
For one-off fetches without a full spider:
import asyncio
from silkworm import fetch_html
async def main():
text, doc = await fetch_html("https://example.com")
print(doc.select_first("title").text)
asyncio.run(main())import asyncio
from silkworm import fetch_html_cdp
async def main():
# Requires Lightpanda/Chrome running with CDP enabled
text, doc = await fetch_html_cdp("https://example.com")
print(doc.select_first("title").text)
asyncio.run(main())Pull requests and issues are welcome. To set up a dev environment, install uv, create a Python 3.13 virtualenv, and sync dev dependencies:
uv venv --python python3.13
uv sync --group devRun the checks before opening a PR:
just fmt && just lint && just typecheck && just testSilkworm is built on top of excellent open-source projects:
- rnet - HTTP client with browser impersonation capabilities
- scraper-rs - Fast HTML parsing library
- logly - Structured logging
- rxml - XML parsing and writing
We are grateful to the maintainers and contributors of these projects for their work.
MIT License. See LICENSE for details.