A customizable web crawler framework for .NET written in C#. Supports .NET 8, .NET 9, and .NET 10.
dotnet add package Tharga.CrawlerThe simplest way to crawl a site:
var crawler = new Crawler();
var result = await crawler.StartAsync(new Uri("https://example.com/"));You can also crawl multiple starting points at once:
var crawler = new Crawler();
var uris = new[] { new Uri("https://example.com/"), new Uri("https://example.com/blog") };
var result = await crawler.StartAsync(uris);There are three events available for monitoring crawl progress:
Fires when the entire crawl has finished. Useful for background crawling without awaiting the result.
var crawler = new Crawler();
crawler.CrawlerCompleteEvent += (s, e) =>
{
var result = e.CrawlerResult;
Console.WriteLine($"Completed with {result.GetRequestedPages().Count()} requests " +
$"and {result.GetFinalPages().Count()} final pages.");
};
await crawler.StartAsync(new Uri("https://example.com/"));Fires each time a page is successfully downloaded (HTTP 2xx).
crawler.PageCompleteEvent += (s, e) =>
{
Console.WriteLine($"Downloaded: {e.CrawlContent.FinalUri} ({e.CrawlContent.StatusCode})");
};Fires when a page download fails (non-2xx status or exception).
crawler.PageFailedEvent += (s, e) =>
{
Console.WriteLine($"Failed: {e.CrawlContent.RequestUri} - {e.CrawlContent.StatusCode}");
};Register the crawler in your service collection using AddCrawler(). All components are registered as transient, so multiple parallel crawler instances are supported.
services.AddCrawler();Then inject ICrawler into your services:
public class MyService
{
private readonly ICrawler _crawler;
public MyService(ICrawler crawler)
{
_crawler = crawler;
}
public async Task Crawl(Uri uri)
{
var result = await _crawler.StartAsync(uri);
}
}For scenarios where you need to create crawler instances with custom components at runtime, inject ICrawlerProvider:
public class MyService
{
private readonly ICrawlerProvider _crawlerProvider;
public MyService(ICrawlerProvider crawlerProvider)
{
_crawlerProvider = crawlerProvider;
}
public async Task Crawl(Uri uri)
{
var crawler = _crawlerProvider.GetCrawlerInstance(scheduler: myCustomScheduler);
var result = await crawler.StartAsync(uri);
}
}You can replace any built-in component by passing CrawlerRegistrationOptions:
services.AddCrawler(options =>
{
options.Scheduler = provider => new MyCustomScheduler();
options.Downloader = provider => new MyCustomDownloader();
});There are several options that can be configured for each crawl.
| Option | Default | Description |
|---|---|---|
MaxCrawlTime |
No limit | Maximum total duration for the crawl |
NumberOfProcessors |
3 | Number of parallel page processors |
| Option | Default | Description |
|---|---|---|
RetryCount |
3 | Number of retries for HTTP 5xx errors |
Timeout |
No limit | Timeout per individual page download |
UserAgent |
UserAgentLibrary.Chrome |
User agent string sent with requests |
| Option | Default | Description |
|---|---|---|
MaxQueueCount |
No limit | Maximum items in the queue. New URIs are dropped when the limit is reached |
var crawler = new Crawler();
var options = new CrawlerOptions
{
MaxCrawlTime = TimeSpan.FromMinutes(10),
NumberOfProcessors = 5,
DownloadOptions = new DownloadOptions
{
RetryCount = 3,
Timeout = TimeSpan.FromSeconds(30),
UserAgent = UserAgentLibrary.Chrome
},
SchedulerOptions = new SchedulerOptions
{
MaxQueueCount = 1000
}
};
var result = await crawler.StartAsync(new Uri("https://example.com/"), options);The UserAgentLibrary class provides built-in user agent strings:
UserAgentLibrary.Chrome(default)UserAgentLibrary.FirefoxUserAgentLibrary.EdgeUserAgentLibrary.GooglebotUserAgentLibrary.BingbotUserAgentLibrary.DuckDuckBot
All StartAsync overloads accept a CancellationToken:
using var cts = new CancellationTokenSource(TimeSpan.FromMinutes(5));
var result = await crawler.StartAsync(new Uri("https://example.com/"), cancellationToken: cts.Token);
Console.WriteLine($"Cancelled: {result.IsCancelled}, Elapsed: {result.Elapsed}");The CrawlerResult returned from StartAsync provides:
IsCancelled— whether the crawl was cancelledElapsed— total crawl durationGetRequestedPages()— all pages that were requestedGetFinalPages()— distinct final pages (after following redirects), ordered by redirect count
Each crawled page includes the HTTP status code, redirect chain, final URI, content type, download time, and page title.
The crawler is built from four pluggable components. The Crawler orchestrates the overall process, delegating to the IDownloader, IScheduler, and IPageProcessor.
Downloads page content using HttpClient. Handles HTTP redirects (301, 302, 303, 307, 308) automatically, tracking the full redirect chain. Extracts the page <title> from HTML content.
Processes downloaded HTML to extract links. Uses HtmlAgilityPack to parse the DOM and find all <a href="..."> elements. Resolves relative URLs and stays within the original domain.
An in-memory queue that uses a breadth-first (shallow) crawl strategy — pages closest to the starting URI are crawled first. Handles retry logic and uses IUriService for URI filtering and mutation.
All major components can be replaced by implementing the corresponding interface and registering your implementation via DI.
Handles downloading page content. Override this to use a headless browser (e.g., Playwright, Puppeteer) instead of HttpClient.
Controls how HTML is parsed to extract links. Override this to change link extraction logic or to process non-HTML content.
Manages the crawl queue and tracks what has been crawled. Override this to persist the queue to a database for resumable crawls.
Provides URI filtering and mutation. Called by the scheduler before enqueuing a URI.
ShouldEnqueueAsync(Uri parentUri, Uri uri)— returnfalseto skip a URIMutateUriAsync(Uri uri)— transform a URI before it is enqueued (e.g., strip query parameters)
By default, the crawler stays on the same domain as the starting URI.