A Python command-line tool that crawls a website, extracts every third-party domain referenced in its HTML, and resolves the IPv4 and IPv6 addresses of those domains. It is useful for auditing external dependencies, mapping third-party scripts and assets, and understanding a site's infrastructure footprint.
The tool performs a breadth-first crawl of a site, starting from a given URL and limited to a configurable number of pages. While crawling, it follows internal links and collects URLs referenced in common HTML attributes such as href, src, action, data-src, and data-href. Any domain that differs from the starting domain is recorded as external. Once crawling finishes, each external domain is resolved via DNS to obtain its IPv4 and/or IPv6 addresses.
The crawler reuses a single HTTP session with automatic retries on transient server errors, skips non-HTML responses, caps the amount of data downloaded per page, and can optionally respect robots.txt and add a delay between requests.
- Python 3.8 or higher
- Libraries:
requests,beautifulsoup4,colorama(optional, used only for colored terminal output)
Clone the repository or download the script directly, then install the dependencies with pip:
pip install -r requirements.txtThe tool runs without colorama as well; in that case, output is shown without color.
python3 main.py <URL> [options]The URL can be provided with or without a protocol (http:// or https://). If the protocol is omitted, http:// is used.
| Option | Description |
|---|---|
-m, --max-pages N |
Maximum number of internal pages to crawl (default: 30). |
-t, --timeout N |
Timeout in seconds for HTTP requests (default: 10). |
--user-agent UA |
Custom User-Agent string for requests. |
--no-ipv4 |
Disables IPv4 address resolution. |
--no-ipv6 |
Disables IPv6 address resolution. |
--include-internal |
Includes the base domain in the results (by default only external domains are listed). |
--delay SECONDS |
Adds a delay between requests to reduce load on the target server. |
--respect-robots |
Honors robots.txt disallow rules while crawling internal pages. |
--max-content-size BYTES |
Caps how much of each response body is downloaded (default: 5 MB). |
-v, --verbose |
Displays detailed information during execution. |
--export FILE.csv |
Exports results to a CSV file. |
--json FILE.json |
Exports results to a JSON file. |
--simple |
Forces simple table output even when an export is requested. |
-
Quick analysis of a website:
python3 main.py example.com
-
Increase the page limit and show detailed logs:
python3 main.py https://example.com -m 50 -v
-
Export results to CSV and JSON at the same time:
python3 main.py example.com --export domains.csv --json domains.json
-
Skip IPv6 resolution and include the base domain in the output:
python3 main.py example.com --no-ipv6 --include-internal
-
Crawl politely, respecting
robots.txtand waiting between requests:python3 main.py example.com --respect-robots --delay 1.5
Displays a formatted list with the domain name and the IP addresses found. Domains that could not be resolved are marked as (unresolved).
Example:
====================================================
DOMAIN IP ADDRESSES
====================================================
cdn.example.com 192.0.2.10, 2001:db8::1
api.thirdparty.net 203.0.113.5
fonts.googleapis.com (unresolved)
====================================================
Total: 3 domains.
A file with the columns Domain, IPv4, and IPv6. Multiple IP addresses of the same type are separated by semicolons.
A file containing a list of objects with the following structure:
[
{
"domain": "cdn.example.com",
"ipv4": ["192.0.2.10"],
"ipv6": ["2001:db8::1"],
"ips": ["192.0.2.10", "2001:db8::1"],
"resolved": true
}
]- URL normalization: adds
http://if no protocol is provided. - Crawling: uses a queue to traverse internal pages, respecting the
--max-pageslimit and, optionally,robots.txtrules. - Fetching: streams each response, skips content that is not
text/html, and stops downloading once--max-content-sizeis reached. - URL extraction: parses HTML with BeautifulSoup and collects links from common tags (
a,link,script,img,iframe,form,source,embed,object) and custom attributes (data-src,data-href). - Classification: separates internal URLs (same domain as the start URL) from external ones.
- DNS caching: avoids repeated lookups for domains seen more than once.
- Resolution: uses
socket.getaddrinfoto obtain IPv4 and/or IPv6 addresses for each external domain.
- Crawling only follows links found within the same origin domain as the starting URL.
- JavaScript is not executed; only links present in the static HTML are considered.
- DNS resolution happens after crawling completes, not while a page is being processed.
robots.txtenforcement, when enabled, only governs which internal pages are crawled; it does not affect which external domains are reported.
This tool only requests publicly available pages and performs standard DNS lookups, the same kind of information any browser or nslookup/dig command would reveal. Always make sure you are authorized to crawl a target site, and consider using --respect-robots and --delay to avoid placing unnecessary load on the servers you scan.