Skip to content

Tadakai/Domain-Mapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Domain Mapper

A Python command-line tool that crawls a website, extracts every third-party domain referenced in its HTML, and resolves the IPv4 and IPv6 addresses of those domains. It is useful for auditing external dependencies, mapping third-party scripts and assets, and understanding a site's infrastructure footprint.

Description

The tool performs a breadth-first crawl of a site, starting from a given URL and limited to a configurable number of pages. While crawling, it follows internal links and collects URLs referenced in common HTML attributes such as href, src, action, data-src, and data-href. Any domain that differs from the starting domain is recorded as external. Once crawling finishes, each external domain is resolved via DNS to obtain its IPv4 and/or IPv6 addresses.

The crawler reuses a single HTTP session with automatic retries on transient server errors, skips non-HTML responses, caps the amount of data downloaded per page, and can optionally respect robots.txt and add a delay between requests.

Requirements

  • Python 3.8 or higher
  • Libraries: requests, beautifulsoup4, colorama (optional, used only for colored terminal output)

Installation

Clone the repository or download the script directly, then install the dependencies with pip:

pip install -r requirements.txt

The tool runs without colorama as well; in that case, output is shown without color.

Basic Usage

python3 main.py <URL> [options]

The URL can be provided with or without a protocol (http:// or https://). If the protocol is omitted, http:// is used.

Available Options

Option Description
-m, --max-pages N Maximum number of internal pages to crawl (default: 30).
-t, --timeout N Timeout in seconds for HTTP requests (default: 10).
--user-agent UA Custom User-Agent string for requests.
--no-ipv4 Disables IPv4 address resolution.
--no-ipv6 Disables IPv6 address resolution.
--include-internal Includes the base domain in the results (by default only external domains are listed).
--delay SECONDS Adds a delay between requests to reduce load on the target server.
--respect-robots Honors robots.txt disallow rules while crawling internal pages.
--max-content-size BYTES Caps how much of each response body is downloaded (default: 5 MB).
-v, --verbose Displays detailed information during execution.
--export FILE.csv Exports results to a CSV file.
--json FILE.json Exports results to a JSON file.
--simple Forces simple table output even when an export is requested.

Examples

  1. Quick analysis of a website:

    python3 main.py example.com
  2. Increase the page limit and show detailed logs:

    python3 main.py https://example.com -m 50 -v
  3. Export results to CSV and JSON at the same time:

    python3 main.py example.com --export domains.csv --json domains.json
  4. Skip IPv6 resolution and include the base domain in the output:

    python3 main.py example.com --no-ipv6 --include-internal
  5. Crawl politely, respecting robots.txt and waiting between requests:

    python3 main.py example.com --respect-robots --delay 1.5

Output

Simple Table (default)

Displays a formatted list with the domain name and the IP addresses found. Domains that could not be resolved are marked as (unresolved).

Example:

====================================================
DOMAIN                                    IP ADDRESSES
====================================================
cdn.example.com                           192.0.2.10, 2001:db8::1
api.thirdparty.net                        203.0.113.5
fonts.googleapis.com                      (unresolved)
====================================================
Total: 3 domains.

CSV Export

A file with the columns Domain, IPv4, and IPv6. Multiple IP addresses of the same type are separated by semicolons.

JSON Export

A file containing a list of objects with the following structure:

[
  {
    "domain": "cdn.example.com",
    "ipv4": ["192.0.2.10"],
    "ipv6": ["2001:db8::1"],
    "ips": ["192.0.2.10", "2001:db8::1"],
    "resolved": true
  }
]

Internal Operation

  1. URL normalization: adds http:// if no protocol is provided.
  2. Crawling: uses a queue to traverse internal pages, respecting the --max-pages limit and, optionally, robots.txt rules.
  3. Fetching: streams each response, skips content that is not text/html, and stops downloading once --max-content-size is reached.
  4. URL extraction: parses HTML with BeautifulSoup and collects links from common tags (a, link, script, img, iframe, form, source, embed, object) and custom attributes (data-src, data-href).
  5. Classification: separates internal URLs (same domain as the start URL) from external ones.
  6. DNS caching: avoids repeated lookups for domains seen more than once.
  7. Resolution: uses socket.getaddrinfo to obtain IPv4 and/or IPv6 addresses for each external domain.

Limitations

  • Crawling only follows links found within the same origin domain as the starting URL.
  • JavaScript is not executed; only links present in the static HTML are considered.
  • DNS resolution happens after crawling completes, not while a page is being processed.
  • robots.txt enforcement, when enabled, only governs which internal pages are crawled; it does not affect which external domains are reported.

Responsible Use

This tool only requests publicly available pages and performs standard DNS lookups, the same kind of information any browser or nslookup/dig command would reveal. Always make sure you are authorized to crawl a target site, and consider using --respect-robots and --delay to avoid placing unnecessary load on the servers you scan.

About

A Python script that crawls a website, extracts all third-party domains referenced in the HTML code, and resolves their IP addresses.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages