SafeNav is an web and link analysis platform designed to evaluate the safety of URLs, websites, and application links.
It performs multi-layer analysis using static inspection, reputation checks, and machine-learning–based scoring to identify potentially malicious or unsafe links.
The project is structured as a full-stack system with a React frontend and a Python-based backend, focusing on real-world web security use cases such as phishing detection, suspicious domain analysis, and unsafe content identification.
Handles different types of links including shortened URLs, redirects, and malformed URLs. Applies RFC 3986-compliant sanitization, percent-decoding (including double-encoded attacks), Punycode (IDN) conversion, and scheme/host standardization before any analysis begins.
Detects suspicious patterns such as abnormal URL length, special characters, and domain structure anomalies. Includes typosquatting detection via Levenshtein/Jaro-Winkler distance, keyword analysis, and Shannon entropy scoring for DGA (Domain Generation Algorithm) detection.
Follows HTTP redirect chains (301, 302, 307) without executing JavaScript, detecting cross-domain hops, redirect loops, and excessive hop counts that indicate cloaking or obfuscation. Flags client-side redirects (meta-refresh, window.location) for deeper analysis.
Analyzes certificate type (DV/OV/EV), issuer, age (newly issued certs under 48 hours are high-risk), and cipher suite strength. Over 80% of phishing sites now use HTTPS — the padlock alone is not a safety signal.
Evaluates domain age via WHOIS/RDAP, suspicious TLD detection, and registrar reputation. Domains registered under one week are flagged as critical risk. Caches results via Redis to handle rate limits efficiently.
Uses a Random Forest classifier trained on lexical, host-based, and statistical URL features to compute a probabilistic malice score. Feature importances make the model's verdict explainable (e.g., "70% contribution from Domain Age").
Combines the ML score with additive heuristic penalties into a single 0–100 Risk Score. Critical indicators (e.g., insecure login form, blacklist hit) immediately override to 100. Every verdict includes a human-readable reasoning list.
Designed with separable components for easy extension, testing, and experimentation. Each of the eight analysis modules operates independently and feeds into a central score aggregator.
SafeNav is organized as a full-stack application:
- frontend/ – React-based user interface
- backend/ – Python backend responsible for API handling and the eight-module static analysis pipeline
The backend implements a "Fail-Fast" architecture: all Phase 1 checks run within milliseconds to a few seconds using only the URL string, DNS records, SSL handshake, and HTTP response headers — no browser rendering, no JavaScript execution.
Why Fail-Fast? Approximately 90% of malicious links can be caught through surface-level inspection alone. By filtering these at Phase 1, expensive dynamic sandboxing (Phase 2) is reserved only for ambiguous or heavily obfuscated targets.
Before any security check runs, the raw URL is cleaned and converted to a canonical form to defeat common obfuscation tricks.
| Step | What Happens | Why It Matters |
|---|---|---|
| Percent-decode (recursive) | Decodes %xx escapes repeatedly until stable |
Defeats double-encoding attacks like %2520 |
| Control character stripping | Removes ASCII 0–31 and surrounding whitespace | Prevents parser-breaking invisible characters |
| Scheme & host lowercasing | HTTP:// → http://, domain to lowercase |
Ensures consistent, case-insensitive matching |
| Punycode (IDN) conversion | Converts Unicode domains to xn--... ASCII |
Defeats homograph attacks (Cyrillic 'а' vs Latin 'a') |
| Length guard | Rejects inputs over 2048 characters | Prevents regex backtracking / DoS |
Plain English: Think of this step as spell-checking and standardizing the URL before the real analysis starts — the same way a browser normalizes what you type before making a request.
Once normalized, the URL is fingerprinted to classify its intent. Categories are non-mutually exclusive.
| Type | Detection Method | Risk Signal |
|---|---|---|
| Standard Website | http/https scheme, valid domain |
Baseline |
| IP-Based Link | ipaddress library validates raw IPs in netloc |
High — phishing kits avoid domain blocklists this way |
| Shortened URL | Domain matched against shortener database (bit.ly, t.co, etc.) | Medium — destination is hidden |
| Direct Download | Path extension checked against .exe .apk .zip .bat .ps1 blacklist |
High — immediate malware risk |
| App Deep Link | Non-http scheme detected (e.g., whatsapp://, tg://) |
Medium — may trigger unauthorized app actions |
| Android Intent | intent:// scheme parsed for package and target app |
High — reveals exactly which app is targeted |
Query parameter values are also scanned for suspicious extensions (e.g., ?file=malware.exe) to prevent false negatives from indirect download links.
Phishers use redirect chains to bounce through legitimate-looking domains before reaching the malicious page. This module traces the full path without executing any client-side code.
- User-Agent masquerading – requests mimic a real browser to bypass basic bot detection
- Stream mode – headers and redirects are followed without downloading the full response body
- Chain analysis – each hop's domain is compared; cross-domain transitions (e.g.,
google.com → attacker.xyz) increase the risk score - Loop guard – redirect chains capped at 10 hops to prevent infinite loops
- Client-side redirect detection – response body is scanned via regex for
meta http-equiv="refresh"andwindow.location, flagged for Phase 2 deep scan
Plain English: Like following every "click here" button automatically and reporting each stop on the journey, without actually loading the pages in a browser.
HTTPS no longer implies safety. This module inspects the quality of the certificate, not just its presence.
| Check | Logic | Risk Implication |
|---|---|---|
| Certificate type | DV (domain-only) vs OV/EV (organization verified) | DV certs are free, automated — standard for phishing |
| Issuer | Flags Let's Encrypt, cPanel issuers on login pages | High-risk combination |
| Certificate age | Current Time − notBefore |
Under 48 hours → critical "burn domain" signal |
| Cipher suite | Checks for deprecated TLS 1.0, SSLv3, RC4, NULL | Indicates neglected or compromised server |
| SNI compatibility | server_hostname included in socket handshake |
Required for multi-tenant hosts |
| Self-signed certs | Handshake errors caught, flagged as "Invalid/Untrusted" | Never crash — always report |
A domain's registration history is one of the strongest predictors of malicious intent.
| Signal | Threshold | Risk Level |
|---|---|---|
| Domain age | < 7 days since registration | Critical |
| Domain age | < 30 days since registration | High |
| Suspicious TLD | .xyz, .top, .tk, .gq, .zip |
Medium (amplified by other signals) |
| WHOIS privacy/redaction | Creation date missing | Indeterminate (confidence-adjusted) |
| Rate limiting | Handled via Redis cache (24-hour TTL) + rotating proxies | Operational |
Data is fetched via WHOIS (port 43) or the modern RDAP JSON API. The tldextract library isolates the effective second-level domain before lookup.
This module analyzes the text of the URL for visual and semantic deception patterns.
Typosquatting Detection
Levenshtein distance is calculated between the analyzed domain and a reference list of high-value phishing targets (Google, PayPal, Amazon, Microsoft, etc.). A distance of 1–2 flags the domain as a potential typosquat (e.g., gooogle.com). Jaro-Winkler distance is used additionally for subdomain spoofing detection. Checks are limited to the top 50–100 most-phished brands using the optimized python-Levenshtein C library for performance.
Keyword Analysis
The URL is scanned for trust-inducing keywords in subdomains and paths: login, secure, account, verify, update, support, billing. A URL like paypal-secure.com or apple.verify-id.com triggers a Suspicious Keyword flag.
Shannon Entropy (DGA Detection)
High entropy in a domain name indicates random character distribution — a hallmark of Domain Generation Algorithms used by malware C2 servers (e.g., xkzj194.com). Known CDN providers (AWS, Akamai) are whitelisted to prevent false positives.
Heuristic rules catch known patterns. The ML module catches novel attacks through probabilistic pattern matching.
Feature Vector
| Feature Category | Features |
|---|---|
| Lexical | URL length, domain length, dot/hyphen/digit count, @ presence, path depth |
| Statistical | Shannon entropy of domain and path |
| Host-Based | Domain age (days), SSL validity (binary), redirect count, is_HTTPS (binary) |
Model: Random Forest Classifier
Trained on balanced datasets from PhishTank/OpenPhish (malicious) and Alexa/Tranco Top 1M (benign). Random Forest is chosen over deep learning for three reasons: it performs better on tabular URL feature data, it is resistant to overfitting with smaller datasets, and it produces feature importance scores — making its verdicts explainable rather than a black box.
Inference uses model.predict_proba() to output a 0.0–1.0 probability that feeds directly into the Risk Score formula. An MLOps feedback loop retrains the model periodically on false positives/negatives reported by users and Phase 2 scans to counter model drift.
A lightweight HTML parser that looks for gross security violations in the page source without rendering it.
Insecure Login Form Detection
Using BeautifulSoup (bs4):
- Parse the HTML body
- Find
<input type="password">elements - Check the parent
<form action="...">attribute - Violation: If the page URL or form action uses
http://→ immediately flagged as "Insecure Login Form" (credential theft risk)
If <script> tags are prevalent but no forms are found, the system notes "Dynamic Content Detected" and recommends Phase 2 analysis — covering React/Angular apps where forms are JS-generated.
All eight module outputs are synthesized into a single Risk Score (0–100) using a Weighted Risk Fusion model:
If a Critical Indicator is detected (e.g., phishing blacklist hit, insecure login form), the score is immediately forced to 100.
| Detected Signal | Severity | Penalty |
|---|---|---|
| Typosquatting Match | High | +50 |
| Domain Age < 7 Days | High | +40 |
| ML Probability > 80% | High | +30 |
| Cross-Domain Redirect | Low | +15 |
| Suspicious Keyword | Medium | +20 |
| DV SSL Certificate | Low | +10 |
| Insecure Login Form (HTTP) | Critical | +100 (Override) |
If a check fails (e.g., WHOIS timeout), it is marked Indeterminate and the remaining weights are normalized — the score reflects only verified data without artificially deflating the result.
| Score | Verdict | UI |
|---|---|---|
| 0 – 30 | Safe | 🟢 Green Shield — "Safe to Visit" |
| 31 – 69 | Caution | 🟡 Warning — "Proceed with Caution" (Phase 2 suggested) |
| 70 – 100 | High Risk | 🔴 Red Alert — "Dangerous Link Detected" |
Every verdict includes a Reasoning section — a plain-English list of exactly why the score was assigned:
- "Domain registered only 3 days ago."
- "Contains login keywords but uses a low-trust DV certificate."
- "Redirects through 3 different domains."
This transparency builds user trust and turns SafeNav into an educational tool, not just a black-box filter.
- Built with React (Vite)
- Responsible for user interaction and result visualization
- Communicates with backend APIs to request URL analysis and display safety reports
| Technology | Role |
|---|---|
| UI framework — component-based result dashboard | |
| Build tool — fast hot-module reload in development | |
| Primary frontend language | |
| Markup & styling |
- Phishing link detection
- Unsafe website analysis
- Educational research on web security and threat detection
- Full-stack development practice with a security focus
Current Phase: Phase 1 – Static Analysis Engine (In Development)
| Phase | Description | Status |
|---|---|---|
| Phase 1 | Static Analysis Engine — 8-module URL inspection pipeline | 🔄 In Development |
| Phase 2 | Dynamic Analysis — full browser sandboxing & JS execution | 🔜 Planned |
| Phase 3 | MLOps Pipeline — automated model retraining on new threat data | 🔜 Planned |
| Phase 4 | Scale & Deploy — Kubernetes horizontal scaling, extended reporting | 🔜 Planned |
What's done in Phase 1:
- ✅ Architecture fully designed and documented
- ✅ All 8 analysis modules specified (normalization → ML → risk fusion)
- ✅ Weighted Risk Scoring algorithm defined
- ✅ Docker Compose full-stack setup
- 🔄 Module implementation in progress
Coming next:
- Phase 2 dynamic sandboxing (headless browser, JS execution, behavioral fingerprinting)
- MLOps feedback loop for continuous model improvement
- Kubernetes-based horizontal scaling for production workloads
SafeNav can be executed in two different modes depending on the use case:
- Docker Mode – Recommended for demo, evaluation, and deployment
- Development Mode – Recommended while coding and debugging
This mode runs the frontend, backend, and database together using Docker Compose.
- Docker Desktop installed
- Docker Compose enabled
- Clone the repository:
git clone https://github.com/su7ox/SafeNav.git
cd SafeNav- Build and start all services:
docker-compose up -d- Verify running containers:
docker ps- Frontend UI: http://localhost:5173
- Backend API: http://localhost:8000
- API Documentation (Swagger): http://localhost:8000/docs
docker-compose downDocker does not automatically reflect code changes.
docker-compose build
docker-compose up -dThis mode supports hot reload and is recommended during development.
- Python 3.10 or higher
- Navigate to backend directory:
cd backend- Create virtual environment (one-time):
python -m venv venv- Activate virtual environment:
venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt- Run backend server:
uvicorn app.main:app --reloadBackend will be available at:
- Node.js (LTS version recommended)
- Open a new terminal and navigate to frontend directory:
cd frontend- Install dependencies:
npm install- Start frontend development server:
npm run devFrontend will be available at:
su7ox
GitHub: @su7ox