SafeNav – Web & Link Safety Analysis Platform

SafeNav is an web and link analysis platform designed to evaluate the safety of URLs, websites, and application links.
It performs multi-layer analysis using static inspection, reputation checks, and machine-learning–based scoring to identify potentially malicious or unsafe links. The project is structured as a full-stack system with a React frontend and a Python-based backend, focusing on real-world web security use cases such as phishing detection, suspicious domain analysis, and unsafe content identification.

🚀 Key Features

URL Normalization & Parsing

Handles different types of links including shortened URLs, redirects, and malformed URLs. Applies RFC 3986-compliant sanitization, percent-decoding (including double-encoded attacks), Punycode (IDN) conversion, and scheme/host standardization before any analysis begins.

Static & Lexical Analysis

Detects suspicious patterns such as abnormal URL length, special characters, and domain structure anomalies. Includes typosquatting detection via Levenshtein/Jaro-Winkler distance, keyword analysis, and Shannon entropy scoring for DGA (Domain Generation Algorithm) detection.

Redirect Chain Tracing

Follows HTTP redirect chains (301, 302, 307) without executing JavaScript, detecting cross-domain hops, redirect loops, and excessive hop counts that indicate cloaking or obfuscation. Flags client-side redirects (meta-refresh, window.location) for deeper analysis.

SSL/TLS Certificate Inspection

Analyzes certificate type (DV/OV/EV), issuer, age (newly issued certs under 48 hours are high-risk), and cipher suite strength. Over 80% of phishing sites now use HTTPS — the padlock alone is not a safety signal.

Reputation & Domain Intelligence

Evaluates domain age via WHOIS/RDAP, suspicious TLD detection, and registrar reputation. Domains registered under one week are flagged as critical risk. Caches results via Redis to handle rate limits efficiently.

Machine Learning–Based Risk Scoring

Uses a Random Forest classifier trained on lexical, host-based, and statistical URL features to compute a probabilistic malice score. Feature importances make the model's verdict explainable (e.g., "70% contribution from Domain Age").

Weighted Risk Fusion Engine

Combines the ML score with additive heuristic penalties into a single 0–100 Risk Score. Critical indicators (e.g., insecure login form, blacklist hit) immediately override to 100. Every verdict includes a human-readable reasoning list.

Modular Detection Pipeline

Designed with separable components for easy extension, testing, and experimentation. Each of the eight analysis modules operates independently and feeds into a central score aggregator.

🧱 Project Architecture

SafeNav is organized as a full-stack application:

frontend/ – React-based user interface
backend/ – Python backend responsible for API handling and the eight-module static analysis pipeline

⚙️ Backend – Phase 1: Static Analysis Engine

The backend implements a "Fail-Fast" architecture: all Phase 1 checks run within milliseconds to a few seconds using only the URL string, DNS records, SSL handshake, and HTTP response headers — no browser rendering, no JavaScript execution.

Why Fail-Fast? Approximately 90% of malicious links can be caught through surface-level inspection alone. By filtering these at Phase 1, expensive dynamic sandboxing (Phase 2) is reserved only for ambiguous or heavily obfuscated targets.

Module I – Link Intake, Sanitization & Normalization

Before any security check runs, the raw URL is cleaned and converted to a canonical form to defeat common obfuscation tricks.

Step	What Happens	Why It Matters
Percent-decode (recursive)	Decodes `%xx` escapes repeatedly until stable	Defeats double-encoding attacks like `%2520`
Control character stripping	Removes ASCII 0–31 and surrounding whitespace	Prevents parser-breaking invisible characters
Scheme & host lowercasing	`HTTP://` → `http://`, domain to lowercase	Ensures consistent, case-insensitive matching
Punycode (IDN) conversion	Converts Unicode domains to `xn--...` ASCII	Defeats homograph attacks (Cyrillic 'а' vs Latin 'a')
Length guard	Rejects inputs over 2048 characters	Prevents regex backtracking / DoS

Plain English: Think of this step as spell-checking and standardizing the URL before the real analysis starts — the same way a browser normalizes what you type before making a request.

Module II – Link Type Identification & Taxonomy

Once normalized, the URL is fingerprinted to classify its intent. Categories are non-mutually exclusive.

Type	Detection Method	Risk Signal
Standard Website	`http`/`https` scheme, valid domain	Baseline
IP-Based Link	`ipaddress` library validates raw IPs in netloc	High — phishing kits avoid domain blocklists this way
Shortened URL	Domain matched against shortener database (bit.ly, t.co, etc.)	Medium — destination is hidden
Direct Download	Path extension checked against `.exe .apk .zip .bat .ps1` blacklist	High — immediate malware risk
App Deep Link	Non-http scheme detected (e.g., `whatsapp://`, `tg://`)	Medium — may trigger unauthorized app actions
Android Intent	`intent://` scheme parsed for package and target app	High — reveals exactly which app is targeted

Query parameter values are also scanned for suspicious extensions (e.g., ?file=malware.exe) to prevent false negatives from indirect download links.

Module III – Lightweight Redirect Tracing

Phishers use redirect chains to bounce through legitimate-looking domains before reaching the malicious page. This module traces the full path without executing any client-side code.

User-Agent masquerading – requests mimic a real browser to bypass basic bot detection
Stream mode – headers and redirects are followed without downloading the full response body
Chain analysis – each hop's domain is compared; cross-domain transitions (e.g., google.com → attacker.xyz) increase the risk score
Loop guard – redirect chains capped at 10 hops to prevent infinite loops
Client-side redirect detection – response body is scanned via regex for meta http-equiv="refresh" and window.location, flagged for Phase 2 deep scan

Plain English: Like following every "click here" button automatically and reporting each stop on the journey, without actually loading the pages in a browser.

Module IV – SSL/TLS Certificate Inspection

HTTPS no longer implies safety. This module inspects the quality of the certificate, not just its presence.

Check	Logic	Risk Implication
Certificate type	DV (domain-only) vs OV/EV (organization verified)	DV certs are free, automated — standard for phishing
Issuer	Flags Let's Encrypt, cPanel issuers on login pages	High-risk combination
Certificate age	`Current Time − notBefore`	Under 48 hours → critical "burn domain" signal
Cipher suite	Checks for deprecated TLS 1.0, SSLv3, RC4, NULL	Indicates neglected or compromised server
SNI compatibility	`server_hostname` included in socket handshake	Required for multi-tenant hosts
Self-signed certs	Handshake errors caught, flagged as "Invalid/Untrusted"	Never crash — always report

Module V – Domain Reputation & History

A domain's registration history is one of the strongest predictors of malicious intent.

Signal	Threshold	Risk Level
Domain age	< 7 days since registration	Critical
Domain age	< 30 days since registration	High
Suspicious TLD	`.xyz`, `.top`, `.tk`, `.gq`, `.zip`	Medium (amplified by other signals)
WHOIS privacy/redaction	Creation date missing	Indeterminate (confidence-adjusted)
Rate limiting	Handled via Redis cache (24-hour TTL) + rotating proxies	Operational

Data is fetched via WHOIS (port 43) or the modern RDAP JSON API. The tldextract library isolates the effective second-level domain before lookup.

Module VI – Advanced Lexical Analysis & Phishing Detection

This module analyzes the text of the URL for visual and semantic deception patterns.

Typosquatting Detection
Levenshtein distance is calculated between the analyzed domain and a reference list of high-value phishing targets (Google, PayPal, Amazon, Microsoft, etc.). A distance of 1–2 flags the domain as a potential typosquat (e.g., gooogle.com). Jaro-Winkler distance is used additionally for subdomain spoofing detection. Checks are limited to the top 50–100 most-phished brands using the optimized python-Levenshtein C library for performance.

Keyword Analysis
The URL is scanned for trust-inducing keywords in subdomains and paths: login, secure, account, verify, update, support, billing. A URL like paypal-secure.com or apple.verify-id.com triggers a Suspicious Keyword flag.

Shannon Entropy (DGA Detection)
$$H(X) = - \sum_{i=1}^{n} P(x_i) \log_2 P(x_i)$$

High entropy in a domain name indicates random character distribution — a hallmark of Domain Generation Algorithms used by malware C2 servers (e.g., xkzj194.com). Known CDN providers (AWS, Akamai) are whitelisted to prevent false positives.

Module VII – Machine Learning Risk Prediction

Heuristic rules catch known patterns. The ML module catches novel attacks through probabilistic pattern matching.

Feature Vector

Feature Category	Features
Lexical	URL length, domain length, dot/hyphen/digit count, `@` presence, path depth
Statistical	Shannon entropy of domain and path
Host-Based	Domain age (days), SSL validity (binary), redirect count, is_HTTPS (binary)

Model: Random Forest Classifier
Trained on balanced datasets from PhishTank/OpenPhish (malicious) and Alexa/Tranco Top 1M (benign). Random Forest is chosen over deep learning for three reasons: it performs better on tabular URL feature data, it is resistant to overfitting with smaller datasets, and it produces feature importance scores — making its verdicts explainable rather than a black box.

Inference uses model.predict_proba() to output a 0.0–1.0 probability that feeds directly into the Risk Score formula. An MLOps feedback loop retrains the model periodically on false positives/negatives reported by users and Phase 2 scans to counter model drift.

Module VIII – Static Content Inspection

A lightweight HTML parser that looks for gross security violations in the page source without rendering it.

Insecure Login Form Detection
Using BeautifulSoup (bs4):

Parse the HTML body
Find <input type="password"> elements
Check the parent <form action="..."> attribute
Violation: If the page URL or form action uses http:// → immediately flagged as "Insecure Login Form" (credential theft risk)

If <script> tags are prevalent but no forms are found, the system notes "Dynamic Content Detected" and recommends Phase 2 analysis — covering React/Angular apps where forms are JS-generated.

🧮 Risk Scoring & Fusion

All eight module outputs are synthesized into a single Risk Score (0–100) using a Weighted Risk Fusion model:

$$\text{Risk Score} = \min\left(100,; (\alpha \cdot S_{ML}) + \sum (P_i \times W_i)\right)$$

If a Critical Indicator is detected (e.g., phishing blacklist hit, insecure login form), the score is immediately forced to 100.

Detected Signal	Severity	Penalty
Typosquatting Match	High	+50
Domain Age < 7 Days	High	+40
ML Probability > 80%	High	+30
Cross-Domain Redirect	Low	+15
Suspicious Keyword	Medium	+20
DV SSL Certificate	Low	+10
Insecure Login Form (HTTP)	Critical	+100 (Override)

If a check fails (e.g., WHOIS timeout), it is marked Indeterminate and the remaining weights are normalized — the score reflects only verified data without artificially deflating the result.

🚦 Result Tiers & Explainability

Score	Verdict	UI
0 – 30	Safe	🟢 Green Shield — "Safe to Visit"
31 – 69	Caution	🟡 Warning — "Proceed with Caution" (Phase 2 suggested)
70 – 100	High Risk	🔴 Red Alert — "Dangerous Link Detected"

Every verdict includes a Reasoning section — a plain-English list of exactly why the score was assigned:

"Domain registered only 3 days ago."
"Contains login keywords but uses a low-trust DV certificate."
"Redirects through 3 different domains."

This transparency builds user trust and turns SafeNav into an educational tool, not just a black-box filter.

🎨 Frontend

Built with React (Vite)
Responsible for user interaction and result visualization
Communicates with backend APIs to request URL analysis and display safety reports

🛠 Tech Stack

🎨 Frontend

Technology	Role
	UI framework — component-based result dashboard
	Build tool — fast hot-module reload in development
	Primary frontend language
	Markup & styling

⚙️ Backend

Technology	Role
	Core language — analysis pipeline, ML, networking
	ASGI web framework — async-first, handles concurrent module calls
	Random Forest classifier — ML risk prediction engine
	WHOIS result caching (24h TTL) + Celery message broker
	Parallel task dispatch — runs modules concurrently
`BeautifulSoup (bs4)`	Static HTML parsing — insecure form detection
`tldextract`	Accurate domain/subdomain/TLD isolation
`dnspython` · `ssl` · `socket`	DNS resolution, TLS handshake, certificate retrieval
`python-Levenshtein`	C-optimized edit distance — typosquatting detection
`httpx` / `requests`	HTTP client — redirect tracing, User-Agent masquerading

🚀 Infrastructure & Tooling

Technology	Role
	Containerization — encapsulates runtime, OpenSSL, ML models
	Orchestrates frontend + backend + Redis as a unified stack
	Version control & repository hosting
	Primary development environment

📌 Use Cases

Phishing link detection
Unsafe website analysis
Educational research on web security and threat detection
Full-stack development practice with a security focus

📈 Project Status

Current Phase: Phase 1 – Static Analysis Engine (In Development)

Phase	Description	Status
Phase 1	Static Analysis Engine — 8-module URL inspection pipeline	🔄 In Development
Phase 2	Dynamic Analysis — full browser sandboxing & JS execution	🔜 Planned
Phase 3	MLOps Pipeline — automated model retraining on new threat data	🔜 Planned
Phase 4	Scale & Deploy — Kubernetes horizontal scaling, extended reporting	🔜 Planned

What's done in Phase 1:

✅ Architecture fully designed and documented
✅ All 8 analysis modules specified (normalization → ML → risk fusion)
✅ Weighted Risk Scoring algorithm defined
✅ Docker Compose full-stack setup
🔄 Module implementation in progress

Coming next:

Phase 2 dynamic sandboxing (headless browser, JS execution, behavioral fingerprinting)
MLOps feedback loop for continuous model improvement
Kubernetes-based horizontal scaling for production workloads

▶️ How to Run SafeNav

SafeNav can be executed in two different modes depending on the use case:

Docker Mode – Recommended for demo, evaluation, and deployment
Development Mode – Recommended while coding and debugging

🐳 Running with Docker (Recommended)

This mode runs the frontend, backend, and database together using Docker Compose.

Prerequisites

Docker Desktop installed
Docker Compose enabled

Steps

Clone the repository:

git clone https://github.com/su7ox/SafeNav.git
cd SafeNav

Build and start all services:

docker-compose up -d

Verify running containers:

docker ps

Access the Application

Frontend UI: http://localhost:5173
Backend API: http://localhost:8000
API Documentation (Swagger): http://localhost:8000/docs

Stop the Application

docker-compose down

Apply Code Changes

Docker does not automatically reflect code changes.

docker-compose build
docker-compose up -d

🧑‍💻 Running in Development Mode (Without Docker)

This mode supports hot reload and is recommended during development.

🔹 Backend (FastAPI)

Prerequisites

Python 3.10 or higher

Steps

Navigate to backend directory:

cd backend

Create virtual environment (one-time):

python -m venv venv

Activate virtual environment:

venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Run backend server:

uvicorn app.main:app --reload

Backend will be available at:

🔹 Frontend (React + Vite)

Prerequisites

Node.js (LTS version recommended)

Steps

Open a new terminal and navigate to frontend directory:

cd frontend

Install dependencies:

npm install

Start frontend development server:

npm run dev

Frontend will be available at:

http://localhost:5173

👤 Author

su7ox

GitHub: @su7ox

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
backend		backend
frontend		frontend
README.md		README.md
Static Analysis Phase One Plan.pdf		Static Analysis Phase One Plan.pdf
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

SafeNav – Web & Link Safety Analysis Platform

🚀 Key Features

URL Normalization & Parsing

Static & Lexical Analysis

Redirect Chain Tracing

SSL/TLS Certificate Inspection

Reputation & Domain Intelligence

Machine Learning–Based Risk Scoring

Weighted Risk Fusion Engine

Modular Detection Pipeline

🧱 Project Architecture

⚙️ Backend – Phase 1: Static Analysis Engine

Module I – Link Intake, Sanitization & Normalization

Module II – Link Type Identification & Taxonomy

Module III – Lightweight Redirect Tracing

Module IV – SSL/TLS Certificate Inspection

Module V – Domain Reputation & History

Module VI – Advanced Lexical Analysis & Phishing Detection

Module VII – Machine Learning Risk Prediction

Module VIII – Static Content Inspection

🧮 Risk Scoring & Fusion

🚦 Result Tiers & Explainability

🎨 Frontend

🛠 Tech Stack

🎨 Frontend

⚙️ Backend

🚀 Infrastructure & Tooling

📌 Use Cases

📈 Project Status

▶️ How to Run SafeNav

🐳 Running with Docker (Recommended)

Prerequisites

Steps

Access the Application

Stop the Application

Apply Code Changes

🧑‍💻 Running in Development Mode (Without Docker)

🔹 Backend (FastAPI)

Prerequisites

Steps

🔹 Frontend (React + Vite)

Prerequisites

Steps

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages