Python AI Network Analysis

A machine learning-based network traffic analysis and anomaly detection system — a complete solution from data collection to model deployment

Overview

This project is an end-to-end network anomaly detection system for sample DNS logs and network flow records. It covers data collection, parsing, feature extraction, model training, local visualization, and service-style inference through a small Flask API.

The bundled sample dataset and demo scripts support the following 9 traffic categories:

Threat Type	Description
C2 Beaconing (malicious)	Command & control communication with periodic low-traffic persistent connections
Data Exfiltration (data_exfil)	Large outbound data transfers with high upload and low download
Port Scanning (scan)	SYN-only probing across multiple destination ports
DDoS Attack (ddos)	High-volume packets in short bursts from distributed source addresses
Lateral Movement (lateral_movement)	Internal communication over sensitive ports between hosts
Brute Force (brute_force)	High-frequency short SSH/RDP connections with repeated auth failures
DNS Tunneling (dns_tunnel)	High-frequency, large-payload UDP traffic over DNS port
Cryptomining (cryptomining)	Sustained persistent connections to mining pool ports with stable bidirectional traffic
Normal Traffic (normal)	Legitimate HTTP/HTTPS/DNS/NTP and other business traffic

Capabilities

Data Collection: Supports batch CSV loading and streaming reads (generator mode)
Log Parsing: Regex-based parsing of DNS logs and NetFlow traffic records
Feature Engineering: Extracts 12+ statistical features (byte rate, packet rate, entropy, DGA detection, etc.)
Multi-Model Support: Random Forest / Gradient Boosting / Isolation Forest
REST API: Flask-based deployment with /api/v1/predict, /api/v1/health, and model metadata endpoints
Containerization: One-command Docker build with automated GitHub Actions CI testing
Visualization: Multi-workspace Plotly/Dash dashboard with Overview, Incidents, Sources, and DNS views

Repository Layout

src/: Core pipeline code for shared configuration, data collection, parsing, feature extraction, and model training/inference.
visualization/: Plotly and dashboard helper code used by the Dash UI.
data/: Bundled sample DNS and flow datasets plus dataset notes in data/README.md.
examples/: Runnable demo scripts, including examples/demo_analysis.py.
scripts/: Utility scripts such as scripts/generate_data.py for regenerating sample data.
tests/: Unit, integration, API, and dashboard smoke tests.
models/: Generated model artifacts. Git only keeps .gitkeep; trained model files are created locally.
.github/workflows/: GitHub Actions CI configuration.
.env.example: Example runtime configuration for local and Docker deployment.
api_server.py: Flask API entry point.
dashboard_server.py: Dash dashboard entry point.
docker-compose.yml: Docker deployment for the API and dashboard together.
start.sh: Local bootstrap script that installs dependencies, generates data if needed, trains the model if needed, and starts the API and dashboard.
stop.sh: Local shutdown script for the API and dashboard.
pyproject.toml: Packaging metadata, console scripts, and tool configuration.
Dockerfile: Base container image for the API and dashboard services.

Installation

Python 3.9+
pip

git clone https://github.com/zenithyangg/python-ai-network-analysis
cd python-ai-network-analysis

python -m venv venv
source venv/bin/activate  # macOS/Linux
# venv\Scripts\activate   # Windows

pip install -r requirements.txt
pip install -e .

For development and testing:

pip install -r requirements-dev.txt
pip install -e ".[dev]"

Quick Start

Start Everything Locally

bash start.sh

This starts:

Dashboard: http://localhost:8050
API health: http://localhost:55000/api/v1/health

Run the End-to-End Demo

python examples/demo_analysis.py

The demo prints data collection stats, feature extraction output, model metrics, suspicious connection summaries, and DNS anomaly highlights.

Deployment

The system uses the same operator commands in every mode:

bash start.sh
bash stop.sh

Both scripts support:

RUN_MODE=local: local Python processes
RUN_MODE=docker: Docker Compose deployment for local machines or Linux VPS hosts

Local Python Mode

bash start.sh
bash stop.sh

Docker Mode for Local Machine or VPS

cp .env.example .env

Then edit .env as needed:

Local Docker: keep PUBLIC_HOST=localhost
Linux VPS: set PUBLIC_HOST=<your-server-ip-or-domain>
Linux VPS: set AUTO_OPEN_BROWSER=0
Use API_BIND_IP=127.0.0.1 and DASHBOARD_BIND_IP=127.0.0.1 for local-only exposure
Use API_BIND_IP=0.0.0.0 and DASHBOARD_BIND_IP=0.0.0.0 to expose the system on a VPS
Set DASHBOARD_THEME=dark or DASHBOARD_THEME=light for the default UI theme
Keep DATA_SOURCE_MODE=local_files for the bundled CSV snapshot, or switch to sqlite
Set RUN_MODE=docker if you want bash start.sh and bash stop.sh to drive Docker by default

Then start and stop the deployed system with the same scripts:

bash start.sh
bash stop.sh

Use the Python Components Directly

from src.data_collector import NetworkDataCollector
from src.parser import NetworkLogParser
from src.feature_extractor import FeatureExtractor
from src.model import AnomalyDetector

# Collect → Parse → Extract features
collector = NetworkDataCollector()
flow_df = collector.collect_flow_data("data/sample_flow_data.csv")

parser = NetworkLogParser()
flow_df = parser.parse_flow_dataframe(flow_df)

extractor = FeatureExtractor()
features = extractor.extract_flow_features(flow_df)

# Train and detect
detector = AnomalyDetector(model_type="random_forest")
X = features[["bytes_sent", "bytes_received", "packets_sent",
              "packets_received", "duration_sec", "byte_rate",
              "packet_rate", "sent_recv_ratio"]].fillna(0)
y = (features["label"] != "normal").astype(int)
metrics = detector.train(X, y)
print(metrics)  # {'accuracy': ..., 'precision': ..., 'recall': ..., 'f1': ...}

Services

REST API

python api_server.py

After installation, you can also run:

network-analysis-api

API base URL: http://localhost:55000

If models/classifier.pkl is not present yet, run python examples/demo_analysis.py first, or use bash start.sh, which will generate the model automatically.

curl -X POST http://localhost:55000/api/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"records": [{"src_port": 52431, "dst_port": 443, "bytes_sent": 1024, "bytes_received": 65536, "packets_sent": 15, "packets_received": 42, "duration_sec": 3.5}]}'

Dashboard

python dashboard_server.py

After installation, you can also run:

network-analysis-dashboard

Dashboard URL: http://localhost:8050

The dashboard is organized into four workspaces — Overview, Incidents, Sources, and DNS — and includes drilldown selectors for incidents, source IPs, and domains, plus a runtime dark/light theme switch.

Data source (default = bundled CSV):

By default, the repository's bundled data/sample_flow_data.csv and data/sample_dns_logs.csv are used; start.sh will generate them automatically if they are missing.

To use SQLite, configure via environment variables only — no code changes required:

Generate the sample database:

python - <<'PY'
import pandas as pd, sqlite3
conn = sqlite3.connect("data/network_analysis.db")
pd.read_csv("data/sample_flow_data.csv").to_sql("flow_records", conn, if_exists="replace", index=False)
pd.read_csv("data/sample_dns_logs.csv").to_sql("dns_logs", conn, if_exists="replace", index=False)
conn.close()
PY

Set the following in .env:

DATA_SOURCE_MODE=sqlite
SQLITE_PATH=data/network_analysis.db
SQLITE_FLOW_TABLE=flow_records
SQLITE_DNS_TABLE=dns_logs

Restart with bash stop.sh && bash start.sh

Docker reads these same environment variables.

Model role:

The model calculates anomaly predictions and risk scores for traffic, powering the /api/v1/predict API and risk ranking in the dashboard. The model file is stored at classifier.pkl by default and is automatically trained and generated during start.sh or Docker build.
If the model is missing, the API cannot provide inference; the dashboard will fall back to the labels present in the data (if any), but risk scores and model-based insights will be unavailable.

Model in action (quickly verify ML is working):

Generate/update the model and sample data:

bash start.sh    # or: python examples/demo_analysis.py

Call the API for inference (returns anomaly/normal + probability):

curl -X POST http://localhost:55000/api/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"records": [{"src_port": 52431, "dst_port": 443, "bytes_sent": 1024, "bytes_received": 65536, "packets_sent": 15, "packets_received": 42, "duration_sec": 3.5}]}'

In the dashboard, view the alert queue, risk ranking, and event timeline (all depend on the model output fields prediction and risk_score).

Docker

cp .env.example .env
# Set RUN_MODE=docker in .env if Docker should be the default runtime
bash start.sh

Docker mode starts both the API and the dashboard via docker-compose.yml. The image builds a model artifact during image creation, so /api/v1/health and /api/v1/predict are usable immediately after startup.

Testing

pytest -q

pytest tests/ --cov=src --cov-report=term-missing

Current status: 45 test cases all passing locally.

Sample Data

The repository includes simulated enterprise network telemetry in data/:

500 DNS log entries: Normal queries, DGA domains, DNS tunneling, NXDOMAIN flooding
500 traffic records: 9 label types covering normal and malicious scenarios

Data is generated by scripts/generate_data.py with a fixed seed for reproducibility. See data/README.md for field definitions and label distribution details.

Tech Stack

Category	Technology
Language	Python 3.9+
Data Processing	pandas, numpy
Machine Learning	scikit-learn (RandomForest, GradientBoosting, IsolationForest)
Visualization	matplotlib, seaborn, Plotly, Dash
Web API	Flask
Containerization	Docker
CI/CD	GitHub Actions
Testing	pytest, pytest-cov

Contributing

Contributions, bug reports, and suggestions are welcome! Please refer to CONTRIBUTING.md.

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python AI Network Analysis

Overview

Capabilities

Repository Layout

Installation

Quick Start

Start Everything Locally

Run the End-to-End Demo

Deployment

Local Python Mode

Docker Mode for Local Machine or VPS

Use the Python Components Directly

Services

REST API

Dashboard

Docker

Testing

Sample Data

Tech Stack

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
data		data
examples		examples
models		models
scripts		scripts
src		src
tests		tests
visualization		visualization
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api_server.py		api_server.py
dashboard_server.py		dashboard_server.py
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
setup.py		setup.py
start.sh		start.sh
stop.sh		stop.sh

Folders and files

Latest commit

History

Repository files navigation

Python AI Network Analysis

Overview

Capabilities

Repository Layout

Installation

Quick Start

Start Everything Locally

Run the End-to-End Demo

Deployment

Local Python Mode

Docker Mode for Local Machine or VPS

Use the Python Components Directly

Services

REST API

Dashboard

Docker

Testing

Sample Data

Tech Stack

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages