Skip to content

zenithyangg/python-ai-network-analysis

Repository files navigation

Python AI Network Analysis

A machine learning-based network traffic analysis and anomaly detection system — a complete solution from data collection to model deployment

Python 3.9+ License: MIT Tests

Overview

This project is an end-to-end network anomaly detection system for sample DNS logs and network flow records. It covers data collection, parsing, feature extraction, model training, local visualization, and service-style inference through a small Flask API.

The bundled sample dataset and demo scripts support the following 9 traffic categories:

Threat Type Description
C2 Beaconing (malicious) Command & control communication with periodic low-traffic persistent connections
Data Exfiltration (data_exfil) Large outbound data transfers with high upload and low download
Port Scanning (scan) SYN-only probing across multiple destination ports
DDoS Attack (ddos) High-volume packets in short bursts from distributed source addresses
Lateral Movement (lateral_movement) Internal communication over sensitive ports between hosts
Brute Force (brute_force) High-frequency short SSH/RDP connections with repeated auth failures
DNS Tunneling (dns_tunnel) High-frequency, large-payload UDP traffic over DNS port
Cryptomining (cryptomining) Sustained persistent connections to mining pool ports with stable bidirectional traffic
Normal Traffic (normal) Legitimate HTTP/HTTPS/DNS/NTP and other business traffic

Capabilities

  • Data Collection: Supports batch CSV loading and streaming reads (generator mode)
  • Log Parsing: Regex-based parsing of DNS logs and NetFlow traffic records
  • Feature Engineering: Extracts 12+ statistical features (byte rate, packet rate, entropy, DGA detection, etc.)
  • Multi-Model Support: Random Forest / Gradient Boosting / Isolation Forest
  • REST API: Flask-based deployment with /api/v1/predict, /api/v1/health, and model metadata endpoints
  • Containerization: One-command Docker build with automated GitHub Actions CI testing
  • Visualization: Multi-workspace Plotly/Dash dashboard with Overview, Incidents, Sources, and DNS views

Repository Layout

  • src/: Core pipeline code for shared configuration, data collection, parsing, feature extraction, and model training/inference.
  • visualization/: Plotly and dashboard helper code used by the Dash UI.
  • data/: Bundled sample DNS and flow datasets plus dataset notes in data/README.md.
  • examples/: Runnable demo scripts, including examples/demo_analysis.py.
  • scripts/: Utility scripts such as scripts/generate_data.py for regenerating sample data.
  • tests/: Unit, integration, API, and dashboard smoke tests.
  • models/: Generated model artifacts. Git only keeps .gitkeep; trained model files are created locally.
  • .github/workflows/: GitHub Actions CI configuration.
  • .env.example: Example runtime configuration for local and Docker deployment.
  • api_server.py: Flask API entry point.
  • dashboard_server.py: Dash dashboard entry point.
  • docker-compose.yml: Docker deployment for the API and dashboard together.
  • start.sh: Local bootstrap script that installs dependencies, generates data if needed, trains the model if needed, and starts the API and dashboard.
  • stop.sh: Local shutdown script for the API and dashboard.
  • pyproject.toml: Packaging metadata, console scripts, and tool configuration.
  • Dockerfile: Base container image for the API and dashboard services.

Installation

  • Python 3.9+
  • pip
git clone https://github.com/zenithyangg/python-ai-network-analysis
cd python-ai-network-analysis

python -m venv venv
source venv/bin/activate  # macOS/Linux
# venv\Scripts\activate   # Windows

pip install -r requirements.txt
pip install -e .

For development and testing:

pip install -r requirements-dev.txt
pip install -e ".[dev]"

Quick Start

Start Everything Locally

bash start.sh

This starts:

  • Dashboard: http://localhost:8050
  • API health: http://localhost:55000/api/v1/health

Run the End-to-End Demo

python examples/demo_analysis.py

The demo prints data collection stats, feature extraction output, model metrics, suspicious connection summaries, and DNS anomaly highlights.

Deployment

The system uses the same operator commands in every mode:

  • bash start.sh
  • bash stop.sh

Both scripts support:

  • RUN_MODE=local: local Python processes
  • RUN_MODE=docker: Docker Compose deployment for local machines or Linux VPS hosts

Local Python Mode

bash start.sh
bash stop.sh

Docker Mode for Local Machine or VPS

cp .env.example .env

Then edit .env as needed:

  • Local Docker: keep PUBLIC_HOST=localhost
  • Linux VPS: set PUBLIC_HOST=<your-server-ip-or-domain>
  • Linux VPS: set AUTO_OPEN_BROWSER=0
  • Use API_BIND_IP=127.0.0.1 and DASHBOARD_BIND_IP=127.0.0.1 for local-only exposure
  • Use API_BIND_IP=0.0.0.0 and DASHBOARD_BIND_IP=0.0.0.0 to expose the system on a VPS
  • Set DASHBOARD_THEME=dark or DASHBOARD_THEME=light for the default UI theme
  • Keep DATA_SOURCE_MODE=local_files for the bundled CSV snapshot, or switch to sqlite
  • Set RUN_MODE=docker if you want bash start.sh and bash stop.sh to drive Docker by default

Then start and stop the deployed system with the same scripts:

bash start.sh
bash stop.sh

Use the Python Components Directly

from src.data_collector import NetworkDataCollector
from src.parser import NetworkLogParser
from src.feature_extractor import FeatureExtractor
from src.model import AnomalyDetector

# Collect → Parse → Extract features
collector = NetworkDataCollector()
flow_df = collector.collect_flow_data("data/sample_flow_data.csv")

parser = NetworkLogParser()
flow_df = parser.parse_flow_dataframe(flow_df)

extractor = FeatureExtractor()
features = extractor.extract_flow_features(flow_df)

# Train and detect
detector = AnomalyDetector(model_type="random_forest")
X = features[["bytes_sent", "bytes_received", "packets_sent",
              "packets_received", "duration_sec", "byte_rate",
              "packet_rate", "sent_recv_ratio"]].fillna(0)
y = (features["label"] != "normal").astype(int)
metrics = detector.train(X, y)
print(metrics)  # {'accuracy': ..., 'precision': ..., 'recall': ..., 'f1': ...}

Services

REST API

python api_server.py

After installation, you can also run:

network-analysis-api

API base URL: http://localhost:55000

If models/classifier.pkl is not present yet, run python examples/demo_analysis.py first, or use bash start.sh, which will generate the model automatically.

curl -X POST http://localhost:55000/api/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"records": [{"src_port": 52431, "dst_port": 443, "bytes_sent": 1024, "bytes_received": 65536, "packets_sent": 15, "packets_received": 42, "duration_sec": 3.5}]}'

Dashboard

python dashboard_server.py

After installation, you can also run:

network-analysis-dashboard

Dashboard URL: http://localhost:8050

The dashboard is organized into four workspaces — Overview, Incidents, Sources, and DNS — and includes drilldown selectors for incidents, source IPs, and domains, plus a runtime dark/light theme switch.

Data source (default = bundled CSV):

  • By default, the repository's bundled data/sample_flow_data.csv and data/sample_dns_logs.csv are used; start.sh will generate them automatically if they are missing.

  • To use SQLite, configure via environment variables only — no code changes required:

    1. Generate the sample database:
      python - <<'PY'
      import pandas as pd, sqlite3
      conn = sqlite3.connect("data/network_analysis.db")
      pd.read_csv("data/sample_flow_data.csv").to_sql("flow_records", conn, if_exists="replace", index=False)
      pd.read_csv("data/sample_dns_logs.csv").to_sql("dns_logs", conn, if_exists="replace", index=False)
      conn.close()
      PY
    2. Set the following in .env:
      DATA_SOURCE_MODE=sqlite
      SQLITE_PATH=data/network_analysis.db
      SQLITE_FLOW_TABLE=flow_records
      SQLITE_DNS_TABLE=dns_logs
      
    3. Restart with bash stop.sh && bash start.sh

    Docker reads these same environment variables.

Model role:

  • The model calculates anomaly predictions and risk scores for traffic, powering the /api/v1/predict API and risk ranking in the dashboard. The model file is stored at classifier.pkl by default and is automatically trained and generated during start.sh or Docker build.
  • If the model is missing, the API cannot provide inference; the dashboard will fall back to the labels present in the data (if any), but risk scores and model-based insights will be unavailable.

Model in action (quickly verify ML is working):

  1. Generate/update the model and sample data:
    bash start.sh    # or: python examples/demo_analysis.py
  2. Call the API for inference (returns anomaly/normal + probability):
    curl -X POST http://localhost:55000/api/v1/predict \
      -H "Content-Type: application/json" \
      -d '{"records": [{"src_port": 52431, "dst_port": 443, "bytes_sent": 1024, "bytes_received": 65536, "packets_sent": 15, "packets_received": 42, "duration_sec": 3.5}]}'
  3. In the dashboard, view the alert queue, risk ranking, and event timeline (all depend on the model output fields prediction and risk_score).

Docker

cp .env.example .env
# Set RUN_MODE=docker in .env if Docker should be the default runtime
bash start.sh

Docker mode starts both the API and the dashboard via docker-compose.yml. The image builds a model artifact during image creation, so /api/v1/health and /api/v1/predict are usable immediately after startup.

Testing

pytest -q

pytest tests/ --cov=src --cov-report=term-missing

Current status: 45 test cases all passing locally.

Sample Data

The repository includes simulated enterprise network telemetry in data/:

  • 500 DNS log entries: Normal queries, DGA domains, DNS tunneling, NXDOMAIN flooding
  • 500 traffic records: 9 label types covering normal and malicious scenarios

Data is generated by scripts/generate_data.py with a fixed seed for reproducibility. See data/README.md for field definitions and label distribution details.

Tech Stack

Category Technology
Language Python 3.9+
Data Processing pandas, numpy
Machine Learning scikit-learn (RandomForest, GradientBoosting, IsolationForest)
Visualization matplotlib, seaborn, Plotly, Dash
Web API Flask
Containerization Docker
CI/CD GitHub Actions
Testing pytest, pytest-cov

Contributing

Contributions, bug reports, and suggestions are welcome! Please refer to CONTRIBUTING.md.

License

This project is licensed under the MIT License.

About

A machine learning-based network traffic analysis and anomaly detection system — a complete solution from data collection to model deployment

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors