A machine learning-based network traffic analysis and anomaly detection system — a complete solution from data collection to model deployment
This project is an end-to-end network anomaly detection system for sample DNS logs and network flow records. It covers data collection, parsing, feature extraction, model training, local visualization, and service-style inference through a small Flask API.
The bundled sample dataset and demo scripts support the following 9 traffic categories:
| Threat Type | Description |
|---|---|
| C2 Beaconing (malicious) | Command & control communication with periodic low-traffic persistent connections |
| Data Exfiltration (data_exfil) | Large outbound data transfers with high upload and low download |
| Port Scanning (scan) | SYN-only probing across multiple destination ports |
| DDoS Attack (ddos) | High-volume packets in short bursts from distributed source addresses |
| Lateral Movement (lateral_movement) | Internal communication over sensitive ports between hosts |
| Brute Force (brute_force) | High-frequency short SSH/RDP connections with repeated auth failures |
| DNS Tunneling (dns_tunnel) | High-frequency, large-payload UDP traffic over DNS port |
| Cryptomining (cryptomining) | Sustained persistent connections to mining pool ports with stable bidirectional traffic |
| Normal Traffic (normal) | Legitimate HTTP/HTTPS/DNS/NTP and other business traffic |
- Data Collection: Supports batch CSV loading and streaming reads (generator mode)
- Log Parsing: Regex-based parsing of DNS logs and NetFlow traffic records
- Feature Engineering: Extracts 12+ statistical features (byte rate, packet rate, entropy, DGA detection, etc.)
- Multi-Model Support: Random Forest / Gradient Boosting / Isolation Forest
- REST API: Flask-based deployment with
/api/v1/predict,/api/v1/health, and model metadata endpoints - Containerization: One-command Docker build with automated GitHub Actions CI testing
- Visualization: Multi-workspace Plotly/Dash dashboard with
Overview,Incidents,Sources, andDNSviews
src/: Core pipeline code for shared configuration, data collection, parsing, feature extraction, and model training/inference.visualization/: Plotly and dashboard helper code used by the Dash UI.data/: Bundled sample DNS and flow datasets plus dataset notes indata/README.md.examples/: Runnable demo scripts, includingexamples/demo_analysis.py.scripts/: Utility scripts such asscripts/generate_data.pyfor regenerating sample data.tests/: Unit, integration, API, and dashboard smoke tests.models/: Generated model artifacts. Git only keeps.gitkeep; trained model files are created locally..github/workflows/: GitHub Actions CI configuration..env.example: Example runtime configuration for local and Docker deployment.api_server.py: Flask API entry point.dashboard_server.py: Dash dashboard entry point.docker-compose.yml: Docker deployment for the API and dashboard together.start.sh: Local bootstrap script that installs dependencies, generates data if needed, trains the model if needed, and starts the API and dashboard.stop.sh: Local shutdown script for the API and dashboard.pyproject.toml: Packaging metadata, console scripts, and tool configuration.Dockerfile: Base container image for the API and dashboard services.
- Python 3.9+
- pip
git clone https://github.com/zenithyangg/python-ai-network-analysis
cd python-ai-network-analysis
python -m venv venv
source venv/bin/activate # macOS/Linux
# venv\Scripts\activate # Windows
pip install -r requirements.txt
pip install -e .For development and testing:
pip install -r requirements-dev.txt
pip install -e ".[dev]"bash start.shThis starts:
- Dashboard:
http://localhost:8050 - API health:
http://localhost:55000/api/v1/health
python examples/demo_analysis.pyThe demo prints data collection stats, feature extraction output, model metrics, suspicious connection summaries, and DNS anomaly highlights.
The system uses the same operator commands in every mode:
bash start.shbash stop.sh
Both scripts support:
RUN_MODE=local: local Python processesRUN_MODE=docker: Docker Compose deployment for local machines or Linux VPS hosts
bash start.sh
bash stop.shcp .env.example .envThen edit .env as needed:
- Local Docker: keep
PUBLIC_HOST=localhost - Linux VPS: set
PUBLIC_HOST=<your-server-ip-or-domain> - Linux VPS: set
AUTO_OPEN_BROWSER=0 - Use
API_BIND_IP=127.0.0.1andDASHBOARD_BIND_IP=127.0.0.1for local-only exposure - Use
API_BIND_IP=0.0.0.0andDASHBOARD_BIND_IP=0.0.0.0to expose the system on a VPS - Set
DASHBOARD_THEME=darkorDASHBOARD_THEME=lightfor the default UI theme - Keep
DATA_SOURCE_MODE=local_filesfor the bundled CSV snapshot, or switch tosqlite - Set
RUN_MODE=dockerif you wantbash start.shandbash stop.shto drive Docker by default
Then start and stop the deployed system with the same scripts:
bash start.sh
bash stop.shfrom src.data_collector import NetworkDataCollector
from src.parser import NetworkLogParser
from src.feature_extractor import FeatureExtractor
from src.model import AnomalyDetector
# Collect → Parse → Extract features
collector = NetworkDataCollector()
flow_df = collector.collect_flow_data("data/sample_flow_data.csv")
parser = NetworkLogParser()
flow_df = parser.parse_flow_dataframe(flow_df)
extractor = FeatureExtractor()
features = extractor.extract_flow_features(flow_df)
# Train and detect
detector = AnomalyDetector(model_type="random_forest")
X = features[["bytes_sent", "bytes_received", "packets_sent",
"packets_received", "duration_sec", "byte_rate",
"packet_rate", "sent_recv_ratio"]].fillna(0)
y = (features["label"] != "normal").astype(int)
metrics = detector.train(X, y)
print(metrics) # {'accuracy': ..., 'precision': ..., 'recall': ..., 'f1': ...}python api_server.pyAfter installation, you can also run:
network-analysis-apiAPI base URL: http://localhost:55000
If models/classifier.pkl is not present yet, run python examples/demo_analysis.py first, or use bash start.sh, which will generate the model automatically.
curl -X POST http://localhost:55000/api/v1/predict \
-H "Content-Type: application/json" \
-d '{"records": [{"src_port": 52431, "dst_port": 443, "bytes_sent": 1024, "bytes_received": 65536, "packets_sent": 15, "packets_received": 42, "duration_sec": 3.5}]}'python dashboard_server.pyAfter installation, you can also run:
network-analysis-dashboardDashboard URL: http://localhost:8050
The dashboard is organized into four workspaces — Overview, Incidents, Sources, and DNS — and includes drilldown selectors for incidents, source IPs, and domains, plus a runtime dark/light theme switch.
Data source (default = bundled CSV):
-
By default, the repository's bundled
data/sample_flow_data.csvanddata/sample_dns_logs.csvare used;start.shwill generate them automatically if they are missing. -
To use SQLite, configure via environment variables only — no code changes required:
- Generate the sample database:
python - <<'PY' import pandas as pd, sqlite3 conn = sqlite3.connect("data/network_analysis.db") pd.read_csv("data/sample_flow_data.csv").to_sql("flow_records", conn, if_exists="replace", index=False) pd.read_csv("data/sample_dns_logs.csv").to_sql("dns_logs", conn, if_exists="replace", index=False) conn.close() PY
- Set the following in
.env:DATA_SOURCE_MODE=sqlite SQLITE_PATH=data/network_analysis.db SQLITE_FLOW_TABLE=flow_records SQLITE_DNS_TABLE=dns_logs - Restart with
bash stop.sh && bash start.sh
Docker reads these same environment variables.
- Generate the sample database:
Model role:
- The model calculates anomaly predictions and risk scores for traffic, powering the
/api/v1/predictAPI and risk ranking in the dashboard. The model file is stored atclassifier.pklby default and is automatically trained and generated duringstart.shor Docker build. - If the model is missing, the API cannot provide inference; the dashboard will fall back to the labels present in the data (if any), but risk scores and model-based insights will be unavailable.
Model in action (quickly verify ML is working):
- Generate/update the model and sample data:
bash start.sh # or: python examples/demo_analysis.py - Call the API for inference (returns anomaly/normal + probability):
curl -X POST http://localhost:55000/api/v1/predict \ -H "Content-Type: application/json" \ -d '{"records": [{"src_port": 52431, "dst_port": 443, "bytes_sent": 1024, "bytes_received": 65536, "packets_sent": 15, "packets_received": 42, "duration_sec": 3.5}]}'
- In the dashboard, view the alert queue, risk ranking, and event timeline (all depend on the model output fields
predictionandrisk_score).
cp .env.example .env
# Set RUN_MODE=docker in .env if Docker should be the default runtime
bash start.shDocker mode starts both the API and the dashboard via docker-compose.yml. The image builds a model artifact during image creation, so /api/v1/health and /api/v1/predict are usable immediately after startup.
pytest -q
pytest tests/ --cov=src --cov-report=term-missingCurrent status: 45 test cases all passing locally.
The repository includes simulated enterprise network telemetry in data/:
- 500 DNS log entries: Normal queries, DGA domains, DNS tunneling, NXDOMAIN flooding
- 500 traffic records: 9 label types covering normal and malicious scenarios
Data is generated by scripts/generate_data.py with a fixed seed for reproducibility. See data/README.md for field definitions and label distribution details.
| Category | Technology |
|---|---|
| Language | Python 3.9+ |
| Data Processing | pandas, numpy |
| Machine Learning | scikit-learn (RandomForest, GradientBoosting, IsolationForest) |
| Visualization | matplotlib, seaborn, Plotly, Dash |
| Web API | Flask |
| Containerization | Docker |
| CI/CD | GitHub Actions |
| Testing | pytest, pytest-cov |
Contributions, bug reports, and suggestions are welcome! Please refer to CONTRIBUTING.md.
This project is licensed under the MIT License.