🛡️ MobileGuard AI

Advanced AI-Powered Mobile Threat Intelligence Platform

A multi-stage AI security assessment platform for Android applications combining static analysis, dynamic sandboxing, LLM-powered behavioral analysis, and explainable machine learning.

Features • Architecture • Installation • Usage • API Reference • Development

📋 Table of Contents

Overview
Features
Architecture
Technology Stack
Installation
Configuration
Usage
API Reference
Pipeline Components
Machine Learning
Frontend Architecture
Testing
Deployment
Project Structure
Contributing
License

🔍 Overview

MobileGuard AI is an enterprise-grade Android malware detection system designed for financial institutions, security operations centers (SOCs), and cybersecurity agencies. It provides comprehensive threat analysis through a 7-stage pipeline:

Static Analysis - APK decompilation, permission analysis, API usage patterns, certificate validation, code obfuscation detection
YARA Signature Scanning - Multi-rule malware family detection with severity-based scoring
MITRE ATT&CK Mapping - Permission and API call mapping to mobile attack techniques
Malware Family Classification - Rule-based family detection (BankBot, Spyware, RAT, Ransomware)
Dynamic Analysis - Sandbox execution with runtime event collection, behavioral anomaly detection, and ADB-based monitoring
LLM Analysis - Resilient multi-tier LLM routing (Gemini 2.5 Flash → GPT-4o) with smart truncation
Risk Scoring - XGBoost-based ML classifier with SHAP explainability and context-aware boost rules

Key Differentiators:

Explainable AI - SHAP values show which features drove the risk score
Regional Threat Intelligence - India-specific banking malware patterns (UPI, OTP interception)
Real-time Streaming - Server-sent events provide live analysis progress
Multi-modal Analysis - Combines YARA signatures, MITRE mapping, ML, and LLM approaches
Resilient LLM Infrastructure - Automatic failover between Gemini and OpenAI with context window management
Production-Ready Sandbox - Full ADB + Frida integration with runtime event collection
VirusTotal Integration - Hash-based reputation checking
Audit Trail - JSONL-based audit logging with SQLite feature caching

✨ Features

🔬 Static Analysis Engine

APK Parsing - Androguard-based decompilation with manifest extraction and DEX analysis
Permission Profiling - 22+ dangerous permission detection with combo risk scoring (READ_SMS + INTERNET = +10 points)
API Fingerprinting - Call graph analysis with 14+ suspicious API pattern matching (sendTextMessage, Runtime.exec, DexClassLoader)
Obfuscation Detection - Shannon entropy analysis on strings (>4.5 threshold), base64 pattern matching
Certificate Validation - Self-signed cert detection, validity period analysis, issuer verification, debug cert flagging
Native Code Inspection - .so library enumeration with known malware signature matching (libfrida-gadget.so, libinject.so)
Call Graph Construction - NetworkX-based control flow analysis with graph density metrics
VirusTotal Integration - SHA256 hash reputation checking with malicious/suspicious count extraction
C2 Domain Detection - URL extraction with threat intelligence feed matching

🏗️ Dynamic Analysis Sandbox

Execution Modes - Live sandbox (ADB + Frida + logcat) with automatic device detection and fallback to emulated mode
Behavioral Monitoring - SMS send attempts, accessibility service abuse, silent install detection, overlay detection
Runtime Event Collection - CollectorOrchestrator with multiple specialized collectors (Frida hooks, dumpsys, logcat parsing)
Network Traffic Analysis - Domain extraction from logcat, C2 server detection with threat intel matching
System State Analysis - dumpsys inspection for accessibility services, device admin, foreground services, overlay windows
Device Admin Detection - Privilege escalation attempt monitoring via device_policy dumpsys
Behavioral Scoring - Evidence-based anomaly score (0-100) calculated from observed runtime events
Safe Failure Handling - Graceful degradation to emulated mode when ADB unavailable or device unreachable
Malware Family Matching - Behavioral pattern correlation against known family signatures

🤖 LLM-Powered Intelligence

Resilient Multi-Tier Routing - Primary: Gemini 2.5 Flash → Fallback: GPT-4o with automatic failover
Smart Truncation - Center-out code truncation (top 40% + bottom 40%) for context window management
Native JSON Output - Structured response format with strict schema validation and markdown stripping
Contextual Analysis - Decompiled code interpretation with malicious behavior extraction and evidence citations
Evidence-Based Reasoning - Cites specific class names, methods, API calls, and permissions in findings
Zero-Day Hypothesis Generation - Novel threat detection for unknown malware families (triggered when severity > 0.6 AND family_similarity < 0.4)
India-Specific Risk Assessment - UPI, BHIM, PhonePe, Paytm targeting detection with regional threat patterns
Structured JSON Output - Confidence scores, verdict classification (APPROVE/MONITOR/ESCALATE/BLOCK), executive summaries
Security Analyst Persona - 15+ years malware analysis experience, national SOC-level assessment standards
Automatic Retry Logic - 2-attempt system with progressive truncation on context window errors

📊 Machine Learning & Explainability

XGBoost Classifier - 300 tree ensemble with early stopping, class imbalance handling, and CPU-optimized hist tree method
Dataset Feature Extraction - 37 engineered features from Drebin/CIC-AndMal2017 compatible format
SHAP Explainability - TreeExplainer integration with top-5 feature attribution and waterfall visualization
Synthetic Training Data - Statistical distribution matching for benign (n=800) and malicious (n=800) samples
Multi-Dimensional Scoring - 6 weighted risk dimensions:
- Permission Abuse (10%)
- Obfuscation (10%)
- Behavioral Anomaly (15%)
- ML Malware (45%)
- Developer Trust (10%)
- LLM Severity (10%)
Context-Aware Boost Rules - Dynamic score amplification:
- SMS send attempts (+15)
- C2 domains contacted (+20)
- Accessibility abuse (+12)
- Static C2 IPs (+10)
- LLM verdict BLOCK (+10)
- Silent install attempt (+15)
- Root activity detected (+15)
- Shell execution (+10)
- Dynamic code loading (+10)
- VirusTotal malicious ≥5 (+25)
- YARA signature matches (+3 to +25)

📝 Reporting & Compliance

Threat Reports - Structured JSON with verdict, forensic indicators, recommended actions, MITRE techniques
Audit Logging - ISO 8601 timestamped JSONL logs with dimension scores, SHAP values, and YARA matches
Feature Store - SQLite-based result caching with model version tracking for duplicate APK detection
CERT-In Compliance - Reporting format aligned with Indian cybersecurity standards
YARA Match Reporting - Matched rule names, families, severity levels, and specific string matches
MITRE ATT&CK Coverage - Tactic-grouped technique IDs with evidence traceability
Family Attribution - Confidence-scored malware family classification with matched behavioral signals

🎨 Modern React Frontend

Real-time Updates - Server-sent events with live progress tracking across 7 pipeline stages
Interactive Visualizations - Recharts-based risk gauge, 6-axis dimension radar charts, SHAP waterfall plots
Framer Motion Animations - Smooth page transitions and component mounting with spring physics
Tailwind CSS Design System - Dark mode with glassmorphism effects and responsive 12-column grid
Responsive Layout - Mobile-first design with adaptive breakpoints
Lucide Icons - Shield, Activity, AlertTriangle, FileSearch, Brain, BarChart, and 50+ security icons
Component Library - 8 specialized components:
- UploadZone (drag-and-drop with 150MB client-side validation)
- ProgressTracker (5-stage pipeline visualization)
- RiskGauge (circular gauge with dynamic gradient coloring)
- DimensionChart (6-axis radar with tooltips)
- ShapExplainer (top-5 feature attribution bars)
- ThreatReport (collapsible sections with copy-to-clipboard)
- ActionBanner (verdict display with color-coded badges)
- AuditLog (paginated table with filters)

🏛️ Architecture

High-Level System Architecture

graph TB
    subgraph "Client Layer"
        UI[React Frontend<br/>Vite + Tailwind CSS]
        Browser[Web Browser<br/>Chrome/Firefox/Safari]
    end
    
    subgraph "API Gateway"
        API[FastAPI Server<br/>Uvicorn ASGI]
        CORS[CORS Middleware]
        SSE[Server-Sent Events<br/>Streaming]
    end
    
    subgraph "Core Pipeline"
        Orch[Pipeline Orchestrator<br/>Event Coordinator]
        Cache{Cache Check<br/>SHA256 + Version}
    end
    
    subgraph "Analysis Engines"
        Static[Static Analyzer<br/>Androguard + NetworkX]
        YARA[YARA Engine<br/>Signature Scanner]
        MITRE[MITRE Mapper<br/>ATT&CK Techniques]
        Family[Family Classifier<br/>Rule-Based Detection]
        Dynamic[Dynamic Analyzer<br/>ADB + Frida + Logcat]
        LLM[LLM Analyzer<br/>Resilient Router]
        Scorer[Risk Scorer<br/>XGBoost + SHAP]
        Report[Report Generator<br/>Intelligence Synthesis]
    end
    
    subgraph "External Services"
        Gemini[Google Gemini 2.5 Flash<br/>Primary LLM]
        GPT[OpenAI GPT-4o<br/>Fallback LLM]
        VT[VirusTotal API<br/>Hash Reputation]
        ADB[Android Device<br/>Live Sandbox]
    end
    
    subgraph "Data Layer"
        FS[(SQLite<br/>Feature Store)]
        Audit[(JSONL<br/>Audit Logs)]
        Model[(XGBoost Model<br/>SHAP Explainer)]
        Rules[(YARA Rules<br/>.yar Files)]
        Intel[(Threat Intel<br/>C2 Blocklists)]
    end
    
    Browser -->|Upload APK| UI
    UI <-->|REST API<br/>SSE Stream| API
    API --> CORS
    CORS --> Orch
    
    Orch --> Cache
    Cache -->|Hit| UI
    Cache -->|Miss| Static
    
    Static --> YARA
    Static -->|Hash Check| VT
    YARA --> MITRE
    MITRE --> Family
    Family --> Dynamic
    
    Dynamic -->|If Available| ADB
    Dynamic --> LLM
    
    LLM -->|Primary| Gemini
    LLM -->|Fallback| GPT
    
    LLM --> Scorer
    Scorer --> Report
    
    Static -.->|Load| Model
    Scorer -.->|Load| Model
    YARA -.->|Load| Rules
    Static -.->|Query| Intel
    Dynamic -.->|Query| Intel
    
    Report -->|Cache| FS
    Report -->|Log| Audit
    Cache -.->|Query| FS
    
    Report -->|Final Result| SSE
    SSE --> UI
    
    style UI fill:#3B82F6,stroke:#1E40AF,color:#fff
    style API fill:#10B981,stroke:#047857,color:#fff
    style Orch fill:#F59E0B,stroke:#D97706,color:#fff
    style Gemini fill:#8B5CF6,stroke:#6D28D9,color:#fff
    style GPT fill:#8B5CF6,stroke:#6D28D9,color:#fff
    style FS fill:#EF4444,stroke:#B91C1C,color:#fff
    style Model fill:#EF4444,stroke:#B91C1C,color:#fff

Data Flow Diagram

sequenceDiagram
    participant User
    participant Frontend
    participant API
    participant Orchestrator
    participant Analyzers
    participant LLM
    participant Database
    
    User->>Frontend: Upload APK File (150MB max)
    Frontend->>API: POST /analyze (multipart/form-data)
    API->>Orchestrator: Initialize Pipeline
    
    Orchestrator->>Database: Check Cache (SHA256)
    alt Cache Hit
        Database-->>Orchestrator: Return Cached Result
        Orchestrator-->>Frontend: SSE: cache_hit (100%)
    else Cache Miss
        Orchestrator-->>Frontend: SSE: static_analysis (10%)
        Orchestrator->>Analyzers: Static Analysis
        Analyzers-->>Orchestrator: Static Features
        
        Orchestrator-->>Frontend: SSE: yara_scan (25%)
        Orchestrator->>Analyzers: YARA + MITRE + Family
        Analyzers-->>Orchestrator: Signatures + Techniques
        
        Orchestrator-->>Frontend: SSE: dynamic_analysis (40%)
        Orchestrator->>Analyzers: Dynamic Analysis
        Analyzers-->>Orchestrator: Runtime Events
        
        alt Composite Score > 40
            Orchestrator-->>Frontend: SSE: llm_analysis (80%)
            Orchestrator->>LLM: Analyze Code
            LLM->>LLM: Try Gemini 2.5 Flash
            alt Gemini Success
                LLM-->>Orchestrator: Analysis Result
            else Gemini Fail
                LLM->>LLM: Fallback to GPT-4o
                LLM-->>Orchestrator: Analysis Result
            end
        else Score ≤ 40
            Orchestrator-->>Frontend: SSE: llm_skipped (80%)
        end
        
        Orchestrator-->>Frontend: SSE: risk_scoring (90%)
        Orchestrator->>Analyzers: ML Score + SHAP
        Analyzers-->>Orchestrator: Risk Score + Attribution
        
        Orchestrator->>Database: Cache Result
        Orchestrator->>Database: Log Audit Entry
        
        Orchestrator-->>Frontend: SSE: complete (100%)
    end
    
    Frontend->>User: Display Risk Gauge + Report

Component Interaction Flow

graph TD
    A[Frontend React SPA] -->|HTTP POST /analyze| B[FastAPI Backend]
    B -->|Stream SSE Events| A
    
    B --> C[Pipeline Orchestrator]
    
    C --> D[Static Analyzer]
    D -->|Androguard| D1[APK Decompilation]
    D1 --> D2[Permission Analysis]
    D2 --> D3[API Call Graph]
    D3 --> D4[Certificate Validation]
    D4 --> D5[Obfuscation Detection]
    D5 --> D6[VirusTotal Hash Check]
    
    C --> Y[YARA Engine]
    Y -->|Scan DEX/Manifest/.so| Y1[Signature Matching]
    Y1 --> Y2[Severity Scoring]
    
    C --> M[MITRE Mapper]
    M --> M1[Permission → Technique]
    M1 --> M2[API → Technique]
    
    C --> FC[Family Classifier]
    FC --> FC1[BankBot/Spyware/RAT Rules]
    
    C --> E[Dynamic Analyzer]
    E -->|ADB + Frida| E1[Sandbox Execution]
    E1 --> E2[Runtime Collectors]
    E2 --> E3[Behavioral Scoring]
    E3 --> E4[Event Mapping]
    
    C --> F[LLM Analyzer]
    F -->|Resilient Router| F1[Gemini 2.5 Flash]
    F1 -->|Fallback| F2[GPT-4o]
    F2 --> F3[Smart Truncation]
    F3 --> F4[JSON Validation]
    
    C --> G[Risk Scorer]
    G -->|Dataset Features| G1[XGBoost Model]
    G1 -->|SHAP| G2[Feature Attribution]
    G2 --> G3[6 Dimension Scores]
    G3 --> G4[Boost Rules]
    
    C --> H[Report Generator]
    H --> I[Threat Report]
    
    G --> J[Feature Store]
    J -->|SQLite| K[(Cache DB)]
    
    H --> L[Audit Logger]
    L -->|JSONL| N[(Audit Logs)]

Data Flow

User uploads APK → Frontend sends multipart/form-data POST
Backend validates → Size check (150MB max), magic byte verification (PK header)
Orchestrator streams events → Each stage emits SSE with progress percentage
Static analysis → 14 numeric features + graph topology metrics
Dynamic analysis → Sandbox execution (if enabled) or emulated mode
LLM analysis → Gemini API call with decompiled code context
Risk scorer builds feature vector → 37-dimensional array for XGBoost
SHAP explainer → Top-5 feature contributions extracted
Report generated → JSON with verdict (APPROVE/MONITOR/ESCALATE/BLOCK)
Results cached → SQLite feature store + JSONL audit log
Frontend renders → Risk gauge, dimension chart, SHAP waterfall, threat report

🔧 Technology Stack

Backend

Component	Technology	Version	Purpose
API Framework	FastAPI	0.111.0	Async REST API with OpenAPI docs
ASGI Server	Uvicorn	0.30.1	Production ASGI server with WebSocket support
APK Analysis	Androguard	3.3.5	DEX decompilation, manifest parsing
ML Framework	XGBoost	2.0.3	Gradient boosting classifier
Explainability	SHAP	0.45.1	TreeExplainer for feature attribution
LLM API	Google Gemini + OpenAI	2.5 Flash / GPT-4o	Resilient multi-tier routing
Graph Analysis	NetworkX	3.3	Call graph construction
Data Processing	Pandas + NumPy	2.2.2 + 1.26.4	Feature engineering
Database	SQLAlchemy	2.0.30	ORM for SQLite feature store
File Type Detection	python-magic	0.4.27	APK validation
YARA Scanner	yara-python	4.5.1	Signature-based detection
Testing	pytest + httpx	8.2.2 + 0.27.0	Unit/integration tests

Frontend

Component	Technology	Version	Purpose
Framework	React	19.2.6	Component-based UI
Build Tool	Vite	8.0.12	Fast HMR development server
Styling	Tailwind CSS	3.4.19	Utility-first CSS framework
Charts	Recharts	3.8.1	D3-based data visualization
Animations	Framer Motion	12.40.0	Declarative animations
Icons	Lucide React	1.20.0	SVG icon library
HTTP Client	Fetch API	Native	Server-sent events streaming

Infrastructure

Containerization - Docker + Docker Compose
Reverse Proxy - Nginx (frontend static serving)
Storage - SQLite (feature cache), JSONL (audit logs)

📦 Installation

Prerequisites

# Python 3.11+
python --version  # Should be >= 3.11

# Node.js 20+
node --version    # Should be >= 20

# Docker & Docker Compose (optional)
docker --version
docker-compose --version

# Java Runtime (for Androguard)
java -version     # Required for APK decompilation

# ADB (for live sandbox mode)
adb version       # Optional - only if USE_LIVE_SANDBOX=true

Quick Start (Docker)

# 1. Clone the repository
git clone https://github.com/yourusername/mobileguard-ai.git
cd mobileguard-ai

# 2. Configure environment variables
cp .env.example .env
nano .env  # Add your GEMINI_API_KEY

# 3. Build and run with Docker Compose
docker-compose up --build

# 4. Access the application
# Frontend: http://localhost:3000
# Backend API: http://localhost:8000
# API Docs: http://localhost:8000/docs

Development Setup (Local)

Backend

cd backend

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Train the XGBoost model (generates models/xgboost_mobileguard.json)
python -m backend.training.train_xgboost

# Start the API server
uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000

Frontend

cd frontend

# Install dependencies
npm install

# Start development server with HMR
npm run dev

# Build for production
npm run build
npm run preview  # Preview production build

⚙️ Configuration

Environment Variables (.env)

# Required
GEMINI_API_KEY="your-gemini-api-key-here"
OPENAI_API_KEY="your-openai-api-key-here"    # For LLM fallback tier

# Optional
VIRUSTOTAL_API_KEY="your-vt-api-key"          # For threat intelligence enrichment
USE_LIVE_SANDBOX="false"                      # Enable ADB-based sandbox (requires devices)
MAX_APK_SIZE_MB="150"                         # Max upload size
SANDBOX_TIMEOUT_SECS="90"                     # Dynamic analysis timeout

# Paths (auto-configured)
FEATURE_CACHE_DB="data/feature_cache.sqlite"
AUDIT_LOG_PATH="data/audit.jsonl"
MODEL_PATH="models/xgboost_mobileguard.json"

Configuration Files

Backend Configuration (backend/config.py):

LLM_MODEL - Gemini model name (default: gemini-2.0-flash)
RISK_THRESHOLDS - Score boundaries for APPROVE/MONITOR/ESCALATE/BLOCK
DANGEROUS_PERMISSIONS - Permission risk weights (0-5 scale)
SUSPICIOUS_API_PATTERNS - Regex patterns for malicious API detection

Frontend Configuration (frontend/vite.config.js):

Build settings for production optimization
Proxy configuration for local development

Tailwind Config (frontend/tailwind.config.js):

Custom color palette (background, accent, danger, success)
Animation keyframes for glow effects

🚀 Usage

Web Interface

Navigate to http://localhost:3000
Check System Status - Verify API health and model loading
Upload APK - Drag & drop or click to select .apk file (max 150MB)
Monitor Progress - Watch real-time analysis stages:
- Static Analysis (0-30%)
- Dynamic Analysis (30-50%)
- LLM Analysis (50-70%)
- Risk Scoring (70-90%)
- Report Generation (90-100%)
Review Results:
- Risk Gauge - Composite score with action recommendation
- Dimension Chart - 6 risk dimension breakdown
- SHAP Explainer - Top-5 features driving the score
- Threat Report - Executive summary with forensic indicators
Audit Log - View historical analyses with scores and timestamps

API Usage

Analyze APK

curl -X POST "http://localhost:8000/analyze" \
  -H "Content-Type: multipart/form-data" \
  -F "file=@sample.apk" \
  --no-buffer  # Required for SSE streaming

Response (Server-Sent Events):

data: {"stage":"static_analysis","status":"running","progress":10}

data: {"stage":"yara_scan","status":"running","progress":25}

data: {"stage":"cache_hit","status":"done","progress":20}  # If APK previously analyzed

data: {"stage":"dynamic_analysis","status":"running","progress":40}

data: {"stage":"llm_skipped","status":"done","progress":80}  # If composite_score ≤ 40

data: {"stage":"llm_analysis","status":"running","progress":80}  # If composite_score > 40

data: {"stage":"risk_scoring","status":"running","progress":60}

data: {"stage":"report_generation","status":"running","progress":90}

data: {"stage":"complete","status":"done","progress":100,"result":{...}}

Health Check

curl http://localhost:8000/health

Response:

{
  "status": "ok",
  "version": "1.0.0",
  "model_loaded": true,
  "llm_available": true,
  "sandbox_available": false
}

Retrieve Cached Analysis

curl http://localhost:8000/analysis/{apk_sha256_hash}

Fetch Audit Log

curl "http://localhost:8000/audit-log?limit=10&offset=0"

📚 API Reference

Endpoints

`POST /analyze`

Analyze an APK file with full pipeline execution.

Request:

Body - multipart/form-data
Field - file (APK binary, max 150MB)

Response:

Content-Type - text/event-stream
Events - JSON objects with stage, status, progress, error, result fields

Error Codes:

422 - Invalid APK format (not a ZIP/PK header)
413 - File too large (> 150MB)
500 - Analysis pipeline failure

`GET /health`

System health and component availability.

Response:

{
  "status": "ok",
  "version": "1.0.0",
  "model_loaded": true,
  "sandbox_available": false
}

`GET /analysis/{apk_hash}`

Retrieve cached analysis by SHA256 hash.

Response: Full AnalysisResult JSON

Error Codes:

404 - Hash not found in cache
503 - Feature store unavailable

`GET /audit-log`

Fetch audit log entries.

Query Parameters:

limit (int) - Max entries (default: 50)
offset (int) - Pagination offset (default: 0)

Response:

{
  "entries": [
    {
      "apk_hash": "abc123...",
      "filename": "sample.apk",
      "score": 68.5,
      "action": "ESCALATE",
      "analyzed_at": "2026-06-18T10:30:45Z"
    }
  ]
}

`DELETE /cache/{apk_hash}`

Remove cached analysis result.

Response: {"status": "ok"}

🔬 Pipeline Components

Pipeline Orchestrator

File: backend/pipeline/orchestrator.py

The orchestrator coordinates all 7 analysis stages with intelligent caching and conditional LLM invocation:

Cache Check - Query feature store by SHA256 hash + model version
Static Analysis - APK decompilation and feature extraction
YARA Scanning - Signature-based malware family detection
MITRE Mapping - Permission/API → ATT&CK technique mapping
Family Classification - Rule-based family detection
Dynamic Analysis - Runtime behavior monitoring (if sandbox available)
Conditional LLM Analysis - Only invoked if composite_score > 40 (cost optimization)
Risk Scoring - ML prediction + SHAP + boost rules
Report Generation - Threat report with actionable intelligence

Key Optimizations:

Smart LLM Gating - Skip expensive LLM calls for low-risk apps (saves 70% of API costs)
Hash-Based Caching - Instant results for previously analyzed APKs
Model Version Tracking - Cache invalidation when model updates
SSE Streaming - Real-time progress updates to frontend

1. Static Analyzer

Implementation: backend/pipeline/static_analyzer.py

Extracts 27+ static features using Androguard with VirusTotal enrichment.

Core Capabilities:

APK decompilation and manifest parsing
Permission risk scoring with combo detection
Call graph construction using NetworkX
Certificate chain validation
Obfuscation detection via Shannon entropy
Native library inspection
C2 domain/IP extraction
VirusTotal hash reputation check

2. YARA Signature Engine

Implementation: backend/detection/yara_engine.py

Production-grade signature scanning with metadata-aware severity scoring.

Architecture:

Independent rule compilation (one broken rule doesn't disable engine)
Unpacked content scanning (DEX, AndroidManifest.xml, .so files)
Cross-component deduplication
Severity-based score weighting (CRITICAL: 100, HIGH: 70, MEDIUM: 40, LOW: 15)
Action prioritization (BLOCK > ESCALATE > MONITOR > APPROVE)
Safe failure handling with scan error reporting

Score Contribution:

severity_score - Highest single rule weight (0-100)
score_boost - Additive boost for risk composite (capped at 40)

3. MITRE ATT&CK Mapper

Implementation: backend/intel/mitre_mapper.py

Maps permissions and APIs to MITRE Mobile ATT&CK techniques.

Coverage:

20+ permission mappings (SMS → T1636.004, Accessibility → T1417)
10+ API mappings (DexClassLoader → T1407, Runtime.exec → T1406)
Tactic-grouped findings (Collection, Persistence, Defense Evasion, etc.)
Evidence traceability (each technique links to triggering signal)

4. Family Classifier

Implementation: backend/intel/family_classifier.py

Rule-based classification with confidence scoring.

Supported Families:

BankBot-like - SMS + Accessibility overlay (confidence: 0.75+)
Spyware-like - Contacts + Location exfiltration (confidence: 0.75+)
RAT-like - Dynamic code loading + shell execution (confidence: 0.70+)
Ransomware-like - Storage encryption + device locking (confidence: 0.70+)

Algorithm:

Required signals (all must match)
Trigger signals (any N must match)
Bonus signals (each adds +0.05 confidence)
Threshold filtering (default: 0.8)

5. Dynamic Analyzer

Implementation: backend/pipeline/dynamic_analyzer.py

Full ADB + Frida + logcat integration with automatic fallback.

Live Sandbox Mode:

APK installation via ADB
Logcat capture with signal extraction
UI interaction via adb shell monkey
dumpsys inspection (accessibility, device_policy, window, activity services)
Runtime event collection (Frida hooks, system state)
Behavioral anomaly scoring (0-100)

Emulated Mode:

Graceful degradation when no device available
Returns neutral values instead of blocking analysis
Allows ML/LLM tiers to carry the decision weight

6. LLM Analyzer

Implementation: backend/pipeline/llm_analyzer.py + backend/pipeline/resilient_router.py

Multi-tier routing with smart truncation and strict JSON validation.

Routing Strategy:

Primary Tier - Gemini 2.5 Flash with native JSON output
Fallback Tier - GPT-4o with structured response format
Retry Logic - 2 attempts with progressive truncation on context errors

Smart Truncation:

Center-out strategy (top 40% + bottom 40%)
Preserves class headers/imports and execution tails
Triggered automatically on context window errors

Security Analyst Persona:

15+ years malware analysis experience
Evidence-based reasoning (no speculation)
Distinguishes capability from intent
Maps to MITRE ATT&CK Mobile
India-specific threat assessment

7. Risk Scorer

Implementation: backend/pipeline/risk_scorer.py

XGBoost classifier with SHAP explainability and context-aware boost rules.

Multi-Dimensional Scoring:

dimension_scores = {
    "permission_abuse": 10% weight,      # Dangerous permissions
    "obfuscation": 10% weight,           # Code obfuscation
    "behavioral_anomaly": 15% weight,    # Runtime behavior
    "ml_malware": 45% weight,            # XGBoost prediction
    "developer_trust": 10% weight,       # Certificate validation
    "llm_severity": 10% weight,          # LLM assessment
}

composite_score = Σ(dimension_score × weight)

Boost Rules (Context-Aware Amplification):

SMS send attempts → +15 points
C2 domains contacted → +20 points
Accessibility service abuse → +12 points
Static C2 IPs found → +10 points
LLM verdict BLOCK → +10 points
Silent install attempt → +15 points
Root activity detected → +15 points
Shell execution → +10 points
Dynamic code loading → +10 points
VirusTotal malicious ≥5 → +25 points
YARA matches → +3 to +25 points (capped at +40)

Action Thresholds:

RISK_THRESHOLDS = {
    "LOW":    0-25  → APPROVE,
    "MEDIUM": 26-50 → MONITOR,
    "HIGH":   51-75 → ESCALATE,
    "CRITICAL": 76-100 → BLOCK
}

SHAP Explainability:

explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(feature_vector)

# Extract top 5 contributors
top_features = [
    ("permission_danger", +18.5),
    ("obfuscation_score", +12.3),
    ("c2_hit_count", +9.7),
    ("has_native_code", +5.2),
    ("graph_density", -3.1)
]

8. Report Generator

File: backend/pipeline/static_analyzer.py

Features Extracted:

@dataclass
class StaticFeatures:
    apk_hash: str                      # SHA256 hash
    package_name: str                  # com.example.app
    permission_list: List[str]         # Manifest permissions
    permission_danger_score: float     # 0-100 weighted risk
    dangerous_permission_count: int    # Count of high-risk perms
    suspicious_api_count: int          # Matches against SUSPICIOUS_API_PATTERNS
    api_suspicion_score: float         # 0-100 API risk
    top_apis: List[str]                # Most called methods
    high_entropy_count: int            # Shannon > 4.5
    obfuscation_score: float           # 0-100 code obfuscation
    suspicious_urls: List[str]         # Extracted HTTP(S) URLs
    c2_hit_count: int                  # C2 IP matches
    is_self_signed: bool               # Certificate issuer = subject
    cert_trust_score: float            # 0-100 certificate trust
    has_native_code: bool              # .so libraries present
    native_risk_score: float           # 0-100 native lib risk
    receiver_list: List[str]           # Broadcast receivers
    service_list: List[str]            # Background services
    graph_density: float               # NetworkX call graph density
    graph_node_count: int              # Methods in call graph
    graph_edge_count: int              # Method calls
    min_sdk: int                       # Minimum Android SDK
    target_sdk: int                    # Target Android SDK

Risk Calculation:

Permission Combo Bonus - READ_SMS + INTERNET = +10 points
Self-Signed Penalty - -40 cert_trust_score
Native Library Check - Matches KNOWN_MALICIOUS_LIBS (libfrida-gadget.so, etc.)

2. Dynamic Analyzer

File: backend/pipeline/dynamic_analyzer.py

Sandbox Modes:

Live Mode (USE_LIVE_SANDBOX=true) - Requires ADB + Frida + mitmproxy
- Installs APK on connected device
- Injects Frida hooks for API monitoring
- Captures network traffic with mitmproxy
- Runs monkey for UI interaction
Emulated Mode (default) - Returns neutral values when no sandbox available

Features Extracted:

@dataclass
class DynamicFeatures:
    sandbox_mode: str                  # "live" or "emulated"
    sms_send_attempts: int             # sendTextMessage() calls
    network_domains_contacted: List[str]
    c2_domains_hit: int                # Known C2 matches
    data_exfil_bytes: int              # Total outbound traffic
    accessibility_service_abused: bool # Overlay attack detection
    clipboard_hijack_detected: bool    # ClipboardManager hooks
    silent_install_attempted: bool     # PackageInstaller calls
    camera_accessed: bool              # Camera.open() detected
    microphone_accessed: bool          # MediaRecorder usage
    location_accessed: bool            # GPS provider access
    device_admin_requested: bool       # DevicePolicyManager
    behavioural_anomaly_score: float   # 0-100 runtime risk
    matched_malware_family: str        # e.g. "BankBot", "Unknown"
    family_similarity_score: float     # 0.0-1.0 confidence

Live Sandbox Requirements:

# Android Debug Bridge
adb devices  # Must show at least one device

# Frida (optional - for runtime hooking)
pip install frida-tools
frida-ps -U  # List processes on USB device

# mitmproxy (optional - for network capture)
pip install mitmproxy
mitmdump --version

3. LLM Analyzer

File: backend/pipeline/llm_analyzer.py

System Prompt:

You are an elite Android malware analyst at a national cybersecurity agency. You have 15 years of experience with banking trojans, spyware, SMS stealers, and overlay attack frameworks. Never speculate without evidence from the code. Never produce generic statements — cite specific class names, method names, API calls, or string literals from the code.

Features Extracted:

@dataclass
class LLMFeatures:
    primary_function: str              # "What this app really does"
    malicious_behaviors: List[str]     # Specific behaviors with evidence
    data_collection: List[str]         # Data exfiltration methods
    obfuscation_techniques: List[str]  # Code obfuscation patterns
    attack_vectors: List[str]          # Technical attack chains
    india_specific_risks: List[str]    # UPI/OTP/Banking risks
    severity_score: float              # 0.0-1.0 LLM confidence
    confidence: float                  # 0.0-1.0 verdict confidence
    verdict: str                       # CRITICAL/HIGH/MEDIUM/LOW/UNKNOWN
    recommended_action: str            # Next steps for analyst
    executive_summary: str             # 2-3 sentence summary
    zero_day_hypotheses: List[str]     # Novel threat theories

Zero-Day Detection:

Triggered when severity_score > 0.6 AND family_similarity_score < 0.4
Generates 3 ranked threat hypotheses for unknown malware

API Configuration:

import google.generativeai as genai

genai.configure(api_key=GEMINI_API_KEY)
model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content(prompt)

4. Risk Scorer

File: backend/pipeline/risk_scorer.py

Multi-Dimensional Scoring:

dimension_scores = {
    "permission_abuse": 20% weight,      # Dangerous permissions
    "obfuscation": 15% weight,           # Code obfuscation
    "behavioral_anomaly": 25% weight,    # Runtime behavior
    "ml_malware": 20% weight,            # XGBoost prediction
    "developer_trust": 10% weight,       # Certificate validation
    "llm_severity": 10% weight,          # Gemini assessment
}

composite_score = Σ(dimension_score × weight)

Boost Rules (Context-Aware Amplification):

SMS send attempts → +15 points
C2 domains contacted → +20 points
Accessibility service abuse → +12 points
Static C2 IPs found → +10 points
LLM verdict CRITICAL → +10 points
Silent install attempt → +15 points

Action Thresholds:

RISK_THRESHOLDS = {
    "LOW":    0-25  → APPROVE,
    "MEDIUM": 26-50 → MONITOR,
    "HIGH":   51-75 → ESCALATE,
    "CRITICAL": 76-100 → BLOCK
}

SHAP Explainability:

explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer.shap_values(feature_vector)

# Extract top 5 contributors
top_features = [
    ("permission_danger", +18.5),
    ("obfuscation_score", +12.3),
    ("c2_hit_count", +9.7),
    ("has_native_code", +5.2),
    ("graph_density", -3.1)
]

5. Report Generator

File: backend/pipeline/report_generator.py

Report Structure:

VERDICT: BLOCK — App exhibits clear signs of malicious intent.
RISK SCORE: 82.3/100 — Score driven by: permission_danger (+18.5), c2_hit_count (+12.0)

TECHNICAL FINDINGS:
  - Permission Analysis: 8 dangerous permissions requested.
  - Code Behaviour: Banking overlay with SMS interception. Accessibility service abuse for OTP capture.
  - Network Activity: Contacted 3 domains. C2 hits: 1.
  - Obfuscation: 127 high entropy strings detected (Score: 64.2).

INDIA-SPECIFIC THREAT: UPI transaction overlay, OTP SMS interception targeting Bank of India users.

RECOMMENDED ACTIONS:
  1. Immediate: Block application execution and network access.
  2. Investigation: Identify affected devices and reset credentials.
  3. Reporting: File a formal report with CERT-In.

EVIDENCE SUMMARY:
  * Hardcoded C2 IPs detected in code
  * Requests Accessibility Service (Overlay/Keylogger potential)
  * LLM identified: SMS interception with runtime code injection
  * Network traffic to known C2 domains
  * Suspicious API usage: sendTextMessage, getDeviceId, Runtime.exec

Forensic Indicators: Top 5 evidence items ranked by criticality, with technical citations (class names, method names, API calls).

🤖 Machine Learning

Model Training

Dataset Generation:

python -m backend.training.train_xgboost

Synthetic Data Distribution:

Benign Apps (n=800)
- Permission danger: μ=15, σ=10
- API suspicion: μ=16, σ=8
- Obfuscation: μ=12, σ=8
- Self-signed: 60%
Malicious Apps (n=800)
- Permission danger: μ=70, σ=18
- API suspicion: μ=72, σ=18
- Obfuscation: μ=65, σ=20
- Self-signed: 90%

Feature Engineering (backend/training/feature_engineering.py):

Missing value imputation (median strategy)
Column removal (>40% missing)
StandardScaler normalization
SMOTE oversampling (if class imbalance > 5:1)

Model Hyperparameters:

XGBClassifier(
    n_estimators=300,           # 300 boosting rounds
    max_depth=6,                # Tree depth
    learning_rate=0.05,         # Step size shrinkage
    subsample=0.8,              # Row sampling
    colsample_bytree=0.8,       # Column sampling
    scale_pos_weight=ratio,     # Class imbalance weight
    eval_metric=["logloss", "auc"],
    early_stopping_rounds=20,   # Validation patience
    tree_method="hist"          # CPU-optimized
)

Evaluation Metrics (backend/training/evaluate.py):

Precision, Recall, F1-Score
ROC-AUC
Confusion Matrix
Feature Importance (gain/weight/cover)

Model Artifacts:

models/
├── xgboost_mobileguard.json      # Trained XGBoost model
├── scaler.pkl                    # StandardScaler object
├── feature_columns.json          # 37 feature names
└── shap_feature_importance.png  # SHAP summary plot

Real-World Dataset Integration

Replace synthetic data with:

Drebin Dataset - 15,036 malware samples, 123K+ benign apps
CIC-AndMal2017 - 426 malware families across 5 categories
AndroZoo - 10M+ APKs with VirusTotal labels

# Example: Load Drebin parquet
df = pd.read_parquet("data/drebin_features.parquet")
X, y, feature_columns, scaler = engineer_features(df)

🎨 Frontend Architecture

Component Hierarchy

App.jsx
├── Header (System Status)
│   ├── API Health Indicator
│   ├── Analysis Engine Status
│   └── Last Scan Timestamp
│
├── Left Panel (4 cols)
│   ├── UploadZone (Drag & Drop)
│   └── ProgressTracker (5 stages with icons)
│
└── Right Panel (8 cols)
    ├── ActionBanner (Verdict + Score)
    ├── RiskGauge (Circular gauge with gradient)
    ├── DimensionChart (Radar chart with 6 axes)
    ├── ShapExplainer (Waterfall plot)
    ├── ThreatReport (Collapsible sections)
    └── AuditLog (Paginated table)

Key Components

UploadZone (src/components/UploadZone.jsx):

Drag-and-drop zone with hover state
File type validation (.apk only)
Size validation (150MB max client-side check)
Lucide Upload icon with animation

ProgressTracker (src/components/ProgressTracker.jsx):

5 stages with icons (FileSearch, Activity, Brain, BarChart, FileText)
Progress bar with gradient fill
Real-time status updates from SSE
Error state with AlertTriangle icon

RiskGauge (src/components/RiskGauge.jsx):

Recharts RadialBarChart
Dynamic color gradient (green → yellow → orange → red)
Score label with action badge
Animated arc fill with easeElastic

DimensionChart (src/components/DimensionChart.jsx):

Recharts RadarChart with 6 dimensions
Permission Abuse, Obfuscation, Behavioral Anomaly, ML Malware, Developer Trust, LLM Severity
Gradient fill with opacity
Tooltip with dimension explanations

ShapExplainer (src/components/ShapExplainer.jsx):

Top 5 feature contributions
Positive values (red) vs Negative values (green)
Horizontal bar chart with labels
Explanation text from risk_scorer

ThreatReport (src/components/ThreatReport.jsx):

Collapsible sections (Executive Summary, Technical Findings, Evidence)
Copy-to-clipboard functionality
Malware family badge
India-specific risk flag
Forensic indicators with checkboxes

ActionBanner (src/components/ActionBanner.jsx):

Color-coded by action (APPROVE=green, MONITOR=yellow, ESCALATE=orange, BLOCK=red)
Large composite score display
Icon (Shield, AlertTriangle, XCircle)
Framer Motion slide-in animation

Design System

Colors (Tailwind config):

colors: {
  background: "#07111F",     // Deep navy
  card: "rgba(255,255,255,0.04)", // Glassmorphism
  accent: "#3B82F6",         // Blue
  success: "#22C55E",        // Green
  warning: "#F59E0B",        // Amber
  danger: "#EF4444",         // Red
  muted: "#64748B",          // Slate
  textPrimary: "#F8FAFC",    // Off-white
  textSecondary: "#94A3B8"   // Light slate
}

Animations:

Page load: Staggered fade-in (Framer Motion)
Card mount: Scale + opacity transition
Progress bar: Smooth width animation with spring physics
Gauge fill: Arc sweep with easeElastic timing

Typography:

Font: Inter (variable font for optimal performance)
Heading: 4xl/5xl bold with tight tracking
Body: Base/lg with relaxed line height
Code: Monospace (JetBrains Mono fallback)

🧪 Testing

Backend Tests

cd backend
pytest tests/ -v --cov=backend --cov-report=html

Test Files:

tests/test_api.py - FastAPI endpoint tests
tests/test_static.py - Static analyzer unit tests
tests/test_scorer.py - Risk scoring validation

Example Test:

def test_health_endpoint_returns_ok():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "ok"

def test_analyze_rejects_non_apk_files():
    with open("test.txt", "w") as f:
        f.write("Not an APK")
    with open("test.txt", "rb") as f:
        response = client.post("/analyze", files={"file": f})
    assert response.status_code == 422

Frontend Tests

cd frontend
npm run test  # Vitest + React Testing Library

Testing Strategy:

Unit tests for API client functions
Component tests with mocked API responses
Integration tests for upload flow
Visual regression tests (optional - with Playwright)

🚢 Deployment

Docker Compose (Recommended)

# docker-compose.yml
services:
  backend:
    build:
      context: .
      dockerfile: Dockerfile.backend
    ports:
      - "8000:8000"
    env_file: .env
    volumes:
      - ./data:/app/data        # Persistent cache & logs
      - ./models:/app/models    # Pre-trained model
    restart: unless-stopped

  frontend:
    build:
      context: .
      dockerfile: Dockerfile.frontend
    ports:
      - "3000:80"
    depends_on:
      - backend
    restart: unless-stopped

Deployment Commands:

docker-compose up -d           # Start in detached mode
docker-compose logs -f backend # View backend logs
docker-compose down            # Stop all services

Production Considerations

Environment Variables:
- Use Docker secrets or AWS Secrets Manager for API keys
- Never commit .env to version control
Reverse Proxy:
- Configure Nginx for SSL termination
- Set up rate limiting (e.g., 10 uploads/minute per IP)
- Enable CORS only for trusted origins
Database:
- Replace SQLite with PostgreSQL for multi-node deployments
- Use connection pooling (SQLAlchemy pool_size=20)
Storage:
- Mount persistent volumes for data/ and models/
- Use S3 for audit log archival
Monitoring:
- Prometheus metrics for API latency, error rates
- Grafana dashboards for pipeline stage durations
- Sentry for exception tracking
Security:
- Run containers as non-root user
- Scan Docker images with Trivy
- Enable AppArmor/SELinux profiles

Cloud Deployment

AWS ECS (Fargate):

# Build and push to ECR
aws ecr get-login-password | docker login --username AWS --password-stdin <ecr-repo>
docker build -f Dockerfile.backend -t mobileguard-backend .
docker tag mobileguard-backend:latest <ecr-repo>/mobileguard-backend:latest
docker push <ecr-repo>/mobileguard-backend:latest

# Deploy with Fargate task definition
aws ecs update-service --cluster prod --service mobileguard --force-new-deployment

Kubernetes (GKE/EKS):

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mobileguard-backend
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: api
        image: gcr.io/project/mobileguard-backend:v1.0.0
        env:
        - name: GEMINI_API_KEY
          valueFrom:
            secretKeyRef:
              name: api-secrets
              key: gemini-key

📂 Project Structure

mobileguard-ai/
├── backend/                          # Python FastAPI backend
│   ├── config.py                     # Environment config & constants
│   ├── main.py                       # FastAPI app & endpoints
│   ├── requirements.txt              # Python dependencies
│   ├── dataset_feature_extractor.py  # 37-feature extraction for ML model
│   │
│   ├── pipeline/                     # Analysis pipeline modules
│   │   ├── orchestrator.py          # Pipeline coordinator with SSE streaming
│   │   ├── static_analyzer.py       # Androguard-based APK analysis
│   │   ├── dynamic_analyzer.py      # ADB + Frida sandbox execution
│   │   ├── llm_analyzer.py          # LLM analysis with structured output
│   │   ├── resilient_router.py      # Multi-tier LLM routing (Gemini → GPT-4o)
│   │   ├── risk_scorer.py           # XGBoost + SHAP scoring
│   │   ├── report_generator.py      # Threat report generation
│   │   ├── behavior_scorer.py       # Runtime behavior anomaly scoring
│   │   ├── runtime_collectors.py    # Frida hooks + dumpsys collectors
│   │   ├── runtime_events.py        # Event data structures
│   │   └── event_mapper.py          # Event → feature mapping
│   │
│   ├── detection/                    # Signature-based detection
│   │   ├── yara_engine.py           # YARA scanner with metadata scoring
│   │   └── yara_rules/              # .yar signature files
│   │
│   ├── intel/                        # Threat intelligence
│   │   ├── mitre_mapper.py          # ATT&CK technique mapping
│   │   └── family_classifier.py     # Malware family classification
│   │
│   ├── data/                         # Data management
│   │   ├── feature_store.py         # SQLite caching layer
│   │   ├── audit_logger.py          # JSONL audit logging
│   │   └── threat_intel.py          # C2 blocklist + VirusTotal integration
│   │
│   ├── training/                     # ML model training
│   │   ├── train_xgboost.py         # Model training script
│   │   ├── feature_engineering.py   # SMOTE + StandardScaler
│   │   └── evaluate.py              # Metrics & SHAP plots
│   │
│   └── tests/                        # Pytest test suite
│       ├── test_api.py              # FastAPI endpoint tests
│       ├── test_static.py           # Static analyzer tests
│       └── test_scorer.py           # Risk scorer tests
│
├── frontend/                         # React + Vite frontend
│   ├── src/
│   │   ├── App.jsx                  # Main application component
│   │   ├── main.jsx                 # React entry point
│   │   ├── index.css                # Tailwind base styles
│   │   │
│   │   ├── api/
│   │   │   └── client.js            # Fetch API wrapper (SSE support)
│   │   │
│   │   └── components/              # React components
│   │       ├── UploadZone.jsx       # Drag & drop file upload
│   │       ├── ProgressTracker.jsx  # 5-stage progress indicator
│   │       ├── RiskGauge.jsx        # Recharts radial gauge
│   │       ├── DimensionChart.jsx   # 6-axis radar chart
│   │       ├── ShapExplainer.jsx    # Feature attribution viz
│   │       ├── ThreatReport.jsx     # Collapsible report card
│   │       ├── ActionBanner.jsx     # Verdict display banner
│   │       └── AuditLog.jsx         # Paginated log table
│   │
│   ├── public/
│   │   ├── favicon.svg
│   │   └── icons.svg
│   │
│   ├── package.json                 # NPM dependencies
│   ├── vite.config.js               # Vite build config
│   ├── tailwind.config.js           # Tailwind theme
│   └── postcss.config.js            # PostCSS plugins
│
├── models/                           # ML model artifacts
│   ├── xgboost_mobileguard.json     # Trained XGBoost model
│   ├── scaler.pkl                   # StandardScaler object
│   ├── feature_columns.json         # 37 feature names
│   └── shap_feature_importance.png  # Feature importance plot
│
├── data/                             # Runtime data storage
│   ├── feature_cache.sqlite         # APK analysis cache
│   ├── audit_2026-06-18.jsonl       # Daily audit logs
│   └── certin_iocs.json             # Threat intel feed
│
├── docker-compose.yml                # Multi-container orchestration
├── Dockerfile.backend                # Backend container image
├── Dockerfile.frontend               # Frontend container image
├── nginx.conf                        # Nginx config for frontend
├── .env                              # Environment variables
└── README.md                         # This file

🚀 Future Improvements

Phase 1: Enhanced Detection Capabilities

Advanced Static Analysis

DEX2JAR Integration - Decompile to Java bytecode for deeper semantic analysis
Control Flow Graph (CFG) Analysis - Detect code reachability and dead code patterns
Data Flow Tracking - Trace sensitive data from source to sink (taint analysis)
String Encryption Detection - Pattern matching for common encryption libraries (AES, RSA)
Anti-Analysis Detection - Identify emulator checks, debugger detection, root detection
Resource Analysis - Inspect assets, raw files, and embedded payloads

Live Sandbox Enhancements

Automated Device Farm - Integrate with AWS Device Farm or BrowserStack
Multi-Device Testing - Test across Android 8-14 with different screen sizes
Kernel-Level Monitoring - eBPF-based syscall tracing for privilege escalation detection
UI Automation - Selenium-like APK interaction for permission dialog testing
Memory Dump Analysis - Extract runtime strings, loaded libraries, decrypted payloads
SSL Pinning Bypass - Automatic certificate unpinning for network analysis

LLM Intelligence Evolution

Multi-Model Ensemble - Combine Gemini, GPT-4, Claude for consensus scoring
Code Summarization - Generate human-readable pseudocode from smali/DEX
Threat Actor Attribution - Link malware samples to known APT groups
Natural Language Queries - "Show me all apps that access SMS and call APIs"
Automated IOC Extraction - Extract IPs, domains, file hashes from analysis
Fine-Tuned Security Model - Train Gemini on labeled malware corpus

Phase 2: Scale & Performance

Distributed Processing

Celery Task Queue - Asynchronous APK processing with Redis backend
Horizontal Scaling - Load balancer with 3+ API replicas
Database Migration - PostgreSQL with read replicas for feature store
Caching Layer - Redis for hot APK hashes (< 1ms retrieval)
Batch Analysis API - Upload 100+ APKs with parallel processing
GraphQL API - Flexible querying for frontend/integrations

Performance Optimizations

Model Quantization - Reduce XGBoost model size by 60% (int8 inference)
Lazy Feature Extraction - Extract only features needed by ML model
Incremental Analysis - Cache intermediate results (static → dynamic → LLM)
APK Deduplication - SHA256-based early termination for known samples
Streaming Decompilation - Process APK classes incrementally
CDN Integration - Serve frontend assets via CloudFront/Cloudflare

Phase 3: Extended Threat Intelligence

Real-Time Threat Feeds

VirusTotal Integration - Cross-reference hashes with 70+ AV engines
MISP Integration - Ingest IOCs from Malware Information Sharing Platform
AlienVault OTX - Community threat intelligence feed
CERT-In Feed - Official Indian government threat bulletins
Custom IOC Management - Upload enterprise-specific C2 domains/IPs
Threat Actor Profiles - Link samples to known groups (Lazarus, APT28)

Malware Family Classification

Drebin Feature Vectors - Train classifier on 179 Drebin features
Signature Database - 500+ malware family YARA rules
Similarity Hashing - SSDeep/TLSH for variant detection
Behavioral Clustering - Group unknown samples by runtime behavior
Family Evolution Tracking - Detect new variants of known families

Phase 4: Regional & Compliance Features

India-Specific Enhancements

UPI Deep Inspection - Detect PhonePe/Paytm/Google Pay overlay attacks
Aadhaar OTP Monitoring - Flag apps intercepting UIDAI SMS
Banking App Whitelist - Trusted app signatures for 30+ Indian banks
Regional Language Support - Hindi/Tamil/Bengali UI translations
RBI Compliance Reporting - Generate reports aligned with RBI guidelines
NPCI Notification Integration - Alert on suspicious UPI transaction apps

Enterprise Features

SIEM Integration - Export logs to Splunk/ELK/QRadar
SOAR Playbooks - Automated response workflows (quarantine, alert, block)
Active Directory SSO - LDAP/SAML authentication
Multi-Tenancy - Isolated workspaces for different business units
Role-Based Access Control (RBAC) - Analyst/Admin/Auditor roles
Compliance Reports - SOC 2, ISO 27001, GDPR audit trails

Phase 5: Advanced ML & AI

Deep Learning Models

MalConv - 1D CNN for raw APK byte sequence classification
DexRay - Graph neural network on call graphs
Transformer-Based Classifier - BERT fine-tuned on decompiled code
Generative Adversarial Network (GAN) - Synthetic malware generation for training
Reinforcement Learning Sandbox - AI-driven APK interaction for maximum coverage

Explainability Improvements

LIME Integration - Local interpretable model-agnostic explanations
Counterfactual Analysis - "What changes would flip the verdict?"
Feature Interaction Plots - 2D SHAP dependence plots
Natural Language Explanations - LLM-generated risk summaries
Interactive Decision Trees - Visualize XGBoost tree paths

Continuous Learning

Active Learning Pipeline - Flag uncertain samples for analyst review
Model Drift Detection - Monitor prediction distribution shifts
Online Learning - Update model with new labeled samples
A/B Testing Framework - Compare model versions in production
AutoML Integration - Hyperparameter tuning with Optuna/Ray Tune

Phase 6: User Experience & Visualization

Frontend Enhancements

3D Call Graph Visualization - Three.js interactive network diagram
Timeline View - Chronological analysis stage progression
Comparison Mode - Side-by-side analysis of 2+ APKs
Dark/Light Mode Toggle - User preference persistence
Export Reports - PDF/DOCX generation with branding
Mobile App - React Native companion for on-the-go analysis

Collaboration Features

Team Comments - Annotate analysis results with threaded discussions
Shared Workspaces - Collaborative investigations
Notification System - Email/Slack alerts for high-risk APKs
Analyst Dashboard - Personal queue, statistics, leaderboard
API Webhooks - Push notifications to external systems

Phase 7: Open Source Ecosystem

Community Contributions

Plugin System - Custom analyzers via Python entry points
YARA Rule Repository - Community-contributed malware signatures
Threat Hunt Queries - Sigma-style detection rules
Sample Exchange - Secure APK sharing platform (hashed uploads)
Public API - Rate-limited free tier for researchers
Documentation Portal - Interactive API explorer, tutorials, blog

Research Initiatives

Academic Partnerships - Collaborate with universities on novel techniques
Conference Papers - Publish findings at BlackHat, DEF CON, USENIX
Bug Bounty Program - Reward security researchers for vulnerabilities
Open Dataset Release - Anonymized analysis results for research
Benchmark Suite - Standard test set for comparing malware detectors

🤝 Contributing

We welcome contributions! Please follow these guidelines:

Fork the repository and create a feature branch
```
git checkout -b feature/your-feature-name
```

Make changes with clear commit messages

git commit -m "feat(static): Add native library signature matching"

Write tests for new features
```
pytest tests/test_your_feature.py -v
```
Update documentation if adding public APIs
Submit a pull request with:
- Description of changes
- Test results
- Screenshots (for UI changes)

Commit Convention:

feat: - New feature
fix: - Bug fix
docs: - Documentation update
refactor: - Code refactoring
test: - Test additions/updates
chore: - Build/tooling changes

📄 License

MIT License - See LICENSE file for details.

🙏 Acknowledgments

Androguard - APK analysis framework
XGBoost - Gradient boosting library
SHAP - Explainable AI toolkit
Google Gemini - LLM API for contextual analysis
Drebin Dataset - Android malware research dataset
CERT-In - Indian cybersecurity standards
Recharts - React charting library
Framer Motion - Animation library

📞 Support

Documentation: docs.mobileguard.ai
Issues: GitHub Issues
Email: indiser01@gmail.com

Built with ❤️ for cybersecurity professionals

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.agent/skills/best_ui		.agent/skills/best_ui
backend		backend
dataset		dataset
evaluation		evaluation
frontend		frontend
.gitignore		.gitignore
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
Readme.md		Readme.md
architecture.md		architecture.md
docker-compose.yml		docker-compose.yml
get_structure_wsl.txt		get_structure_wsl.txt
nginx.conf		nginx.conf

Folders and files

Latest commit

History

Repository files navigation

🛡️ MobileGuard AI

Advanced AI-Powered Mobile Threat Intelligence Platform

📋 Table of Contents

🔍 Overview

✨ Features

🔬 Static Analysis Engine

🏗️ Dynamic Analysis Sandbox

🤖 LLM-Powered Intelligence

📊 Machine Learning & Explainability

📝 Reporting & Compliance

🎨 Modern React Frontend

🏛️ Architecture

High-Level System Architecture

Data Flow Diagram

Component Interaction Flow

Data Flow

🔧 Technology Stack

Backend

Frontend

Infrastructure

📦 Installation

Prerequisites

Quick Start (Docker)

Development Setup (Local)

Backend

Frontend

⚙️ Configuration

Environment Variables (.env)

Configuration Files

🚀 Usage

Web Interface

API Usage

Analyze APK

Health Check

Retrieve Cached Analysis

Fetch Audit Log

📚 API Reference

Endpoints

POST /analyze

GET /health

GET /analysis/{apk_hash}

GET /audit-log

DELETE /cache/{apk_hash}

🔬 Pipeline Components

Pipeline Orchestrator

1. Static Analyzer

2. YARA Signature Engine

3. MITRE ATT&CK Mapper

4. Family Classifier

5. Dynamic Analyzer

6. LLM Analyzer

7. Risk Scorer

8. Report Generator

2. Dynamic Analyzer

3. LLM Analyzer

4. Risk Scorer

5. Report Generator

🤖 Machine Learning

Model Training

Real-World Dataset Integration

🎨 Frontend Architecture

Component Hierarchy

Key Components

Design System

🧪 Testing

Backend Tests

Frontend Tests

🚢 Deployment

Docker Compose (Recommended)

Production Considerations

Cloud Deployment

📂 Project Structure

🚀 Future Improvements

Phase 1: Enhanced Detection Capabilities

Advanced Static Analysis

Live Sandbox Enhancements

`POST /analyze`

`GET /health`

`GET /analysis/{apk_hash}`

`GET /audit-log`

`DELETE /cache/{apk_hash}`

Packages