Log Anomaly Detection — Android System Logs

Course: Applied Artificial Intelligence (B.Tech Sem 6)
Dataset: Android.log — 1,555,005 lines from a Huawei Android device (Dec 17, 2016)

What This Project Is

Every Android device produces a continuous stream of log messages — thousands per second from the OS kernel, background services, UI framework, and apps. This project treats that stream as raw text data and answers a single question:

Which log entries are anomalous, and what does the normal structure of system logs look like?

This is an end-to-end unsupervised NLP + anomaly detection pipeline. There are no labels. The system must learn what "normal" looks like from the data alone and surface entries that deviate from it. The pipeline covers the full lifecycle: raw log parsing → statistical characterisation → text vectorisation → clustering → sentiment → anomaly scoring.

Dataset

Property	Value
File	`Android.log`
Lines	1,555,005
Format	Android logcat (`MM-DD HH:MM:SS.mmm PID TID LEVEL TAG: message`)
Device	Huawei (EMUI)
Period	Single day — Dec 17, 2016
Log levels	V (Verbose), D (Debug), I (Info), W (Warning), E (Error), F (Fatal)

Each line is one log event, emitted by a process (PID) on a thread (TID) from a component (TAG). The message is free-form text.

Pipeline

Android.log (1.55M lines)
        │
        ▼
 1. Parse with regex → structured DataFrame
        │
        ▼
 2. EDA — level distribution, time series, tag activity, message lengths, hourly heatmap
        │
        ▼
 3. Statistical Tests
    ├── Chi-Square   — does log-level distribution shift over the day?
    ├── Welch T-Test — are error messages longer than info messages?
    ├── Mann-Whitney — non-parametric confirmation
    ├── One-Way ANOVA — message length across all five levels
    └── KS Test      — CDF shape: Debug vs Verbose
        │
        ▼
 4. Text Processing (50k sample)
    └── lowercase → strip hex/addresses/paths/numbers → remove stopwords → Porter stem
        │
        ▼
 5. Vector Embeddings
    ├── TF-IDF (5 000 features, unigrams + bigrams, sublinear TF)
    └── LSA via TruncatedSVD (30 components) → dense semantic vectors
        │
        ▼
 6. Clustering
    ├── K-Means    — elbow + silhouette to pick k; interprets subsystems
    ├── DBSCAN     — density-based; noise points = structural anomalies
    └── t-SNE      — 2-D projection of embedding space
        │
        ▼
 7. Sentiment Analysis (TextBlob)
    └── polarity + subjectivity per message; aggregated by level and over time
        │
        ▼
 8. Anomaly Detection
    ├── Isolation Forest  — tree-based; fast on large data
    ├── Local Outlier Factor — density-based; catches local deviations
    └── Ensemble (IF ∩ LOF) — intersection for high-confidence anomalies

Files

File	Purpose
`log_anomaly_detection.ipynb`	Primary deliverable — 33 cells, fully documented
`app.py`	Streamlit interactive dashboard
`setup_notebook.py`	Regenerates the notebook from source
`requirements.txt`	All Python dependencies
`venv/`	Isolated Python 3.12 environment
`Android.log`	Raw dataset (192 MB)

Notebook Sections

#	Section	Key Output
1	Data Loading & Parsing	Regex parser for logcat format; 1.55M rows → DataFrame
2	EDA	Level pie/bar, events-per-minute time series with spike detection, top-20 tags, message length violin/box, hour × level heatmap
3	Statistical Tests	Chi-Square, T-Test, Mann-Whitney U, ANOVA, KS — each with plot and conclusion
4	Text Processing	Cleaning pipeline, word frequency, word cloud, bigram/trigram charts
5	Vector Embeddings	TF-IDF matrix, LSA variance curve, 30-d dense embeddings
6	Clustering	K-Means elbow + silhouette, cluster interpretation by top terms, DBSCAN noise analysis, t-SNE coloured by cluster and level
7	Sentiment Analysis	TextBlob polarity/subjectivity, distribution by level, polarity time series, polarity vs subjectivity scatter
8	Anomaly Detection	Isolation Forest score distribution, LOF scores, ensemble comparison pie, top-10 anomaly table
9	Summary	Findings table

Streamlit Dashboard

An interactive version of the full pipeline with live filters.

source venv/bin/activate
streamlit run app.py

Tabs:

Tab	Content
Overview	Dataset metrics, level distribution
EDA	Time series, top tags, message length, heatmap — filter by log level
Statistical Tests	All 5 tests with live metric cards
Text & Embeddings	Word cloud, frequency chart, LSA variance
Clustering	k slider, eps slider, t-SNE projection
Sentiment	Polarity dashboard, time series
Anomaly Detection	Contamination slider, model comparison, top anomaly table
Log Explorer	Filter by level / tag / message substring, CSV download

Key Findings

EDA
Debug (~50%) and Info (~30%) dominate the log. Error + Fatal together are less than 2% of all entries. Activity spikes occur in short bursts — likely screen-on events, notification cycles, and sync operations.

Statistical Tests
The Chi-Square test rejects the null hypothesis — log-level proportions shift measurably across the day. Error messages are statistically significantly longer than Info messages (Welch t-test, p ≪ 0.05), which makes sense: error messages tend to include stack traces, context strings, and diagnostic data. ANOVA confirms at least one group mean differs across all five levels.

Text Processing
After stripping memory addresses, hex values, file paths, and numbers — the dominant noise in system logs — each message reduces to ~8–12 meaningful tokens. Top stems relate to power management, keyguard, display, and network operations.

Clustering
K-Means clusters map loosely onto Android subsystems: display/brightness, keyguard/security, network/broadcast, health/step-counting, and general system. DBSCAN marks ~5% of entries as noise — these are structurally isolated messages that don't belong to any dense region of the embedding space and are the strongest candidates for anomalies.

Sentiment
Most messages are neutral (as expected for machine-generated logs). Error and Fatal messages carry a slight negative polarity signal — this validates that TextBlob polarity has some discriminative signal even on technical text.

Anomaly Detection
Isolation Forest and LOF independently flag ~5% of entries. Their intersection (the ensemble) is a smaller, higher-confidence set. Top anomalies tend to be unusually long messages from rarely-active components — consistent with error conditions that generate verbose diagnostic output.

Setup

# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate       # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Regenerate notebook (already done)
python setup_notebook.py

# Open notebook
jupyter notebook log_anomaly_detection.ipynb

# Launch dashboard
streamlit run app.py

Dependencies

pandas · numpy · matplotlib · seaborn · scipy
scikit-learn · nltk · textblob · wordcloud
streamlit · plotly · nbformat

Python 3.10+ required. Tested on Python 3.12.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
app.py		app.py
log_anomaly_detection.ipynb		log_anomaly_detection.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Log Anomaly Detection — Android System Logs

What This Project Is

Dataset

Pipeline

Files

Notebook Sections

Streamlit Dashboard

Key Findings

Setup

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Log Anomaly Detection — Android System Logs

What This Project Is

Dataset

Pipeline

Files

Notebook Sections

Streamlit Dashboard

Key Findings

Setup

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages