Course: Applied Artificial Intelligence (B.Tech Sem 6)
Dataset: Android.log — 1,555,005 lines from a Huawei Android device (Dec 17, 2016)
Every Android device produces a continuous stream of log messages — thousands per second from the OS kernel, background services, UI framework, and apps. This project treats that stream as raw text data and answers a single question:
Which log entries are anomalous, and what does the normal structure of system logs look like?
This is an end-to-end unsupervised NLP + anomaly detection pipeline. There are no labels. The system must learn what "normal" looks like from the data alone and surface entries that deviate from it. The pipeline covers the full lifecycle: raw log parsing → statistical characterisation → text vectorisation → clustering → sentiment → anomaly scoring.
| Property | Value |
|---|---|
| File | Android.log |
| Lines | 1,555,005 |
| Format | Android logcat (MM-DD HH:MM:SS.mmm PID TID LEVEL TAG: message) |
| Device | Huawei (EMUI) |
| Period | Single day — Dec 17, 2016 |
| Log levels | V (Verbose), D (Debug), I (Info), W (Warning), E (Error), F (Fatal) |
Each line is one log event, emitted by a process (PID) on a thread (TID) from a component (TAG). The message is free-form text.
Android.log (1.55M lines)
│
▼
1. Parse with regex → structured DataFrame
│
▼
2. EDA — level distribution, time series, tag activity, message lengths, hourly heatmap
│
▼
3. Statistical Tests
├── Chi-Square — does log-level distribution shift over the day?
├── Welch T-Test — are error messages longer than info messages?
├── Mann-Whitney — non-parametric confirmation
├── One-Way ANOVA — message length across all five levels
└── KS Test — CDF shape: Debug vs Verbose
│
▼
4. Text Processing (50k sample)
└── lowercase → strip hex/addresses/paths/numbers → remove stopwords → Porter stem
│
▼
5. Vector Embeddings
├── TF-IDF (5 000 features, unigrams + bigrams, sublinear TF)
└── LSA via TruncatedSVD (30 components) → dense semantic vectors
│
▼
6. Clustering
├── K-Means — elbow + silhouette to pick k; interprets subsystems
├── DBSCAN — density-based; noise points = structural anomalies
└── t-SNE — 2-D projection of embedding space
│
▼
7. Sentiment Analysis (TextBlob)
└── polarity + subjectivity per message; aggregated by level and over time
│
▼
8. Anomaly Detection
├── Isolation Forest — tree-based; fast on large data
├── Local Outlier Factor — density-based; catches local deviations
└── Ensemble (IF ∩ LOF) — intersection for high-confidence anomalies
| File | Purpose |
|---|---|
log_anomaly_detection.ipynb |
Primary deliverable — 33 cells, fully documented |
app.py |
Streamlit interactive dashboard |
setup_notebook.py |
Regenerates the notebook from source |
requirements.txt |
All Python dependencies |
venv/ |
Isolated Python 3.12 environment |
Android.log |
Raw dataset (192 MB) |
| # | Section | Key Output |
|---|---|---|
| 1 | Data Loading & Parsing | Regex parser for logcat format; 1.55M rows → DataFrame |
| 2 | EDA | Level pie/bar, events-per-minute time series with spike detection, top-20 tags, message length violin/box, hour × level heatmap |
| 3 | Statistical Tests | Chi-Square, T-Test, Mann-Whitney U, ANOVA, KS — each with plot and conclusion |
| 4 | Text Processing | Cleaning pipeline, word frequency, word cloud, bigram/trigram charts |
| 5 | Vector Embeddings | TF-IDF matrix, LSA variance curve, 30-d dense embeddings |
| 6 | Clustering | K-Means elbow + silhouette, cluster interpretation by top terms, DBSCAN noise analysis, t-SNE coloured by cluster and level |
| 7 | Sentiment Analysis | TextBlob polarity/subjectivity, distribution by level, polarity time series, polarity vs subjectivity scatter |
| 8 | Anomaly Detection | Isolation Forest score distribution, LOF scores, ensemble comparison pie, top-10 anomaly table |
| 9 | Summary | Findings table |
An interactive version of the full pipeline with live filters.
source venv/bin/activate
streamlit run app.pyTabs:
| Tab | Content |
|---|---|
| Overview | Dataset metrics, level distribution |
| EDA | Time series, top tags, message length, heatmap — filter by log level |
| Statistical Tests | All 5 tests with live metric cards |
| Text & Embeddings | Word cloud, frequency chart, LSA variance |
| Clustering | k slider, eps slider, t-SNE projection |
| Sentiment | Polarity dashboard, time series |
| Anomaly Detection | Contamination slider, model comparison, top anomaly table |
| Log Explorer | Filter by level / tag / message substring, CSV download |
EDA
Debug (~50%) and Info (~30%) dominate the log. Error + Fatal together are less than 2% of all entries. Activity spikes occur in short bursts — likely screen-on events, notification cycles, and sync operations.
Statistical Tests
The Chi-Square test rejects the null hypothesis — log-level proportions shift measurably across the day. Error messages are statistically significantly longer than Info messages (Welch t-test, p ≪ 0.05), which makes sense: error messages tend to include stack traces, context strings, and diagnostic data. ANOVA confirms at least one group mean differs across all five levels.
Text Processing
After stripping memory addresses, hex values, file paths, and numbers — the dominant noise in system logs — each message reduces to ~8–12 meaningful tokens. Top stems relate to power management, keyguard, display, and network operations.
Clustering
K-Means clusters map loosely onto Android subsystems: display/brightness, keyguard/security, network/broadcast, health/step-counting, and general system. DBSCAN marks ~5% of entries as noise — these are structurally isolated messages that don't belong to any dense region of the embedding space and are the strongest candidates for anomalies.
Sentiment
Most messages are neutral (as expected for machine-generated logs). Error and Fatal messages carry a slight negative polarity signal — this validates that TextBlob polarity has some discriminative signal even on technical text.
Anomaly Detection
Isolation Forest and LOF independently flag ~5% of entries. Their intersection (the ensemble) is a smaller, higher-confidence set. Top anomalies tend to be unusually long messages from rarely-active components — consistent with error conditions that generate verbose diagnostic output.
# Create and activate virtual environment
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Regenerate notebook (already done)
python setup_notebook.py
# Open notebook
jupyter notebook log_anomaly_detection.ipynb
# Launch dashboard
streamlit run app.pypandas · numpy · matplotlib · seaborn · scipy
scikit-learn · nltk · textblob · wordcloud
streamlit · plotly · nbformat
Python 3.10+ required. Tested on Python 3.12.