This repository presents a high-throughput Java-based log analytics engine designed to perform real-time anomaly detection using incremental statistical methods.
The objective of this project is to characterize the performance impact of different Java concurrency models on a deterministic, real-time statistical workload.
Rather than introducing machine learning complexity, the system uses incremental statistical techniques to isolate and measure concurrency behavior under controlled, reproducible conditions.
What is the throughput and consistency impact of different Java concurrency models when executing identical real-time statistical anomaly detection workloads?
This project evaluates:
- Single-threaded execution (baseline)
- Thread pool–based multithreading (ExecutorService)
- Parallel stream execution (ForkJoinPool)
- Partitioned local processing (isolated statistical windows)
All engines execute the same deterministic anomaly detection pipeline. Only the execution strategy changes.
Each log entry flows through a fixed processing pipeline:
Raw Log Entry
↓
LogRecord Object
↓
Sliding Window Buffer (bounded, circular)
↓
Incremental Statistics (Welford’s algorithm)
↓
Z-score Anomaly Detection
- O(1) statistical updates per log
- No full-history scans
- Bounded memory usage
- Deterministic output
- No machine learning
- No distributed systems
Anomalies are detected using a Z-score–based approach:
Z = (value - mean) / standard_deviation
If:
|Z| > threshold
the log entry is classified as an anomaly.
Statistics are maintained incrementally using Welford’s algorithm to ensure:
- Numerical stability
- Constant-time updates
- No need to retain full history
- Sequential processing
- Baseline performance reference
- Minimal overhead
- Uses
ExecutorService - Configurable worker threads
- Shared sliding window with synchronization
- Uses
parallelStream() - Backed by ForkJoinPool
- High-level implicit parallelism
- Dataset partitioned across threads
- Each partition maintains its own sliding window
- Results merged post-processing
Note: The partitioned model may introduce statistical drift due to isolated windows. This is an intentional trade-off and part of the performance analysis.
Reproducibility is a core design goal.
- Synthetic datasets generated using a fixed random seed (
42) - Dataset created once per execution
- All engines process identical data
- Consistency verification mode ensures anomaly counts match across runs
Deterministic guarantees ensure that performance comparisons are not influenced by data variation, only execution strategy.
To verify determinism:
java -cp src Main consistency 1000000If anomaly counts remain identical across executions, determinism is verified.
To compare engine performance:
java -cp src Main benchmark 1000000Example output:
================ ENGINE BENCHMARK =================
Dataset Size: 1,000,000
Random Seed: 42
----------------------------------------------------------
Engine Time (ms) Anomalies
----------------------------------------------------------
SingleThreadEngine 38 3229
ThreadPoolEngine(4) 37 3229
ParallelStreamEngine 30 3229
PartitionedLocal(4) 54 3232
----------------------------------------------------------
Fastest Engine: ParallelStreamEngine
Slowest Engine: PartitionedLocal(4)
Deterministic anomaly results verified.
This mode ensures:
- Fair comparison
- Single dataset generation
- Structured performance summary
- Deterministic anomaly counts
The following metrics are evaluated:
- Total execution time (ms)
- Throughput (logs processed per second)
- Anomaly count consistency
- Relative speedup vs single-thread baseline
No synthetic optimizations are applied between engines. All results reflect raw execution behavior under identical workloads.
javac src/*.javajava -cp src Mainsingle
threadpool
parallel
partitioned
consistency
benchmark
java -cp src Main single 100000
java -cp src Main threadpool 1000000
java -cp src Main benchmark 1000000
java -cp src Main consistency 500000If dataset size is omitted, a default value is used.
Datasets tested:
- 50,000 logs
- 100,000 logs
- 1,000,000+ logs
All experiments use:
- Identical hardware
- Same JVM
- Same random seed
- Same anomaly detection threshold
- Identical workload
Only the concurrency model changes.
Based on benchmark runs:
- ParallelStreamEngine often performs best under high load
- ThreadPoolEngine performs comparably with explicit control
- Single-threaded execution remains competitive at smaller dataset sizes
- PartitionedLocalEngine trades statistical consistency for potential scalability gains
- Deterministic anomaly detection is feasible at high throughput without ML overhead
src/
Main.java
SyntheticDataGenerator.java
ConsistencyChecker.java
EngineBenchmarkRunner.java
ExecutionEngine.java
SingleThreadEngine.java
ThreadPoolEngine.java
ParallelStreamEngine.java
PartitionedLocalEngine.java
SlidingWindowBuffer.java
WindowedStatistics.java
IncrementalStatistics.java
AnomalyDetector.java
LogAnomalyEngine.java
...
The design separates:
- Data representation
- Statistical logic
- Execution strategy
- Benchmarking utilities
- Systems engineering study
- Concurrency performance comparison
- Deterministic real-time statistical processing
- Reproducible benchmarking framework
- Machine learning research
- Distributed systems framework
- Big data platform
- Production log aggregation service
This project contributes:
- A reproducible experimental framework for evaluating Java concurrency models
- Evidence that incremental statistics enable real-time anomaly detection
- Insight into performance trade-offs between abstraction and manual thread control
- Analysis of consistency vs scalability trade-offs in partitioned processing
This repository provides a controlled experimental environment for studying concurrency behavior in Java under statistically grounded real-time workloads.
It demonstrates that meaningful performance insights can be obtained without distributed infrastructure or machine learning complexity — provided experimental rigor and determinism are enforced.
Potential extensions include:
- Support for streaming input sources
- Integration with real log datasets
- Adaptive threshold mechanisms
- Visualization dashboards
- Distributed execution experiments
This project is intended for academic and educational use.
Abhinav Sai Gunnampalli (Abhiix0) Yashwanth Abhishek Guvvala (Yashabhi0)