Skip to content

Abhiix0/java-log-analytics-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Java Log Analytics Engine

Deterministic Real-Time Anomaly Detection with Concurrency Benchmarking


Overview

This repository presents a high-throughput Java-based log analytics engine designed to perform real-time anomaly detection using incremental statistical methods.

The objective of this project is to characterize the performance impact of different Java concurrency models on a deterministic, real-time statistical workload.

Rather than introducing machine learning complexity, the system uses incremental statistical techniques to isolate and measure concurrency behavior under controlled, reproducible conditions.


Core Research Question

What is the throughput and consistency impact of different Java concurrency models when executing identical real-time statistical anomaly detection workloads?

This project evaluates:

  • Single-threaded execution (baseline)
  • Thread pool–based multithreading (ExecutorService)
  • Parallel stream execution (ForkJoinPool)
  • Partitioned local processing (isolated statistical windows)

All engines execute the same deterministic anomaly detection pipeline. Only the execution strategy changes.


System Architecture

Each log entry flows through a fixed processing pipeline:

Raw Log Entry
    ↓
LogRecord Object
    ↓
Sliding Window Buffer (bounded, circular)
    ↓
Incremental Statistics (Welford’s algorithm)
    ↓
Z-score Anomaly Detection

Key Properties

  • O(1) statistical updates per log
  • No full-history scans
  • Bounded memory usage
  • Deterministic output
  • No machine learning
  • No distributed systems

Statistical Model

Anomalies are detected using a Z-score–based approach:

Z = (value - mean) / standard_deviation

If:

|Z| > threshold

the log entry is classified as an anomaly.

Statistics are maintained incrementally using Welford’s algorithm to ensure:

  • Numerical stability
  • Constant-time updates
  • No need to retain full history

Concurrency Models Compared

1. SingleThreadEngine

  • Sequential processing
  • Baseline performance reference
  • Minimal overhead

2. ThreadPoolEngine

  • Uses ExecutorService
  • Configurable worker threads
  • Shared sliding window with synchronization

3. ParallelStreamEngine

  • Uses parallelStream()
  • Backed by ForkJoinPool
  • High-level implicit parallelism

4. PartitionedLocalEngine

  • Dataset partitioned across threads
  • Each partition maintains its own sliding window
  • Results merged post-processing

Note: The partitioned model may introduce statistical drift due to isolated windows. This is an intentional trade-off and part of the performance analysis.


Determinism & Reproducibility

Reproducibility is a core design goal.

  • Synthetic datasets generated using a fixed random seed (42)
  • Dataset created once per execution
  • All engines process identical data
  • Consistency verification mode ensures anomaly counts match across runs

Deterministic guarantees ensure that performance comparisons are not influenced by data variation, only execution strategy.

To verify determinism:

java -cp src Main consistency 1000000

If anomaly counts remain identical across executions, determinism is verified.


Benchmark Mode

To compare engine performance:

java -cp src Main benchmark 1000000

Example output:

================ ENGINE BENCHMARK =================

Dataset Size: 1,000,000
Random Seed: 42

----------------------------------------------------------
Engine                  Time (ms)    Anomalies
----------------------------------------------------------
SingleThreadEngine      38           3229
ThreadPoolEngine(4)     37           3229
ParallelStreamEngine    30           3229
PartitionedLocal(4)     54           3232
----------------------------------------------------------

Fastest Engine: ParallelStreamEngine
Slowest Engine: PartitionedLocal(4)

Deterministic anomaly results verified.

This mode ensures:

  • Fair comparison
  • Single dataset generation
  • Structured performance summary
  • Deterministic anomaly counts

Performance Metrics

The following metrics are evaluated:

  • Total execution time (ms)
  • Throughput (logs processed per second)
  • Anomaly count consistency
  • Relative speedup vs single-thread baseline

No synthetic optimizations are applied between engines. All results reflect raw execution behavior under identical workloads.


Usage

Compile

javac src/*.java

Run

java -cp src Main

Available Modes

single
threadpool
parallel
partitioned
consistency
benchmark

Examples

java -cp src Main single 100000
java -cp src Main threadpool 1000000
java -cp src Main benchmark 1000000
java -cp src Main consistency 500000

If dataset size is omitted, a default value is used.


Experimental Setup

Datasets tested:

  • 50,000 logs
  • 100,000 logs
  • 1,000,000+ logs

All experiments use:

  • Identical hardware
  • Same JVM
  • Same random seed
  • Same anomaly detection threshold
  • Identical workload

Only the concurrency model changes.


Key Observations

Based on benchmark runs:

  • ParallelStreamEngine often performs best under high load
  • ThreadPoolEngine performs comparably with explicit control
  • Single-threaded execution remains competitive at smaller dataset sizes
  • PartitionedLocalEngine trades statistical consistency for potential scalability gains
  • Deterministic anomaly detection is feasible at high throughput without ML overhead

Project Structure

src/
  Main.java
  SyntheticDataGenerator.java
  ConsistencyChecker.java
  EngineBenchmarkRunner.java
  ExecutionEngine.java
  SingleThreadEngine.java
  ThreadPoolEngine.java
  ParallelStreamEngine.java
  PartitionedLocalEngine.java
  SlidingWindowBuffer.java
  WindowedStatistics.java
  IncrementalStatistics.java
  AnomalyDetector.java
  LogAnomalyEngine.java
  ...

The design separates:

  • Data representation
  • Statistical logic
  • Execution strategy
  • Benchmarking utilities

What This Project Is

  • Systems engineering study
  • Concurrency performance comparison
  • Deterministic real-time statistical processing
  • Reproducible benchmarking framework

What This Project Is Not

  • Machine learning research
  • Distributed systems framework
  • Big data platform
  • Production log aggregation service

Research Framing

This project contributes:

  • A reproducible experimental framework for evaluating Java concurrency models
  • Evidence that incremental statistics enable real-time anomaly detection
  • Insight into performance trade-offs between abstraction and manual thread control
  • Analysis of consistency vs scalability trade-offs in partitioned processing

Conclusion

This repository provides a controlled experimental environment for studying concurrency behavior in Java under statistically grounded real-time workloads.

It demonstrates that meaningful performance insights can be obtained without distributed infrastructure or machine learning complexity — provided experimental rigor and determinism are enforced.


Future Work

Potential extensions include:

  • Support for streaming input sources
  • Integration with real log datasets
  • Adaptive threshold mechanisms
  • Visualization dashboards
  • Distributed execution experiments

License

This project is intended for academic and educational use.


Author

Abhinav Sai Gunnampalli (Abhiix0) Yashwanth Abhishek Guvvala (Yashabhi0)

About

A high-performance Java engine for real-time log processing and statistical anomaly detection.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages