Distributed Data Pipeline: Resilience, Scalability & Monitoring

Hey! I’m Aman. I built this project to demonstrate how a production-grade data pipeline handles massive workloads while staying resilient. Instead of a simple script, I engineered a decoupled microservices architecture to ensure that no matter how hard the system crashes, your data remains safe in the "Vault."

🏗️ System Architecture & Data Flow

I designed this with a "fail-safe" mindset. Each component is isolated so that a crash in one node doesn't halt the entire pipeline.

       [Streamlit UI] ──(HTTP POST)──> [FastAPI Gateway]
                                                │
                                  ┌─────────────┴─────────────┐
                                  ▼                           ▼
                            [Redis Cache]               [Redis Broker]
                        (Rate Limiter/429)           (Task Distribution)
                                                              │
                                                              ▼
                                                    [Celery Workers (x3)]
                                                   (Distributed Execution)
                                  ┌───────────────────────────┴─────────────┐
                                  ▼                                         ▼
                         [PostgreSQL Vault]                        [Auto-Retry Logic]
                       (Success/DLQ Storage)                     (Exponential Backoff)
                                                                            │
                                                                            ▼
                                                                    [Discord Alerts]
                                                                  (Fatal Failure Hook)

🔄 The Request Lifecycle

The Entry Point: Jobs are submitted via the Streamlit dashboard or FastAPI Swagger.
Safety First: I’ve integrated Redis to enforce IP-based rate limiting to prevent API abuse.
Fire & Forget: FastAPI assigns a JOB-ID and pushes tasks to Redis, returning a response instantly.
Heavy Lifting: 3 Scalable Celery Workers process tasks in parallel to maximize throughput.
Handling Chaos: If a task hits a network snag, logic triggers an Exponential Backoff strategy.
The Vault (DLQ): Results are saved in PostgreSQL; failures are routed to a Dead Letter Queue and Discord.

📸 System in Action

Note: These are live captures from my development environment showing the system under load.

1. Central Command Center

The Streamlit dashboard for real-time monitoring of processing loads and system health metrics.

2. Smart Rate Limiting

Protection in action-this is what happens when the Redis-backed request limit is breached.

3. Distributed Worker Cluster (Flower)

Monitoring 3 concurrent worker nodes handling parallel execution to maximize throughput.

4. Resilience & Observability (Discord)

The automatic retry sequence (2s -> 4s -> 8s) before a Fatal Failure is logged.

5. The PostgreSQL Vault

A peek into the persistent storage where the system tracks every SUCCESS and FAILURE.

6. Automated ETL Pipeline (Celery Beat)

An automated Celery Beat daemon extracting, transforming, and loading live crypto prices into the Vault.

7. Dead Letter Queue (DLQ) Audit

A direct SQL audit proving that every failed job is preserved for manual recovery.

📂 Project Structure

distributed-data-pipeline/
├── api/                  # FastAPI Gateway & Rate Limiting
│   ├── database.py       # SQLAlchemy Models & DB Connection
│   ├── main.py           # Entry point & API Routes
│   └── requirements.txt  # API-specific dependencies
├── assets/               # Live System Screenshots for documentation
├── ui/                   # Monitoring Dashboard (Streamlit)
│   ├── app.py            # Real-time Metrics & Vault UI
│   └── requirements.txt  # UI-specific dependencies
├── worker/               # Distributed Task Execution (Celery)
│   ├── database.py       # SQLAlchemy Models for worker node
│   ├── tasks.py          # Heavy logic, Backoff & Chaos Testing
│   └── requirements.txt  # Worker-specific dependencies
├── .env.example          # Template for environment variables & Webhooks
├── docker-compose.yml    # Multi-container Orchestration
├── LICENSE               # MIT License
└── README.md             # Project documentation

🔥 Why This Pipeline is Robust

Fault Tolerance: Uses an Exponential Backoff formula (2^n) to avoid overwhelming services.
Chaos Engineering: Built-in logic simulating a 50% failure rate to prove system resilience.
Zero Data Loss: All failed tasks are captured in the Dead Letter Queue (PostgreSQL).
Sub-Millisecond Caching: Dashboard metrics are served from Redis RAM for extreme speed.
Scalable by Design: Scale from 1 to 100+ workers by adjusting a single Docker parameter.

🧠 Design Decisions

Cumulative Metrics: All system metrics (Total Jobs Processed, Total Data Handled) are intentionally cumulative to reflect continuous system load and real-world pipeline behavior. Reset functionality can be extended if required, but data retention is prioritized for observability and debugging.
Data Integrity Handling: The system ignores invalid or negative data inputs during processing to ensure that only meaningful, positive workloads are executed and reflected in system metrics. This prevents distortion of performance insights and maintains consistency in analytics.

🚀 Get it Running Locally

Clone & Setup

git clone https://github.com/iamanpathak/distributed-data-pipeline.git
cd distributed-data-pipeline
cp .env.example .env # Add your Discord Webhook URL here

Launch Infrastructure

docker-compose up -d --build --scale worker=3

Explore the Services

Live Dashboard: http://localhost:8501
API Swagger Docs: http://localhost:8000/docs
Celery Task Monitor (Flower): http://localhost:5555

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Made with ❤️ by Aman Pathak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Data Pipeline: Resilience, Scalability & Monitoring

🏗️ System Architecture & Data Flow

🔄 The Request Lifecycle

📸 System in Action

1. Central Command Center

2. Smart Rate Limiting

3. Distributed Worker Cluster (Flower)

4. Resilience & Observability (Discord)

5. The PostgreSQL Vault

6. Automated ETL Pipeline (Celery Beat)

7. Dead Letter Queue (DLQ) Audit

📂 Project Structure

🔥 Why This Pipeline is Robust

🧠 Design Decisions

🚀 Get it Running Locally

📄 License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
api		api
assets		assets
ui		ui
worker		worker
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Distributed Data Pipeline: Resilience, Scalability & Monitoring

🏗️ System Architecture & Data Flow

🔄 The Request Lifecycle

📸 System in Action

1. Central Command Center

2. Smart Rate Limiting

3. Distributed Worker Cluster (Flower)

4. Resilience & Observability (Discord)

5. The PostgreSQL Vault

6. Automated ETL Pipeline (Celery Beat)

7. Dead Letter Queue (DLQ) Audit

📂 Project Structure

🔥 Why This Pipeline is Robust

🧠 Design Decisions

🚀 Get it Running Locally

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages