Skip to content

A collection of eight production-grade system designs covering MLOps, data engineering, AI security, orbital autonomy, infrastructure architecture, and database architecture. Built to teach how systems behave in production, not just at design time.

License

Notifications You must be signed in to change notification settings

TAM-DS/System-Design-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Production Systems Design

From Blank Whiteboard to Production

A collection of eight production-grade system designs covering MLOps, data engineering, AI security, orbital autonomy, infrastructure architecture, and Database Architecture.

"Software bug = satellite is now space junk" — This is why systems thinking matters, especially at 2 am.


🚀 Systems

Design a RAG system that doesn't hallucinate. Covers data ingestion, vector databases, context building, monitoring, and the 7 attack surfaces.

Key Insight: Most RAG failures happen at the retrieval layer, not the LLM.

Tech Stack: Python, LangChain, Pinecone/Weaviate, MLflow, Prometheus

Dashboards:


Build cloud-agnostic ML deployment that works on AWS, GCP, and Azure. Single Terraform config deploys anywhere—no vendor lock-in.

Key Insight: Vendor lock-in isn't technical—it's architectural. Design for portability from day one.

Tech Stack: Terraform, Kubernetes, MLflow, Prometheus, Grafana

Impact: Eliminated vendor lock-in risk, preserved negotiating power


Process 10TB/day across medallion architecture (Bronze → Silver → Gold). Optimize for cost and performance at scale.

Key Insight: Architecture determines cost. Bronze/Silver/Gold + Delta Lake + Z-ordering = 63% cost reduction.

Tech Stack: Databricks, Delta Lake, Spark, Airflow, Terraform

Impact: $244K → $91K/year (63% cost reduction)


Secure AI systems across 7 attack surfaces: Data, Model, Inference API, Output, Infrastructure, Supply Chain, and Humans.

Key Insight: Most teams secure ONE layer; attackers exploit the other SIX. Defense in depth is non-negotiable.

Tech Stack: Python, OWASP, HashiCorp Vault, Prometheus, Kubernetes

Dashboards:


Design K8s infrastructure for distributed ML training. GPU scheduling, storage strategy, cost optimization.

Key Insight: K8s for ML ≠ K8s for web services. Treat them the same = waste money.

Tech Stack: Kubernetes, CUDA, PyTorch, Databricks, Kubecost

Impact: $200K → $80K/month (60% cost reduction)


Build autonomous systems that work at 240ms latency with 92% disconnection time. Covers orbital mechanics, launch constraints, and safe mode design.

Key Insight: You can't remote-control what physics won't allow. Autonomy is required, not optional.

Tech Stack: C++, Python, RTOS, Fault-tolerant systems

Signature: "Software bug = satellite is now space junk"

Dashboards:


Everything runs on Linux. Master the foundation: kernel, networking, security, performance tuning.

Key Insight: You can't optimize what you don't understand. Know your stack from silicon to application.

Tech Stack: Linux, Bash, systemd, eBPF, perf


Design database architecture that matches data structure to database type. Covers the 5 database types (In-Memory, Time-Series, Graph, Relational, Distributed), polyglot persistence, and decision framework.

Key Insight: Performance problems are often architecture problems, not hardware problems. 30-second queries → 100 milliseconds = architecture change (300x faster, same hardware).

Tech Stack: PostgreSQL, Redis, Neo4j, Cassandra, InfluxDB, MongoDB, Elasticsearch

Impact: 300x performance improvement, $50K/month infrastructure savings, no emergency migrations



🔗 Connect

Tracy Manning
Staff MLOps Engineer | Multi-Cloud + Linux Infrastructure Architect | Austin, TX

💼 LinkedIn
🐦 X/Twitter
📊 Tableau Public
📱 WhatsApp Channel


📄 License

MIT License - Feel free to use for learning and adapt for your systems.


💬 Final Thought

"Software bug = satellite is now space junk."

This is why systems thinking matters.

When you can't send a technician, you design better systems. When you can't afford downtime, you design for failure. When you can't waste money, you architect for cost.

Production systems that work at 2am.

That's the standard.


Last updated: January 2026

About

A collection of eight production-grade system designs covering MLOps, data engineering, AI security, orbital autonomy, infrastructure architecture, and database architecture. Built to teach how systems behave in production, not just at design time.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published