A quick overview of every problem in this repo. Use the Category and Topics columns to filter by what you want to practice. Each row links to the problem statement and the reference solution.
This file is generated by
scripts/build_index.pyfrom the frontmatter in each problem'squestion.md. Do not edit by hand.
| # | Problem | Category | Difficulty | Interview | Question | Solution |
|---|---|---|---|---|---|---|
| 1 | Log File Error Analysis | Batch Pipelines | Easy | strong | Question | Solution |
| 2 | Rolling Average of Sensor Readings | Streaming | Easy | optional | Question | Solution |
| 3 | Transform and Clean Raw Data for Analytics | Batch Pipelines | Medium | optional | Question | Solution |
| 4 | Schema Evolution and Validation for Streaming Events | Streaming | Medium | strong | Question | Solution |
| 5 | Merging Messy CSVs from Multiple Partners | Batch Pipelines | Medium | optional | Question | Solution |
| 6 | Partitioning vs Clustering in BigQuery | Storage & Formats | Easy | strong | Question | Solution |
| 7 | ETL vs ELT and Why ELT Won | Batch Pipelines | Easy | ⭐ must | Question | Solution |
| 8 | OLTP vs OLAP | Database Internals | Easy | ⭐ must | Question | Solution |
| 9 | Idempotency in Data Pipelines | Batch Pipelines | Medium | ⭐ must | Question | Solution |
| 10 | Slowly Changing Dimensions | Data Modeling | Medium | ⭐ must | Question | Solution |
| 11 | Data Contracts in Plain Words | Batch Pipelines | Medium | strong | Question | Solution |
| 12 | Parquet vs CSV vs JSON | Storage & Formats | Easy | ⭐ must | Question | Solution |
| 13 | Data Lake vs Warehouse vs Lakehouse | Storage & Formats | Medium | ⭐ must | Question | Solution |
| 14 | Exactly Once Delivery | Streaming | Medium | ⭐ must | Question | Solution |
| 15 | Teaching SQL Performance to a Junior | SQL & Querying | Medium | strong | Question | Solution |
| 16 | SELECT DISTINCT Hiding Join Bugs | SQL & Querying | Medium | strong | Question | Solution |
| 17 | Reading an EXPLAIN Plan | SQL & Querying | Medium | ⭐ must | Question | Solution |
| 18 | CTE vs Subquery | SQL & Querying | Medium | strong | Question | Solution |
| 19 | Same Query Different Answers | SQL & Querying | Medium | strong | Question | Solution |
| 20 | Window Functions vs GROUP BY | SQL & Querying | Medium | ⭐ must | Question | Solution |
| 21 | Data Platform for an Electricity Retailer | System Design | Hard | optional | Question | Solution |
| 22 | Banking App Monthly Spending Widget | System Design | Hard | ⭐ must | Question | Solution |
| 23 | Ride Hailing Surge Pricing | System Design | Hard | strong | Question | Solution |
| 24 | Spotify Minutes Listened This Week | System Design | Hard | ⭐ must | Question | Solution |
| 25 | Smart Meter to Monthly Bill PDF | System Design | Hard | optional | Question | Solution |
| 26 | Delivery Idle Driver Tracking | System Design | Hard | optional | Question | Solution |
| 27 | Year in Review Recap | System Design | Medium | strong | Question | Solution |
| 28 | Low Balance Notification Pipeline | System Design | Medium | strong | Question | Solution |
| 29 | Daily Report Quietly Wrong for Two Weeks | Debugging & Reliability | Medium | ⭐ must | Question | Solution |
| 30 | Warehouse Cost Doubled in Two Months | Cost Optimization | Medium | ⭐ must | Question | Solution |
| 31 | The Dashboard is Wrong | Debugging & Reliability | Easy | strong | Question | Solution |
| 32 | Inheriting a Pipeline No One Owns | People & Process | Medium | optional | Question | Solution |
| 33 | Executive Needs a Number Tomorrow | People & Process | Medium | optional | Question | Solution |
| 34 | Three Days of Data Lost | Debugging & Reliability | Hard | ⭐ must | Question | Solution |
| 35 | Lambda vs Cloud Function vs Cloud Run | Cloud Services | Medium | strong | Question | Solution |
| 36 | Scheduled Pipeline Pay Only When Run | Cloud Services | Easy | optional | Question | Solution |
| 37 | BigQuery vs Snowflake for New Team | Cloud Services | Medium | strong | Question | Solution |
| 38 | Store Partner Files in S3 or Warehouse | Cloud Services | Easy | optional | Question | Solution |
| 39 | Managed Airflow vs Self Hosted | Cloud Services | Medium | strong | Question | Solution |
| 40 | BigQuery Access Control for 50 Person Company | Cloud Services | Medium | optional | Question | Solution |
| 41 | Tables for an Airbnb Like App | Data Modeling | Medium | ⭐ must | Question | Solution |
| 42 | Tracking Subscription Plan History | Data Modeling | Medium | strong | Question | Solution |
| 43 | Mixing Facts and Dimensions | Data Modeling | Medium | strong | Question | Solution |
| 44 | Explaining Fact Table Grain | Data Modeling | Easy | ⭐ must | Question | Solution |
| 45 | Current State and Full History | Data Modeling | Medium | optional | Question | Solution |
| 46 | Region Suddenly Shows Zero Revenue | Debugging & Reliability | Medium | ⭐ must | Question | Solution |
| 47 | Airflow Green but Output Empty | Debugging & Reliability | Medium | ⭐ must | Question | Solution |
| 48 | Query Suddenly 80x Slower | Debugging & Reliability | Medium | ⭐ must | Question | Solution |
| 49 | User Says Data Is Wrong | Debugging & Reliability | Easy | optional | Question | Solution |
| 50 | Partition Always Ten Percent Smaller | Debugging & Reliability | Medium | strong | Question | Solution |
| 51 | BigQuery Bill Eight Times Higher | Cost Optimization | Medium | ⭐ must | Question | Solution |
| 52 | Four Hour Spark Job Under One Hour | Cost Optimization | Medium | strong | Question | Solution |
| 53 | Hourly Scan on Daily Data | Cost Optimization | Easy | strong | Question | Solution |
| 54 | Just Throw More Memory At It | Cost Optimization | Medium | strong | Question | Solution |
| 55 | Partitioning Clustering Materialized Views | Storage & Formats | Easy | strong | Question | Solution |
| 56 | Watermarks in Plain Words | Streaming | Medium | ⭐ must | Question | Solution |
| 57 | Kafka Ordering Guarantee | Streaming | Medium | ⭐ must | Question | Solution |
| 58 | Streaming Consumer Lag Diagnosis | Streaming | Medium | ⭐ must | Question | Solution |
| 59 | Onboarding a New Analyst | People & Process | Easy | optional | Question | Solution |
| 60 | Metric by Tomorrow vs Doing It Right | People & Process | Easy | optional | Question | Solution |
| 61 | Two Teams Disagree on Active User | People & Process | Medium | strong | Question | Solution |
| 62 | Postmortem After a Bad Day | People & Process | Medium | ⭐ must | Question | Solution |
| 63 | Inherited Pipeline No Docs No Tests | People & Process | Medium | optional | Question | Solution |
| 64 | Breaking Change in dbt Model 200 Consumers | People & Process | Medium | strong | Question | Solution |
| 65 | 4000 DAG Airflow at 90 Percent CPU | Debugging & Reliability | Medium | strong | Question | Solution |
| 66 | Indexes When to Add and When They Hurt | Database Internals | Easy | ⭐ must | Question | Solution |
| 67 | Transactions and ACID | Database Internals | Easy | ⭐ must | Question | Solution |
| 68 | Isolation Levels in Plain Words | Database Internals | Medium | ⭐ must | Question | Solution |
| 69 | Normalization and When to Denormalize | Database Internals | Medium | strong | Question | Solution |
| 70 | B-Tree vs Hash vs LSM Tree | Database Internals | Medium | ⭐ must | Question | Solution |
| 71 | Read Replicas and Replication Lag | Database Operations | Medium | strong | Question | Solution |
| 72 | Sharding and Picking a Shard Key | Database Operations | Hard | ⭐ must | Question | Solution |
| 73 | Database Connection Pooling | Database Operations | Medium | strong | Question | Solution |
| 74 | Deadlocks and Lock Escalation | Database Operations | Medium | strong | Question | Solution |
| 75 | SQL vs NoSQL | Database Internals | Medium | strong | Question | Solution |
| Category | What you practice |
|---|---|
| SQL & Querying | Writing, reading and reasoning about SQL like a senior engineer |
| Data Modeling | Star schemas, history tracking, grain, dimensions, SCDs |
| Database Internals | Engines, ACID, isolation levels, indexes, B-tree vs LSM, normalization |
| Database Operations | Running databases at scale: replicas, sharding, connection pools, deadlocks |
| Batch Pipelines | ETL/ELT, idempotency, data cleaning, contracts, orchestration |
| Streaming | Kafka, watermarks, exactly-once, ordering, consumer lag |
| Storage & Formats | Parquet, lakehouse, partitioning, clustering, materialized views |
| System Design | End-to-end pipelines for real consumer and energy-sector products |
| Cloud Services | Picking between AWS, GCP and Azure services with clear trade-offs |
| Cost Optimization | Finding waste in queries, jobs, and infrastructure when the bill spikes |
| Debugging & Reliability | Step-by-step investigation when the number is wrong or the job died |
| People & Process | Mentoring, comms, postmortems, ownership, rollouts |
- Easy — A focused warm-up. Solvable or explainable in under an hour.
- Medium — Realistic interview question. Has edge cases that matter.
- Hard — Multi-step or system-design heavy. Closer to a take-home task.
- ⭐ must — Shows up in the majority of senior data engineer interview loops right now.
- strong — Common follow-up territory. Worth knowing cold for a senior bar.
- optional — Niche, situational, or domain-specific. Read when curious.
Twenty-nine problems that cover the questions you cannot dodge in a senior data engineer loop. Read them in the order shown, top to bottom.
- Window Functions vs GROUP BY — SQL & Querying
- Reading an EXPLAIN Plan — SQL & Querying
- OLTP vs OLAP — Database Internals
- Transactions and ACID — Database Internals
- Isolation Levels in Plain Words — Database Internals
- Indexes When to Add and When They Hurt — Database Internals
- B-Tree vs Hash vs LSM Tree — Database Internals
- Sharding and Picking a Shard Key — Database Operations
- Explaining Fact Table Grain — Data Modeling
- Slowly Changing Dimensions — Data Modeling
- Tables for an Airbnb Like App — Data Modeling
- ETL vs ELT and Why ELT Won — Batch Pipelines
- Idempotency in Data Pipelines — Batch Pipelines
- Parquet vs CSV vs JSON — Storage & Formats
- Data Lake vs Warehouse vs Lakehouse — Storage & Formats
- Watermarks in Plain Words — Streaming
- Kafka Ordering Guarantee — Streaming
- Exactly Once Delivery — Streaming
- Streaming Consumer Lag Diagnosis — Streaming
- Banking App Monthly Spending Widget — System Design
- Spotify Minutes Listened This Week — System Design
- Region Suddenly Shows Zero Revenue — Debugging & Reliability
- Airflow Green but Output Empty — Debugging & Reliability
- Daily Report Quietly Wrong for Two Weeks — Debugging & Reliability
- Query Suddenly 80x Slower — Debugging & Reliability
- Three Days of Data Lost — Debugging & Reliability
- BigQuery Bill Eight Times Higher — Cost Optimization
- Warehouse Cost Doubled in Two Months — Cost Optimization
- Postmortem After a Bad Day — People & Process
All 75 problems arranged as a pedagogical sequence. Read top to bottom for a complete path from SQL to senior-level data engineering.
- Teaching SQL Performance to a Junior — SQL & Querying, Medium
- CTE vs Subquery — SQL & Querying, Medium
- Window Functions vs GROUP BY — SQL & Querying, Medium
- SELECT DISTINCT Hiding Join Bugs — SQL & Querying, Medium
- Reading an EXPLAIN Plan — SQL & Querying, Medium
- Same Query Different Answers — SQL & Querying, Medium
- OLTP vs OLAP — Database Internals, Easy
- SQL vs NoSQL — Database Internals, Medium
- Transactions and ACID — Database Internals, Easy
- Isolation Levels in Plain Words — Database Internals, Medium
- Normalization and When to Denormalize — Database Internals, Medium
- Indexes When to Add and When They Hurt — Database Internals, Easy
- B-Tree vs Hash vs LSM Tree — Database Internals, Medium
- Database Connection Pooling — Database Operations, Medium
- Deadlocks and Lock Escalation — Database Operations, Medium
- Read Replicas and Replication Lag — Database Operations, Medium
- Sharding and Picking a Shard Key — Database Operations, Hard
- Explaining Fact Table Grain — Data Modeling, Easy
- Mixing Facts and Dimensions — Data Modeling, Medium
- Slowly Changing Dimensions — Data Modeling, Medium
- Tables for an Airbnb Like App — Data Modeling, Medium
- Tracking Subscription Plan History — Data Modeling, Medium
- Current State and Full History — Data Modeling, Medium
- ETL vs ELT and Why ELT Won — Batch Pipelines, Easy
- Idempotency in Data Pipelines — Batch Pipelines, Medium
- Data Contracts in Plain Words — Batch Pipelines, Medium
- Log File Error Analysis — Batch Pipelines, Easy
- Transform and Clean Raw Data for Analytics — Batch Pipelines, Medium
- Merging Messy CSVs from Multiple Partners — Batch Pipelines, Medium
- Parquet vs CSV vs JSON — Storage & Formats, Easy
- Data Lake vs Warehouse vs Lakehouse — Storage & Formats, Medium
- Partitioning vs Clustering in BigQuery — Storage & Formats, Easy
- Partitioning Clustering Materialized Views — Storage & Formats, Easy
- Store Partner Files in S3 or Warehouse — Cloud Services, Easy
- Lambda vs Cloud Function vs Cloud Run — Cloud Services, Medium
- Scheduled Pipeline Pay Only When Run — Cloud Services, Easy
- BigQuery vs Snowflake for New Team — Cloud Services, Medium
- Managed Airflow vs Self Hosted — Cloud Services, Medium
- BigQuery Access Control for 50 Person Company — Cloud Services, Medium
- Rolling Average of Sensor Readings — Streaming, Easy
- Schema Evolution and Validation for Streaming Events — Streaming, Medium
- Watermarks in Plain Words — Streaming, Medium
- Kafka Ordering Guarantee — Streaming, Medium
- Exactly Once Delivery — Streaming, Medium
- Streaming Consumer Lag Diagnosis — Streaming, Medium
- Banking App Monthly Spending Widget — System Design, Hard
- Spotify Minutes Listened This Week — System Design, Hard
- Ride Hailing Surge Pricing — System Design, Hard
- Delivery Idle Driver Tracking — System Design, Hard
- Year in Review Recap — System Design, Medium
- Low Balance Notification Pipeline — System Design, Medium
- Smart Meter to Monthly Bill PDF — System Design, Hard
- Data Platform for an Electricity Retailer — System Design, Hard
- User Says Data Is Wrong — Debugging & Reliability, Easy
- The Dashboard is Wrong — Debugging & Reliability, Easy
- Region Suddenly Shows Zero Revenue — Debugging & Reliability, Medium
- Airflow Green but Output Empty — Debugging & Reliability, Medium
- Daily Report Quietly Wrong for Two Weeks — Debugging & Reliability, Medium
- Partition Always Ten Percent Smaller — Debugging & Reliability, Medium
- Query Suddenly 80x Slower — Debugging & Reliability, Medium
- Three Days of Data Lost — Debugging & Reliability, Hard
- 4000 DAG Airflow at 90 Percent CPU — Debugging & Reliability, Medium
- Just Throw More Memory At It — Cost Optimization, Medium
- Hourly Scan on Daily Data — Cost Optimization, Easy
- BigQuery Bill Eight Times Higher — Cost Optimization, Medium
- Warehouse Cost Doubled in Two Months — Cost Optimization, Medium
- Four Hour Spark Job Under One Hour — Cost Optimization, Medium
- Onboarding a New Analyst — People & Process, Easy
- Metric by Tomorrow vs Doing It Right — People & Process, Easy
- Executive Needs a Number Tomorrow — People & Process, Medium
- Two Teams Disagree on Active User — People & Process, Medium
- Inheriting a Pipeline No One Owns — People & Process, Medium
- Inherited Pipeline No Docs No Tests — People & Process, Medium
- Breaking Change in dbt Model 200 Consumers — People & Process, Medium
- Postmortem After a Bad Day — People & Process, Medium
New problems are added regularly. If you want to contribute, see the Contribution Guide.