Skip to content

Latest commit

 

History

History
234 lines (215 loc) · 31 KB

File metadata and controls

234 lines (215 loc) · 31 KB

Problems Index

A quick overview of every problem in this repo. Use the Category and Topics columns to filter by what you want to practice. Each row links to the problem statement and the reference solution.

This file is generated by scripts/build_index.py from the frontmatter in each problem's question.md. Do not edit by hand.

# Problem Category Difficulty Interview Question Solution
1 Log File Error Analysis Batch Pipelines Easy strong Question Solution
2 Rolling Average of Sensor Readings Streaming Easy optional Question Solution
3 Transform and Clean Raw Data for Analytics Batch Pipelines Medium optional Question Solution
4 Schema Evolution and Validation for Streaming Events Streaming Medium strong Question Solution
5 Merging Messy CSVs from Multiple Partners Batch Pipelines Medium optional Question Solution
6 Partitioning vs Clustering in BigQuery Storage & Formats Easy strong Question Solution
7 ETL vs ELT and Why ELT Won Batch Pipelines Easy ⭐ must Question Solution
8 OLTP vs OLAP Database Internals Easy ⭐ must Question Solution
9 Idempotency in Data Pipelines Batch Pipelines Medium ⭐ must Question Solution
10 Slowly Changing Dimensions Data Modeling Medium ⭐ must Question Solution
11 Data Contracts in Plain Words Batch Pipelines Medium strong Question Solution
12 Parquet vs CSV vs JSON Storage & Formats Easy ⭐ must Question Solution
13 Data Lake vs Warehouse vs Lakehouse Storage & Formats Medium ⭐ must Question Solution
14 Exactly Once Delivery Streaming Medium ⭐ must Question Solution
15 Teaching SQL Performance to a Junior SQL & Querying Medium strong Question Solution
16 SELECT DISTINCT Hiding Join Bugs SQL & Querying Medium strong Question Solution
17 Reading an EXPLAIN Plan SQL & Querying Medium ⭐ must Question Solution
18 CTE vs Subquery SQL & Querying Medium strong Question Solution
19 Same Query Different Answers SQL & Querying Medium strong Question Solution
20 Window Functions vs GROUP BY SQL & Querying Medium ⭐ must Question Solution
21 Data Platform for an Electricity Retailer System Design Hard optional Question Solution
22 Banking App Monthly Spending Widget System Design Hard ⭐ must Question Solution
23 Ride Hailing Surge Pricing System Design Hard strong Question Solution
24 Spotify Minutes Listened This Week System Design Hard ⭐ must Question Solution
25 Smart Meter to Monthly Bill PDF System Design Hard optional Question Solution
26 Delivery Idle Driver Tracking System Design Hard optional Question Solution
27 Year in Review Recap System Design Medium strong Question Solution
28 Low Balance Notification Pipeline System Design Medium strong Question Solution
29 Daily Report Quietly Wrong for Two Weeks Debugging & Reliability Medium ⭐ must Question Solution
30 Warehouse Cost Doubled in Two Months Cost Optimization Medium ⭐ must Question Solution
31 The Dashboard is Wrong Debugging & Reliability Easy strong Question Solution
32 Inheriting a Pipeline No One Owns People & Process Medium optional Question Solution
33 Executive Needs a Number Tomorrow People & Process Medium optional Question Solution
34 Three Days of Data Lost Debugging & Reliability Hard ⭐ must Question Solution
35 Lambda vs Cloud Function vs Cloud Run Cloud Services Medium strong Question Solution
36 Scheduled Pipeline Pay Only When Run Cloud Services Easy optional Question Solution
37 BigQuery vs Snowflake for New Team Cloud Services Medium strong Question Solution
38 Store Partner Files in S3 or Warehouse Cloud Services Easy optional Question Solution
39 Managed Airflow vs Self Hosted Cloud Services Medium strong Question Solution
40 BigQuery Access Control for 50 Person Company Cloud Services Medium optional Question Solution
41 Tables for an Airbnb Like App Data Modeling Medium ⭐ must Question Solution
42 Tracking Subscription Plan History Data Modeling Medium strong Question Solution
43 Mixing Facts and Dimensions Data Modeling Medium strong Question Solution
44 Explaining Fact Table Grain Data Modeling Easy ⭐ must Question Solution
45 Current State and Full History Data Modeling Medium optional Question Solution
46 Region Suddenly Shows Zero Revenue Debugging & Reliability Medium ⭐ must Question Solution
47 Airflow Green but Output Empty Debugging & Reliability Medium ⭐ must Question Solution
48 Query Suddenly 80x Slower Debugging & Reliability Medium ⭐ must Question Solution
49 User Says Data Is Wrong Debugging & Reliability Easy optional Question Solution
50 Partition Always Ten Percent Smaller Debugging & Reliability Medium strong Question Solution
51 BigQuery Bill Eight Times Higher Cost Optimization Medium ⭐ must Question Solution
52 Four Hour Spark Job Under One Hour Cost Optimization Medium strong Question Solution
53 Hourly Scan on Daily Data Cost Optimization Easy strong Question Solution
54 Just Throw More Memory At It Cost Optimization Medium strong Question Solution
55 Partitioning Clustering Materialized Views Storage & Formats Easy strong Question Solution
56 Watermarks in Plain Words Streaming Medium ⭐ must Question Solution
57 Kafka Ordering Guarantee Streaming Medium ⭐ must Question Solution
58 Streaming Consumer Lag Diagnosis Streaming Medium ⭐ must Question Solution
59 Onboarding a New Analyst People & Process Easy optional Question Solution
60 Metric by Tomorrow vs Doing It Right People & Process Easy optional Question Solution
61 Two Teams Disagree on Active User People & Process Medium strong Question Solution
62 Postmortem After a Bad Day People & Process Medium ⭐ must Question Solution
63 Inherited Pipeline No Docs No Tests People & Process Medium optional Question Solution
64 Breaking Change in dbt Model 200 Consumers People & Process Medium strong Question Solution
65 4000 DAG Airflow at 90 Percent CPU Debugging & Reliability Medium strong Question Solution
66 Indexes When to Add and When They Hurt Database Internals Easy ⭐ must Question Solution
67 Transactions and ACID Database Internals Easy ⭐ must Question Solution
68 Isolation Levels in Plain Words Database Internals Medium ⭐ must Question Solution
69 Normalization and When to Denormalize Database Internals Medium strong Question Solution
70 B-Tree vs Hash vs LSM Tree Database Internals Medium ⭐ must Question Solution
71 Read Replicas and Replication Lag Database Operations Medium strong Question Solution
72 Sharding and Picking a Shard Key Database Operations Hard ⭐ must Question Solution
73 Database Connection Pooling Database Operations Medium strong Question Solution
74 Deadlocks and Lock Escalation Database Operations Medium strong Question Solution
75 SQL vs NoSQL Database Internals Medium strong Question Solution

Category Legend

Category What you practice
SQL & Querying Writing, reading and reasoning about SQL like a senior engineer
Data Modeling Star schemas, history tracking, grain, dimensions, SCDs
Database Internals Engines, ACID, isolation levels, indexes, B-tree vs LSM, normalization
Database Operations Running databases at scale: replicas, sharding, connection pools, deadlocks
Batch Pipelines ETL/ELT, idempotency, data cleaning, contracts, orchestration
Streaming Kafka, watermarks, exactly-once, ordering, consumer lag
Storage & Formats Parquet, lakehouse, partitioning, clustering, materialized views
System Design End-to-end pipelines for real consumer and energy-sector products
Cloud Services Picking between AWS, GCP and Azure services with clear trade-offs
Cost Optimization Finding waste in queries, jobs, and infrastructure when the bill spikes
Debugging & Reliability Step-by-step investigation when the number is wrong or the job died
People & Process Mentoring, comms, postmortems, ownership, rollouts

Difficulty Guide

  • Easy — A focused warm-up. Solvable or explainable in under an hour.
  • Medium — Realistic interview question. Has edge cases that matter.
  • Hard — Multi-step or system-design heavy. Closer to a take-home task.

Interview Value

  • ⭐ must — Shows up in the majority of senior data engineer interview loops right now.
  • strong — Common follow-up territory. Worth knowing cold for a senior bar.
  • optional — Niche, situational, or domain-specific. Read when curious.

⭐ Interview Must-Haves

Twenty-nine problems that cover the questions you cannot dodge in a senior data engineer loop. Read them in the order shown, top to bottom.

  1. Window Functions vs GROUP BYSQL & Querying
  2. Reading an EXPLAIN PlanSQL & Querying
  3. OLTP vs OLAPDatabase Internals
  4. Transactions and ACIDDatabase Internals
  5. Isolation Levels in Plain WordsDatabase Internals
  6. Indexes When to Add and When They HurtDatabase Internals
  7. B-Tree vs Hash vs LSM TreeDatabase Internals
  8. Sharding and Picking a Shard KeyDatabase Operations
  9. Explaining Fact Table GrainData Modeling
  10. Slowly Changing DimensionsData Modeling
  11. Tables for an Airbnb Like AppData Modeling
  12. ETL vs ELT and Why ELT WonBatch Pipelines
  13. Idempotency in Data PipelinesBatch Pipelines
  14. Parquet vs CSV vs JSONStorage & Formats
  15. Data Lake vs Warehouse vs LakehouseStorage & Formats
  16. Watermarks in Plain WordsStreaming
  17. Kafka Ordering GuaranteeStreaming
  18. Exactly Once DeliveryStreaming
  19. Streaming Consumer Lag DiagnosisStreaming
  20. Banking App Monthly Spending WidgetSystem Design
  21. Spotify Minutes Listened This WeekSystem Design
  22. Region Suddenly Shows Zero RevenueDebugging & Reliability
  23. Airflow Green but Output EmptyDebugging & Reliability
  24. Daily Report Quietly Wrong for Two WeeksDebugging & Reliability
  25. Query Suddenly 80x SlowerDebugging & Reliability
  26. Three Days of Data LostDebugging & Reliability
  27. BigQuery Bill Eight Times HigherCost Optimization
  28. Warehouse Cost Doubled in Two MonthsCost Optimization
  29. Postmortem After a Bad DayPeople & Process

Recommended Reading Order

All 75 problems arranged as a pedagogical sequence. Read top to bottom for a complete path from SQL to senior-level data engineering.

  1. Teaching SQL Performance to a JuniorSQL & Querying, Medium
  2. CTE vs SubquerySQL & Querying, Medium
  3. Window Functions vs GROUP BYSQL & Querying, Medium
  4. SELECT DISTINCT Hiding Join BugsSQL & Querying, Medium
  5. Reading an EXPLAIN PlanSQL & Querying, Medium
  6. Same Query Different AnswersSQL & Querying, Medium
  7. OLTP vs OLAPDatabase Internals, Easy
  8. SQL vs NoSQLDatabase Internals, Medium
  9. Transactions and ACIDDatabase Internals, Easy
  10. Isolation Levels in Plain WordsDatabase Internals, Medium
  11. Normalization and When to DenormalizeDatabase Internals, Medium
  12. Indexes When to Add and When They HurtDatabase Internals, Easy
  13. B-Tree vs Hash vs LSM TreeDatabase Internals, Medium
  14. Database Connection PoolingDatabase Operations, Medium
  15. Deadlocks and Lock EscalationDatabase Operations, Medium
  16. Read Replicas and Replication LagDatabase Operations, Medium
  17. Sharding and Picking a Shard KeyDatabase Operations, Hard
  18. Explaining Fact Table GrainData Modeling, Easy
  19. Mixing Facts and DimensionsData Modeling, Medium
  20. Slowly Changing DimensionsData Modeling, Medium
  21. Tables for an Airbnb Like AppData Modeling, Medium
  22. Tracking Subscription Plan HistoryData Modeling, Medium
  23. Current State and Full HistoryData Modeling, Medium
  24. ETL vs ELT and Why ELT WonBatch Pipelines, Easy
  25. Idempotency in Data PipelinesBatch Pipelines, Medium
  26. Data Contracts in Plain WordsBatch Pipelines, Medium
  27. Log File Error AnalysisBatch Pipelines, Easy
  28. Transform and Clean Raw Data for AnalyticsBatch Pipelines, Medium
  29. Merging Messy CSVs from Multiple PartnersBatch Pipelines, Medium
  30. Parquet vs CSV vs JSONStorage & Formats, Easy
  31. Data Lake vs Warehouse vs LakehouseStorage & Formats, Medium
  32. Partitioning vs Clustering in BigQueryStorage & Formats, Easy
  33. Partitioning Clustering Materialized ViewsStorage & Formats, Easy
  34. Store Partner Files in S3 or WarehouseCloud Services, Easy
  35. Lambda vs Cloud Function vs Cloud RunCloud Services, Medium
  36. Scheduled Pipeline Pay Only When RunCloud Services, Easy
  37. BigQuery vs Snowflake for New TeamCloud Services, Medium
  38. Managed Airflow vs Self HostedCloud Services, Medium
  39. BigQuery Access Control for 50 Person CompanyCloud Services, Medium
  40. Rolling Average of Sensor ReadingsStreaming, Easy
  41. Schema Evolution and Validation for Streaming EventsStreaming, Medium
  42. Watermarks in Plain WordsStreaming, Medium
  43. Kafka Ordering GuaranteeStreaming, Medium
  44. Exactly Once DeliveryStreaming, Medium
  45. Streaming Consumer Lag DiagnosisStreaming, Medium
  46. Banking App Monthly Spending WidgetSystem Design, Hard
  47. Spotify Minutes Listened This WeekSystem Design, Hard
  48. Ride Hailing Surge PricingSystem Design, Hard
  49. Delivery Idle Driver TrackingSystem Design, Hard
  50. Year in Review RecapSystem Design, Medium
  51. Low Balance Notification PipelineSystem Design, Medium
  52. Smart Meter to Monthly Bill PDFSystem Design, Hard
  53. Data Platform for an Electricity RetailerSystem Design, Hard
  54. User Says Data Is WrongDebugging & Reliability, Easy
  55. The Dashboard is WrongDebugging & Reliability, Easy
  56. Region Suddenly Shows Zero RevenueDebugging & Reliability, Medium
  57. Airflow Green but Output EmptyDebugging & Reliability, Medium
  58. Daily Report Quietly Wrong for Two WeeksDebugging & Reliability, Medium
  59. Partition Always Ten Percent SmallerDebugging & Reliability, Medium
  60. Query Suddenly 80x SlowerDebugging & Reliability, Medium
  61. Three Days of Data LostDebugging & Reliability, Hard
  62. 4000 DAG Airflow at 90 Percent CPUDebugging & Reliability, Medium
  63. Just Throw More Memory At ItCost Optimization, Medium
  64. Hourly Scan on Daily DataCost Optimization, Easy
  65. BigQuery Bill Eight Times HigherCost Optimization, Medium
  66. Warehouse Cost Doubled in Two MonthsCost Optimization, Medium
  67. Four Hour Spark Job Under One HourCost Optimization, Medium
  68. Onboarding a New AnalystPeople & Process, Easy
  69. Metric by Tomorrow vs Doing It RightPeople & Process, Easy
  70. Executive Needs a Number TomorrowPeople & Process, Medium
  71. Two Teams Disagree on Active UserPeople & Process, Medium
  72. Inheriting a Pipeline No One OwnsPeople & Process, Medium
  73. Inherited Pipeline No Docs No TestsPeople & Process, Medium
  74. Breaking Change in dbt Model 200 ConsumersPeople & Process, Medium
  75. Postmortem After a Bad DayPeople & Process, Medium

New problems are added regularly. If you want to contribute, see the Contribution Guide.