Welcome to the repository for the GenAI Book. This repository contains the markdown chapters detailing enterprise architectures, Agentic AI, LangGraph, and MLflow 3.0 on the Databricks Platform.
In this book, Yip provides a great (and practical) technical blueprint for bringing Lakehouse foundations to the GenAI and Agentic era. A lot has happened since his last book, from Agent Bricks to the new Lakebase/OLTP architecture, and he provides a comprehensive resource with the practical focus of someone who works on the platform daily, doing the real, hard work. —David Meyer, SVP of Products, Databricks
This book is super-exciting! It is an impressively comprehensive resource for both beginners looking to learn the platform and practitioners wanting to understand everything under the covers of Databricks. Jason Yip does a great job going through all the key features: Lakebase, Agent Bricks, composable agents, UC and governance, Lakeflow, and Data Warehousing with DBSQL. I especially found it helpful because of the abundance of inside-the-product screenshots, architecture diagrams, and code examples. This is a must-read for those looking to begin or deepen their Databricks journey! —Ari Kaplan, Head of Technical Evangelism, Databricks
Jason has done an excellent job of walking the readers through the breadth and depth of the Databricks Data Intelligence platform, including Agent Bricks, Lakebase, Databricks Apps, Unity Catalog, Genie, Lakeflow, Databricks SQL, and MLflow 3. Building on his first book, this is a powerful resource for data and AI practitioners driving transformation and innovation with AI agents built on enterprise data. —Amit Singh, Global Head of Partner GTM - AI, Lakebase and Genie, Databricks
The Lakehouse paradigm championed by Databricks is changing the way the industry thinks about data. The industry is moving to an Extract Transform Catalog pattern where the data never leaves the lake and is accessible to all who need it. This book gives you insight into how and why a governance-first approach is key to the new data landscape many enterprises are building. —Robert Thompson, MTS Solutions Architecture, T-Mobile; Databricks MVP
This book clearly captures Databricks evolution from a Lakehouse to a true Data Intelligence Platform. What stands out is its deep, practical treatment of agentic AI, showing not just what agents are but how to operationalize them with evaluation, governance, cost controls, and real-time intelligence. It is a pragmatic blueprint for architects and engineers looking to move AI from experimentation to production at enterprise scale. —Srivathsan RL, Global Databricks Solution Lead, Cognizant; Databricks AI MVP
This is the definitive guide for organizations looking to operationalize Generative AI. It offers an excellent balance of hands-on implementation and strategic governance, demonstrating exactly how to leverage Databricks data pipelines to transform GenAI concepts into secure, scalable, and high-value real-world solutions. —Nilton Ueda, Executive Senior Manager, Deloitte; Databricks MVP
The book is an essential read for both newcomers and experienced professionals. It captures nuanced insights that are often missing from standard documentation, enabling readers to develop a broader perspective on its usage and the value it can bring to Databricks projects. Each chapter is thoughtfully structured and clearly articulated, reflecting a high level of expertise. Additionally, the book comprehensively covers the latest advancements and best practices available today. —Maulik Dixit, VP of Data Engineering, Tredence; Databricks MVP
Databricks Data Intelligence Platform has redefined the modern data stack by unifying Generative AI with the reliable foundation of Delta Lake and the governance of Unity Catalog. The authors expertly deconstruct complex patterns, from serverless GPU orchestration to real-time stateful AI with Lakebase, providing the practical examples needed to build secure, scalable applications. It is the master manual for the next generation of AI engineering. —Scott Davis, Head of Data and AI, Lumenalta; Databricks MVP
This book is a solid and comprehensive overview of key Databricks components, providing a good foundation of knowledge for both well-known and lesser-known (but equally important) capabilities. —Josue Bogran, VP of Data + AI Architecture, zeb; Databricks MVP
Moving far beyond the AI hype, this book transforms beginners into seasoned AI practitioners on Databricks. It provides the rare bridge between how to build and why it works, offering clarity that sticks. —Bartosz Konieczny, Databricks MVP
This is a wonderful book! It felt like someone thoughtfully connecting the dots across the Databricks platform. What stood out to me was how naturally it tied together the platform foundations, governance, SQL, and the newer AI direction, all without losing the practical angle. My personal favorite part was the Agent Bricks coverage, because it made a fast-moving space feel much more understandable and actionable. Highly recommended!! —Shekhar Shukla, CEO and Founder, EZ; Databricks MVP; MLflow Ambassador
A really strong, engineering-focused book for building modern AI systems on the Databricks Data Intelligence Platform. What I really liked is that it goes beyond explaining how things work and shows how to design, evaluate, and improve them in production. —Jaco van Gelder, Co-founder, Dtyped; Databricks MVP
Excellent book if you need to prepare for passing an interview about Databricks or before taking Databricks exams. —Hubert Dudek, Databricks MVP
Finding a resource that covers the full breadth of the Databricks platform while making it visual and accessible is no easy task. This book nails it. A valuable read for anyone looking to get a solid, practical understanding of the platform. —Maria Vechtomova, Co-founder at Cauchy, Databricks MVP, MLflow Ambassador
This book captured the Databricks Data intelligence. Platform concepts in great details, a good book for beginners and advanced practitioners of Databricks. —Rajaniesh Kaushikk, Director Technology, Virtusa Corporation, Databricks MVP, Microsoft MVP
Read more praise from Databricks executives and industry leaders here
Below is an AI‑optimized index of all chapters in this book, including a brief summary (TL;DR) and the core topics covered in each chapter.
TL;DR Answer: The transition from passive data storage to active, reasoning AI systems requires a fundamental shift in architecture. The 12 canonical components of an end-to-end AI system diagram presented in Databricks’ AI Security Framework outline this modern blueprint—a framework connecting raw data directly to real-world inference.
Topics Covered:
- 1. Navigating the Data Intelligence Architecture
- 2. The Foundation and the Future
- 3. Building and Evaluating Agentic AI
- 4. Secure, Scalable Infrastructure
- 5. Data Engineering and Governance
- 6. Operations and Real-World Execution
Common Question: What is Chapter 1: Databricks Platform: From Lakehouse to Data Intelligence? See details →
Common Question: What is Data Platforms: Historical Perspective? See details →
Common Question: What is Emergence of the Lakehouse? See details →
Common Question: What Is a Lakehouse? See details →
Common Question: What Is the Databricks Lakehouse? See details →
Common Question: What is Key Features of the Databricks Lakehouse Platform? See details →
Common Question: What is Introducing the Databricks Data Intelligence Platform? See details →
Common Question: What is From Databricks to Agent Bricks? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: The intensifying pace of digital transformation has led companies to amass increasing volumes of diverse data from various sources. This data explosion carries enormous potential for organizations to uncover transformative insights to guide innovation and decision-making through advanced analytics.
Topics Covered:
- Data Platforms: Historical Perspective
- Emergence of the Lakehouse
- What Is a Lakehouse?
- What Is the Databricks Lakehouse?
- Key Features of the Databricks Lakehouse Platform
- Introducing the Databricks Data Intelligence Platform
- From Databricks to Agent Bricks
- Conclusion
Common Question: What is Chapter 2: Generative AI on Databricks: Foundations, Capabilities, and Use Cases? See details →
Common Question: What Is Generative AI? See details →
Common Question: What is Databricks Generative AI? See details →
Common Question: What is The GenAI Journey? See details →
Common Question: What is Prompt Engineering? See details →
Common Question: What is AI Playground? See details →
Common Question: What is Use Cases? See details →
Common Question: What is MCP Servers vs. Unity Catalog functions? See details →
Common Question: What is Retrieval-Augmented Generation (RAG)? See details →
Common Question: What is Metadata Filtering? See details →
Common Question: What is Result Enrichment Without Lookups? See details →
Common Question: What is Testing the Vector Search Index? See details →
Common Question: What is Mosaic AI Fine-Tuning API? See details →
Common Question: What is Fine-Tuning Example? See details →
Common Question: What is Pre-Training? See details →
Common Question: What is A Case Study of AI2’s OLMo, a Truly Open-Source Large Language Model? See details →
Common Question: What is GenAI Pricing? See details →
Common Question: What Are Tokens and Tokenizers? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: Ever since ChatGPT was released to the public, there has been no shortage of interest in chatbots or generative artificial intelligence (GenAI). But what exactly is GenAI, and how does Databricks come into the picture? And how can it help organizations deploy their own chatbot or develop their own GenAI applications? In this chapter, we will first learn the concepts around GenAI. We will then discuss how Databricks and MosaicML, now an integral part of Databricks, work together to transform t...
Topics Covered:
- What Is Generative AI?
- Databricks Generative AI
- The GenAI Journey
- Prompt Engineering
- Retrieval-Augmented Generation (RAG)
- Testing the Vector Search Index
- Mosaic AI Fine-Tuning API
- Pre-Training
- GenAI Pricing
- Conclusion
Common Question: What is Chapter 3: Composable Agents with Agent Bricks: From Prototype to Production? See details →
Common Question: What is Agent Bricks: Databricks’ No-Code Solution for Production Agents? See details →
Common Question: What is An Agent Bricks Walkthrough? See details →
Common Question: What is Information Extraction? See details →
Common Question: What is Optimization and Production? See details →
Common Question: What is Custom LLM? See details →
Common Question: What is Iterating on Quality in Custom LLM? See details →
Common Question: What is Adding Appropriate Criteria? See details →
Common Question: How to Do Proper Evaluations? See details →
Common Question: What is A Structured Evaluation Framework: The 3x3 Approach? See details →
Common Question: What is Real-Life Example? See details →
Common Question: What is Conclusion? See details →
Common Question: What is From Bespoke Orchestration to Integrated Workflows? See details →
Common Question: What is Maximizing Business Value? See details →
TL;DR Answer: In the rapidly evolving world of Generative AI (GenAI), there is an elephant in the room (Figure 3-1). While the potential of GenAI is immense, according to a 2024 Forbes report, 90% of GenAI projects never reach production.1 Further, in August 2025, an MIT survey states that 95% of GenAI projects don’t see a good return on investment.2 This predicament still holds true today in 2026. The challenges often boil down...
Topics Covered:
- Agent Bricks: Databricks’ No-Code Solution for Production Agents
- An Agent Bricks Walkthrough
- Custom LLM
- Conclusion
Common Question: What is Chapter 4: Agent Bricks Deep Dive: Retrieval, Reasoning, and Tooling Architectures? See details →
Common Question: What is Retrieval Augmented Generation: The Knowledge Assistant? See details →
Common Question: What is Similarity Search: The Magic Behind the Scenes? See details →
Common Question: What is Reranking? See details →
Common Question: What is Vector Search Infrastructure? See details →
Common Question: What is A Crash Course on Chunking? See details →
Common Question: What is Fixed-Size Chunking? See details →
Common Question: What is Recursive Chunking? See details →
Common Question: What is Semantic Chunking? See details →
Common Question: Document-Based Chunking See details →
Common Question: What is Agentic Chunking? See details →
Common Question: What is A Practical Example for RAG: Using Unstructured Data? See details →
Common Question: What is Quality Improvement? See details →
Common Question: How Many Iterations Are Needed to Improve Quality? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: In Chapter 3, we discussed the first two use cases for Agent Bricks: Information Extraction and Custom LLM. To recap, Agent Bricks is a UI-based product that packs the best of Databricks research behind the scenes, so we don’t need to learn or keep up with the latest techniques to make our GenAI solutions work with our data. There are two different types of bricks: generative bricks and non-generative bricks.
Topics Covered:
- Retrieval Augmented Generation: The Knowledge Assistant
- Similarity Search: The Magic Behind the Scenes
- Vector Search Infrastructure
- A Practical Example for RAG: Using Unstructured Data
- Quality Improvement
- Conclusion
Common Question: What is Chapter 5: Agentic AI Design Patterns for Enterprise Systems? See details →
Common Question: What is A Case Study? See details →
Common Question: What is Deep Dive into Agentic Patterns? See details →
Common Question: What is Visualizing Agents with LangGraph? See details →
Common Question: What is Tool Use? See details →
Common Question: What is Reflection? See details →
Common Question: What is Multi-Agent Collaboration? See details →
Common Question: What is ReAct (Reason + Act)? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: In the previous chapters, we have discussed different types of agents, including Information Extraction, Custom LLM, and Knowledge Assistant. These are great use cases, but at the same time, they are isolated for a specific use case. This is obviously by design because they are built for domain-specific purposes. What if we can chain all these agents together to create a very powerful system? This is called an Agentic AI workflow. We will first discuss the various Agentic AI design patterns a...
Topics Covered:
- A Case Study
- Deep Dive into Agentic Patterns
- Conclusion
Common Question: What is Chapter 6: Databricks As an Agentic Platform: Orchestration, Context, and Control? See details →
Common Question: What is Data Intelligence? See details →
Common Question: What is Deep Dive into Data Intelligence? See details →
Common Question: What is Genie Code (Databricks Assistant)? See details →
Common Question: What is AI-Powered Governance? See details →
Common Question: What is Search and Discovery? See details →
Common Question: What is AI/BI Genie? See details →
Common Question: How to Set Up Genie See details →
Common Question: What is Genie Deep Research? See details →
Common Question: What is Databricks MCP Servers? See details →
Common Question: What is Conclusion? See details →
Common Question: What is Core Components of the Data Intelligence Platform? See details →
Common Question: What is Integration and Connectivity? See details →
TL;DR Answer: In the previous chapters, we learned about the new Agent Bricks offerings. Since 2023, with the rise of GenAI and LLMs, Databricks has integrated them into its platform. The Databricks data intelligence platform (see Figure 6-1) combines the lakehouse platform and AI/LLMs to add the “data intelligence” engine that understands the uniqueness of your data and uses that understanding across everything in the platform. However, tools are increasingly designed for AI Agen...
Topics Covered:
- Data Intelligence
- Deep Dive into Data Intelligence
- Conclusion
Common Question: What is Chapter 7: Quality Tuning and Evaluation with Agent Bricks? See details →
Common Question: What is Evaluating Information Extraction and Custom LLM? See details →
Common Question: What is Evaluation-Driven Development with MLflow? See details →
Common Question: What is Manual Evaluation? See details →
Common Question: What is Systematic Evaluation? See details →
Common Question: What is Monitoring? See details →
Common Question: What is Databricks Traces? See details →
Common Question: What is Dataset? See details →
Common Question: What is New Evaluation? See details →
Common Question: What is Fix Responses? See details →
Common Question: What is Knowledge Assistant and Multi-Agent Supervisor? See details →
Common Question: What is Agent Learning from Human Feedback (ALHF) in Knowledge Assistant? See details →
Common Question: What is Extending the Paradigm to the Multi-Agent Supervisor? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: So far, we have discussed how to create agents with Agent Bricks; it’s a new paradigm for UI-driven AI. It is understandable in a user interface-driven framework that all we are seeing are spinning circles or progress bars. Scientists and engineers alike often want more details, and companies that adopt the latest AI products also need to evaluate the pros and cons of bringing in new technology, since they can no longer analyze the underlying model internals to understand how things work. How...
Topics Covered:
- Evaluating Information Extraction and Custom LLM
- Evaluation-Driven Development with MLflow
- Manual Evaluation
- Systematic Evaluation
- Monitoring
- Dataset
- Agent Learning from Human Feedback (ALHF) in Knowledge Assistant
- Extending the Paradigm to the Multi-Agent Supervisor
- Conclusion
Common Question: What is Chapter 8: Lakebase: The OLTP Engine for Intelligent Applications? See details →
Common Question: What Is Databricks Lakebase? See details →
Common Question: What is Getting Started with Lakebase? See details →
Common Question: What is Welcome to the Lakebase Interface? See details →
Common Question: What is Storage and Compute Decoupling? See details →
Common Question: What is Copy-on-Write? See details →
Common Question: What is Data Sharing Between Lakebase and Lakehouse? See details →
Common Question: What is Lakebase Core Features? See details →
Common Question: What is Managed Data Synchronization? See details →
Common Question: What is Lakebase: Before and After? See details →
Common Question: What is The Operational Complexity? See details →
Common Question: What is The Governance Layer? See details →
Common Question: What is Lakebase (Relational/Key-Value Storage)? See details →
Common Question: What is Vector Database (Semantic Search)? See details →
Common Question: What is Real-Life Use Cases? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: Databricks is the most optimized platform for data and analytics, including Online Analytical Processing (OLAP). However, several use cases require sub-second data latency, in which case Lakehouse data must be moved to a low-latency online transactional processing (OLTP) database to serve online applications. In the past, organizations have adopted a strategy of copying data from Delta Lake into a separate OLTP database like Azure SQL Database, Amazon Aurora, or GCP Cloud SQL. The purpose was...
Topics Covered:
- What Is Databricks Lakebase?
- Getting Started with Lakebase
- Welcome to the Lakebase Interface
- Lakebase Core Features
- Managed Data Synchronization
- Lakebase: Before and After
- Conclusion
Common Question: What is Chapter 9: Building Security-First, Production-Grade Databricks Apps? See details →
Common Question: What is Building Apps Where Your Data Lives? See details →
Common Question: What is Core Architecture: A Serverless, Containerized Foundation? See details →
Common Question: What is The Serverless Compute Plane? See details →
Common Question: Isolation and Security by Design See details →
Common Question: What is The Runtime Environment? See details →
Common Question: What is Unity Catalog As the Governance Layer? See details →
Common Question: What is Distinguishing Permissions from Authorization? See details →
Common Question: What is App Authorization (Service Principal Model)? See details →
Common Question: What is User Authorization (“On-Behalf-Of” Model)? See details →
Common Question: What is From Code to Production? See details →
Common Question: What is For Python Applications? See details →
Common Question: What is For Node.js Applications? See details →
Common Question: What is Deployment Using Declarative Automation Bundles? See details →
Common Question: What is A CI/CD Workflow Example? See details →
Common Question: What is Auditing and Observability? See details →
Common Question: What is Integrating with External Monitoring Tools? See details →
Common Question: What is Strategic Use Cases: Beyond Dashboards? See details →
Common Question: What is Conclusion: The Future of the Full-Stack Data Intelligence Platform? See details →
TL;DR Answer: In the previous chapters, we learned about the Databricks Lakehouse, which essentially means storing all your data in open storage in an open format with Unity Catalog providing a single governance layer. Databricks provides features that support all use cases, including data engineering, data science, streaming, and warehousing. Since 2023, with the rise of GenAI and LLMs, Databricks has integrated them into its platform. The Databricks data intelligence platform (see Figure [9-1](A625176_2_...
Topics Covered:
- Building Apps Where Your Data Lives
- Core Architecture: A Serverless, Containerized Foundation
- The Serverless Compute Plane
- Isolation and Security by Design
- The Runtime Environment
- Unity Catalog As the Governance Layer
- App Authorization (Service Principal Model)
- User Authorization (“On-Behalf-Of” Model)
- From Code to Production
- Deployment Using Declarative Automation Bundles
- A CI/CD Workflow Example
- Auditing and Observability
- Integrating with External Monitoring Tools
- Strategic Use Cases: Beyond Dashboards
- Conclusion: The Future of the Full-Stack Data Intelligence Platform
Common Question: What is Chapter 10: AI Runtime: Elastic Compute for AI Workloads? See details →
Common Question: What is AI (Serverless GPU) Runtime? See details →
Common Question: What is Serverless API at a Glance? See details →
Common Question: What is Serverless CPU for Machine Learning? See details →
Common Question: What is Ray on Databricks Serverless GPU? See details →
Common Question: What Is Ray? See details →
Common Question: What is Serverless GPU API? See details →
Common Question: What is The Launcher Module: @distributed and Orchestration? See details →
Common Question: What is The Runtime Module: The Core of Distributed Processing? See details →
Common Question: What is World Size? See details →
Common Question: What is Global Rank (Rank)? See details →
Common Question: What is Local Rank? See details →
Common Question: What is Ray: The “Spark Engine” for Distributed AI Training? See details →
Common Question: What is The Distributed Training Frameworks? See details →
Common Question: What is Multimodal Fine-Tuning: Where the Ray Stack Shines? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: When it comes to graphics processing units (GPUs), people undoubtedly think of Copilot or ChatGPT. No doubt that large language models are powered by a massive number of GPUs, and we will also discuss AI models in this chapter. One lesser-known limitation is the limited support for training traditional machine learning models on serverless CPU compute. The reason is that serverless is using the Spark Connect architecture. As shown in Figure 10-1, the Spark Connect a...
Topics Covered:
- AI (Serverless GPU) Runtime
- Serverless API at a Glance
- Conclusion
Common Question: What is Chapter 11: MLflow 3 and the GenAI Agents? See details →
Common Question: What is Tracing? See details →
Common Question: What is Auto Tracing? See details →
Common Question: What is Scope of Auto-Trace? See details →
Common Question: What is Custom Tracing? See details →
Common Question: What is Function Decorator? See details →
Common Question: What is Span Tracing? See details →
Common Question: What is Evaluation? See details →
Common Question: What is Evaluating Traces with LLM Judges? See details →
Common Question: What is Template-Based Judge? See details →
Common Question: What is Guidelines-Based Judges? See details →
Common Question: What is Creating Judges (Scorers) from the MLflow UI? See details →
Common Question: What is Review App: Labeling Session? See details →
Common Question: What is Hello Prompt Optimization, Goodbye, Prompt Management? See details →
Common Question: What is Introducing DSPy? See details →
Common Question: What is The Setup? See details →
Common Question: What is The Contract? See details →
Common Question: What is The Strategy? See details →
Common Question: What is The Self-Improver? See details →
Common Question: What is The Output? See details →
Common Question: What is Conclusion? See details →
Common Question: What is Enhanced Observability Through Tracing? See details →
Common Question: What is Rigorous Quality Assurance? See details →
Common Question: What is Declarative Optimization with DSPy? See details →
TL;DR Answer: Since launching in 2018, MLflow has been one of the major open-source tools in managing machine learning lifecycles. In the previous edition, we discussed the machine learning lifecycle and how to use MLflow to track experiments. Since MLflow 3.0, the community as well as Databricks has successfully evolved MLflow to support various GenAI use cases, from the basics like prompt versioning to agent tracing to custom judges.
Topics Covered:
- Tracing
- Evaluation
- Creating Judges (Scorers) from the MLflow UI
- Review App: Labeling Session
- Hello Prompt Optimization, Goodbye, Prompt Management
- Conclusion
Common Question: What is Chapter 12: Real-Time Intelligence with Spark Structured Streaming? See details →
Common Question: What is The Foundation of Structured Streaming? See details →
Common Question: What is Structured Streaming? See details →
Common Question: What is Spark Real-Time Mode? See details →
Common Question: What is Triggers? See details →
Common Question: What is Output Modes? See details →
Common Question: What is Windowed Grouped Aggregation? See details →
Common Question: What is State Management? See details →
Common Question: What is Late-Arrival Handling: Watermark? See details →
Common Question: What is Auto Loader? See details →
Common Question: What is Spark Real-Time Mode Deep Dive? See details →
Common Question: What is Architecture: Continuous Processing vs. Micro-Batching? See details →
Common Question: What is Advanced State Management? See details →
Common Question: What is Use Case: Real-Time Stock Tracking? See details →
Common Question: What is Solution: Spark Real-Time Mode and Stateful Processing? See details →
Common Question: What is Structured Streaming Best Practices? See details →
Common Question: What is Real-Time Machine Learning? See details →
Common Question: What is Online Store? See details →
Common Question: What is Specialized Libraries: Tecton? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: Many people think of streaming as some very low-latency continuous real-time events like X, formerly Twitter, feeds or IoT devices. While that was the original use case, streaming has evolved over the years to allow integration with other non-real-time tables and as a useful technique to enable incremental processing for batch pipelines. In this chapter, we will first go back in time to visit Spark Streaming; then we will look at the latest Databricks Structured Streaming engine. They are lar...
Topics Covered:
- The Foundation of Structured Streaming
- Structured Streaming
- Spark Real-Time Mode
- Triggers
- Output Modes
- Windowed Grouped Aggregation
- State Management
- Late-Arrival Handling: Watermark
- Auto Loader
- Spark Real-Time Mode Deep Dive
- Advanced State Management
- Use Case: Real-Time Stock Tracking
- Structured Streaming Best Practices
- Real-Time Machine Learning
- Online Store
- Conclusion
Common Question: What is Chapter 13: Lakeflow Connect: Data Ingestion for the Lakehouse? See details →
Common Question: What is Cloud Ingestion? See details →
Common Question: What is Files Ingestion? See details →
Common Question: What is Auto Loader? See details →
Common Question: What is COPY INTO (Legacy)? See details →
Common Question: What is Beyond Ingestion? See details →
Common Question: What is Change Data Capture (CDC)? See details →
Common Question: What is Automated Schema Evolution? See details →
Common Question: What is Unified Governance with Unity Catalog? See details →
Common Question: What is Zerobus: Direct to Delta streaming? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: Organizations have a wealth of information siloed in various data sources. It could be relational databases, on-premises data warehouses, big data storage systems like Hadoop, ERP/CRM systems, or real-time streaming sources. Many analytics use cases require not only efficient processing of this data but also a unified approach to produce meaningful reports and predictions. To start this journey, organizations need to ingest data from different sources into a single location. In this chapter, ...
Topics Covered:
- Cloud Ingestion
- Files Ingestion
- Beyond Ingestion
- Conclusion
Common Question: What is Chapter 14: Open Data Governance with Unity Catalog? See details →
Common Question: What Is Databricks Unity Catalog? See details →
Common Question: What is Unity Catalog: Before and After? See details →
Common Question: What is Unity Catalog Hierarchy? See details →
Common Question: What is Unity Catalog Admin Roles? See details →
Common Question: What is Getting Started with Unity Catalog? See details →
Common Question: What is Create a Metastore? See details →
Common Question: What is Organizing Data in Unity Catalog? See details →
Common Question: What is Key Features of Unity Catalog? See details →
Common Question: What is Centralized Metadata and User Management? See details →
Common Question: What is Centralized Access Controls? See details →
Common Question: What is Data Lineage? See details →
Common Question: What is Data Access Auditing? See details →
Common Question: What is Data Search and Discovery? See details →
Common Question: What is Row-Level Security and Column-Level Masking? See details →
Common Question: What is Row Filters? See details →
Common Question: What is Create a Row Filter? See details →
Common Question: What is Apply the Row Filter to a Table? See details →
Common Question: What is Column Masks? See details →
Common Question: What is Dynamic Views vs. Row Filters and Column Masks? See details →
Common Question: What is Delta Sharing? See details →
Common Question: What is An Open Standard for Data Sharing? See details →
Common Question: How Delta Sharing Works See details →
Common Question: What is Delta Sharing: Iceberg Version? See details →
Common Question: What is The Catalog War? See details →
Common Question: What is Conclusion? See details →
Common Question: What is Key Governance Capabilities? See details →
Common Question: What is Open Data Sharing? See details →
TL;DR Answer: Data is one of an organization’s most significant assets. An important determinant of a company’s performance and growth is how well its data is handled regarding quality, management, and ownership. Organizations today, especially with ever-expanding use cases for GenAI, face increasingly stringent data privacy regulations. Nonetheless, the reliance on data is increasing as organizations look to help optimize operations and drive business decision-making. Therefore, they are looking for data ...
Topics Covered:
- What Is Databricks Unity Catalog?
- Unity Catalog: Before and After
- Unity Catalog Hierarchy
- Unity Catalog Admin Roles
- Organizing Data in Unity Catalog
- Key Features of Unity Catalog
- Data Lineage
- Data Access Auditing
- Data Search and Discovery
- Row-Level Security and Column-Level Masking
- Delta Sharing
- Delta Sharing: Iceberg Version
- Conclusion
Common Question: What is Chapter 15: Delta Lake: The Foundation of Reliable Data and AI? See details →
Common Question: What is The Challenges of Other Formats? See details →
Common Question: What Is Delta Lake? See details →
Common Question: What is Transaction Log: Single Source of Truth? See details →
Common Question: What is Understanding the Transaction Log Protocol? See details →
Common Question: What is Traditional DELETE Operation (Copy-on-Write)? See details →
Common Question: What is Deletion Vectors (Merge-on-Read)? See details →
Common Question: What is Delta Lake: Medallion Architecture? See details →
Common Question: What is Delta Lake Key Features? See details →
Common Question: What is Update, Delete, and Upsert in Delta Tables? See details →
Common Question: What is Schema Evolution? See details →
Common Question: What is Time Travel? See details →
Common Question: What is Clone Delta Tables? See details →
Common Question: What is Generated Column? See details →
Common Question: What is Change Data Feed? See details →
Common Question: How CDF Works See details →
Common Question: What is Application in CDC (Change Data Capture)? See details →
Common Question: What is Other Use Cases? See details →
Common Question: What is Universal Format? See details →
Common Question: What is Delta Optimization? See details →
Common Question: What is Liquid Clustering? See details →
Common Question: What is Working with Liquid Clustering? See details →
Common Question: What is Current Limitations? See details →
Common Question: What is Predictive I/O? See details →
Common Question: What is ML/AI to the Rescue? See details →
Common Question: What is Delta Lake 4.0: The Future of Open Data Lakehouses? See details →
Common Question: What is Delta Connect (Support for Spark Connect)? See details →
Common Question: What is Expanded UniForm (Universal Format)? See details →
Common Question: What is Delta Kernel? See details →
Common Question: What is Delta Sharing? See details →
Common Question: What is Delta Sharing Use Cases? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: In this chapter, we will examine a crucial aspect of the lakehouse paradigm: the data storage format. As discussed in Chapter 1, the ideal storage format for a lakehouse is one that provides similar data management and performance features of a data warehouse but is an open format and built on top of cloud data storage. Delta Lake is a storage protocol that exactly fits the requirements. Delta Lake is an open, performant s...
Topics Covered:
- The Challenges of Other Formats
- What Is Delta Lake?
- Transaction Log: Single Source of Truth
- Delta Lake: Medallion Architecture
- Delta Lake Key Features
- Time Travel
- Clone Delta Tables
- Generated Column
- Change Data Feed
- Universal Format
- Delta Optimization
- Liquid Clustering
- Working with Liquid Clustering
- Current Limitations
- Predictive I/O
- Delta Lake 4.0: The Future of Open Data Lakehouses
- Delta Sharing
- Conclusion
Common Question: What is Chapter 16: Lakeflow Declarative Pipelines: Managing Data and AI Workflows? See details →
Common Question: What Are Declarative Pipelines? See details →
Common Question: What is Data Ingestion Using Lakeflow Declarative Pipelines? See details →
Common Question: What is Real-Time Fraud Detection with Lakeflow? See details →
Common Question: What is Streaming Event Deduplication? See details →
Common Question: What is AUTO CDC API? See details →
Common Question: What is Ensuring Data Quality in Real Time? See details →
Common Question: What is Lakeflow Designer? See details →
Common Question: What is Pipeline Task Type? See details →
Common Question: What is Logging and Monitoring? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: It is no secret that well-curated, trustworthy data is the foundation of the lakehouse architecture. Organizations need clean, fresh, and reliable data to drive their analytics and data science projects, which in turn help them make decisions for key business initiatives.
Topics Covered:
- What Are Declarative Pipelines?
- Lakeflow Designer
- Logging and Monitoring
- Conclusion
Common Question: What is Chapter 17: Data Warehousing with Databricks SQL? See details →
Common Question: What Is Databricks SQL? See details →
Common Question: What is SQL Warehouses? See details →
Common Question: What is Photon? See details →
Common Question: What is SQL Editor? See details →
Common Question: What is Introduction to AI/BI Dashboards? See details →
Common Question: What is Alerts? See details →
Common Question: What is Query History and Profile? See details →
Common Question: What is Serverless Compute? See details →
Common Question: What is Constraints in DBSQL? See details →
Common Question: What is Constraints on Databricks? See details →
Common Question: What is Enforced Constraints? See details →
Common Question: What is Informational Constraints: Primary Key and Foreign Key? See details →
Common Question: What is Streaming Tables and Materialized Views? See details →
Common Question: What is Streaming Tables? See details →
Common Question: What is Materialized Views? See details →
Common Question: What is Create a Materialized View? See details →
Common Question: What is Refresh a Materialized View? See details →
Common Question: What is Lakehouse Federation? See details →
Common Question: What is AI Functions in DBSQL? See details →
Common Question: What is Consume LLM Models in DBSQL? See details →
Common Question: What is Custom Functions Backed by a Serverless Serving Endpoint? See details →
Common Question: What is Integrate BI Tools with Databricks? See details →
Common Question: What is Publish to Power BI Online from Databricks? See details →
Common Question: What is Connect Power BI Desktop to Databricks? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: If you’re a data analyst who primarily uses SQL to write queries and reports and create comprehensive dashboards for analysis using your favorite business intelligence (BI) tools, Databricks SQL (DBSQL) provides a comprehensive environment for running ad hoc queries and creating dashboards on data stored in your data lakehouse.
Topics Covered:
- What Is Databricks SQL?
- SQL Warehouses
- Constraints in DBSQL
- Streaming Tables and Materialized Views
- Materialized Views
- Connect Power BI Desktop to Databricks
- Conclusion
Common Question: What is Chapter 18: Databricks Pricing and Observability Using System Tables? See details →
Common Question: What is Costs Associated with the Databricks Platform? See details →
Common Question: What is Cloud Infrastructure Costs? See details →
Common Question: What is Databricks Pricing? See details →
Common Question: What Are Databricks Units? See details →
Common Question: What is SQL Warehouse Pricing? See details →
Common Question: What is Databricks Cost Management Best Practices? See details →
Common Question: What is Databricks Observability: System Tables? See details →
Common Question: What is Introduction to System Tables? See details →
Common Question: What is Common Schemas/Tables Available with System Tables? See details →
Common Question: What is System Table: Billing Usage Example? See details →
Common Question: What is Conclusion? See details →
TL;DR Answer: In this chapter, we will investigate how pricing for running workloads on Databricks works. It is important to understand how Databricks calculates the cost of your workloads. We will see what factors determine the pricing model and recommend which compute SKU should be used for running your specific workloads.
Topics Covered:
- Costs Associated with the Databricks Platform
- Cloud Infrastructure Costs
- Databricks Pricing
- Databricks Cost Management Best Practices
- Databricks Observability: System Tables
- Conclusion
Common Question: What is Chapter 19: From Ideation to Creation: Building Intelligent Products with GenAI? See details →
Common Question: What is The Problem Statement? See details →
Common Question: What is End-to-End Architecture? See details →
Common Question: What is Data Ingestion? See details →
Common Question: What is Loading Multimedia Files? See details →
Common Question: What is Bounding Box? See details →
Common Question: What is Fine-Tuning Loop? See details →
Common Question: What is Deployment? See details →
Common Question: What is Deep Research Agent? See details →
Common Question: What is Conclusion: From Pixels to Saving Lives? See details →
TL;DR Answer: In the last edition, we created a chatbot for understanding diabetes; you are encouraged to check out the chapter. Despite the chatbot now being achievable with Agent Bricks’ Knowledge Assistant, it’s still critical to understand the end-to-end workflow. In this second edition, we will walk through a practical healthcare case study in which we will attempt to fine-tune a vision language model (VLM), a.k.a. Qwen3-VL-2B-Instruct, to watch real surgery videos and generate post-surgical reports t...
Topics Covered:
- The Problem Statement
- End-to-End Architecture
TL;DR Answer: > In this book, Yip provides a great (and practical) technical blueprint for bringing Lakehouse foundations to the GenAI and Agentic era. A lot has happened since his last book, from Agent Bricks to the new Lakebase/OLTP architecture, and he provides a comprehensive resource with the practical focus of someone who works on the platform daily, doing the real, hard work.
Topics Covered:
- 🌟 Praise for the Book
- 📖 Chapter Directory and Summaries
- 💻 Code Extracts
- ✍️ About the Authors
For an exhaustive mapping of all 91 code snippets, SQL pipelines, and Python LangGraph architectures used in this book, please see the code_map.csv. This provides language metadata, captions, and the raw code blocks for easy retrieval.
Jason Yip is a Databricks Most Valued Professional, recognized by Databricks for his exceptional technical expertise and commitment to the data and AI community. He serves on multiple advisory boards at Databricks, including the Partner Product Advisory Board and the Solution Architect Champion Advisory Board. He currently serves as Director of Data and AI at Tredence, a leading data science and analytics company, and is a Databricks Gold Partner. He advises Fortune 500 companies on implementing data and Generative AI strategies in the cloud. He is a top voice on Databricks and a former Microsoft employee who successfully led the Microsoft Corporate Finance big data transformation using Databricks.
🔗 Connect with Jason Yip on LinkedIn
Nikhil Gupta is a seasoned data professional with over 20 years of experience in big data technologies, driving innovation and strategic growth in the field. Nikhil serves as a Solution Architect at Stripe and held the same role at Databricks. Nikhil leverages his expertise to help customers across various industries, including retail, CPG, financial services, banking, and manufacturing, modernize their data and AI implementations on the Databricks platform. His expertise spans a range of big data technologies, including data warehousing, data lakes, and real-time data processing, making him a trusted advisor for Fortune 500 companies.
🔗 Connect with Nikhil Gupta on LinkedIn
Marcin Wojtyczka is a Practice Lead Resident Solutions Architect at Databricks and the creator of the open-source DQX data quality framework. A former software engineer and data architect, he has spent years building large-scale data platforms and ML/AI systems. Today, he builds reusable libraries and frameworks that enhance and complement the Databricks platform, powering the next generation of data intelligence. Beyond the code, Marcin is a frequent speaker, an active open-source contributor, a product builder, and an expedition sailor.