From 6da17fa65b8d2271e70a72d7fe19937dd15b679b Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Thu, 30 Apr 2026 23:59:26 +0200 Subject: [PATCH 01/14] Add Vector Search semantic product discovery example Demonstrates a Direct Access Vector Search index and endpoint declared as bundle resources (vector_search_endpoints, vector_search_indexes), tested e2e against staging with the direct engine. Key design decisions: - Jobs use resource references (${resources.*.name}) for endpoint and index names so dev-mode prefixing flows through automatically - schema_json uses flat {"col":"type"} format required by the API - Notebooks embed descriptions/queries explicitly (Direct Access indexes don't auto-embed; that's a Delta Sync feature) - engine: direct set in bundle config so no env var is needed Co-authored-by: Isaac --- .../vector_search_product_discovery/README.md | 157 ++++++++++++++ .../data/products.json | 202 ++++++++++++++++++ .../databricks.yml | 46 ++++ .../resources/endpoint.yml | 5 + .../resources/index.yml | 16 ++ .../resources/query_demo.job.yml | 38 ++++ .../resources/schema.yml | 6 + .../resources/setup_job.job.yml | 38 ++++ .../src/01_upsert_products.py | 54 +++++ .../src/02_query_demo.py | 107 ++++++++++ 10 files changed, 669 insertions(+) create mode 100644 contrib/vector_search_product_discovery/README.md create mode 100644 contrib/vector_search_product_discovery/data/products.json create mode 100644 contrib/vector_search_product_discovery/databricks.yml create mode 100644 contrib/vector_search_product_discovery/resources/endpoint.yml create mode 100644 contrib/vector_search_product_discovery/resources/index.yml create mode 100644 contrib/vector_search_product_discovery/resources/query_demo.job.yml create mode 100644 contrib/vector_search_product_discovery/resources/schema.yml create mode 100644 contrib/vector_search_product_discovery/resources/setup_job.job.yml create mode 100644 contrib/vector_search_product_discovery/src/01_upsert_products.py create mode 100644 contrib/vector_search_product_discovery/src/02_query_demo.py diff --git a/contrib/vector_search_product_discovery/README.md b/contrib/vector_search_product_discovery/README.md new file mode 100644 index 0000000..38ff3f6 --- /dev/null +++ b/contrib/vector_search_product_discovery/README.md @@ -0,0 +1,157 @@ +# Vector Search: Semantic Product Discovery + +A Declarative Automation Bundle demonstrating **semantic product search** using +[Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html). + +## The problem + +Keyword search fails when shoppers use different words than what appears in product +descriptions. A customer searching for *"something to keep my coffee hot all day"* won't +match a product described as an *"insulated stainless water bottle with double-wall vacuum +insulation"* — even though it's the right answer. + +Semantic search using vector embeddings matches on **meaning**, not words. + +## How it works + +Product descriptions are embedded at upsert time by the setup job using +[`databricks-gte-large-en`](https://docs.databricks.com/en/machine-learning/foundation-models/supported-models.html). +At query time the query is embedded with the same model and the index returns the nearest +products in vector space. + +``` +data/products.json (synced to workspace by bundle deploy) + ↓ embed descriptions → upsert_data() +product_index (Direct Access Vector Search index) + ↓ embed query → similarity_search(query_vector=...) +ranked results +``` + +## Bundle resources + +| Resource | Type | Description | +|---|---|---| +| `product_search_schema` | `schemas` | Unity Catalog schema that namespaces the index | +| `product_search_endpoint` | `vector_search_endpoints` | Managed ANN serving endpoint | +| `product_index` | `vector_search_indexes` | Direct Access index — schema defined in `resources/index.yml` | +| `product_discovery_setup` | `jobs` | Embeds product descriptions and upserts into the index | +| `product_discovery_query` | `jobs` | Embeds a query and returns ranked results | + +## Prerequisites + +- Databricks workspace with Unity Catalog enabled +- Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources +- An existing Unity Catalog catalog (default: `main`) + +## Quick start + +1. **Authenticate** + ```bash + databricks auth login --host https://your-workspace.cloud.databricks.com + ``` + +2. **Configure** `databricks.yml` — set the `dev` workspace host and any variable overrides + +3. **Deploy** — creates the schema, endpoint, index, jobs, and syncs `data/products.json` + ```bash + databricks bundle deploy + ``` + > Vector Search endpoint creation takes a few minutes to reach ONLINE status. + +4. **Load the catalog** — embeds all product descriptions and upserts them into the index + ```bash + databricks bundle run product_discovery_setup + ``` + +5. **Search** — pass any natural-language query + ```bash + databricks bundle run product_discovery_query --params "query=footwear for slippery wet trails" + ``` + +6. **Or open** `src/02_query_demo.py` in your workspace to run queries interactively + +## Configuration + +Override variables at deploy time or run time: + +```bash +databricks bundle deploy \ + --var catalog=my_catalog \ + --var schema=product_search \ + --var endpoint_name=my-vs-endpoint \ + --var embedding_model=databricks-gte-large-en \ + --var embedding_dimension=1024 +``` + +| Variable | Default | Description | +|---|---|---| +| `catalog` | `main` | Existing Unity Catalog catalog | +| `schema` | `product_search` | Schema created by the bundle | +| `endpoint_name` | `product-search-endpoint` | Vector Search endpoint name (must be unique per workspace) | +| `embedding_model` | `databricks-gte-large-en` | Foundation model used for embeddings | +| `embedding_dimension` | `1024` | Vector dimension — must match `embedding_dimension` in `resources/index.yml` | + +> **Note:** `embedding_dimension` in `resources/index.yml` is hardcoded to `1024` because +> it is immutable after index creation. If you need a different dimension, change the value +> in `index.yml` before the first deploy. + +## Index schema + +The index schema lives entirely in `resources/index.yml`: + +```yaml +direct_access_index_spec: + schema_json: >- + {"product_id":"int","name":"string","category":"string","brand":"string", + "price":"float","description":"string","description_vector":"array"} + embedding_vector_columns: + - name: description_vector + embedding_dimension: 1024 +``` + +`schema_json` is a flat `{"column_name": "type"}` JSON string. `description_vector` stores +the pre-computed embedding produced by `01_upsert_products.py`. + +## Updating the product catalog + +Edit `data/products.json`, then re-deploy and re-run setup: + +```bash +databricks bundle deploy +databricks bundle run product_discovery_setup +``` + +Upserts are idempotent on `product_id` — existing records are updated, new records added. + +## Variant: Delta Sync index + +This example uses a **Direct Access** index, which gives full control over when and how +records enter the index via `upsert_data`. If you already have a pipeline writing to a +Delta table, a **Delta Sync** index is often simpler — you point the index at the source +table and it keeps itself up to date. Replace `index_type: DIRECT_ACCESS` and +`direct_access_index_spec` with `index_type: DELTA_SYNC` and `delta_sync_index_spec` in +`resources/index.yml`, and remove the upsert job. + +## Project structure + +``` +. +├── databricks.yml +├── data/ +│ └── products.json # Product catalog — synced to workspace on deploy +├── resources/ +│ ├── schema.yml # Unity Catalog schema +│ ├── endpoint.yml # Vector Search endpoint +│ ├── index.yml # Direct Access index +│ ├── setup_job.job.yml # Embed + upsert job +│ └── query_demo.job.yml # Query job (--params "query=...") +└── src/ + ├── 01_upsert_products.py # Reads products.json, embeds, calls upsert_data + └── 02_query_demo.py # Semantic search — runs as job or interactively +``` + +## Resources + +- [Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html) +- [Declarative Automation Bundles](https://docs.databricks.com/dev-tools/bundles/) +- [Foundation Models — GTE Large](https://docs.databricks.com/en/machine-learning/foundation-models/supported-models.html) diff --git a/contrib/vector_search_product_discovery/data/products.json b/contrib/vector_search_product_discovery/data/products.json new file mode 100644 index 0000000..96259cd --- /dev/null +++ b/contrib/vector_search_product_discovery/data/products.json @@ -0,0 +1,202 @@ +[ + { + "product_id": 1, + "name": "Alpine Thermal Jacket", + "category": "Outdoor Clothing", + "brand": "SummitGear", + "price": 289.99, + "description": "Insulated hardshell designed for alpine conditions. Features a windproof outer layer, sealed seams, and a 700-fill-power down inner. Packs into its own pocket. Ideal for mountaineering, ski touring, and above-treeline travel in sub-zero temperatures." + }, + { + "product_id": 2, + "name": "Merino Wool Base Layer Top", + "category": "Outdoor Clothing", + "brand": "WoolTech", + "price": 89.99, + "description": "Next-to-skin mid-weight top made from 100% New Zealand merino wool. Naturally temperature-regulating and odor-resistant. Flatlock seams prevent chafing on long days. Worn as a standalone layer or under a shell in cold weather." + }, + { + "product_id": 3, + "name": "Softshell Fleece Jacket", + "category": "Outdoor Clothing", + "brand": "TrailRidge", + "price": 149.99, + "description": "Four-way stretch softshell with a bonded fleece backer. Wind-resistant without being fully waterproof — ideal as a mid-layer or standalone jacket on dry, cool days. Two hand pockets and a chest zip pocket." + }, + { + "product_id": 4, + "name": "Rain Jacket with Hood", + "category": "Outdoor Clothing", + "brand": "StormShield", + "price": 199.99, + "description": "3-layer waterproof breathable shell rated at 20,000mm hydrostatic head. Helmet-compatible hood with a single-hand adjustment. Pit-zip vents for temperature control during high output activities. Packs to fist size." + }, + { + "product_id": 5, + "name": "Waterproof Mid Hiking Boot", + "category": "Footwear", + "brand": "TrailTread", + "price": 179.99, + "description": "Full-grain leather upper with a waterproof membrane. Vibram Megagrip outsole provides traction on wet rock and loose trail. Mid-cut ankle collar supports the ankle on uneven terrain. Recommended for day hikes and multi-day trips with a loaded pack." + }, + { + "product_id": 6, + "name": "Trail Running Shoe", + "category": "Footwear", + "brand": "SpeedTrail", + "price": 139.99, + "description": "Lightweight trail runner with a rock plate and aggressive lug pattern. 8mm drop and a wide toe box promote natural foot strike. Drainage ports shed water quickly on stream crossings. Built for technical singletrack and ultra-distance racing." + }, + { + "product_id": 7, + "name": "Ultralight Backpacking Tent", + "category": "Camping", + "brand": "SilNylon Co", + "price": 349.99, + "description": "Two-person freestanding tent weighing 1.1 kg. Silnylon fly sheds rain and condensation. Interior mesh canopy maximizes airflow on warm nights. Sets up in under three minutes. Rated for three-season use; not designed for heavy snow loads." + }, + { + "product_id": 8, + "name": "20°F Down Sleeping Bag", + "category": "Camping", + "brand": "NightFrost", + "price": 279.99, + "description": "Mummy-cut bag with 800-fill-power hydrophobic down. EN-tested lower limit of -7°C. Footbox baffle prevents cold spots at the toes. YKK zipper with anti-snag tape. Compresses to the size of a Nalgene bottle in the included stuff sack." + }, + { + "product_id": 9, + "name": "Rechargeable Headlamp 350 lm", + "category": "Camping", + "brand": "BrightBeam", + "price": 49.99, + "description": "USB-C rechargeable lamp with a 350-lumen flood beam and a 100-lumen red night-vision mode. IPX4 splash-resistant housing. Single button cycles through brightness levels. Runtime up to 40 hours on low. Tilt mechanism adjusts beam angle hands-free." + }, + { + "product_id": 10, + "name": "Gravity Water Filter", + "category": "Camping", + "brand": "ClearFlow", + "price": 59.99, + "description": "Hollow-fiber gravity filter removes bacteria, protozoa, and microplastics to 0.1 micron. No pumping required — hang the dirty reservoir and let gravity do the work. Filters 1.5 liters per minute. Includes clean and dirty reservoirs and a hydration hose adapter." + }, + { + "product_id": 11, + "name": "Carbon Fiber Trekking Poles", + "category": "Camping", + "brand": "TrailPro", + "price": 119.99, + "description": "100% carbon fiber shaft reduces arm fatigue on long days. Quick-lock mechanism adjusts from 100 to 135 cm in seconds. Cork grip wicks sweat and molds to hand shape over time. Tungsten carbide tips with interchangeable rubber feet for paved surfaces." + }, + { + "product_id": 12, + "name": "Noise-Canceling Wireless Headphones", + "category": "Electronics", + "brand": "SoundWave", + "price": 329.99, + "description": "Over-ear headphones with hybrid active noise cancellation that adapts to ambient sound levels. 30-hour battery life. Multipoint pairing connects to two devices simultaneously. Foldable design with a hard carry case. Hi-Res Audio certified with a 4 Hz–40 kHz range." + }, + { + "product_id": 13, + "name": "Wireless Mechanical Keyboard", + "category": "Electronics", + "brand": "KeyForge", + "price": 149.99, + "description": "Tenkeyless layout with hot-swappable tactile switches. Bluetooth 5.0 pairs with up to three devices; a 2.4 GHz dongle provides sub-1ms latency for gaming. PBT keycaps resist shine. Per-key RGB lighting with 15 preset effects. 2000 mAh battery lasts two weeks on a single charge with lighting off." + }, + { + "product_id": 14, + "name": "Portable Laptop Stand", + "category": "Electronics", + "brand": "ErgaDesk", + "price": 49.99, + "description": "Adjustable aluminum stand raises a laptop screen to eye level, reducing neck strain during long work sessions. Six height settings from 15 to 32 cm. Folds flat to 3 mm for bag transport. Supports laptops from 10 to 17 inches and up to 8 kg." + }, + { + "product_id": 15, + "name": "Smart Air Purifier", + "category": "Electronics", + "brand": "PureHome", + "price": 219.99, + "description": "HEPA H13 filter captures 99.97% of particles down to 0.3 microns including pollen, dust mite debris, and pet dander. Activated carbon layer adsorbs VOCs and cooking odors. App-controlled with air quality sensor and auto mode. Covers rooms up to 50 m². Night mode drops fan noise to 22 dB." + }, + { + "product_id": 16, + "name": "Voice-Controlled Smart Speaker", + "category": "Electronics", + "brand": "EchoBox", + "price": 99.99, + "description": "360-degree speaker with a woofer and two tweeters. Built-in voice assistant controls smart home devices, plays music, answers questions, and sets timers. Connects via Wi-Fi and Bluetooth. Multi-room audio links speakers across the home. Privacy mic mute button." + }, + { + "product_id": 17, + "name": "Cast Iron Dutch Oven 5.5 qt", + "category": "Kitchen", + "brand": "IronChef", + "price": 89.99, + "description": "Enameled cast iron with a tight-fitting lid that seals in moisture for braises, stews, and bread baking. Oven-safe to 260°C. Works on all cooktops including induction. Interior cream enamel shows browning clearly. Self-basting dimpled lid. Lifetime warranty against defects." + }, + { + "product_id": 18, + "name": "Burr Coffee Grinder", + "category": "Kitchen", + "brand": "RoastMate", + "price": 79.99, + "description": "40mm stainless steel conical burrs produce consistent grind size from coarse French press to fine espresso. 40g hopper capacity. 18 click-stop settings. Static-reducing grounds bin with a rubber seal. Quiet 120W motor. Removable upper burr for easy cleaning." + }, + { + "product_id": 19, + "name": "10-Piece Stainless Knife Block Set", + "category": "Kitchen", + "brand": "CutMaster", + "price": 159.99, + "description": "High-carbon German stainless blades forged from a single billet for full tang strength. Set includes 8-inch chef's, 8-inch bread, 7-inch santoku, 5-inch utility, 3.5-inch paring, six steak knives, shears, honing rod, and a beechwood block. Blades hand-sharpened to 15° per side." + }, + { + "product_id": 20, + "name": "Pour-Over Coffee Dripper Set", + "category": "Kitchen", + "brand": "BrewCraft", + "price": 44.99, + "description": "Borosilicate glass dripper sits on a matching carafe. Spiral ribs promote even extraction by allowing air to escape uniformly. Includes 40 bleached paper filters and a stainless gooseneck pouring kettle. Produces a clean, bright cup that highlights single-origin floral and fruity notes." + }, + { + "product_id": 21, + "name": "Extra-Thick Yoga Mat", + "category": "Fitness", + "brand": "ZenGrip", + "price": 69.99, + "description": "6mm natural rubber mat with a microfiber top layer that grips when wet. Non-slip bottom prevents sliding on hardwood and tile. Alignment lines guide stance width in warrior and standing poses. Rolled dimensions: 61 × 183 cm. Includes carrying strap. Free from latex, PVC, and phthalates." + }, + { + "product_id": 22, + "name": "Vibrating Foam Roller", + "category": "Fitness", + "brand": "RecoverPro", + "price": 89.99, + "description": "High-density EPP foam roller with four built-in vibration frequencies (20–40 Hz). Vibration penetrates deeper tissue than static rolling for myofascial release and delayed-onset muscle soreness. USB rechargeable; 2-hour runtime per charge. Hollow core stores the charging cable." + }, + { + "product_id": 23, + "name": "Resistance Band Set", + "category": "Fitness", + "brand": "FlexBand", + "price": 34.99, + "description": "Five fabric-wrapped loop bands in progressive resistances from 5 to 40 lbs. Non-roll design stays in place during squats, hip thrusts, and lateral walks. Used for glute activation, mobility work, and upper-body accessory exercises. Includes a mesh carry bag and a printed exercise guide." + }, + { + "product_id": 24, + "name": "Insulated Stainless Water Bottle 32 oz", + "category": "Fitness", + "brand": "HydroKeep", + "price": 39.99, + "description": "Double-wall vacuum insulation keeps beverages cold 24 hours or hot 12 hours. 18/8 stainless steel; no plastic liner means no flavor transfer. Wide-mouth lid accepts ice cubes. Compatible with most car cup holders. Powder-coat finish resists dents and scratches." + }, + { + "product_id": 25, + "name": "Compression Running Tights", + "category": "Fitness", + "brand": "PaceWear", + "price": 79.99, + "description": "Four-way stretch fabric with graduated compression from ankle to waist improves circulation and reduces muscle oscillation during runs. UPF 50+ sun protection. Rear zip pocket fits a key or gel. Reflective piping increases visibility in low light. Available in lengths for inseams 28–34 inches." + } +] diff --git a/contrib/vector_search_product_discovery/databricks.yml b/contrib/vector_search_product_discovery/databricks.yml new file mode 100644 index 0000000..1bfedc5 --- /dev/null +++ b/contrib/vector_search_product_discovery/databricks.yml @@ -0,0 +1,46 @@ +# This is a Declarative Automation Bundle definition for vector_search_product_discovery. +# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation. +bundle: + name: vector_search_product_discovery + engine: direct + +variables: + catalog: + description: Unity Catalog catalog name + default: main + schema: + description: Unity Catalog schema name for the product search use case + default: product_search + endpoint_name: + description: Name of the Vector Search endpoint + default: product-search-endpoint + embedding_model: + description: Model serving endpoint used to embed product descriptions + default: databricks-gte-large-en + embedding_dimension: + description: >- + Vector dimension passed to the embedding model and upsert/query notebooks. + Must match the dimension used when the index was created (see resources/index.yml). + default: "1024" + +include: + - resources/*.yml + +targets: + dev: + # The default target uses 'mode: development' to create a development copy. + # - Deployed resources get prefixed with '[dev my_user_name]' + # - Any job schedules and triggers are paused by default. + # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html. + mode: development + default: true + workspace: + host: https://your-workspace.cloud.databricks.com + + prod: + mode: production + workspace: + host: https://your-workspace.cloud.databricks.com + permissions: + - group_name: users + level: CAN_VIEW diff --git a/contrib/vector_search_product_discovery/resources/endpoint.yml b/contrib/vector_search_product_discovery/resources/endpoint.yml new file mode 100644 index 0000000..3c08b72 --- /dev/null +++ b/contrib/vector_search_product_discovery/resources/endpoint.yml @@ -0,0 +1,5 @@ +resources: + vector_search_endpoints: + product_search_endpoint: + name: ${var.endpoint_name} + endpoint_type: STANDARD diff --git a/contrib/vector_search_product_discovery/resources/index.yml b/contrib/vector_search_product_discovery/resources/index.yml new file mode 100644 index 0000000..92e52c4 --- /dev/null +++ b/contrib/vector_search_product_discovery/resources/index.yml @@ -0,0 +1,16 @@ +resources: + vector_search_indexes: + product_index: + name: ${var.catalog}.${var.schema}.product_index + endpoint_name: ${resources.vector_search_endpoints.product_search_endpoint.name} + primary_key: product_id + index_type: DIRECT_ACCESS + direct_access_index_spec: + # schema_json is a flat {"column_name": "type"} map serialised as a JSON string. + # description_vector stores the pre-computed embedding produced by 01_upsert_products.py. + schema_json: >- + {"product_id":"int","name":"string","category":"string","brand":"string", + "price":"float","description":"string","description_vector":"array"} + embedding_vector_columns: + - name: description_vector + embedding_dimension: 1024 diff --git a/contrib/vector_search_product_discovery/resources/query_demo.job.yml b/contrib/vector_search_product_discovery/resources/query_demo.job.yml new file mode 100644 index 0000000..1f6e50d --- /dev/null +++ b/contrib/vector_search_product_discovery/resources/query_demo.job.yml @@ -0,0 +1,38 @@ +resources: + jobs: + product_discovery_query: + name: product_discovery_query + + parameters: + - name: index_name + default: ${resources.vector_search_indexes.product_index.name} + - name: endpoint_name + default: ${resources.vector_search_endpoints.product_search_endpoint.name} + - name: embedding_model + default: ${var.embedding_model} + - name: embedding_dimension + default: ${var.embedding_dimension} + - name: query + default: warm insulated jacket for cold mountain weather + - name: num_results + default: "5" + + environments: + - environment_key: serverless_env + spec: + client: "3" + dependencies: + - databricks-vectorsearch + + tasks: + - task_key: query + environment_key: serverless_env + notebook_task: + notebook_path: ../src/02_query_demo.py + base_parameters: + index_name: "{{job.parameters.index_name}}" + endpoint_name: "{{job.parameters.endpoint_name}}" + embedding_model: "{{job.parameters.embedding_model}}" + embedding_dimension: "{{job.parameters.embedding_dimension}}" + query: "{{job.parameters.query}}" + num_results: "{{job.parameters.num_results}}" diff --git a/contrib/vector_search_product_discovery/resources/schema.yml b/contrib/vector_search_product_discovery/resources/schema.yml new file mode 100644 index 0000000..98e16dd --- /dev/null +++ b/contrib/vector_search_product_discovery/resources/schema.yml @@ -0,0 +1,6 @@ +resources: + schemas: + product_search_schema: + catalog_name: ${var.catalog} + name: ${var.schema} + comment: "Schema for the vector search product discovery example" diff --git a/contrib/vector_search_product_discovery/resources/setup_job.job.yml b/contrib/vector_search_product_discovery/resources/setup_job.job.yml new file mode 100644 index 0000000..f2e2873 --- /dev/null +++ b/contrib/vector_search_product_discovery/resources/setup_job.job.yml @@ -0,0 +1,38 @@ +resources: + jobs: + product_discovery_setup: + name: product_discovery_setup + + parameters: + - name: index_name + default: ${resources.vector_search_indexes.product_index.name} + - name: endpoint_name + default: ${resources.vector_search_endpoints.product_search_endpoint.name} + - name: embedding_model + default: ${var.embedding_model} + - name: embedding_dimension + default: ${var.embedding_dimension} + - name: data_path + default: ${workspace.file_path}/data/products.json + + environments: + - environment_key: serverless_env + spec: + client: "3" + dependencies: + - databricks-vectorsearch + + tasks: + - task_key: upsert_products + description: Load products from JSON, embed descriptions, and upsert into the Vector Search index + environment_key: serverless_env + notebook_task: + notebook_path: ../src/01_upsert_products.py + base_parameters: + index_name: "{{job.parameters.index_name}}" + endpoint_name: "{{job.parameters.endpoint_name}}" + embedding_model: "{{job.parameters.embedding_model}}" + embedding_dimension: "{{job.parameters.embedding_dimension}}" + data_path: "{{job.parameters.data_path}}" + + max_concurrent_runs: 1 diff --git a/contrib/vector_search_product_discovery/src/01_upsert_products.py b/contrib/vector_search_product_discovery/src/01_upsert_products.py new file mode 100644 index 0000000..2bbebe6 --- /dev/null +++ b/contrib/vector_search_product_discovery/src/01_upsert_products.py @@ -0,0 +1,54 @@ +# Databricks notebook source +# MAGIC %md +# MAGIC # Upsert Products into Vector Search Index +# MAGIC +# MAGIC Reads the product catalog from the JSON file deployed with the bundle, +# MAGIC embeds each product description, then upserts all records into the Vector +# MAGIC Search index. Re-running is safe — upsert is idempotent on `product_id`. + +# COMMAND ---------- + +dbutils.widgets.text("index_name", "main.product_search.product_index", "Index name (3-part UC name)") +dbutils.widgets.text("endpoint_name", "product-search-endpoint", "Endpoint name") +dbutils.widgets.text("embedding_model", "databricks-gte-large-en", "Embedding model endpoint") +dbutils.widgets.text("embedding_dimension", "1024", "Embedding dimension") +dbutils.widgets.text("data_path", "", "Path to products.json") + +index_name = dbutils.widgets.get("index_name") +endpoint_name = dbutils.widgets.get("endpoint_name") +embedding_model = dbutils.widgets.get("embedding_model") +embedding_dim = int(dbutils.widgets.get("embedding_dimension")) +data_path = dbutils.widgets.get("data_path") + +# COMMAND ---------- + +import json +from mlflow.deployments import get_deploy_client + +with open(data_path) as f: + products = json.load(f) + +embed = get_deploy_client("databricks") +descriptions = [p["description"] for p in products] + +vectors = [] +batch_size = 32 +for i in range(0, len(descriptions), batch_size): + response = embed.predict( + endpoint=embedding_model, + inputs={"input": descriptions[i : i + batch_size], "dimensions": embedding_dim}, + ) + vectors.extend(item["embedding"] for item in response["data"]) + +for product, vector in zip(products, vectors): + product["description_vector"] = vector + +# COMMAND ---------- + +from databricks.vector_search.client import VectorSearchClient + +vsc = VectorSearchClient(disable_notice=True) +index = vsc.get_index(endpoint_name=endpoint_name, index_name=index_name) +index.upsert(products) + +print(f"Upserted {len(products)} products into {index_name}") diff --git a/contrib/vector_search_product_discovery/src/02_query_demo.py b/contrib/vector_search_product_discovery/src/02_query_demo.py new file mode 100644 index 0000000..74473a4 --- /dev/null +++ b/contrib/vector_search_product_discovery/src/02_query_demo.py @@ -0,0 +1,107 @@ +# Databricks notebook source +# MAGIC %md +# MAGIC # Semantic Product Search Demo +# MAGIC +# MAGIC Queries the Vector Search index to find products that match a natural-language +# MAGIC description. Try queries that would fail keyword search — e.g. *"something to +# MAGIC keep my coffee hot all day"* or *"gear for sleeping outside in freezing weather"*. + +# COMMAND ---------- + +dbutils.widgets.text("index_name", "main.product_search.product_index", "Index name (3-part UC name)") +dbutils.widgets.text("endpoint_name", "product-search-endpoint", "Endpoint name") +dbutils.widgets.text("embedding_model", "databricks-gte-large-en", "Embedding model endpoint") +dbutils.widgets.text("embedding_dimension", "1024", "Embedding dimension") +dbutils.widgets.text("query", "warm insulated jacket for cold mountain weather", "Search query") +dbutils.widgets.text("num_results", "5", "Number of results") + +index_name = dbutils.widgets.get("index_name") +endpoint_name = dbutils.widgets.get("endpoint_name") +embedding_model = dbutils.widgets.get("embedding_model") +embedding_dim = int(dbutils.widgets.get("embedding_dimension")) +query = dbutils.widgets.get("query") +num_results = int(dbutils.widgets.get("num_results")) + +# COMMAND ---------- + +from mlflow.deployments import get_deploy_client +from databricks.vector_search.client import VectorSearchClient + +embed = get_deploy_client("databricks") +vsc = VectorSearchClient(disable_notice=True) +index = vsc.get_index(endpoint_name=endpoint_name, index_name=index_name) + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Run a query + +# COMMAND ---------- + +import pandas as pd + +# Direct access indexes don't auto-embed queries — embed the query text first. +query_vector = embed.predict( + endpoint=embedding_model, + inputs={"input": [query], "dimensions": embedding_dim}, +)["data"][0]["embedding"] + +results = index.similarity_search( + query_vector=query_vector, + columns=["product_id", "name", "category", "brand", "price", "description"], + num_results=num_results, +) + +result_columns = ["product_id", "name", "category", "brand", "price", "description", "score"] +rows = results["result"]["data_array"] +df = pd.DataFrame(rows, columns=result_columns) +df.index += 1 +print(df.to_string()) + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Example queries to try +# MAGIC +# MAGIC These queries use different vocabulary than the product descriptions, demonstrating +# MAGIC that semantic search finds the right products even without exact keyword matches. +# MAGIC +# MAGIC | Query | Expected top results | +# MAGIC |---|---| +# MAGIC | `jacket built for mountaineering in sub-zero conditions` | Alpine Thermal Jacket | +# MAGIC | `something to keep beverages hot or cold all day` | Insulated Stainless Water Bottle | +# MAGIC | `footwear for slippery wet trails` | Waterproof Mid Hiking Boot | +# MAGIC | `staying warm overnight in below-freezing temperatures outdoors` | 20°F Down Sleeping Bag | +# MAGIC | `grinding beans at home for a fresh espresso` | Burr Coffee Grinder | +# MAGIC | `lightweight shelter for a solo overnight trip` | Ultralight Backpacking Tent | +# MAGIC | `reduce muscle soreness after a hard workout` | Vibrating Foam Roller | +# MAGIC | `improve posture while working at a computer` | Portable Laptop Stand | + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## Batch comparison across several queries + +# COMMAND ---------- + +example_queries = [ + "jacket built for mountaineering in sub-zero conditions", + "footwear for slippery wet trails", + "staying warm overnight in below-freezing temperatures outdoors", + "grinding beans at home for a fresh espresso", + "reduce muscle soreness after a hard workout", +] + +print(f"{'Query':<55} {'Top result'}") +print("-" * 90) + +for q in example_queries: + qv = embed.predict( + endpoint=embedding_model, + inputs={"input": [q], "dimensions": embedding_dim}, + )["data"][0]["embedding"] + r = index.similarity_search(query_vector=qv, columns=["name", "category"], num_results=1) + top = r["result"]["data_array"] + top_name = top[0][0] if top else "—" + top_cat = top[0][1] if top else "" + print(f"{q:<55} {top_name} ({top_cat})") From 2e83d4c82d177af6b4746a8da7a4b74a04f0c9f1 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Fri, 1 May 2026 11:16:35 +0200 Subject: [PATCH 02/14] Apply ruff format to upsert/query notebooks Co-authored-by: Isaac --- .../src/01_upsert_products.py | 18 +++++---- .../src/02_query_demo.py | 40 +++++++++++++------ 2 files changed, 39 insertions(+), 19 deletions(-) diff --git a/contrib/vector_search_product_discovery/src/01_upsert_products.py b/contrib/vector_search_product_discovery/src/01_upsert_products.py index 2bbebe6..0d16d55 100644 --- a/contrib/vector_search_product_discovery/src/01_upsert_products.py +++ b/contrib/vector_search_product_discovery/src/01_upsert_products.py @@ -8,17 +8,21 @@ # COMMAND ---------- -dbutils.widgets.text("index_name", "main.product_search.product_index", "Index name (3-part UC name)") +dbutils.widgets.text( + "index_name", "main.product_search.product_index", "Index name (3-part UC name)" +) dbutils.widgets.text("endpoint_name", "product-search-endpoint", "Endpoint name") -dbutils.widgets.text("embedding_model", "databricks-gte-large-en", "Embedding model endpoint") +dbutils.widgets.text( + "embedding_model", "databricks-gte-large-en", "Embedding model endpoint" +) dbutils.widgets.text("embedding_dimension", "1024", "Embedding dimension") dbutils.widgets.text("data_path", "", "Path to products.json") -index_name = dbutils.widgets.get("index_name") -endpoint_name = dbutils.widgets.get("endpoint_name") -embedding_model = dbutils.widgets.get("embedding_model") -embedding_dim = int(dbutils.widgets.get("embedding_dimension")) -data_path = dbutils.widgets.get("data_path") +index_name = dbutils.widgets.get("index_name") +endpoint_name = dbutils.widgets.get("endpoint_name") +embedding_model = dbutils.widgets.get("embedding_model") +embedding_dim = int(dbutils.widgets.get("embedding_dimension")) +data_path = dbutils.widgets.get("data_path") # COMMAND ---------- diff --git a/contrib/vector_search_product_discovery/src/02_query_demo.py b/contrib/vector_search_product_discovery/src/02_query_demo.py index 74473a4..e00ae97 100644 --- a/contrib/vector_search_product_discovery/src/02_query_demo.py +++ b/contrib/vector_search_product_discovery/src/02_query_demo.py @@ -8,19 +8,25 @@ # COMMAND ---------- -dbutils.widgets.text("index_name", "main.product_search.product_index", "Index name (3-part UC name)") +dbutils.widgets.text( + "index_name", "main.product_search.product_index", "Index name (3-part UC name)" +) dbutils.widgets.text("endpoint_name", "product-search-endpoint", "Endpoint name") -dbutils.widgets.text("embedding_model", "databricks-gte-large-en", "Embedding model endpoint") +dbutils.widgets.text( + "embedding_model", "databricks-gte-large-en", "Embedding model endpoint" +) dbutils.widgets.text("embedding_dimension", "1024", "Embedding dimension") -dbutils.widgets.text("query", "warm insulated jacket for cold mountain weather", "Search query") +dbutils.widgets.text( + "query", "warm insulated jacket for cold mountain weather", "Search query" +) dbutils.widgets.text("num_results", "5", "Number of results") -index_name = dbutils.widgets.get("index_name") -endpoint_name = dbutils.widgets.get("endpoint_name") -embedding_model = dbutils.widgets.get("embedding_model") -embedding_dim = int(dbutils.widgets.get("embedding_dimension")) -query = dbutils.widgets.get("query") -num_results = int(dbutils.widgets.get("num_results")) +index_name = dbutils.widgets.get("index_name") +endpoint_name = dbutils.widgets.get("endpoint_name") +embedding_model = dbutils.widgets.get("embedding_model") +embedding_dim = int(dbutils.widgets.get("embedding_dimension")) +query = dbutils.widgets.get("query") +num_results = int(dbutils.widgets.get("num_results")) # COMMAND ---------- @@ -52,7 +58,15 @@ num_results=num_results, ) -result_columns = ["product_id", "name", "category", "brand", "price", "description", "score"] +result_columns = [ + "product_id", + "name", + "category", + "brand", + "price", + "description", + "score", +] rows = results["result"]["data_array"] df = pd.DataFrame(rows, columns=result_columns) df.index += 1 @@ -100,8 +114,10 @@ endpoint=embedding_model, inputs={"input": [q], "dimensions": embedding_dim}, )["data"][0]["embedding"] - r = index.similarity_search(query_vector=qv, columns=["name", "category"], num_results=1) + r = index.similarity_search( + query_vector=qv, columns=["name", "category"], num_results=1 + ) top = r["result"]["data_array"] top_name = top[0][0] if top else "—" - top_cat = top[0][1] if top else "" + top_cat = top[0][1] if top else "" print(f"{q:<55} {top_name} ({top_cat})") From 880182099226cec913bf882f432084ad88445c84 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Mon, 1 Jun 2026 15:26:35 +0200 Subject: [PATCH 03/14] Use schema resource reference --- contrib/vector_search_product_discovery/resources/index.yml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/contrib/vector_search_product_discovery/resources/index.yml b/contrib/vector_search_product_discovery/resources/index.yml index 92e52c4..ba38450 100644 --- a/contrib/vector_search_product_discovery/resources/index.yml +++ b/contrib/vector_search_product_discovery/resources/index.yml @@ -1,7 +1,10 @@ resources: vector_search_indexes: product_index: - name: ${var.catalog}.${var.schema}.product_index + # Reference the schema *resource* (not ${var.schema}) so the index lands in the + # actually-deployed schema. In development mode the schema name is prefixed, + # and this reference picks that up. + name: ${var.catalog}.${resources.schemas.product_search_schema.name}.product_index endpoint_name: ${resources.vector_search_endpoints.product_search_endpoint.name} primary_key: product_id index_type: DIRECT_ACCESS From 3a06d9fa00c732a20b93c4bed87417a3d8a7f2d5 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Tue, 2 Jun 2026 14:04:10 +0200 Subject: [PATCH 04/14] Cleanup --- .../vector_search_product_discovery/README.md | 2 +- .../databricks.yml | 16 +++++----------- .../resources/index.yml | 3 --- 3 files changed, 6 insertions(+), 15 deletions(-) diff --git a/contrib/vector_search_product_discovery/README.md b/contrib/vector_search_product_discovery/README.md index 38ff3f6..133efef 100644 --- a/contrib/vector_search_product_discovery/README.md +++ b/contrib/vector_search_product_discovery/README.md @@ -50,7 +50,7 @@ ranked results databricks auth login --host https://your-workspace.cloud.databricks.com ``` -2. **Configure** `databricks.yml` — set the `dev` workspace host and any variable overrides +2. **Configure** `databricks.yml` — set the workspace host and any variable overrides 3. **Deploy** — creates the schema, endpoint, index, jobs, and syncs `data/products.json` ```bash diff --git a/contrib/vector_search_product_discovery/databricks.yml b/contrib/vector_search_product_discovery/databricks.yml index 1bfedc5..e79df17 100644 --- a/contrib/vector_search_product_discovery/databricks.yml +++ b/contrib/vector_search_product_discovery/databricks.yml @@ -6,7 +6,7 @@ bundle: variables: catalog: - description: Unity Catalog catalog name + description: Unity Catalog catalog name for an existing catalog default: main schema: description: Unity Catalog schema name for the product search use case @@ -27,20 +27,14 @@ include: - resources/*.yml targets: - dev: - # The default target uses 'mode: development' to create a development copy. - # - Deployed resources get prefixed with '[dev my_user_name]' - # - Any job schedules and triggers are paused by default. - # See also https://docs.databricks.com/dev-tools/bundles/deployment-modes.html. - mode: development - default: true - workspace: - host: https://your-workspace.cloud.databricks.com - prod: mode: production + default: true workspace: host: https://your-workspace.cloud.databricks.com + root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target} permissions: + - user_name: ${workspace.current_user.userName} + level: CAN_MANAGE - group_name: users level: CAN_VIEW diff --git a/contrib/vector_search_product_discovery/resources/index.yml b/contrib/vector_search_product_discovery/resources/index.yml index ba38450..684bf68 100644 --- a/contrib/vector_search_product_discovery/resources/index.yml +++ b/contrib/vector_search_product_discovery/resources/index.yml @@ -1,9 +1,6 @@ resources: vector_search_indexes: product_index: - # Reference the schema *resource* (not ${var.schema}) so the index lands in the - # actually-deployed schema. In development mode the schema name is prefixed, - # and this reference picks that up. name: ${var.catalog}.${resources.schemas.product_search_schema.name}.product_index endpoint_name: ${resources.vector_search_endpoints.product_search_endpoint.name} primary_key: product_id From fa94643d0d26e08a70ca345d6a1be17fe77b22d9 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Tue, 2 Jun 2026 14:55:52 +0200 Subject: [PATCH 05/14] Move files --- .../vector_search_product_discovery/README.md | 0 .../vector_search_product_discovery/data/products.json | 0 .../vector_search_product_discovery/databricks.yml | 0 .../vector_search_product_discovery/resources/endpoint.yml | 0 .../vector_search_product_discovery/resources/index.yml | 0 .../vector_search_product_discovery/resources/query_demo.job.yml | 0 .../vector_search_product_discovery/resources/schema.yml | 0 .../vector_search_product_discovery/resources/setup_job.job.yml | 0 .../vector_search_product_discovery/src/01_upsert_products.py | 0 .../vector_search_product_discovery/src/02_query_demo.py | 0 10 files changed, 0 insertions(+), 0 deletions(-) rename {contrib => knowledge_base}/vector_search_product_discovery/README.md (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/data/products.json (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/databricks.yml (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/resources/endpoint.yml (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/resources/index.yml (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/resources/query_demo.job.yml (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/resources/schema.yml (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/resources/setup_job.job.yml (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/src/01_upsert_products.py (100%) rename {contrib => knowledge_base}/vector_search_product_discovery/src/02_query_demo.py (100%) diff --git a/contrib/vector_search_product_discovery/README.md b/knowledge_base/vector_search_product_discovery/README.md similarity index 100% rename from contrib/vector_search_product_discovery/README.md rename to knowledge_base/vector_search_product_discovery/README.md diff --git a/contrib/vector_search_product_discovery/data/products.json b/knowledge_base/vector_search_product_discovery/data/products.json similarity index 100% rename from contrib/vector_search_product_discovery/data/products.json rename to knowledge_base/vector_search_product_discovery/data/products.json diff --git a/contrib/vector_search_product_discovery/databricks.yml b/knowledge_base/vector_search_product_discovery/databricks.yml similarity index 100% rename from contrib/vector_search_product_discovery/databricks.yml rename to knowledge_base/vector_search_product_discovery/databricks.yml diff --git a/contrib/vector_search_product_discovery/resources/endpoint.yml b/knowledge_base/vector_search_product_discovery/resources/endpoint.yml similarity index 100% rename from contrib/vector_search_product_discovery/resources/endpoint.yml rename to knowledge_base/vector_search_product_discovery/resources/endpoint.yml diff --git a/contrib/vector_search_product_discovery/resources/index.yml b/knowledge_base/vector_search_product_discovery/resources/index.yml similarity index 100% rename from contrib/vector_search_product_discovery/resources/index.yml rename to knowledge_base/vector_search_product_discovery/resources/index.yml diff --git a/contrib/vector_search_product_discovery/resources/query_demo.job.yml b/knowledge_base/vector_search_product_discovery/resources/query_demo.job.yml similarity index 100% rename from contrib/vector_search_product_discovery/resources/query_demo.job.yml rename to knowledge_base/vector_search_product_discovery/resources/query_demo.job.yml diff --git a/contrib/vector_search_product_discovery/resources/schema.yml b/knowledge_base/vector_search_product_discovery/resources/schema.yml similarity index 100% rename from contrib/vector_search_product_discovery/resources/schema.yml rename to knowledge_base/vector_search_product_discovery/resources/schema.yml diff --git a/contrib/vector_search_product_discovery/resources/setup_job.job.yml b/knowledge_base/vector_search_product_discovery/resources/setup_job.job.yml similarity index 100% rename from contrib/vector_search_product_discovery/resources/setup_job.job.yml rename to knowledge_base/vector_search_product_discovery/resources/setup_job.job.yml diff --git a/contrib/vector_search_product_discovery/src/01_upsert_products.py b/knowledge_base/vector_search_product_discovery/src/01_upsert_products.py similarity index 100% rename from contrib/vector_search_product_discovery/src/01_upsert_products.py rename to knowledge_base/vector_search_product_discovery/src/01_upsert_products.py diff --git a/contrib/vector_search_product_discovery/src/02_query_demo.py b/knowledge_base/vector_search_product_discovery/src/02_query_demo.py similarity index 100% rename from contrib/vector_search_product_discovery/src/02_query_demo.py rename to knowledge_base/vector_search_product_discovery/src/02_query_demo.py From 24daec93e94c0d07f93b5456c6fc76d26689de1d Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Tue, 2 Jun 2026 15:02:39 +0200 Subject: [PATCH 06/14] Format schema_json --- .../resources/index.yml | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/knowledge_base/vector_search_product_discovery/resources/index.yml b/knowledge_base/vector_search_product_discovery/resources/index.yml index 684bf68..e812f28 100644 --- a/knowledge_base/vector_search_product_discovery/resources/index.yml +++ b/knowledge_base/vector_search_product_discovery/resources/index.yml @@ -8,9 +8,16 @@ resources: direct_access_index_spec: # schema_json is a flat {"column_name": "type"} map serialised as a JSON string. # description_vector stores the pre-computed embedding produced by 01_upsert_products.py. - schema_json: >- - {"product_id":"int","name":"string","category":"string","brand":"string", - "price":"float","description":"string","description_vector":"array"} + schema_json: |- + { + "product_id": "int", + "name": "string", + "category": "string", + "brand": "string", + "price": "float", + "description": "string", + "description_vector": "array" + } embedding_vector_columns: - name: description_vector embedding_dimension: 1024 From a9a8957ca768c5a98d297f083d3ebef927fc4fff Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Tue, 2 Jun 2026 15:04:10 +0200 Subject: [PATCH 07/14] Strip .job.yml from .yml files --- knowledge_base/vector_search_product_discovery/README.md | 4 ++-- .../resources/{query_demo.job.yml => query_demo.yml} | 0 .../resources/{setup_job.job.yml => setup_job.yml} | 0 3 files changed, 2 insertions(+), 2 deletions(-) rename knowledge_base/vector_search_product_discovery/resources/{query_demo.job.yml => query_demo.yml} (100%) rename knowledge_base/vector_search_product_discovery/resources/{setup_job.job.yml => setup_job.yml} (100%) diff --git a/knowledge_base/vector_search_product_discovery/README.md b/knowledge_base/vector_search_product_discovery/README.md index 133efef..249b1e2 100644 --- a/knowledge_base/vector_search_product_discovery/README.md +++ b/knowledge_base/vector_search_product_discovery/README.md @@ -143,8 +143,8 @@ table and it keeps itself up to date. Replace `index_type: DIRECT_ACCESS` and │ ├── schema.yml # Unity Catalog schema │ ├── endpoint.yml # Vector Search endpoint │ ├── index.yml # Direct Access index -│ ├── setup_job.job.yml # Embed + upsert job -│ └── query_demo.job.yml # Query job (--params "query=...") +│ ├── setup_job.yml # Embed + upsert job +│ └── query_demo.yml # Query job (--params "query=...") └── src/ ├── 01_upsert_products.py # Reads products.json, embeds, calls upsert_data └── 02_query_demo.py # Semantic search — runs as job or interactively diff --git a/knowledge_base/vector_search_product_discovery/resources/query_demo.job.yml b/knowledge_base/vector_search_product_discovery/resources/query_demo.yml similarity index 100% rename from knowledge_base/vector_search_product_discovery/resources/query_demo.job.yml rename to knowledge_base/vector_search_product_discovery/resources/query_demo.yml diff --git a/knowledge_base/vector_search_product_discovery/resources/setup_job.job.yml b/knowledge_base/vector_search_product_discovery/resources/setup_job.yml similarity index 100% rename from knowledge_base/vector_search_product_discovery/resources/setup_job.job.yml rename to knowledge_base/vector_search_product_discovery/resources/setup_job.yml From 54777527dba84b15f371738449e07fcaf6c5f665 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Wed, 3 Jun 2026 14:25:09 +0200 Subject: [PATCH 08/14] Anonymise product brands Co-authored-by: Isaac --- .../data/products.json | 54 +++++++++---------- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/knowledge_base/vector_search_product_discovery/data/products.json b/knowledge_base/vector_search_product_discovery/data/products.json index 96259cd..ddf4be2 100644 --- a/knowledge_base/vector_search_product_discovery/data/products.json +++ b/knowledge_base/vector_search_product_discovery/data/products.json @@ -3,7 +3,7 @@ "product_id": 1, "name": "Alpine Thermal Jacket", "category": "Outdoor Clothing", - "brand": "SummitGear", + "brand": "Highcairn", "price": 289.99, "description": "Insulated hardshell designed for alpine conditions. Features a windproof outer layer, sealed seams, and a 700-fill-power down inner. Packs into its own pocket. Ideal for mountaineering, ski touring, and above-treeline travel in sub-zero temperatures." }, @@ -11,7 +11,7 @@ "product_id": 2, "name": "Merino Wool Base Layer Top", "category": "Outdoor Clothing", - "brand": "WoolTech", + "brand": "Wollund", "price": 89.99, "description": "Next-to-skin mid-weight top made from 100% New Zealand merino wool. Naturally temperature-regulating and odor-resistant. Flatlock seams prevent chafing on long days. Worn as a standalone layer or under a shell in cold weather." }, @@ -19,7 +19,7 @@ "product_id": 3, "name": "Softshell Fleece Jacket", "category": "Outdoor Clothing", - "brand": "TrailRidge", + "brand": "Glidewen", "price": 149.99, "description": "Four-way stretch softshell with a bonded fleece backer. Wind-resistant without being fully waterproof — ideal as a mid-layer or standalone jacket on dry, cool days. Two hand pockets and a chest zip pocket." }, @@ -27,7 +27,7 @@ "product_id": 4, "name": "Rain Jacket with Hood", "category": "Outdoor Clothing", - "brand": "StormShield", + "brand": "Stormvel", "price": 199.99, "description": "3-layer waterproof breathable shell rated at 20,000mm hydrostatic head. Helmet-compatible hood with a single-hand adjustment. Pit-zip vents for temperature control during high output activities. Packs to fist size." }, @@ -35,15 +35,15 @@ "product_id": 5, "name": "Waterproof Mid Hiking Boot", "category": "Footwear", - "brand": "TrailTread", + "brand": "Treadwen", "price": 179.99, - "description": "Full-grain leather upper with a waterproof membrane. Vibram Megagrip outsole provides traction on wet rock and loose trail. Mid-cut ankle collar supports the ankle on uneven terrain. Recommended for day hikes and multi-day trips with a loaded pack." + "description": "Full-grain leather upper with a waterproof membrane. A grippy rubber outsole bites into wet rock and loose trail. Mid-cut ankle collar supports the ankle on uneven terrain. Recommended for day hikes and multi-day trips with a loaded pack." }, { "product_id": 6, "name": "Trail Running Shoe", "category": "Footwear", - "brand": "SpeedTrail", + "brand": "Fellrun", "price": 139.99, "description": "Lightweight trail runner with a rock plate and aggressive lug pattern. 8mm drop and a wide toe box promote natural foot strike. Drainage ports shed water quickly on stream crossings. Built for technical singletrack and ultra-distance racing." }, @@ -51,7 +51,7 @@ "product_id": 7, "name": "Ultralight Backpacking Tent", "category": "Camping", - "brand": "SilNylon Co", + "brand": "Tarnost", "price": 349.99, "description": "Two-person freestanding tent weighing 1.1 kg. Silnylon fly sheds rain and condensation. Interior mesh canopy maximizes airflow on warm nights. Sets up in under three minutes. Rated for three-season use; not designed for heavy snow loads." }, @@ -59,15 +59,15 @@ "product_id": 8, "name": "20°F Down Sleeping Bag", "category": "Camping", - "brand": "NightFrost", + "brand": "Frostlin", "price": 279.99, - "description": "Mummy-cut bag with 800-fill-power hydrophobic down. EN-tested lower limit of -7°C. Footbox baffle prevents cold spots at the toes. YKK zipper with anti-snag tape. Compresses to the size of a Nalgene bottle in the included stuff sack." + "description": "Mummy-cut bag with 800-fill-power hydrophobic down. EN-tested lower limit of -7°C. Footbox baffle prevents cold spots at the toes. Corrosion-resistant zipper with anti-snag tape. Compresses to the size of a 1 L water bottle in the included stuff sack." }, { "product_id": 9, "name": "Rechargeable Headlamp 350 lm", "category": "Camping", - "brand": "BrightBeam", + "brand": "Beamwick", "price": 49.99, "description": "USB-C rechargeable lamp with a 350-lumen flood beam and a 100-lumen red night-vision mode. IPX4 splash-resistant housing. Single button cycles through brightness levels. Runtime up to 40 hours on low. Tilt mechanism adjusts beam angle hands-free." }, @@ -75,7 +75,7 @@ "product_id": 10, "name": "Gravity Water Filter", "category": "Camping", - "brand": "ClearFlow", + "brand": "Streamwell", "price": 59.99, "description": "Hollow-fiber gravity filter removes bacteria, protozoa, and microplastics to 0.1 micron. No pumping required — hang the dirty reservoir and let gravity do the work. Filters 1.5 liters per minute. Includes clean and dirty reservoirs and a hydration hose adapter." }, @@ -83,7 +83,7 @@ "product_id": 11, "name": "Carbon Fiber Trekking Poles", "category": "Camping", - "brand": "TrailPro", + "brand": "Polaract", "price": 119.99, "description": "100% carbon fiber shaft reduces arm fatigue on long days. Quick-lock mechanism adjusts from 100 to 135 cm in seconds. Cork grip wicks sweat and molds to hand shape over time. Tungsten carbide tips with interchangeable rubber feet for paved surfaces." }, @@ -91,7 +91,7 @@ "product_id": 12, "name": "Noise-Canceling Wireless Headphones", "category": "Electronics", - "brand": "SoundWave", + "brand": "Aurivox", "price": 329.99, "description": "Over-ear headphones with hybrid active noise cancellation that adapts to ambient sound levels. 30-hour battery life. Multipoint pairing connects to two devices simultaneously. Foldable design with a hard carry case. Hi-Res Audio certified with a 4 Hz–40 kHz range." }, @@ -99,7 +99,7 @@ "product_id": 13, "name": "Wireless Mechanical Keyboard", "category": "Electronics", - "brand": "KeyForge", + "brand": "Clackton", "price": 149.99, "description": "Tenkeyless layout with hot-swappable tactile switches. Bluetooth 5.0 pairs with up to three devices; a 2.4 GHz dongle provides sub-1ms latency for gaming. PBT keycaps resist shine. Per-key RGB lighting with 15 preset effects. 2000 mAh battery lasts two weeks on a single charge with lighting off." }, @@ -107,7 +107,7 @@ "product_id": 14, "name": "Portable Laptop Stand", "category": "Electronics", - "brand": "ErgaDesk", + "brand": "Deskwen", "price": 49.99, "description": "Adjustable aluminum stand raises a laptop screen to eye level, reducing neck strain during long work sessions. Six height settings from 15 to 32 cm. Folds flat to 3 mm for bag transport. Supports laptops from 10 to 17 inches and up to 8 kg." }, @@ -115,7 +115,7 @@ "product_id": 15, "name": "Smart Air Purifier", "category": "Electronics", - "brand": "PureHome", + "brand": "Puralto", "price": 219.99, "description": "HEPA H13 filter captures 99.97% of particles down to 0.3 microns including pollen, dust mite debris, and pet dander. Activated carbon layer adsorbs VOCs and cooking odors. App-controlled with air quality sensor and auto mode. Covers rooms up to 50 m². Night mode drops fan noise to 22 dB." }, @@ -123,7 +123,7 @@ "product_id": 16, "name": "Voice-Controlled Smart Speaker", "category": "Electronics", - "brand": "EchoBox", + "brand": "Voxhome", "price": 99.99, "description": "360-degree speaker with a woofer and two tweeters. Built-in voice assistant controls smart home devices, plays music, answers questions, and sets timers. Connects via Wi-Fi and Bluetooth. Multi-room audio links speakers across the home. Privacy mic mute button." }, @@ -131,7 +131,7 @@ "product_id": 17, "name": "Cast Iron Dutch Oven 5.5 qt", "category": "Kitchen", - "brand": "IronChef", + "brand": "Ferralto", "price": 89.99, "description": "Enameled cast iron with a tight-fitting lid that seals in moisture for braises, stews, and bread baking. Oven-safe to 260°C. Works on all cooktops including induction. Interior cream enamel shows browning clearly. Self-basting dimpled lid. Lifetime warranty against defects." }, @@ -139,7 +139,7 @@ "product_id": 18, "name": "Burr Coffee Grinder", "category": "Kitchen", - "brand": "RoastMate", + "brand": "Roastel", "price": 79.99, "description": "40mm stainless steel conical burrs produce consistent grind size from coarse French press to fine espresso. 40g hopper capacity. 18 click-stop settings. Static-reducing grounds bin with a rubber seal. Quiet 120W motor. Removable upper burr for easy cleaning." }, @@ -147,7 +147,7 @@ "product_id": 19, "name": "10-Piece Stainless Knife Block Set", "category": "Kitchen", - "brand": "CutMaster", + "brand": "Bladely", "price": 159.99, "description": "High-carbon German stainless blades forged from a single billet for full tang strength. Set includes 8-inch chef's, 8-inch bread, 7-inch santoku, 5-inch utility, 3.5-inch paring, six steak knives, shears, honing rod, and a beechwood block. Blades hand-sharpened to 15° per side." }, @@ -155,7 +155,7 @@ "product_id": 20, "name": "Pour-Over Coffee Dripper Set", "category": "Kitchen", - "brand": "BrewCraft", + "brand": "Pournal", "price": 44.99, "description": "Borosilicate glass dripper sits on a matching carafe. Spiral ribs promote even extraction by allowing air to escape uniformly. Includes 40 bleached paper filters and a stainless gooseneck pouring kettle. Produces a clean, bright cup that highlights single-origin floral and fruity notes." }, @@ -163,7 +163,7 @@ "product_id": 21, "name": "Extra-Thick Yoga Mat", "category": "Fitness", - "brand": "ZenGrip", + "brand": "Mattra", "price": 69.99, "description": "6mm natural rubber mat with a microfiber top layer that grips when wet. Non-slip bottom prevents sliding on hardwood and tile. Alignment lines guide stance width in warrior and standing poses. Rolled dimensions: 61 × 183 cm. Includes carrying strap. Free from latex, PVC, and phthalates." }, @@ -171,7 +171,7 @@ "product_id": 22, "name": "Vibrating Foam Roller", "category": "Fitness", - "brand": "RecoverPro", + "brand": "Rollwen", "price": 89.99, "description": "High-density EPP foam roller with four built-in vibration frequencies (20–40 Hz). Vibration penetrates deeper tissue than static rolling for myofascial release and delayed-onset muscle soreness. USB rechargeable; 2-hour runtime per charge. Hollow core stores the charging cable." }, @@ -179,7 +179,7 @@ "product_id": 23, "name": "Resistance Band Set", "category": "Fitness", - "brand": "FlexBand", + "brand": "Bandwell", "price": 34.99, "description": "Five fabric-wrapped loop bands in progressive resistances from 5 to 40 lbs. Non-roll design stays in place during squats, hip thrusts, and lateral walks. Used for glute activation, mobility work, and upper-body accessory exercises. Includes a mesh carry bag and a printed exercise guide." }, @@ -187,7 +187,7 @@ "product_id": 24, "name": "Insulated Stainless Water Bottle 32 oz", "category": "Fitness", - "brand": "HydroKeep", + "brand": "Vesslo", "price": 39.99, "description": "Double-wall vacuum insulation keeps beverages cold 24 hours or hot 12 hours. 18/8 stainless steel; no plastic liner means no flavor transfer. Wide-mouth lid accepts ice cubes. Compatible with most car cup holders. Powder-coat finish resists dents and scratches." }, @@ -195,7 +195,7 @@ "product_id": 25, "name": "Compression Running Tights", "category": "Fitness", - "brand": "PaceWear", + "brand": "Stridon", "price": 79.99, "description": "Four-way stretch fabric with graduated compression from ankle to waist improves circulation and reduces muscle oscillation during runs. UPF 50+ sun protection. Rear zip pocket fits a key or gel. Reflective piping increases visibility in low light. Available in lengths for inseams 28–34 inches." } From c7aab932ee22e400738428cfee2d9e883ab47dd5 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Wed, 3 Jun 2026 14:32:59 +0200 Subject: [PATCH 09/14] Rename resource files to reflect resource type Co-authored-by: Isaac --- .../resources/{query_demo.yml => query-job.yml} | 0 .../resources/{setup_job.yml => setup-job.yml} | 0 .../resources/{endpoint.yml => vector-search-endpoint.yml} | 0 .../resources/{index.yml => vector-search-index.yml} | 0 4 files changed, 0 insertions(+), 0 deletions(-) rename knowledge_base/vector_search_product_discovery/resources/{query_demo.yml => query-job.yml} (100%) rename knowledge_base/vector_search_product_discovery/resources/{setup_job.yml => setup-job.yml} (100%) rename knowledge_base/vector_search_product_discovery/resources/{endpoint.yml => vector-search-endpoint.yml} (100%) rename knowledge_base/vector_search_product_discovery/resources/{index.yml => vector-search-index.yml} (100%) diff --git a/knowledge_base/vector_search_product_discovery/resources/query_demo.yml b/knowledge_base/vector_search_product_discovery/resources/query-job.yml similarity index 100% rename from knowledge_base/vector_search_product_discovery/resources/query_demo.yml rename to knowledge_base/vector_search_product_discovery/resources/query-job.yml diff --git a/knowledge_base/vector_search_product_discovery/resources/setup_job.yml b/knowledge_base/vector_search_product_discovery/resources/setup-job.yml similarity index 100% rename from knowledge_base/vector_search_product_discovery/resources/setup_job.yml rename to knowledge_base/vector_search_product_discovery/resources/setup-job.yml diff --git a/knowledge_base/vector_search_product_discovery/resources/endpoint.yml b/knowledge_base/vector_search_product_discovery/resources/vector-search-endpoint.yml similarity index 100% rename from knowledge_base/vector_search_product_discovery/resources/endpoint.yml rename to knowledge_base/vector_search_product_discovery/resources/vector-search-endpoint.yml diff --git a/knowledge_base/vector_search_product_discovery/resources/index.yml b/knowledge_base/vector_search_product_discovery/resources/vector-search-index.yml similarity index 100% rename from knowledge_base/vector_search_product_discovery/resources/index.yml rename to knowledge_base/vector_search_product_discovery/resources/vector-search-index.yml From b9dfbdc61e612f0ffdf5e3d7d962690057b8dac8 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Wed, 3 Jun 2026 15:01:39 +0200 Subject: [PATCH 10/14] Update README Co-authored-by: Isaac --- .../vector_search_product_discovery/README.md | 86 ++++++++----------- 1 file changed, 38 insertions(+), 48 deletions(-) diff --git a/knowledge_base/vector_search_product_discovery/README.md b/knowledge_base/vector_search_product_discovery/README.md index 249b1e2..4a68333 100644 --- a/knowledge_base/vector_search_product_discovery/README.md +++ b/knowledge_base/vector_search_product_discovery/README.md @@ -1,18 +1,18 @@ # Vector Search: Semantic Product Discovery -A Declarative Automation Bundle demonstrating **semantic product search** using +A Declarative Automation Bundle demonstrating semantic product search using [Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html). +It automates the full setup — the Unity Catalog schema, the Vector Search endpoint and +index, and the jobs that load and query the catalog — so a single `databricks bundle deploy` +gives you a working semantic-search example to explore and adapt. -## The problem +## How it works Keyword search fails when shoppers use different words than what appears in product -descriptions. A customer searching for *"something to keep my coffee hot all day"* won't -match a product described as an *"insulated stainless water bottle with double-wall vacuum -insulation"* — even though it's the right answer. - -Semantic search using vector embeddings matches on **meaning**, not words. - -## How it works +descriptions. A customer searching for "something to keep my coffee hot all day" won't +match a product described as an "insulated stainless water bottle with double-wall vacuum +insulation" even though it's the right answer. Semantic search using vector embeddings +matches on meaning, not words. Product descriptions are embedded at upsert time by the setup job using [`databricks-gte-large-en`](https://docs.databricks.com/en/machine-learning/foundation-models/supported-models.html). @@ -27,48 +27,56 @@ product_index (Direct Access Vector Search index) ranked results ``` -## Bundle resources +## Project structure -| Resource | Type | Description | -|---|---|---| -| `product_search_schema` | `schemas` | Unity Catalog schema that namespaces the index | -| `product_search_endpoint` | `vector_search_endpoints` | Managed ANN serving endpoint | -| `product_index` | `vector_search_indexes` | Direct Access index — schema defined in `resources/index.yml` | -| `product_discovery_setup` | `jobs` | Embeds product descriptions and upserts into the index | -| `product_discovery_query` | `jobs` | Embeds a query and returns ranked results | +``` +. +├── databricks.yml # Bundle name, variables, and the deploy target +├── data/ +│ └── products.json # Product catalog — synced to the workspace on deploy +├── resources/ +│ ├── schema.yml # Unity Catalog schema that namespaces the index +│ ├── vector-search-endpoint.yml # Vector Search endpoint (managed ANN serving) +│ ├── vector-search-index.yml # Direct Access index — schema defined inline +│ ├── setup-job.yml # Job: embed product descriptions and upsert them +│ └── query-job.yml # Job: embed a query and return ranked results +└── src/ + ├── 01_upsert_products.py # Reads products.json, embeds, calls upsert_data + └── 02_query_demo.py # Semantic search — runs as a job or interactively +``` ## Prerequisites - Databricks workspace with Unity Catalog enabled -- Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources +- Databricks CLI version 1.1.0 or above - An existing Unity Catalog catalog (default: `main`) -## Quick start +## Usage -1. **Authenticate** +1. Authenticate the CLI: ```bash databricks auth login --host https://your-workspace.cloud.databricks.com ``` -2. **Configure** `databricks.yml` — set the workspace host and any variable overrides +2. Configure `databricks.yml`. Set the workspace host and any variable overrides. -3. **Deploy** — creates the schema, endpoint, index, jobs, and syncs `data/products.json` +3. Deploy the bundle. This creates the schema, endpoint, index, jobs, and syncs `data/products.json`. ```bash databricks bundle deploy ``` > Vector Search endpoint creation takes a few minutes to reach ONLINE status. -4. **Load the catalog** — embeds all product descriptions and upserts them into the index +4. Load the catalog by running the bundle. This embeds all product descriptions and upserts them into the index. ```bash databricks bundle run product_discovery_setup ``` -5. **Search** — pass any natural-language query +5. Pass any natural-language query to search. ```bash databricks bundle run product_discovery_query --params "query=footwear for slippery wet trails" ``` -6. **Or open** `src/02_query_demo.py` in your workspace to run queries interactively +6. Or open `src/02_query_demo.py` in your workspace to run queries interactively. ## Configuration @@ -89,15 +97,15 @@ databricks bundle deploy \ | `schema` | `product_search` | Schema created by the bundle | | `endpoint_name` | `product-search-endpoint` | Vector Search endpoint name (must be unique per workspace) | | `embedding_model` | `databricks-gte-large-en` | Foundation model used for embeddings | -| `embedding_dimension` | `1024` | Vector dimension — must match `embedding_dimension` in `resources/index.yml` | +| `embedding_dimension` | `1024` | Vector dimension — must match `embedding_dimension` in `resources/vector-search-index.yml` | -> **Note:** `embedding_dimension` in `resources/index.yml` is hardcoded to `1024` because +> **Note:** `embedding_dimension` in `resources/vector-search-index.yml` is hardcoded to `1024` because > it is immutable after index creation. If you need a different dimension, change the value -> in `index.yml` before the first deploy. +> in `vector-search-index.yml` before the first deploy. ## Index schema -The index schema lives entirely in `resources/index.yml`: +The index schema lives entirely in `resources/vector-search-index.yml`: ```yaml direct_access_index_spec: @@ -130,25 +138,7 @@ records enter the index via `upsert_data`. If you already have a pipeline writin Delta table, a **Delta Sync** index is often simpler — you point the index at the source table and it keeps itself up to date. Replace `index_type: DIRECT_ACCESS` and `direct_access_index_spec` with `index_type: DELTA_SYNC` and `delta_sync_index_spec` in -`resources/index.yml`, and remove the upsert job. - -## Project structure - -``` -. -├── databricks.yml -├── data/ -│ └── products.json # Product catalog — synced to workspace on deploy -├── resources/ -│ ├── schema.yml # Unity Catalog schema -│ ├── endpoint.yml # Vector Search endpoint -│ ├── index.yml # Direct Access index -│ ├── setup_job.yml # Embed + upsert job -│ └── query_demo.yml # Query job (--params "query=...") -└── src/ - ├── 01_upsert_products.py # Reads products.json, embeds, calls upsert_data - └── 02_query_demo.py # Semantic search — runs as job or interactively -``` +`resources/vector-search-index.yml`, and remove the upsert job. ## Resources From 0114d5f67bf6dc6d988395790857266443c9ea30 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Fri, 12 Jun 2026 00:06:30 +0200 Subject: [PATCH 11/14] Add dev target and isolate per-deploy resources Co-authored-by: Isaac --- .../vector_search_product_discovery/databricks.yml | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/knowledge_base/vector_search_product_discovery/databricks.yml b/knowledge_base/vector_search_product_discovery/databricks.yml index e79df17..464cfaa 100644 --- a/knowledge_base/vector_search_product_discovery/databricks.yml +++ b/knowledge_base/vector_search_product_discovery/databricks.yml @@ -27,9 +27,15 @@ include: - resources/*.yml targets: + dev: + mode: development + default: true + workspace: + host: https://your-workspace.cloud.databricks.com + variables: + endpoint_name: ${workspace.current_user.short_name}-product-search-endpoint prod: mode: production - default: true workspace: host: https://your-workspace.cloud.databricks.com root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target} From 59559af45901fad69ba91bbd08483aebacc3a8d4 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Fri, 12 Jun 2026 00:09:12 +0200 Subject: [PATCH 12/14] Make embedding_dimension a single source of truth Co-authored-by: Isaac --- .../vector_search_product_discovery/databricks.yml | 7 ++++--- .../resources/vector-search-index.yml | 2 +- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/knowledge_base/vector_search_product_discovery/databricks.yml b/knowledge_base/vector_search_product_discovery/databricks.yml index 464cfaa..8a30467 100644 --- a/knowledge_base/vector_search_product_discovery/databricks.yml +++ b/knowledge_base/vector_search_product_discovery/databricks.yml @@ -19,9 +19,10 @@ variables: default: databricks-gte-large-en embedding_dimension: description: >- - Vector dimension passed to the embedding model and upsert/query notebooks. - Must match the dimension used when the index was created (see resources/index.yml). - default: "1024" + Vector dimension. Referenced by the index spec (resources/vector-search-index.yml) + and passed to the upsert/query notebooks, so it is set once here. Immutable after + the index is created. + default: 1024 include: - resources/*.yml diff --git a/knowledge_base/vector_search_product_discovery/resources/vector-search-index.yml b/knowledge_base/vector_search_product_discovery/resources/vector-search-index.yml index e812f28..76d1012 100644 --- a/knowledge_base/vector_search_product_discovery/resources/vector-search-index.yml +++ b/knowledge_base/vector_search_product_discovery/resources/vector-search-index.yml @@ -20,4 +20,4 @@ resources: } embedding_vector_columns: - name: description_vector - embedding_dimension: 1024 + embedding_dimension: ${var.embedding_dimension} From 6b88daa95d58c9f65bcead1f2fd42a247cb8c83c Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Fri, 12 Jun 2026 00:09:58 +0200 Subject: [PATCH 13/14] Return query results from bundle run Co-authored-by: Isaac --- .../vector_search_product_discovery/src/02_query_demo.py | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/knowledge_base/vector_search_product_discovery/src/02_query_demo.py b/knowledge_base/vector_search_product_discovery/src/02_query_demo.py index e00ae97..f938e3a 100644 --- a/knowledge_base/vector_search_product_discovery/src/02_query_demo.py +++ b/knowledge_base/vector_search_product_discovery/src/02_query_demo.py @@ -72,6 +72,10 @@ df.index += 1 print(df.to_string()) +# Surface the ranked results to `databricks bundle run` / `jobs get-run-output`. +# (The print above stays for interactive notebook use.) +dbutils.notebook.exit(df.to_json(orient="records")) + # COMMAND ---------- # MAGIC %md From 49e06517c31a3cff189e7e58cd3ddd36e84ed282 Mon Sep 17 00:00:00 2001 From: Jan Rose Date: Fri, 12 Jun 2026 00:12:33 +0200 Subject: [PATCH 14/14] Update README for dev/prod targets and query output Co-authored-by: Isaac --- .../vector_search_product_discovery/README.md | 20 +++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/knowledge_base/vector_search_product_discovery/README.md b/knowledge_base/vector_search_product_discovery/README.md index 4a68333..6c46e8d 100644 --- a/knowledge_base/vector_search_product_discovery/README.md +++ b/knowledge_base/vector_search_product_discovery/README.md @@ -64,6 +64,11 @@ ranked results ```bash databricks bundle deploy ``` + This deploys the default `dev` target in development mode, so resources are namespaced + per user — jobs and the schema get a `[dev you]` prefix and the endpoint is named after + you — and several people can deploy into the same workspace without colliding. Use + `databricks bundle deploy --target prod` for the shared production copy. + > Vector Search endpoint creation takes a few minutes to reach ONLINE status. 4. Load the catalog by running the bundle. This embeds all product descriptions and upserts them into the index. @@ -76,6 +81,9 @@ ranked results databricks bundle run product_discovery_query --params "query=footwear for slippery wet trails" ``` + The job returns the ranked results as JSON — view them with + `databricks jobs get-run-output ` or on the run page. + 6. Or open `src/02_query_demo.py` in your workspace to run queries interactively. ## Configuration @@ -95,13 +103,13 @@ databricks bundle deploy \ |---|---|---| | `catalog` | `main` | Existing Unity Catalog catalog | | `schema` | `product_search` | Schema created by the bundle | -| `endpoint_name` | `product-search-endpoint` | Vector Search endpoint name (must be unique per workspace) | +| `endpoint_name` | `product-search-endpoint` | Vector Search endpoint name. Shared in prod; the `dev` target overrides it per user. | | `embedding_model` | `databricks-gte-large-en` | Foundation model used for embeddings | -| `embedding_dimension` | `1024` | Vector dimension — must match `embedding_dimension` in `resources/vector-search-index.yml` | +| `embedding_dimension` | `1024` | Vector dimension. Drives both the index and the embedding requests; immutable after the index is created. | -> **Note:** `embedding_dimension` in `resources/vector-search-index.yml` is hardcoded to `1024` because -> it is immutable after index creation. If you need a different dimension, change the value -> in `vector-search-index.yml` before the first deploy. +> **Note:** `embedding_dimension` is immutable after the index is created. Set it (via the +> `embedding_dimension` variable) before the first deploy — the index and the upsert/query +> jobs all read from that one variable. ## Index schema @@ -114,7 +122,7 @@ direct_access_index_spec: "price":"float","description":"string","description_vector":"array"} embedding_vector_columns: - name: description_vector - embedding_dimension: 1024 + embedding_dimension: ${var.embedding_dimension} ``` `schema_json` is a flat `{"column_name": "type"}` JSON string. `description_vector` stores