Add Vector Search semantic product discovery example#153
Conversation
Demonstrates a Direct Access Vector Search index and endpoint declared
as bundle resources (vector_search_endpoints, vector_search_indexes),
tested e2e against staging with the direct engine.
Key design decisions:
- Jobs use resource references (${resources.*.name}) for endpoint and
index names so dev-mode prefixing flows through automatically
- schema_json uses flat {"col":"type"} format required by the API
- Notebooks embed descriptions/queries explicitly (Direct Access indexes
don't auto-embed; that's a Delta Sync feature)
- engine: direct set in bundle config so no env var is needed
Co-authored-by: Isaac
Co-authored-by: Isaac
pietern
left a comment
There was a problem hiding this comment.
Ran this example end-to-end on a dogfood workspace with the released CLI (v1.1.0): validate → deploy → run (setup + query) → destroy. The embed → upsert → similarity-search logic is correct — all three README example queries returned the documented top result, so the substance is solid. Also confirmed v1.1.0 recognizes vector_search_endpoints / vector_search_indexes, so the cli#5123 dependency has shipped (correctly struck through in the description).
Nice to see the index name reference ${resources.schemas.product_search_schema.name} rather than the raw ${var.schema} — that's the mode-prefix-safe form.
Remaining feedback is about per-deploy isolation and the CLI run experience, flagged inline. Nothing blocks the single-user happy path; it's mostly "what happens when a second person deploys this into the same workspace."
| # Vector Search: Semantic Product Discovery | ||
|
|
||
| A Declarative Automation Bundle demonstrating **semantic product search** using | ||
| [Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html). |
There was a problem hiding this comment.
I have heard "Vector Search" is being renamed?
There was a problem hiding this comment.
Let's keep it at Vector Search for consistency with DABs resource names (vector_search_*). We can do a follow-up where we update this for all renamed resources in bundle-examples (I'm sure there's more)
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
Co-authored-by: Isaac
| embedding_model: "{{job.parameters.embedding_model}}" | ||
| embedding_dimension: "{{job.parameters.embedding_dimension}}" | ||
| query: "{{job.parameters.query}}" | ||
| num_results: "{{job.parameters.num_results}}" |
There was a problem hiding this comment.
I think these are automatically pushed down from job parameters into the task.
If so you don't need to specify any of them.
Follow-up to [review feedback on #153](#153 (comment)): job-level `parameters` are automatically pushed down to notebook tasks, so the `base_parameters` blocks that mirrored them 1:1 via `{{job.parameters.x}}` were redundant. Confirmed against the docs ([Parameterize jobs](https://docs.databricks.com/aws/en/jobs/parameters), [Access parameter values from a task](https://docs.databricks.com/aws/en/jobs/parameter-use)): - Job parameters are pushed down to tasks that use key-value parameters; notebooks read them with `dbutils.widgets.get()` — which is exactly how `01_upsert_products.py` and `02_query_demo.py` already read them. - When a task parameter and a job parameter share a name, the job parameter is fetched — so these `base_parameters` could never take effect anyway; removing them is behavior-preserving. `databricks bundle validate` passes for both targets against a real workspace (dev and prod, host placeholder swapped locally for validation only). This pull request and its description were generated with Claude Code.
Summary
Adds a Declarative Automation Bundle under
knowledge_base/vector_search_product_discovery/that demonstrates semantic product search end-to-end with Databricks Vector Search:vector_search_endpoints+vector_search_indexesdeclared as bundle resources; jobs reference them via${resources.*.name}so dev-mode prefixing flows through automaticallydev(default,mode: development) and aprodtarget — a plainbundle deployis isolated per user (per-user endpoint name; schema/jobs/index dev-prefixed), so several people can deploy into one workspace without collidingengine: direct); descriptions are embedded explicitly in01_upsert_products.pyand the query notebook embeds the query beforesimilarity_search— Direct Access indexes don't auto-embed (a Delta Sync feature)embedding_dimensionvariable feeds both the index spec and the notebooks (immutable after index creation, so one knob prevents a silent mismatch)dbutils.notebook.exit(...)so ranked results come back fromdatabricks bundle run/jobs get-run-outputschema_jsonuses the flat{"col":"type"}form required by the APIRequirements
Databricks CLI v1.1.0+, which ships
vector_search_endpoints/vector_search_indexesas first-class DABs resources (was databricks/cli#5123).Test plan
Verified with CLI v1.1.0 on a workspace (default
devtarget):databricks bundle validate—devandproddatabricks bundle deploy— endpoint ONLINE, index createddatabricks bundle run product_discovery_setup— products embedded + upserteddatabricks bundle run product_discovery_query --params "query=something to keep my coffee hot all day"— returns ranked results as JSON (e.g. the insulated water bottle surfaces with no keyword overlap)databricks bundle destroy— clean teardownThis pull request and its description were written by Isaac.