Skip to content

Add Vector Search semantic product discovery example#153

Merged
janniklasrose merged 14 commits into
mainfrom
janniklasrose/vector-search-example
Jun 12, 2026
Merged

Add Vector Search semantic product discovery example#153
janniklasrose merged 14 commits into
mainfrom
janniklasrose/vector-search-example

Conversation

@janniklasrose

@janniklasrose janniklasrose commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a Declarative Automation Bundle under knowledge_base/vector_search_product_discovery/ that demonstrates semantic product search end-to-end with Databricks Vector Search:

  • vector_search_endpoints + vector_search_indexes declared as bundle resources; jobs reference them via ${resources.*.name} so dev-mode prefixing flows through automatically
  • A dev (default, mode: development) and a prod target — a plain bundle deploy is isolated per user (per-user endpoint name; schema/jobs/index dev-prefixed), so several people can deploy into one workspace without colliding
  • Direct Access index (engine: direct); descriptions are embedded explicitly in 01_upsert_products.py and the query notebook embeds the query before similarity_search — Direct Access indexes don't auto-embed (a Delta Sync feature)
  • A single embedding_dimension variable feeds both the index spec and the notebooks (immutable after index creation, so one knob prevents a silent mismatch)
  • The query job calls dbutils.notebook.exit(...) so ranked results come back from databricks bundle run / jobs get-run-output
  • schema_json uses the flat {"col":"type"} form required by the API

Requirements

Databricks CLI v1.1.0+, which ships vector_search_endpoints / vector_search_indexes as first-class DABs resources (was databricks/cli#5123).

Test plan

Verified with CLI v1.1.0 on a workspace (default dev target):

  • databricks bundle validatedev and prod
  • databricks bundle deploy — endpoint ONLINE, index created
  • databricks bundle run product_discovery_setup — products embedded + upserted
  • databricks bundle run product_discovery_query --params "query=something to keep my coffee hot all day" — returns ranked results as JSON (e.g. the insulated water bottle surfaces with no keyword overlap)
  • databricks bundle destroy — clean teardown

This pull request and its description were written by Isaac.

Demonstrates a Direct Access Vector Search index and endpoint declared
as bundle resources (vector_search_endpoints, vector_search_indexes),
tested e2e against staging with the direct engine.

Key design decisions:
- Jobs use resource references (${resources.*.name}) for endpoint and
  index names so dev-mode prefixing flows through automatically
- schema_json uses flat {"col":"type"} format required by the API
- Notebooks embed descriptions/queries explicitly (Direct Access indexes
  don't auto-embed; that's a Delta Sync feature)
- engine: direct set in bundle config so no env var is needed

Co-authored-by: Isaac
Comment thread contrib/vector_search_product_discovery/resources/index.yml Outdated

@pietern pietern left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this example end-to-end on a dogfood workspace with the released CLI (v1.1.0): validatedeployrun (setup + query) → destroy. The embed → upsert → similarity-search logic is correct — all three README example queries returned the documented top result, so the substance is solid. Also confirmed v1.1.0 recognizes vector_search_endpoints / vector_search_indexes, so the cli#5123 dependency has shipped (correctly struck through in the description).

Nice to see the index name reference ${resources.schemas.product_search_schema.name} rather than the raw ${var.schema} — that's the mode-prefix-safe form.

Remaining feedback is about per-deploy isolation and the CLI run experience, flagged inline. Nothing blocks the single-user happy path; it's mostly "what happens when a second person deploys this into the same workspace."

Comment thread knowledge_base/vector_search_product_discovery/databricks.yml Outdated
Comment thread knowledge_base/vector_search_product_discovery/resources/vector-search-index.yml Outdated
Comment thread knowledge_base/vector_search_product_discovery/src/02_query_demo.py
Comment thread knowledge_base/vector_search_product_discovery/data/products.json
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
# Vector Search: Semantic Product Discovery

A Declarative Automation Bundle demonstrating **semantic product search** using
[Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have heard "Vector Search" is being renamed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it at Vector Search for consistency with DABs resource names (vector_search_*). We can do a follow-up where we update this for all renamed resources in bundle-examples (I'm sure there's more)

Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
Comment thread knowledge_base/vector_search_product_discovery/README.md Outdated
@janniklasrose janniklasrose requested a review from pietern June 11, 2026 22:59
@janniklasrose janniklasrose enabled auto-merge (squash) June 12, 2026 08:21
embedding_model: "{{job.parameters.embedding_model}}"
embedding_dimension: "{{job.parameters.embedding_dimension}}"
query: "{{job.parameters.query}}"
num_results: "{{job.parameters.num_results}}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are automatically pushed down from job parameters into the task.

If so you don't need to specify any of them.

@janniklasrose janniklasrose merged commit e6d00cb into main Jun 12, 2026
1 check passed
@janniklasrose janniklasrose deleted the janniklasrose/vector-search-example branch June 12, 2026 09:31
janniklasrose added a commit that referenced this pull request Jun 12, 2026
Follow-up to [review feedback on
#153](#153 (comment)):
job-level `parameters` are automatically pushed down to notebook tasks,
so the `base_parameters` blocks that mirrored them 1:1 via
`{{job.parameters.x}}` were redundant.

Confirmed against the docs ([Parameterize
jobs](https://docs.databricks.com/aws/en/jobs/parameters), [Access
parameter values from a
task](https://docs.databricks.com/aws/en/jobs/parameter-use)):

- Job parameters are pushed down to tasks that use key-value parameters;
notebooks read them with `dbutils.widgets.get()` — which is exactly how
`01_upsert_products.py` and `02_query_demo.py` already read them.
- When a task parameter and a job parameter share a name, the job
parameter is fetched — so these `base_parameters` could never take
effect anyway; removing them is behavior-preserving.

`databricks bundle validate` passes for both targets against a real
workspace (dev and prod, host placeholder swapped locally for validation
only).

This pull request and its description were generated with Claude Code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants