OntoBricks 0.5.2

Digital Twin Builder for Databricks

Project Description

OntoBricks is a web application that transforms Databricks tables into a materialized knowledge graph. It lets you design ontologies (OWL), map them to Unity Catalog tables via R2RML, materialize triples into a Delta-backed triple store and a Lakebase Postgres graph engine, reason over the graph (OWL 2 RL, SWRL, SHACL), and query it through an auto-generated GraphQL API. The entire pipeline — from metadata import to a queryable knowledge graph — can run in four clicks using LLM-powered automation.

Project Support

Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.

Building the Project

OntoBricks uses uv for dependency management. All dependencies are declared in pyproject.toml.

# Clone the repository
git clone <repository-url>
cd OntoBricks

# Install dependencies (uv resolves them from pyproject.toml)
uv sync

# Or use the setup script
scripts/setup.sh

Prerequisites

Python 3.10 or higher
Databricks workspace access (Databricks Apps must be enabled). Local development uses a Personal Access Token; production uses the App's service principal.
A SQL Warehouse (you'll need its ID for local dev).
Databricks Lakebase Autoscaling project + branch + Postgres database — required since v0.4.0 for the domain registry (domains, versions, permissions, schedules, global config) and the Graph DB triple store. Provisioned Lakebase instances are not supported. The Postgres driver (psycopg[binary] + psycopg-pool) is declared as an optional dependency so volume-only forks can opt out — install with uv sync --extra lakebase for any normal deployment.
Unity Catalog Volume in the catalog/schema that hosts the triplestore VIEWs (triplestore_<domain>_v<n>). The volume is reserved for binary artefacts (documents/ uploads — domain-scoped attachments imported by the ontology designer).
psql (libpq client) on PATH for the Lakebase permission bootstrap scripts (brew install libpq && brew link --force libpq on macOS).

Deploying / Installing the Project

Local Development

# Configure credentials
cp .env.example .env
# Edit .env with your Databricks host, token, and warehouse ID

# Start the application
scripts/start.sh
# Open http://localhost:8000

Deploy to Databricks Apps

# Install and authenticate the Databricks CLI (>= 0.250.0)
brew install databricks            # or curl -fsSL https://databricks.com/install.sh | sh
databricks auth login --host https://<workspace>

# Edit scripts/deploy.config.sh (warehouse, registry catalog/schema,
# Lakebase project/branch/database — see the file header) and then:
make deploy
# Or directly: scripts/deploy.sh

scripts/deploy.sh generates app.yaml from app.yaml.template + scripts/deploy.config.sh, validates and deploys the DAB bundle on target dev-lakebase, runs scripts/bootstrap-app-permissions.sh (app SP CAN_MANAGE on itself), then runs scripts/bootstrap-lakebase-perms.sh on the registry / graph / sync schemas. All steps are idempotent.

After the first deploy, bind the sql-warehouse, volume, and postgres (Lakebase) resources in the Databricks Apps UI (Compute > Apps > > Resources) if the DAB bind did not take. Open the app and click Settings > Registry > Initialize to create the Lakebase schema; re-run make bootstrap-lakebase once afterwards so the freshly created schema picks up USAGE/DML.

One-click graph DB provisioning. Admins can stand up a brand-new graph store without the shell scripts: Settings > Lakebase > Connection has a "Create graph DB from scratch" button that provisions the Lakebase instance + database + schema and applies all grants (app + MCP service principals) as an async job with live progress. It automates scripts/setup-lakebase.sh + scripts/bootstrap-lakebase-perms.sh (which remain the fallback when the app SP lacks instance-creation rights). See docs/lakebase-graphdb.md §3.1b.

Lakebase deploy targets. Pick a Databricks Lakebase Autoscaling project + branch and a Postgres database, then set the LAKEBASE_PROJECT, LAKEBASE_BRANCH, LAKEBASE_DATABASE_RESOURCE_SEGMENT (the db-… id from databricks postgres list-databases "projects/<id>/branches/<branch>" -o json, not the Postgres database name shown in the SQL UI), and LAKEBASE_REGISTRY_SCHEMA defaults in scripts/deploy.config.sh. The DAB composes the full Apps postgres.database path and binds a postgres Apps resource so the runtime auto-injects PGHOST / PGPORT / PGDATABASE / PGUSER; the app mints the Lakebase JWT automatically (no user secret required).

Upgrading from a pre-v0.4.0 deployment. Pre-v0.4.0 stored the entire registry as JSON on the Unity Catalog Volume. Run scripts/migrate-registry-to-lakebase.sh once before upgrading to v0.4.0+ to copy every JSON-shaped artefact (domains, versions, permissions, schedules, global config) into Lakebase. Binary artefacts on the Volume are left untouched.

First deploy only: make deploy runs scripts/bootstrap-app-permissions.sh automatically, which grants each app's service principal CAN_MANAGE on itself. Without that grant the middleware cannot read the app's own ACL and every first-time visitor — including the deploying CAN_MANAGE user — lands on the access-denied page. If you deploy via databricks bundle deploy directly, run make bootstrap-perms once afterwards (it is idempotent).

See Deployment Guide for the full checklist including resource configuration and permissions.

Releasing the Project

Ensure all tests pass: make test
Update the version in pyproject.toml
Commit, tag, and push:

git add -A && git commit -m "Release vX.Y.Z"
git tag vX.Y.Z
git push origin main --tags

Deploy the new version: make deploy

Using the Project

Automated Pipeline (4 clicks)

Step	Action	What Happens
1	Import Metadata (Domain > Metadata)	Fetches table and column metadata from Unity Catalog
2	Generate Ontology (Ontology > Wizard)	LLM designs entities, relationships, and attributes from your metadata
3	Auto-Map (Mapping > Auto-Map)	LLM generates SQL mappings for every entity and relationship
4	Synchronize (Digital Twin > Status)	Executes mappings and populates the triple store

Domain & registry (0.1.2 UX)

Ontology Designer — the main ontology graph view lives under Ontology → Designer (visual canvas + AI Assistant).
Version lifecycle (DRAFT / IN-REVIEW / PUBLISHED) — every domain version carries a lifecycle status, shown as a colour-coded badge across the navbar, Domain Information, Registry Browse, Domain Versions and query headers. Only DRAFT versions are editable; the external API/GraphQL/MCP only serve the numeric-latest PUBLISHED version. Transitions (DRAFT ↔ IN-REVIEW → PUBLISHED, PUBLISHED → DRAFT admin-only) are enforced server-side and replace the former "Active"/mcp_enabled toggle.
Domain Cockpit (Validation) — Published Version shows which registry version is exposed via API / MCP; it can differ from the version you have loaded in the editor.
Registry → Browse — drives the lifecycle status transitions for a domain's versions; Domain → Versions shows that status as a read-only badge.
Validation & Review workflow — a business-user-oriented review layer on top of the lifecycle. Registry → My Tasks is a cross-domain worklist of versions waiting on you (submit, sign off, or publish). Domain → Validation shows a soft consistency-check summary, a reviewer sign-off panel and the full audit trail. Submit-for-review and Publish stay builder/admin (Publish unlocks for a builder once the sign-off quorum is met — a per-domain setting, default 1, editable on Domain → Information → Global — while an admin can publish at any time, overriding the quorum, with the override flagged in the audit trail); sign-off (approve / request-changes) is open to any domain member, and request-changes reopens the version to DRAFT. Every decision (with from → to status snapshots) is persisted append-only in the domain_review_events registry table.
New domain — after New Domain, a full-page loading overlay runs until Domain Information finishes its first load.
Domain Information — triple-store / snapshot / local graph paths update when you commit the domain name (blur or change) or change version (aligned with naming rules before save).
Duplicate names — Save to Unity Catalog is blocked if the sanitized domain name already exists in the registry (inline check + confirmation before POST).
Navbar — domain name and version in the top bar refresh after load, save, clear, import, and version switches (browser cache invalidated on those actions).

Graph DB engine (Settings → Graph DB)

The graph triple-store backend is pluggable; the abstraction (GraphDBFactory / GraphDBBackend) is preserved so additional engines can be added in the future. Today only one engine ships:

Lakebase (Postgres) — default; three Postgres objects per domain version (*_sync bulk-data table, *__app companion for reasoning/cohort writes, g_<dom>_v<n> UNION view for reads) inside a configurable Postgres schema on the App-bound Lakebase database (same connection as the optional Lakebase registry backend). Requires the lakebase extra (uv sync --extra lakebase) so psycopg is installed.

Engine-specific options are stored as global JSON (graph_engine_config). For Lakebase the supported keys are database (optional override of PGDATABASE), schema (optional, default ontobricks_graph), sync_mode (app_managed default, or managed_synced to delegate bulk ingest to a Databricks Lakeflow snapshot pipeline), sync_table_mode (snapshot / triggered / continuous — snapshot is the recommended mode), sync_timeout_s (default 600), sync_uc_catalog (UC catalog the synced table is registered in; defaults to the snapshot Delta catalog when unset), and sync_uc_schema (UC schema segment for the synced-table FQN; defaults to the registry UC schema so the Lakeflow object lands in the same UC namespace as other registry artefacts). See docs/lakebase-graphdb.md for the full reference.

Lakebase permission grants. The app service principal needs USAGE + DML on each Postgres schema it touches — granted by scripts/bootstrap-lakebase-perms.sh:

Schema When to run Who runs it

Registry schema (e.g. ontobricks_registry) After Settings → Registry → Initialize scripts/deploy.sh automatically on every dev-lakebase deploy (coords: LAKEBASE_PROJECT / LAKEBASE_BRANCH / LAKEBASE_REGISTRY_DATABASE / LAKEBASE_REGISTRY_SCHEMA)

Graph schema (e.g. ontobricks_graph) After first Digital Twin Build The in-app "Create graph DB" flow, or a manual bootstrap-lakebase-perms.sh run

The deploy script is registry-scoped — it only grants on the registry schema. The graph DB is configured in-app (Settings → Graph DB) and may live in a different Lakebase project, so its grant is handled separately.

Lakebase build performance. When the active engine is Lakebase, the Digital Twin build streams warehouse rows in fetchmany batches (SQLWarehouse.iter_rows) and ingests them via COPY FROM STDIN into a per-batch temp table followed by INSERT … ON CONFLICT DO NOTHING (and the symmetrical DELETE … USING for incremental removes). The FastAPI process never holds the full graph or the full diff: snapshot CTAS and EXCEPT execution stay warehouse-side, the app pipes one batch at a time. There is no Volume archive thread — Postgres is the system of record for the graph.

Lakebase managed-synced mode. When graph_engine_config.sync_mode = "managed_synced", the bulk R2RML data movement is moved entirely off the app: a Databricks Lakeflow snapshot pipeline keeps a Postgres synced table in lock-step with the R2RML view, and the FastAPI process only orchestrates (SyncedTableManager.ensure + trigger_and_wait). Reasoning + cohort writes stay on the direct PG path through a writable companion table; readers see both via a UNION view (back-compat name). PG layout per graph version: g_<dom>_v<n>_sync (Lakeflow), g_<dom>_v<n>__app (app), g_<dom>_v<n> (UNION view). See docs/graphdb-integration.md §9 for the full architecture.

Manual Workflow

Design an ontology visually using the OntoViz canvas, or import OWL/RDFS/industry standards (FIBO, CDISC, IOF, HL7 FHIR R4/R4B/R5)
Map ontology entities to Databricks tables with column-level precision
Build the Digital Twin — materializes triples into the triple store (incremental by default)
Query through the GraphQL playground or explore the interactive knowledge graph
Reason over the graph — run OWL 2 RL inference, SWRL rules, SHACL validation, and constraint checks

Knowledge Graph Features

Two-phase search — preview matching entities in a flat list, then select specific ones to expand into the full graph with relationships and neighbors
Configurable search depth — control the maximum traversal depth and entity cap for graph expansion
Right-click "Expand neighbours" — enrich the current graph in place with N-hop neighbours of any selected node (depth follows the right-pane Depth slider, default 2); newly added entities are highlighted and the camera zooms to frame them, with a non-blocking spinner in the canvas top-right while the request runs
Bridge navigation — follow cross-domain bridges to automatically switch domains and focus on the target entity in the knowledge graph
Data cluster detection — detect communities in the knowledge graph using Louvain, Label Propagation, or Greedy Modularity algorithms; available client-side (Graphology) for the visible subgraph and server-side (NetworkX) for the full graph; cluster results can be visualized with color-by-cluster mode and collapsed into super-nodes
Cohort discovery — group entities that travel together using rule-based linkage (shared resources via predicates) and compatibility constraints (same-value, value-equals, value-in, value-range); deterministic, explainable cohorts with live counters, why/why-not explainers, and idempotent materialisation as graph triples (:inCohort) or Unity Catalog Delta tables. See docs/cohort_discovery.md.
Data quality violation limits — cap the number of violations displayed per rule (configurable via dropdown, default 10) for faster quality checks
Per-rule progress tracking — SWRL inference and data quality checks report progress for each individual rule

AI Assistant

The Ontology Designer view (Ontology → Designer) includes a floating AI Assistant (bottom-right of the canvas) that lets you modify your ontology through natural language commands — add entities, remove orphans, list relationships, and more. Conversation history is maintained within the session.

Navigation & Performance

Deep-linked sidebar sections — shareable URLs, browser Back/Forward support
Breadcrumb navigation — always see your position (Registry > Domain > Ontology > Section)
Keyboard shortcuts — Cmd/Ctrl+S save, Cmd/Ctrl+K search, ? help overlay
SQL connection pooling — reusable database connections, no per-query TLS handshake
CSRF protection — double-submit cookie for all state-changing requests
Structured JSON logging — set LOG_FORMAT=json for production-grade observability

MCP Integration

OntoBricks exposes the knowledge graph to LLM agents via the Model Context Protocol. Deploy the companion mcp-ontobricks app and connect from Cursor, Claude Desktop, or the Databricks Playground.

Registry OBX Export / Import (UI)

Export one or more domains directly from Registry → Browse to a portable .obx file with per-domain version-mode selection (Latest / Active / All / Choose). Import with per-domain conflict resolution (Skip / Overwrite / Rename). No command line required — ideal for ad-hoc transfers and cross-tenant sharing.

Registry Import / Export (CLI)

For automated promotion pipelines, use the scripts/registry_transfer.sh command-line tool — export a curated subset of domains/versions from a source registry into a .zip, then preview and commit it into the target registry. See Registry Import / Export for the full reference, examples, and a comparison of the OBX UI vs CLI approaches.

Ontology Pitfalls Detector

Detect 19 structural, logical, and semantic pitfalls (P1.1–P4.7) in your ontology from the Ontology → Pitfalls sidebar panel. Fast graph-only checks run immediately; ML-heavy checks (semantic similarity, NLP naming) require installing the optional extra:

uv sync --extra pitfalls

Documentation

Full documentation is available in docs/. For a comprehensive feature list and architecture details, see INFO.md.

Name		Name	Last commit message	Last commit date
Latest commit History 297 Commits
.claude		.claude
.cursor		.cursor
.github		.github
.planning		.planning
changelogs		changelogs
ci		ci
docs		docs
licenses		licenses
releases		releases
scripts		scripts
src		src
tests		tests
.cursorrules		.cursorrules
.databricksignore		.databricksignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
LICENSE.txt		LICENSE.txt
Makefile		Makefile
NOTICE.txt		NOTICE.txt
README.md		README.md
SECURITY.md		SECURITY.md
app.yaml.template		app.yaml.template
commitlint.config.js		commitlint.config.js
databricks.yml		databricks.yml
mypy_baseline.txt		mypy_baseline.txt
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OntoBricks 0.5.2

Project Description

Project Support

Building the Project

Prerequisites

Deploying / Installing the Project

Local Development

Deploy to Databricks Apps

Releasing the Project

Using the Project

Automated Pipeline (4 clicks)

Domain & registry (0.1.2 UX)

Graph DB engine (Settings → Graph DB)

Manual Workflow

Knowledge Graph Features

AI Assistant

Navigation & Performance

MCP Integration

Registry OBX Export / Import (UI)

Registry Import / Export (CLI)

Ontology Pitfalls Detector

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Schema	When to run	Who runs it
Registry schema (e.g. `ontobricks_registry`)	After `Settings → Registry → Initialize`	`scripts/deploy.sh` automatically on every `dev-lakebase` deploy (coords: `LAKEBASE_PROJECT` / `LAKEBASE_BRANCH` / `LAKEBASE_REGISTRY_DATABASE` / `LAKEBASE_REGISTRY_SCHEMA`)
Graph schema (e.g. `ontobricks_graph`)	After first Digital Twin `Build`	The in-app "Create graph DB" flow, or a manual `bootstrap-lakebase-perms.sh` run

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

OntoBricks 0.5.2

Project Description

Project Support

Building the Project

Prerequisites

Deploying / Installing the Project

Local Development

Deploy to Databricks Apps

Releasing the Project

Using the Project

Automated Pipeline (4 clicks)

Domain & registry (0.1.2 UX)

Graph DB engine (Settings → Graph DB)

Manual Workflow

Knowledge Graph Features

AI Assistant

Navigation & Performance

MCP Integration

Registry OBX Export / Import (UI)

Registry Import / Export (CLI)

Ontology Pitfalls Detector

Documentation

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages