[DS-1639] feat(noom-mcp-server): add export_query_to_file for bulk result export by michael-ruder · Pull Request #19 · noom/databricks-ai-dev-kit

michael-ruder · 2026-06-02T18:53:01Z

Why

Returning query results through the MCP serializes them into the model's context window. One real analytical query (the Meristem ups_ex360_med grid) came back as 198 × 35 ≈ 68 KB ≈ 37,000 tokens and overflowed.

This is structural, not a one-off: a tool result is tokenized into context by design, so any non-trivial extract — by any engineer, on any project — hits the same wall. It also makes the model a lossy conduit for data that downstream tools (viz, the model app) need exact and whole.

What

A customization-layer tool, export_query_to_file, that has the server write results to a local file and return only a manifest ({path, row_count, bytes, columns, format}). The rows never enter the model context — local tools read the file off disk.

Pass the path, not the payload. The MCP is a control plane, not a data plane.

before:  warehouse ──▶ MCP ──▶ [ ~37,000 tokens into context ]  💥
after:   warehouse ──▶ MCP server ──▶ disk: results.csv
                            └──▶ context: {path, rows, bytes}  ≈ 50 tokens

How

Does NOT reuse SQLExecutor.execute() — that path uses the INLINE disposition and reads only result.data_array (the first chunk), silently truncating large results.
Issues the statement with disposition=EXTERNAL_LINKS, format=CSV. Databricks writes CSV chunk files to cloud storage and returns presigned URLs; the server streams each chunk to disk following next_chunk_index. Memory-bounded, no truncation. Header arrives in the first chunk; empty result → header-only file from the manifest schema.
Governance preserved — reuses get_sql_sp_client / get_sql_warehouse_id / get_mcp_user_identity: SP execution, forced warehouse override, mcp_user:<identity> tagging, plus a mcp_tool:export_query_to_file tag.
Path-sandboxed to DATABRICKS_MCP_EXPORT_DIR (default ~/databricks-mcp-exports); rejects .. traversal and absolute escapes.
Registered in run.py before the allowlist (Step 2b) so the allowlist stays the single source of truth; tool name added to ALLOWED_TOOLS. No upstream files touched, same pattern as Raise SQL tool-call timeout ceiling (60s → 600s), keep low default + per-call knob #18.

Files

File	Change
`customization/export_query_patch.py`	new — tool, chunk streaming, path sandbox
`customization/tool_allowlist_patch.py`	add `export_query_to_file` to `ALLOWED_TOOLS`
`run.py`	Step 2b — register before allowlist
`tests/test_export_query_patch.py`	new — 14 unit tests
`.env.example`	document `DATABRICKS_MCP_EXPORT_DIR`
`docs/export-tool-design.md`	full design note (rationale + alternatives considered)

Validation

14 new unit tests; full suite 51 passed / 1 skipped; lint clean.
Live-validated against the warehouse: synthetic deterministic result, a real 122-column table slice, and an empty result → header-only file.
The live test caught and fixed a duplicate-header bug (we initially synthesized a header the CSV chunks already include) — exactly the kind of thing unit tests with mocks would have missed.

Follow-ups (not in this PR)

Optional Parquet output (Format.ARROW_STREAM + pyarrow) if a consumer prefers it.
Engineers must restart their MCP client to pick up the new tool.

🤖 Generated with Claude Code

Note

Medium Risk
Introduces a new local file-write path for full query results (sandboxed but still sensitive data on disk) while reusing existing SP/warehouse/query-tag governance; implementation is large but heavily unit-tested for path safety and atomic writes.

Overview
Adds a governed MCP tool export_query_to_file so large query results are written on the server to a sandboxed local CSV and the model only gets a small manifest (path, row_count, bytes, columns, format).

Execution uses Databricks EXTERNAL_LINKS + CSV (not inline execute_sql), with lazy chunk downloads, atomic temp-then-move writes, optional overwrite via os.link/os.replace, timeout clamp/cancel at 600s, and path confinement under DATABRICKS_MCP_EXPORT_DIR (documented in .env.example). Wiring: register_export_query_tool in run.py (Step 2b, before allowlist), export_query_to_file on ALLOWED_TOOLS, plus tests/test_export_query_patch.py and docs/export-tool-design.md.

Note: .env.example now has duplicate DATABRICKS_WAREHOUSE_ID blocks (placeholder + default)—likely accidental cleanup outside this feature.

^{Reviewed by Cursor Bugbot for commit 724eb1d. Bugbot is set up for automated code reviews on this repo. Configure here.}

Returning large query results through the MCP serializes them into the model's context window. A real analytical query (the Meristem ups_ex360_med grid) came back as ~68 KB / ~37,000 tokens and overflowed. This is structural, not a one-off: a tool result is tokenized into context by design, so any non-trivial extract — by any engineer, on any project — hits the same wall. It also makes the model a lossy conduit for data that downstream tools (viz, the model app) need exact and whole. This adds a customization-layer tool, export_query_to_file, that has the SERVER write results to a local file and return only a manifest ({path, row_count, bytes, columns, format}). The rows never enter the model context — local tools read the file off disk. Pass the path, not the payload. Mechanism: - Does NOT reuse SQLExecutor.execute(): that path uses the INLINE disposition and reads only result.data_array (the first chunk), silently truncating large results. - Issues the statement with disposition=EXTERNAL_LINKS, format=CSV. Databricks writes CSV chunk files to cloud storage and returns presigned URLs; the server streams each chunk to disk, following next_chunk_index. Memory- bounded, no truncation. The header arrives in the first chunk; an empty result yields a header-only file from the manifest schema. - Reuses the existing governance chokepoints (get_sql_sp_client, get_sql_warehouse_id, get_mcp_user_identity): SP execution, forced warehouse override, and mcp_user:<identity> tagging, plus a mcp_tool tag. - Sandboxes writes to DATABRICKS_MCP_EXPORT_DIR (default ~/databricks-mcp-exports); rejects .. traversal and absolute escapes. Registered in run.py before the tool allowlist (Step 2b) so the allowlist stays the single source of truth for what is exposed; the tool name is added to ALLOWED_TOOLS. No upstream files touched, per the fork's governance model. Validation: 14 new unit tests; full suite 51 passed / 1 skipped; lint clean; live-validated against the warehouse (synthetic deterministic result, a real 122-column table slice, and an empty result -> header-only file). The live test caught and fixed a duplicate-header bug (we initially synthesized a header that the CSV chunks already include). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CI runs `ruff format --check` in addition to `ruff check`; the new module needed reformatting (line wrapping only, no logic change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

michael-ruder · 2026-06-02T19:57:56Z

@beipang
https://docs.google.com/document/d/1pmCLe8k9vvZq2TkAn9VCi7vPWoPOVGbEm4vFLOwyZR4/edit?tab=t.0

beipang · 2026-06-02T20:29:13Z

+   `INLINE` disposition and reads only the first chunk — it truncates large
+   results). Instead it issues the statement with `disposition=EXTERNAL_LINKS`
+   and `format=CSV`. Databricks writes the result to cloud storage as CSV chunk
+   files and returns presigned download URLs.


One small Q: What is the retention and lifecycle for the CSV files persisted in Databricks?

Review follow-up (PR #19): the only artifact this tool owns with a real lifecycle is the LOCAL CSV — Databricks-side cloud-fetch staging is transient and platform-managed (presigned links expire ~15 min, measured). The local files previously had no retention; since exports can contain prod PII, an unmanaged pile of CSVs is a governance concern. Adds sweep_old_exports(): removes files in the export dir older than DATABRICKS_MCP_EXPORT_RETENTION_DAYS (default 7; 0 disables), pruning emptied subdirectories. Runs at startup and before each export. Sweep errors are logged, never raised, so cleanup can't block an export. Also documents the env var in .env.example and adds a "Data lifecycle" section to the design note distinguishing the transient Databricks staging from the local file. Tests: +4 (old removed / recent kept, disabled at <=0, empty-dir prune, env parsing). Full suite 55 passed / 1 skipped; ruff check + format clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

michael-ruder · 2026-06-02T21:24:47Z

@beipang — good question. There are two distinct "CSV" artifacts here, so let me separate them, because only one is ours to manage:

1. Databricks-side (the cloud-fetch staging in "How it works"). With disposition=EXTERNAL_LINKS, Databricks stages the result to its own managed cloud storage and returns presigned URLs. These are transient and platform-managed — I measured the presigned-link expiration at ~15 minutes, and the staged result only lives under Databricks' statement-result retention before Databricks purges it. We don't pick the location, create a table/volume, or control that lifecycle. Nothing for us to retain or clean up there.

2. Local-side (the actual deliverable). The .csv lands on the engineer's disk at DATABRICKS_MCP_EXPORT_DIR (default ~/databricks-mcp-exports). This is the one with a real lifecycle question — and you're right that it needed one. Exports can contain prod PII, so an unmanaged pile of local CSVs is a governance concern, not just disk hygiene.

Addressed in 35c1098:

sweep_old_exports() deletes files older than DATABRICKS_MCP_EXPORT_RETENTION_DAYS (default 7, 0 disables), pruning emptied subdirs. Runs at startup and before each export; sweep errors are logged, never raised, so cleanup can't block an export.
Documented the env var in .env.example and added a Data lifecycle table to docs/export-tool-design.md spelling out the two artifacts above.
+4 unit tests (old removed / recent kept, disabled at ≤0, empty-dir prune, env parsing).

tl;dr: nothing durable is persisted in Databricks by this tool; the only thing we own is the local file, and it's now retention-bounded with PII in mind.

michael-ruder · 2026-06-02T21:27:13Z

Correction to my earlier comment (re: the "~15 minute" figure) — I want to keep the record accurate.

I checked the official Databricks docs and they do not document a specific expiration/garbage-collection duration. Two things to fix in what I said above:

"~15 minutes" was my own measurement, not a documented value. I read the expiration timestamp on a single live EXTERNAL_LINKS response and it came back ~15 min out. That's one observation, not a guaranteed SLA — it can vary by cloud/region/API version.
"garbage collection" was the wrong framing. What I measured is the presigned-URL validity window, which is distinct from when Databricks purges the underlying staged result (which the docs don't quantify at all).

The accurate, defensible statement:

The EXTERNAL_LINKS presigned URLs are short-lived; each response carries an explicit expiration timestamp (we observed ~15 min in testing). Consumers should treat that field as the source of truth rather than assume a fixed duration. The underlying staged result is Databricks-managed and not fetchable after the statement closes; Databricks does not publish a specific retention/GC interval.

None of this changes the design: we still persist nothing durable or self-owned in Databricks, and the only artifact with a lifecycle we manage is the local file (now retention-bounded). I'll tighten the wording in docs/export-tool-design.md to match — attributing the 15 min as an observation and pointing at the per-response expiration field.

Sources (neither states a duration):

Statement Execution API tutorial — AWS — describes the URLs as "short-lived" with an expiration field
executeStatement API reference — external_links.expiration

bugbot (PR #19): _write_csv opened output_path with "wb", truncating it immediately, then streamed chunk downloads. A later chunk failure, an expired presigned link, or the tool's wall-clock ceiling firing mid-stream would leave a partial/empty CSV at the real path — and with overwrite=true, destroy the prior good export. A half-written CSV that looks complete is exactly the silent corruption this tool exists to prevent. Now streams to a temp file in the destination directory and os.replace()s it into place only after the last chunk is written (atomic on the same filesystem). On any BaseException (including timeout/cancellation) the temp file is removed and dest is left untouched. Tests: +2 (mid-stream failure preserves prior file and leaves no .part; failure creates no destination when none existed). Full suite 57 passed / 1 skipped; ruff check + format clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nks lazily Two bugbot findings (PR #19): 1. Statement timeout exceeds tool ceiling. _run_export polled on the raw caller timeout, so a value above the 600s FastMCP ceiling let the tool call be aborted before our cancellation fired — leaving the statement running on the warehouse (orphaned query). Added _clamp_statement_timeout(): the per-call timeout is now bounded by EXPORT_TOOL_TIMEOUT_CEILING so the poll loop cancels within the ceiling. (Same orphaned-query concern as #18.) 2. Presigned URLs expire during download. We collected every external link up front, then downloaded — so on a long multi-chunk export the links issued first could expire (~15 min) before the later downloads. Replaced _collect_external_links with _iter_external_links, a lazy generator that the writer drives: each chunk's link is fetched and downloaded just-in-time, so the URL is freshly issued when used. The at-risk window is now one chunk, not the whole export. (The prior atomic-write fix already prevents a truncated CSV from being reported as success — a failed download raises and leaves dest untouched; this makes long exports actually complete rather than fail.) _write_csv now consumes an iterable lazily (downloads as it iterates) and writes a header-only file when the iterable is empty. Tests: +clamp + lazy-iteration (asserts the next chunk isn't fetched until the prior link is consumed). Full suite 59 passed / 1 skipped; ruff clean; re-validated live (synthetic, 122-col real table, clamp). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…n sweep Two bugbot findings (PR #19): 1. Overwrite checked before query (TOCTOU). The existence check runs at the top of _run_export, before the (possibly long) query, but _write_csv ended with an unconditional os.replace — so a file appearing during the query was clobbered despite overwrite=false. The early check stays as a fast-fail; the authoritative guard is now atomic at the move step: overwrite=true uses os.replace (clobber), overwrite=false uses os.link (fails if dest exists). 2. Retention sweep sandbox escape. sweep_old_exports walked with base.rglob("*"). On Python 3.13 (recurse_symlinks=False) this does NOT actually reach files behind a directory symlink — verified empirically — so it was not exploitable on our interpreter. But a delete path shouldn't depend on a version-specific glob default, and the prune pass followed dir symlinks via is_dir(). Rewrote with os.walk(followlinks=False) + explicit symlink skip + a containment check (resolved path must stay under base). Now sandbox-safe regardless of version. Tests: +4 (no-clobber when overwrite=false / clobber when true; dir-symlink not descended; file-symlink target untouched). Full suite 63 passed / 1 skipped; ruff clean; re-validated live (export + overwrite=false refusal). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bugbot (PR #19): resolve_export_path validates containment before the (long) query, but _write_csv never re-checked — so if dest.parent became a symlink pointing outside DATABRICKS_MCP_EXPORT_DIR during the query, mkstemp/os.link/ os.replace would follow it and write outside the sandbox while the manifest still reported the intended path. (Low real-world severity — local single-user server, so the actor needs fs write access as the user — but it's the same class we just hardened on the delete path, so worth closing for consistency.) _write_csv now takes base_dir and, immediately before writing, re-resolves dest.parent and re-checks containment, refusing to write if it resolves outside base. Containment-based (a symlink resolving back inside base is still allowed), not a blanket symlink ban. _run_export captures the base once and passes it to both resolve_export_path and _write_csv. Not a full TOCTOU-free solution (openat + O_NOFOLLOW per component) — that's over-engineering for a local dev tool; this narrows the window to near-zero and catches symlink-to-outside. Tests: +2 (parent symlink outside base refused / inside base allowed). Full suite 65 passed / 1 skipped; ruff clean; re-validated live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Updates the design note to the shipped behavior after the bugbot review round: atomic write, no-clobber overwrite guard, lazy just-in-time link fetching, timeout clamp, write-time sandbox re-check, and the retention/PII lifecycle. Adds a correctness/safety property table, corrects the presigned-link "~15 min" wording to an observation (not a documented SLA), and updates file/test counts (6 files, 65 tests). Includes a changelog of what changed since the prior draft. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cursor · 2026-06-03T14:25:46Z

+            break
+        result = client.statement_execution.get_statement_result_chunk_n(
+            statement_id=statement_id, chunk_index=nxt
+        )


Chunk loop stops on empty links

Medium Severity

_iter_external_links only continues the while loop when result.external_links is truthy. If a paginated chunk response has no links but still sets next_chunk_index, the loop exits without fetching later chunks, so the CSV can be truncated while the manifest row_count still reflects the full result.

^{Reviewed by Cursor Bugbot for commit 4be1cc1. Configure here.}

AI is telling me that this would work. But it is true that for paginated read, the "next_chunk_index" is usually used for the loop termination condition.

Removes the time-based retention feature (sweep_old_exports, get_export_retention_days, DATABRICKS_MCP_EXPORT_RETENTION_DAYS, and the startup + per-export sweep calls). It was added in response to a review question, justified by "exports can contain prod PII" — but that premise is wrong here: this MCP masks PII at source, which is exactly what makes local downloads and cross-model work acceptable. And auto-deleting prior exports works directly against reproducibility of analyses, which the team values. Exported files now persist until the engineer deletes them. The export dir is the user's to manage. Removing the sweep also moots the sweep-sandbox-escape review finding (no sweep, nothing to harden). .env.example and the design note updated accordingly (Data lifecycle now states no automatic cleanup, with the PII-masking + reproducibility rationale). Tests for the sweep removed. Full suite 59 passed / 1 skipped; ruff check + format clean; live-validated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

beipang · 2026-06-03T21:15:18Z

+    """Yield external links in row order, following the chunk chain lazily.
+
+    Laziness is deliberate. The presigned ``external_link`` URLs are short-lived
+    (Databricks issues them per response and they expire — observed ~15 min). If


Found more details: https://databricks-sdk-py.readthedocs.io/en/stable/workspace/sql/statement_execution.html

"The results are only available for one hour after success"

A few more limitations:

Statements with disposition=EXTERNAL_LINKS are limited to 100 GiB. Result sets larger than this limit will be truncated. Truncation is indicated by the truncated field in the result manifest. - The maximum query text size is 16 MiB. - Cancelation might silently fail.

More about the request:
Not really related to your change but interesting.

A successful response from a cancel request indicates that the cancel request was successfully received and sent to the processing engine. However, an outstanding statement might have already completed execution when the cancel request arrives. Polling for status until a terminal state is reached is a reliable way to determine the final state. - Wait timeouts are approximate, occur server-side, and cannot account for things such as caller delays and network latency from caller to service. - To guarantee that the statement is kept alive, you must poll at least once every 15 minutes. - The results are only available for one hour after success; polling does not extend this.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 724eb1d. Configure here.}

cursor · 2026-06-04T18:02:01Z

+        "bytes": dest.stat().st_size,
+        "columns": columns,
+        "format": fmt,
+    }


Ignores manifest truncated flag

Medium Severity

After a successful statement, _run_export always returns a success manifest using manifest.total_row_count and never reads manifest.truncated. Databricks can truncate EXTERNAL_LINKS results (e.g. the documented size cap), so exports can be incomplete while the tool still reports the full row count.

^{Reviewed by Cursor Bugbot for commit 724eb1d. Configure here.}

michael-ruder and others added 2 commits June 2, 2026 14:52

style(noom-mcp-server): ruff-format export_query_patch

d9778ec

CI runs `ruff format --check` in addition to `ruff check`; the new module needed reformatting (line wrapping only, no logic change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

michael-ruder changed the title ~~feat(noom-mcp-server): add export_query_to_file for bulk result export~~ [DS-1699] feat(noom-mcp-server): add export_query_to_file for bulk result export Jun 2, 2026

beipang reviewed Jun 2, 2026

View reviewed changes

beipang approved these changes Jun 2, 2026

View reviewed changes

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread noom-mcp-server/customization/export_query_patch.py Outdated

Comment thread noom-mcp-server/customization/export_query_patch.py

Comment thread noom-mcp-server/customization/export_query_patch.py Outdated

michael-ruder and others added 2 commits June 2, 2026 17:36

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread noom-mcp-server/customization/export_query_patch.py Outdated

Comment thread noom-mcp-server/customization/export_query_patch.py

michael-ruder changed the title ~~[DS-1699] feat(noom-mcp-server): add export_query_to_file for bulk result export~~ [DS-1639] feat(noom-mcp-server): add export_query_to_file for bulk result export Jun 3, 2026

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread noom-mcp-server/customization/export_query_patch.py Outdated

michael-ruder and others added 2 commits June 3, 2026 10:03

cursor Bot reviewed Jun 3, 2026

View reviewed changes

beipang reviewed Jun 3, 2026

View reviewed changes

Merge branch 'main' into feat/export-query-to-file

724eb1d

michael-ruder merged commit 84829bb into main Jun 4, 2026
8 checks passed

cursor Bot reviewed Jun 4, 2026

View reviewed changes

Conversation

michael-ruder commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

How

Files

Validation

Follow-ups (not in this PR)

Uh oh!

michael-ruder commented Jun 2, 2026

Uh oh!

beipang Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

michael-ruder commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michael-ruder commented Jun 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot Jun 3, 2026

Choose a reason for hiding this comment

Chunk loop stops on empty links

Uh oh!

beipang Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

beipang Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 4, 2026

Choose a reason for hiding this comment

Ignores manifest truncated flag

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michael-ruder commented Jun 2, 2026 •

edited by cursor Bot

Loading

beipang Jun 3, 2026 •

edited

Loading