Skip to content
This repository was archived by the owner on Jun 11, 2026. It is now read-only.

[DS-1639] feat(noom-mcp-server): add export_query_to_file for bulk result export#19

Merged
michael-ruder merged 10 commits into
mainfrom
feat/export-query-to-file
Jun 4, 2026
Merged

[DS-1639] feat(noom-mcp-server): add export_query_to_file for bulk result export#19
michael-ruder merged 10 commits into
mainfrom
feat/export-query-to-file

Conversation

@michael-ruder

@michael-ruder michael-ruder commented Jun 2, 2026

Copy link
Copy Markdown

Why

Returning query results through the MCP serializes them into the model's context window. One real analytical query (the Meristem ups_ex360_med grid) came back as 198 × 35 ≈ 68 KB ≈ 37,000 tokens and overflowed.

This is structural, not a one-off: a tool result is tokenized into context by design, so any non-trivial extract — by any engineer, on any project — hits the same wall. It also makes the model a lossy conduit for data that downstream tools (viz, the model app) need exact and whole.

What

A customization-layer tool, export_query_to_file, that has the server write results to a local file and return only a manifest ({path, row_count, bytes, columns, format}). The rows never enter the model context — local tools read the file off disk.

Pass the path, not the payload. The MCP is a control plane, not a data plane.

before:  warehouse ──▶ MCP ──▶ [ ~37,000 tokens into context ]  💥
after:   warehouse ──▶ MCP server ──▶ disk: results.csv
                            └──▶ context: {path, rows, bytes}  ≈ 50 tokens

How

  • Does NOT reuse SQLExecutor.execute() — that path uses the INLINE disposition and reads only result.data_array (the first chunk), silently truncating large results.
  • Issues the statement with disposition=EXTERNAL_LINKS, format=CSV. Databricks writes CSV chunk files to cloud storage and returns presigned URLs; the server streams each chunk to disk following next_chunk_index. Memory-bounded, no truncation. Header arrives in the first chunk; empty result → header-only file from the manifest schema.
  • Governance preserved — reuses get_sql_sp_client / get_sql_warehouse_id / get_mcp_user_identity: SP execution, forced warehouse override, mcp_user:<identity> tagging, plus a mcp_tool:export_query_to_file tag.
  • Path-sandboxed to DATABRICKS_MCP_EXPORT_DIR (default ~/databricks-mcp-exports); rejects .. traversal and absolute escapes.
  • Registered in run.py before the allowlist (Step 2b) so the allowlist stays the single source of truth; tool name added to ALLOWED_TOOLS. No upstream files touched, same pattern as Raise SQL tool-call timeout ceiling (60s → 600s), keep low default + per-call knob #18.

Files

File Change
customization/export_query_patch.py new — tool, chunk streaming, path sandbox
customization/tool_allowlist_patch.py add export_query_to_file to ALLOWED_TOOLS
run.py Step 2b — register before allowlist
tests/test_export_query_patch.py new — 14 unit tests
.env.example document DATABRICKS_MCP_EXPORT_DIR
docs/export-tool-design.md full design note (rationale + alternatives considered)

Validation

  • 14 new unit tests; full suite 51 passed / 1 skipped; lint clean.
  • Live-validated against the warehouse: synthetic deterministic result, a real 122-column table slice, and an empty result → header-only file.
  • The live test caught and fixed a duplicate-header bug (we initially synthesized a header the CSV chunks already include) — exactly the kind of thing unit tests with mocks would have missed.

Follow-ups (not in this PR)

  • Optional Parquet output (Format.ARROW_STREAM + pyarrow) if a consumer prefers it.
  • Engineers must restart their MCP client to pick up the new tool.

🤖 Generated with Claude Code


Note

Medium Risk
Introduces a new local file-write path for full query results (sandboxed but still sensitive data on disk) while reusing existing SP/warehouse/query-tag governance; implementation is large but heavily unit-tested for path safety and atomic writes.

Overview
Adds a governed MCP tool export_query_to_file so large query results are written on the server to a sandboxed local CSV and the model only gets a small manifest (path, row_count, bytes, columns, format).

Execution uses Databricks EXTERNAL_LINKS + CSV (not inline execute_sql), with lazy chunk downloads, atomic temp-then-move writes, optional overwrite via os.link/os.replace, timeout clamp/cancel at 600s, and path confinement under DATABRICKS_MCP_EXPORT_DIR (documented in .env.example). Wiring: register_export_query_tool in run.py (Step 2b, before allowlist), export_query_to_file on ALLOWED_TOOLS, plus tests/test_export_query_patch.py and docs/export-tool-design.md.

Note: .env.example now has duplicate DATABRICKS_WAREHOUSE_ID blocks (placeholder + default)—likely accidental cleanup outside this feature.

Reviewed by Cursor Bugbot for commit 724eb1d. Bugbot is set up for automated code reviews on this repo. Configure here.

michael-ruder and others added 2 commits June 2, 2026 14:52
Returning large query results through the MCP serializes them into the model's
context window. A real analytical query (the Meristem ups_ex360_med grid) came
back as ~68 KB / ~37,000 tokens and overflowed. This is structural, not a
one-off: a tool result is tokenized into context by design, so any non-trivial
extract — by any engineer, on any project — hits the same wall. It also makes
the model a lossy conduit for data that downstream tools (viz, the model app)
need exact and whole.

This adds a customization-layer tool, export_query_to_file, that has the SERVER
write results to a local file and return only a manifest
({path, row_count, bytes, columns, format}). The rows never enter the model
context — local tools read the file off disk. Pass the path, not the payload.

Mechanism:
  - Does NOT reuse SQLExecutor.execute(): that path uses the INLINE disposition
    and reads only result.data_array (the first chunk), silently truncating
    large results.
  - Issues the statement with disposition=EXTERNAL_LINKS, format=CSV. Databricks
    writes CSV chunk files to cloud storage and returns presigned URLs; the
    server streams each chunk to disk, following next_chunk_index. Memory-
    bounded, no truncation. The header arrives in the first chunk; an empty
    result yields a header-only file from the manifest schema.
  - Reuses the existing governance chokepoints (get_sql_sp_client,
    get_sql_warehouse_id, get_mcp_user_identity): SP execution, forced warehouse
    override, and mcp_user:<identity> tagging, plus a mcp_tool tag.
  - Sandboxes writes to DATABRICKS_MCP_EXPORT_DIR (default
    ~/databricks-mcp-exports); rejects .. traversal and absolute escapes.

Registered in run.py before the tool allowlist (Step 2b) so the allowlist stays
the single source of truth for what is exposed; the tool name is added to
ALLOWED_TOOLS. No upstream files touched, per the fork's governance model.

Validation: 14 new unit tests; full suite 51 passed / 1 skipped; lint clean;
live-validated against the warehouse (synthetic deterministic result, a real
122-column table slice, and an empty result -> header-only file). The live test
caught and fixed a duplicate-header bug (we initially synthesized a header that
the CSV chunks already include).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI runs `ruff format --check` in addition to `ruff check`; the new module
needed reformatting (line wrapping only, no logic change).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@michael-ruder michael-ruder changed the title feat(noom-mcp-server): add export_query_to_file for bulk result export [DS-1699] feat(noom-mcp-server): add export_query_to_file for bulk result export Jun 2, 2026
@michael-ruder

Copy link
Copy Markdown
Author

`INLINE` disposition and reads only the first chunk — it truncates large
results). Instead it issues the statement with `disposition=EXTERNAL_LINKS`
and `format=CSV`. Databricks writes the result to cloud storage as CSV chunk
files and returns presigned download URLs.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small Q: What is the retention and lifecycle for the CSV files persisted in Databricks?

Review follow-up (PR #19): the only artifact this tool owns with a real
lifecycle is the LOCAL CSV — Databricks-side cloud-fetch staging is transient
and platform-managed (presigned links expire ~15 min, measured). The local
files previously had no retention; since exports can contain prod PII, an
unmanaged pile of CSVs is a governance concern.

Adds sweep_old_exports(): removes files in the export dir older than
DATABRICKS_MCP_EXPORT_RETENTION_DAYS (default 7; 0 disables), pruning emptied
subdirectories. Runs at startup and before each export. Sweep errors are logged,
never raised, so cleanup can't block an export.

Also documents the env var in .env.example and adds a "Data lifecycle" section
to the design note distinguishing the transient Databricks staging from the
local file.

Tests: +4 (old removed / recent kept, disabled at <=0, empty-dir prune, env
parsing). Full suite 55 passed / 1 skipped; ruff check + format clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@michael-ruder

Copy link
Copy Markdown
Author

@beipang — good question. There are two distinct "CSV" artifacts here, so let me separate them, because only one is ours to manage:

1. Databricks-side (the cloud-fetch staging in "How it works"). With disposition=EXTERNAL_LINKS, Databricks stages the result to its own managed cloud storage and returns presigned URLs. These are transient and platform-managed — I measured the presigned-link expiration at ~15 minutes, and the staged result only lives under Databricks' statement-result retention before Databricks purges it. We don't pick the location, create a table/volume, or control that lifecycle. Nothing for us to retain or clean up there.

2. Local-side (the actual deliverable). The .csv lands on the engineer's disk at DATABRICKS_MCP_EXPORT_DIR (default ~/databricks-mcp-exports). This is the one with a real lifecycle question — and you're right that it needed one. Exports can contain prod PII, so an unmanaged pile of local CSVs is a governance concern, not just disk hygiene.

Addressed in 35c1098:

  • sweep_old_exports() deletes files older than DATABRICKS_MCP_EXPORT_RETENTION_DAYS (default 7, 0 disables), pruning emptied subdirs. Runs at startup and before each export; sweep errors are logged, never raised, so cleanup can't block an export.
  • Documented the env var in .env.example and added a Data lifecycle table to docs/export-tool-design.md spelling out the two artifacts above.
  • +4 unit tests (old removed / recent kept, disabled at ≤0, empty-dir prune, env parsing).

tl;dr: nothing durable is persisted in Databricks by this tool; the only thing we own is the local file, and it's now retention-bounded with PII in mind.

Comment thread noom-mcp-server/customization/export_query_patch.py Outdated
Comment thread noom-mcp-server/customization/export_query_patch.py
Comment thread noom-mcp-server/customization/export_query_patch.py Outdated
@michael-ruder

Copy link
Copy Markdown
Author

Correction to my earlier comment (re: the "~15 minute" figure) — I want to keep the record accurate.

I checked the official Databricks docs and they do not document a specific expiration/garbage-collection duration. Two things to fix in what I said above:

  1. "~15 minutes" was my own measurement, not a documented value. I read the expiration timestamp on a single live EXTERNAL_LINKS response and it came back ~15 min out. That's one observation, not a guaranteed SLA — it can vary by cloud/region/API version.
  2. "garbage collection" was the wrong framing. What I measured is the presigned-URL validity window, which is distinct from when Databricks purges the underlying staged result (which the docs don't quantify at all).

The accurate, defensible statement:

The EXTERNAL_LINKS presigned URLs are short-lived; each response carries an explicit expiration timestamp (we observed ~15 min in testing). Consumers should treat that field as the source of truth rather than assume a fixed duration. The underlying staged result is Databricks-managed and not fetchable after the statement closes; Databricks does not publish a specific retention/GC interval.

None of this changes the design: we still persist nothing durable or self-owned in Databricks, and the only artifact with a lifecycle we manage is the local file (now retention-bounded). I'll tighten the wording in docs/export-tool-design.md to match — attributing the 15 min as an observation and pointing at the per-response expiration field.

Sources (neither states a duration):

michael-ruder and others added 2 commits June 2, 2026 17:36
bugbot (PR #19): _write_csv opened output_path with "wb", truncating it
immediately, then streamed chunk downloads. A later chunk failure, an expired
presigned link, or the tool's wall-clock ceiling firing mid-stream would leave a
partial/empty CSV at the real path — and with overwrite=true, destroy the prior
good export. A half-written CSV that looks complete is exactly the silent
corruption this tool exists to prevent.

Now streams to a temp file in the destination directory and os.replace()s it
into place only after the last chunk is written (atomic on the same filesystem).
On any BaseException (including timeout/cancellation) the temp file is removed
and dest is left untouched.

Tests: +2 (mid-stream failure preserves prior file and leaves no .part; failure
creates no destination when none existed). Full suite 57 passed / 1 skipped;
ruff check + format clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nks lazily

Two bugbot findings (PR #19):

1. Statement timeout exceeds tool ceiling. _run_export polled on the raw caller
   timeout, so a value above the 600s FastMCP ceiling let the tool call be
   aborted before our cancellation fired — leaving the statement running on the
   warehouse (orphaned query). Added _clamp_statement_timeout(): the per-call
   timeout is now bounded by EXPORT_TOOL_TIMEOUT_CEILING so the poll loop cancels
   within the ceiling. (Same orphaned-query concern as #18.)

2. Presigned URLs expire during download. We collected every external link up
   front, then downloaded — so on a long multi-chunk export the links issued
   first could expire (~15 min) before the later downloads. Replaced
   _collect_external_links with _iter_external_links, a lazy generator that the
   writer drives: each chunk's link is fetched and downloaded just-in-time, so
   the URL is freshly issued when used. The at-risk window is now one chunk, not
   the whole export. (The prior atomic-write fix already prevents a truncated CSV
   from being reported as success — a failed download raises and leaves dest
   untouched; this makes long exports actually complete rather than fail.)

_write_csv now consumes an iterable lazily (downloads as it iterates) and writes
a header-only file when the iterable is empty.

Tests: +clamp + lazy-iteration (asserts the next chunk isn't fetched until the
prior link is consumed). Full suite 59 passed / 1 skipped; ruff clean;
re-validated live (synthetic, 122-col real table, clamp).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread noom-mcp-server/customization/export_query_patch.py Outdated
Comment thread noom-mcp-server/customization/export_query_patch.py
…n sweep

Two bugbot findings (PR #19):

1. Overwrite checked before query (TOCTOU). The existence check runs at the top
   of _run_export, before the (possibly long) query, but _write_csv ended with an
   unconditional os.replace — so a file appearing during the query was clobbered
   despite overwrite=false. The early check stays as a fast-fail; the
   authoritative guard is now atomic at the move step: overwrite=true uses
   os.replace (clobber), overwrite=false uses os.link (fails if dest exists).

2. Retention sweep sandbox escape. sweep_old_exports walked with base.rglob("*").
   On Python 3.13 (recurse_symlinks=False) this does NOT actually reach files
   behind a directory symlink — verified empirically — so it was not exploitable
   on our interpreter. But a delete path shouldn't depend on a version-specific
   glob default, and the prune pass followed dir symlinks via is_dir(). Rewrote
   with os.walk(followlinks=False) + explicit symlink skip + a containment check
   (resolved path must stay under base). Now sandbox-safe regardless of version.

Tests: +4 (no-clobber when overwrite=false / clobber when true; dir-symlink not
descended; file-symlink target untouched). Full suite 63 passed / 1 skipped;
ruff clean; re-validated live (export + overwrite=false refusal).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@michael-ruder michael-ruder changed the title [DS-1699] feat(noom-mcp-server): add export_query_to_file for bulk result export [DS-1639] feat(noom-mcp-server): add export_query_to_file for bulk result export Jun 3, 2026
Comment thread noom-mcp-server/customization/export_query_patch.py Outdated
michael-ruder and others added 2 commits June 3, 2026 10:03
bugbot (PR #19): resolve_export_path validates containment before the (long)
query, but _write_csv never re-checked — so if dest.parent became a symlink
pointing outside DATABRICKS_MCP_EXPORT_DIR during the query, mkstemp/os.link/
os.replace would follow it and write outside the sandbox while the manifest
still reported the intended path. (Low real-world severity — local single-user
server, so the actor needs fs write access as the user — but it's the same class
we just hardened on the delete path, so worth closing for consistency.)

_write_csv now takes base_dir and, immediately before writing, re-resolves
dest.parent and re-checks containment, refusing to write if it resolves outside
base. Containment-based (a symlink resolving back inside base is still allowed),
not a blanket symlink ban. _run_export captures the base once and passes it to
both resolve_export_path and _write_csv.

Not a full TOCTOU-free solution (openat + O_NOFOLLOW per component) — that's
over-engineering for a local dev tool; this narrows the window to near-zero and
catches symlink-to-outside.

Tests: +2 (parent symlink outside base refused / inside base allowed). Full
suite 65 passed / 1 skipped; ruff clean; re-validated live.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Updates the design note to the shipped behavior after the bugbot review round:
atomic write, no-clobber overwrite guard, lazy just-in-time link fetching,
timeout clamp, write-time sandbox re-check, and the retention/PII lifecycle.
Adds a correctness/safety property table, corrects the presigned-link "~15 min"
wording to an observation (not a documented SLA), and updates file/test counts
(6 files, 65 tests). Includes a changelog of what changed since the prior draft.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread noom-mcp-server/customization/export_query_patch.py
break
result = client.statement_execution.get_statement_result_chunk_n(
statement_id=statement_id, chunk_index=nxt
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunk loop stops on empty links

Medium Severity

_iter_external_links only continues the while loop when result.external_links is truthy. If a paginated chunk response has no links but still sets next_chunk_index, the loop exits without fetching later chunks, so the CSV can be truncated while the manifest row_count still reflects the full result.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4be1cc1. Configure here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI is telling me that this would work. But it is true that for paginated read, the "next_chunk_index" is usually used for the loop termination condition.

Removes the time-based retention feature (sweep_old_exports,
get_export_retention_days, DATABRICKS_MCP_EXPORT_RETENTION_DAYS, and the
startup + per-export sweep calls). It was added in response to a review
question, justified by "exports can contain prod PII" — but that premise is
wrong here: this MCP masks PII at source, which is exactly what makes local
downloads and cross-model work acceptable. And auto-deleting prior exports
works directly against reproducibility of analyses, which the team values.

Exported files now persist until the engineer deletes them. The export dir is
the user's to manage. Removing the sweep also moots the sweep-sandbox-escape
review finding (no sweep, nothing to harden).

.env.example and the design note updated accordingly (Data lifecycle now states
no automatic cleanup, with the PII-masking + reproducibility rationale). Tests
for the sweep removed.

Full suite 59 passed / 1 skipped; ruff check + format clean; live-validated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
"""Yield external links in row order, following the chunk chain lazily.

Laziness is deliberate. The presigned ``external_link`` URLs are short-lived
(Databricks issues them per response and they expire — observed ~15 min). If

@beipang beipang Jun 3, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found more details: https://databricks-sdk-py.readthedocs.io/en/stable/workspace/sql/statement_execution.html

"The results are only available for one hour after success"

A few more limitations:

Statements with disposition=EXTERNAL_LINKS are limited to 100 GiB. Result sets larger than this limit will be truncated. Truncation is indicated by the truncated field in the result manifest. - The maximum query text size is 16 MiB. - Cancelation might silently fail.

More about the request:
Not really related to your change but interesting.

A successful response from a cancel request indicates that the cancel request was successfully received and sent to the processing engine. However, an outstanding statement might have already completed execution when the cancel request arrives. Polling for status until a terminal state is reached is a reliable way to determine the final state. - Wait timeouts are approximate, occur server-side, and cannot account for things such as caller delays and network latency from caller to service. - To guarantee that the statement is kept alive, you must poll at least once every 15 minutes. - The results are only available for one hour after success; polling does not extend this.

@michael-ruder michael-ruder merged commit 84829bb into main Jun 4, 2026
8 checks passed

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 724eb1d. Configure here.

"bytes": dest.stat().st_size,
"columns": columns,
"format": fmt,
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignores manifest truncated flag

Medium Severity

After a successful statement, _run_export always returns a success manifest using manifest.total_row_count and never reads manifest.truncated. Databricks can truncate EXTERNAL_LINKS results (e.g. the documented size cap), so exports can be incomplete while the tool still reports the full row count.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 724eb1d. Configure here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants