[DS-1639] feat(noom-mcp-server): add export_query_to_file for bulk result export#19
Conversation
Returning large query results through the MCP serializes them into the model's
context window. A real analytical query (the Meristem ups_ex360_med grid) came
back as ~68 KB / ~37,000 tokens and overflowed. This is structural, not a
one-off: a tool result is tokenized into context by design, so any non-trivial
extract — by any engineer, on any project — hits the same wall. It also makes
the model a lossy conduit for data that downstream tools (viz, the model app)
need exact and whole.
This adds a customization-layer tool, export_query_to_file, that has the SERVER
write results to a local file and return only a manifest
({path, row_count, bytes, columns, format}). The rows never enter the model
context — local tools read the file off disk. Pass the path, not the payload.
Mechanism:
- Does NOT reuse SQLExecutor.execute(): that path uses the INLINE disposition
and reads only result.data_array (the first chunk), silently truncating
large results.
- Issues the statement with disposition=EXTERNAL_LINKS, format=CSV. Databricks
writes CSV chunk files to cloud storage and returns presigned URLs; the
server streams each chunk to disk, following next_chunk_index. Memory-
bounded, no truncation. The header arrives in the first chunk; an empty
result yields a header-only file from the manifest schema.
- Reuses the existing governance chokepoints (get_sql_sp_client,
get_sql_warehouse_id, get_mcp_user_identity): SP execution, forced warehouse
override, and mcp_user:<identity> tagging, plus a mcp_tool tag.
- Sandboxes writes to DATABRICKS_MCP_EXPORT_DIR (default
~/databricks-mcp-exports); rejects .. traversal and absolute escapes.
Registered in run.py before the tool allowlist (Step 2b) so the allowlist stays
the single source of truth for what is exposed; the tool name is added to
ALLOWED_TOOLS. No upstream files touched, per the fork's governance model.
Validation: 14 new unit tests; full suite 51 passed / 1 skipped; lint clean;
live-validated against the warehouse (synthetic deterministic result, a real
122-column table slice, and an empty result -> header-only file). The live test
caught and fixed a duplicate-header bug (we initially synthesized a header that
the CSV chunks already include).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CI runs `ruff format --check` in addition to `ruff check`; the new module needed reformatting (line wrapping only, no logic change). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| `INLINE` disposition and reads only the first chunk — it truncates large | ||
| results). Instead it issues the statement with `disposition=EXTERNAL_LINKS` | ||
| and `format=CSV`. Databricks writes the result to cloud storage as CSV chunk | ||
| files and returns presigned download URLs. |
There was a problem hiding this comment.
One small Q: What is the retention and lifecycle for the CSV files persisted in Databricks?
Review follow-up (PR #19): the only artifact this tool owns with a real lifecycle is the LOCAL CSV — Databricks-side cloud-fetch staging is transient and platform-managed (presigned links expire ~15 min, measured). The local files previously had no retention; since exports can contain prod PII, an unmanaged pile of CSVs is a governance concern. Adds sweep_old_exports(): removes files in the export dir older than DATABRICKS_MCP_EXPORT_RETENTION_DAYS (default 7; 0 disables), pruning emptied subdirectories. Runs at startup and before each export. Sweep errors are logged, never raised, so cleanup can't block an export. Also documents the env var in .env.example and adds a "Data lifecycle" section to the design note distinguishing the transient Databricks staging from the local file. Tests: +4 (old removed / recent kept, disabled at <=0, empty-dir prune, env parsing). Full suite 55 passed / 1 skipped; ruff check + format clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@beipang — good question. There are two distinct "CSV" artifacts here, so let me separate them, because only one is ours to manage: 1. Databricks-side (the cloud-fetch staging in "How it works"). With 2. Local-side (the actual deliverable). The Addressed in
tl;dr: nothing durable is persisted in Databricks by this tool; the only thing we own is the local file, and it's now retention-bounded with PII in mind. |
|
Correction to my earlier comment (re: the "~15 minute" figure) — I want to keep the record accurate. I checked the official Databricks docs and they do not document a specific expiration/garbage-collection duration. Two things to fix in what I said above:
The accurate, defensible statement:
None of this changes the design: we still persist nothing durable or self-owned in Databricks, and the only artifact with a lifecycle we manage is the local file (now retention-bounded). I'll tighten the wording in Sources (neither states a duration):
|
bugbot (PR #19): _write_csv opened output_path with "wb", truncating it immediately, then streamed chunk downloads. A later chunk failure, an expired presigned link, or the tool's wall-clock ceiling firing mid-stream would leave a partial/empty CSV at the real path — and with overwrite=true, destroy the prior good export. A half-written CSV that looks complete is exactly the silent corruption this tool exists to prevent. Now streams to a temp file in the destination directory and os.replace()s it into place only after the last chunk is written (atomic on the same filesystem). On any BaseException (including timeout/cancellation) the temp file is removed and dest is left untouched. Tests: +2 (mid-stream failure preserves prior file and leaves no .part; failure creates no destination when none existed). Full suite 57 passed / 1 skipped; ruff check + format clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nks lazily Two bugbot findings (PR #19): 1. Statement timeout exceeds tool ceiling. _run_export polled on the raw caller timeout, so a value above the 600s FastMCP ceiling let the tool call be aborted before our cancellation fired — leaving the statement running on the warehouse (orphaned query). Added _clamp_statement_timeout(): the per-call timeout is now bounded by EXPORT_TOOL_TIMEOUT_CEILING so the poll loop cancels within the ceiling. (Same orphaned-query concern as #18.) 2. Presigned URLs expire during download. We collected every external link up front, then downloaded — so on a long multi-chunk export the links issued first could expire (~15 min) before the later downloads. Replaced _collect_external_links with _iter_external_links, a lazy generator that the writer drives: each chunk's link is fetched and downloaded just-in-time, so the URL is freshly issued when used. The at-risk window is now one chunk, not the whole export. (The prior atomic-write fix already prevents a truncated CSV from being reported as success — a failed download raises and leaves dest untouched; this makes long exports actually complete rather than fail.) _write_csv now consumes an iterable lazily (downloads as it iterates) and writes a header-only file when the iterable is empty. Tests: +clamp + lazy-iteration (asserts the next chunk isn't fetched until the prior link is consumed). Full suite 59 passed / 1 skipped; ruff clean; re-validated live (synthetic, 122-col real table, clamp). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…n sweep Two bugbot findings (PR #19): 1. Overwrite checked before query (TOCTOU). The existence check runs at the top of _run_export, before the (possibly long) query, but _write_csv ended with an unconditional os.replace — so a file appearing during the query was clobbered despite overwrite=false. The early check stays as a fast-fail; the authoritative guard is now atomic at the move step: overwrite=true uses os.replace (clobber), overwrite=false uses os.link (fails if dest exists). 2. Retention sweep sandbox escape. sweep_old_exports walked with base.rglob("*"). On Python 3.13 (recurse_symlinks=False) this does NOT actually reach files behind a directory symlink — verified empirically — so it was not exploitable on our interpreter. But a delete path shouldn't depend on a version-specific glob default, and the prune pass followed dir symlinks via is_dir(). Rewrote with os.walk(followlinks=False) + explicit symlink skip + a containment check (resolved path must stay under base). Now sandbox-safe regardless of version. Tests: +4 (no-clobber when overwrite=false / clobber when true; dir-symlink not descended; file-symlink target untouched). Full suite 63 passed / 1 skipped; ruff clean; re-validated live (export + overwrite=false refusal). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bugbot (PR #19): resolve_export_path validates containment before the (long) query, but _write_csv never re-checked — so if dest.parent became a symlink pointing outside DATABRICKS_MCP_EXPORT_DIR during the query, mkstemp/os.link/ os.replace would follow it and write outside the sandbox while the manifest still reported the intended path. (Low real-world severity — local single-user server, so the actor needs fs write access as the user — but it's the same class we just hardened on the delete path, so worth closing for consistency.) _write_csv now takes base_dir and, immediately before writing, re-resolves dest.parent and re-checks containment, refusing to write if it resolves outside base. Containment-based (a symlink resolving back inside base is still allowed), not a blanket symlink ban. _run_export captures the base once and passes it to both resolve_export_path and _write_csv. Not a full TOCTOU-free solution (openat + O_NOFOLLOW per component) — that's over-engineering for a local dev tool; this narrows the window to near-zero and catches symlink-to-outside. Tests: +2 (parent symlink outside base refused / inside base allowed). Full suite 65 passed / 1 skipped; ruff clean; re-validated live. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Updates the design note to the shipped behavior after the bugbot review round: atomic write, no-clobber overwrite guard, lazy just-in-time link fetching, timeout clamp, write-time sandbox re-check, and the retention/PII lifecycle. Adds a correctness/safety property table, corrects the presigned-link "~15 min" wording to an observation (not a documented SLA), and updates file/test counts (6 files, 65 tests). Includes a changelog of what changed since the prior draft. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| break | ||
| result = client.statement_execution.get_statement_result_chunk_n( | ||
| statement_id=statement_id, chunk_index=nxt | ||
| ) |
There was a problem hiding this comment.
Chunk loop stops on empty links
Medium Severity
_iter_external_links only continues the while loop when result.external_links is truthy. If a paginated chunk response has no links but still sets next_chunk_index, the loop exits without fetching later chunks, so the CSV can be truncated while the manifest row_count still reflects the full result.
Reviewed by Cursor Bugbot for commit 4be1cc1. Configure here.
There was a problem hiding this comment.
AI is telling me that this would work. But it is true that for paginated read, the "next_chunk_index" is usually used for the loop termination condition.
Removes the time-based retention feature (sweep_old_exports, get_export_retention_days, DATABRICKS_MCP_EXPORT_RETENTION_DAYS, and the startup + per-export sweep calls). It was added in response to a review question, justified by "exports can contain prod PII" — but that premise is wrong here: this MCP masks PII at source, which is exactly what makes local downloads and cross-model work acceptable. And auto-deleting prior exports works directly against reproducibility of analyses, which the team values. Exported files now persist until the engineer deletes them. The export dir is the user's to manage. Removing the sweep also moots the sweep-sandbox-escape review finding (no sweep, nothing to harden). .env.example and the design note updated accordingly (Data lifecycle now states no automatic cleanup, with the PII-masking + reproducibility rationale). Tests for the sweep removed. Full suite 59 passed / 1 skipped; ruff check + format clean; live-validated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| """Yield external links in row order, following the chunk chain lazily. | ||
|
|
||
| Laziness is deliberate. The presigned ``external_link`` URLs are short-lived | ||
| (Databricks issues them per response and they expire — observed ~15 min). If |
There was a problem hiding this comment.
Found more details: https://databricks-sdk-py.readthedocs.io/en/stable/workspace/sql/statement_execution.html
"The results are only available for one hour after success"
A few more limitations:
Statements with disposition=EXTERNAL_LINKS are limited to 100 GiB. Result sets larger than this limit will be truncated. Truncation is indicated by the truncated field in the result manifest. - The maximum query text size is 16 MiB. - Cancelation might silently fail.
More about the request:
Not really related to your change but interesting.
A successful response from a cancel request indicates that the cancel request was successfully received and sent to the processing engine. However, an outstanding statement might have already completed execution when the cancel request arrives. Polling for status until a terminal state is reached is a reliable way to determine the final state. - Wait timeouts are approximate, occur server-side, and cannot account for things such as caller delays and network latency from caller to service. - To guarantee that the statement is kept alive, you must poll at least once every 15 minutes. - The results are only available for one hour after success; polling does not extend this.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 724eb1d. Configure here.
| "bytes": dest.stat().st_size, | ||
| "columns": columns, | ||
| "format": fmt, | ||
| } |
There was a problem hiding this comment.
Ignores manifest truncated flag
Medium Severity
After a successful statement, _run_export always returns a success manifest using manifest.total_row_count and never reads manifest.truncated. Databricks can truncate EXTERNAL_LINKS results (e.g. the documented size cap), so exports can be incomplete while the tool still reports the full row count.
Reviewed by Cursor Bugbot for commit 724eb1d. Configure here.


Why
Returning query results through the MCP serializes them into the model's context window. One real analytical query (the
Meristem ups_ex360_medgrid) came back as 198 × 35 ≈ 68 KB ≈ 37,000 tokens and overflowed.This is structural, not a one-off: a tool result is tokenized into context by design, so any non-trivial extract — by any engineer, on any project — hits the same wall. It also makes the model a lossy conduit for data that downstream tools (viz, the model app) need exact and whole.
What
A customization-layer tool,
export_query_to_file, that has the server write results to a local file and return only a manifest ({path, row_count, bytes, columns, format}). The rows never enter the model context — local tools read the file off disk.How
SQLExecutor.execute()— that path uses theINLINEdisposition and reads onlyresult.data_array(the first chunk), silently truncating large results.disposition=EXTERNAL_LINKS,format=CSV. Databricks writes CSV chunk files to cloud storage and returns presigned URLs; the server streams each chunk to disk followingnext_chunk_index. Memory-bounded, no truncation. Header arrives in the first chunk; empty result → header-only file from the manifest schema.get_sql_sp_client/get_sql_warehouse_id/get_mcp_user_identity: SP execution, forced warehouse override,mcp_user:<identity>tagging, plus amcp_tool:export_query_to_filetag.DATABRICKS_MCP_EXPORT_DIR(default~/databricks-mcp-exports); rejects..traversal and absolute escapes.run.pybefore the allowlist (Step 2b) so the allowlist stays the single source of truth; tool name added toALLOWED_TOOLS. No upstream files touched, same pattern as Raise SQL tool-call timeout ceiling (60s → 600s), keep low default + per-call knob #18.Files
customization/export_query_patch.pycustomization/tool_allowlist_patch.pyexport_query_to_filetoALLOWED_TOOLSrun.pytests/test_export_query_patch.py.env.exampleDATABRICKS_MCP_EXPORT_DIRdocs/export-tool-design.mdValidation
Follow-ups (not in this PR)
Format.ARROW_STREAM+ pyarrow) if a consumer prefers it.🤖 Generated with Claude Code
Note
Medium Risk
Introduces a new local file-write path for full query results (sandboxed but still sensitive data on disk) while reusing existing SP/warehouse/query-tag governance; implementation is large but heavily unit-tested for path safety and atomic writes.
Overview
Adds a governed MCP tool
export_query_to_fileso large query results are written on the server to a sandboxed local CSV and the model only gets a small manifest (path,row_count,bytes,columns,format).Execution uses Databricks
EXTERNAL_LINKS+ CSV (not inlineexecute_sql), with lazy chunk downloads, atomic temp-then-move writes, optional overwrite viaos.link/os.replace, timeout clamp/cancel at 600s, and path confinement underDATABRICKS_MCP_EXPORT_DIR(documented in.env.example). Wiring:register_export_query_toolinrun.py(Step 2b, before allowlist),export_query_to_fileonALLOWED_TOOLS, plustests/test_export_query_patch.pyanddocs/export-tool-design.md.Note:
.env.examplenow has duplicateDATABRICKS_WAREHOUSE_IDblocks (placeholder + default)—likely accidental cleanup outside this feature.Reviewed by Cursor Bugbot for commit 724eb1d. Bugbot is set up for automated code reviews on this repo. Configure here.