feat(cubesql): add disable_post_processing to /v1/cubesql endpoint#11183
Open
rynmccrmck wants to merge 2 commits into
Open
feat(cubesql): add disable_post_processing to /v1/cubesql endpoint#11183rynmccrmck wants to merge 2 commits into
rynmccrmck wants to merge 2 commits into
Conversation
Queries that require DataFusion post-processing fetch intermediate rows from the source DB capped at CUBEJS_DB_QUERY_LIMIT (default 50k). On large datasets this silently truncates the intermediate result, producing wrong aggregates. The sql4sql (/v1/sql?format=sql) path already classifies plans as pushdown/regular/post-processing and reports it to the caller, but never executes — so it cannot truncate. The /v1/cubesql path executes and streams results, so to guarantee accuracy it must block before execution. This adds the same control to /v1/cubesql with a hard guarantee: if the final plan still requires post-processing after the cost model is biased against it, the request fails (the error is delivered in the streamed response body) rather than returning truncated data. Changes: - env.ts: register CUBESQL_DISABLE_POST_PROCESSING env var (server-wide default, off by default) so all services are covered without per-request changes - gateway.ts: read disable_post_processing from request body, fall back to env var, pass through to execSql - sql-server.ts, index.ts: thread disablePostProcessing param to native - node_export.rs: set CUBESQL_PENALIZE_POST_PROCESSING_VAR before planning (biases the cost model toward pushdown) and, after planning, fail if the plan root is not a single CubeScan(Wrapped) node Tests: - smoke-cubesql.test.ts: error path (asserts the post-processing error is surfaced in the streamed body), push-down success, default-off baseline - api-gateway/test/index.test.ts: gateway unit test asserting the param is forwarded to execSql; existing /v1/cubesql call-signature assertions updated for the new trailing argument - backend-shared/test/env.test.ts: unit test for CUBESQL_DISABLE_POST_PROCESSING
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes here.
Queries that require DataFusion post-processing fetch intermediate rows from the source DB capped at CUBEJS_DB_QUERY_LIMIT (default 50k). On large datasets this silently truncates the intermediate result, producing wrong aggregates.
The sql4sql (/v1/sql?format=sql) path already classifies plans as pushdown/regular/post-processing and reports it to the caller, but never executes so it cant truncate. The /v1/cubesql path executes and streams results, so to guarantee accuracy it should block before execution.
This adds the same control to /v1/cubesql with a hard guarantee: if the final plan still requires post-processing after the cost model is biased against it, the request fails (the error is delivered in the streamed response body) rather than returning truncated data.
Changes:
Tests:
Check List