fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2
fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2poche123 wants to merge 5 commits into
Conversation
…ion, pipeline robustness Surface column values into the grounding/generation prompts and harden the query + ingestion pipelines so the NL→SQL agents stop guessing literals and the control flow stops reporting wrong-but-empty results as confident answers. Grounding / value anchoring: - profile: store the full distinct value set for low-cardinality columns (was top-5) - retrieval: value_fingerprint() on hits; render a "Column value catalogue" into the grounding and generation prompts; instructions to copy WHERE/CASE literals verbatim, handle tall/EAV layouts and scaled-duplicate measures - structurally detect self-referencing hierarchy columns (value containment, with a row-wise discriminator) and teach "a person's team/reports/structure" -> filter the hierarchy column Control flow / governance: - treat 0-row / degenerate aggregate results as suspicious -> clarification + confidence cap - gate auto-learning on row_count > 0 (stop poisoning the example store) - prefer a candidate that passes the firewall and returns non-empty rows over candidates[0] - exclude CTE names from AST table_refs (firewall no longer rejects valid CTE SQL) - snapshot-scope retrieval; honor and persist conversation snapshot pins - unify /stream, :explain, :validate on the rendered prompts + firewall/scope guards - thread a re-firewalled extra_filter through the semantic-metric bind() Ingestion robustness / determinism: - header-quality scoring (skip numeric spacer rows) + stable leading-blank handling - temperature=0 for the naming/describe agents (cuts re-ingest schema churn ~94%) - rewrite Parquet column names on the no-key fallback (fixes profile/sample Binder errors) - index column values into embeddings + content_tsv - warn when the cross-encoder reranker degrades to a no-op
- builder.py is a lock-step (canon-pinned) module; revert it and pass the
determinism setting via the existing extra_settings={"temperature": 0.0}
instead of a new parameter, so the lockstep gate stays green
- move the table-kinds lookup out of the query controller into TableResolver
(service layer) so the "no raw SQL in controllers" gate passes
- ruff: inline a negated return (SIM103), drop a forward-ref quote (UP037),
and apply ruff format
- tests: extend the fake TableResolvers with pins / current_snapshots /
table_kinds_by_name; seed current_snapshot_id in the retrieval integration
fixture (retrieval is now scoped to current_snapshot_id, as publish sets it)
Financial exports (Orbis/BvD) place one date-header row above a stack of sub-sections. Section-splitting orphaned that header above each sub-section's title, dropping the year columns and making the detailed financial tables (Ratios de Rentabilidad, Memo lineas, etc.) unqueryable by period. A data-first section now inherits the nearest recent period header as its column row, encoded as a backward-compatible 3-index section path ([header:data_start:end]); contiguous sections keep the legacy 2-index form. Guards against misfires: - period values are full-match only (INV-2024-0007 / Form 2020 are not years) - only a single leftmost label column may precede the inherited years - a label-headed table closes the period band (no stale-header bleed) - phantom null columns are compacted; header-row index is bounds-checked Adds unit coverage for inheritance, band-closing/no-false-inherit, path round-trip and legacy-path parsing.
…anch Reconciles the two parallel lines of the NL→SQL quality work into a single branch that carries every refinement from both, with no capability lost. Kept from this branch (execution-layer robustness): - TableResolver / AstClassifier / profile hardening and the query_controller surface, plus their unit/integration tests. - The XLSX multi-row period-header reconstruction (embedded-year rejection, band-closing, 3-index section paths) and its test. - PII-safe value rendering in the embed corpus. Brought in (the value-anchoring superset): - value_anchoring.py: the dataset-agnostic backbone (per-column value catalogue, value→column + group resolution, entity disambiguation by live row-match count + hierarchy intent, degenerate/zero-row detection, signed (mixed-sign) measure handling incl. the SUM(CASE WHEN cost THEN -m) shape, CTE-aware guards, function firewall). - query_service orchestration: candidate execution-selection, a non-destructive repair loop (never replaces a correct answer with a worse one), zero-row and signed-measure repair, group-coverage advisory. - Deterministic period/date column naming at ingest (year_YYYY / period_YYYY_MM_DD), original-header preservation, and the value-anchoring prompt rules for grounding / generation / critic / explainer. - LexicalReranker as the dependency-free default (a strict win over the no-op). - temperature=0 for the deterministic ingest agents (describe / column-name / rename / relation), applied via the existing extra_settings hook. Cross-line graft: the column hierarchy hint (references_column, computed by the profile stage) is now surfaced in the retrieval column fingerprint, so a self-referencing manager/owner column tells the agent to filter THAT column to find a person's reports instead of their own row. Gate notes: - builder.py stays byte-identical to canon (lockstep) -- temperature=0 rides the existing extra_settings path, no new parameter. - ruff check + format clean; lockstep 21/21. - test_query_service: the _FakeSettings stub gains the value-anchoring knobs (features off, so the orchestration tests stay focused on the core answer() flow), and the retry-on-error test now uses a realistic non-degenerate refined SQL with a table-aware resolver so the critic round is actually exercised.
miguelgfierro
left a comment
There was a problem hiding this comment.
Está bien, he puesto algunos comentarios, pásaselos a la IA
| dataset_id: uuid.UUID, | ||
| table_names: list[str], | ||
| object_store_base: str | None = None, | ||
| pins: dict[str, str] | None = None, |
There was a problem hiding this comment.
resolve() tiene un parámetro pins pensado para fijar una pregunta de drill-down al snapshot del esquema con el que se respondió la primera vez, para que un re-ingest a mitad de conversación no cambie la respuesta sin avisar. Pero ninguno de los tres puntos de llamada en query_service.py (líneas 1166, 1281, 1384) pasa pins=, y current_snapshots() (línea 122) nunca se llama. prior_snapshot_pins simplemente se guarda y se recarga sin cambios (línea 988). Ahora mismo esta función no hace nada — un re-ingest entre dos mensajes de la misma conversación cambiará la respuesta al nuevo esquema sin que nadie se entere.
| END AS score | ||
| FROM flyquery_examples | ||
| WHERE workspace_id = :workspace_id AND quality = 'APPROVED' {ds_filter} | ||
| WHERE workspace_id = :workspace_id AND quality IN ('APPROVED', 'PROPOSED') {ds_filter} |
There was a problem hiding this comment.
Esta consulta ahora trae quality IN ('APPROVED', 'PROPOSED') en lugar de solo 'APPROVED'. Las filas PROPOSED vienen de auto_learner.py, que guarda un ejemplo cada vez que una consulta se ejecutó una vez, sin reintentos, y devolvió más de 0 filas — eso no es lo mismo que "correcta". Estos ejemplos se le muestran al LLM bajo el título "approved Q→SQL examples", sin forma de distinguir lo aprobado de lo no revisado. Una consulta incorrecta pero plausible ahora puede auto-enseñarse a preguntas futuras antes de que ningún humano la revise.
| gen_out = await generation_agent.run( | ||
| {"grounded": grounded, "question": body.question, "starting_point_sql": None} | ||
| gen_run = await generation_agent.run( | ||
| _render_generation_prompt( |
There was a problem hiding this comment.
answer() (la ruta principal /query) pasa los nuevos datos de grounding — catálogo de valores de columna, entidades/grupos resueltos, pistas de jerarquía, pistas de valores — a los constructores de prompts. /explain (línea 398), /validate (línea 465) y la ruta SSE /query/stream (líneas 607 y 641) llaman a los mismos constructores de prompts pero sin pasar nada de esto, así que no reciben ninguno de los aportes de grounding de este PR. El código de streaming incluso tiene un comentario que dice que coincide con la ruta síncrona, y no es cierto. Además, la línea 660 deja fijo snapshot_pins: {} en la respuesta del stream — nunca se rellena de verdad.
| mixed = False | ||
| try: | ||
| if mn is not None and mx is not None and float(mn) < 0 < float(mx): | ||
| mixed = True | ||
| except (TypeError, ValueError): | ||
| pass |
There was a problem hiding this comment.
mixed_sign se marca como verdadero cada vez que el mínimo de una columna es negativo y el máximo es positivo — no se comprueba nada más. Una columna de ingresos con 9.998 filas positivas normales y 2 filas negativas de reembolsos se marca exactamente igual que una columna que de verdad es mitad positiva, mitad negativa por diseño. Esa marca luego dispara una instrucción estricta de "SIGN ERROR — must fix" en query_service.py, incluso en casos donde restar es lo correcto. Conviene comprobar qué porcentaje de filas es negativo, no solo si hay alguna.
| reranker_top_n: int = 30 | ||
| query_expansion_enabled: bool = False | ||
|
|
||
| # Value-anchoring / grounding-quality knobs. The ingest pipeline already |
There was a problem hiding this comment.
por qué la config lo tenemos aquí en vez de en el .env?
| # computes a per-column value catalogue (profile_json.top_values, min/max, | ||
| # distinct_estimate) + a semantic_type; these control how it is surfaced to | ||
| # the SQL writer at query time. All dataset-agnostic. | ||
| value_catalog_enabled: bool = True # render real column values into prompts |
There was a problem hiding this comment.
value_catalog_enabled se define aquí pero nunca se comprueba en ningún sitio — column_value_catalog() siempre se ejecuta, sin importar esta opción. Ahora mismo no hay forma de desactivar esta función. (Relacionado: render_column_catalog_line() en value_anchoring.py línea 150 también está muerta — nada la llama; todos los puntos de llamada reales usan render_catalog_from_meta(). Vale la pena borrar ambas, o conectar la opción si de verdad debe hacer algo.)
| conn = duckdb.connect(":memory:") | ||
| conn.execute("SET threads=2") | ||
| for t in terms: | ||
| mq = t["measure_col"].replace(chr(34), chr(34) * 2) |
There was a problem hiding this comment.
chr(34) es confuso, sería más sencillo usar '"'
What this PR does
Makes flyquery's Text‑to‑SQL actually answer real questions over uploaded
workbooks. It combines value‑anchored grounding, a non‑destructive
self‑repair loop, deterministic ingestion, and execution‑layer
robustness into one branch — every refinement from both lines of this work,
with nothing lost.
The problems we hit (and how we fixed them)
1. It scored 0/9 on a real structured‑data test — not from weak SQL, but from
value‑anchoring. The per‑column value catalogue is computed at ingest
(
profile_json.top_values,sample_values_json,governance_json.semantic_type)but was never read at query time, so the model invented every filter literal
(
'Revenue'vs the stored'Total Revenue', integer2023vs the stored'FY23',Brand='DAPA'vs the realFamily='DAPA').→
value_anchoring.py(new, dataset‑agnostic) renders each column's real valuesinto the grounding/generation/critic prompts, resolves question literals to the
columns that own them, and indexes values into BM25 + vector retrieval. Nothing
hardcodes a column name, label, or domain rule.
2. Wrong queries failed silently. A 0‑row result and a
WHERE 1=0no‑op werestamped
OKand narrated as "no data".→ A non‑destructive repair loop detects zero‑row / degenerate SQL, feeds the real
distinct values back to the critic, and never replaces a correct answer with a
worse one (
best_okis restored if a repair regresses it).3. Hierarchy / multi‑team questions failed. "Who reports to X", "average for
teams A, B and C".
→ Group‑aware value resolution (a term that umbrellas ≥2 catalogued values → the
whole set), entity disambiguation by live row‑match count + hierarchy intent,
and a self‑referencing‑column hint (
references_column, surfaced in the columnfingerprint) that tells the agent to filter the manager column to find a
person's reports rather than their own row.
4. Signed measures gave the wrong total. Operating profit over a column that
already stores some line items as negatives was double‑subtracted.
→ Mixed‑sign detection (
min < 0 < max) plus an observed‑sign probe that rewritesboth the
SUM(...) - SUM(...)chain and the equivalentSUM(CASE WHEN cost THEN -m ...)shape into one correct signed sum.5. Ingestion was non‑deterministic, with a silent off‑by‑one. Dashboard
exports (Orbis/BvD) put the period header in a band above each
label → datasection; the section pass treated a data row as the header, and DuckDB's
content‑based guess made it unstable — mis‑mapping years by one column.
→ Multi‑row period‑header reconstruction (embedded‑year rejection so
INV‑2024‑0007isn't a period, band‑closing, 3‑index section paths), thendeterministic period/date naming (
year_YYYY/period_YYYY_MM_DD) with theoriginal header preserved.
temperature=0for the classification/mapping ingestagents makes column naming stable across re‑ingests.
6. Retrieval truncated by order only. The default reranker was a no‑op.
→
LexicalReranker(dependency‑free token overlap) is the default fallback — astrict win that lets value‑aware retrieval actually reach the prompt when a
cross‑encoder model isn't installed.
Plus pipeline robustness: execution‑based multi‑candidate selection, a CTE‑aware
unknown‑table guard, column‑existence validation, a synthesis‑path function
firewall (blocks
read_csv_auto/pg_read_file/ …), and theTableResolver/AstClassifier/profilehardening.Results
operating profit with the by‑therapeutic‑area breakdown, named team rosters).
previously‑silent off‑by‑one is gone (Venta 2024/2020/2016 =
7,200,102.62 / 3,986,533.08 / 2,284,370.94).
Gates
ruff format --checkclean; lockstep 21/21 (builder.pystays byte‑identical to canon —
temperature=0rides the existingextra_settingshook, no new parameter).as the PR gate). Test fixtures updated: the
_FakeSettingsstub gains thevalue‑anchoring knobs, the retry‑on‑error test exercises a realistic
non‑degenerate refined SQL, and the reranker tests assert the
LexicalRerankerdefault.
All changes are pure
src/+tests/— no Docker/compose, no env, nolocal‑setup files.