fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness by poche123 · Pull Request #2 · firefly-operationOS/flyquery

poche123 · 2026-06-02T08:09:32Z

What this PR does

Makes flyquery's Text‑to‑SQL actually answer real questions over uploaded
workbooks. It combines value‑anchored grounding, a non‑destructive
self‑repair loop, deterministic ingestion, and execution‑layer
robustness into one branch — every refinement from both lines of this work,
with nothing lost.

The problems we hit (and how we fixed them)

1. It scored 0/9 on a real structured‑data test — not from weak SQL, but from
value‑anchoring. The per‑column value catalogue is computed at ingest
(profile_json.top_values, sample_values_json, governance_json.semantic_type)
but was never read at query time, so the model invented every filter literal
('Revenue' vs the stored 'Total Revenue', integer 2023 vs the stored
'FY23', Brand='DAPA' vs the real Family='DAPA').
→ value_anchoring.py (new, dataset‑agnostic) renders each column's real values
into the grounding/generation/critic prompts, resolves question literals to the
columns that own them, and indexes values into BM25 + vector retrieval. Nothing
hardcodes a column name, label, or domain rule.

2. Wrong queries failed silently. A 0‑row result and a WHERE 1=0 no‑op were
stamped OK and narrated as "no data".
→ A non‑destructive repair loop detects zero‑row / degenerate SQL, feeds the real
distinct values back to the critic, and never replaces a correct answer with a
worse one (best_ok is restored if a repair regresses it).

3. Hierarchy / multi‑team questions failed. "Who reports to X", "average for
teams A, B and C".
→ Group‑aware value resolution (a term that umbrellas ≥2 catalogued values → the
whole set), entity disambiguation by live row‑match count + hierarchy intent,
and a self‑referencing‑column hint (references_column, surfaced in the column
fingerprint) that tells the agent to filter the manager column to find a
person's reports rather than their own row.

4. Signed measures gave the wrong total. Operating profit over a column that
already stores some line items as negatives was double‑subtracted.
→ Mixed‑sign detection (min < 0 < max) plus an observed‑sign probe that rewrites
both the SUM(...) - SUM(...) chain and the equivalent
SUM(CASE WHEN cost THEN -m ...) shape into one correct signed sum.

5. Ingestion was non‑deterministic, with a silent off‑by‑one. Dashboard
exports (Orbis/BvD) put the period header in a band above each label → data
section; the section pass treated a data row as the header, and DuckDB's
content‑based guess made it unstable — mis‑mapping years by one column.
→ Multi‑row period‑header reconstruction (embedded‑year rejection so
INV‑2024‑0007 isn't a period, band‑closing, 3‑index section paths), then
deterministic period/date naming (year_YYYY / period_YYYY_MM_DD) with the
original header preserved. temperature=0 for the classification/mapping ingest
agents makes column naming stable across re‑ingests.

6. Retrieval truncated by order only. The default reranker was a no‑op.
→ LexicalReranker (dependency‑free token overlap) is the default fallback — a
strict win that lets value‑aware retrieval actually reach the prompt when a
cross‑encoder model isn't installed.

Plus pipeline robustness: execution‑based multi‑candidate selection, a CTE‑aware
unknown‑table guard, column‑existence validation, a synthesis‑path function
firewall (blocks read_csv_auto / pg_read_file / …), and the
TableResolver / AstClassifier / profile hardening.

Results

AstraZeneca structured‑data test: 0/9 → strong (exact revenue/CAGR/ratio,
operating profit with the by‑therapeutic‑area breakdown, named team rosters).
Orbis / Moody's (IVI Málaga) financial profile: 9/9 exact, and the
previously‑silent off‑by‑one is gone (Venta 2024/2020/2016 =
7,200,102.62 / 3,986,533.08 / 2,284,370.94).

Gates

ruff check + ruff format --check clean; lockstep 21/21 (builder.py
stays byte‑identical to canon — temperature=0 rides the existing
extra_settings hook, no new parameter).
Unit 351 passed, integration 97 passed / 0 failed locally (same config
as the PR gate). Test fixtures updated: the _FakeSettings stub gains the
value‑anchoring knobs, the retry‑on‑error test exercises a realistic
non‑degenerate refined SQL, and the reranker tests assert the LexicalReranker
default.

All changes are pure src/ + tests/ — no Docker/compose, no env, no
local‑setup files.

…ion, pipeline robustness Surface column values into the grounding/generation prompts and harden the query + ingestion pipelines so the NL→SQL agents stop guessing literals and the control flow stops reporting wrong-but-empty results as confident answers. Grounding / value anchoring: - profile: store the full distinct value set for low-cardinality columns (was top-5) - retrieval: value_fingerprint() on hits; render a "Column value catalogue" into the grounding and generation prompts; instructions to copy WHERE/CASE literals verbatim, handle tall/EAV layouts and scaled-duplicate measures - structurally detect self-referencing hierarchy columns (value containment, with a row-wise discriminator) and teach "a person's team/reports/structure" -> filter the hierarchy column Control flow / governance: - treat 0-row / degenerate aggregate results as suspicious -> clarification + confidence cap - gate auto-learning on row_count > 0 (stop poisoning the example store) - prefer a candidate that passes the firewall and returns non-empty rows over candidates[0] - exclude CTE names from AST table_refs (firewall no longer rejects valid CTE SQL) - snapshot-scope retrieval; honor and persist conversation snapshot pins - unify /stream, :explain, :validate on the rendered prompts + firewall/scope guards - thread a re-firewalled extra_filter through the semantic-metric bind() Ingestion robustness / determinism: - header-quality scoring (skip numeric spacer rows) + stable leading-blank handling - temperature=0 for the naming/describe agents (cuts re-ingest schema churn ~94%) - rewrite Parquet column names on the no-key fallback (fixes profile/sample Binder errors) - index column values into embeddings + content_tsv - warn when the cross-encoder reranker degrades to a no-op

- builder.py is a lock-step (canon-pinned) module; revert it and pass the determinism setting via the existing extra_settings={"temperature": 0.0} instead of a new parameter, so the lockstep gate stays green - move the table-kinds lookup out of the query controller into TableResolver (service layer) so the "no raw SQL in controllers" gate passes - ruff: inline a negated return (SIM103), drop a forward-ref quote (UP037), and apply ruff format - tests: extend the fake TableResolvers with pins / current_snapshots / table_kinds_by_name; seed current_snapshot_id in the retrieval integration fixture (retrieval is now scoped to current_snapshot_id, as publish sets it)

Financial exports (Orbis/BvD) place one date-header row above a stack of sub-sections. Section-splitting orphaned that header above each sub-section's title, dropping the year columns and making the detailed financial tables (Ratios de Rentabilidad, Memo lineas, etc.) unqueryable by period. A data-first section now inherits the nearest recent period header as its column row, encoded as a backward-compatible 3-index section path ([header:data_start:end]); contiguous sections keep the legacy 2-index form. Guards against misfires: - period values are full-match only (INV-2024-0007 / Form 2020 are not years) - only a single leftmost label column may precede the inherited years - a label-headed table closes the period band (no stale-header bleed) - phantom null columns are compacted; header-row index is bounds-checked Adds unit coverage for inheritance, band-closing/no-false-inherit, path round-trip and legacy-path parsing.

…anch Reconciles the two parallel lines of the NL→SQL quality work into a single branch that carries every refinement from both, with no capability lost. Kept from this branch (execution-layer robustness): - TableResolver / AstClassifier / profile hardening and the query_controller surface, plus their unit/integration tests. - The XLSX multi-row period-header reconstruction (embedded-year rejection, band-closing, 3-index section paths) and its test. - PII-safe value rendering in the embed corpus. Brought in (the value-anchoring superset): - value_anchoring.py: the dataset-agnostic backbone (per-column value catalogue, value→column + group resolution, entity disambiguation by live row-match count + hierarchy intent, degenerate/zero-row detection, signed (mixed-sign) measure handling incl. the SUM(CASE WHEN cost THEN -m) shape, CTE-aware guards, function firewall). - query_service orchestration: candidate execution-selection, a non-destructive repair loop (never replaces a correct answer with a worse one), zero-row and signed-measure repair, group-coverage advisory. - Deterministic period/date column naming at ingest (year_YYYY / period_YYYY_MM_DD), original-header preservation, and the value-anchoring prompt rules for grounding / generation / critic / explainer. - LexicalReranker as the dependency-free default (a strict win over the no-op). - temperature=0 for the deterministic ingest agents (describe / column-name / rename / relation), applied via the existing extra_settings hook. Cross-line graft: the column hierarchy hint (references_column, computed by the profile stage) is now surfaced in the retrieval column fingerprint, so a self-referencing manager/owner column tells the agent to filter THAT column to find a person's reports instead of their own row. Gate notes: - builder.py stays byte-identical to canon (lockstep) -- temperature=0 rides the existing extra_settings path, no new parameter. - ruff check + format clean; lockstep 21/21. - test_query_service: the _FakeSettings stub gains the value-anchoring knobs (features off, so the orchestration tests stay focused on the core answer() flow), and the retry-on-error test now uses a realistic non-degenerate refined SQL with a table-aware resolver so the critic round is actually exercised.

miguelgfierro

Está bien, he puesto algunos comentarios, pásaselos a la IA

miguelgfierro · 2026-07-01T13:36:42Z

        dataset_id: uuid.UUID,
        table_names: list[str],
        object_store_base: str | None = None,
+        pins: dict[str, str] | None = None,


resolve() tiene un parámetro pins pensado para fijar una pregunta de drill-down al snapshot del esquema con el que se respondió la primera vez, para que un re-ingest a mitad de conversación no cambie la respuesta sin avisar. Pero ninguno de los tres puntos de llamada en query_service.py (líneas 1166, 1281, 1384) pasa pins=, y current_snapshots() (línea 122) nunca se llama. prior_snapshot_pins simplemente se guarda y se recarga sin cambios (línea 988). Ahora mismo esta función no hace nada — un re-ingest entre dos mensajes de la misma conversación cambiará la respuesta al nuevo esquema sin que nadie se entere.

miguelgfierro · 2026-07-01T13:37:28Z

                           END AS score
                    FROM flyquery_examples
-                    WHERE workspace_id = :workspace_id AND quality = 'APPROVED' {ds_filter}
+                    WHERE workspace_id = :workspace_id AND quality IN ('APPROVED', 'PROPOSED') {ds_filter}


Esta consulta ahora trae quality IN ('APPROVED', 'PROPOSED') en lugar de solo 'APPROVED'. Las filas PROPOSED vienen de auto_learner.py, que guarda un ejemplo cada vez que una consulta se ejecutó una vez, sin reintentos, y devolvió más de 0 filas — eso no es lo mismo que "correcta". Estos ejemplos se le muestran al LLM bajo el título "approved Q→SQL examples", sin forma de distinguir lo aprobado de lo no revisado. Una consulta incorrecta pero plausible ahora puede auto-enseñarse a preguntas futuras antes de que ningún humano la revise.

miguelgfierro · 2026-07-01T13:38:53Z

-            gen_out = await generation_agent.run(
-                {"grounded": grounded, "question": body.question, "starting_point_sql": None}
+            gen_run = await generation_agent.run(
+                _render_generation_prompt(


answer() (la ruta principal /query) pasa los nuevos datos de grounding — catálogo de valores de columna, entidades/grupos resueltos, pistas de jerarquía, pistas de valores — a los constructores de prompts. /explain (línea 398), /validate (línea 465) y la ruta SSE /query/stream (líneas 607 y 641) llaman a los mismos constructores de prompts pero sin pasar nada de esto, así que no reciben ninguno de los aportes de grounding de este PR. El código de streaming incluso tiene un comentario que dice que coincide con la ruta síncrona, y no es cierto. Además, la línea 660 deja fijo snapshot_pins: {} en la respuesta del stream — nunca se rellena de verdad.

miguelgfierro · 2026-07-01T13:41:35Z

+    mixed = False
+    try:
+        if mn is not None and mx is not None and float(mn) < 0 < float(mx):
+            mixed = True
+    except (TypeError, ValueError):
+        pass


mixed_sign se marca como verdadero cada vez que el mínimo de una columna es negativo y el máximo es positivo — no se comprueba nada más. Una columna de ingresos con 9.998 filas positivas normales y 2 filas negativas de reembolsos se marca exactamente igual que una columna que de verdad es mitad positiva, mitad negativa por diseño. Esa marca luego dispara una instrucción estricta de "SIGN ERROR — must fix" en query_service.py, incluso en casos donde restar es lo correcto. Conviene comprobar qué porcentaje de filas es negativo, no solo si hay alguna.

miguelgfierro · 2026-07-01T13:43:03Z

    reranker_top_n: int = 30
    query_expansion_enabled: bool = False

+    # Value-anchoring / grounding-quality knobs. The ingest pipeline already


por qué la config lo tenemos aquí en vez de en el .env?

miguelgfierro · 2026-07-01T13:43:17Z

+    # computes a per-column value catalogue (profile_json.top_values, min/max,
+    # distinct_estimate) + a semantic_type; these control how it is surfaced to
+    # the SQL writer at query time. All dataset-agnostic.
+    value_catalog_enabled: bool = True  # render real column values into prompts


value_catalog_enabled se define aquí pero nunca se comprueba en ningún sitio — column_value_catalog() siempre se ejecuta, sin importar esta opción. Ahora mismo no hay forma de desactivar esta función. (Relacionado: render_column_catalog_line() en value_anchoring.py línea 150 también está muerta — nada la llama; todos los puntos de llamada reales usan render_catalog_from_meta(). Vale la pena borrar ambas, o conectar la opción si de verdad debe hacer algo.)

miguelgfierro · 2026-07-01T13:45:41Z

+        conn = duckdb.connect(":memory:")
+        conn.execute("SET threads=2")
+        for t in terms:
+            mq = t["measure_col"].replace(chr(34), chr(34) * 2)


chr(34) es confuso, sería más sencillo usar '"'

Jose Agustin Puente and others added 5 commits June 2, 2026 10:08

Merge commit '7b02792' into recon/nl2sql-superset

3b30b74

miguelgfierro approved these changes Jul 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2

fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2
poche123 wants to merge 5 commits into
mainfrom
fix/nl2sql-grounding-and-robustness

poche123 commented Jun 2, 2026 •

edited

Loading

Uh oh!

miguelgfierro left a comment

Uh oh!

miguelgfierro Jul 1, 2026

Uh oh!

miguelgfierro Jul 1, 2026

Uh oh!

miguelgfierro Jul 1, 2026

Uh oh!

miguelgfierro Jul 1, 2026

Uh oh!

miguelgfierro Jul 1, 2026

Uh oh!

miguelgfierro Jul 1, 2026

Uh oh!

miguelgfierro Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

poche123 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

The problems we hit (and how we fixed them)

Results

Gates

Uh oh!

miguelgfierro left a comment

Choose a reason for hiding this comment

Uh oh!

miguelgfierro Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

miguelgfierro Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

miguelgfierro Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

miguelgfierro Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

miguelgfierro Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

miguelgfierro Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

miguelgfierro Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

poche123 commented Jun 2, 2026 •

edited

Loading