Skip to content

fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2

Open
poche123 wants to merge 5 commits into
mainfrom
fix/nl2sql-grounding-and-robustness
Open

fix(query+ingest): value-anchored NL→SQL grounding, hierarchy resolution & pipeline robustness#2
poche123 wants to merge 5 commits into
mainfrom
fix/nl2sql-grounding-and-robustness

Conversation

@poche123

@poche123 poche123 commented Jun 2, 2026

Copy link
Copy Markdown

What this PR does

Makes flyquery's Text‑to‑SQL actually answer real questions over uploaded
workbooks. It combines value‑anchored grounding, a non‑destructive
self‑repair loop
, deterministic ingestion, and execution‑layer
robustness
into one branch — every refinement from both lines of this work,
with nothing lost.

The problems we hit (and how we fixed them)

1. It scored 0/9 on a real structured‑data test — not from weak SQL, but from
value‑anchoring.
The per‑column value catalogue is computed at ingest
(profile_json.top_values, sample_values_json, governance_json.semantic_type)
but was never read at query time, so the model invented every filter literal
('Revenue' vs the stored 'Total Revenue', integer 2023 vs the stored
'FY23', Brand='DAPA' vs the real Family='DAPA').
value_anchoring.py (new, dataset‑agnostic) renders each column's real values
into the grounding/generation/critic prompts, resolves question literals to the
columns that own them, and indexes values into BM25 + vector retrieval. Nothing
hardcodes a column name, label, or domain rule.

2. Wrong queries failed silently. A 0‑row result and a WHERE 1=0 no‑op were
stamped OK and narrated as "no data".
→ A non‑destructive repair loop detects zero‑row / degenerate SQL, feeds the real
distinct values back to the critic, and never replaces a correct answer with a
worse one
(best_ok is restored if a repair regresses it).

3. Hierarchy / multi‑team questions failed. "Who reports to X", "average for
teams A, B and C".
→ Group‑aware value resolution (a term that umbrellas ≥2 catalogued values → the
whole set), entity disambiguation by live row‑match count + hierarchy intent,
and a self‑referencing‑column hint (references_column, surfaced in the column
fingerprint) that tells the agent to filter the manager column to find a
person's reports rather than their own row.

4. Signed measures gave the wrong total. Operating profit over a column that
already stores some line items as negatives was double‑subtracted.
→ Mixed‑sign detection (min < 0 < max) plus an observed‑sign probe that rewrites
both the SUM(...) - SUM(...) chain and the equivalent
SUM(CASE WHEN cost THEN -m ...) shape into one correct signed sum.

5. Ingestion was non‑deterministic, with a silent off‑by‑one. Dashboard
exports (Orbis/BvD) put the period header in a band above each label → data
section; the section pass treated a data row as the header, and DuckDB's
content‑based guess made it unstable — mis‑mapping years by one column.
→ Multi‑row period‑header reconstruction (embedded‑year rejection so
INV‑2024‑0007 isn't a period, band‑closing, 3‑index section paths), then
deterministic period/date naming (year_YYYY / period_YYYY_MM_DD) with the
original header preserved. temperature=0 for the classification/mapping ingest
agents makes column naming stable across re‑ingests.

6. Retrieval truncated by order only. The default reranker was a no‑op.
LexicalReranker (dependency‑free token overlap) is the default fallback — a
strict win that lets value‑aware retrieval actually reach the prompt when a
cross‑encoder model isn't installed.

Plus pipeline robustness: execution‑based multi‑candidate selection, a CTE‑aware
unknown‑table guard, column‑existence validation, a synthesis‑path function
firewall (blocks read_csv_auto / pg_read_file / …), and the
TableResolver / AstClassifier / profile hardening.

Results

  • AstraZeneca structured‑data test: 0/9 → strong (exact revenue/CAGR/ratio,
    operating profit with the by‑therapeutic‑area breakdown, named team rosters).
  • Orbis / Moody's (IVI Málaga) financial profile: 9/9 exact, and the
    previously‑silent off‑by‑one is gone (Venta 2024/2020/2016 =
    7,200,102.62 / 3,986,533.08 / 2,284,370.94).

Gates

  • ruff check + ruff format --check clean; lockstep 21/21 (builder.py
    stays byte‑identical to canon — temperature=0 rides the existing
    extra_settings hook, no new parameter).
  • Unit 351 passed, integration 97 passed / 0 failed locally (same config
    as the PR gate). Test fixtures updated: the _FakeSettings stub gains the
    value‑anchoring knobs, the retry‑on‑error test exercises a realistic
    non‑degenerate refined SQL, and the reranker tests assert the LexicalReranker
    default.

All changes are pure src/ + tests/ — no Docker/compose, no env, no
local‑setup files.

Jose Agustin Puente and others added 5 commits June 2, 2026 10:08
…ion, pipeline robustness

Surface column values into the grounding/generation prompts and harden the
query + ingestion pipelines so the NL→SQL agents stop guessing literals and the
control flow stops reporting wrong-but-empty results as confident answers.

Grounding / value anchoring:
- profile: store the full distinct value set for low-cardinality columns (was top-5)
- retrieval: value_fingerprint() on hits; render a "Column value catalogue" into the
  grounding and generation prompts; instructions to copy WHERE/CASE literals verbatim,
  handle tall/EAV layouts and scaled-duplicate measures
- structurally detect self-referencing hierarchy columns (value containment, with a
  row-wise discriminator) and teach "a person's team/reports/structure" -> filter the
  hierarchy column

Control flow / governance:
- treat 0-row / degenerate aggregate results as suspicious -> clarification + confidence cap
- gate auto-learning on row_count > 0 (stop poisoning the example store)
- prefer a candidate that passes the firewall and returns non-empty rows over candidates[0]
- exclude CTE names from AST table_refs (firewall no longer rejects valid CTE SQL)
- snapshot-scope retrieval; honor and persist conversation snapshot pins
- unify /stream, :explain, :validate on the rendered prompts + firewall/scope guards
- thread a re-firewalled extra_filter through the semantic-metric bind()

Ingestion robustness / determinism:
- header-quality scoring (skip numeric spacer rows) + stable leading-blank handling
- temperature=0 for the naming/describe agents (cuts re-ingest schema churn ~94%)
- rewrite Parquet column names on the no-key fallback (fixes profile/sample Binder errors)
- index column values into embeddings + content_tsv
- warn when the cross-encoder reranker degrades to a no-op
- builder.py is a lock-step (canon-pinned) module; revert it and pass the
  determinism setting via the existing extra_settings={"temperature": 0.0}
  instead of a new parameter, so the lockstep gate stays green
- move the table-kinds lookup out of the query controller into TableResolver
  (service layer) so the "no raw SQL in controllers" gate passes
- ruff: inline a negated return (SIM103), drop a forward-ref quote (UP037),
  and apply ruff format
- tests: extend the fake TableResolvers with pins / current_snapshots /
  table_kinds_by_name; seed current_snapshot_id in the retrieval integration
  fixture (retrieval is now scoped to current_snapshot_id, as publish sets it)
Financial exports (Orbis/BvD) place one date-header row above a stack of
sub-sections. Section-splitting orphaned that header above each
sub-section's title, dropping the year columns and making the detailed
financial tables (Ratios de Rentabilidad, Memo lineas, etc.) unqueryable
by period.

A data-first section now inherits the nearest recent period header as its
column row, encoded as a backward-compatible 3-index section path
([header:data_start:end]); contiguous sections keep the legacy 2-index form.

Guards against misfires:
- period values are full-match only (INV-2024-0007 / Form 2020 are not years)
- only a single leftmost label column may precede the inherited years
- a label-headed table closes the period band (no stale-header bleed)
- phantom null columns are compacted; header-row index is bounds-checked

Adds unit coverage for inheritance, band-closing/no-false-inherit, path
round-trip and legacy-path parsing.
…anch

Reconciles the two parallel lines of the NL→SQL quality work into a single
branch that carries every refinement from both, with no capability lost.

Kept from this branch (execution-layer robustness):
- TableResolver / AstClassifier / profile hardening and the query_controller
  surface, plus their unit/integration tests.
- The XLSX multi-row period-header reconstruction (embedded-year rejection,
  band-closing, 3-index section paths) and its test.
- PII-safe value rendering in the embed corpus.

Brought in (the value-anchoring superset):
- value_anchoring.py: the dataset-agnostic backbone (per-column value
  catalogue, value→column + group resolution, entity disambiguation by live
  row-match count + hierarchy intent, degenerate/zero-row detection, signed
  (mixed-sign) measure handling incl. the SUM(CASE WHEN cost THEN -m) shape,
  CTE-aware guards, function firewall).
- query_service orchestration: candidate execution-selection, a non-destructive
  repair loop (never replaces a correct answer with a worse one), zero-row and
  signed-measure repair, group-coverage advisory.
- Deterministic period/date column naming at ingest (year_YYYY /
  period_YYYY_MM_DD), original-header preservation, and the value-anchoring
  prompt rules for grounding / generation / critic / explainer.
- LexicalReranker as the dependency-free default (a strict win over the no-op).
- temperature=0 for the deterministic ingest agents (describe / column-name /
  rename / relation), applied via the existing extra_settings hook.

Cross-line graft: the column hierarchy hint (references_column, computed by the
profile stage) is now surfaced in the retrieval column fingerprint, so a
self-referencing manager/owner column tells the agent to filter THAT column to
find a person's reports instead of their own row.

Gate notes:
- builder.py stays byte-identical to canon (lockstep) -- temperature=0 rides
  the existing extra_settings path, no new parameter.
- ruff check + format clean; lockstep 21/21.
- test_query_service: the _FakeSettings stub gains the value-anchoring knobs
  (features off, so the orchestration tests stay focused on the core answer()
  flow), and the retry-on-error test now uses a realistic non-degenerate refined
  SQL with a table-aware resolver so the critic round is actually exercised.

@miguelgfierro miguelgfierro left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Está bien, he puesto algunos comentarios, pásaselos a la IA

dataset_id: uuid.UUID,
table_names: list[str],
object_store_base: str | None = None,
pins: dict[str, str] | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolve() tiene un parámetro pins pensado para fijar una pregunta de drill-down al snapshot del esquema con el que se respondió la primera vez, para que un re-ingest a mitad de conversación no cambie la respuesta sin avisar. Pero ninguno de los tres puntos de llamada en query_service.py (líneas 1166, 1281, 1384) pasa pins=, y current_snapshots() (línea 122) nunca se llama. prior_snapshot_pins simplemente se guarda y se recarga sin cambios (línea 988). Ahora mismo esta función no hace nada — un re-ingest entre dos mensajes de la misma conversación cambiará la respuesta al nuevo esquema sin que nadie se entere.

END AS score
FROM flyquery_examples
WHERE workspace_id = :workspace_id AND quality = 'APPROVED' {ds_filter}
WHERE workspace_id = :workspace_id AND quality IN ('APPROVED', 'PROPOSED') {ds_filter}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esta consulta ahora trae quality IN ('APPROVED', 'PROPOSED') en lugar de solo 'APPROVED'. Las filas PROPOSED vienen de auto_learner.py, que guarda un ejemplo cada vez que una consulta se ejecutó una vez, sin reintentos, y devolvió más de 0 filas — eso no es lo mismo que "correcta". Estos ejemplos se le muestran al LLM bajo el título "approved Q→SQL examples", sin forma de distinguir lo aprobado de lo no revisado. Una consulta incorrecta pero plausible ahora puede auto-enseñarse a preguntas futuras antes de que ningún humano la revise.

gen_out = await generation_agent.run(
{"grounded": grounded, "question": body.question, "starting_point_sql": None}
gen_run = await generation_agent.run(
_render_generation_prompt(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answer() (la ruta principal /query) pasa los nuevos datos de grounding — catálogo de valores de columna, entidades/grupos resueltos, pistas de jerarquía, pistas de valores — a los constructores de prompts. /explain (línea 398), /validate (línea 465) y la ruta SSE /query/stream (líneas 607 y 641) llaman a los mismos constructores de prompts pero sin pasar nada de esto, así que no reciben ninguno de los aportes de grounding de este PR. El código de streaming incluso tiene un comentario que dice que coincide con la ruta síncrona, y no es cierto. Además, la línea 660 deja fijo snapshot_pins: {} en la respuesta del stream — nunca se rellena de verdad.

Comment on lines +141 to +146
mixed = False
try:
if mn is not None and mx is not None and float(mn) < 0 < float(mx):
mixed = True
except (TypeError, ValueError):
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed_sign se marca como verdadero cada vez que el mínimo de una columna es negativo y el máximo es positivo — no se comprueba nada más. Una columna de ingresos con 9.998 filas positivas normales y 2 filas negativas de reembolsos se marca exactamente igual que una columna que de verdad es mitad positiva, mitad negativa por diseño. Esa marca luego dispara una instrucción estricta de "SIGN ERROR — must fix" en query_service.py, incluso en casos donde restar es lo correcto. Conviene comprobar qué porcentaje de filas es negativo, no solo si hay alguna.

Comment thread src/flyquery/config.py
reranker_top_n: int = 30
query_expansion_enabled: bool = False

# Value-anchoring / grounding-quality knobs. The ingest pipeline already

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

por qué la config lo tenemos aquí en vez de en el .env?

Comment thread src/flyquery/config.py
# computes a per-column value catalogue (profile_json.top_values, min/max,
# distinct_estimate) + a semantic_type; these control how it is surfaced to
# the SQL writer at query time. All dataset-agnostic.
value_catalog_enabled: bool = True # render real column values into prompts

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value_catalog_enabled se define aquí pero nunca se comprueba en ningún sitio — column_value_catalog() siempre se ejecuta, sin importar esta opción. Ahora mismo no hay forma de desactivar esta función. (Relacionado: render_column_catalog_line() en value_anchoring.py línea 150 también está muerta — nada la llama; todos los puntos de llamada reales usan render_catalog_from_meta(). Vale la pena borrar ambas, o conectar la opción si de verdad debe hacer algo.)

conn = duckdb.connect(":memory:")
conn.execute("SET threads=2")
for t in terms:
mq = t["measure_col"].replace(chr(34), chr(34) * 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chr(34) es confuso, sería más sencillo usar '"'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants