feat: data-parity skill — algorithm guardrails and output style by suryaiyer95 · Pull Request #493 · AltimateAI/altimate-code

suryaiyer95 · 2026-03-27T00:29:44Z

Summary

Two improvements to the data-parity LLM skill based on real-world testing:

Algorithm guardrail — joindiff physically cannot see a second table when source_warehouse ≠ target_warehouse. It runs a single FULL OUTER JOIN on one connection, so it always reports 0 differences cross-database. Added a CRITICAL warning to the skill so the LLM always chooses hashdiff or auto for cross-DB comparisons.

Output style — Added explicit instruction to report facts only: counts, changed values, missing rows. No editorializing, no pitching the tool, no "this is exactly why row-level diffing matters" commentary.

Default model — Set anthropic/claude-sonnet-4-6 as the default in opencode.jsonc.

Test plan

Ran cross-DB comparison (pg_source vs pg_target) — agent now uses hashdiff automatically
Ran TPC-H migration validation — output is clean fact-reporting, no promotional commentary
Ran SQL query comparison (same-warehouse) — joindiff still used correctly for same-DB

coderabbitai · 2026-03-27T00:29:52Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 311b8513-03fc-440b-9e52-c471c38b2bf1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/data-parity-skill-improvements

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

suryaiyer95 · 2026-03-27T00:32:58Z

Closing — .opencode/ skill config and model defaults should not live in the open source repo.

- Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries)

Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples.

When partition_column is set without partition_granularity or partition_bucket_size, groups by raw DISTINCT values. Works for any non-date, non-numeric column: status, region, country, etc. WHERE clause uses equality: col = 'value' with proper escaping.

Rust serializes ReladiffOutcome with serde tag 'mode', producing: {mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}} Previous code checked for {Match: {...}} / {Diff: {...}} shapes that never matched, causing partitioned diff to report all partitions as 'identical' with 0 rows. - extractStats(): check outcome.mode === 'diff', read from stats fields - mergeOutcomes(): aggregate mode-based outcomes correctly - summarize()/formatOutcome(): display mode-based shape with correct labels

Key changes based on feedback: - Always generate TODO plan before any tool is called - Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL) - Add PK discovery + explicit user confirmation step - Profile pass is now mandatory before row-level diff - Ask user before expensive row-level diff on large tables: - <100K rows: proceed automatically - 100K-10M rows: ask with where_clause option - >10M rows: offer window/partition/full choices - Document partition modes (date/numeric/categorical) with examples - Add warehouse_list as first step to confirm connections

…from data diff The Rust engine only compares columns explicitly listed in extra_columns. When omitted, it was silently reporting all key-matched rows as 'identical' even when non-key values differed — a false positive bug. Changes: - Auto-discover columns from information_schema when extra_columns is omitted and source is a plain table name (not a SQL query) - Exclude audit/timestamp columns (updated_at, created_at, inserted_at, modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.) from comparison by default since they typically differ due to ETL timing - Report excluded columns in tool output so users know what was skipped - Fix misleading tool description that said 'Omit to compare all columns' - Update SKILL.md with critical guidance on extra_columns behavior

…ult truncation All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to 1000 rows. This silently drops rows when the data-diff engine needs complete result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated, causing the engine to undercount differences. - Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface - When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the truncation check so rows aren't sliced to zero - Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql, redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb - Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()` Interactive SQL callers are unaffected — they continue to get the default 1000-row limit. Only the data-diff pipeline opts out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

aidtya · 2026-03-28T08:18:07Z

Companion fix: column name collision in `join_diff_sql`

The joindiff algorithm produces incorrect results for compound keys due to a column name collision between the Rust SQL generator and the node-postgres driver.

Root cause: join_diff_sql emits unaliased COALESCE and CAST expressions. PostgreSQL auto-assigns duplicate column names (e.g., two columns both named coalesce). node-postgres converts rows to JS objects keyed by column name — duplicates silently overwrite earlier values, so the engine receives corrupted key data.

Example: Comparing rakuten.warehouse_metadata vs rakutenvthree.warehouse_metadata with key_columns: ["rk", "instance_id"], the rk column was lost (overwritten by instance_id), collapsing 460 distinct keys into 2 groups. Tool reported only_in_source = 1 instead of the correct 307.

Fix: https://github.com/AltimateAI/altimate-core-internal/pull/114 — every SELECT expression in join_diff_sql now gets a unique alias (_k0, _k1, _v0_l, _v0_r, …). No changes needed in this PR's orchestrator code — it already uses positional indexing.

To pick up the fix: bump @altimateai/altimate-core in packages/opencode/package.json after the altimate-core-internal binary is rebuilt.

…m exclusions with user Column exclusion now has two layers: 1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc. 2. Schema-level default detection (new) — queries column_default for NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc. Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift in a single round-trip (no extra query). The skill prompt now instructs the agent to present detected auto-timestamp columns to the user and ask for confirmation before excluding them, since migrations should preserve timestamps while ETL replication regenerates them.

Replace getSchema() (not in interface) with the required listSchemas(), listTables(), and describeTable() methods to match the Connector contract.

github-actions bot added the contributor label Mar 27, 2026

suryaiyer95 closed this Mar 27, 2026

suryaiyer95 reopened this Mar 27, 2026

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from 2bc4608 to 0f8c7ac Compare March 27, 2026 00:39

suryaiyer95 force-pushed the feat/data-parity-skill-improvements branch from 0f8c7ac to 7909e55 Compare March 27, 2026 00:41

suryaiyer95 and others added 6 commits March 26, 2026 18:21

aidtya added 2 commits March 28, 2026 12:49

fix: implement Connector interface methods in ClickHouse driver

8c7ef31

Replace getSchema() (not in interface) with the required listSchemas(), listTables(), and describeTable() methods to match the Connector contract.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: data-parity skill — algorithm guardrails and output style#493

feat: data-parity skill — algorithm guardrails and output style#493
suryaiyer95 wants to merge 9 commits intomainfrom
feat/data-parity-skill-improvements

suryaiyer95 commented Mar 27, 2026

Uh oh!

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Review skipped

Uh oh!

suryaiyer95 commented Mar 27, 2026

Uh oh!

aidtya commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

suryaiyer95 commented Mar 27, 2026

Summary

Test plan

Uh oh!

coderabbitai bot commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

suryaiyer95 commented Mar 27, 2026

Uh oh!

aidtya commented Mar 28, 2026

Companion fix: column name collision in join_diff_sql

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 27, 2026 •

edited

Loading

Companion fix: column name collision in `join_diff_sql`