feat: data-parity skill — algorithm guardrails and output style#493
feat: data-parity skill — algorithm guardrails and output style#493suryaiyer95 wants to merge 9 commits intomainfrom
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Closing — .opencode/ skill config and model defaults should not live in the open source repo. |
2bc4608 to
0f8c7ac
Compare
- Add DataParity engine integration via native Rust bindings - Add data-diff tool for LLM agent (profile, joindiff, hashdiff, cascade, auto) - Add ClickHouse driver support - Add data-parity skill: profile-first workflow, algorithm selection guide, CRITICAL warning that joindiff cannot run cross-database (always returns 0 diffs), output style rules (facts only, no editorializing) - Gitignore .altimate-code/ (credentials) and *.node (platform binaries)
0f8c7ac to
7909e55
Compare
Split large tables by a date or numeric column before diffing. Each partition is diffed independently then results are aggregated. New params: - partition_column: column to split on (date or numeric) - partition_granularity: day | week | month | year (for dates) - partition_bucket_size: bucket width for numeric columns New output field: - partition_results: per-partition breakdown (identical / differ / error) Dialect-aware SQL: Postgres, Snowflake, BigQuery, ClickHouse, MySQL. Skill updated with partition guidance and examples.
When partition_column is set without partition_granularity or partition_bucket_size, groups by raw DISTINCT values. Works for any non-date, non-numeric column: status, region, country, etc. WHERE clause uses equality: col = 'value' with proper escaping.
Rust serializes ReladiffOutcome with serde tag 'mode', producing:
{mode: 'diff', diff_rows: [...], stats: {rows_table1, rows_table2, exclusive_table1, exclusive_table2, updated, unchanged}}
Previous code checked for {Match: {...}} / {Diff: {...}} shapes that
never matched, causing partitioned diff to report all partitions as
'identical' with 0 rows.
- extractStats(): check outcome.mode === 'diff', read from stats fields
- mergeOutcomes(): aggregate mode-based outcomes correctly
- summarize()/formatOutcome(): display mode-based shape with correct labels
Key changes based on feedback: - Always generate TODO plan before any tool is called - Enforce data_diff tool usage (never manual EXCEPT/JOIN SQL) - Add PK discovery + explicit user confirmation step - Profile pass is now mandatory before row-level diff - Ask user before expensive row-level diff on large tables: - <100K rows: proceed automatically - 100K-10M rows: ask with where_clause option - >10M rows: offer window/partition/full choices - Document partition modes (date/numeric/categorical) with examples - Add warehouse_list as first step to confirm connections
…from data diff The Rust engine only compares columns explicitly listed in extra_columns. When omitted, it was silently reporting all key-matched rows as 'identical' even when non-key values differed — a false positive bug. Changes: - Auto-discover columns from information_schema when extra_columns is omitted and source is a plain table name (not a SQL query) - Exclude audit/timestamp columns (updated_at, created_at, inserted_at, modified_at, _fivetran_*, _airbyte_*, publisher_last_updated_*, etc.) from comparison by default since they typically differ due to ETL timing - Report excluded columns in tool output so users know what was skipped - Fix misleading tool description that said 'Omit to compare all columns' - Update SKILL.md with critical guidance on extra_columns behavior
…ult truncation
All drivers default to `LIMIT 1001` on SELECT queries and post-truncate to
1000 rows. This silently drops rows when the data-diff engine needs complete
result sets — a FULL OUTER JOIN returning >1000 diff rows would be truncated,
causing the engine to undercount differences.
- Add `ExecuteOptions { noLimit?: boolean }` to the `Connector` interface
- When `noLimit: true`, set `effectiveLimit = 0` (falsy) so the existing
LIMIT injection guard is skipped, and add `effectiveLimit > 0` to the
truncation check so rows aren't sliced to zero
- Update all 12 drivers: postgres, clickhouse, snowflake, bigquery, mysql,
redshift, databricks, duckdb, oracle, sqlserver, sqlite, mongodb
- Pass `{ noLimit: true }` from `data-diff.ts` `executeQuery()`
Interactive SQL callers are unaffected — they continue to get the default
1000-row limit. Only the data-diff pipeline opts out.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Companion fix: column name collision in
|
…m exclusions with user Column exclusion now has two layers: 1. Name-pattern matching (existing) — updated_at, created_at, _fivetran_synced, etc. 2. Schema-level default detection (new) — queries column_default for NOW(), CURRENT_TIMESTAMP, GETDATE(), SYSDATE, SYSTIMESTAMP, etc. Covers PostgreSQL, MySQL, Snowflake, SQL Server, Oracle, ClickHouse, DuckDB, SQLite, and Redshift in a single round-trip (no extra query). The skill prompt now instructs the agent to present detected auto-timestamp columns to the user and ask for confirmation before excluding them, since migrations should preserve timestamps while ETL replication regenerates them.
Replace getSchema() (not in interface) with the required listSchemas(), listTables(), and describeTable() methods to match the Connector contract.
Summary
Two improvements to the data-parity LLM skill based on real-world testing:
Algorithm guardrail —
joindiffphysically cannot see a second table whensource_warehouse ≠ target_warehouse. It runs a single FULL OUTER JOIN on one connection, so it always reports 0 differences cross-database. Added aCRITICALwarning to the skill so the LLM always chooseshashdifforautofor cross-DB comparisons.Output style — Added explicit instruction to report facts only: counts, changed values, missing rows. No editorializing, no pitching the tool, no "this is exactly why row-level diffing matters" commentary.
Default model — Set
anthropic/claude-sonnet-4-6as the default inopencode.jsonc.Test plan
hashdiffautomaticallyjoindiffstill used correctly for same-DB