A CLI toolkit for comparing, analyzing, and exporting data across databases and file formats.
Supports MSSQL, PostgreSQL, Databricks, and Parquet.
pip install datatk
# or
uv tool install datatkRequires Python 3.13+ and uv.
git clone https://github.com/nathanthorell/datatk.git
cd datatk
uv sync --extra devdatatk --help
datatk data-compare
datatk object-compare
datatk schema-size
datatk db-diagram
datatk export-parquet
datatk proc-tester
datatk view-tester
datatk data-cleanupCopy .env.example to .env and update the connection strings for your databases:
cp .env.example .envConnection string formats:
- MSSQL:
Server=host,port;Database=db;UID=user;PWD=pass - MSSQL (Azure AD interactive):
Server=host,port;Database=db;Authentication=ActiveDirectoryInteractive(opens browser for Entra ID login;UIDis optional as a login hint) - PostgreSQL:
postgresql://user:pass@host:port/database - Databricks:
databricks://token:ACCESS_TOKEN@host/catalog?http_path=/sql/1.0/warehouses/ID
Copy config-example.toml to config.toml and configure the tools you want to use:
cp config-example.toml config.tomlThe [datatk] section sets global defaults (e.g. logging_level) that apply to all tools unless
overridden in a tool-specific section.
Compare data across different database platforms.
datatk data-compare- Supports MSSQL, PostgreSQL, Databricks, and local files (Parquet, CSV, JSON)
- Compare data using inline SQL, query files, or file paths (
db_type = "file", query set to the file path) - Output options:
left_only,right_only,common,differences, orall - Optional case-insensitive string comparison (
case_insensitive), configurable globally or per comparison - Reports differences and execution time per source (suppress with
show_performance = false)
Compare database object definitions across environments (DEV, QA, TEST, PROD).
datatk object-compare- Supports MSSQL and PostgreSQL
- Object types: stored procedures, views, functions, tables, triggers, sequences, indexes, types, extensions (PostgreSQL), external tables (MSSQL), and foreign keys
- Detects objects that exist in only some environments
- Uses MD5 checksums for efficient definition comparison
Analyze storage across databases by measuring schema sizes.
datatk schema-size- Connects to multiple servers and calculates data and index space in megabytes
- Summary and detail modes
- Comparative reports across servers and databases
Generate ERD diagrams from database metadata.
datatk db-diagram- Output formats: DBML (default), Mermaid, PlantUML
- Column display modes: all columns, keys only, or table names only
- Hierarchical mode: focus on relationships around a specific base table with directional traversal (up, down, or both)
- Detects relationships from foreign key constraints
Export database objects to Parquet files.
datatk export-parquet- Connects to MSSQL databases and exports tables or query results
- Configurable batch size
- Tracks export timing per object
Batch test stored procedures with configurable default parameters.
datatk proc-tester- Executes all stored procedures in a configured schema
- Applies default values for common parameter types
- Reports execution status and timing
Batch test database views.
datatk view-tester- Runs a
SELECT TOP 1 *against each view in a configured schema - Reports execution status and timing
Delete data using foreign key hierarchy traversal to handle dependencies automatically.
datatk data-cleanup- Traverses foreign key relationships to determine deletion order
- Summary and execute modes (run summary first to preview)
- Configurable batch size and threshold
uv run ruff check src/ # Run ruff linter
uv run ruff check src/ --fix # Run ruff with auto-fix
uv run mypy src/ # Run mypy type checker
uv run ruff format src/ # Format code with ruffOr use the Makefile:
make lint # Run ruff and mypy linters
make format # Format code with ruff
make clean # Remove temporary files and virtual environment