datatk — Data Toolkit

A CLI toolkit for comparing, analyzing, and exporting data across databases and file formats.

Supports MSSQL, PostgreSQL, Databricks, and Parquet.

Installation

pip install datatk
# or
uv tool install datatk

From Source

Requires Python 3.13+ and uv.

git clone https://github.com/nathanthorell/datatk.git
cd datatk
uv sync --extra dev

Quick Start

datatk --help
datatk data-compare
datatk object-compare
datatk schema-size
datatk db-diagram
datatk export-parquet
datatk proc-tester
datatk view-tester
datatk data-cleanup

Configuration

Environment Variables

Copy .env.example to .env and update the connection strings for your databases:

cp .env.example .env

Connection string formats:

MSSQL: Server=host,port;Database=db;UID=user;PWD=pass
MSSQL (Azure AD interactive): Server=host,port;Database=db;Authentication=ActiveDirectoryInteractive (opens browser for Entra ID login; UID is optional as a login hint)
PostgreSQL: postgresql://user:pass@host:port/database
Databricks: databricks://token:ACCESS_TOKEN@host/catalog?http_path=/sql/1.0/warehouses/ID

Tool Configuration

Copy config-example.toml to config.toml and configure the tools you want to use:

cp config-example.toml config.toml

The [datatk] section sets global defaults (e.g. logging_level) that apply to all tools unless overridden in a tool-specific section.

Tools

`data-compare`

Compare data across different database platforms.

datatk data-compare

Supports MSSQL, PostgreSQL, Databricks, and local files (Parquet, CSV, JSON)
Compare data using inline SQL, query files, or file paths (db_type = "file", query set to the file path)
Output options: left_only, right_only, common, differences, or all
Optional case-insensitive string comparison (case_insensitive), configurable globally or per comparison
Reports differences and execution time per source (suppress with show_performance = false)

`object-compare`

Compare database object definitions across environments (DEV, QA, TEST, PROD).

datatk object-compare

Supports MSSQL and PostgreSQL
Object types: stored procedures, views, functions, tables, triggers, sequences, indexes, types, extensions (PostgreSQL), external tables (MSSQL), and foreign keys
Detects objects that exist in only some environments
Uses MD5 checksums for efficient definition comparison

`schema-size`

Analyze storage across databases by measuring schema sizes.

datatk schema-size

Connects to multiple servers and calculates data and index space in megabytes
Summary and detail modes
Comparative reports across servers and databases

`db-diagram`

Generate ERD diagrams from database metadata.

datatk db-diagram

Output formats: DBML (default), Mermaid, PlantUML
Column display modes: all columns, keys only, or table names only
Hierarchical mode: focus on relationships around a specific base table with directional traversal (up, down, or both)
Detects relationships from foreign key constraints

`export-parquet`

Export database objects to Parquet files.

datatk export-parquet

Connects to MSSQL databases and exports tables or query results
Configurable batch size
Tracks export timing per object

`proc-tester`

Batch test stored procedures with configurable default parameters.

datatk proc-tester

Executes all stored procedures in a configured schema
Applies default values for common parameter types
Reports execution status and timing

`view-tester`

Batch test database views.

datatk view-tester

Runs a SELECT TOP 1 * against each view in a configured schema
Reports execution status and timing

`data-cleanup`

Delete data using foreign key hierarchy traversal to handle dependencies automatically.

datatk data-cleanup

Traverses foreign key relationships to determine deletion order
Summary and execute modes (run summary first to preview)
Configurable batch size and threshold

Development

Linting and Formatting

uv run ruff check src/        # Run ruff linter
uv run ruff check src/ --fix  # Run ruff with auto-fix
uv run mypy src/              # Run mypy type checker
uv run ruff format src/       # Format code with ruff

Or use the Makefile:

make lint    # Run ruff and mypy linters
make format  # Format code with ruff
make clean   # Remove temporary files and virtual environment

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.github/workflows		.github/workflows
src/datatk		src/datatk
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config-example.toml		config-example.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

datatk — Data Toolkit

Installation

From Source

Quick Start

Configuration

Environment Variables

Tool Configuration

Tools

`data-compare`

`object-compare`

`schema-size`

`db-diagram`

`export-parquet`

`proc-tester`

`view-tester`

`data-cleanup`

Development

Linting and Formatting

About

Uh oh!

Releases 2

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

datatk — Data Toolkit

Installation

From Source

Quick Start

Configuration

Environment Variables

Tool Configuration

Tools

data-compare

object-compare

schema-size

db-diagram

export-parquet

proc-tester

view-tester

data-cleanup

Development

Linting and Formatting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors

Uh oh!

Languages

`data-compare`

`object-compare`

`schema-size`

`db-diagram`

`export-parquet`

`proc-tester`

`view-tester`

`data-cleanup`