Query any data file using SQL. One command, instant results.
Quick Start • Features • Installation • Usage • LLM Integration • Documentation
DataQL is a CLI tool developed in Go that allows you to query and manipulate data files using SQL statements. It loads data into an SQLite database (in-memory or file-based) enabling powerful SQL operations on your data.
Working with data files has always been tedious. You either write throwaway scripts, load everything into pandas, or copy-paste into spreadsheets. With LLMs entering the workflow, a new problem emerged: how do you analyze a 10MB CSV without burning through your entire context window?
Traditional approaches fail:
- Send file to LLM context: 10MB CSV = ~100,000+ tokens. Expensive, slow, often impossible.
- Write a script: Context switch, setup overhead, not conversational.
- Use pandas/Excel: Great for humans, useless for LLM automation.
DataQL lets you query any data file using SQL. One command, instant results:
# Instead of sending 50,000 rows to an LLM...
dataql run -f sales.csv -q "SELECT region, SUM(revenue) FROM sales GROUP BY region"
# You get just what you need:
# region | SUM(revenue)
# North | 1,234,567
# South | 987,654| Scenario | Without DataQL | With DataQL |
|---|---|---|
| Analyze 10MB CSV with LLM | ~100,000 tokens ($3+) | ~500 tokens ($0.01) |
| Query data from S3 | Download → Script → Parse | One command |
| Join CSV + JSON + Database | Custom ETL pipeline | Single SQL query |
| Automate data reports | Complex scripts | Simple CLI + cron |
| LLM data analysis | Context overflow | No size limit |
- Token Efficient: LLMs get query results, not raw data. 99% reduction in token usage.
- Universal Format Support: CSV, JSON, Parquet, Excel, XML, YAML, Avro, ORC - all queryable with SQL.
- Any Data Source: Local files, URLs, S3, GCS, Azure, PostgreSQL, MySQL, MongoDB.
- LLM-Native: Built-in MCP server for Claude, Codex, Gemini. Skills for Claude Code.
- Zero Setup: Single binary, no dependencies, no configuration files.
- Familiar Syntax: If you know SQL, you know DataQL.
# Install DataQL
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash
# Query a CSV file
dataql run -f data.csv -q "SELECT * FROM data WHERE amount > 100"
# Query JSON from a URL
dataql run -f "https://api.example.com/data.json" -q "SELECT name, value FROM data"
# Query data from S3
dataql run -f "s3://bucket/data.parquet" -q "SELECT * FROM data LIMIT 10"
# Export results to JSON
dataql run -f data.csv -q "SELECT * FROM data" -e output.json -t json
# Interactive REPL mode
dataql run -f data.csv
# dataql> SELECT COUNT(*) FROM data;
# dataql> .tables
# dataql> .exitSupported File Formats:
- CSV (with configurable delimiter)
- JSON (arrays or single objects)
- JSONL/NDJSON (newline-delimited JSON)
- XML
- YAML
- Parquet
- Excel (.xlsx, .xls)
- Avro
- ORC
Data Sources:
- Local files
- HTTP/HTTPS URLs
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
- Standard input (stdin)
- Message Queues (SQS, Kafka, RabbitMQ - peek without consuming)
Database Connectors:
- PostgreSQL
- MySQL
- DuckDB
- MongoDB
- DynamoDB
Key Capabilities:
- Execute SQL queries using SQLite syntax
- Export results to CSV, JSONL, JSON, Excel, Parquet, XML, YAML formats
- Interactive REPL mode with command history
- Progress bar for large file operations
- Parallel file processing for multiple inputs
- Automatic flattening of nested JSON objects
- Join data from multiple sources
LLM Integration:
- MCP Server for Claude Code, OpenAI Codex, Google Gemini
- Auto-activating Claude Code Skills
- Token-efficient data processing for AI assistants
DataQL is designed for efficient use with Large Language Models, enabling AI assistants to query large datasets without loading entire files into context.
# Install skills for Claude Code
dataql skills install
# Or start MCP server for any LLM
dataql mcp serveWhy use DataQL with LLMs?
| Traditional Approach | With DataQL |
|---|---|
| Send 10MB CSV to context | Run SQL query |
| ~100,000+ tokens | ~500 tokens |
| Limited by context window | No file size limit |
See LLM Integration Guide for complete documentation.
Linux / macOS:
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bashWindows (PowerShell):
irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.ps1 | iexSpecific version:
Linux / macOS:
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --version v1.0.0Windows (PowerShell):
$env:DATAQL_VERSION="v1.0.0"; irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.ps1 | iexUser installation (no sudo/admin required):
Linux / macOS:
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --localWindows (PowerShell):
$env:DATAQL_USER_INSTALL="true"; irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.ps1 | iex# Clone the repository
git clone https://github.com/adrianolaselva/dataql.git
cd dataql
# Build and install
make build
make install # requires sudo
# or
make install-local # installs to ~/.local/bindataql --versionUpgrade to latest version:
Linux / macOS:
# Only upgrades if a newer version is available
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --upgrade
# Force reinstall (same or different version)
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --force
# Clean install (remove all versions first, then install)
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.sh | bash -s -- --clean --forceWindows (PowerShell):
# Force reinstall
$env:DATAQL_FORCE="true"; irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/install.ps1 | iexLinux / macOS:
curl -fsSL https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/uninstall.sh | bashWindows (PowerShell):
irm https://raw.githubusercontent.com/adrianolaselva/dataql/main/scripts/uninstall.ps1 | iexLoad a data file and start interactive mode (format is auto-detected):
# CSV file
dataql run -f data.csv -d ","
# JSON file (array or single object)
dataql run -f data.json
# JSONL/NDJSON file (one JSON per line)
dataql run -f data.jsonl| Format | Extensions | Description |
|---|---|---|
| CSV | .csv |
Comma-separated values with configurable delimiter |
| JSON | .json |
JSON arrays or single objects |
| JSONL | .jsonl, .ndjson |
Newline-delimited JSON (streaming) |
| XML | .xml |
XML documents |
| YAML | .yaml, .yml |
YAML documents |
| Parquet | .parquet |
Apache Parquet columnar format |
| Excel | .xlsx, .xls |
Microsoft Excel spreadsheets |
| Avro | .avro |
Apache Avro format |
| ORC | .orc |
Apache ORC format |
| Source | Format | Example |
|---|---|---|
| Local file | Path | -f data.csv |
| HTTP/HTTPS | URL | -f "https://example.com/data.csv" |
| Amazon S3 | s3:// |
-f "s3://bucket/path/data.csv" |
| Google Cloud Storage | gs:// |
-f "gs://bucket/path/data.json" |
| Azure Blob | az:// |
-f "az://container/path/data.parquet" |
| Standard input | - |
cat data.csv | dataql run -f - |
| PostgreSQL | postgres:// |
-f "postgres://user:pass@host/db?table=t" |
| MySQL | mysql:// |
-f "mysql://user:pass@host/db?table=t" |
| DuckDB | duckdb:// |
-f "duckdb:///path/db.db?table=t" |
| MongoDB | mongodb:// |
-f "mongodb://host/db?collection=c" |
| DynamoDB | dynamodb:// |
-f "dynamodb://region/table-name" |
| Flag | Short | Description | Default |
|---|---|---|---|
--file |
-f |
Input file, URL, or database connection | Required |
--delimiter |
-d |
CSV delimiter (only for CSV files) | , |
--query |
-q |
SQL query to execute | - |
--export |
-e |
Export path | - |
--type |
-t |
Export format (csv, jsonl, json, excel, parquet, xml, yaml) |
- |
--storage |
-s |
SQLite file path (for persistence) | In-memory |
--lines |
-l |
Limit number of lines/records to read | All |
--collection |
-c |
Custom table name | Filename |
Interactive Mode:
dataql run -f sales.csv -d ";"dataql> SELECT product, SUM(amount) as total FROM sales GROUP BY product ORDER BY total DESC LIMIT 10;
product total
Widget Pro 125430.50
Gadget Plus 98210.00
...
Execute Query and Display Results:
dataql run -f data.csv -d "," -q "SELECT * FROM data WHERE amount > 100 LIMIT 10"Export to JSONL:
dataql run -f input.csv -d "," \
-q "SELECT id, name, value FROM input WHERE status = 'active'" \
-e output.jsonl -t jsonlExport to CSV:
dataql run -f input.csv -d "," \
-q "SELECT * FROM input" \
-e output.csv -t csvMultiple Input Files:
dataql run -f file1.csv -f file2.csv -d "," \
-q "SELECT a.*, b.extra FROM file1 a JOIN file2 b ON a.id = b.id"Query JSON Files:
# JSON array
dataql run -f users.json -q "SELECT name, email FROM users WHERE status = 'active'"
# JSON with nested objects (automatically flattened)
# {"user": {"name": "John", "address": {"city": "NYC"}}}
# becomes columns: user_name, user_address_city
dataql run -f data.json -q "SELECT user_name, user_address_city FROM data"Query JSONL/NDJSON Files:
# JSONL is ideal for large datasets (streaming, low memory)
dataql run -f logs.jsonl -q "SELECT level, message, timestamp FROM logs WHERE level = 'ERROR'"
# Works with .ndjson extension too
dataql run -f events.ndjson -q "SELECT COUNT(*) as total FROM events"Custom Table Name:
# Use --collection to specify a custom table name
dataql run -f data.json -c my_table -q "SELECT * FROM my_table"Persist to SQLite File:
dataql run -f data.csv -d "," -s ./database.dbQuery from URL:
dataql run -f "https://raw.githubusercontent.com/datasets/population/main/data/population.csv" \
-q "SELECT Country_Name, Value FROM population WHERE Year = 2020 LIMIT 10"Query from S3:
dataql run -f "s3://my-bucket/data/sales.csv" \
-q "SELECT product, SUM(amount) as total FROM sales GROUP BY product"Query from PostgreSQL:
dataql run -f "postgres://user:pass@localhost:5432/mydb?table=orders" \
-q "SELECT * FROM orders WHERE status = 'completed'"Peek at SQS messages (without consuming):
dataql run -f "sqs://my-events-queue?region=us-east-1" \
-q "SELECT message_id, body_event_type, timestamp FROM my_events_queue"Read from stdin:
cat data.csv | dataql run -f - -q "SELECT * FROM stdin WHERE value > 100"# Download sample data
wget https://www.stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2021-financial-year-provisional/Download-data/annual-enterprise-survey-2021-financial-year-provisional-csv.csv -O survey.csv
# Query and export
dataql run -f survey.csv -d "," \
-q "SELECT Year, Industry_aggregation_NZSIOC as industry, Variable_name as metric, Value as amount FROM survey WHERE Value > 1000" \
-e analysis.jsonl -t jsonlDataQL uses SQLite under the hood, supporting standard SQL syntax:
-- Basic SELECT
SELECT column1, column2 FROM tablename;
-- Filtering
SELECT * FROM data WHERE amount > 100 AND status = 'active';
-- Aggregation
SELECT category, COUNT(*), SUM(value) FROM data GROUP BY category;
-- Joins (multiple files)
SELECT a.*, b.extra FROM file1 a JOIN file2 b ON a.id = b.id;
-- Ordering and Limiting
SELECT * FROM data ORDER BY created_at DESC LIMIT 100;Note: Table names are derived from filenames (without extension). For
sales.csv,sales.json, orsales.jsonl, useSELECT * FROM sales. Use--collectionflag to specify a custom table name.
For detailed documentation, see:
- Getting Started - Installation and Hello World examples
- CLI Reference - Complete command-line reference
- Data Sources - Working with S3, GCS, Azure, URLs, and stdin
- Database Connections - Connect to PostgreSQL, MySQL, DuckDB, MongoDB
- LLM Integration - Use DataQL with Claude, Codex, Gemini
- MCP Setup - Configure MCP server for LLM integration
- Examples - Real-world usage examples and automation scripts
- Go 1.24 or higher
- GCC (for CGO compilation - required for SQLite, DuckDB)
- Docker and Docker Compose (for E2E tests)
make build# Unit tests
make test
# E2E tests (requires Docker)
make e2e-up # Start infrastructure (PostgreSQL, MySQL, MongoDB, Kafka, LocalStack)
make e2e-wait # Wait for services to be healthy
make e2e-test-scripts # Run all E2E tests
make e2e-down # Stop infrastructureDataQL includes comprehensive E2E tests for all data sources:
| Data Source | Tests | Status |
|---|---|---|
| PostgreSQL | 26 | SELECT, WHERE, ORDER BY, LIMIT, aggregates, exports |
| MySQL | 26 | SELECT, WHERE, ORDER BY, LIMIT, aggregates, exports |
| MongoDB | 20+ | Collections, queries, filters, exports |
| Kafka | 10+ | Peek mode, message parsing, exports |
| S3 (LocalStack) | 13 | CSV, JSON, JSONL file reading, queries, exports |
| SQS (LocalStack) | 16 | Message reading, filtering, aggregation, exports |
See e2e/README.md for detailed E2E testing documentation.
make lintContributions are welcome! Please read our Contributing Guide for details on our code of conduct and the process for submitting pull requests.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This is a rewrite of csvql, an earlier experiment I did back in 2019. The original was simple and limited. This version? Built entirely with AI assistance (Claude Code). I wanted to see how far AI-assisted development could go, and honestly, it went pretty far. The code, docs, tests - all of it came from conversations with an AI. Make of that what you will.
This project is licensed under the MIT License - see the LICENSE file for details.
- csvql - The original project that inspired this rewrite
- Claude Code - AI assistant that helped build this entire project
- SQLite - Embedded database engine
- Cobra - CLI framework
- go-sqlite3 - SQLite driver for Go
