Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 12 additions & 11 deletions ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ question
┌──────────────────┐
│ safety.py │ Parse with sqlglot in Postgres dialect. Reject anything
│ safety.py │ Parse with sqlglot in the database dialect. Reject anything
│ validate_select │ that is not a single SELECT / WITH / UNION /
│ _only │ INTERSECT / EXCEPT. Reject CTEs that hide DML
│ │ (WITH x AS (DELETE ... RETURNING ...) SELECT ...).
Expand All @@ -61,10 +61,10 @@ question
┌──────────────────┐
│ db.py │ Execute. The Postgres session was opened with
│ Database.execute │ default_transaction_read_only = on and a 60s
│ │ statement_timeout, so even if safety.py failed
│ │ the database itself would refuse a write.
│ db.py │ Execute. Postgres sessions use
│ Database.execute │ default_transaction_read_only = on; SQLite files open
│ │ mode=ro and use PRAGMA query_only = ON, so even if
│ │ safety.py failed the database itself refuses writes.
└────────┬─────────┘
Expand All @@ -81,8 +81,8 @@ question
| `__init__.py` | Package version. |
| `__main__.py` | `python -m promptquery` entry point. |
| `cli.py` | Click command and prompt-toolkit REPL. Orchestrates the whole pipeline. |
| `db.py` | psycopg3 connection wrapper. Sets the read-only session. |
| `schema.py` | Dataclasses + the `pg_catalog` queries that introspect them. |
| `db.py` | Postgres and SQLite connection wrappers. Sets the read-only session. |
| `schema.py` | Dataclasses + database-specific introspection adapters. |
| `retrieval.py` | Tokenizer, TF-IDF ranker, FK-graph expander. |
| `llm.py` | Provider clients (Anthropic, OpenAI), SQL extractor, provider factory. |
| `prompts.py` | System prompt template and schema-to-prompt formatter. |
Expand All @@ -95,7 +95,7 @@ question
- **`schema.py` ↔ `prompts.py`** — `format_schema` walks the same dataclasses. Schema additions usually need a prompt update.
- **`cli.py` ↔ everything else** — the only file that knows the full pipeline. New stages (query history, post-execution feedback) wire in here.
- **`safety.py` ↔ `llm.py`** — `extract_sql` runs first; `validate_select_only` runs second. Together they handle the case where the model returns malformed output.
- **`db.py` ↔ `safety.py`** — two layers, intentionally redundant. Either alone is insufficient; both together make a write impossible.
- **`db.py` ↔ `safety.py`** — two layers, intentionally redundant. Either alone is insufficient; together they keep execution read-only at both the SQL-parser and database-session layers.

## Design bets

Expand All @@ -109,7 +109,7 @@ Many natural-language questions reference one core entity but require joins thro

### Why two safety layers

`safety.py` is the primary guard. It parses every statement and rejects anything other than a SELECT. The Postgres session-level `default_transaction_read_only = on` is the fallback: if a malicious prompt somehow produces SQL that the validator misclassifies (a parser bug, an unknown construct), Postgres itself refuses the write.
`safety.py` is the primary guard. It parses every statement in the selected database dialect and rejects anything other than a SELECT. The session-level read-only mode is the fallback: Postgres uses `default_transaction_read_only = on`, while SQLite files open with `mode=ro` and use `PRAGMA query_only = ON`. If a malicious prompt somehow produces SQL that the validator misclassifies (a parser bug, an unknown construct), the database itself refuses the write.

This redundancy is not paranoia. AI-generated SQL is, by construction, less predictable than human-written SQL. The cost of one accidental `DELETE` is high enough that doubling up is the only sensible default.

Expand All @@ -121,7 +121,7 @@ This redundancy is not paranoia. AI-generated SQL is, by construction, less pred

These are intentionally out of scope for the MVP. They are tracked in the [roadmap](README.md#roadmap).

- MySQL / SQLite support — needs an adapter abstraction first.
- MySQL support — needs an adapter implementation and optional driver decision.
- Multi-database sessions in one REPL.
- Data visualization (charts, plots).
- Query-history persistence between sessions.
Expand All @@ -136,7 +136,7 @@ Run the test suite:
pytest
```

All tests in v0.1 are pure Python — no live database required. The integration test harness (docker-compose + a real Postgres) is queued for v0.2 alongside the public benchmark suite against Spider / BIRD.
All core tests are pure Python — no live external database required. SQLite adapter tests use temporary local database files. The integration test harness (docker-compose + a real Postgres) is queued for v0.2 alongside the public benchmark suite against Spider / BIRD.

The most safety-critical file is `tests/test_safety.py`. Cases there encode "things the validator MUST reject." Add cases when you discover new attack vectors; do not delete cases during refactors.

Expand All @@ -160,6 +160,7 @@ PromptQuery/
│ ├── safety.py
│ └── render.py
└── tests/ pytest suite
├── test_schema_adapters.py
├── test_safety.py
└── test_retrieval.py
```
Expand Down
18 changes: 10 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# PromptQuery

> **Natural-language SQL for production-scale Postgres schemas.**
> **Natural-language SQL for production-scale Postgres and SQLite schemas.**

[![PyPI](https://img.shields.io/pypi/v/promptquery.svg)](https://pypi.org/project/promptquery/)
[![CI](https://github.com/Cyberfilo/promptquery/actions/workflows/ci.yml/badge.svg)](https://github.com/Cyberfilo/promptquery/actions/workflows/ci.yml)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Python: 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](pyproject.toml)

PromptQuery is an open-source CLI that lets you query Postgres in plain English — engineered for **real production schemas with hundreds of tables**, not toy demos. It introspects your schema, generates SQL, shows it for confirmation, and runs it read-only.
PromptQuery is an open-source CLI that lets you query Postgres and SQLite in plain English — engineered for **real production schemas with hundreds of tables**, not toy demos. It introspects your schema, generates SQL, shows it for confirmation, and runs it read-only.

<p align="center">
<img src="https://raw.githubusercontent.com/Cyberfilo/promptquery/main/docs/demo.gif" alt="PromptQuery turning the plain-English question 'orders over 1000 euros with the customer name and status' into a correct multi-table JOIN — showing the SQL, asking for confirmation, then printing the result rows" width="820">
Expand Down Expand Up @@ -55,6 +55,7 @@ export ANTHROPIC_API_KEY=...

# Connect and start asking:
prq postgresql://localhost/mydb
prq sqlite:///local.db
```

`prq` and `pquery` are short aliases for `promptquery`. All three commands work identically.
Expand All @@ -67,6 +68,7 @@ prq postgresql://localhost/mydb
prq --query "how many users in Italy" postgresql://localhost/mydb # JSON to stdout
prq --query "top 10 orders by total" --out csv postgresql://... > out.csv
prq --query "..." --out table postgresql://... # rich-formatted table
prq --query "top customers by spend" sqlite:///local.db # SQLite local file
```

Exit codes: `0` success · `1` LLM/connection error · `2` safety-guard rejection · `3` execution error.
Expand Down Expand Up @@ -133,7 +135,7 @@ question
└────────┬──────────┘
"Run? [y/N]" → execute against a read-only Postgres session
"Run? [y/N]" → execute against a read-only database session
```

See [ARCHITECTURE.md](ARCHITECTURE.md) for the deep dive (file inventory, design bets, the patent-landmine non-goals).
Expand Down Expand Up @@ -167,8 +169,8 @@ If both are set, Anthropic is preferred. Override either with `--model anthropic

PromptQuery has **two independent layers** so a write is impossible, even if one layer fails:

1. **Session-level**: every Postgres session opens with `default_transaction_read_only = on` and a 60-second `statement_timeout`. The database itself refuses non-SELECT operations.
2. **Pre-execution**: every generated query is parsed with `sqlglot` and rejected unless it's a single `SELECT` / `WITH` / `UNION` / `INTERSECT` / `EXCEPT`. The validator also catches CTEs that hide DML (`WITH x AS (DELETE …) SELECT * FROM x`) and dangerous-function calls (`pg_terminate_backend`, `set_config`, `lo_export`, `dblink_exec`).
1. **Session-level**: every Postgres session opens with `default_transaction_read_only = on` and a 60-second `statement_timeout`; SQLite files open with `mode=ro` and enable `PRAGMA query_only = ON`. The database itself refuses non-SELECT operations.
2. **Pre-execution**: every generated query is parsed with `sqlglot` in the selected database dialect and rejected unless it's a single `SELECT` / `WITH` / `UNION` / `INTERSECT` / `EXCEPT`. The validator also catches CTEs that hide DML (`WITH x AS (DELETE …) SELECT * FROM x`) and dangerous-function calls (`pg_terminate_backend`, `set_config`, `lo_export`, `dblink_exec`, `load_extension`).

Every query is also shown to you before it runs. Confirm with `y`.

Expand Down Expand Up @@ -226,7 +228,7 @@ See [`eval/END_TO_END.md`](eval/END_TO_END.md) for the harness internals.
## What PromptQuery does NOT do (yet)

- **No writes.** `SELECT` only, by design and by belt-and-suspenders.
- **Postgres only.** MySQL and SQLite are on the v0.4 roadmap.
- **Full multi-dialect coverage.** Postgres remains the reference implementation and SQLite local files are supported; MySQL is still on the roadmap.
- **One database at a time.** No multi-DB sessions.
- **No data visualisation.** Rows out, that's it. Pipe to `csv` / `jq` / your tool of choice.

Expand All @@ -236,7 +238,7 @@ See [`eval/END_TO_END.md`](eval/END_TO_END.md) for the harness internals.

- **v0.2 (shipped)** — LLM-assisted table selector, stemmed TF-IDF.
- **v0.3** — local LLMs (Ollama), schema anonymisation (GDPR-by-default), query-history-as-few-shot.
- **v0.4** — MySQL + SQLite adapters, MCP server mode, public competitor benchmark.
- **v0.4** — MySQL adapter, MCP server mode, public competitor benchmark.

---

Expand All @@ -255,7 +257,7 @@ python3.12 -m venv .venv
.venv/bin/python -m eval.retrieval
```

37 tests, all pure-Python — no live database or API key required for the core suite.
55 tests, all pure-Python — no live database or API key required for the core suite.

---

Expand Down
16 changes: 9 additions & 7 deletions src/promptquery/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
from rich.console import Console

from . import __version__
from .db import Database
from .db import Database, SQLiteDatabase, make_database
from .llm import LLMClient, LLMError, extract_sql, make_client
from .prompts import build_system_prompt
from .render import render_results, render_sql
Expand Down Expand Up @@ -81,7 +81,7 @@ def run_question(
retriever: TfIdfRetriever,
llm: LLMClient,
selector_llm: LLMClient | None,
db: Database,
db: Database | SQLiteDatabase,
*,
top_k: int,
select_n: int,
Expand Down Expand Up @@ -125,7 +125,7 @@ def run_question(
return QueryResult(Outcome.EMPTY_SQL, None, [], [], "LLM returned an empty response.")

try:
validate_select_only(sql)
validate_select_only(sql, dialect=getattr(db, "dialect", "postgres"))
except UnsafeQuery as e:
return QueryResult(Outcome.UNSAFE, sql, [], [], str(e))

Expand Down Expand Up @@ -210,14 +210,15 @@ def main(dsn: str, model: str | None, selector_model: str | None,
query: str | None, out_format: str | None,
top_k: int, select_n: int, max_tables: int,
no_selector: bool, yes: bool) -> None:
"""PromptQuery — natural-language SQL for Postgres.
"""PromptQuery — natural-language SQL for Postgres and SQLite.

DSN is a libpq connection string, e.g. postgresql://user:pass@host/db.
DSN is a libpq connection string or sqlite:///path/to.db.

Examples:

Interactive REPL:
promptquery postgresql://localhost/mydb
promptquery sqlite:///local.db

One-shot query (machine-friendly JSON to stdout, progress to stderr):
promptquery -q "how many users in Italy" postgresql://localhost/mydb
Expand Down Expand Up @@ -250,7 +251,7 @@ def main(dsn: str, model: str | None, selector_model: str | None,

progress.print(f"[dim]Connecting to[/dim] {_redact(dsn)} [dim]...[/dim]")
try:
db_ctx = Database(dsn).__enter__()
db_ctx = make_database(dsn).__enter__()
except Exception as e:
progress.print(f"[red]Connection failed:[/red] {e}")
sys.exit(1)
Expand All @@ -267,7 +268,8 @@ def main(dsn: str, model: str | None, selector_model: str | None,
else (" (selector: same)" if selector_llm is not None else " (selector: off)")
)
progress.print(f"[green]✓[/green] {len(schema.tables)} tables found "
f"[dim](sql: {llm.name}/{llm.model}{selector_info})[/dim]")
f"[dim](db: {db_ctx.dialect}, sql: "
f"{llm.name}/{llm.model}{selector_info})[/dim]")

retriever = TfIdfRetriever(schema)

Expand Down
90 changes: 90 additions & 0 deletions src/promptquery/db.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
from __future__ import annotations

import sqlite3
from urllib.parse import quote

import psycopg
from psycopg.rows import dict_row


class Database:
dialect = "postgres"
default_schema = "public"

def __init__(self, dsn: str):
self.dsn = dsn
self.conn: psycopg.Connection | None = None
Expand Down Expand Up @@ -47,3 +53,87 @@ def __enter__(self) -> "Database":

def __exit__(self, exc_type, exc, tb) -> None:
self.close()


class SQLiteDatabase:
dialect = "sqlite"
default_schema = "main"

def __init__(self, dsn: str):
self.dsn = dsn
self.path = _sqlite_path_from_dsn(dsn)
self.conn: sqlite3.Connection | None = None

def connect(self) -> None:
if self.path == ":memory:":
self.conn = sqlite3.connect(self.path)
else:
self.conn = sqlite3.connect(
f"file:{quote(self.path, safe='/')}?mode=ro",
uri=True,
)
self.conn.row_factory = sqlite3.Row
with self.conn:
self.conn.execute("PRAGMA foreign_keys = ON")
self.conn.execute("PRAGMA query_only = ON")
self.conn.execute("PRAGMA busy_timeout = 60000")

def close(self) -> None:
if self.conn is not None:
self.conn.close()
self.conn = None

def _require_conn(self) -> sqlite3.Connection:
if self.conn is None:
raise RuntimeError("Database is not connected")
return self.conn

def fetch_dicts(self, sql: str) -> list[dict]:
conn = self._require_conn()
cur = conn.execute(sql)
return [dict(row) for row in cur.fetchall()]

def execute(self, sql: str) -> tuple[list[str], list[tuple]]:
conn = self._require_conn()
cur = conn.execute(sql)
if cur.description is None:
return [], []
cols = [d[0] for d in cur.description]
rows = [tuple(row) for row in cur.fetchall()]
return cols, rows

def pragma_dicts(self, name: str, argument: str) -> list[dict]:
conn = self._require_conn()
cur = conn.execute(f"PRAGMA {name}({_quote_sqlite_literal(argument)})")
return [dict(row) for row in cur.fetchall()]

def __enter__(self) -> "SQLiteDatabase":
self.connect()
return self

def __exit__(self, exc_type, exc, tb) -> None:
self.close()


def make_database(dsn: str) -> Database | SQLiteDatabase:
if dsn.startswith("sqlite:///"):
return SQLiteDatabase(dsn)
return Database(dsn)


def _sqlite_path_from_dsn(dsn: str) -> str:
if not dsn.startswith("sqlite:///"):
raise ValueError("SQLite DSNs must use sqlite:///path/to.db")

path = dsn[len("sqlite:///"):]
if not path:
raise ValueError("SQLite DSN is missing a database path")
if path == ":memory:":
return path
if dsn.startswith("sqlite:////"):
return "/" + dsn[len("sqlite:////"):]
return path


def _quote_sqlite_literal(value: str) -> str:
return "'" + value.replace("'", "''") + "'"
5 changes: 3 additions & 2 deletions src/promptquery/safety.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,16 +40,17 @@ class UnsafeQuery(Exception):
"lo_export",
"dblink_exec",
"set_config",
"load_extension",
}


def validate_select_only(sql: str) -> None:
def validate_select_only(sql: str, *, dialect: str = "postgres") -> None:
sql = (sql or "").strip()
if not sql:
raise UnsafeQuery("empty SQL")

try:
statements = sqlglot.parse(sql, read="postgres")
statements = sqlglot.parse(sql, read=dialect)
except Exception as e:
raise UnsafeQuery(f"could not parse SQL: {e}") from e

Expand Down
Loading