NL2SQL is an enterprise-grade, agentic Natural Language to SQL pipeline built on LangGraph.
Unlike simple prompt-to-query solutions, this project employs a Map-Reduce Supervisor Architecture to handle complex, real-world database interactions. It features dynamic routing across multiple database engines (Postgres, MySQL, MSSQL, SQLite), vector-based schema retrieval for scalability, and a rigorous Plan-Validate-Execute loop to ensure query correctness and safety.
Designed for observability and reliability, it provides detailed performance metrics, step-by-step reasoning logs, and deterministic SQL generation, making it suitable for production environments where accuracy is paramount.
The system implements a sophisticated Agentic Architecture designed for enterprise scale.
- LangGraph Orchestration: Uses a stateful graph architecture (
StateGraph) to manage complex agentic workflows, ensuring predictable control flow and state validation. - Hybrid Execution Lanes:
- Fast Lane (Direct SQL): Optimizes for low latency (< 2s) on simple queries (e.g., "List 10 machines") by skipping complex planning.
- Slow Lane (Agentic Loop): Engages a robust Plan-Validate-Execute-Refine loop for complex analytical queries requiring reasoning.
- Map-Reduce Supervisor:
- Supervisor (Decomposer): Intelligently breaks down complex user queries into sub-tasks (Map).
- Parallel Execution: Executes sub-tasks dynamically across distributed workers (LangGraph branches).
- Reducer (Aggregator): Synthesizes results from multiple branches into a coherent final answer.
- Vector-Based Routing (RAG):
- Datasource Discovery: Uses vector search to match user queries to the correct database(s) from a registry of available connections.
- Schema Retrieval: Dynamically fetches only relevant table schemas for the LLM context window, enabling support for databases with thousands of tables.
- Intent Classification:
- Determines query intent (
TABULAR,KPI,SUMMARY) to select the optimal execution lane. - Enriches user queries with domain-specific synonyms and entity extraction.
- Determines query intent (
- Self-Correcting Agents: The "Slow Lane" implements a Reflection pattern. If generated SQL fails validation or execution, the system captures the error, summarizes it, and feeds it back to the Planner for correction (Retries).
- Structured Output Enforcement: Uses OpenAI generic 'Tools' (Function Calling) to enforce strict JSON schemas for critical nodes (
Planner,Decomposer), preventing parsing errors. - Pre-Execution Validation: A dedicated
ValidatorNodeperforms static analysis on generated SQL to catch schema violations (e.g., non-existent columns) before execution.
Follow this guide to set up the environment and run your first query.
- Python 3.10+
- Docker & Docker Compose
- OpenAI API Key
git clone https://github.com/nadeem4/nl2sql.git
cd nl2sql
pip install -r requirements.txtStart the database containers (Postgres, MySQL, MSSQL) and seed them with synthetic manufacturing data:
# Start containers
docker-compose up -d
# Seed data (wait ~10s for DBs to initialize)
python scripts/seed_databases.py --waitCreate a .env file in the root directory:
OPENAI_API_KEY="sk-..."Run a query against the Postgres database to verify the pipeline:
python -m src.nl2sql.cli --id manufacturing_ops --query "List 5 machines"Expected Output: A structured log of the AI's reasoning followed by a table of 5 machines.
Define connection strings and settings for each database.
manufacturing_sqlite:
engine: sqlite
connection_string: sqlite:///data/manufacturing.db
tables: [] # Empty list = include all
manufacturing_ops:
engine: postgres
connection_string: postgresql+psycopg2://user:password@localhost:5432/manufacturing_opsMap specific agents to different LLM providers/models.
default:
provider: openai
model: gpt-4o
agents:
planner:
provider: openai
model: gpt-4o-mini # Use a cheaper model for planningMap datasource IDs to lists of sample questions to improve routing accuracy (Layer 1).
manufacturing_ops:
- "List all machines"
- "Which machines are offline?"manufacturing_supply:
- "Show top products by price"
- "List all products"The CLI (src.nl2sql.cli) is the main entry point.
--query "...": The natural language question.--verbose: Display step-by-step AI reasoning.--show-perf: Display detailed performance metrics (latency, tokens).--vector-store <PATH>: Use vector search for schema selection (requires indexing).--id <ID>: Optional. Force a specific datasource, bypassing the router (e.g.,manufacturing_ops).--no-exec: Generate and validate SQL without executing it.--debug: Enable debug logging for verbose output.--include-ids <ID>...: Benchmark only specific test case IDs.
The system simulates a manufacturing enterprise distributed across 4 databases:
| ID | Engine | Content | Example Query |
|---|---|---|---|
manufacturing_ops |
Postgres | Machines, Maintenance, Employees | List 5 machines with their serial numbers |
manufacturing_supply |
MySQL | Products, Inventory, Suppliers | Show me top 3 products by price |
manufacturing_history |
MSSQL | Production Runs, Sales, Defects | Count total production runs |
manufacturing_ref |
SQLite | Factories, Shifts, Machine Types | List all factories and their locations |
Run these commands to test each database:
python -m src.nl2sql.cli --id manufacturing_ops --query "List 5 machines with their serial numbers"python -m src.nl2sql.cli --id manufacturing_supply --query "Show me top 3 products by price"python -m src.nl2sql.cli --id manufacturing_history --query "Count total production runs"python -m src.nl2sql.cli --id manufacturing_ref --query "List all factories and their locations"The system can automatically decompose complex queries into sub-queries, execute them in parallel, and aggregate the results.
python -m src.nl2sql.cli --query "Compare sales from manufacturing_history and inventory from manufacturing_supply"For large schemas, use vector search to dynamically select relevant tables.
-
Index the Schema:
python -m src.nl2sql.cli --index --vector-store ./chroma_db
-
Query with Context:
python -m src.nl2sql.cli --query "Show top 5 products" --vector-store ./chroma_db
- Stream Reasoning: Use
--verboseto see the Intent, Planner, and Generator steps. - Debug Mode: Use
--debugfor verbose output.
We use a hybrid Intent-Driven + Map-Reduce Architecture:
- Intent & Routing:
- Intent Node: Classifies user goal and standardizes the query.
- Decomposer (Supervisor): Uses vector search to find the right data source(s). If multiple are needed, it triggers Map-Reduce.
- Execution Lanes:
- Fast Lane: Direct SQL generation (
DirectSQLNode) for simple tabular data. - Slow Lane: Reasoning loop (
Planner->Validator->Generator) for complex requests.
- Fast Lane: Direct SQL generation (
- Aggregation:
- The
AggregatorNodestitches results together, using an LLM only if necessary (Summary Mode).
- The
For a deep dive into the Map-Reduce pattern, see docs/ARCHITECTURE_MAP_REDUCE.md.
flowchart TD
%% Nodes
user["User Query"] --> intent["Intent Node"]
intent --> decomposer["Decomposer (Supervisor)"]
subgraph Routing Layer
decomposer -- Lookups --> vector_store[("Vector Store")]
vector_store -. Context .-> decomposer
end
decomposer -- Single DB --> branch["Execution Branch"]
decomposer -- "Map (Multi-DB)" --> parallel["Parallel Execution Branches"]
parallel --> branch
subgraph Worker Lane ["Worker Lane (Per Datasource)"]
branch --> schema["Schema Node"]
schema --> lane_logic{Fast or Slow?}
%% Fast Lane
lane_logic -- Fast --> directSQL["DirectSQL (AI)"]
directSQL --> executor["Executor"]
%% Slow Lane (ReAct Loop)
lane_logic -- Slow --> planner["Planner (AI)"]
planner --> validator["Validator"]
validator -- "Pass" --> generator["Generator"]
validator -- "Fail" --> summarizer["Summarizer (Reflect)"]
generator --> executor
executor -- "Error" --> summarizer
summarizer -. "Feedback" .-> planner
end
executor -- "Success" --> aggregator["Aggregator (Reducer)"]
aggregator --> answer["Final Answer"]
%% Styling
style user fill:#f6f8fa,stroke:#333
style decomposer fill:#e1f5fe,stroke:#01579b
style aggregator fill:#e1f5fe,stroke:#01579b
style answer fill:#f6f8fa,stroke:#333
style vector_store fill:#fff3e0,stroke:#e65100
style summarizer fill:#fff9c4,stroke:#fbc02d
To support efficient querying across large or multiple databases, we use a two-tiered vectorization approach:
-
Datasource Routing:
- What: Indexes the
descriptionof each database. - Why: Determines which database contains the relevant data (e.g., "Sales" vs. "Inventory").
- Strategy:
- Layer 1 (Fast): Vector search against database descriptions and 200+ sample questions.
- Layer 2 (Robust): Decomposer LLM uses retrieved context to form sub-queries.
- What: Indexes the
-
Schema Selection:
- What: Indexes table metadata (columns, foreign keys, comments).
- Why: Determines which tables are needed for the query within the selected database.
This allows the system to scale to hundreds of tables without overwhelming the LLM's context window.
- Intent (AI): Entry point. Classifies intent (Table vs Summary) and standardizes query.
- Decomposer (Supervisor): Orchestrates routing and query splitting.
- DirectSQL (AI): Fast Lane. Generates SQL for simple queries without planning.
- Planner (AI): Slow Lane. Generates a structured plan.
- Validator (Code): Slow Lane. Verifies the plan against schema.
- Generator (Code): Compiles plan to SQL.
- Executor (Code): Runs the SQL.
- Aggregator (AI): Synthesizes final results. Skips AI for simple table outputs.
The CLI provides a detailed breakdown of time and token usage, including a top-level matrix and per-datasource details:
Enable this detailed view by adding the --show-perf flag.
Click to see Sample Performance Output
Top Level Performance
| Metric | Decomposer | Aggregator | Exec (manufacturing_ops) | Total |
|---|---|---|---|---|
| Latency (s) | 0.0000 | 0.0000 | 7.2700 | 7.2700 |
| Token Usage | 0 | 0 | 3029 | 3029 |
Performance: manufacturing_ops
| Node | Type | Model | Latency (s) | Tokens |
|---|---|---|---|---|
| Intent | AI | gpt-4o-mini | 1.9300 | 569 |
| Planner | AI | gpt-4o-mini | 4.9900 | 2456 |
| Generator | Non-AI | - | 0.0000 | - |
| Executor | Non-AI | - | 0.3500 | - |
src/: Core modules (nodes,langgraph_pipeline,datasource_config,llm_registry).configs/: YAML configurations for datasources and LLMs.scripts/: Utilities (e.g.,seed_databases.py).tests/: Unit and integration tests.
We employ a rigorous evaluation framework to ensure the pipeline's correctness and stability.
For detailed documentation on metrics, architecture, and usage, see EVALUATION.md.
- Execution Accuracy: Verifies that generated SQL returns the correct data (not just string matching).
- Stability (Pass@K): Measures pipeline reliability and routing consistency over multiple iterations.
- Routing Accuracy: Tracks the performance of the 3-layer routing system.
Run the full test suite with parallel execution:
python -m src.nl2sql.cli --benchmark --dataset tests/golden_dataset.yamlTo test stability (run each question 5 times):
python -m src.nl2sql.cli --benchmark --dataset tests/golden_dataset.yaml --iterations 5Run the test suite using pytest:
python -m pytest tests/- Add the driver to
requirements.txt. - Add a connection profile to
configs/datasources.yaml. - (Optional) Add specific DDL/Seeding logic to
scripts/seed_databases.py.