Author: Markus van Kempen (markus.van.kempen@gmail.com) Last Updated: January 24, 2026 Version: 2.0
A Streamlit application that converts natural language queries into SQL using IBM WatsonX LLMs with self-improving capabilities powered by embeddings.
- Embedding-based learning: Uses IBM Slate embeddings to find semantically similar past queries
- Few-shot prompting: Automatically includes successful examples in prompts
- Error avoidance: Learns from common mistakes to improve future queries
- User feedback: π/π buttons help the system learn what works
| Mode | Description | Best For |
|---|---|---|
| Direct SQL | Prompt engineering + SQLGlot validation | Speed-critical queries |
| LangChain Agent | LangChain SQL chain with schema injection | Complex queries |
| BeeAI Agent | IBM BeeAI Framework with native WatsonX | Highest reliability |
| Self-Improving | Full learning system with pattern storage | Continuous improvement |
| SQL Validation | SQLGlot (AST Parsing) | Robustness & Safety |
| Agent Frameworks | LangChain, BeeAI Framework | Flexibility & Extensibility |
Not all questions need SQL! The system uses a 4-layer hybrid approach:
| Layer | Status | Description |
|---|---|---|
| 1. Semantic | β | IBM Slate embeddings (79 examples) - understands "per" = "by" |
| 2. Keywords | β | Database/general/help keyword matching |
| 3. Schema-Aware | β | Validates against actual DB content |
| 4. Empty Analysis | β | Explains why queries return no results |
Query Types Handled:
- Database queries β Generate SQL (e.g., "show all products")
- General questions β Direct response (e.g., "what time is it")
- Help questions β System guidance (e.g., "what can you do")
- Unknown entities β Helpful feedback (e.g., "show BMW sales" β "BMW not in database")
The semantic classifier understands that "show total revenue per country" = "revenue by country" = "sales grouped by country" without keyword rules!
See COMPARISON_RESULTS.md for full implementation details and what's missing.
When a search matches multiple products (e.g., "Wireless Mouse" matches both "Wireless Mouse" and "Wireless Mouse Pro"), the system shows a helpful disambiguation message.
- Memory + disk caching of embedding vectors
- First query: API call (~450ms) β Cached for reuse (<1ms)
- Reduces WatsonX API calls by 80%+ for repeated patterns
Understands follow-up queries in conversation:
- "show more" β expands previous results
- "same for USA" β applies new filter to previous query
- "top 5" β limits previous results
- "those products" β resolves references to previous data
- SQLGlot-powered: AST-based parsing ensures only valid SQL is executed
- Safety checks: Prevents
SELECT *without tables, implicit cross-joins (ON TRUE), and missing join conditions - Fallback mechanism: Gracefully degrades if strict validation fails
- Semantic Context: Columns have descriptions (e.g., "status: 1=New")
- Dynamic Sampling: Injects real valid values into prompt (e.g.,
country: 'USA', 'Germany') - Reduced Hallucinations: LLM knows exactly which values exist in the DB
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtCreate a .env file:
WATSONX_API_KEY=your_api_key
WATSONX_URL=https://us-south.ml.cloud.ibm.com
WATSONX_PROJECT_ID=your_project_id
DATABASE_PATH=./data/database.dbpython scripts/init_database.pystreamlit run app.pyβββ app.py # Main Streamlit UI (entry point)
βββ watsonx_client.py # Direct SQL mode with learning
βββ langchain_agent.py # LangChain mode with learning
βββ beeai_agent.py # BeeAI mode with learning
βββ self_improving_agent.py # Full self-improving agent
βββ learning_store.py # Embedding-based pattern storage
βββ query_classifier.py # Hybrid query classification
βββ semantic_classifier.py # Embedding-based intent detection + caching
βββ context_manager.py # Conversation context awareness
βββ product_mapper.py # Product disambiguation
βββ schema_loader.py # Database schema utilities
βββ logging_config.py # Centralized logging
β
βββ scripts/ # Utility scripts
β βββ init_database.py # Database initialization
β
βββ tests/ # Test suite
β βββ test_comparison.py # SQLGlot comparison tests
β βββ test_self_improving.py # 100-query test suite
β βββ test_model_comparison.py # LLM model benchmark
β βββ test_mode_comparison.py # Mode comparison tests
β βββ test_model_mode_comparison.py # Combined comparison
β
βββ docs/ # Documentation
β βββ COMPARISON_RESULTS.md # Detailed mode & model comparison
β βββ test_results_summary.md # Test findings
β
βββ data/ # Data storage
β βββ database.db # SQLite database
β βββ learning.db # Learning store (patterns, errors)
β βββ cache/ # Embedding cache
β βββ results/ # Test comparison results (JSON)
β
βββ .cursor/rules/ # Cursor AI rules
- customers: customer_id, customer_name, email, city, country, region
- products: product_id, product_code, product_name, category, price
- orders: order_id, customer_id, product_id, order_date, quantity, total_amount
- sales: Pre-joined view combining all tables for simplified queries
User Query
β
π Query Classifier βββββββββββββββββββ
β β (general/help)
β (database) β
β π Direct Response
π§ Embedding Search
β
β
π Find Similar Patterns (top 3)
β
β
β οΈ Find Common Errors to Avoid
β
β
π Build Enhanced Prompt
β (includes few-shot examples + error avoidance)
β
π€ Generate SQL β Execute β Display Results
β β
β β
πΎ Store Success Pattern β οΈ Store Error Pattern
β
β
π/π User Feedback β Updates Pattern Rating
Results using Direct SQL Mode with Self-Improving Learning enabled.
| Metric | Value |
|---|---|
| Success Rate | 81% |
| Patterns Learned | 81 |
| Errors Logged | 20 |
| Embedding Utilization | 99% |
| Category | Success Rate |
|---|---|
| Product Filtering | 100% |
| Basic Sales | 100% |
| Basic Products | 95% |
| Aggregation | 90% |
| Complex Queries | 80% |
| Rank | Combination | Accuracy | Speed | Notes |
|---|---|---|---|---|
| π₯ | Granite 4 Small + LangChain | 100% | 604ms | Best overall |
| π₯ | Granite 4 Small + BeeAI | 100% | ~650ms | After prompt fix |
| π₯ | Mistral Small + BeeAI | 100% | 512ms | Fastest 100% combo |
| π₯ | Llama 70B + Any Mode | 100% | 1200ms+ | Most accurate model |
| Mode | Accuracy | Notes |
|---|---|---|
| LangChain | 100% | Retry logic compensates for errors |
| BeeAI | 100% | After prompt optimization |
| Direct SQL | 86% | Fast but struggles with complex |
| Model | Direct SQL | LangChain | BeeAI |
|---|---|---|---|
| Llama 3.3 70B | 100% | 100% | 100% |
| Mistral Small 24B | 67% | 100% | 100% |
| Granite 4 Small | 83% | 100% | 100% |
| Granite 3.3 8B | 50% | 100% | 83% |
LangChain achieves 100% accuracy with ALL models because it handles retries and schema injection automatically.
| Model | Simple | Filter | Aggregation | Complex |
|---|---|---|---|---|
| Granite 4 Small | 100% | 100% | 100% | 75% |
| Mistral Small 24B | 100% | 100% | 75% | 75% |
| Llama 3.3 70B | 100% | 100% | 75% | 50% |
ibm/granite-4-h-small- Recommended - Best accuracyibm/granite-3-3-8b-instruct- Fast but struggles with complexmeta-llama/llama-3-3-70b-instruct- Powerful but slowermeta-llama/llama-3-405b-instruct- Largest availablemistralai/mistral-small-3-1-24b-instruct-2503- Fastestmistralai/mistral-medium-2505- Balanced
- IBM WatsonX: LLM inference (Granite, Llama, Mistral models)
- IBM Slate Embeddings:
ibm/slate-125m-english-rtrvr-v2for semantic search
Product Queries:
- "show all laptop products"
- "list products under $100"
- "what wireless products do we have"
Sales Queries:
- "show sales by country"
- "total revenue from laptops in USA"
- "top 5 products by sales"
Customer Queries:
- "how many customers are in Europe"
- "list customers who ordered smartphones"
General Questions (no SQL):
- "what time is it"
- "hello"
- "what can you do"
# Run all tests with pytest
pytest
# Run specific test file
pytest tests/test_model_comparison.py -v# Quick test (5 queries, 3 models)
python tests/test_model_comparison.py --quick
# Full benchmark (12 queries, 4 models)
python tests/test_model_comparison.py
# Specific models
python tests/test_model_comparison.py --models ibm/granite-4-h-small meta-llama/llama-3-3-70b-instruct# Run 100-query test
python tests/test_self_improving.py --mode direct_sql --queries 100
# Quick test (10 queries)
python tests/test_self_improving.py --quick| Document | Description |
|---|---|
| ARCHITECTURE.md | System architecture, data flow, components |
| API.md | API reference for all modules |
| USAGE.md | Usage guide with examples |
| COMPARISON_RESULTS.md | Detailed mode & model comparison |
| test_results_summary.md | Test findings |
| CONTRIBUTING.md | Contribution guidelines |
| SHOWCASE.md | Self-Improving Agent Showcase |
MIT License
