A CLI-based chatbot that uses a semantic cache layer to reduce unnecessary LLM calls by reusing responses to semantically similar queries. Built with Python, Gemini, and sentence-transformers.
- Python 3.9+
- Internet connection (for Gemini API)
- A valid Gemini API key
Optional:
query_pairs.jsonis only required if running the evaluation script (evaluate_threshold.py)
1. Clone the repository
git clone <your-repo-url>
cd semantic-cache-chatbot2. Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate3. Install dependencies
pip install -r requirements.txt4. Add your Gemini API key
Create a .env file in the project root:
GEMINI_API_KEY=your_key_hereGet a free key at: https://aistudio.google.com
5. Run the chatbot
python3 chatbot.pyCommands:
- Type any message to chat
- Type
statsto view cache metrics - Type
exitto quit
Instead of matching queries by exact string, the cache matches by meaning using sentence embeddings.
Flow:
- User submits a query
- Query is normalized (lowercased, trimmed)
- Query is embedded using
all-MiniLM-L6-v2 - Cosine similarity is computed against all cached queries
- If similarity ≥ threshold, return cached response
- Otherwise, call Gemini and store the result
- Cache persists to
cache.json
Each cache entry stores:
- original query
- response
- embedding vector
Model: all-MiniLM-L6-v2
- Runs locally (no additional API calls)
- Fast and lightweight
- Widely used for semantic similarity tasks
Similarity metric: cosine similarity
- Measures directional similarity between embeddings
- Standard approach for sentence embeddings
Why not FAISS? This cache is expected to store hundreds of entries at most. A brute-force cosine search using NumPy completes in milliseconds. Using FAISS or ANN indexing would add complexity without meaningful performance benefit at this scale.
The similarity threshold was selected empirically using a labeled evaluation dataset of query pairs.
Each pair was labeled as:
should_hit→ cached response should be reusedshould_not_hit→ reuse would be incorrect
Cosine similarity scores were computed using the same embedding model as the runtime system (all-MiniLM-L6-v2).
Thresholds from 0.35 to 0.95 were evaluated using the following cost function:
cost = 10 × false_positives + 1 × false_negatives
False positives (incorrect cache reuse) are penalized more heavily than false negatives (missed reuse), since returning a wrong answer is worse than making an extra LLM call.
Final threshold: 0.74
This achieved:
- Precision: 1.0
- Recall: ~0.67
- False positive rate: 0.0
This reflects a deliberate design choice to prioritize correctness over maximizing cache hits.
Precision vs Recall The system prioritizes precision to avoid incorrect cache reuse. This reduces hit rate but ensures correctness.
Semantic similarity ≠ answer equivalence Two queries may be semantically similar but still require different answers, especially for time-sensitive or context-dependent queries.
Context sensitivity I implemented a lightweight context hashing approach to prevent incorrect cache reuse across different conversational contexts. However, in testing this significantly reduced valid cache hits, so the current system prioritizes semantic similarity alone. Context-aware caching is left as a potential improvement.
Cache staleness Cached responses persist indefinitely. There is no invalidation or expiration mechanism.
The chatbot tracks:
- Cache hit rate
- Estimated tokens saved
Token savings are estimated using a simple heuristic based on response length and are intended as approximate indicators rather than exact billing metrics.
- Larger and more realistic evaluation dataset
- Context-aware similarity beyond simple hashing
- Cache invalidation / TTL support
- Top-k retrieval and re-ranking for borderline matches
- Vector database integration if cache size grows
I used Claude for debugging and code review. I also used ChatGPT to generate a 500-pair labeled evaluation dataset used for threshold tuning. I reviewed and refined that dataset, then selected the final threshold based on my own evaluation, cost tradeoff, and error analysis.
semantic-cache-chatbot/
├── chatbot.py
├── cache.py
├── gemini_client.py
├── evaluate_threshold.py
├── requirements.txt
├── .env.example
└── README.md
- google-genai
- sentence-transformers
- numpy
- scikit-learn
- rich
- python-dotenv