Semantic Cache Chatbot

A CLI-based chatbot that uses a semantic cache layer to reduce unnecessary LLM calls by reusing responses to semantically similar queries. Built with Python, Gemini, and sentence-transformers.

Setup

Requirements

Python 3.9+
Internet connection (for Gemini API)
A valid Gemini API key

Optional:

query_pairs.json is only required if running the evaluation script (evaluate_threshold.py)

1. Clone the repository

git clone <your-repo-url>
cd semantic-cache-chatbot

2. Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Add your Gemini API key

Create a .env file in the project root:

GEMINI_API_KEY=your_key_here

Get a free key at: https://aistudio.google.com

5. Run the chatbot

python3 chatbot.py

Commands:

Type any message to chat
Type stats to view cache metrics
Type exit to quit

How the Semantic Cache Works

Instead of matching queries by exact string, the cache matches by meaning using sentence embeddings.

Flow:

User submits a query
Query is normalized (lowercased, trimmed)
Query is embedded using all-MiniLM-L6-v2
Cosine similarity is computed against all cached queries
If similarity ≥ threshold, return cached response
Otherwise, call Gemini and store the result
Cache persists to cache.json

Each cache entry stores:

original query
response
embedding vector

Similarity Approach

Model: all-MiniLM-L6-v2

Runs locally (no additional API calls)
Fast and lightweight
Widely used for semantic similarity tasks

Similarity metric: cosine similarity

Measures directional similarity between embeddings
Standard approach for sentence embeddings

Why not FAISS? This cache is expected to store hundreds of entries at most. A brute-force cosine search using NumPy completes in milliseconds. Using FAISS or ANN indexing would add complexity without meaningful performance benefit at this scale.

Threshold Selection

The similarity threshold was selected empirically using a labeled evaluation dataset of query pairs.

Each pair was labeled as:

should_hit → cached response should be reused
should_not_hit → reuse would be incorrect

Cosine similarity scores were computed using the same embedding model as the runtime system (all-MiniLM-L6-v2).

Thresholds from 0.35 to 0.95 were evaluated using the following cost function:

cost = 10 × false_positives + 1 × false_negatives

False positives (incorrect cache reuse) are penalized more heavily than false negatives (missed reuse), since returning a wrong answer is worse than making an extra LLM call.

Final threshold: 0.74

This achieved:

Precision: 1.0
Recall: ~0.67
False positive rate: 0.0

This reflects a deliberate design choice to prioritize correctness over maximizing cache hits.

Key Trade-offs

Precision vs Recall The system prioritizes precision to avoid incorrect cache reuse. This reduces hit rate but ensures correctness.

Semantic similarity ≠ answer equivalence Two queries may be semantically similar but still require different answers, especially for time-sensitive or context-dependent queries.

Context sensitivity I implemented a lightweight context hashing approach to prevent incorrect cache reuse across different conversational contexts. However, in testing this significantly reduced valid cache hits, so the current system prioritizes semantic similarity alone. Context-aware caching is left as a potential improvement.

Cache staleness Cached responses persist indefinitely. There is no invalidation or expiration mechanism.

Metrics

The chatbot tracks:

Cache hit rate
Estimated tokens saved

Token savings are estimated using a simple heuristic based on response length and are intended as approximate indicators rather than exact billing metrics.

What I Would Improve With More Time

Larger and more realistic evaluation dataset
Context-aware similarity beyond simple hashing
Cache invalidation / TTL support
Top-k retrieval and re-ranking for borderline matches
Vector database integration if cache size grows

AI Usage Note

I used Claude for debugging and code review. I also used ChatGPT to generate a 500-pair labeled evaluation dataset used for threshold tuning. I reviewed and refined that dataset, then selected the final threshold based on my own evaluation, cost tradeoff, and error analysis.

Project Structure

semantic-cache-chatbot/
├── chatbot.py
├── cache.py
├── gemini_client.py
├── evaluate_threshold.py
├── requirements.txt
├── .env.example
└── README.md

Dependencies

google-genai
sentence-transformers
numpy
scikit-learn
rich
python-dotenv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Semantic Cache Chatbot

Setup

Requirements

How the Semantic Cache Works

Similarity Approach

Threshold Selection

Key Trade-offs

Metrics

What I Would Improve With More Time

AI Usage Note

Project Structure

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
cache.py		cache.py
chatbot.py		chatbot.py
evaluate_threshold.py		evaluate_threshold.py
gemini_client.py		gemini_client.py
query_pairs.json		query_pairs.json
requirements.txt		requirements.txt
threshold_evaluation.json		threshold_evaluation.json

Folders and files

Latest commit

History

Repository files navigation

Semantic Cache Chatbot

Setup

Requirements

How the Semantic Cache Works

Similarity Approach

Threshold Selection

Key Trade-offs

Metrics

What I Would Improve With More Time

AI Usage Note

Project Structure

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages