Skip to content

vai67/semantic-cache-chatbot

Repository files navigation

Semantic Cache Chatbot

A CLI-based chatbot that uses a semantic cache layer to reduce unnecessary LLM calls by reusing responses to semantically similar queries. Built with Python, Gemini, and sentence-transformers.


Setup

Requirements

  • Python 3.9+
  • Internet connection (for Gemini API)
  • A valid Gemini API key

Optional:

  • query_pairs.json is only required if running the evaluation script (evaluate_threshold.py)

1. Clone the repository

git clone <your-repo-url>
cd semantic-cache-chatbot

2. Create and activate a virtual environment

python3 -m venv venv
source venv/bin/activate

3. Install dependencies

pip install -r requirements.txt

4. Add your Gemini API key

Create a .env file in the project root:

GEMINI_API_KEY=your_key_here

Get a free key at: https://aistudio.google.com

5. Run the chatbot

python3 chatbot.py

Commands:

  • Type any message to chat
  • Type stats to view cache metrics
  • Type exit to quit

How the Semantic Cache Works

Instead of matching queries by exact string, the cache matches by meaning using sentence embeddings.

Flow:

  1. User submits a query
  2. Query is normalized (lowercased, trimmed)
  3. Query is embedded using all-MiniLM-L6-v2
  4. Cosine similarity is computed against all cached queries
  5. If similarity ≥ threshold, return cached response
  6. Otherwise, call Gemini and store the result
  7. Cache persists to cache.json

Each cache entry stores:

  • original query
  • response
  • embedding vector

Similarity Approach

Model: all-MiniLM-L6-v2

  • Runs locally (no additional API calls)
  • Fast and lightweight
  • Widely used for semantic similarity tasks

Similarity metric: cosine similarity

  • Measures directional similarity between embeddings
  • Standard approach for sentence embeddings

Why not FAISS? This cache is expected to store hundreds of entries at most. A brute-force cosine search using NumPy completes in milliseconds. Using FAISS or ANN indexing would add complexity without meaningful performance benefit at this scale.


Threshold Selection

The similarity threshold was selected empirically using a labeled evaluation dataset of query pairs.

Each pair was labeled as:

  • should_hit → cached response should be reused
  • should_not_hit → reuse would be incorrect

Cosine similarity scores were computed using the same embedding model as the runtime system (all-MiniLM-L6-v2).

Thresholds from 0.35 to 0.95 were evaluated using the following cost function:

cost = 10 × false_positives + 1 × false_negatives

False positives (incorrect cache reuse) are penalized more heavily than false negatives (missed reuse), since returning a wrong answer is worse than making an extra LLM call.

Final threshold: 0.74

This achieved:

  • Precision: 1.0
  • Recall: ~0.67
  • False positive rate: 0.0

This reflects a deliberate design choice to prioritize correctness over maximizing cache hits.


Key Trade-offs

Precision vs Recall The system prioritizes precision to avoid incorrect cache reuse. This reduces hit rate but ensures correctness.

Semantic similarity ≠ answer equivalence Two queries may be semantically similar but still require different answers, especially for time-sensitive or context-dependent queries.

Context sensitivity I implemented a lightweight context hashing approach to prevent incorrect cache reuse across different conversational contexts. However, in testing this significantly reduced valid cache hits, so the current system prioritizes semantic similarity alone. Context-aware caching is left as a potential improvement.

Cache staleness Cached responses persist indefinitely. There is no invalidation or expiration mechanism.


Metrics

The chatbot tracks:

  • Cache hit rate
  • Estimated tokens saved

Token savings are estimated using a simple heuristic based on response length and are intended as approximate indicators rather than exact billing metrics.


What I Would Improve With More Time

  • Larger and more realistic evaluation dataset
  • Context-aware similarity beyond simple hashing
  • Cache invalidation / TTL support
  • Top-k retrieval and re-ranking for borderline matches
  • Vector database integration if cache size grows

AI Usage Note

I used Claude for debugging and code review. I also used ChatGPT to generate a 500-pair labeled evaluation dataset used for threshold tuning. I reviewed and refined that dataset, then selected the final threshold based on my own evaluation, cost tradeoff, and error analysis.


Project Structure

semantic-cache-chatbot/
├── chatbot.py
├── cache.py
├── gemini_client.py
├── evaluate_threshold.py
├── requirements.txt
├── .env.example
└── README.md

Dependencies

  • google-genai
  • sentence-transformers
  • numpy
  • scikit-learn
  • rich
  • python-dotenv

About

CLI chatbot with a semantic cache to reduce redundant LLM calls using embedding similarity.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages