EvalKit

QA-Grade RAG Evaluation Platform with a 6-layer cascading pipeline that diagnoses why your RAG system fails, not just that it fails.

Architecture

EvalKit runs every query through a deterministic evaluation cascade:

Query + Context + Response
        |
   Layer A  ── Retrieval metrics (Precision@K, Recall@K, MRR, NDCG)
        |
   Layer C  ── Claim decomposition & verification
        |
   Layer B  ── Generation quality (faithfulness, relevance, completeness)
        |
  Layer D.0 ── Answerability classification
        |
   Layer D  ── Root cause attribution (17 diagnostic codes)
        |
   Anomaly  ── Regression detection, fix suggestions, cost tracking

17 Root Cause Codes

Severity	Codes
Blocker	`INPUT_INVALID`, `NO_CONTEXT_PROVIDED`, `NO_RESPONSE_GENERATED`
Critical	`SHOULD_HAVE_REFUSED`, `HALLUCINATION`, `RETRIEVAL_MISS`, `NO_RELEVANT_DOCS_RETRIEVED`
Major	`GENERATION_UNFAITHFUL`, `EVIDENCE_NOT_USED`, `OFF_TOPIC_RESPONSE`, `EXCESSIVE_NOISE`, `FALSE_REFUSAL`, `CHUNK_BOUNDARY_BROKEN`, `EMBEDDING_DOMAIN_MISMATCH`, `EMBEDDING_DRIFT`
Minor	`CHUNK_INCOHERENT`, `CHUNK_TOO_SPARSE`, `CHUNK_TOO_DENSE`

Judge Abstraction

Pluggable AI judges with consensus support:

OpenAI — GPT-4o, GPT-4-turbo
Anthropic — Claude Sonnet, Claude Haiku
Consensus — Multi-judge agreement with escalation

Tech Stack

Component	Technology
Backend	FastAPI + Python 3.11
Database	aiosqlite (async SQLite)
Auth	Clerk (JWT)
Frontend	Next.js 14 + TypeScript + Tailwind CSS
UI Components	shadcn/ui (Radix) + Recharts
Payments	PayPal, Razorpay, Payoneer
Deployment	Docker Compose

Project Structure

evalkit/                    # Python backend
  main.py                   # FastAPI app (API endpoints)
  config.py                 # Pydantic settings
  auth.py                   # Clerk authentication
  judges/                   # AI judge implementations
    base.py                 #   Abstract base
    openai_judge.py         #   GPT-4o / GPT-4-turbo
    anthropic_judge.py      #   Claude models
    consensus.py            #   Multi-judge consensus
    escalation.py           #   Disagreement escalation
    hallucination_tier.py   #   Hallucination severity tiers
  layers/                   # Evaluation pipeline
    retrieval.py            #   Layer A: Precision, Recall, MRR, NDCG
    claims.py               #   Layer C: Claim decomposition & verification
    generation.py           #   Layer B: Faithfulness, relevance, completeness
    answerability.py        #   Layer D.0: Answerable / partial / unanswerable
    root_cause.py           #   Layer D: 17 root cause codes
    anomaly.py              #   Anomaly detection
    fix_suggestions.py      #   Actionable fix recommendations
    cost_tracker.py         #   LLM cost estimation
    chunk_quality.py        #   Chunk boundary & coherence
    embedding_fitness.py    #   Embedding domain fitness
  models/                   # Pydantic schemas
    enums.py                #   Verdict, Severity, RootCauseCode
    request.py              #   API request models
    response.py             #   API response models
  regression/               # Regression detection
    detector.py             #   Quality regression analyzer
  reporting/                # Report generation
    json_report.py          #   JSON export
    markdown_report.py      #   Markdown export
  storage/                  # Persistence
    sqlite.py               #   Async SQLite storage
  api_keys.py               # API key management
  payment.py                # Payment processing
  subscriptions.py          # Quota enforcement
  rate_limit.py             # Rate limiting
  legal.py                  # Terms, privacy, refund
  cli.py                    # CLI commands

evalkit-ui/                 # Next.js 14 frontend
  app/                      # App Router pages
    dashboard/              #   Evaluation dashboard
    pricing/                #   Pricing page
    sign-in/, sign-up/      #   Clerk auth
    privacy/, terms/, refund/ # Legal pages
  components/
    eval/                   #   CascadeViz, ClaimsTable, RootCauseCard, etc.
    dashboard/              #   EvaluationTable, HealthIndicator, Sidebar
    landing/                #   Hero, FeatureGrid, HowItWorks
    ui/                     #   shadcn/ui components

docs/                       # Documentation
  evalkit_techniques.xlsx   #   38 evaluation techniques with formulas & thresholds

API Endpoints

Method	Endpoint	Description
`POST`	`/v1/evaluate`	Run full 6-layer evaluation
`POST`	`/v1/evaluate/chunks`	Evaluate chunk quality
`POST`	`/v1/evaluate/embedding-fitness`	Evaluate embedding fitness
`POST`	`/v1/compare`	Compare two evaluation runs
`GET`	`/v1/runs/{run_id}`	Get evaluation results
`GET`	`/v1/runs`	List evaluation runs
`GET`	`/v1/health`	Health check
`POST`	`/v1/projects`	Create project
`GET`	`/v1/projects`	List projects

Quick Start

1. Clone & configure

git clone https://github.com/srivsr/evalkit.git
cd evalkit
cp .env.example .env
# Fill in at least one judge API key (OpenAI or Anthropic)

2. Run with Docker

docker compose up --build

Backend: http://localhost:8001
Frontend: http://localhost:3004

3. Run locally (dev)

Backend:

pip install -e .
uvicorn evalkit.main:app --port 8000 --reload

Frontend:

cd evalkit-ui
npm install
npm run dev

4. CLI

pip install -e .
evalkit --help

Evaluation Techniques

The docs/evalkit_techniques.xlsx spreadsheet documents all 38 evaluation techniques across the pipeline:

Layer A (Retrieval): Precision@K, Recall@K, MRR, NDCG, Context Relevance
Layer C (Claims): Claim decomposition, claim verification, context coverage
Layer B (Generation): Faithfulness, relevance, completeness, coherence
Layer D.0 (Answerability): Answerable / partially / unanswerable classification
Layer D (Root Cause): 17 diagnostic root cause codes with severity mapping
Anomaly Detection: Regression flags, threshold alerts
Fix Suggestions: Actionable remediation per root cause

Each technique includes: formula/logic, examples, thresholds, RAG relationship, data flow (inputs/outputs), and architect decision trees.

Environment Variables

See .env.example for the full list. Key variables:

Variable	Required	Description
`EVALKIT_OPENAI_API_KEY`	One of these	OpenAI API key for GPT judges
`EVALKIT_ANTHROPIC_API_KEY`	required	Anthropic API key for Claude judges
`EVALKIT_CLERK_SECRET_KEY`	For auth	Clerk backend secret
`EVALKIT_CLERK_PUBLISHABLE_KEY`	For auth	Clerk frontend key
`EVALKIT_DB_PATH`	No	SQLite path (default: `./evalkit.db`)
`EVALKIT_DEFAULT_JUDGE`	No	Default judge model (default: `gpt-4o`)

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
evalkit-ui		evalkit-ui
evalkit		evalkit
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.frontend		Dockerfile.frontend
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalKit

Architecture

17 Root Cause Codes

Judge Abstraction

Tech Stack

Project Structure

API Endpoints

Quick Start

1. Clone & configure

2. Run with Docker

3. Run locally (dev)

4. CLI

Evaluation Techniques

Environment Variables

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvalKit

Architecture

17 Root Cause Codes

Judge Abstraction

Tech Stack

Project Structure

API Endpoints

Quick Start

1. Clone & configure

2. Run with Docker

3. Run locally (dev)

4. CLI

Evaluation Techniques

Environment Variables

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages