diff --git a/.planning/MILESTONES.md b/.planning/MILESTONES.md index 23c5086..e3db4ca 100644 --- a/.planning/MILESTONES.md +++ b/.planning/MILESTONES.md @@ -1,5 +1,33 @@ # Project Milestones: DataVisor +## v1.1 Deployment, Workflow & Competitive Parity (Shipped: 2026-02-13) + +**Delivered:** Production-ready Docker deployment, smart dataset ingestion UI, annotation editing, error triage workflows, interactive visualizations with grid filtering, keyboard shortcuts, and per-annotation TP/FP/FN classification. + +**Phases completed:** 8-14 (20 plans total) + +**Key accomplishments:** + +- Production-ready Docker stack (Caddy + FastAPI + Next.js) with single-user auth, GCP deployment scripts, and comprehensive documentation +- Smart dataset ingestion wizard with auto-detection of COCO layouts (Roboflow/Standard/Flat) and multi-split support +- Annotation editing via react-konva canvas (move, resize, draw, delete bounding boxes) with DuckDB persistence +- Error triage workflow: per-sample tagging, per-annotation TP/FP/FN auto-classification via IoU matching, worst-images ranking, and highlight mode +- Interactive data discovery: clickable confusion matrix, near-duplicate detection, histogram filtering, and find-similar — all piping results to the grid +- Full keyboard navigation with 16 shortcuts across grid, modal, triage, and editing contexts + +**Stats:** + +- 171 files created/modified +- ~19,460 lines of code added (9,306 Python + 10,154 TypeScript) +- 7 phases, 20 plans, 97 commits +- 2 days (Feb 12-13, 2026) + +**Git range:** `a83d6cf` → `1bed6cf` + +**What's next:** Format expansion (YOLO/VOC), PR curves, per-class AP metrics + +--- + ## v1.0 MVP (Shipped: 2026-02-12) **Delivered:** A unified CV dataset introspection tool with visual browsing, annotation overlays, model comparison, embedding visualization, error analysis, and AI-powered pattern detection. @@ -28,3 +56,32 @@ **What's next:** Interactive model evaluation dashboard (PR curves, confusion matrix, per-class AP metrics) --- + +## v1.2 Classification Dataset Support (Shipped: 2026-02-19) + +**Delivered:** First-class single-label classification dataset support with full feature parity to detection workflows — from JSONL ingestion through evaluation metrics to production-ready polish for high-cardinality datasets. + +**Phases completed:** 15-17 (6 plans total) + +**Key accomplishments:** + +- Classification JSONL parser with auto-detection of dataset type, multi-split ingestion, and sentinel bbox pattern for unified schema +- Grid browsing with class label badges and detail modal with dropdown class editor (PATCH mutation) +- Classification evaluation: accuracy, macro/weighted F1, per-class precision/recall/F1, and clickable confusion matrix +- Error analysis categorizing each image as correct, misclassified, or missing prediction +- Confusion matrix polish with threshold filtering and overflow scroll for 43+ classes, most-confused pairs summary +- Embedding scatter color modes: GT class, predicted class, and correct/incorrect with Tableau 20 categorical palette + +**Stats:** + +- 61 files created/modified +- ~6,052 lines of code added +- 3 phases, 6 plans, 27 commits +- 1 day (Feb 18, 2026) + +**Git range:** `5264e51` → `67a7a9c` + +**What's next:** TBD — next milestone planning + +--- + diff --git a/.planning/PROJECT.md b/.planning/PROJECT.md index 61ab957..e62013d 100644 --- a/.planning/PROJECT.md +++ b/.planning/PROJECT.md @@ -2,86 +2,63 @@ ## What This Is -DataVisor is an open-source dataset introspection tool for computer vision — an alternative to Voxel51. It combines a high-performance visual browser with VLM-powered agentic workflows to automatically discover dataset blind spots (poor lighting, rare occlusions, label errors). Built as a personal tool for exploring 100K+ image datasets with COCO format annotations. +DataVisor is an open-source dataset introspection tool for computer vision — an alternative to Voxel51. It combines a high-performance visual browser with VLM-powered agentic workflows to automatically discover dataset blind spots (poor lighting, rare occlusions, label errors). Built as a personal tool for exploring 100K+ image datasets with COCO detection or JSONL classification annotations. ## Core Value A single tool that replaces scattered one-off scripts: load any CV dataset, visually browse with annotation overlays, compare ground truth against predictions, cluster via embeddings, and surface mistakes — all in one workflow. +## Current State + +**Shipped:** v1.2 (2026-02-19) +**Codebase:** ~38K LOC (16,256+ Python + 15,924+ TypeScript) across 17 phases +**Architecture:** FastAPI + DuckDB + Qdrant (backend), Next.js + Tailwind + deck.gl + Recharts (frontend), Pydantic AI (agents), Moondream2 (VLM) + ## Requirements ### Validated -- ✓ Multi-format ingestion (COCO) with streaming parser architecture — v1.0 -- ✓ DuckDB-backed metadata storage for fast analytical queries over 100K+ samples — v1.0 -- ✓ Virtualized infinite-scroll grid view with overlaid bounding box annotations — v1.0 -- ✓ Ground Truth vs Model Predictions comparison toggle (solid vs dashed lines) — v1.0 -- ✓ Deterministic class-to-color hashing (same class = same color across sessions) — v1.0 -- ✓ t-SNE embedding generation from images (DINOv2-base) — v1.0 -- ✓ deck.gl-powered 2D embedding scatterplot with zoom, pan, and lasso selection — v1.0 -- ✓ Lasso-to-grid filtering (select cluster points → filter grid to those images) — v1.0 -- ✓ Hover thumbnails on embedding map points — v1.0 -- ✓ Qdrant vector storage for embedding similarity search — v1.0 -- ✓ Error categorization: Hard False Positives, Label Errors, False Negatives — v1.0 -- ✓ Pydantic AI agent that monitors error distribution and recommends actions — v1.0 -- ✓ Pattern detection (e.g., "90% of False Negatives occur in low-light images") — v1.0 -- ✓ Import pre-computed predictions (JSON) — v1.0 -- ✓ BasePlugin class for Python extensibility — v1.0 -- ✓ Local disk and GCS image source support — v1.0 -- ✓ Dynamic metadata filtering (sidebar filters on any metadata field) — v1.0 -- ✓ VLM auto-tagging (Moondream2) for scene attribute tags — v1.0 -- ✓ Search by filename and sort by metadata — v1.0 -- ✓ Save and load filter configurations (saved views) — v1.0 -- ✓ Add/remove tags (individual + bulk) — v1.0 -- ✓ Sample detail modal with full-resolution image — v1.0 -- ✓ Dataset statistics dashboard (class distribution, annotation counts) — v1.0 +- Streaming COCO ingestion with ijson at 100K+ scale, local + GCS sources — v1.0 +- DuckDB metadata storage with fast analytical queries — v1.0 +- Virtualized grid with SVG annotation overlays, deterministic color hashing — v1.0 +- GT vs Predictions comparison toggle — v1.0 +- t-SNE embeddings with deck.gl scatter plot, lasso-to-grid filtering — v1.0 +- Error categorization (TP/FP/FN/Label Error) + Qdrant similarity search — v1.0 +- Pydantic AI agent for error patterns + Moondream2 VLM auto-tagging — v1.0 +- Metadata filtering, search, saved views, bulk tagging — v1.0 +- Docker 3-service stack with Caddy auth, GCP deployment scripts — v1.1 +- Smart ingestion UI with auto-detection of COCO layouts and multi-split support — v1.1 +- Annotation editing via react-konva (move, resize, draw, delete) — v1.1 +- Error triage: sample tagging, per-annotation TP/FP/FN via IoU, worst-images ranking, highlight mode — v1.1 +- Interactive discovery: confusion matrix, near-duplicates, histogram filtering, find-similar — v1.1 +- Keyboard shortcuts: 16 shortcuts across grid, modal, triage, editing — v1.1 +- Auto-detect dataset type (detection vs classification) from annotation format — v1.2 +- JSONL classification ingestion with multi-split support — v1.2 +- Grid browsing with class label badges for classification datasets — v1.2 +- Classification prediction import and GT vs predicted comparison — v1.2 +- Classification stats: accuracy, F1, per-class precision/recall, confusion matrix — v1.2 +- Embedding color modes (GT class, predicted class, correct/incorrect) — v1.2 +- Confusion matrix scaling to 43+ classes with threshold filtering — v1.2 ### Active -- [ ] Dockerized deployment with single-user auth for secure cloud VM access -- [ ] GCP deployment script + local run script with setup instructions -- [ ] Smart dataset ingestion UI (point at folder → auto-detect train/val/test splits → import) -- [ ] Annotation editing in the UI (move, resize, delete bounding boxes — depth TBD) -- [ ] Error triage workflow (tag FP/TP/FN/mistake, highlight errors, dim non-errors) -- [ ] Smart "worst images" ranking (combined score: errors + confidence + uniqueness) -- [ ] Keyboard shortcuts for navigation -- [ ] Competitive feature parity with FiftyOne/Encord (gaps TBD after research) +(None — planning next milestone) ### Out of Scope -- Multi-user collaboration — personal tool, single-user auth only for VM security -- Video annotation support — image-only for now -- Training pipeline integration — DataVisor inspects data, doesn't train models +- Multi-user collaboration — personal tool, single-user auth only +- Video annotation support — image-only +- Training pipeline integration — DataVisor inspects data, doesn't train - Mobile/tablet interface — desktop browser only -- Real-time streaming inference — batch-oriented analysis -- Full annotation editor (draw new boxes, complex labeling workflows) — quick corrections only, not CVAT replacement - -## Current Milestone: v1.1 Deployment, Workflow & Competitive Parity - -**Goal:** Make DataVisor deployable (Docker + GCP), secure for cloud access, and close key workflow gaps vs FiftyOne/Encord — smart ingestion, error triage, annotation corrections, and keyboard-driven navigation. - -**Target features:** -- Dockerized project with single-user auth (basic auth for cloud VM security) -- GCP deployment script + local run script -- Smart dataset ingestion UI (auto-detect folder structure, train/val/test splits) -- Annotation management (organize + quick edit: move/resize/delete bboxes) -- Error triage & data curation workflow (tag, highlight, rank worst images) -- Keyboard shortcuts for navigation -- Competitive gaps from FiftyOne/Encord analysis - -## Context - -Shipped v1.0 with 12,720 LOC (6,950 Python + 5,770 TypeScript) across 7 phases and 21 plans. -Tech stack: FastAPI + DuckDB + Qdrant (backend), Next.js + Tailwind + deck.gl + Recharts (frontend), Pydantic AI (agents), Moondream2 (VLM). -59 backend tests passing. TypeScript compiles with 0 errors. -Architecture: 3 Zustand stores, FastAPI DI, source discriminator for GT/prediction separation, 4 SSE progress streams, lazy model loading. +- Full annotation editor (polygons, segmentation) — bounding box only +- Multi-label classification — single-label per image only for now ## Constraints - **Tech Stack**: FastAPI + DuckDB + Qdrant (backend), Next.js + Tailwind + deck.gl (frontend), Pydantic AI (agents) — established - **Performance**: Must handle 100K+ images without UI lag; DuckDB for metadata queries, deck.gl for WebGL rendering, virtualized scrolling - **Storage**: Supports both local filesystem and GCS bucket sources -- **GPU**: VLM inference (Moondream2) supports MPS/CUDA/CPU auto-detection; DINOv2 embeddings likewise +- **GPU**: VLM inference (Moondream2) supports MPS/CUDA/CPU auto-detection; SigLIP embeddings likewise - **Extensibility**: BasePlugin architecture exists; hooks system ready for expansion - **Python**: 3.14+ (numba/umap-learn incompatible; using scikit-learn t-SNE) @@ -89,16 +66,25 @@ Architecture: 3 Zustand stores, FastAPI DI, source discriminator for GT/predicti | Decision | Rationale | Outcome | |----------|-----------|---------| -| DuckDB over SQLite | Analytical queries on metadata at scale; columnar storage for filtering 100K+ rows | ✓ Good | -| Qdrant over FAISS | Payload filtering support; Rust-based performance; local deployment | ✓ Good | -| deck.gl for embedding viz | WebGL-powered; handles millions of points; lasso/interaction built-in | ✓ Good | -| Pydantic AI for agents | Type-safe agent definitions; native FastAPI/Pydantic integration | ✓ Good | -| Deterministic color hashing | Class names hash to consistent colors across sessions; no manual palette | ✓ Good | -| Plugin hooks over monolith | Ingestion/UI/transformation hooks enable domain-specific extensions without forking | ✓ Good | -| Source discriminator column | Clean GT/prediction separation in annotations table via source field | ✓ Good | -| Lazy model loading | VLM and Qdrant loaded on-demand, not at startup, to avoid memory pressure | ✓ Good | -| t-SNE over UMAP | umap-learn blocked by Python 3.14 numba incompatibility; t-SNE via scikit-learn | ⚠️ Revisit when numba supports 3.14 | -| Moondream2 via transformers | trust_remote_code with all_tied_weights_keys patch for transformers 5.x compat | ✓ Good (fragile — monitor updates) | +| DuckDB over SQLite | Analytical queries on metadata at scale; columnar storage for filtering 100K+ rows | Good | +| Qdrant over FAISS | Payload filtering support; Rust-based performance; local deployment | Good | +| deck.gl for embedding viz | WebGL-powered; handles millions of points; lasso/interaction built-in | Good | +| Pydantic AI for agents | Type-safe agent definitions; native FastAPI/Pydantic integration | Good | +| Deterministic color hashing | Class names hash to consistent colors across sessions; no manual palette | Good | +| Source discriminator column | Clean GT/prediction separation in annotations table via source field | Good | +| Caddy over nginx | Auto-HTTPS, built-in basic_auth, simpler config | Good | +| react-konva for editing | Canvas-based editing in modal; SVG stays for grid overlays | Good | +| Gemini 2.0 Flash for agent | Fast, cheap, good structured output; replaced GPT-4o | Good | +| Pre-computed agent prompt | All data in prompt, no tool calls; avoids Pydantic AI request_limit issues | Good | +| t-SNE over UMAP | umap-learn blocked by Python 3.14 numba incompatibility | Revisit when numba supports 3.14 | +| Moondream2 via transformers | trust_remote_code with all_tied_weights_keys patch for transformers 5.x | Fragile — monitor updates | +| Sentinel bbox values (0.0) for classification | Avoids 30+ null guards; unified schema for detection and classification | Good | +| Separate classification evaluation service | ~50-line function vs modifying 560-line detection eval; clean separation | Good | +| Dataset-type routing at endpoint level | Keep classification/detection services separate; route in router layer | Good | +| Parser registry in IngestionService | Format-based dispatch to COCOParser or ClassificationJSONLParser | Good | +| Threshold slider for confusion matrix | Hide noisy off-diagonal cells at high cardinality (0-50%, default 1%) | Good | +| Client-side most-confused pairs | Derived from confusion matrix data; no new API endpoint needed | Good | +| Tableau 20 palette for embeddings | Stable categorical coloring for class-based scatter modes | Good | --- -*Last updated: 2026-02-12 after v1.1 scope redefinition* +*Last updated: 2026-02-19 after v1.2 milestone* diff --git a/.planning/ROADMAP.md b/.planning/ROADMAP.md index 6073dd0..36db1ce 100644 --- a/.planning/ROADMAP.md +++ b/.planning/ROADMAP.md @@ -2,8 +2,9 @@ ## Milestones -- v1.0 MVP - Phases 1-7 (shipped 2026-02-12) -- **v1.1 Deployment, Workflow & Competitive Parity** - Phases 8-14 +- v1.0 MVP - Phases 1-7 (shipped 2026-02-12) — [archive](.planning/milestones/v1.0-ROADMAP.md) +- v1.1 Deployment, Workflow & Competitive Parity - Phases 8-14 (shipped 2026-02-13) — [archive](.planning/milestones/v1.1-ROADMAP.md) +- v1.2 Classification Dataset Support - Phases 15-17 (shipped 2026-02-19) — [archive](.planning/milestones/v1.2-ROADMAP.md) ## Phases @@ -40,143 +41,57 @@ -### v1.1 Deployment, Workflow & Competitive Parity - -**Milestone Goal:** Make DataVisor deployable (Docker + GCP), secure for cloud access, and close key workflow gaps vs FiftyOne/Encord -- smart ingestion, annotation editing, error triage, interactive visualizations, and keyboard-driven navigation. - -**Phase Numbering:** -- Integer phases (8, 9, 10, ...): Planned milestone work -- Decimal phases (9.1, 9.2): Urgent insertions (marked with INSERTED) - -Decimal phases appear between their surrounding integers in numeric order. - -- [x] **Phase 8: Docker Deployment & Auth** - Dockerized 3-service stack with Caddy reverse proxy, basic auth, and deployment scripts -- [x] **Phase 9: Smart Ingestion** - No-code dataset import from folder path with auto-detection and confirmation -- [x] **Phase 10: Annotation Editing** - Move, resize, delete, and draw bounding boxes via react-konva in sample detail modal -- [x] **Phase 11: Error Triage** - Tag errors, highlight mode, and worst-images ranking with DuckDB persistence -- [x] **Phase 12: Interactive Viz & Discovery** - Confusion matrix, near-duplicates, interactive histograms, and find-similar -- [x] **Phase 13: Keyboard Shortcuts** - Keyboard navigation, triage hotkeys, edit shortcuts, and help overlay -- [x] **Phase 14: Per-Annotation Triage** - Auto-discover TP/FP/FN per bounding box via IoU overlap, color-coded boxes in detail modal, click to override classifications - -## Phase Details +
+v1.1 Deployment, Workflow & Competitive Parity (Phases 8-14) - SHIPPED 2026-02-13 ### Phase 8: Docker Deployment & Auth -**Goal**: DataVisor runs as a deployable Docker stack with single-user auth, accessible securely on a cloud VM or locally with a single command -**Depends on**: Phase 7 (v1.0 complete) -**Requirements**: DEPLOY-01, DEPLOY-02, DEPLOY-03, DEPLOY-04, DEPLOY-05 -**Success Criteria** (what must be TRUE): - 1. User can run `docker compose up` and access DataVisor at `http://localhost` with all features working (grid, embeddings, error analysis) - 2. User is prompted for username/password before accessing any page or API endpoint, and unauthenticated requests are rejected - 3. User can run a deployment script that provisions a GCP VM with persistent disk and starts DataVisor accessible at a public IP with HTTPS - 4. User can follow deployment documentation to configure environment variables, deploy to GCP, and set up a custom domain - 5. DuckDB data, Qdrant vectors, and thumbnail cache persist across container restarts without data loss -**Plans**: 5 plans - -Plans: -- [x] 08-01-PLAN.md -- Backend Dockerfile + config fixes (CORS, DuckDB CHECKPOINT) -- [x] 08-02-PLAN.md -- Frontend Dockerfile + Caddyfile reverse proxy with auth -- [x] 08-03-PLAN.md -- Docker Compose orchestration + .dockerignore + env config -- [x] 08-04-PLAN.md -- Local run script + GCP deployment scripts -- [x] 08-05-PLAN.md -- Deployment documentation + full stack verification +**Goal**: Deployable Docker stack with single-user auth, accessible on cloud VM or locally +**Plans**: 5 plans (complete) ### Phase 9: Smart Ingestion -**Goal**: Users can import datasets from the UI by pointing at a folder, reviewing auto-detected structure, and confirming import -- no CLI or config files needed -**Depends on**: Phase 8 (auth protects new endpoints) -**Requirements**: INGEST-01, INGEST-02, INGEST-03, INGEST-04, INGEST-05 -**Success Criteria** (what must be TRUE): - 1. User can enter a folder path in the UI and trigger a scan that returns detected dataset structure - 2. Scanner correctly identifies COCO annotation files and image directories within the folder - 3. Scanner detects train/val/test split subdirectories and presents them as separate importable splits - 4. User sees the detected structure as a confirmation step and can approve or adjust before import begins - 5. Import progress displays per-split status via real-time SSE updates until completion -**Plans**: 2 plans - -Plans: -- [x] 09-01-PLAN.md -- Backend FolderScanner service, scan/import API endpoints, split-aware ingestion pipeline -- [x] 09-02-PLAN.md -- Frontend ingestion wizard (path input, scan results, import progress) + landing page link +**Goal**: No-code dataset import from folder path with auto-detection and confirmation +**Plans**: 2 plans (complete) ### Phase 10: Annotation Editing -**Goal**: Users can make quick bounding box corrections directly in the sample detail modal without leaving DataVisor -**Depends on**: Phase 8 (auth protects mutation endpoints) -**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, ANNOT-04, ANNOT-05 -**Success Criteria** (what must be TRUE): - 1. User can enter edit mode in the sample detail modal and drag a bounding box to a new position - 2. User can grab resize handles on a bounding box and change its dimensions - 3. User can delete a bounding box and the deletion persists after closing the modal - 4. User can draw a new bounding box and assign it a class label - 5. Only ground truth annotations show edit controls; prediction annotations remain read-only and non-interactive -**Plans**: 3 plans - -Plans: -- [x] 10-01-PLAN.md -- Backend annotation CRUD endpoints + frontend mutation hooks and types -- [x] 10-02-PLAN.md -- Konva building blocks: coord-utils, EditableRect, DrawLayer, ClassPicker -- [x] 10-03-PLAN.md -- AnnotationEditor composition, sample modal integration, annotation list delete +**Goal**: Move, resize, delete, and draw bounding boxes via react-konva in sample detail modal +**Plans**: 3 plans (complete) ### Phase 11: Error Triage -**Goal**: Users can systematically review and tag errors with a focused triage workflow that persists decisions and surfaces the worst samples first -**Depends on**: Phase 8 (extends v1.0 error analysis) -**Requirements**: TRIAGE-01, TRIAGE-02, TRIAGE-03 -**Success Criteria** (what must be TRUE): - 1. User can tag any sample or annotation as FP, TP, FN, or mistake, and the tag persists across page refreshes - 2. User can activate highlight mode to dim non-error samples in the grid, making errors visually prominent - 3. User can view a "worst images" ranking that surfaces samples with the highest combined error score (error count + confidence spread + uniqueness) -**Plans**: 2 plans - -Plans: -- [x] 11-01-PLAN.md -- Backend triage endpoints (set-triage-tag, worst-images scoring) + frontend hooks and types -- [x] 11-02-PLAN.md -- Triage tag buttons in detail modal, highlight mode grid dimming, worst-images stats panel +**Goal**: Tag errors, highlight mode, and worst-images ranking with DuckDB persistence +**Plans**: 2 plans (complete) ### Phase 12: Interactive Viz & Discovery -**Goal**: Users can explore dataset quality interactively -- clicking visualization elements filters the grid, finding similar samples and near-duplicates is one click away -**Depends on**: Phase 11 (triage data informs confusion matrix), Phase 8 (auth protects endpoints) -**Requirements**: ANNOT-06, TRIAGE-04, TRIAGE-05, TRIAGE-06 -**Success Criteria** (what must be TRUE): - 1. User can click "Find Similar" on any sample to see nearest neighbors from Qdrant displayed in the grid - 2. User can view a confusion matrix and click any cell to filter the grid to samples matching that GT/prediction pair - 3. User can trigger near-duplicate detection and browse groups of visually similar images - 4. User can click a bar in any statistics dashboard histogram to filter the grid to samples in that bucket -**Plans**: 3 plans - -Plans: -- [x] 12-01-PLAN.md -- Discovery filter foundation + Find Similar grid filtering + interactive histogram bars -- [x] 12-02-PLAN.md -- Clickable confusion matrix cells with backend sample ID resolution -- [x] 12-03-PLAN.md -- Near-duplicate detection via Qdrant pairwise search with SSE progress +**Goal**: Confusion matrix, near-duplicates, interactive histograms, and find-similar +**Plans**: 3 plans (complete) ### Phase 13: Keyboard Shortcuts -**Goal**: Power users can navigate, triage, and edit entirely from the keyboard without reaching for the mouse -**Depends on**: Phase 10 (annotation edit shortcuts), Phase 11 (triage shortcuts), Phase 12 (all UI features exist) -**Requirements**: UX-01, UX-02, UX-03, UX-04 -**Success Criteria** (what must be TRUE): - 1. User can navigate between samples in the grid and modal using arrow keys, j/k, Enter, and Escape - 2. User can quick-tag errors during triage using number keys and toggle highlight mode with h - 3. User can delete annotations and undo edits with keyboard shortcuts while in annotation edit mode - 4. User can press ? to open a shortcut help overlay listing all available keyboard shortcuts -**Plans**: 2 plans - -Plans: -- [x] 13-01-PLAN.md -- Foundation (react-hotkeys-hook, shortcut registry, ui-store) + grid keyboard navigation -- [x] 13-02-PLAN.md -- Modal shortcuts (navigation, triage, editing, undo) + help overlay +**Goal**: Keyboard navigation, triage hotkeys, edit shortcuts, and help overlay +**Plans**: 2 plans (complete) ### Phase 14: Per-Annotation Triage -**Goal**: Users can see auto-discovered TP/FP/FN classifications per bounding box based on IoU overlap, with color-coded visualization in the detail modal and the ability to click individual annotations to override their classification -**Depends on**: Phase 11 (extends triage system), Phase 6 (error analysis IoU matching) -**Success Criteria** (what must be TRUE): - 1. User opens a sample with GT and predictions and sees each bounding box color-coded as TP (green), FP (red), or FN (orange) based on automatic IoU matching - 2. User can click an individual bounding box to override its auto-assigned classification (e.g. mark an auto-TP as a mistake) - 3. Per-annotation triage decisions persist across page refreshes and are stored in DuckDB - 4. Highlight mode dims samples that have no triage annotations, making triaged samples visually prominent -**Plans**: 3 plans - -Plans: -- [x] 14-01-PLAN.md -- Backend schema, IoU matching service, and annotation triage API endpoints -- [x] 14-02-PLAN.md -- Frontend types, hooks, and clickable TriageOverlay SVG component -- [x] 14-03-PLAN.md -- Wire TriageOverlay into sample modal + highlight mode integration +**Goal**: Auto-discover TP/FP/FN per bounding box via IoU overlap, color-coded boxes in detail modal, click to override classifications +**Plans**: 3 plans (complete) -## Progress +
+ +
+v1.2 Classification Dataset Support (Phases 15-17) - SHIPPED 2026-02-19 + +### Phase 15: Classification Ingestion & Display +**Goal**: Users can import, browse, and inspect classification datasets with the same ease as detection datasets +**Plans**: 2 plans (complete) + +### Phase 16: Classification Evaluation +**Goal**: Users can import predictions and analyze classification model performance with accuracy, F1, confusion matrix, and error categorization +**Plans**: 2 plans (complete) + +### Phase 17: Classification Polish +**Goal**: Classification workflows are production-ready for high-cardinality datasets (43+ classes) with visual aids that surface actionable insights +**Plans**: 2 plans (complete) -**Execution Order:** -Phases execute in numeric order: 8 -> 9 -> 10 -> 11 -> 12 -> 13 -> 14 -(Note: Phases 9, 10, 11 are independent after Phase 8. Execution is sequential but no inter-dependency exists between 9/10/11.) +
+ +## Progress | Phase | Milestone | Plans Complete | Status | Completed | |-------|-----------|----------------|--------|-----------| @@ -194,3 +109,6 @@ Phases execute in numeric order: 8 -> 9 -> 10 -> 11 -> 12 -> 13 -> 14 | 12. Interactive Viz & Discovery | v1.1 | 3/3 | Complete | 2026-02-13 | | 13. Keyboard Shortcuts | v1.1 | 2/2 | Complete | 2026-02-13 | | 14. Per-Annotation Triage | v1.1 | 3/3 | Complete | 2026-02-13 | +| 15. Classification Ingestion & Display | v1.2 | 2/2 | Complete | 2026-02-18 | +| 16. Classification Evaluation | v1.2 | 2/2 | Complete | 2026-02-18 | +| 17. Classification Polish | v1.2 | 2/2 | Complete | 2026-02-18 | diff --git a/.planning/STATE.md b/.planning/STATE.md index 1526a0a..ab9e078 100644 --- a/.planning/STATE.md +++ b/.planning/STATE.md @@ -2,19 +2,18 @@ ## Project Reference -See: .planning/PROJECT.md (updated 2026-02-12) +See: .planning/PROJECT.md (updated 2026-02-19) **Core value:** A single tool that replaces scattered scripts: load any CV dataset, visually browse with annotation overlays, compare GT vs predictions, cluster via embeddings, and surface mistakes -- all in one workflow. -**Current focus:** v1.1 complete. All 14 phases delivered. +**Current focus:** Planning next milestone ## Current Position -Phase: 14 of 14 (Per-Annotation Triage) -Plan: 3 of 3 in current phase -Status: Complete -Last activity: 2026-02-13 -- Phase 14 verified and complete +Phase: 17 of 17 (all milestones complete) +Status: v1.2 archived, ready for next milestone +Last activity: 2026-02-19 -- Completed v1.2 milestone archival -Progress: [████████████████████████████████████████████████████████████] v1.1: 41/41 plans complete +Progress: [################################] 100% (v1.0 + v1.1 + v1.2 complete) ## Performance Metrics @@ -23,96 +22,20 @@ Progress: [███████████████████████ - Average duration: 3.9 min - Total execution time: 82 min -**By Phase (v1.0):** - -| Phase | Plans | Total | Avg/Plan | -|-------|-------|-------|----------| -| 1. Data Foundation | 4/4 | 14 min | 3.5 min | -| 2. Visual Grid | 3/3 | 15 min | 5.0 min | -| 3. Filtering & Search | 2/2 | 10 min | 5.0 min | -| 4. Predictions & Comparison | 3/3 | 9 min | 3.0 min | -| 5. Embeddings & Visualization | 4/4 | 16 min | 4.0 min | -| 6. Error Analysis & Similarity | 2/2 | 9 min | 4.5 min | -| 7. Intelligence & Agents | 3/3 | 9 min | 3.0 min | - -**By Phase (v1.1):** - -| Phase | Plans | Total | Avg/Plan | -|-------|-------|-------|----------| -| 8. Docker Deployment & Auth | 5/5 | 25 min | 5.0 min | -| 9. Smart Ingestion | 2/2 | 10 min | 5.0 min | -| 10. Annotation Editing | 3/3 | 9 min | 3.0 min | -| 11. Error Triage | 2/2 | 6 min | 3.0 min | -| 12. Interactive Viz & Discovery | 3/3 | 10 min | 3.3 min | -| 13. Keyboard Shortcuts | 2/2 | 6 min | 3.0 min | -| 14. Per-Annotation Triage | 3/3 | 7 min | 2.3 min | +**Velocity (v1.1):** +- Total plans completed: 20 +- Average duration: 3.7 min +- Total execution time: 73 min + +**Velocity (v1.2):** +- Total plans completed: 6 +- Timeline: 1 day (2026-02-18) ## Accumulated Context ### Decisions Decisions are logged in PROJECT.md Key Decisions table. -Recent decisions affecting current work: - -- [v1.1 Roadmap]: Keep Qdrant in local mode for Docker (single-user <1M vectors) -- [v1.1 Roadmap]: Caddy over nginx for reverse proxy (auto-HTTPS, built-in basic_auth) -- [v1.1 Roadmap]: react-konva for annotation editing in detail modal only (SVG stays for grid) -- [v1.1 Roadmap]: FastAPI HTTPBasic DI over middleware (testable, composable) -- [08-01]: CPU-only PyTorch via post-sync replacement in Dockerfile (uv sync then uv pip install from CPU index) -- [08-01]: CORS restricted to localhost:3000 in dev, disabled entirely behind proxy (DATAVISOR_BEHIND_PROXY=true) -- [08-02]: NEXT_PUBLIC_API_URL=/api baked at build time for same-origin API via Caddy -- [08-02]: Caddy handles all auth at proxy layer -- zero application code changes -- [08-03]: Directory bind mount ./data:/app/data for DuckDB WAL + Qdrant + thumbnails persistence -- [08-03]: AUTH_PASSWORD_HASH has no default -- forces explicit auth configuration before deployment -- [08-03]: Only Caddy exposes ports 80/443 -- backend and frontend are Docker-internal only -- [08-04]: VM startup script does NOT auto-start docker compose -- requires manual .env setup first -- [08-04]: GCP config via env vars with defaults (only GCP_PROJECT_ID required) -- [08-05]: 10-section deployment docs covering local Docker, GCP, custom domain HTTPS, data persistence, troubleshooting -- [08-05]: opencv-python-headless replaces opencv-python in Docker builder stage (no X11/GUI libs in slim images) -- [09-01]: Three-layout priority detection: Roboflow > Standard COCO > Flat -- [09-01]: ijson peek at top-level keys for COCO detection (max 10 keys, files >500MB skipped) -- [09-01]: Optional dataset_id param on ingest_with_progress for multi-split ID sharing -- [09-01]: INSERT-or-UPDATE pattern for dataset record across multi-split imports -- [09-02]: POST SSE streaming via fetch + ReadableStream (not EventSource, which is GET-only) -- [09-02]: FolderScanner refactored to accept StorageBackend for GCS support -- [09-02]: Split-prefixed IDs for collision avoidance in multi-split import -- [10-01]: get_cursor DI for annotation router (auto-close cursor) -- [10-01]: source='ground_truth' enforced in SQL WHERE clauses for PUT/DELETE safety -- [10-01]: Dataset counts refreshed via subquery UPDATE (no race conditions) -- [10-02]: useDrawLayer hook pattern (handlers + ReactNode) instead of separate component -- [10-02]: Transformer scale reset to 1 on transformEnd (Konva best practice) -- [10-03]: AnnotationEditor loaded via next/dynamic with ssr:false (prevents Konva SSR errors) -- [10-03]: Draw completion shows ClassPicker before creating annotation (requires category selection) -- [10-03]: Delete buttons only appear on ground_truth rows when edit mode is active -- [11-01]: Dual router pattern (samples_router + datasets_router) from single triage module -- [11-01]: Atomic triage tag replacement via list_filter + list_append single SQL -- [11-01]: get_db DI pattern for triage router (matching statistics.py style) -- [11-02]: Triage buttons always visible in detail modal (not gated by edit mode) -- [11-02]: Highlight toggle uses yellow-500 active styling to distinguish from edit buttons -- [11-02]: Triage tag badges show short label (TP/FP/FN/MISTAKE) instead of full prefix -- [12-01]: Lasso selection takes priority over discovery filter (effectiveIds = lassoSelectedIds ?? sampleIdFilter) -- [12-01]: "Show in Grid" button only appears after similarity results load (progressive disclosure) -- [12-01]: getState() pattern for store access in Recharts onClick handlers (non-reactive) -- [12-01]: DiscoveryFilterChip in dataset header for cross-tab visibility -- [12-02]: Imperative fetch function (not hook) for one-shot confusion cell sample lookups -- [12-02]: Greedy IoU matching replayed per sample for consistent CM cell membership -- [12-02]: getState() pattern for Zustand store writes in async callbacks -- [12-03]: Tab bar always visible so Near Duplicates is accessible without predictions -- [12-03]: Union-find with path compression for O(alpha(n)) grouping of pairwise matches -- [12-03]: Progress updates throttled to every 10 points to avoid excessive state updates -- [13-01]: isFocused passed as prop from ImageGrid (avoids N store subscriptions per GridCell) -- [13-01]: Central shortcut registry pattern: all shortcuts as data in lib/shortcuts.ts -- [13-02]: Single useHotkeys('1, 2, 3, 4') with event.key dispatch (avoids rules-of-hooks violation) -- [13-02]: Single-level undo stack via React state for annotation delete undo -- [13-02]: Triage number keys disabled during edit mode (prevents Konva focus confusion) -- [13-02]: groupByCategory via reduce instead of Object.groupBy (avoids es2024 lib dep) -- [14-01]: Reuse _compute_iou_matrix from evaluation.py (no duplicate IoU code) -- [14-01]: Auto-computed labels ephemeral (computed on GET, not stored); overrides persist in annotation_triage table -- [14-01]: triage:annotated sample tag bridges per-annotation triage to highlight mode -- [14-02]: TriageOverlay is separate from AnnotationOverlay (interactive vs non-interactive SVG) -- [14-02]: Click handler delegates to parent via callback (overlay does not manage mutations) -- [14-02]: Annotations not in triageMap skipped (handles GT-only samples gracefully) -- [14-03]: GT boxes show category name only, predictions show category + confidence% (color conveys triage type) ### Pending Todos @@ -120,10 +43,16 @@ None. ### Blockers/Concerns -- [RESOLVED] SVG-to-Canvas coordinate mismatch resolved by coord-utils.ts (10-02) +- Confirm Roboflow JSONL format against actual export before finalizing parser + +### Roadmap Evolution + +- v1.0: 7 phases (1-7), 21 plans -- shipped 2026-02-12 +- v1.1: 7 phases (8-14), 20 plans -- shipped 2026-02-13 +- v1.2: 3 phases (15-17), 6 plans -- shipped 2026-02-19 ## Session Continuity -Last session: 2026-02-13 -Stopped at: Phase 14 complete, v1.1 milestone complete +Last session: 2026-02-19 +Stopped at: Completed v1.2 milestone archival Resume file: None diff --git a/.planning/REQUIREMENTS.md b/.planning/milestones/v1.1-REQUIREMENTS.md similarity index 68% rename from .planning/REQUIREMENTS.md rename to .planning/milestones/v1.1-REQUIREMENTS.md index a8c8887..12b090b 100644 --- a/.planning/REQUIREMENTS.md +++ b/.planning/milestones/v1.1-REQUIREMENTS.md @@ -1,11 +1,10 @@ -# Requirements: DataVisor v1.1 +# Requirements Archive: DataVisor v1.1 **Defined:** 2026-02-12 -**Core Value:** A single tool that replaces scattered scripts: load any CV dataset, visually browse with annotation overlays, compare GT vs predictions, cluster via embeddings, and surface mistakes — all in one workflow. +**Completed:** 2026-02-13 +**Core Value:** A single tool that replaces scattered scripts: load any CV dataset, visually browse with annotation overlays, compare GT vs predictions, cluster via embeddings, and surface mistakes -- all in one workflow. -## v1.1 Requirements - -Requirements for Deployment, Workflow & Competitive Parity milestone. +## v1.1 Requirements (All Complete) ### Deployment & Infrastructure @@ -39,7 +38,7 @@ Requirements for Deployment, Workflow & Competitive Parity milestone. - [x] **TRIAGE-03**: "Worst images" ranking surfaces samples with highest combined error score (error count + confidence spread + uniqueness) - [x] **TRIAGE-04**: Interactive confusion matrix that filters grid when a cell is clicked - [x] **TRIAGE-05**: Near-duplicate detection surfaces visually similar images in the dataset -- [x] **TRIAGE-06**: Interactive histograms on the statistics dashboard — clicking a bar filters the grid +- [x] **TRIAGE-06**: Interactive histograms on the statistics dashboard -- clicking a bar filters the grid ### UX @@ -48,46 +47,8 @@ Requirements for Deployment, Workflow & Competitive Parity milestone. - [x] **UX-03**: Keyboard shortcuts for annotation editing (Delete, Ctrl+Z, e for edit mode) - [x] **UX-04**: Shortcut help overlay triggered by ? key -## v1.2 Requirements - -Deferred to future milestone. Tracked but not in current roadmap. - -### Format Expansion - -- **FMT-01**: YOLO format parser (.txt annotation files with class_id + normalized xywh) -- **FMT-02**: Pascal VOC format parser (XML annotation files) -- **FMT-03**: Dataset export in COCO and YOLO formats - -### Evaluation - -- **EVAL-01**: PR curves per class -- **EVAL-02**: Per-class AP metrics dashboard - -### Advanced - -- **ADV-01**: Model zoo / in-app inference (ONNX/TorchScript) -- **ADV-02**: Custom workspaces / panel layouts -- **ADV-03**: Customizable keyboard shortcut remapping -- **ADV-04**: CVAT/Label Studio integration for complex annotation workflows - -## Out of Scope - -Explicitly excluded. Documented to prevent scope creep. - -| Feature | Reason | -|---------|--------| -| Multi-user collaboration / RBAC | Personal tool — single-user auth for VM security only | -| Video annotation support | Image-only for now; multiplies complexity | -| Training pipeline integration | DataVisor inspects data, doesn't train models | -| Mobile/tablet interface | Desktop browser only | -| Real-time streaming inference | Batch-oriented analysis | -| 3D point cloud visualization | Different rendering pipeline entirely | -| Full annotation editor (polygon, segmentation) | Bounding box CRUD only for v1.1 | - ## Traceability -Which phases cover which requirements. Updated during roadmap creation. - | Requirement | Phase | Status | |-------------|-------|--------| | DEPLOY-01 | Phase 8 | Complete | @@ -117,11 +78,7 @@ Which phases cover which requirements. Updated during roadmap creation. | UX-03 | Phase 13 | Complete | | UX-04 | Phase 13 | Complete | -**Coverage:** -- v1.1 requirements: 26 total -- Mapped to phases: 26 -- Unmapped: 0 +**Coverage:** 26/26 requirements complete (100%) --- -*Requirements defined: 2026-02-12* -*Last updated: 2026-02-13 — Phase 13 requirements marked Complete (v1.1 milestone complete)* +*Archived: 2026-02-13* diff --git a/.planning/milestones/v1.1-ROADMAP.md b/.planning/milestones/v1.1-ROADMAP.md new file mode 100644 index 0000000..ad66394 --- /dev/null +++ b/.planning/milestones/v1.1-ROADMAP.md @@ -0,0 +1,131 @@ +# Milestone v1.1: Deployment, Workflow & Competitive Parity + +**Status:** SHIPPED 2026-02-13 +**Phases:** 8-14 +**Total Plans:** 20 + +## Overview + +Make DataVisor deployable (Docker + GCP), secure for cloud access, and close key workflow gaps vs FiftyOne/Encord -- smart ingestion, annotation editing, error triage, interactive visualizations, and keyboard-driven navigation. + +## Phases + +### Phase 8: Docker Deployment & Auth + +**Goal**: DataVisor runs as a deployable Docker stack with single-user auth, accessible securely on a cloud VM or locally with a single command +**Depends on**: Phase 7 (v1.0 complete) +**Requirements**: DEPLOY-01, DEPLOY-02, DEPLOY-03, DEPLOY-04, DEPLOY-05 +**Plans**: 5 plans + +Plans: +- [x] 08-01: Backend Dockerfile + config fixes (CORS, DuckDB CHECKPOINT) +- [x] 08-02: Frontend Dockerfile + Caddyfile reverse proxy with auth +- [x] 08-03: Docker Compose orchestration + .dockerignore + env config +- [x] 08-04: Local run script + GCP deployment scripts +- [x] 08-05: Deployment documentation + full stack verification + +### Phase 9: Smart Ingestion + +**Goal**: Users can import datasets from the UI by pointing at a folder, reviewing auto-detected structure, and confirming import -- no CLI or config files needed +**Depends on**: Phase 8 (auth protects new endpoints) +**Requirements**: INGEST-01, INGEST-02, INGEST-03, INGEST-04, INGEST-05 +**Plans**: 2 plans + +Plans: +- [x] 09-01: Backend FolderScanner service, scan/import API endpoints, split-aware ingestion pipeline +- [x] 09-02: Frontend ingestion wizard (path input, scan results, import progress) + landing page link + +### Phase 10: Annotation Editing + +**Goal**: Users can make quick bounding box corrections directly in the sample detail modal without leaving DataVisor +**Depends on**: Phase 8 (auth protects mutation endpoints) +**Requirements**: ANNOT-01, ANNOT-02, ANNOT-03, ANNOT-04, ANNOT-05 +**Plans**: 3 plans + +Plans: +- [x] 10-01: Backend annotation CRUD endpoints + frontend mutation hooks and types +- [x] 10-02: Konva building blocks: coord-utils, EditableRect, DrawLayer, ClassPicker +- [x] 10-03: AnnotationEditor composition, sample modal integration, annotation list delete + +### Phase 11: Error Triage + +**Goal**: Users can systematically review and tag errors with a focused triage workflow that persists decisions and surfaces the worst samples first +**Depends on**: Phase 8 (extends v1.0 error analysis) +**Requirements**: TRIAGE-01, TRIAGE-02, TRIAGE-03 +**Plans**: 2 plans + +Plans: +- [x] 11-01: Backend triage endpoints (set-triage-tag, worst-images scoring) + frontend hooks and types +- [x] 11-02: Triage tag buttons in detail modal, highlight mode grid dimming, worst-images stats panel + +### Phase 12: Interactive Viz & Discovery + +**Goal**: Users can explore dataset quality interactively -- clicking visualization elements filters the grid, finding similar samples and near-duplicates is one click away +**Depends on**: Phase 11 (triage data informs confusion matrix), Phase 8 (auth protects endpoints) +**Requirements**: ANNOT-06, TRIAGE-04, TRIAGE-05, TRIAGE-06 +**Plans**: 3 plans + +Plans: +- [x] 12-01: Discovery filter foundation + Find Similar grid filtering + interactive histogram bars +- [x] 12-02: Clickable confusion matrix cells with backend sample ID resolution +- [x] 12-03: Near-duplicate detection via Qdrant pairwise search with SSE progress + +### Phase 13: Keyboard Shortcuts + +**Goal**: Power users can navigate, triage, and edit entirely from the keyboard without reaching for the mouse +**Depends on**: Phase 10 (annotation edit shortcuts), Phase 11 (triage shortcuts), Phase 12 (all UI features exist) +**Requirements**: UX-01, UX-02, UX-03, UX-04 +**Plans**: 2 plans + +Plans: +- [x] 13-01: Foundation (react-hotkeys-hook, shortcut registry, ui-store) + grid keyboard navigation +- [x] 13-02: Modal shortcuts (navigation, triage, editing, undo) + help overlay + +### Phase 14: Per-Annotation Triage + +**Goal**: Users can see auto-discovered TP/FP/FN classifications per bounding box based on IoU overlap, with color-coded visualization in the detail modal and the ability to click individual annotations to override their classification +**Depends on**: Phase 11 (extends triage system), Phase 6 (error analysis IoU matching) +**Plans**: 3 plans + +Plans: +- [x] 14-01: Backend schema, IoU matching service, and annotation triage API endpoints +- [x] 14-02: Frontend types, hooks, and clickable TriageOverlay SVG component +- [x] 14-03: Wire TriageOverlay into sample modal + highlight mode integration + +## Milestone Summary + +**Key Decisions:** + +- Caddy over nginx for reverse proxy (auto-HTTPS, built-in basic_auth) +- CPU-only PyTorch via post-sync replacement in Dockerfile +- react-konva for annotation editing (SVG stays for grid overlays) +- FastAPI HTTPBasic DI over middleware (testable, composable) +- Atomic triage tag replacement via list_filter + list_append single SQL +- Union-find with path compression for near-duplicate grouping +- Central shortcut registry pattern (all shortcuts as data) +- Auto-computed triage labels ephemeral (computed on GET); overrides persist in annotation_triage table +- Switched AI agent from OpenAI GPT-4o to Google Gemini 2.0 Flash +- Pre-compute all data for AI agent prompt (no tool calls needed) + +**Issues Resolved:** + +- opencv-python-headless for Docker slim images (no X11 libs needed) +- DuckDB WAL stale file recovery via CHECKPOINT on shutdown +- PyTorch CPU install order (uv sync first, then replace with CPU wheel) +- Pydantic AI request_limit exceeded by Gemini tool-call loop (eliminated tools) +- GEMINI_API_KEY not loading (load_dotenv for third-party libs) +- pyvips missing for Moondream2 auto-tag (added dependency) + +**Issues Deferred:** + +- UMAP blocked by Python 3.14 numba incompatibility (using t-SNE) +- Moondream2 trust_remote_code fragile with transformers updates + +**Technical Debt Incurred:** + +- Module-level cache for Intelligence panel results (should use React Query cache) +- Old triage tags filtered client-side (OBSOLETE_TRIAGE_TAGS set in grid-cell.tsx) + +--- + +_For current project status, see .planning/ROADMAP.md_ diff --git a/.planning/milestones/v1.2-REQUIREMENTS.md b/.planning/milestones/v1.2-REQUIREMENTS.md new file mode 100644 index 0000000..28bec2b --- /dev/null +++ b/.planning/milestones/v1.2-REQUIREMENTS.md @@ -0,0 +1,103 @@ +# Requirements Archive: v1.2 Classification Dataset Support + +**Archived:** 2026-02-19 +**Status:** SHIPPED + +For current requirements, see `.planning/REQUIREMENTS.md`. + +--- + +# Requirements: DataVisor + +**Defined:** 2026-02-18 +**Core Value:** A single tool that replaces scattered scripts: load any CV dataset, visually browse with annotation overlays, compare GT vs predictions, cluster via embeddings, and surface mistakes -- all in one workflow. + +## v1.2 Requirements + +Requirements for classification dataset support. Each maps to roadmap phases. + +### Ingestion + +- [x] **INGEST-01**: User can import a classification dataset from a directory containing JSONL annotations and images +- [x] **INGEST-02**: System auto-detects dataset type (detection vs classification) from annotation format during import +- [x] **INGEST-03**: User can import multi-split classification datasets (train/valid/test) in a single operation +- [x] **INGEST-04**: Schema stores dataset_type on the datasets table and handles classification annotations without bbox values + +### Display + +- [x] **DISP-01**: User sees class label badges on grid thumbnails for classification datasets +- [x] **DISP-02**: User sees class label (GT and prediction) prominently in the sample detail modal +- [x] **DISP-03**: User can edit the GT class label via dropdown in the detail modal +- [x] **DISP-04**: Statistics dashboard shows classification-appropriate metrics (labeled images, class distribution) and hides detection-only elements (bbox area, IoU slider) + +### Evaluation + +- [x] **EVAL-01**: User can import classification predictions in JSONL format with confidence scores +- [x] **EVAL-02**: User sees accuracy, macro F1, weighted F1, and per-class precision/recall/F1 metrics +- [x] **EVAL-03**: User sees a confusion matrix for classification with click-to-filter support +- [x] **EVAL-04**: User sees error analysis categorizing each image as correct, misclassified, or missing prediction +- [x] **EVAL-05**: User sees GT vs predicted label comparison on grid thumbnails and in the modal + +### Polish + +- [x] **POLISH-01**: Confusion matrix scales to 43+ classes with readable rendering +- [x] **POLISH-02**: User can color embedding scatter by GT class, predicted class, or correct/incorrect status +- [x] **POLISH-03**: User sees most-confused class pairs summary from the confusion matrix +- [x] **POLISH-04**: User sees per-class performance sparklines with color-coded thresholds + +## Future Requirements + +### Multi-label Classification + +- **MLABEL-01**: User can import multi-label classification datasets (multiple labels per image) +- **MLABEL-02**: User sees multi-label metrics (hamming loss, subset accuracy) + +### Advanced Evaluation + +- **ADVEVAL-01**: User can import top-K predictions with full probability distributions +- **ADVEVAL-02**: User sees confidence calibration plot (reliability diagram) +- **ADVEVAL-03**: User can compare performance across train/valid/test splits side-by-side + +## Out of Scope + +| Feature | Reason | +|---------|--------| +| Multi-label classification | Different data model, metrics, and UI; scope explosion for v1.2 | +| Top-K evaluation | Requires importing full probability distributions; complicates schema | +| PR curves for classification | Less informative than confusion matrix + per-class metrics for multi-class | +| mAP for classification | Detection metric, not applicable to classification | +| Bbox editing for classification | No bounding boxes in classification datasets | +| IoU threshold controls for classification | No spatial matching in classification | + +## Traceability + +Which phases cover which requirements. Updated during roadmap creation. + +| Requirement | Phase | Status | +|-------------|-------|--------| +| INGEST-01 | Phase 15 | Done | +| INGEST-02 | Phase 15 | Done | +| INGEST-03 | Phase 15 | Done | +| INGEST-04 | Phase 15 | Done | +| DISP-01 | Phase 15 | Done | +| DISP-02 | Phase 15 | Done | +| DISP-03 | Phase 15 | Done | +| DISP-04 | Phase 15 | Done | +| EVAL-01 | Phase 16 | Done | +| EVAL-02 | Phase 16 | Done | +| EVAL-03 | Phase 16 | Done | +| EVAL-04 | Phase 16 | Done | +| EVAL-05 | Phase 16 | Done | +| POLISH-01 | Phase 17 | Done | +| POLISH-02 | Phase 17 | Done | +| POLISH-03 | Phase 17 | Done | +| POLISH-04 | Phase 17 | Done | + +**Coverage:** +- v1.2 requirements: 17 total +- Mapped to phases: 17 +- Unmapped: 0 + +--- +*Requirements defined: 2026-02-18* +*Last updated: 2026-02-18 after roadmap creation* diff --git a/.planning/milestones/v1.2-ROADMAP.md b/.planning/milestones/v1.2-ROADMAP.md new file mode 100644 index 0000000..cb34b45 --- /dev/null +++ b/.planning/milestones/v1.2-ROADMAP.md @@ -0,0 +1,145 @@ +# Roadmap: DataVisor + +## Milestones + +- v1.0 MVP - Phases 1-7 (shipped 2026-02-12) — [archive](.planning/milestones/v1.0-ROADMAP.md) +- v1.1 Deployment, Workflow & Competitive Parity - Phases 8-14 (shipped 2026-02-13) — [archive](.planning/milestones/v1.1-ROADMAP.md) +- v1.2 Classification Dataset Support - Phases 15-17 (in progress) + +## Phases + +
+v1.0 MVP (Phases 1-7) - SHIPPED 2026-02-12 + +### Phase 1: Data Foundation +**Goal**: DuckDB-backed streaming ingestion pipeline for COCO datasets at 100K+ scale +**Plans**: 4 plans (complete) + +### Phase 2: Visual Grid +**Goal**: Virtualized infinite-scroll grid with SVG annotation overlays +**Plans**: 3 plans (complete) + +### Phase 3: Filtering & Search +**Goal**: Full metadata filtering, search, saved views, and bulk tagging +**Plans**: 2 plans (complete) + +### Phase 4: Predictions & Comparison +**Goal**: Model prediction import with GT vs Predictions comparison +**Plans**: 3 plans (complete) + +### Phase 5: Embeddings & Visualization +**Goal**: DINOv2 embeddings with t-SNE reduction and deck.gl scatter plot +**Plans**: 4 plans (complete) + +### Phase 6: Error Analysis & Similarity +**Goal**: Error categorization pipeline and Qdrant-powered similarity search +**Plans**: 2 plans (complete) + +### Phase 7: Intelligence & Agents +**Goal**: Pydantic AI agent for error patterns and Moondream2 VLM auto-tagging +**Plans**: 3 plans (complete) + +
+ +
+v1.1 Deployment, Workflow & Competitive Parity (Phases 8-14) - SHIPPED 2026-02-13 + +### Phase 8: Docker Deployment & Auth +**Goal**: Deployable Docker stack with single-user auth, accessible on cloud VM or locally +**Plans**: 5 plans (complete) + +### Phase 9: Smart Ingestion +**Goal**: No-code dataset import from folder path with auto-detection and confirmation +**Plans**: 2 plans (complete) + +### Phase 10: Annotation Editing +**Goal**: Move, resize, delete, and draw bounding boxes via react-konva in sample detail modal +**Plans**: 3 plans (complete) + +### Phase 11: Error Triage +**Goal**: Tag errors, highlight mode, and worst-images ranking with DuckDB persistence +**Plans**: 2 plans (complete) + +### Phase 12: Interactive Viz & Discovery +**Goal**: Confusion matrix, near-duplicates, interactive histograms, and find-similar +**Plans**: 3 plans (complete) + +### Phase 13: Keyboard Shortcuts +**Goal**: Keyboard navigation, triage hotkeys, edit shortcuts, and help overlay +**Plans**: 2 plans (complete) + +### Phase 14: Per-Annotation Triage +**Goal**: Auto-discover TP/FP/FN per bounding box via IoU overlap, color-coded boxes in detail modal, click to override classifications +**Plans**: 3 plans (complete) + +
+ +### v1.2 Classification Dataset Support (In Progress) + +**Milestone Goal:** First-class single-label classification dataset support with full feature parity to detection workflows -- from ingestion through evaluation to polish. + +#### Phase 15: Classification Ingestion & Display +**Goal**: Users can import, browse, and inspect classification datasets with the same ease as detection datasets +**Depends on**: Phase 14 (existing codebase) +**Requirements**: INGEST-01, INGEST-02, INGEST-03, INGEST-04, DISP-01, DISP-02, DISP-03, DISP-04 +**Success Criteria** (what must be TRUE): + 1. User can point the ingestion wizard at a folder with JSONL annotations and images, and the system auto-detects it as a classification dataset + 2. User can import multi-split classification datasets (train/valid/test) in a single operation, just like detection datasets + 3. User sees class label badges on grid thumbnails instead of bounding box overlays when browsing a classification dataset + 4. User sees GT class label prominently in the sample detail modal and can change it via a dropdown + 5. Statistics dashboard shows classification-appropriate metrics (labeled images count, class distribution) with no detection-only elements visible (no bbox area histogram, no IoU slider) +**Plans**: 2 plans (complete) +Plans: +- [x] 15-01-PLAN.md -- Backend: schema migration, ClassificationJSONLParser, FolderScanner detection, IngestionService dispatch, API endpoints +- [x] 15-02-PLAN.md -- Frontend: type updates, grid class badges, detail modal class label/dropdown, classification-aware statistics + +#### Phase 16: Classification Evaluation +**Goal**: Users can import predictions and analyze classification model performance with accuracy, F1, confusion matrix, and error categorization +**Depends on**: Phase 15 +**Requirements**: EVAL-01, EVAL-02, EVAL-03, EVAL-04, EVAL-05 +**Success Criteria** (what must be TRUE): + 1. User can import classification predictions in JSONL format with confidence scores and see them alongside ground truth + 2. User sees accuracy, macro F1, weighted F1, and per-class precision/recall/F1 metrics in the evaluation panel + 3. User sees a confusion matrix and can click any cell to filter the grid to images with that GT/predicted class pair + 4. User sees each image categorized as correct, misclassified, or missing prediction in the error analysis view + 5. User sees GT vs predicted label comparison on grid thumbnails and in the detail modal +**Plans**: 2 plans (complete) +Plans: +- [x] 16-01-PLAN.md -- Backend: classification prediction parser, evaluation service, error analysis service, endpoint routing +- [x] 16-02-PLAN.md -- Frontend: types, hooks, prediction import dialog, evaluation panel, error analysis panel, grid badges + +#### Phase 17: Classification Polish +**Goal**: Classification workflows are production-ready for high-cardinality datasets (43+ classes) with visual aids that surface actionable insights +**Depends on**: Phase 16 +**Requirements**: POLISH-01, POLISH-02, POLISH-03, POLISH-04 +**Success Criteria** (what must be TRUE): + 1. Confusion matrix renders readably at 43+ classes with threshold filtering and overflow handling + 2. User can color the embedding scatter plot by GT class, predicted class, or correct/incorrect status + 3. User sees a ranked list of most-confused class pairs derived from the confusion matrix + 4. User sees per-class performance sparklines with color-coded thresholds (green/yellow/red) in the metrics table +**Plans**: 2 plans (complete) +Plans: +- [x] 17-01-PLAN.md -- Confusion matrix threshold/overflow, most-confused pairs, F1 bars in per-class table +- [x] 17-02-PLAN.md -- Embedding scatter color modes (GT class, predicted class, correct/incorrect) + +## Progress + +| Phase | Milestone | Plans Complete | Status | Completed | +|-------|-----------|----------------|--------|-----------| +| 1. Data Foundation | v1.0 | 4/4 | Complete | 2026-02-10 | +| 2. Visual Grid | v1.0 | 3/3 | Complete | 2026-02-10 | +| 3. Filtering & Search | v1.0 | 2/2 | Complete | 2026-02-11 | +| 4. Predictions & Comparison | v1.0 | 3/3 | Complete | 2026-02-11 | +| 5. Embeddings & Visualization | v1.0 | 4/4 | Complete | 2026-02-11 | +| 6. Error Analysis & Similarity | v1.0 | 2/2 | Complete | 2026-02-12 | +| 7. Intelligence & Agents | v1.0 | 3/3 | Complete | 2026-02-12 | +| 8. Docker Deployment & Auth | v1.1 | 5/5 | Complete | 2026-02-12 | +| 9. Smart Ingestion | v1.1 | 2/2 | Complete | 2026-02-12 | +| 10. Annotation Editing | v1.1 | 3/3 | Complete | 2026-02-12 | +| 11. Error Triage | v1.1 | 2/2 | Complete | 2026-02-12 | +| 12. Interactive Viz & Discovery | v1.1 | 3/3 | Complete | 2026-02-13 | +| 13. Keyboard Shortcuts | v1.1 | 2/2 | Complete | 2026-02-13 | +| 14. Per-Annotation Triage | v1.1 | 3/3 | Complete | 2026-02-13 | +| 15. Classification Ingestion & Display | v1.2 | 2/2 | Complete | 2026-02-18 | +| 16. Classification Evaluation | v1.2 | 2/2 | Complete | 2026-02-18 | +| 17. Classification Polish | v1.2 | 2/2 | Complete | 2026-02-18 | diff --git a/.planning/phases/15-classification-ingestion-display/15-01-PLAN.md b/.planning/phases/15-classification-ingestion-display/15-01-PLAN.md new file mode 100644 index 0000000..be35187 --- /dev/null +++ b/.planning/phases/15-classification-ingestion-display/15-01-PLAN.md @@ -0,0 +1,256 @@ +--- +phase: 15-classification-ingestion-display +plan: 01 +type: execute +wave: 1 +depends_on: [] +files_modified: + - app/repositories/duckdb_repo.py + - app/models/dataset.py + - app/models/scan.py + - app/ingestion/base_parser.py + - app/ingestion/classification_jsonl_parser.py + - app/services/folder_scanner.py + - app/services/ingestion.py + - app/routers/ingestion.py + - app/routers/datasets.py + - app/routers/annotations.py + - app/routers/statistics.py +autonomous: true + +must_haves: + truths: + - "POST /ingestion/scan on a folder with JSONL + images returns format='classification_jsonl' with correct splits" + - "POST /ingestion/import with classification_jsonl splits creates dataset with dataset_type='classification' and annotations with sentinel bbox values (0.0)" + - "GET /datasets/{id} returns dataset_type field" + - "PATCH /annotations/{id}/category updates category_name for classification label editing" + - "Classification annotations have bbox_x=0, bbox_y=0, bbox_w=0, bbox_h=0, area=0 as sentinel values" + artifacts: + - path: "app/ingestion/classification_jsonl_parser.py" + provides: "ClassificationJSONLParser extending BaseParser" + contains: "class ClassificationJSONLParser" + - path: "app/repositories/duckdb_repo.py" + provides: "dataset_type column migration" + contains: "dataset_type" + - path: "app/services/folder_scanner.py" + provides: "Classification JSONL layout detection" + contains: "classification_jsonl" + key_links: + - from: "app/services/folder_scanner.py" + to: "app/models/scan.py" + via: "ScanResult with format='classification_jsonl'" + pattern: "classification_jsonl" + - from: "app/services/ingestion.py" + to: "app/ingestion/classification_jsonl_parser.py" + via: "parser dispatch by format string" + pattern: "ClassificationJSONLParser" + - from: "app/services/ingestion.py" + to: "app/repositories/duckdb_repo.py" + via: "stores dataset_type on dataset record" + pattern: "dataset_type" +--- + + +Add classification dataset ingestion support to the backend: schema migration, JSONL parser, folder scanner detection, parser dispatch, and annotation category update endpoint. + +Purpose: Enable the system to auto-detect, parse, and store classification datasets using the existing ingestion pipeline with sentinel bbox values. This is the backend foundation for all classification display work. +Output: ClassificationJSONLParser, extended FolderScanner, updated IngestionService with parser dispatch, dataset_type column, and category update endpoint. + + + +@/Users/ortizeg/.claude/get-shit-done/workflows/execute-plan.md +@/Users/ortizeg/.claude/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/phases/15-classification-ingestion-display/15-RESEARCH.md +@app/ingestion/base_parser.py +@app/ingestion/coco_parser.py +@app/services/folder_scanner.py +@app/services/ingestion.py +@app/repositories/duckdb_repo.py +@app/models/dataset.py +@app/models/scan.py +@app/routers/ingestion.py +@app/routers/datasets.py +@app/routers/annotations.py + + + + + + Task 1: Schema migration, Pydantic models, and ClassificationJSONLParser + + app/repositories/duckdb_repo.py + app/models/dataset.py + app/models/scan.py + app/ingestion/base_parser.py + app/ingestion/classification_jsonl_parser.py + + + **1. Schema migration** (`app/repositories/duckdb_repo.py`): + Add after existing ALTER TABLE statements in `initialize_schema()`: + ```python + self.connection.execute( + "ALTER TABLE datasets ADD COLUMN IF NOT EXISTS dataset_type VARCHAR DEFAULT 'detection'" + ) + ``` + + **2. Pydantic models**: + - `app/models/dataset.py`: Add `dataset_type: str = "detection"` field to `DatasetResponse`. + - `app/models/scan.py`: No change needed -- `ScanResult.format` already accepts any string. The format field will carry `"classification_jsonl"` for classification datasets. + + **3. BaseParser update** (`app/ingestion/base_parser.py`): + Add `image_dir: str = ""` parameter to `build_image_batches` abstract method signature if not already present (check COCOParser -- it already has it in the concrete method, ensure the ABC matches). + + **4. ClassificationJSONLParser** (`app/ingestion/classification_jsonl_parser.py` -- NEW FILE): + Create parser extending BaseParser with: + + - `format_name` property returns `"classification_jsonl"` + - `parse_categories(file_path)`: Single pass over JSONL, collect unique labels from flexible keys (`label`, `class`, `category`, `class_name`). Return `{i: name for i, name in enumerate(sorted(labels))}`. + - `build_image_batches(file_path, dataset_id, split, image_dir)`: Read JSONL line by line. For each line, extract filename from flexible keys (`filename`, `file_name`, `image`, `path`). Generate sample_id as `f"{split}_{i}"` if split else `str(i)`. Yield DataFrames with columns matching samples table: `id, dataset_id, file_name, width, height, thumbnail_path, split, metadata, image_dir`. Set width=0, height=0 (resolved during thumbnail generation). Use `self.batch_size` for batching. + - `build_annotation_batches(file_path, dataset_id, categories, split)`: Read JSONL again. For each line, extract label using same flexible keys. Create annotation row with sentinel bbox values: `bbox_x=0.0, bbox_y=0.0, bbox_w=0.0, bbox_h=0.0, area=0.0, is_crowd=False, source="ground_truth", confidence=None, metadata=None`. Sample IDs must match those from `build_image_batches` (same `f"{split}_{i}"` pattern). Annotation IDs: `f"{split}_ann_{i}"` or `f"ann_{i}"`. + + Handle edge cases: + - Skip empty lines + - If `label` is an array, emit one annotation row per label (forward-compatible for multi-label) + - Use `"unknown"` as fallback label if no label key found + + + - `python -c "from app.ingestion.classification_jsonl_parser import ClassificationJSONLParser; p = ClassificationJSONLParser(); print(p.format_name)"` prints `classification_jsonl` + - `python -c "from app.models.dataset import DatasetResponse; print(DatasetResponse.model_fields.keys())"` includes `dataset_type` + - All existing tests pass: `cd app && python -m pytest tests/ -x -q` + + ClassificationJSONLParser exists with parse_categories, build_image_batches, build_annotation_batches producing sentinel bbox annotations. DatasetResponse includes dataset_type. Schema migration adds dataset_type column. + + + + Task 2: FolderScanner detection, IngestionService dispatch, and API endpoints + + app/services/folder_scanner.py + app/services/ingestion.py + app/routers/ingestion.py + app/routers/datasets.py + app/routers/annotations.py + app/routers/statistics.py + + + **1. FolderScanner** (`app/services/folder_scanner.py`): + Extend `scan()` to detect classification JSONL layouts BEFORE trying COCO layouts (classification is more specific -- a JSONL file is never COCO): + + In the local scan path, add before `_try_layout_b`: + ```python + splits = self._try_layout_d(Path(resolved), warnings) + if not splits: + splits = self._try_layout_e(Path(resolved), warnings) + if splits: + return ScanResult( + root_path=resolved, + dataset_name=_basename(resolved), + format="classification_jsonl", + splits=splits, + warnings=warnings, + ) + ``` + + Add two new layout detectors: + + `_try_layout_d(root, warnings)` -- **Split directories with JSONL + images**: + - Use existing `_detect_split_dirs()` to find split dirs + - In each split dir, look for `.jsonl` files + - For each `.jsonl` file, call `_is_classification_jsonl(file_path)` (new static method) + - If valid, count images in the split dir, create DetectedSplit + - Return list of splits + + `_try_layout_e(root, warnings)` -- **Flat JSONL at root**: + - Look for `.jsonl` files in root (no recursion) + - Check if any are classification JSONL via `_is_classification_jsonl()` + - Image dir: prefer `images/` subdir, else root itself + - Return single-element split list with name=root.name + + `_is_classification_jsonl(file_path)` -- **Static method**: + - Open file, read first 5 non-empty lines + - Parse each as JSON + - Return True if line has (`filename` or `file_name` or `image` or `path`) AND (`label` or `class` or `category` or `class_name`) AND NOT (`bbox` or `annotations`) + - Catch all exceptions, return False + + For GCS: Add similar classification detection in `_scan_gcs()` -- check for `.jsonl` files before `.json` files. Use `_is_classification_jsonl_remote()` that reads via `self.storage.open()`. + + **2. IngestionService** (`app/services/ingestion.py`): + - Add import for ClassificationJSONLParser at top + - In `ingest_with_progress()`, replace hardcoded `COCOParser(batch_size=1000)` with format-based dispatch: + ```python + if format == "coco": + parser = COCOParser(batch_size=1000) + elif format == "classification_jsonl": + parser = ClassificationJSONLParser(batch_size=1000) + else: + raise ValueError(f"Unsupported format: {format}") + ``` + - After the dataset INSERT (step 4), for new datasets set dataset_type: + ```python + dataset_type = "classification" if format == "classification_jsonl" else "detection" + ``` + Include `dataset_type` in the INSERT VALUES. Update the INSERT statement to include the new column. For the UPDATE path (existing dataset), no change needed -- dataset_type is set on first insert. + - Update the existing INSERT INTO datasets to include `dataset_type` column. The INSERT currently uses positional VALUES -- add `dataset_type` after `prediction_count` (or adjust column list). Be careful to match column order. + + **3. Ingestion router** (`app/routers/ingestion.py`): + - The `/ingestion/import` endpoint passes `format` through to `ingest_with_progress`. Currently it may not pass format. Ensure the ImportRequest or the stored ScanResult format is threaded through. The simplest approach: add `format: str = "coco"` field to `ImportRequest` model in `app/models/scan.py`. + - The router calls `ingest_splits_with_progress()` (not `ingest_with_progress` directly), so the full threading chain is: + 1. Add `format: str = "coco"` param to `ingest_splits_with_progress()` signature in `app/services/ingestion.py` + 2. Inside `ingest_splits_with_progress()`, pass `format=format` to each `self.ingest_with_progress(...)` call in the loop (replacing the hardcoded `format="coco"` default) + 3. In the router's import endpoint, pass `request.format` (or `scan_result.format`) to `ingest_splits_with_progress(format=...)` + + **4. Datasets router** (`app/routers/datasets.py`): + - Ensure the `GET /datasets` and `GET /datasets/{id}` queries include `dataset_type` in SELECT. Currently using `SELECT *` or explicit columns -- add `dataset_type` to the result mapping into `DatasetResponse`. + + **5. Annotations router** (`app/routers/annotations.py`): + - Add a new endpoint: `PATCH /annotations/{annotation_id}/category` accepting `{"category_name": "new_label"}`. It should UPDATE the annotation's `category_name` in DuckDB. Return 200 with the updated annotation. Use a simple Pydantic model `CategoryUpdateRequest(BaseModel): category_name: str`. + + **6. Statistics router** (`app/routers/statistics.py`): + - For classification datasets, the `gt_annotations` stat should reflect "labeled images" (count of distinct sample_ids with GT annotations) rather than raw annotation count. Check `dataset_type` from the datasets table, and if `"classification"`, adjust the query. This is a minor conditional in the existing statistics aggregation. + + + - Create a test JSONL file and verify scanner detection: + ```bash + mkdir -p /tmp/test_cls/train && echo '{"filename": "a.jpg", "label": "cat"}' > /tmp/test_cls/train/annotations.jsonl && touch /tmp/test_cls/train/a.jpg + python -c " + from app.services.folder_scanner import FolderScanner + s = FolderScanner() + r = s.scan('/tmp/test_cls') + print(f'format={r.format}, splits={len(r.splits)}, split_name={r.splits[0].name if r.splits else None}') + assert r.format == 'classification_jsonl' + print('PASS') + " + ``` + - All existing tests pass: `cd app && python -m pytest tests/ -x -q` + - Server starts without errors: `cd app && timeout 5 python -c "from app.main import app; print('OK')" 2>&1 || true` + + FolderScanner detects classification JSONL layouts (D and E). IngestionService dispatches to ClassificationJSONLParser for classification_jsonl format and stores dataset_type. ImportRequest carries format. PATCH /annotations/{id}/category endpoint exists. Statistics endpoint is classification-aware. GET /datasets returns dataset_type. All existing tests pass. + + + + + +1. Scanner returns format="classification_jsonl" for split-dir JSONL layout +2. Scanner returns format="classification_jsonl" for flat JSONL layout +3. Scanner still returns format="coco" for existing COCO layouts (no regression) +4. Dataset INSERT includes dataset_type="classification" for classification imports +5. DatasetResponse includes dataset_type field +6. PATCH /annotations/{id}/category updates category_name +7. All existing tests pass + + + +- Classification JSONL folders are auto-detected by the scanner +- Parser produces correct annotations with sentinel bbox values +- dataset_type is stored and returned via API +- Category update endpoint works for classification label editing +- Zero regressions in existing detection workflow + + + +After completion, create `.planning/phases/15-classification-ingestion-display/15-01-SUMMARY.md` + diff --git a/.planning/phases/15-classification-ingestion-display/15-01-SUMMARY.md b/.planning/phases/15-classification-ingestion-display/15-01-SUMMARY.md new file mode 100644 index 0000000..e131558 --- /dev/null +++ b/.planning/phases/15-classification-ingestion-display/15-01-SUMMARY.md @@ -0,0 +1,131 @@ +--- +phase: 15-classification-ingestion-display +plan: 01 +subsystem: api, ingestion, database +tags: [classification, jsonl, parser, duckdb, fastapi, sentinel-bbox] + +requires: + - phase: 07-evaluation + provides: "statistics router and evaluation service" + - phase: 02-ingestion + provides: "BaseParser, COCOParser, FolderScanner, IngestionService" +provides: + - ClassificationJSONLParser with sentinel bbox values + - FolderScanner classification JSONL detection (layouts D and E) + - dataset_type column and API field + - PATCH /annotations/{id}/category endpoint + - Format-based parser dispatch in IngestionService + - Classification-aware statistics +affects: [15-02, 16-classification-evaluation, frontend-classification-display] + +tech-stack: + added: [] + patterns: [sentinel-bbox-for-classification, format-based-parser-dispatch, layout-detection-priority] + +key-files: + created: + - app/ingestion/classification_jsonl_parser.py + modified: + - app/repositories/duckdb_repo.py + - app/models/dataset.py + - app/models/scan.py + - app/models/annotation.py + - app/ingestion/base_parser.py + - app/services/folder_scanner.py + - app/services/ingestion.py + - app/routers/ingestion.py + - app/routers/datasets.py + - app/routers/annotations.py + - app/routers/statistics.py + +key-decisions: + - "Classification JSONL layouts checked before COCO layouts since JSONL is never COCO" + - "Sentinel bbox values (all 0.0) for classification annotations to avoid nullable columns" + - "Format string threaded through ImportRequest -> ingest_splits_with_progress -> ingest_with_progress" + - "Classification gt_annotations stat uses COUNT(DISTINCT sample_id) instead of COUNT(*)" + +patterns-established: + - "Format dispatch: IngestionService selects parser by format string, extensible for future formats" + - "Layout priority: classification-specific layouts tested before generic COCO layouts" + +duration: 5min +completed: 2026-02-18 +--- + +# Phase 15 Plan 01: Classification Ingestion & Backend Summary + +**ClassificationJSONLParser with sentinel bbox values, FolderScanner auto-detection of JSONL layouts, format-based parser dispatch, and category update endpoint** + +## Performance + +- **Duration:** 5 min +- **Started:** 2026-02-19T02:13:50Z +- **Completed:** 2026-02-19T02:18:51Z +- **Tasks:** 2 +- **Files modified:** 12 + +## Accomplishments +- ClassificationJSONLParser that produces annotations with sentinel bbox values (0.0) and supports multi-label via array labels +- FolderScanner detects classification JSONL in split dirs (Layout D) and flat (Layout E) with GCS support +- Format-based parser dispatch in IngestionService with dataset_type stored on dataset record +- PATCH /annotations/{id}/category endpoint for classification label editing +- Classification-aware statistics (gt_annotations = distinct labeled images) + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: Schema migration, Pydantic models, and ClassificationJSONLParser** - `5264e51` (feat) +2. **Task 2: FolderScanner detection, IngestionService dispatch, and API endpoints** - `8af8a11` (feat) + +## Files Created/Modified +- `app/ingestion/classification_jsonl_parser.py` - New parser extending BaseParser with sentinel bbox annotations +- `app/repositories/duckdb_repo.py` - dataset_type column migration +- `app/models/dataset.py` - dataset_type field on DatasetResponse +- `app/models/scan.py` - format field on ImportRequest +- `app/models/annotation.py` - CategoryUpdateRequest model +- `app/ingestion/base_parser.py` - image_dir parameter on build_image_batches ABC +- `app/services/folder_scanner.py` - Layout D/E detectors, GCS classification detection, _is_classification_jsonl +- `app/services/ingestion.py` - Format dispatch, dataset_type on INSERT, format threading +- `app/routers/ingestion.py` - Format passthrough, .jsonl in browse, updated error message +- `app/routers/datasets.py` - dataset_type in SELECT and DatasetResponse mapping +- `app/routers/annotations.py` - PATCH /annotations/{id}/category endpoint +- `app/routers/statistics.py` - Classification-aware gt_annotations aggregation + +## Decisions Made +- Classification JSONL layouts checked before COCO layouts since JSONL files are never COCO (more specific detection first) +- Used sentinel bbox values (all 0.0) for classification annotations, matching the project decision to avoid nullable columns +- gt_annotations stat for classification uses COUNT(DISTINCT sample_id) to represent "labeled images" rather than raw annotation count +- Added .jsonl to browse endpoint extensions for file navigation + +## Deviations from Plan + +### Auto-fixed Issues + +**1. [Rule 2 - Missing Critical] Added .jsonl to browse endpoint file extensions** +- **Found during:** Task 2 (API endpoints) +- **Issue:** Browse endpoint only showed .json files, users couldn't see .jsonl files when navigating +- **Fix:** Added ".jsonl" to _BROWSE_EXTENSIONS set +- **Files modified:** app/routers/ingestion.py +- **Verification:** Import and app start verified +- **Committed in:** 8af8a11 (Task 2 commit) + +--- + +**Total deviations:** 1 auto-fixed (1 missing critical) +**Impact on plan:** Minor addition necessary for classification JSONL usability. No scope creep. + +## Issues Encountered +None + +## User Setup Required +None - no external service configuration required. + +## Next Phase Readiness +- Backend fully supports classification dataset ingestion, ready for frontend display work in Plan 02 +- Parser dispatch is extensible for future formats (YOLO, VOC, etc.) +- dataset_type field available for frontend to branch display logic + +--- +*Phase: 15-classification-ingestion-display* +*Completed: 2026-02-18* diff --git a/.planning/phases/15-classification-ingestion-display/15-02-PLAN.md b/.planning/phases/15-classification-ingestion-display/15-02-PLAN.md new file mode 100644 index 0000000..3c9997c --- /dev/null +++ b/.planning/phases/15-classification-ingestion-display/15-02-PLAN.md @@ -0,0 +1,232 @@ +--- +phase: 15-classification-ingestion-display +plan: 02 +type: execute +wave: 2 +depends_on: ["15-01"] +files_modified: + - frontend/src/types/dataset.ts + - frontend/src/types/scan.ts + - frontend/src/app/datasets/[datasetId]/page.tsx + - frontend/src/components/grid/grid-cell.tsx + - frontend/src/components/grid/image-grid.tsx + - frontend/src/components/detail/sample-modal.tsx + - frontend/src/components/detail/annotation-list.tsx + - frontend/src/components/stats/stats-dashboard.tsx + - frontend/src/components/stats/annotation-summary.tsx + - frontend/src/components/ingest/scan-results.tsx +autonomous: true + +must_haves: + truths: + - "User sees class label badges on grid thumbnails for classification datasets instead of bbox overlays" + - "User sees GT class label prominently in sample detail modal with a dropdown to change it" + - "Statistics dashboard shows 'Labeled Images' and 'Classes' instead of 'GT Annotations' and 'Categories' for classification datasets" + - "Detection-only elements (bbox area histogram, IoU slider) are hidden for classification datasets" + - "Scan results page shows 'Classification JSONL' format badge for classification datasets" + artifacts: + - path: "frontend/src/types/dataset.ts" + provides: "Dataset type with dataset_type field" + contains: "dataset_type" + - path: "frontend/src/components/grid/grid-cell.tsx" + provides: "ClassBadge rendering for classification datasets" + contains: "ClassBadge" + - path: "frontend/src/components/detail/sample-modal.tsx" + provides: "Class label display and dropdown editor" + contains: "classification" + - path: "frontend/src/components/stats/stats-dashboard.tsx" + provides: "Detection-only tab hiding for classification" + contains: "datasetType" + key_links: + - from: "frontend/src/app/datasets/[datasetId]/page.tsx" + to: "frontend/src/components/grid/image-grid.tsx" + via: "datasetType prop threading" + pattern: "datasetType" + - from: "frontend/src/components/grid/grid-cell.tsx" + to: "frontend/src/types/dataset.ts" + via: "dataset_type determines badge vs overlay" + pattern: "classification" + - from: "frontend/src/components/detail/sample-modal.tsx" + to: "PATCH /annotations/{id}/category" + via: "category update mutation" + pattern: "category" +--- + + +Adapt the frontend to display classification datasets appropriately: class label badges on grid, class label with dropdown in detail modal, classification-aware statistics, and format badge in scan results. + +Purpose: Users browsing classification datasets see class-appropriate UI instead of detection-oriented displays (no bbox overlays, no area histograms). Classification labels are the primary annotation visual. +Output: Updated grid, modal, stats, and scan results components with datasetType-aware branching. + + + +@/Users/ortizeg/.claude/get-shit-done/workflows/execute-plan.md +@/Users/ortizeg/.claude/get-shit-done/templates/summary.md + + + +@.planning/PROJECT.md +@.planning/ROADMAP.md +@.planning/phases/15-classification-ingestion-display/15-RESEARCH.md +@.planning/phases/15-classification-ingestion-display/15-01-SUMMARY.md +@frontend/src/types/dataset.ts +@frontend/src/types/scan.ts +@frontend/src/app/datasets/[datasetId]/page.tsx +@frontend/src/components/grid/grid-cell.tsx +@frontend/src/components/grid/image-grid.tsx +@frontend/src/components/detail/sample-modal.tsx +@frontend/src/components/detail/annotation-list.tsx +@frontend/src/components/stats/stats-dashboard.tsx +@frontend/src/components/stats/annotation-summary.tsx +@frontend/src/components/ingest/scan-results.tsx + + + + + + Task 1: Types, page threading, grid class badges, and scan results format badge + + frontend/src/types/dataset.ts + frontend/src/types/scan.ts + frontend/src/app/datasets/[datasetId]/page.tsx + frontend/src/components/grid/grid-cell.tsx + frontend/src/components/grid/image-grid.tsx + frontend/src/components/ingest/scan-results.tsx + + + **1. TypeScript types**: + - `frontend/src/types/dataset.ts`: Add `dataset_type: string;` to the `Dataset` interface (after `prediction_count`). Default is `"detection"`. + - `frontend/src/types/scan.ts`: No structural change needed -- `ScanResult.format` is already a string and will carry `"classification_jsonl"`. + + **2. Dataset page prop threading** (`frontend/src/app/datasets/[datasetId]/page.tsx`): + - The page fetches the dataset object which now includes `dataset_type`. + - Thread `datasetType={dataset.dataset_type}` as a prop to ``, ``, and `` (and any stats sub-components that need it). + - Also thread it to any component that renders differently for classification vs detection. + + **3. Grid class badges** (`frontend/src/components/grid/grid-cell.tsx`): + - Add `datasetType?: string` prop to GridCell. + - When `datasetType === "classification"`: + - Do NOT render `` (skip bbox rendering entirely) + - Instead render a `ClassBadge` inline component: + ```tsx + function ClassBadge({ label }: { label?: string }) { + if (!label) return null; + return ( +
+ + {label} + +
+ ); + } + ``` + - Extract the GT annotation's `category_name` from the annotations map for this sample: `const gtAnnotation = annotations?.find(a => a.source === "ground_truth");` + - Render `` + - When `datasetType !== "classification"` (or undefined): render existing `` as before (no change). + + **4. ImageGrid prop threading** (`frontend/src/components/grid/image-grid.tsx`): + - Add `datasetType?: string` prop to ImageGrid. + - Pass it through to each ``. + + **5. Scan results format badge** (`frontend/src/components/ingest/scan-results.tsx`): + - Where the format is displayed, show "Classification JSONL" when `format === "classification_jsonl"` and "COCO" when `format === "coco"`. + - Use the existing badge/styling pattern (likely a colored span). Example: a small badge showing the format type near the dataset name. +
+ + - `cd frontend && npx tsc --noEmit` passes without errors + - `cd frontend && npm run build` succeeds + - Grep confirms: `grep -r "ClassBadge" frontend/src/components/grid/grid-cell.tsx` + - Grep confirms: `grep -r "datasetType" frontend/src/app/datasets/*/page.tsx` + + Dataset type flows from API through page to grid. Classification datasets show class label badges instead of bbox overlays. Scan results show format badge. TypeScript compiles cleanly. +
+ + + Task 2: Detail modal class label display/edit and classification-aware statistics + + frontend/src/components/detail/sample-modal.tsx + frontend/src/components/detail/annotation-list.tsx + frontend/src/components/stats/stats-dashboard.tsx + frontend/src/components/stats/annotation-summary.tsx + + + **1. Sample modal** (`frontend/src/components/detail/sample-modal.tsx`): + - Add `datasetType?: string` prop. + - Pass `datasetType` down to child components: `` in the render. + - When `datasetType === "classification"`: + - Show a prominent class label section above or instead of the annotation overlay. Display format: + ``` + Class: [dropdown with all categories] + ``` + - Extract GT annotation: `const gtAnnotation = annotations?.find(a => a.source === "ground_truth");` + - If predictions exist, also show: `Predicted: [predicted class label]` with confidence if available. + - The class dropdown uses the categories list (from `useFilterFacets` or a categories fetch). On change, call `PATCH /annotations/{gtAnnotation.id}/category` with the new `category_name`. + - Create a TanStack Query mutation hook inline or in a hooks file: `usePatchCategory` that calls `apiPatch(\`/annotations/\${annotationId}/category\`, { category_name })` and invalidates the annotation queries on success. + - Do NOT render the annotation overlay / bounding box editor for classification datasets. Hide the bbox editing canvas (react-konva editor). The image should display without any overlay. + - When `datasetType !== "classification"`: render everything as before (no change). + + **2. Annotation list** (`frontend/src/components/detail/annotation-list.tsx`): + - Add `datasetType?: string` prop. + - When `datasetType === "classification"`: + - Hide the Bounding Box columns (bbox_x, bbox_y, bbox_w, bbox_h) and Area column from the table. + - Show: Class, Source, Confidence columns only. + - When detection: show all columns as before. + + **3. Stats dashboard** (`frontend/src/components/stats/stats-dashboard.tsx`): + - Add `datasetType?: string` prop. + - When `datasetType === "classification"`: + - Hide the "Evaluation" tab entirely (no IoU-based evaluation for classification in this phase). + - Hide the "Error Analysis" sub-panel (detection-specific error categories: TP/FP/FN based on IoU). + - Keep: Class Distribution chart, Split Breakdown chart, Summary cards (with relabeled metrics). + - Hide: Any bbox area histogram or IoU-related controls. + - When detection: show all tabs/panels as before. + + **4. Annotation summary** (`frontend/src/components/stats/annotation-summary.tsx`): + - Add `datasetType?: string` prop. + - When `datasetType === "classification"`: + - Swap summary card labels: + - "GT Annotations" -> "Labeled Images" + - "Categories" -> "Classes" + - Keep "Total Images" and "Predictions" labels as-is. + - When detection: show original labels. + - Use a conditional card definitions array pattern: + ```tsx + const cards = datasetType === "classification" + ? CLASSIFICATION_CARDS + : DETECTION_CARDS; + ``` + + + - `cd frontend && npx tsc --noEmit` passes without errors + - `cd frontend && npm run build` succeeds + - Grep confirms classification branching: `grep -r "classification" frontend/src/components/detail/sample-modal.tsx` + - Grep confirms stats adaptation: `grep -r "classification" frontend/src/components/stats/stats-dashboard.tsx` + - Grep confirms annotation-summary adaptation: `grep -r "Labeled Images" frontend/src/components/stats/annotation-summary.tsx` + + Detail modal shows class label with editable dropdown for classification datasets. Annotation list hides bbox columns for classification. Stats dashboard hides detection-only tabs/panels. Summary cards use classification-appropriate labels. PATCH mutation for category update wired. All TypeScript compiles cleanly. + + +
+ + +1. Classification dataset grid shows class label badges (no bbox overlays) +2. Detail modal shows "Class: [dropdown]" for classification, with working category edit +3. Stats dashboard hides Evaluation tab and Error Analysis for classification +4. Summary cards show "Labeled Images" and "Classes" for classification +5. Annotation list hides bbox/area columns for classification +6. Scan results show "Classification JSONL" format badge +7. Detection datasets are completely unaffected (no regression) +8. TypeScript compiles and Next.js builds succeed + + + +- Classification datasets display class badges on grid thumbnails +- Detail modal has class label display with dropdown editor that persists changes +- Statistics dashboard shows only classification-relevant metrics +- Detection workflow is unchanged +- Frontend builds without errors + + + +After completion, create `.planning/phases/15-classification-ingestion-display/15-02-SUMMARY.md` + diff --git a/.planning/phases/15-classification-ingestion-display/15-02-SUMMARY.md b/.planning/phases/15-classification-ingestion-display/15-02-SUMMARY.md new file mode 100644 index 0000000..3127e00 --- /dev/null +++ b/.planning/phases/15-classification-ingestion-display/15-02-SUMMARY.md @@ -0,0 +1,111 @@ +--- +phase: 15-classification-ingestion-display +plan: 02 +subsystem: frontend, ui +tags: [classification, react, tanstack-query, dataset-type, class-badge, dropdown-editor] + +requires: + - phase: 15-classification-ingestion-display + plan: 01 + provides: "dataset_type field, PATCH /annotations/{id}/category, classification-aware statistics" +provides: + - ClassBadge grid overlay for classification datasets + - Class label dropdown editor in detail modal with PATCH mutation + - Classification-aware statistics dashboard (hidden detection tabs) + - Classification-appropriate summary card labels + - Format badge in scan results for classification JSONL +affects: [16-classification-evaluation, frontend-polish] + +tech-stack: + added: [] + patterns: [datasetType-prop-threading, isClassification-branching-at-component-boundaries] + +key-files: + created: [] + modified: + - frontend/src/types/dataset.ts + - frontend/src/app/datasets/[datasetId]/page.tsx + - frontend/src/components/grid/grid-cell.tsx + - frontend/src/components/grid/image-grid.tsx + - frontend/src/components/ingest/scan-results.tsx + - frontend/src/components/detail/sample-modal.tsx + - frontend/src/components/detail/annotation-list.tsx + - frontend/src/components/stats/stats-dashboard.tsx + - frontend/src/components/stats/annotation-summary.tsx + +key-decisions: + - "Thread datasetType from page level, branch at component boundaries with isClassification flag" + - "Hide entire edit toolbar and annotation editor for classification (no bbox editing needed)" + - "Hide Evaluation, Error Analysis, Worst Images, and Intelligence tabs for classification (IoU-based)" + - "Keep Near Duplicates tab visible for classification (embedding-based, not IoU-dependent)" + +patterns-established: + - "datasetType prop threading: page fetches dataset, threads type to all children" + - "isClassification branching: components check datasetType === 'classification' to show/hide detection UI" + +duration: 5min +completed: 2026-02-18 +--- + +# Phase 15 Plan 02: Classification Frontend Display Summary + +**Classification-aware grid badges, modal class dropdown editor, and detection-tab hiding via datasetType prop threading** + +## Performance + +- **Duration:** 5 min +- **Started:** 2026-02-19T02:20:50Z +- **Completed:** 2026-02-19T02:25:44Z +- **Tasks:** 2 +- **Files modified:** 9 + +## Accomplishments +- Grid shows class label badges instead of bbox overlays for classification datasets +- Detail modal displays class dropdown editor with PATCH category mutation and predicted class with confidence +- Statistics dashboard hides detection-only tabs (Evaluation, Error Analysis, Worst Images, Intelligence) +- Summary cards show "Labeled Images" and "Classes" labels for classification datasets +- Annotation list hides bbox and area columns for classification +- Scan results show "Classification JSONL" format badge + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: Types, page threading, grid class badges, and scan results format badge** - `b96ce5e` (feat) +2. **Task 2: Detail modal class label display/edit and classification-aware statistics** - `e7ad776` (feat) + +## Files Created/Modified +- `frontend/src/types/dataset.ts` - Added dataset_type field to Dataset interface +- `frontend/src/app/datasets/[datasetId]/page.tsx` - Thread datasetType prop to ImageGrid, SampleModal, StatsDashboard +- `frontend/src/components/grid/grid-cell.tsx` - ClassBadge component, classification branching in overlay +- `frontend/src/components/grid/image-grid.tsx` - datasetType prop acceptance and passthrough +- `frontend/src/components/ingest/scan-results.tsx` - "Classification JSONL" friendly format badge +- `frontend/src/components/detail/sample-modal.tsx` - Class dropdown editor, PATCH mutation, hide bbox editor/toolbar +- `frontend/src/components/detail/annotation-list.tsx` - Hide bbox/area columns for classification +- `frontend/src/components/stats/stats-dashboard.tsx` - Hide detection-only tabs for classification +- `frontend/src/components/stats/annotation-summary.tsx` - Classification card labels (Labeled Images, Classes) + +## Decisions Made +- Thread datasetType from page level, branch at component boundaries -- consistent pattern, easy to test +- Hide entire edit toolbar and annotation editor for classification (no bounding boxes to edit) +- Hide Evaluation/Error Analysis/Worst Images/Intelligence tabs for classification (all IoU-based detection features) +- Keep Near Duplicates tab visible for classification since it uses embeddings, not IoU + +## Deviations from Plan + +None - plan executed exactly as written. + +## Issues Encountered +None + +## User Setup Required +None - no external service configuration required. + +## Next Phase Readiness +- Frontend fully supports classification dataset display, ready for classification evaluation in Phase 16 +- datasetType prop threading pattern established for any future dataset-type-specific UI +- PATCH /annotations/{id}/category wired end-to-end for label editing + +--- +*Phase: 15-classification-ingestion-display* +*Completed: 2026-02-18* diff --git a/.planning/phases/15-classification-ingestion-display/15-RESEARCH.md b/.planning/phases/15-classification-ingestion-display/15-RESEARCH.md new file mode 100644 index 0000000..3d60dc5 --- /dev/null +++ b/.planning/phases/15-classification-ingestion-display/15-RESEARCH.md @@ -0,0 +1,555 @@ +# Phase 15: Classification Ingestion & Display - Research + +**Researched:** 2026-02-18 +**Domain:** Classification dataset ingestion, schema extension, frontend display adaptation +**Confidence:** HIGH (this is internal codebase extension, not new technology) + +## Summary + +Phase 15 adds classification dataset support to a codebase currently built exclusively for object detection. The work spans four layers: (1) a new JSONL annotation parser and format auto-detection in the ingestion pipeline, (2) schema changes to track dataset type and store classification annotations using sentinel bbox values, (3) frontend grid/modal display changes to show class labels instead of bounding boxes, and (4) statistics dashboard adaptation to hide detection-only metrics. + +The codebase is well-structured with clear separation of concerns -- parsers in `app/ingestion/`, Pydantic models in `app/models/`, services in `app/services/`, and component-per-feature in `frontend/src/components/`. The existing `BaseParser` ABC and streaming batch pattern provide a natural extension point for a classification JSONL parser. The sentinel bbox approach (bbox values = 0.0) means the annotations table schema is untouched, avoiding null guards in 30+ SQL queries and frontend components. + +**Primary recommendation:** Extend the existing parser registry pattern with a `ClassificationJSONLParser` that produces annotation rows with sentinel bbox values (0.0), add `dataset_type VARCHAR DEFAULT 'detection'` to the datasets table, and use the `datasetType` prop threaded from the page level to branch rendering at component boundaries (grid cell, sample modal, stats dashboard). + +## Standard Stack + +### Core (already in use -- no new dependencies) + +| Library | Purpose | Status | +|---------|---------|--------| +| DuckDB | Schema storage, SQL queries | In use | +| FastAPI | API layer | In use | +| Pydantic | Request/response models | In use | +| ijson | Streaming JSON parsing | In use (COCO parser) | +| pandas | DataFrame batch construction | In use | +| Next.js + React | Frontend framework | In use | +| Zustand | State management | In use | +| TanStack Query | Data fetching/caching | In use | +| Recharts | Charts (class distribution) | In use | + +### Supporting (no new libraries needed) + +This phase requires zero new dependencies. Classification JSONL files are simple enough to parse with Python's built-in `json` module line-by-line, or with the existing `ijson` dependency if streaming is desired. The frontend changes are pure React component branching. + +### Alternatives Considered + +| Instead of | Could Use | Tradeoff | +|------------|-----------|----------| +| Sentinel bbox (0.0) | Nullable bbox columns | Nullable requires 30+ null guards in SQL queries, filter builder, evaluation, frontend annotation types. Sentinel avoids this entirely. | +| Separate classification_annotations table | Shared annotations table with sentinels | Separate table would require duplicating all annotation queries, filter logic, statistics queries. Shared table is simpler. | +| Dynamic format detection at query time | Stored `dataset_type` column | Stored column is a single lookup; dynamic detection requires scanning annotations for non-zero bboxes every time. | + +## Architecture Patterns + +### Recommended Change Map + +``` +Backend: + app/ingestion/ + classification_jsonl_parser.py # NEW: ClassificationJSONLParser + app/services/ + folder_scanner.py # MODIFY: detect JSONL + images layout + ingestion.py # MODIFY: dispatch to parser by format + evaluation.py # LEAVE (classification eval is Phase 16+) + app/repositories/ + duckdb_repo.py # MODIFY: add dataset_type column + app/models/ + dataset.py # MODIFY: add dataset_type field + scan.py # MODIFY: format can be "classification_jsonl" + annotation.py # NO CHANGE (sentinel bbox values fit existing schema) + statistics.py # POSSIBLY MODIFY: add labeled_images_count + app/routers/ + ingestion.py # MODIFY: error message wording + statistics.py # MODIFY: classification-aware summary stats + +Frontend: + types/dataset.ts # MODIFY: add dataset_type field + types/scan.ts # MODIFY: format can include "classification_jsonl" + app/datasets/[datasetId]/page.tsx # MODIFY: thread datasetType prop + components/grid/grid-cell.tsx # MODIFY: show class badge instead of bbox overlay + components/grid/annotation-overlay.tsx # NO CHANGE (just not rendered for classification) + components/detail/sample-modal.tsx # MODIFY: show class label + dropdown + components/detail/annotation-list.tsx # MODIFY: hide bbox columns for classification + components/stats/stats-dashboard.tsx # MODIFY: hide detection-only tabs + components/stats/annotation-summary.tsx # MODIFY: classification-appropriate labels + components/ingest/scan-results.tsx # MODIFY: show format badge for classification +``` + +### Pattern 1: Sentinel BBox Values for Classification + +**What:** Classification annotations use bbox_x=0, bbox_y=0, bbox_w=0, bbox_h=0, area=0 as sentinel values. The `category_name` field carries the class label. One annotation per sample (for single-label classification). + +**When to use:** When inserting classification annotations into the shared annotations table. + +**Example:** +```python +# Classification annotation row (sentinel bboxes) +{ + "id": str(uuid.uuid4()), + "dataset_id": dataset_id, + "sample_id": sample_id, + "category_name": "dog", # The class label + "bbox_x": 0.0, # Sentinel + "bbox_y": 0.0, # Sentinel + "bbox_w": 0.0, # Sentinel + "bbox_h": 0.0, # Sentinel + "area": 0.0, # Sentinel + "is_crowd": False, + "source": "ground_truth", + "confidence": None, + "metadata": None, +} +``` + +### Pattern 2: Parser Dispatch by Format + +**What:** The IngestionService currently hardcodes `COCOParser()`. Extend to dispatch by format string. + +**When to use:** When `ingest_with_progress` is called. + +**Example:** +```python +# In IngestionService.ingest_with_progress(): +if format == "coco": + parser = COCOParser(batch_size=1000) +elif format == "classification_jsonl": + parser = ClassificationJSONLParser(batch_size=1000) +else: + raise ValueError(f"Unsupported format: {format}") +``` + +### Pattern 3: Format Auto-Detection in FolderScanner + +**What:** The FolderScanner currently only detects COCO JSON files. Extend to detect classification JSONL files. + +**When to use:** During `FolderScanner.scan()`. + +**Detection heuristic:** Look for `.jsonl` files in the directory tree. A classification JSONL file contains lines like: +```json +{"filename": "image001.jpg", "label": "dog"} +``` +Peek at the first few lines: if they parse as JSON with `filename` and `label` keys (no `bbox`/`annotations` key), classify as `classification_jsonl`. + +**Example:** +```python +@staticmethod +def _is_classification_jsonl(file_path: Path) -> bool: + """Check if a file is a classification JSONL annotation file.""" + try: + with open(file_path) as f: + for i, line in enumerate(f): + if i >= 5: + break + line = line.strip() + if not line: + continue + obj = json.loads(line) + if "label" in obj and ("filename" in obj or "file_name" in obj): + if "bbox" not in obj and "annotations" not in obj: + return True + return False + return False + except Exception: + return False +``` + +### Pattern 4: datasetType Prop Threading + +**What:** The dataset page fetches `dataset.dataset_type` and threads it as a prop to child components. Components branch at their boundary rather than deep inside. + +**When to use:** Any component whose rendering differs between detection and classification. + +**Example:** +```tsx +// page.tsx + + + + +// grid-cell.tsx +if (datasetType === "classification") { + // Show class label badge instead of AnnotationOverlay + const gtAnnotation = annotations.find(a => a.source === "ground_truth"); + return ; +} else { + return ; +} +``` + +### Anti-Patterns to Avoid + +- **Checking dataset_type deep inside components:** Branch at component boundaries (GridCell, SampleModal, StatsDashboard), not inside utility functions or hooks that are shared across both types. +- **Adding nullable bbox columns:** The sentinel approach was a prior decision. Do not add nullable bbox columns to the annotations table. +- **Modifying the existing 560-line evaluation.py:** Classification evaluation is separate (~50 lines, Phase 16+). Do not touch `evaluation.py` in this phase. +- **Storing dataset_type on samples:** It belongs on the datasets table -- one type per dataset, not per sample. + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| JSONL parsing | Custom streaming parser | Python `json.loads()` per line | JSONL files are small enough (one line per image), no need for ijson streaming | +| Image dimension reading | Manual PIL/cv2 calls | Existing `ImageService` | Already handles dimension extraction during thumbnail generation | +| SQL schema migration | Migration framework | `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` | Already established pattern (see `duckdb_repo.py` lines 84-103) | +| Frontend format badge | Custom badge component | Tailwind utility classes inline | Consistent with existing scan-results.tsx `splitColor()` pattern | + +**Key insight:** This phase is mostly wiring -- connecting an existing architecture to a new data shape. The risky parts are not technical but completeness: ensuring every SQL query, every frontend component, and every display path handles the classification case. + +## Common Pitfalls + +### Pitfall 1: JSONL Format Ambiguity + +**What goes wrong:** Different classification tools produce different JSONL schemas. Some use `"label"`, others `"class"`, `"category"`, or `"class_name"`. Some use `"filename"`, others `"file_name"`, `"image"`, or `"path"`. + +**Why it happens:** No industry standard for classification JSONL format. + +**How to avoid:** Support the most common key variants in the parser. Normalize on read: +```python +filename = obj.get("filename") or obj.get("file_name") or obj.get("image") or obj.get("path", "") +label = obj.get("label") or obj.get("class") or obj.get("category") or obj.get("class_name", "unknown") +``` + +**Warning signs:** Parser silently produces zero annotations because key names don't match. + +### Pitfall 2: Classification Samples Without Annotations + +**What goes wrong:** If an image file exists in the directory but has no line in the JSONL, it gets inserted as a sample with zero annotations. The grid shows it with no badge, confusingly. + +**Why it happens:** JSONL may not list every image (unlabeled images are common in classification datasets). + +**How to avoid:** During ingestion, only insert samples that appear in the JSONL file. Or, insert all images but mark unlabeled ones clearly in the UI. Decision: follow the COCO parser pattern -- only insert samples listed in the annotation file. + +**Warning signs:** Image count in dataset doesn't match directory image count. + +### Pitfall 3: Detection-Only UI Elements Leaking Through + +**What goes wrong:** Classification datasets show bbox area histograms, IoU sliders, or empty bounding box overlays with sentinel values rendered as tiny dots at (0,0). + +**Why it happens:** Forgetting to gate UI elements on `datasetType`. + +**How to avoid:** Audit every component that references bbox values or detection-specific concepts: +- `AnnotationOverlay` -- skip rendering when `datasetType === "classification"` +- `annotation-list.tsx` -- hide Bounding Box and Area columns +- `evaluation-panel.tsx` -- hide IoU slider, use accuracy instead of mAP +- `stats-dashboard.tsx` -- rename "GT Annotations" to "Labeled Images" +- `annotation-summary.tsx` -- swap card labels + +**Warning signs:** Sentinel bbox values (0,0,0,0) rendered visually anywhere. + +### Pitfall 4: Category Ingestion for Classification + +**What goes wrong:** The COCO parser extracts categories from a dedicated `categories` array. Classification JSONL files don't have one -- categories are implicitly defined by the set of unique labels. + +**Why it happens:** Different format, different category discovery mechanism. + +**How to avoid:** The ClassificationJSONLParser must do a first pass to collect unique labels, assign sequential category IDs, then do a second pass to emit annotation batches. Or, single pass collecting labels as encountered. + +**Warning signs:** Empty categories table for classification datasets, breaking filter facets. + +### Pitfall 5: Multi-Label Classification Collision + +**What goes wrong:** If a future dataset has multiple labels per image, the single-annotation-per-sample assumption breaks. + +**Why it happens:** Single-label is the common case, but multi-label exists. + +**How to avoid:** Design the JSONL parser to handle `"label": ["dog", "outdoor"]` by emitting multiple annotation rows per sample. The sentinel bbox approach supports this naturally (each annotation row has its own category_name). But for Phase 15, scope to single-label only and document the multi-label extension path. + +**Warning signs:** JSONL lines with array-valued `label` fields. + +## Code Examples + +### Classification JSONL Parser Structure + +```python +class ClassificationJSONLParser(BaseParser): + """Parse a JSONL file where each line maps filename -> class label.""" + + @property + def format_name(self) -> str: + return "classification_jsonl" + + def parse_categories(self, file_path: Path) -> dict[int, str]: + """First pass: collect unique labels -> sequential IDs.""" + labels: set[str] = set() + with open(file_path) as f: + for line in f: + line = line.strip() + if not line: + continue + obj = json.loads(line) + label = obj.get("label") or obj.get("class") or obj.get("category", "unknown") + labels.add(label) + return {i: name for i, name in enumerate(sorted(labels))} + + def build_image_batches( + self, file_path: Path, dataset_id: str, split: str | None = None, image_dir: str = "" + ) -> Iterator[pd.DataFrame]: + """Yield sample rows from JSONL. Each line = one image.""" + batch = [] + for i, line in enumerate(open(file_path)): + line = line.strip() + if not line: + continue + obj = json.loads(line) + filename = obj.get("filename") or obj.get("file_name", "") + sample_id = f"{split}_{i}" if split else str(i) + batch.append({ + "id": sample_id, + "dataset_id": dataset_id, + "file_name": filename, + "width": obj.get("width", 0), + "height": obj.get("height", 0), + "thumbnail_path": None, + "split": split, + "metadata": None, + "image_dir": image_dir, + }) + if len(batch) >= self.batch_size: + yield pd.DataFrame(batch) + batch = [] + if batch: + yield pd.DataFrame(batch) + + def build_annotation_batches( + self, file_path: Path, dataset_id: str, categories: dict[int, str], split: str | None = None + ) -> Iterator[pd.DataFrame]: + """Yield annotation rows with sentinel bbox values.""" + batch = [] + cat_name_to_id = {v: k for k, v in categories.items()} + for i, line in enumerate(open(file_path)): + line = line.strip() + if not line: + continue + obj = json.loads(line) + label = obj.get("label") or obj.get("class") or obj.get("category", "unknown") + sample_id = f"{split}_{i}" if split else str(i) + ann_id = f"{split}_ann_{i}" if split else f"ann_{i}" + batch.append({ + "id": ann_id, + "dataset_id": dataset_id, + "sample_id": sample_id, + "category_name": label, + "bbox_x": 0.0, + "bbox_y": 0.0, + "bbox_w": 0.0, + "bbox_h": 0.0, + "area": 0.0, + "is_crowd": False, + "source": "ground_truth", + "confidence": None, + "metadata": None, + }) + if len(batch) >= self.batch_size: + yield pd.DataFrame(batch) + batch = [] + if batch: + yield pd.DataFrame(batch) +``` + +### Schema Migration (DuckDB) + +```python +# In duckdb_repo.py initialize_schema(): +self.connection.execute( + "ALTER TABLE datasets ADD COLUMN IF NOT EXISTS dataset_type VARCHAR DEFAULT 'detection'" +) +``` + +### Frontend Class Badge (Grid Cell) + +```tsx +// Inside GridCell, replacing AnnotationOverlay for classification datasets: +function ClassBadge({ label }: { label?: string }) { + if (!label) return null; + return ( +
+ + {label} + +
+ ); +} +``` + +### Frontend Class Label in Detail Modal + +```tsx +// In SampleModal, for classification datasets: +// Show GT class label prominently with dropdown to change it +
+ Class: + +
+``` + +### Classification-Aware Statistics Summary + +```tsx +// In AnnotationSummary, swap card definitions based on datasetType: +const DETECTION_CARDS = [ + { key: "total_images", label: "Total Images" }, + { key: "gt_annotations", label: "GT Annotations" }, + { key: "pred_annotations", label: "Predictions" }, + { key: "total_categories", label: "Categories" }, +]; + +const CLASSIFICATION_CARDS = [ + { key: "total_images", label: "Total Images" }, + { key: "gt_annotations", label: "Labeled Images" }, + { key: "pred_annotations", label: "Predictions" }, + { key: "total_categories", label: "Classes" }, +]; +``` + +## Existing Codebase Surface Area + +### Files That MUST Change + +| File | Change | Reason | +|------|--------|--------| +| `app/repositories/duckdb_repo.py` | Add `dataset_type` column | INGEST-04 | +| `app/ingestion/classification_jsonl_parser.py` | NEW file | INGEST-01 | +| `app/services/folder_scanner.py` | Detect JSONL layouts | INGEST-02 | +| `app/services/ingestion.py` | Parser dispatch, store dataset_type | INGEST-01, INGEST-02 | +| `app/models/dataset.py` | Add `dataset_type` to response | INGEST-04 | +| `app/models/scan.py` | Format can be `classification_jsonl` | INGEST-02 | +| `app/routers/datasets.py` | Return dataset_type in responses | INGEST-04 | +| `frontend/src/types/dataset.ts` | Add `dataset_type` field | INGEST-04 | +| `frontend/src/types/scan.ts` | Format type update | INGEST-02 | +| `frontend/src/app/datasets/[datasetId]/page.tsx` | Thread `datasetType` prop | DISP-01 through DISP-04 | +| `frontend/src/components/grid/grid-cell.tsx` | Show class badge for classification | DISP-01 | +| `frontend/src/components/detail/sample-modal.tsx` | Show class label + dropdown | DISP-02, DISP-03 | +| `frontend/src/components/detail/annotation-list.tsx` | Hide bbox columns for classification | DISP-02 | +| `frontend/src/components/stats/stats-dashboard.tsx` | Hide detection-only tabs | DISP-04 | +| `frontend/src/components/stats/annotation-summary.tsx` | Classification-appropriate labels | DISP-04 | +| `frontend/src/components/ingest/scan-results.tsx` | Format badge for classification | INGEST-02 | + +### Files That SHOULD NOT Change + +| File | Reason | +|------|--------| +| `app/services/evaluation.py` | Detection evaluation untouched; classification eval is separate (future phase) | +| `app/ingestion/coco_parser.py` | COCO format unchanged | +| `app/ingestion/prediction_parser.py` | Detection predictions unchanged | +| `app/services/error_analysis.py` | Detection-specific error categories | +| `app/ingestion/detection_annotation_parser.py` | Detection predictions unchanged | + +### Backend API Changes Needed + +1. **New annotation update endpoint for category_name** (DISP-03): Currently `PUT /annotations/{id}` only updates bbox. Need to add `PATCH /annotations/{id}/category` or extend the existing PUT to accept `category_name`. + +2. **Statistics endpoint** (DISP-04): The `GET /datasets/{id}/statistics` endpoint returns detection-centric summary stats. For classification datasets, `gt_annotations` should reflect "labeled images" (distinct sample_ids with GT annotations) rather than raw annotation count. + +3. **Dataset response**: `GET /datasets/{id}` needs to include `dataset_type`. + +### Classification JSONL Expected Format + +```jsonl +{"filename": "img001.jpg", "label": "cat"} +{"filename": "img002.jpg", "label": "dog"} +{"filename": "img003.jpg", "label": "cat"} +``` + +Alternative accepted keys: +- `filename` / `file_name` / `image` / `path` +- `label` / `class` / `category` / `class_name` +- Optional: `width`, `height`, `confidence`, `split` + +### Folder Layouts to Detect + +**Layout D (Classification JSONL):** Split directories with JSONL + images: +``` +dataset/ + train/ + annotations.jsonl + img001.jpg + img002.jpg + val/ + annotations.jsonl + img003.jpg +``` + +**Layout E (Flat Classification):** Single JSONL at root: +``` +dataset/ + labels.jsonl + images/ + img001.jpg + img002.jpg +``` + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| Hard-coded COCO parser | Parser dispatch by format string | Phase 15 | Enables multi-format support | +| No dataset_type tracking | `dataset_type` column on datasets | Phase 15 | Frontend can branch rendering | +| Detection-only statistics | Type-aware statistics | Phase 15 | Classification users see relevant metrics | + +## Open Questions + +1. **Image dimensions for classification JSONL** + - What we know: COCO JSON includes width/height per image. Classification JSONL typically doesn't. + - What's unclear: Should the parser read image dimensions from disk during ingestion, or store 0/0 and resolve later during thumbnail generation? + - Recommendation: Read dimensions during thumbnail generation (existing `ImageService` path). Store 0/0 initially if not present in JSONL. The grid cell uses `object-cover` which doesn't need dimensions. The annotation overlay (not used for classification) needs dimensions. Detail modal image loads at full-res naturally. + +2. **Multi-label classification** + - What we know: Phase 15 scopes to single-label. Multi-label is a future extension. + - What's unclear: Should the JSONL parser error on array labels or silently take the first? + - Recommendation: If `label` is an array, emit one annotation row per label. This is forward-compatible and costs nothing with the sentinel bbox approach. + +3. **Classification prediction import** + - What we know: Detection predictions use `DetectionAnnotationParser` or `PredictionParser`. Classification predictions would be a different format. + - What's unclear: Is classification prediction import in scope for Phase 15? + - Recommendation: Out of scope. Phase 15 focuses on GT ingestion and display. Classification prediction import + evaluation are natural follow-ups. + +4. **Annotation update for category_name change (DISP-03)** + - What we know: Current `AnnotationUpdate` model only has bbox fields. Current `PUT /annotations/{id}` only updates bbox. + - What's unclear: Should we extend the existing endpoint or create a new one? + - Recommendation: Add a new `PATCH /annotations/{id}/category` endpoint or extend `AnnotationUpdate` to include optional `category_name`. Extending is simpler since the existing pattern already handles updates. A new field `category_name: str | None = None` on AnnotationUpdate, applied when present, is clean. + +## Sources + +### Primary (HIGH confidence) +- **Codebase analysis** -- direct file reads of all affected files listed above + - `app/ingestion/base_parser.py` -- BaseParser ABC interface + - `app/ingestion/coco_parser.py` -- reference parser implementation + - `app/repositories/duckdb_repo.py` -- schema and migration pattern + - `app/services/ingestion.py` -- ingestion orchestration + - `app/services/folder_scanner.py` -- format detection heuristics + - `app/services/evaluation.py` -- evaluation pipeline (560 lines, leave alone) + - `app/models/` -- all Pydantic models + - `app/routers/` -- all API endpoints + - `frontend/src/components/` -- all display components + - `frontend/src/types/` -- all TypeScript type definitions + - `frontend/src/stores/` -- Zustand stores (filter, UI, ingest) + +### Secondary (MEDIUM confidence) +- Prior decisions from phase description: sentinel bbox values, separate classification eval function, datasetType prop threading, parser registry + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH -- no new dependencies, all existing libraries +- Architecture: HIGH -- extending well-established patterns in the codebase +- Pitfalls: HIGH -- derived from direct codebase analysis, not external sources +- Code examples: HIGH -- based on actual codebase patterns and verified file contents + +**Research date:** 2026-02-18 +**Valid until:** 2026-03-18 (stable -- internal codebase patterns, no external dependency risk) diff --git a/.planning/phases/15-classification-ingestion-display/15-VERIFICATION.md b/.planning/phases/15-classification-ingestion-display/15-VERIFICATION.md new file mode 100644 index 0000000..58773d9 --- /dev/null +++ b/.planning/phases/15-classification-ingestion-display/15-VERIFICATION.md @@ -0,0 +1,98 @@ +--- +phase: 15-classification-ingestion-display +verified: 2026-02-19T02:31:00Z +status: passed +score: 5/5 must-haves verified +re_verification: false +--- + +# Phase 15: Classification Ingestion & Display Verification Report + +**Phase Goal:** Users can import, browse, and inspect classification datasets with the same ease as detection datasets +**Verified:** 2026-02-19T02:31:00Z +**Status:** passed +**Re-verification:** No — initial verification + +## Goal Achievement + +### Observable Truths + +| # | Truth | Status | Evidence | +|----|-------|--------|----------| +| 1 | User can point the ingestion wizard at a folder with JSONL annotations and images, and the system auto-detects it as a classification dataset | VERIFIED | `FolderScanner._try_layout_d` and `_try_layout_e` detect JSONL layouts before COCO; `_is_classification_jsonl` heuristic reads first 5 lines for filename+label keys. GCS path also supported via `_scan_gcs_classification`. `ScanResult.format="classification_jsonl"` returned. | +| 2 | User can import multi-split classification datasets (train/valid/test) in a single operation, just like detection datasets | VERIFIED | `ImportRequest.format` field added (default `"coco"`, accepts `"classification_jsonl"`). `ingest_splits_with_progress(format=request.format)` threads format into per-split calls. `IngestionService` dispatches to `ClassificationJSONLParser` by format string. `dataset_type="classification"` stored in INSERT. | +| 3 | User sees class label badges on grid thumbnails instead of bounding box overlays when browsing a classification dataset | VERIFIED | `GridCell` accepts `datasetType?: string`; when `"classification"` renders `` with GT `category_name` instead of ``. `ImageGrid` threads `datasetType` through. Page threads `dataset.dataset_type` to ``. | +| 4 | User sees GT class label prominently in the sample detail modal and can change it via a dropdown | VERIFIED | `SampleModal` shows `{isClassification &&
Class: setColorMode(e.target.value as ColorMode)} + className="rounded border border-zinc-300 dark:border-zinc-600 bg-white dark:bg-zinc-800 text-xs px-2 py-1 text-zinc-900 dark:text-zinc-100" +> + + + + + +``` + +Pass `colorMode` to the `` component. + +**3. Thread datasetType to EmbeddingPanel (page.tsx):** + +In `frontend/src/app/datasets/[datasetId]/page.tsx`, find where `` is rendered and add `datasetType={dataset?.dataset_type}` prop. This enables the panel to know whether to show color mode controls. + +Note: The color mode dropdown should always be shown (not just for classification) since detection datasets also have GT/pred annotations. The `hasPredictions` check handles disabling prediction-dependent modes. + +**4. Invalidate embedding coordinates after prediction import:** + +In the prediction import dialog or wherever predictions are imported, add `embedding-coordinates` to the list of query keys invalidated on import success. Check `frontend/src/components/detail/prediction-import-dialog.tsx` for the `onSuccess` callback of the prediction import mutation -- add: +```typescript +queryClient.invalidateQueries({ queryKey: ["embedding-coordinates", datasetId] }); +``` + +This prevents stale coordinates (missing predLabel) after importing predictions (Pitfall 4 from research). + + +Run `cd frontend && npx tsc --noEmit` to confirm no type errors. +Verify that the EmbeddingPanel toolbar has the color mode dropdown. +Verify that EmbeddingScatter accepts and uses the colorMode prop. +Verify that the page passes datasetType to EmbeddingPanel. + + +Embedding scatter plot supports 4 color modes: Default (uniform blue), GT Class (categorical palette), Predicted Class (categorical palette), and Correct/Incorrect (green/red/gray). Color mode dropdown in toolbar with prediction-dependent options disabled when no predictions. Embedding coordinates cache invalidated after prediction import. + + + + + + +1. `cd frontend && npx tsc --noEmit` passes with no errors +2. Backend `get_coordinates` SQL includes LEFT JOIN annotations and returns gtLabel/predLabel +3. EmbeddingPoint type has optional gtLabel/predLabel fields +4. EmbeddingScatter getFillColor branches on colorMode with categorical palette +5. EmbeddingPanel toolbar has color mode dropdown +6. Prediction-dependent modes disabled when hasPredictions is false +7. Page passes datasetType to EmbeddingPanel +8. Prediction import invalidates embedding-coordinates query key + + + +- Color mode dropdown visible in embedding toolbar +- Points change color when switching modes (GT Class, Predicted Class, Correct/Incorrect) +- Predicted Class and Correct/Incorrect disabled when no predictions imported +- Backend returns gtLabel and predLabel without breaking existing consumers +- No regressions to lasso selection, hover tooltip, or detection embedding functionality + + + +After completion, create `.planning/phases/17-classification-polish/17-02-SUMMARY.md` + diff --git a/.planning/phases/17-classification-polish/17-02-SUMMARY.md b/.planning/phases/17-classification-polish/17-02-SUMMARY.md new file mode 100644 index 0000000..f195c4f --- /dev/null +++ b/.planning/phases/17-classification-polish/17-02-SUMMARY.md @@ -0,0 +1,97 @@ +--- +phase: 17-classification-polish +plan: 02 +subsystem: ui +tags: [deck.gl, scatter-plot, categorical-coloring, embedding, color-mode] + +requires: + - phase: 17-classification-polish + provides: "Embedding scatter plot with lasso selection" +provides: + - "Color mode dropdown (Default, GT Class, Predicted Class, Correct/Incorrect)" + - "Categorical Tableau 20 palette for class-based coloring" + - "Enriched coordinates API returning gtLabel and predLabel per point" +affects: [] + +tech-stack: + added: [] + patterns: + - "Categorical palette with stable label-to-index mapping for consistent colors" + - "Color mode as prop threaded from panel to scatter component" + +key-files: + created: [] + modified: + - app/services/reduction_service.py + - frontend/src/types/embedding.ts + - frontend/src/components/embedding/embedding-scatter.tsx + - frontend/src/components/embedding/embedding-panel.tsx + - frontend/src/app/datasets/[datasetId]/page.tsx + - frontend/src/hooks/use-import-predictions.ts + +key-decisions: + - "LEFT JOIN annotations with MIN() + GROUP BY to collapse multi-annotation to one label per sample" + - "Color mode dropdown always visible (not gated on dataset type) since detection datasets also have annotations" + +patterns-established: + - "Tableau 20 categorical palette for class-based visualizations" + +duration: 2min +completed: 2026-02-18 +--- + +# Phase 17 Plan 02: Embedding Color Modes Summary + +**Categorical color modes for embedding scatter (GT Class, Predicted Class, Correct/Incorrect) with Tableau 20 palette and enriched coordinates API** + +## Performance + +- **Duration:** 2 min +- **Started:** 2026-02-19T03:56:39Z +- **Completed:** 2026-02-19T03:58:47Z +- **Tasks:** 2 +- **Files modified:** 6 + +## Accomplishments +- Backend coordinates endpoint enriched with gtLabel and predLabel via LEFT JOIN annotations +- 4 color modes in embedding scatter: Default (uniform blue), GT Class, Predicted Class, Correct/Incorrect +- Color mode dropdown in toolbar with prediction-dependent options disabled when no predictions exist +- Embedding coordinates cache invalidated after prediction import to prevent stale data + +## Task Commits + +Each task was committed atomically: + +1. **Task 1: Backend coordinates enrichment with GT/pred labels** - `4ff366a` (feat) +2. **Task 2: Embedding scatter color mode dropdown and categorical coloring** - `1f4c858` (feat) + +## Files Created/Modified +- `app/services/reduction_service.py` - LEFT JOIN annotations for GT/pred labels in get_coordinates SQL +- `frontend/src/types/embedding.ts` - Added optional gtLabel/predLabel to EmbeddingPoint interface +- `frontend/src/components/embedding/embedding-scatter.tsx` - ColorMode type, Tableau 20 palette, getFillColor branching +- `frontend/src/components/embedding/embedding-panel.tsx` - Color mode dropdown, hasPredictions memo, datasetType prop +- `frontend/src/app/datasets/[datasetId]/page.tsx` - Thread datasetType to EmbeddingPanel +- `frontend/src/hooks/use-import-predictions.ts` - Invalidate embedding-coordinates on prediction import + +## Decisions Made +- LEFT JOIN annotations with MIN() + GROUP BY to collapse multi-annotation edge cases to one label per sample +- Color mode dropdown always visible (not gated on dataset type) since detection datasets also have GT/pred annotations +- Lasso selection overrides color mode coloring (selection highlight takes priority) + +## Deviations from Plan + +None - plan executed exactly as written. + +## Issues Encountered +None + +## User Setup Required +None - no external service configuration required. + +## Next Phase Readiness +- Milestone v1.2 Classification Dataset Support is complete +- All 3 phases (15-17) delivered: classification ingestion/UI, evaluation, and polish + +--- +*Phase: 17-classification-polish* +*Completed: 2026-02-18* diff --git a/.planning/phases/17-classification-polish/17-RESEARCH.md b/.planning/phases/17-classification-polish/17-RESEARCH.md new file mode 100644 index 0000000..3ad2e73 --- /dev/null +++ b/.planning/phases/17-classification-polish/17-RESEARCH.md @@ -0,0 +1,441 @@ +# Phase 17: Classification Polish - Research + +**Researched:** 2026-02-18 +**Domain:** High-cardinality confusion matrix rendering, embedding scatter coloring, most-confused pairs, per-class sparklines +**Confidence:** HIGH (all four requirements are frontend-focused UI enhancements on existing infrastructure, no new backend services or libraries needed) + +## Summary + +Phase 17 polishes the classification evaluation experience for production use with high-cardinality datasets (43+ classes). It addresses four distinct UI gaps: (1) the current confusion matrix renders as an HTML table which becomes unreadable at 43+ classes -- it needs threshold filtering to hide low-value cells and overflow handling; (2) the embedding scatter plot currently colors all points uniformly blue but should support coloring by GT class, predicted class, or correct/incorrect status; (3) the confusion matrix data already contains all information needed to derive a ranked list of most-confused class pairs, but no summary is surfaced; (4) the per-class metrics table shows raw numbers but lacks visual sparklines with color-coded thresholds for quick scanning. + +All four requirements are frontend-focused with minimal backend changes. The existing `ClassificationEvaluationResponse` already returns `confusion_matrix`, `confusion_matrix_labels`, and `per_class_metrics` -- enough data for requirements POLISH-01, POLISH-03, and POLISH-04 without backend changes. POLISH-02 requires enriching the embedding coordinates endpoint to include GT and predicted labels per sample, or fetching annotation data separately to join client-side. + +**Primary recommendation:** Implement all four requirements as frontend enhancements. For the confusion matrix (POLISH-01), use the existing HTML table approach with a threshold filter slider (hide cells below N%) and `overflow-auto` with `max-h`/`max-w` constraints rather than migrating to canvas -- the HTML table already uses cell-level color intensity and is easier to maintain. For embedding coloring (POLISH-02), extend the backend `/coordinates` endpoint to include `gtLabel` and `predLabel` per point so the `getFillColor` accessor can use a categorical color palette. For most-confused pairs (POLISH-03), derive from the existing confusion matrix client-side. For sparklines (POLISH-04), use Recharts `LineChart` with hidden axes to create inline SVG sparklines. + +## Standard Stack + +### Core (already in use -- no new dependencies) + +| Library | Version | Purpose | Status | +|---------|---------|---------|--------| +| Recharts | ^3.7.0 | Sparkline mini charts in per-class table | In use | +| deck.gl | ^9.2.6 | ScatterplotLayer `getFillColor` accessor for categorical coloring | In use | +| React/Next.js | - | Component rendering, memoization | In use | +| DuckDB | - | JOIN annotations to embedding coordinates (backend) | In use | +| Tailwind CSS | - | Styling, responsive overflow containers | In use | + +### Supporting (no new libraries needed) + +The sparkline requirement can be met with Recharts `LineChart` + `Line` with hidden axes in a small container (~60x20px). No dedicated sparkline library is needed. The Recharts `LineChart` component supports `width`/`height` props directly (no `ResponsiveContainer` needed for fixed-size inline use). + +For categorical color palettes in the embedding scatter, a static array of 20-50 distinct colors is sufficient. D3's categorical color scales (`d3-scale-chromatic`) are NOT in the dependency tree and would be overkill -- a hardcoded palette of ~20 colors with hashing for overflow is simpler and has zero bundle impact. + +### Alternatives Considered + +| Instead of | Could Use | Tradeoff | +|------------|-----------|----------| +| HTML table with threshold filter (confusion matrix) | Canvas-based heatmap (e.g., custom Canvas2D) | Canvas handles extreme sizes better but loses click interactivity, accessibility, and requires more complex implementation. HTML table with threshold filtering handles 43 classes well enough. | +| Recharts inline LineChart (sparklines) | SVG `` or `` by hand | Recharts is already imported; hand-rolling SVG paths saves ~1KB per sparkline but adds maintenance burden. | +| Recharts inline LineChart (sparklines) | `react-sparklines` library | Adds a new dependency for something Recharts already supports. | +| Backend label enrichment on /coordinates | Client-side JOIN via separate annotations fetch | Backend is cleaner (single fetch, no N+1). Frontend JOIN requires fetching all annotations for all samples which is already available in batch-annotations but couples embedding panel to annotation data. | + +## Architecture Patterns + +### Recommended Change Map + +``` +Backend: + app/services/reduction_service.py # MODIFY: get_coordinates JOIN to include GT/pred labels + (or) app/routers/embeddings.py # MODIFY: accept color_mode query param, enrich response + +Frontend: + types/embedding.ts # MODIFY: add gtLabel, predLabel to EmbeddingPoint + components/embedding/embedding-scatter.tsx # MODIFY: accept colorMode prop, use categorical getFillColor + components/embedding/embedding-panel.tsx # MODIFY: add color mode dropdown, pass datasetType, pass colorMode + components/stats/confusion-matrix.tsx # MODIFY: add threshold slider, overflow constraints, max-h/max-w + components/stats/evaluation-panel.tsx # MODIFY: add MostConfusedPairs component, add sparklines to per-class table + (or) components/stats/most-confused-pairs.tsx # NEW: ranked list of most confused (gt, pred) pairs + (or) components/stats/per-class-sparkline.tsx # NEW: inline Recharts sparkline + app/datasets/[datasetId]/page.tsx # MODIFY: pass datasetType to EmbeddingPanel +``` + +### Pattern 1: Confusion Matrix Threshold Filtering (POLISH-01) + +**What:** Add a slider that filters out confusion matrix cells below a threshold percentage, making high-cardinality matrices readable. +**When to use:** When label count >= ~15 classes. + +The current `ConfusionMatrix` component row-normalizes and renders all cells. At 43 classes, this is 1,849 cells -- most with 0.00 values. The fix: + +1. Add a `threshold` state (0.0 to 0.5, default 0.01) with a slider in the matrix header +2. Cells below threshold render as empty (no text, transparent background) +3. Wrap the table in a container with `max-h-[500px] max-w-full overflow-auto` for scroll +4. Make cell size smaller for high-cardinality: `min-w-[24px]` instead of `min-w-[32px]` when labels > 20 +5. Truncate long label text with `max-w-[80px] truncate` on row/column headers + +```tsx +// In confusion-matrix.tsx +const [threshold, setThreshold] = useState(0.01); +const isHighCardinality = labels.length > 20; + +// In the cell render: +{norm >= threshold ? ( + {norm.toFixed(2)} +) : null} +``` + +No canvas rendering needed. The HTML table with threshold filtering, scroll overflow, and compact cell sizing handles 43 classes adequately. Tested reasoning: 43x43 = 1,849 `` elements is trivial for the browser DOM. Canvas would only be justified at 200+ classes. + +### Pattern 2: Embedding Scatter Color Modes (POLISH-02) + +**What:** A dropdown in the embedding toolbar that switches point coloring between: "Default" (uniform blue), "GT Class", "Predicted Class", "Correct/Incorrect". +**When to use:** Classification datasets with predictions imported. + +**Backend change:** Enrich `get_coordinates` to JOIN annotation labels: + +```python +# In reduction_service.py get_coordinates +SELECT e.sample_id, e.x, e.y, s.file_name, s.thumbnail_path, + gt.category_name as gt_label, + pred.category_name as pred_label +FROM embeddings e +JOIN samples s ON e.sample_id = s.id AND e.dataset_id = s.dataset_id +LEFT JOIN annotations gt ON gt.sample_id = s.id AND gt.dataset_id = s.dataset_id + AND gt.source = 'ground_truth' +LEFT JOIN annotations pred ON pred.sample_id = s.id AND pred.dataset_id = s.dataset_id + AND pred.source != 'ground_truth' +WHERE e.dataset_id = ? AND e.x IS NOT NULL +``` + +Note: This LEFT JOINs so points without annotations still appear. For multi-source predictions, pick the first non-GT source or accept NULL. + +**Frontend change:** The `EmbeddingScatter` component's `getFillColor` accessor switches based on `colorMode`: + +```tsx +type ColorMode = "default" | "gt_class" | "pred_class" | "correctness"; + +// Categorical palette (20 distinct colors, cycle with modulo for overflow) +const PALETTE: [number,number,number,number][] = [ + [31,119,180,200], [255,127,14,200], [44,160,44,200], [214,39,40,200], + [148,103,189,200], [140,86,75,200], [227,119,194,200], [127,127,127,200], + // ... 12 more ... +]; + +getFillColor: (d) => { + if (colorMode === "gt_class" && d.gtLabel) { + return PALETTE[labelIndex.get(d.gtLabel)! % PALETTE.length]; + } + if (colorMode === "pred_class" && d.predLabel) { + return PALETTE[labelIndex.get(d.predLabel)! % PALETTE.length]; + } + if (colorMode === "correctness") { + if (!d.predLabel) return [180,180,180,100]; // no prediction: gray + return d.gtLabel === d.predLabel + ? [44,160,44,200] // correct: green + : [214,39,40,200]; // incorrect: red + } + return [100,120,220,200]; // default blue +} +``` + +The `labelIndex` is a `Map` built from unique labels in the points array, sorted alphabetically for stable color assignment. + +**Key concern:** The `EmbeddingPanel` currently receives only `datasetId`. It needs `datasetType` to know whether to show the color mode dropdown. The page already has `dataset?.dataset_type` -- thread it through as a prop. + +### Pattern 3: Most-Confused Class Pairs (POLISH-03) + +**What:** A ranked list derived from the confusion matrix showing the top-N most confused (actual, predicted) pairs. +**When to use:** Always shown below/beside the confusion matrix when classification evaluation data is available. + +This is a pure frontend derivation -- no backend change needed. The confusion matrix and labels are already in `ClassificationEvaluationResponse`. + +```tsx +function getMostConfusedPairs( + matrix: number[][], + labels: string[], + topN: number = 10, +): { actual: string; predicted: string; count: number; pct: number }[] { + const pairs: { actual: string; predicted: string; count: number; pct: number }[] = []; + for (let i = 0; i < matrix.length; i++) { + const rowSum = matrix[i].reduce((a, b) => a + b, 0); + for (let j = 0; j < matrix[i].length; j++) { + if (i === j) continue; // skip diagonal (correct predictions) + if (matrix[i][j] === 0) continue; + pairs.push({ + actual: labels[i], + predicted: labels[j], + count: matrix[i][j], + pct: rowSum > 0 ? matrix[i][j] / rowSum : 0, + }); + } + } + pairs.sort((a, b) => b.count - a.count); + return pairs.slice(0, topN); +} +``` + +Render as a compact table: rank, actual class, arrow, predicted class, count, percentage. Clicking a row could trigger the existing confusion cell click-to-filter behavior. + +### Pattern 4: Per-Class Sparklines with Color-Coded Thresholds (POLISH-04) + +**What:** Add a small inline sparkline to each row of the per-class metrics table, with color coding: green (F1 >= 0.8), yellow (0.5 <= F1 < 0.8), red (F1 < 0.5). +**When to use:** Always shown in the classification per-class table. + +The "sparkline" for per-class metrics is a bit ambiguous since each class has a single F1 value, not a time series. Two interpretations: + +**Interpretation A: Per-class metric bar (precision/recall/F1 as a small bar chart)** +A tiny 3-bar chart (P, R, F1) for each class, giving a visual summary per row. This is more useful than a line sparkline for single-point-in-time data. + +**Interpretation B: Confidence-threshold sweep sparkline** +Show how F1 varies as confidence threshold changes. This requires computing F1 at multiple thresholds (backend change needed -- return F1 at e.g. 5 threshold values). + +**Recommendation: Interpretation A** is simpler and requires no backend changes. Three small bars (P, R, F1) using Recharts `BarChart` with hidden axes, color-coded by the F1 threshold: + +```tsx +function PerClassSparkline({ precision, recall, f1 }: { precision: number; recall: number; f1: number }) { + const color = f1 >= 0.8 ? "#22c55e" : f1 >= 0.5 ? "#eab308" : "#ef4444"; + const data = [ + { name: "P", value: precision }, + { name: "R", value: recall }, + { name: "F1", value: f1 }, + ]; + return ( + + + + ); +} +``` + +Alternatively, a simpler approach: just a colored horizontal bar representing F1 (0-1 scale) with background showing "full" (1.0). No Recharts needed -- pure CSS: + +```tsx +
+
= 0.8 ? "#22c55e" : f1 >= 0.5 ? "#eab308" : "#ef4444", + }} + /> +
+``` + +The CSS bar is simpler, zero-dependency, and arguably clearer for a single metric. **Recommend the CSS bar approach** unless the user specifically wants a multi-metric sparkline. + +### Anti-Patterns to Avoid + +- **Canvas confusion matrix for 43 classes:** Canvas loses click interactivity, text rendering quality, and accessibility. HTML table with threshold filtering is adequate for this scale. +- **Fetching all annotations separately for embedding coloring:** This creates an N+1 or large-batch problem. Better to enrich the `/coordinates` endpoint with a JOIN. +- **Computing most-confused pairs on the backend:** The confusion matrix is already transmitted. Deriving pairs client-side avoids a new endpoint and keeps the backend simple. +- **Using ResponsiveContainer for sparklines in table cells:** ResponsiveContainer requires a parent with explicit dimensions. In table cells, use fixed `width`/`height` props on the chart directly. + +## Don't Hand-Roll + +| Problem | Don't Build | Use Instead | Why | +|---------|-------------|-------------|-----| +| Categorical color palette | Dynamic color generation HSL math | Static 20-color palette array | Reproducible, visually distinct, no computation | +| Sparkline chart | Custom SVG `` calculation | Recharts BarChart or CSS bar | Already in dependency tree, consistent styling | +| Most-confused pairs | New backend endpoint | Client-side derivation from confusion matrix | Data already on client, O(N^2) is trivial for N<=50 | +| Overflow scroll on confusion matrix | Custom virtual scrolling | CSS `overflow-auto` with `max-h`/`max-w` | 43x43 DOM elements is trivial, no virtualization needed | + +**Key insight:** All four requirements are UI refinements on data that is already available in the frontend. The only backend change needed is enriching embedding coordinates with annotation labels for POLISH-02. + +## Common Pitfalls + +### Pitfall 1: Threshold Slider Hides All Cells +**What goes wrong:** If the user sets the confusion matrix threshold too high, all off-diagonal cells disappear, making the matrix look like nothing is wrong. +**Why it happens:** Most off-diagonal values in a well-performing model are very small fractions (0.01-0.05). +**How to avoid:** Set a sensible default (0.01 = 1%), show a "N cells hidden" count, and never hide diagonal cells regardless of threshold. +**Warning signs:** Confusion matrix appears nearly empty with all off-diagonal cells blank. + +### Pitfall 2: Embedding Color Mode Without Predictions +**What goes wrong:** User selects "Predicted Class" or "Correct/Incorrect" color mode but no predictions are imported. All points turn gray. +**Why it happens:** The `predLabel` field is null for all points when no predictions exist. +**How to avoid:** Disable "Predicted Class" and "Correct/Incorrect" options in the dropdown when no predictions exist. Check for the presence of prediction sources (same `hasPredictions` logic used in stats dashboard). +**Warning signs:** All points are gray/identical color in a non-"Default" mode. + +### Pitfall 3: Too Many Classes for Color Palette +**What goes wrong:** With 43+ classes, the 20-color palette cycles and multiple classes share the same color, reducing the scatter plot's usefulness. +**Why it happens:** Human color discrimination is limited to ~20 distinct hues. +**How to avoid:** Accept this limitation and mitigate: (1) show a legend that maps colors to classes (scrollable), (2) use hover tooltip to show the exact class name, (3) for 20+ classes, recommend "Correct/Incorrect" mode which only needs 3 colors (correct/incorrect/no-prediction). +**Warning signs:** Multiple visually distinct clusters in the scatter plot share the same color. + +### Pitfall 4: Stale Embedding Coordinates After Prediction Import +**What goes wrong:** User imports predictions, switches to Embeddings tab, but coordinates don't include the new `predLabel` because the TanStack Query cache is stale (staleTime: Infinity). +**Why it happens:** Embedding coordinates query uses `staleTime: Infinity` -- it never refetches automatically. +**How to avoid:** After prediction import completes, invalidate the `embedding-coordinates` query key. The prediction import dialog already invalidates several query keys on success -- add `embedding-coordinates` to that list. +**Warning signs:** Color mode shows all points as "no prediction" (gray) even after importing predictions. + +### Pitfall 5: Multiple Prediction Sources Per Sample +**What goes wrong:** If a sample has predictions from multiple sources (e.g., "model_v1" and "model_v2"), the JOIN in get_coordinates returns duplicate rows per sample. +**Why it happens:** LEFT JOIN on annotations with source != 'ground_truth' matches multiple rows. +**How to avoid:** Either: (1) use a subquery with LIMIT 1 per sample, or (2) accept a `source` query parameter on the coordinates endpoint to filter to one prediction source, or (3) pick the first non-GT source with ROW_NUMBER(). Option (2) is cleanest -- matches how the evaluation panel already handles source selection. +**Warning signs:** Duplicate points in the scatter plot (same x,y but different pred labels). + +### Pitfall 6: Classification-Only Multi-Label GT in Embeddings +**What goes wrong:** If a sample has multiple GT annotations (multi-label), the coordinates JOIN returns duplicate rows. +**Why it happens:** Same as Pitfall 5 but for GT side. +**How to avoid:** Use MIN(gt.category_name) or GROUP BY to collapse to one GT label per sample, matching the pattern in `compute_classification_evaluation`. +**Warning signs:** Point count in scatter differs from embedding count shown in toolbar. + +## Code Examples + +### Confusion Matrix with Threshold Filter + +```tsx +// confusion-matrix.tsx additions +const [threshold, setThreshold] = useState(0.01); + +// In the header area: +
+ + setThreshold(parseFloat(e.target.value))} + className="w-20 accent-blue-500" + /> + {(threshold*100).toFixed(0)}% +
+ +// Wrap table in scrollable container: +
+ ...
+
+ +// Cell rendering: +const showValue = norm >= threshold || ri === ci; // always show diagonal +``` + +### Enriched Coordinates Query (Backend) + +```python +# reduction_service.py get_coordinates -- enriched for classification +def get_coordinates(self, dataset_id: str, cursor, source: str | None = None) -> list[dict]: + source_clause = "AND pred.source = ?" if source else "" + params = [dataset_id] + if source: + params.append(source) + + result = cursor.execute(f""" + SELECT e.sample_id, e.x, e.y, s.file_name, s.thumbnail_path, + MIN(gt.category_name) as gt_label, + MIN(pred.category_name) as pred_label + FROM embeddings e + JOIN samples s ON e.sample_id = s.id AND e.dataset_id = s.dataset_id + LEFT JOIN annotations gt ON gt.sample_id = s.id AND gt.dataset_id = s.dataset_id + AND gt.source = 'ground_truth' + LEFT JOIN annotations pred ON pred.sample_id = s.id AND pred.dataset_id = s.dataset_id + AND pred.source != 'ground_truth' {source_clause} + WHERE e.dataset_id = ? AND e.x IS NOT NULL + GROUP BY e.sample_id, e.x, e.y, s.file_name, s.thumbnail_path + ORDER BY e.sample_id + """, params + [dataset_id] if source else [dataset_id]).fetchall() + + return [ + { + "sampleId": r[0], "x": r[1], "y": r[2], + "fileName": r[3], "thumbnailPath": r[4], + "gtLabel": r[5], "predLabel": r[6], + } + for r in result + ] +``` + +### Color Mode Dropdown and Palette + +```tsx +// embedding-panel.tsx toolbar addition +const COLOR_MODES = [ + { value: "default", label: "Default" }, + { value: "gt_class", label: "GT Class" }, + { value: "pred_class", label: "Predicted Class" }, + { value: "correctness", label: "Correct / Incorrect" }, +] as const; + + +``` + +### CSS F1 Bar (Sparkline Alternative) + +```tsx +function F1Bar({ f1 }: { f1: number }) { + const color = f1 >= 0.8 ? "bg-green-500" : f1 >= 0.5 ? "bg-yellow-500" : "bg-red-500"; + return ( +
+
+
+ ); +} +``` + +## State of the Art + +| Old Approach | Current Approach | When Changed | Impact | +|--------------|------------------|--------------|--------| +| Full HTML table for all cells | Threshold-filtered table with overflow scroll | Phase 17 | Readable at 43+ classes | +| Uniform blue scatter points | Categorical coloring by class/correctness | Phase 17 | Instant visual insight on embedding clusters | +| Raw confusion matrix only | Most-confused pairs summary | Phase 17 | Actionable: top error modes at a glance | +| Numbers-only per-class table | Color-coded F1 bars | Phase 17 | Scan 43 classes in seconds | + +**Deprecated/outdated:** +- Nothing deprecated. All enhancements build on Phase 16 output. + +## Open Questions + +1. **Sparkline interpretation: single-metric bar vs multi-threshold sweep** + - What we know: Each class has one P, R, F1 value at the current confidence threshold. A "sparkline" traditionally implies a time-series line. + - What's unclear: Does the user want a single F1 bar per class, or a mini-chart showing how F1 varies across confidence thresholds? + - Recommendation: Implement a color-coded F1 bar (green/yellow/red) first. If confidence-sweep sparklines are desired, they require a backend change to return F1 at multiple thresholds per class (more complex). Defer to a follow-up. + +2. **Embedding color legend visibility at 43+ classes** + - What we know: A legend for 43 classes takes significant vertical space and many colors are visually similar. + - What's unclear: Should the legend be always-visible, collapsed/expandable, or omitted in favor of hover tooltips? + - Recommendation: Show a scrollable legend panel (max-h with overflow) for GT/Pred class modes. For "Correct/Incorrect" mode, show a simple 3-item legend (correct/incorrect/no prediction). + +3. **Prediction source selection for embedding coloring** + - What we know: Evaluation panel has a source dropdown. Embedding panel does not. + - What's unclear: Should embedding coloring respect a selected prediction source, or always use the first available source? + - Recommendation: Add an optional source query param to the coordinates endpoint. Default to first non-GT source. If the user has multiple prediction sources, they can switch via a dropdown in the embedding toolbar (add only if multiple sources exist). + +## Sources + +### Primary (HIGH confidence) +- Codebase inspection: `frontend/src/components/stats/confusion-matrix.tsx` (current HTML table implementation, 138 lines) +- Codebase inspection: `frontend/src/components/embedding/embedding-scatter.tsx` (deck.gl ScatterplotLayer, getFillColor accessor, updateTriggers pattern) +- Codebase inspection: `frontend/src/components/embedding/embedding-panel.tsx` (toolbar, lasso toggle, hover state) +- Codebase inspection: `frontend/src/types/embedding.ts` (EmbeddingPoint interface -- no gtLabel/predLabel yet) +- Codebase inspection: `frontend/src/components/stats/evaluation-panel.tsx` (ClassificationMetricsCards, ClassificationPerClassTable) +- Codebase inspection: `app/services/reduction_service.py` (get_coordinates SQL, JOIN samples only) +- Codebase inspection: `app/services/classification_evaluation.py` (confusion matrix computation, per-class metrics) +- Codebase inspection: `frontend/src/hooks/use-embeddings.ts` (staleTime: Infinity for coordinates) +- Codebase inspection: `package.json` (Recharts ^3.7.0, deck.gl ^9.2.6) + +### Secondary (MEDIUM confidence) +- deck.gl ScatterplotLayer documentation: `getFillColor` accessor supports per-point RGBA arrays with `updateTriggers` for reactive updates +- Recharts BarChart/LineChart support fixed-size rendering via `width`/`height` props without ResponsiveContainer + +### Tertiary (LOW confidence) +- Canvas vs HTML table performance for large matrices: Based on general web performance knowledge. HTML table with 1,849 cells (43x43) is well within browser capabilities. Canvas would be warranted at ~200+ classes (40,000+ cells). + +## Metadata + +**Confidence breakdown:** +- Standard stack: HIGH - No new dependencies, all existing libraries sufficient +- Architecture: HIGH - Clear extension patterns on existing components, minimal backend change (one SQL JOIN enrichment) +- Pitfalls: HIGH - Identified from direct codebase inspection (threshold UX, stale cache, multi-source JOIN, palette limits) + +**Research date:** 2026-02-18 +**Valid until:** 2026-03-18 (internal codebase patterns, stable) diff --git a/.planning/phases/17-classification-polish/17-VERIFICATION.md b/.planning/phases/17-classification-polish/17-VERIFICATION.md new file mode 100644 index 0000000..c21c35e --- /dev/null +++ b/.planning/phases/17-classification-polish/17-VERIFICATION.md @@ -0,0 +1,131 @@ +--- +phase: 17-classification-polish +verified: 2026-02-19T04:01:46Z +status: passed +score: 8/8 must-haves verified +re_verification: false +--- + +# Phase 17: Classification Polish Verification Report + +**Phase Goal:** Classification workflows are production-ready for high-cardinality datasets (43+ classes) with visual aids that surface actionable insights +**Verified:** 2026-02-19T04:01:46Z +**Status:** passed +**Re-verification:** No — initial verification + +--- + +## Goal Achievement + +### Observable Truths (from Success Criteria) + +| # | Truth | Status | Evidence | +|----|----------------------------------------------------------------------------------------------------|------------|----------------------------------------------------------------------------------------------| +| 1 | Confusion matrix renders readably at 43+ classes with threshold filtering and overflow handling | VERIFIED | Threshold slider (0–50%, default 1%), `overflow-auto max-h-[500px]`, compact mode for >20 classes (text-[10px], min-w-[24px], max-w-[80px] truncate) — confusion-matrix.tsx lines 34, 65, 102, 163, 180 | +| 2 | User can color the embedding scatter plot by GT class, predicted class, or correct/incorrect status | VERIFIED | `ColorMode` type exported from embedding-scatter.tsx; dropdown in embedding-panel.tsx with all 4 options; `getFillColor` branches on colorMode with CATEGORICAL_PALETTE — embedding-scatter.tsx lines 23, 150–169 | +| 3 | User sees a ranked list of most-confused class pairs derived from the confusion matrix | VERIFIED | `MostConfusedPairs` component in evaluation-panel.tsx (lines 96–191) derives top 10 off-diagonal pairs by raw count; rendered between ConfusionMatrix and per-class table (lines 399–407) | +| 4 | User sees per-class performance sparklines with color-coded thresholds in the metrics table | VERIFIED | `F1Bar` component (lines 86–93): green >= 0.8, yellow >= 0.5, red < 0.5; used in `ClassificationPerClassTable` Performance column (line 262); table has explicit "Performance" header (line 234) | + +**Score: 4/4 success-criteria truths verified** + +--- + +## Must-Have Artifacts (17-01-PLAN.md) + +| Artifact | Provides | Status | Details | +|-----------------------------------------------------------------|----------------------------------------------------------------|------------|---------------------------------------------------------------------------------------------------| +| `frontend/src/components/stats/confusion-matrix.tsx` | Threshold slider, compact cells, overflow scroll container | VERIFIED | Exists, 195 lines, substantive. Contains `threshold`, `hiddenCount`, `isCompact`, `overflow-auto max-h-[500px]`. Imported + used in evaluation-panel.tsx line 22 and rendered at lines 391, 507. | +| `frontend/src/components/stats/evaluation-panel.tsx` | MostConfusedPairs component, F1Bar component in per-class table | VERIFIED | Exists, 524 lines, substantive. Contains `MostConfusedPairs` (line 96), `F1Bar` (line 86), `ClassificationPerClassTable` with Performance column (line 234), and `` — lines 27, 73–77, 274–283, 339. | + +--- + +## Key Link Verification + +### 17-01-PLAN.md Key Links + +| From | To | Via | Status | Details | +|----------------------------------|---------------------------------|--------------------------|---------|------------------------------------------------------------------------------------| +| `evaluation-panel.tsx` | `confusion-matrix.tsx` | `` at line 339. | +| `reduction_service.py` | `frontend/src/types/embedding.ts` | API response shape includes gtLabel, predLabel | WIRED | Backend returns `"gtLabel": r[5], "predLabel": r[6]` (lines 167–168). Frontend type has matching `gtLabel?`, `predLabel?` (lines 16–17). | +| `datasets/[datasetId]/page.tsx` | `embedding-panel.tsx` | `datasetType` prop threading | WIRED | `` at line 117. | + +### Bonus Key Link (not in plan frontmatter) + +| From | To | Via | Status | Details | +|----------------------------------------|---------------------------------|--------------------------------------------------------|---------|--------------------------------------------------------------------------------| +| `use-import-predictions.ts` | `embedding-panel.tsx` | `embedding-coordinates` query key invalidation on import | WIRED | `qc.invalidateQueries({ queryKey: ["embedding-coordinates", datasetId] })` at line 23 of use-import-predictions.ts. | + +--- + +## Requirements Coverage + +All 4 phase success criteria map directly to verified truths above. No unmet requirements found. + +--- + +## Anti-Patterns Found + +None. Zero TODO/FIXME/placeholder comments in any modified files. No stub implementations (empty handlers, static returns, or unreachable branches). TypeScript compiler (`npx tsc --noEmit`) exits with zero errors. + +--- + +## Commit Verification + +All four commits documented in SUMMARY files confirmed to exist: + +| Commit | Description | +|----------|----------------------------------------------------------| +| `10a3230`| feat(17-01): add threshold filtering and overflow scroll to confusion matrix | +| `660d287`| feat(17-01): add most-confused pairs and F1 bars to classification eval | +| `4ff366a`| feat(17-02): enrich coordinates endpoint with GT/pred labels | +| `1f4c858`| feat(17-02): add color mode dropdown and categorical coloring to embedding scatter | + +--- + +## Human Verification Recommended + +The following items pass automated checks but benefit from visual confirmation: + +### 1. Confusion Matrix Readability at 43+ Classes + +**Test:** Load a classification dataset with 43+ classes and open the Evaluation tab. Adjust the threshold slider. +**Expected:** Matrix cells below threshold disappear, "N cells hidden" counter updates, labels are truncated with ellipsis, cell values use 10px font. Matrix scrolls vertically/horizontally without breaking layout. +**Why human:** Cell density, truncation appearance, and scroll UX cannot be verified programmatically. + +### 2. Color Mode Visual Correctness + +**Test:** With a classification dataset that has predictions imported, open the Embeddings tab, select "GT Class" then "Predicted Class" then "Correct / Incorrect" from the dropdown. +**Expected:** Points change color per the Tableau 20 palette (GT Class / Predicted Class), or green/red/gray (Correct / Incorrect). Lasso selection still overrides coloring. +**Why human:** Color rendering accuracy and visual distinction between modes requires visual inspection. + +### 3. Most Confused Pairs Click-Through + +**Test:** In the Evaluation tab for a classification dataset, click a row in the "Most Confused Pairs" table. +**Expected:** The UI switches to the Grid tab and filters images to only those misclassified in that direction. +**Why human:** State transitions and filter application require runtime verification. + +--- + +## Gaps Summary + +No gaps. All must-haves from both 17-01-PLAN.md and 17-02-PLAN.md are verified at all three levels (exists, substantive, wired). The phase goal is achieved. + +--- + +_Verified: 2026-02-19T04:01:46Z_ +_Verifier: Claude (gsd-verifier)_ diff --git a/.planning/research/ARCHITECTURE.md b/.planning/research/ARCHITECTURE.md index 536938c..22a4287 100644 --- a/.planning/research/ARCHITECTURE.md +++ b/.planning/research/ARCHITECTURE.md @@ -1,1057 +1,549 @@ -# Architecture Research: v1.1 Feature Integration +# Architecture Patterns: Classification Dataset Support -**Domain:** CV Dataset Introspection Tooling -- Feature Integration into Existing Architecture -**Researched:** 2026-02-12 -**Confidence:** HIGH (grounded in codebase analysis of 12,720 LOC across 50+ source files) +**Domain:** Single-label classification integration into existing detection-centric DataVisor +**Researched:** 2026-02-18 +**Confidence:** HIGH -- based on direct codebase analysis, no external dependencies needed --- -## Existing Architecture Snapshot +## Executive Summary -Before defining integration points, here is the current v1.0 architecture as built (not as planned -- verified against actual source files): +Classification support requires threading a `dataset_type` discriminator through every layer of the stack: schema, ingestion, API responses, frontend rendering, and evaluation. The key architectural decision is to **reuse the existing `annotations` table with nullable bbox columns** rather than creating a separate table. This preserves all existing query patterns, filtering, and statistics while classification annotations simply have `NULL` bbox values. The frontend conditionally renders class labels (pill/chip) vs bounding boxes based on the dataset type, and the evaluation service branches between detection metrics (mAP/IoU) and classification metrics (accuracy/precision/recall/F1). -``` -CURRENT ARCHITECTURE (v1.0 -- 12,720 LOC) -========================================== - -Frontend (Next.js 16 + React 19) Backend (FastAPI + Python 3.14) --------------------------------------- ------------------------------------ -app/page.tsx -- Dataset list app/main.py -- Lifespan, CORS, router mounts -app/datasets/[id]/ -- Dataset view app/config.py -- Pydantic Settings (env prefix DATAVISOR_) - app/dependencies.py -- DI: get_db, get_cursor, get_*_service -3 Zustand stores: - filter-store.ts -- Filters, selection 9 Routers: - ui-store.ts -- Modal, tabs, sources datasets.py, samples.py, images.py, views.py, - embedding-store.ts -- Lasso selection statistics.py, embeddings.py, similarity.py, - agent.py, vlm.py -14 Hooks (TanStack Query): - use-samples.ts -- Infinite scroll 7 Services: - use-annotations.ts -- Batch + per-sample ingestion.py, embedding_service.py, reduction_service.py, - use-error-analysis.ts similarity_service.py, vlm_service.py, - use-evaluation.ts error_analysis.py, evaluation.py, agent_service.py - use-embedding-progress.ts filter_builder.py, image_service.py - use-vlm-progress.ts - ... (8 more) Data Layer: - DuckDB (data/datavisor.duckdb) -- 6 tables -lib/api.ts -- apiFetch, apiPost, etc. Qdrant (data/qdrant/) -- local mode, disk-persisted -lib/constants.ts -- API_BASE, PAGE_SIZE StorageBackend (fsspec: local + GCS) -lib/color-hash.ts -- Deterministic class colors - -Component Tree: DuckDB Tables: - grid/image-grid.tsx (TanStack Virtual) datasets, samples, annotations, categories, - grid/grid-cell.tsx saved_views, embeddings - grid/annotation-overlay.tsx (SVG-based) - detail/sample-modal.tsx (HTML dialog) - detail/annotation-list.tsx - detail/similarity-panel.tsx - embedding/embedding-scatter.tsx (deck.gl) - embedding/lasso-overlay.tsx - filters/filter-sidebar.tsx - stats/stats-dashboard.tsx (6 sub-panels) - toolbar/auto-tag-button.tsx -``` +--- -### Key Architectural Properties +## Existing Architecture Snapshot (Relevant Surfaces) -1. **DuckDB is single-connection, cursor-per-request** (`app/dependencies.py:24-32`) -2. **Qdrant runs in LOCAL mode** (no Docker service -- `QdrantClient(path=...)` in `similarity_service.py:27`) -3. **SSE pattern established** -- 4 existing SSE streams (ingestion, embeddings, reduction, VLM) -4. **Services are injected via `app.state`** at lifespan startup, retrieved via `get_*` dependencies -5. **Annotation overlay uses SVG** (NOT react-konva) -- `annotation-overlay.tsx` renders `` with `` elements -6. **Frontend talks to `http://localhost:8000`** by default (`NEXT_PUBLIC_API_URL` env var) -7. **No auth exists** -- CORS allows all origins (`allow_origins=["*"]`) -8. **No Docker files exist** -- project runs via `uvicorn` and `next dev` directly +Before defining integration points, here are the exact existing structures that classification support touches: ---- +### DuckDB Schema (from `duckdb_repo.py`) +```sql +-- datasets: NO dataset_type column. format is always "coco". +datasets(id, name, format, source_path, image_dir, image_count, + annotation_count, category_count, prediction_count, created_at, metadata) -## Feature 1: Docker Deployment +-- annotations: bbox columns are NOT NULL. Every row must have bbox values. +annotations(id, dataset_id, sample_id, category_name, + bbox_x DOUBLE NOT NULL, bbox_y DOUBLE NOT NULL, + bbox_w DOUBLE NOT NULL, bbox_h DOUBLE NOT NULL, + area, is_crowd, source, confidence, metadata) -### Compose Topology +-- samples: dataset-agnostic, works for both detection and classification +samples(id, dataset_id, file_name, width, height, thumbnail_path, split, metadata, image_dir, tags) -``` -docker-compose.yml -================== - - +-----------------+ - | nginx | :80 / :443 - | (reverse proxy)| - +--------+--------+ - | - +--------------+--------------+ - | | - +---------v--------+ +---------v--------+ - | backend | | frontend | - | FastAPI + DuckDB| | Next.js 16 | - | (uvicorn :8000) | | (standalone) | - | | | (:3000) | - +--------+---------+ +------------------+ - | - +--------v---------+ - | qdrant | - | (qdrant/qdrant) | - | :6333 (REST) | - | :6334 (gRPC) | - +-------------------+ - -Volumes: - - data_volume:/app/data (DuckDB + thumbnails, mounted into backend) - - qdrant_storage:/qdrant/storage (Qdrant persistent data) - - images:/data/images (bind mount for local image datasets) +-- categories: dataset-agnostic, works for both types +categories(dataset_id, category_id, name, supercategory) ``` -### Integration Points +### Ingestion Pipeline (from `ingestion.py`, `coco_parser.py`, `folder_scanner.py`) +- `FolderScanner` detects COCO layouts only (checks for `"images"` key in JSON) +- `IngestionService.ingest_with_progress()` hardcodes `COCOParser()` +- `ScanResult.format` is always `"coco"` +- All parsers yield DataFrames with bbox columns -**Files to create:** -| File | Purpose | -|------|---------| -| `Dockerfile.backend` | Multi-stage: Python 3.14, install deps, copy app/, expose 8000 | -| `Dockerfile.frontend` | Multi-stage: Node 22, build standalone, expose 3000 | -| `docker-compose.yml` | 4 services: backend, frontend, qdrant, nginx | -| `nginx/default.conf` | Reverse proxy: `/api/*` -> backend:8000, `/*` -> frontend:3000 | -| `.env.docker` | Docker-specific env vars | +### Evaluation (from `evaluation.py`) +- `compute_evaluation()` builds `sv.Detections` objects with xyxy bounding boxes +- IoU matching is hardcoded throughout (no concept of non-spatial matching) +- `_load_detections()` queries `bbox_x, bbox_y, bbox_w, bbox_h` from annotations +- Response model: `EvaluationResponse` has `pr_curves`, `ap_metrics`, `iou_threshold` -**Files to modify:** -| File | Change | Rationale | -|------|--------|-----------| -| `app/config.py` | Add `qdrant_url` setting (default `None` = local mode, set to `http://qdrant:6333` in Docker) | Switch between local Qdrant (dev) and Docker Qdrant (prod) | -| `app/services/similarity_service.py` | Conditional: `QdrantClient(path=...)` vs `QdrantClient(url=...)` based on `qdrant_url` setting | Current code hardcodes local mode | -| `frontend/next.config.ts` | Add `output: "standalone"` for Docker-optimized builds | Reduces image size from ~1GB to ~100MB | -| `frontend/src/lib/constants.ts` | Already uses `NEXT_PUBLIC_API_URL` env var -- no change needed | Works as-is with Docker | +### Frontend (from `annotation-overlay.tsx`, `grid-cell.tsx`, `sample-modal.tsx`) +- `AnnotationOverlay` renders SVG `` elements using `ann.bbox_x/y/w/h` +- `Annotation` type has required `bbox_x/y/w/h: number` fields +- `SampleModal` shows annotation editor (Konva bbox editing), annotation table with bbox columns +- `EvaluationPanel` shows PR curves, mAP cards, per-class AP table, confusion matrix -### Qdrant Mode Switch Design +--- -The critical architecture decision is Qdrant's mode. Currently `SimilarityService.__init__` creates a local-mode client: +## Recommended Architecture -```python -# CURRENT (app/services/similarity_service.py:27) -self.client = QdrantClient(path=str(path)) - -# PROPOSED -- conditional based on settings -settings = get_settings() -if settings.qdrant_url: - # Docker mode: connect to Qdrant service - self.client = QdrantClient(url=settings.qdrant_url) -else: - # Dev mode: local embedded storage - path = Path(qdrant_path) - path.mkdir(parents=True, exist_ok=True) - self.client = QdrantClient(path=str(path)) -``` +### High-Level Integration Pattern -### DuckDB in Docker - -DuckDB is embedded (in-process) -- it runs INSIDE the backend container. The `.duckdb` file must persist across container restarts via a Docker volume: - -```yaml -# docker-compose.yml (backend service) -services: - backend: - build: - context: . - dockerfile: Dockerfile.backend - volumes: - - data_volume:/app/data # DuckDB + thumbnails persist here - - /path/to/images:/data/images:ro # Bind-mount image datasets (read-only) - environment: - - DATAVISOR_DB_PATH=/app/data/datavisor.duckdb - - DATAVISOR_THUMBNAIL_CACHE_DIR=/app/data/thumbnails - - DATAVISOR_QDRANT_URL=http://qdrant:6333 - ports: - - "8000:8000" ``` - -**Single worker constraint remains:** DuckDB requires `--workers 1` in Docker too. The `CMD` should be `uvicorn app.main:app --host 0.0.0.0 --port 8000 --workers 1`. - -### Backend Dockerfile Pattern - -```dockerfile -# Multi-stage build for Python 3.14 + uv -FROM python:3.14-slim AS builder -WORKDIR /app -COPY pyproject.toml uv.lock ./ -RUN pip install uv && uv sync --frozen --no-dev - -FROM python:3.14-slim AS runtime -WORKDIR /app -COPY --from=builder /app/.venv /app/.venv -COPY app/ ./app/ -COPY plugins/ ./plugins/ -ENV PATH="/app/.venv/bin:$PATH" -EXPOSE 8000 -CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"] + dataset_type = "detection" | "classification" + | + +---------------------+---------------------+ + | | | + Ingestion Rendering Evaluation + (parser per (conditional (metric strategy + format) overlay) per type) ``` -**GPU consideration:** The base image does NOT include CUDA. For VLM/embedding features in Docker, users either (a) use CPU-only inference (slow but works), or (b) use `nvidia/cuda` base image with GPU passthrough. Recommend CPU-only as default Docker profile, GPU as optional override. - -### Frontend Dockerfile Pattern - -```dockerfile -FROM node:22-alpine AS builder -WORKDIR /app -COPY frontend/package.json frontend/package-lock.json ./ -RUN npm ci -COPY frontend/ . -ENV NEXT_PUBLIC_API_URL=/api -RUN npm run build - -FROM node:22-alpine AS runner -WORKDIR /app -COPY --from=builder /app/.next/standalone ./ -COPY --from=builder /app/.next/static ./.next/static -COPY --from=builder /app/public ./public -EXPOSE 3000 -CMD ["node", "server.js"] -``` +The `dataset_type` field on the `datasets` table is the single source of truth that drives conditional behavior across all layers. Every component reads this value and branches accordingly. Simple if/else branching at well-defined boundary points -- no polymorphism or plugin system needed. -### Nginx Reverse Proxy +### Component Boundaries -```nginx -# nginx/default.conf -upstream backend { - server backend:8000; -} -upstream frontend { - server frontend:3000; -} +| Component | Responsibility | Communicates With | Change Type | +|-----------|---------------|-------------------|-------------| +| `datasets` table | Stores `dataset_type` column | All components read it | ADD column | +| `annotations` table | Stores all annotations (bbox nullable for classification) | Parsers write, API reads | ALTER bbox to nullable | +| `ClassificationFolderParser` | Parses folder-of-folders layout | IngestionService | NEW | +| `ClassificationPredictionParser` | Parses classification prediction CSV/JSON | Ingestion router | NEW | +| `IngestionService` | Routes to correct parser based on format | Parsers, DuckDB | MODIFY | +| `FolderScanner` | Auto-detects dataset format | Ingestion router | MODIFY | +| `classification_evaluation.py` | Computes accuracy/F1/confusion matrix | Statistics router | NEW | +| `AnnotationOverlay` (frontend) | Renders bbox SVG or class label pill | GridCell, SampleModal | MODIFY | +| `EvaluationPanel` (frontend) | Shows detection or classification metrics | Stats dashboard | MODIFY | +| `DatasetResponse` / `AnnotationResponse` | API response models | Frontend types | MODIFY | + +### Data Flow: Classification Ingestion -server { - listen 80; - - # API routes -> FastAPI backend - location /api/ { - proxy_pass http://backend/; - proxy_set_header Host $host; - proxy_set_header X-Real-IP $remote_addr; - proxy_buffering off; # Required for SSE streams - proxy_read_timeout 300s; # Long-running SSE connections - } - - # Everything else -> Next.js frontend - location / { - proxy_pass http://frontend; - proxy_set_header Host $host; - } -} +``` +User points scanner at folder + | + v +FolderScanner detects structure: + folder-of-folders? -> format = "classification_folders" + CSV with labels? -> format = "classification_csv" + COCO with bbox? -> format = "coco" (existing, unchanged) + | + v +ScanResult returned with format string + detected splits + | + v +IngestionService dispatches to ClassificationFolderParser + | + v +Parser yields sample batches (same schema as detection -- width/height from PIL) + | + v +Parser yields annotation batches: + - category_name = folder name (class label) + - bbox_x/y/w/h = NULL + - area = NULL + - source = "ground_truth" + - ONE annotation per sample (single-label classification) + | + v +DuckDB bulk insert (same INSERT INTO annotations pattern) + | + v +datasets row created with dataset_type = "classification" ``` -### Build Order Implication +### Data Flow: Classification Evaluation -Docker deployment is **independent of all other v1.1 features** and should be built first. It creates the deployment scaffold that other features (auth, ingestion UI) build upon. +``` +GET /datasets/{id}/evaluation + | + v +Router reads dataset_type from datasets table + | + v +if dataset_type == "classification": + compute_classification_evaluation() # NEW function +else: + compute_evaluation() # existing detection path + | + v +Classification evaluation: + 1. Load GT: one annotation per sample (source='ground_truth') + 2. Load preds: highest-confidence prediction per sample + 3. Match by sample_id (no IoU, no spatial matching) + 4. Build confusion matrix (no "background" row/col) + 5. Compute per-class precision/recall/F1 + 6. Compute overall accuracy, macro-F1, weighted-F1 + 7. Return ClassificationEvaluationResponse +``` --- -## Feature 2: Single-User Auth Middleware +## Key Architectural Decisions -### Architecture Decision: Dependency Injection (not middleware) +### Decision 1: Reuse `annotations` Table with Nullable Bbox -**Recommendation: Use FastAPI's `Depends()` pattern, NOT ASGI middleware.** +**Recommendation:** Reuse the existing `annotations` table. Make bbox columns nullable. -Rationale from research and codebase analysis: -1. The codebase already uses `Depends()` extensively (9 dependency functions in `dependencies.py`). Adding auth as another dependency is consistent. -2. Middleware approach would wrap ALL routes including `/health` and SSE streams, requiring complex exclusion logic. -3. The FastAPI community consensus (GitHub Discussion #8867, #3277) strongly favors DI for auth because it is testable, composable, and explicit per-route. -4. Single-user auth is simple: one username/password from environment variables, verified via HTTP Basic Auth. +**Why this is clearly the right choice:** +- Every existing query path (statistics, filtering, batch annotations, triage) filters on `category_name`, `source`, `dataset_id` -- none require non-null bbox. +- The statistics endpoint (`GROUP BY category_name`) works identically for classification. +- Saved views, tags, embeddings, similarity search -- all sample-level features work without changes. +- A separate `classification_annotations` table would require parallel query paths in every service, doubling the maintenance surface. -### Integration Point +**Schema migration (in `duckdb_repo.py:initialize_schema`):** +```sql +-- Add dataset_type to datasets +ALTER TABLE datasets ADD COLUMN IF NOT EXISTS dataset_type VARCHAR DEFAULT 'detection'; -**File to create:** -| File | Purpose | -|------|---------| -| `app/auth.py` | `verify_credentials()` dependency using `fastapi.security.HTTPBasic` | +-- Make bbox columns nullable for classification support +-- DuckDB supports DROP NOT NULL via ALTER TABLE +ALTER TABLE annotations ALTER COLUMN bbox_x DROP NOT NULL; +ALTER TABLE annotations ALTER COLUMN bbox_y DROP NOT NULL; +ALTER TABLE annotations ALTER COLUMN bbox_w DROP NOT NULL; +ALTER TABLE annotations ALTER COLUMN bbox_h DROP NOT NULL; +``` -**Files to modify:** -| File | Change | -|------|--------| -| `app/config.py` | Add `auth_username: str = "admin"` and `auth_password: str` settings | -| `app/main.py` | Add auth dependency to ALL router includes (single line each) | -| `app/routers/*.py` | No changes -- auth applied at router level via `dependencies=[Depends(verify_auth)]` | +**Risk note (MEDIUM confidence):** DuckDB's `ALTER COLUMN DROP NOT NULL` syntax needs verification against current DuckDB version. Fallback approach: change the `CREATE TABLE IF NOT EXISTS annotations` statement to remove `NOT NULL` from bbox columns. Since the table already exists, this alone does nothing -- but combined with a migration that creates a new table, copies data, drops old, and renames, it works. Verify during implementation. -### Implementation Pattern +**Simpler fallback:** Change the `CREATE TABLE` statement to not have `NOT NULL` on bbox columns. For existing databases, store classification bbox as `0.0` instead of `NULL`. This avoids ALTER entirely but is semantically less clean. The code paths would check `bbox_w == 0 AND bbox_h == 0` as "no bbox" rather than `IS NULL`. -```python -# app/auth.py -import secrets -from fastapi import Depends, HTTPException, status -from fastapi.security import HTTPBasic, HTTPBasicCredentials -from app.config import get_settings - -security = HTTPBasic() - -def verify_auth(credentials: HTTPBasicCredentials = Depends(security)) -> str: - """Verify single-user basic auth credentials. - - Returns the username on success. Raises 401 on failure. - Uses secrets.compare_digest to prevent timing attacks. - """ - settings = get_settings() - correct_username = secrets.compare_digest( - credentials.username.encode("utf-8"), - settings.auth_username.encode("utf-8"), - ) - correct_password = secrets.compare_digest( - credentials.password.encode("utf-8"), - settings.auth_password.encode("utf-8"), - ) - if not (correct_username and correct_password): - raise HTTPException( - status_code=status.HTTP_401_UNAUTHORIZED, - detail="Invalid credentials", - headers={"WWW-Authenticate": "Basic"}, - ) - return credentials.username -``` +### Decision 2: `dataset_type` on `datasets` Table -### Router-Level Application +**Recommendation:** Yes. Add `dataset_type VARCHAR DEFAULT 'detection'`. -Apply auth at the router include level in `main.py` so every endpoint on every router requires auth, without modifying individual router files: +**Why:** +- Single source of truth for conditional behavior across all layers. +- Default of `'detection'` means zero migration impact on existing datasets. +- Frontend reads it once per dataset load and threads it through props. +- Evaluation router uses it to select the metric strategy. +- Future types (segmentation, keypoints) extend the same pattern. -```python -# app/main.py -- modified includes -from app.auth import verify_auth +**Not on annotations:** All annotations in a dataset share the same type. There is no mixed detection+classification dataset in DataVisor's model. The dataset-level discriminator is sufficient. -app.include_router(datasets.router, dependencies=[Depends(verify_auth)]) -app.include_router(samples.router, dependencies=[Depends(verify_auth)]) -# ... repeat for all routers +### Decision 3: Frontend Conditional Rendering -# /health remains unprotected (no dependency) -@app.get("/health") -async def health_check() -> dict[str, str]: - return {"status": "ok"} -``` +**Recommendation:** Thread `datasetType` through component props from the dataset query. Branch at component boundaries, not deep inside components. -### Frontend Auth Integration +**Where conditional rendering applies:** -The frontend `api.ts` functions (`apiFetch`, `apiPost`, `apiPatch`, `apiDelete`) all call `fetch()` directly. For Basic Auth, add the `Authorization` header: +| Component | Detection Behavior | Classification Behavior | +|-----------|-------------------|------------------------| +| `AnnotationOverlay` | SVG bbox rectangles with class labels | Class label pill/chip in top-left corner | +| `GridCell` | Overlay shows boxes | Overlay shows label pill | +| `SampleModal` image area | SVG bbox overlays | Class label overlay (no boxes) | +| `SampleModal` annotation table | Columns: class, bbox, area, source | Columns: class, confidence, source (no bbox) | +| `AnnotationEditor` (Konva) | Draggable/resizable bbox editing | Class picker dropdown (no Konva canvas) | +| `DrawLayer` / `EditableRect` | Shown in edit mode | Hidden (no bbox to draw) | +| `EvaluationPanel` header | mAP@50, mAP@75, mAP@50:95 cards | Accuracy, Macro-F1, Weighted-F1 cards | +| `PRCurveChart` | Shown (per-class PR curves) | Hidden (not meaningful for classification) | +| `PerClassTable` | Columns: AP50, AP75, AP50:95, P, R | Columns: Precision, Recall, F1, Support | +| `ConfusionMatrix` | Has "background" row/col for FP/FN | No "background" -- pure NxN class matrix | +| `ErrorAnalysis` panel | IoU-based error categories | Misclassification categories (simpler) | +| `PredictionImportDialog` | Accepts COCO results JSON | Accepts classification CSV/JSON | +| Filter sidebar | Bbox area filter shown | Bbox area filter hidden | +**Implementation pattern:** ```typescript -// frontend/src/lib/api.ts -- modified -function authHeaders(): HeadersInit { - // Credentials stored in environment variables at build time - // or passed via cookie/session after initial login - const creds = btoa(`${process.env.NEXT_PUBLIC_AUTH_USER}:${process.env.NEXT_PUBLIC_AUTH_PASS}`); - return { Authorization: `Basic ${creds}` }; +// Dataset type flows from page -> components +const { data: dataset } = useDataset(datasetId); +const datasetType = dataset?.dataset_type ?? "detection"; + +// AnnotationOverlay branches at the top +export function AnnotationOverlay({ annotations, imageWidth, imageHeight, datasetType }) { + if (datasetType === "classification") { + return ; + } + // Existing SVG bbox rendering unchanged + return ...; } ``` -**Alternative (better UX):** Add a login page that stores credentials in `sessionStorage`, then include them in all API calls. This avoids browser Basic Auth popup. - -### SSE Stream Auth +### Decision 4: Separate Evaluation Function (NOT Shared) -The existing SSE pattern uses `new EventSource(url)` which does NOT support custom headers. Two options: - -1. **Cookie-based session** (recommended): After initial Basic Auth, set an HTTP-only session cookie. EventSource sends cookies automatically. -2. **Query parameter token**: Pass auth token as `?token=xxx` in SSE URLs. Less secure but simpler. - -**Recommendation:** Use cookie-based sessions via `fastapi-sessions` or a simple signed cookie. This is the only viable approach because the existing `useEmbeddingProgress` and `useVlmProgress` hooks use `EventSource` which cannot set `Authorization` headers. - -### Build Order Implication - -Auth must come AFTER Docker (needs HTTPS for secure credential transmission) but BEFORE smart ingestion UI (new endpoints need auth). - ---- - -## Feature 3: Smart Ingestion UI - -### Current Ingestion Flow - -The current ingestion is API-only (`POST /datasets/ingest` with `annotation_path` and `image_dir` as strings). There is no UI -- users must know exact file paths. - -### Smart Ingestion Architecture - -The smart ingestion feature adds three new components: - -``` -User points at folder - | - v -POST /datasets/scan { root_path } <-- NEW endpoint - | - v -FolderScanner service <-- NEW service - - Walk directory tree - - Detect COCO annotation files (*.json with "images" key) - - Detect image directories (dirs with .jpg/.png files) - - Detect train/val/test splits by directory naming - - Return structured scan result - | - v -Response: ScanResult { - annotation_files: [{ path, format, est_images }], - image_dirs: [{ path, image_count, split_guess }], - suggested_imports: [{ annotation, image_dir, split, name }] -} - | - v -Frontend: Ingestion wizard UI <-- NEW page/component - - Shows detected files and directories - - User confirms/adjusts import configuration - - Clicks "Import" -> triggers SSE ingestion stream - | - v -POST /datasets/ingest (existing endpoint) <-- REUSE with minor extension - - SSE progress stream (existing pattern) - - New: accept optional split parameter -``` +**Recommendation:** Add `compute_classification_evaluation()` as a separate function. Do NOT retrofit the detection evaluation to handle both. -### Backend Integration Points - -**Files to create:** -| File | Purpose | -|------|---------| -| `app/services/folder_scanner.py` | `FolderScanner` class: walk dir tree, detect formats, suggest imports | -| `app/models/scan.py` | Pydantic models: `ScanRequest`, `ScanResult`, `DetectedFile`, `SuggestedImport` | -| `app/routers/ingestion.py` | New router: `POST /ingestion/scan`, mounted under auth | - -**Files to modify:** -| File | Change | -|------|--------| -| `app/main.py` | Add `app.include_router(ingestion.router)` | -| `app/dependencies.py` | Add `get_folder_scanner()` dependency | -| `app/models/dataset.py` | Add optional `split` field to `IngestRequest` | -| `app/services/ingestion.py` | Pass `split` to image batch builder (set `split` column during parsing) | -| `app/ingestion/coco_parser.py` | Accept optional `split` parameter in `build_image_batches()` | - -**Frontend files to create:** -| File | Purpose | -|------|---------| -| `frontend/src/app/ingest/page.tsx` | Ingestion wizard page | -| `frontend/src/components/ingest/scan-results.tsx` | Display detected files with checkboxes | -| `frontend/src/components/ingest/import-progress.tsx` | SSE progress display (reuses existing pattern) | -| `frontend/src/hooks/use-scan.ts` | TanStack Query mutation for scan endpoint | -| `frontend/src/hooks/use-ingest.ts` | SSE hook for ingestion progress (similar to `use-embedding-progress.ts`) | -| `frontend/src/types/scan.ts` | TypeScript types matching backend Pydantic models | - -### Folder Scanner Design +**Why the detection code cannot be reused:** +- Detection evaluation builds `sv.Detections` objects with xyxy bounding boxes -- classification has none. +- Detection uses IoU matching to determine TP/FP/FN -- classification matches by sample_id. +- Detection's confusion matrix includes a "background" class -- classification does not. +- Detection computes AP (area under PR curve at multiple IoU thresholds) -- classification computes F1. +- The `_load_detections()` helper queries bbox columns that are NULL for classification. +**New response model:** ```python -# app/services/folder_scanner.py - -class FolderScanner: - """Walk a directory tree and detect importable CV datasets. - - Detection heuristics: - 1. JSON files containing "images" key at top level -> COCO annotation file - 2. Directories containing 10+ image files (.jpg/.jpeg/.png) -> image directory - 3. Directory names matching train/val/test/validation -> split assignment - 4. Paired annotation + image directory at same level -> suggested import - """ +class ClassificationEvaluationResponse(BaseModel): + """Evaluation payload for classification datasets.""" + accuracy: float + macro_precision: float + macro_recall: float + macro_f1: float + weighted_f1: float + per_class_metrics: list[ClassificationPerClassMetrics] + confusion_matrix: list[list[int]] + confusion_matrix_labels: list[str] + conf_threshold: float + +class ClassificationPerClassMetrics(BaseModel): + class_name: str + precision: float + recall: float + f1: float + support: int # number of GT samples for this class +``` + +**Router branching:** +```python +@router.get("/{dataset_id}/evaluation") +def get_evaluation(dataset_id, source, iou_threshold, conf_threshold, split, db): + cursor = db.connection.cursor() + dataset_type = cursor.execute( + "SELECT dataset_type FROM datasets WHERE id = ?", [dataset_id] + ).fetchone()[0] - def scan(self, root_path: str) -> ScanResult: - ... + if dataset_type == "classification": + return compute_classification_evaluation( + cursor, dataset_id, source, conf_threshold, split + ) + else: + return compute_evaluation( + cursor, dataset_id, source, iou_threshold, conf_threshold, split + ) ``` -### SSE Pattern Reuse - -The existing SSE pattern in `datasets.py:37-73` (wrapping `IngestionService.ingest_with_progress()` as a `StreamingResponse`) is directly reusable. The new ingestion UI will call the same `POST /datasets/ingest` endpoint and consume the same SSE event format. - -### Build Order Implication - -Smart ingestion depends on Docker (for deployment context) and auth (new endpoints need auth). Can be built independently of annotation editing and error triage. - ---- - -## Feature 4: Annotation Editing (Browser-Based BBox Editing) - -### Critical Observation: Current Overlay is SVG, NOT react-konva - -The milestone context mentions react-konva, but **react-konva is NOT in the project**. The current annotation rendering in `annotation-overlay.tsx` is pure SVG: +**Frontend union type:** +```typescript +// The hook returns different shapes based on dataset_type +// Use discriminated union or simply check for presence of `accuracy` field +type EvaluationData = DetectionEvaluationResponse | ClassificationEvaluationResponse; -```tsx -// CURRENT: frontend/src/components/grid/annotation-overlay.tsx - - {annotations.map((ann) => ( - - ))} - +function isClassificationEval(data: EvaluationData): data is ClassificationEvaluationResponse { + return "accuracy" in data; +} ``` -### Architecture Decision: Use Konva ONLY in Detail Modal, Keep SVG for Grid +### Decision 5: Ingestion Auto-Detection via FolderScanner -**Recommendation: Do NOT replace the grid overlay with react-konva.** Introduce react-konva ONLY in the sample detail modal for editing. - -Rationale: -1. The SVG grid overlay works well for read-only display at scale (dozens of cells visible simultaneously). Replacing SVG with canvas per grid cell would multiply canvas contexts and hurt performance. -2. Editing happens in the detail modal (`sample-modal.tsx`), where only ONE image is displayed at a time. This is the right place for an interactive canvas. -3. Konva's `Transformer` component provides native drag/resize handles for bounding boxes. -4. The grid overlay continues rendering the latest annotation data from the server (refetched after edits). - -### Component Architecture for Annotation Editing +**Recommendation:** Extend `FolderScanner.scan()` to detect classification layouts BEFORE falling through to COCO detection. Classification layouts are cheaper to detect (structural directory patterns, no JSON parsing needed). +**Classification layout: folder-of-folders (ImageNet-style)** ``` -sample-modal.tsx (MODIFIED) - | - +-- [Read-only mode] AnnotationOverlay (SVG, existing) - | - +-- [Edit mode] AnnotationEditor (NEW, react-konva) - | - +-- with - | | - | +-- (full-res image as Konva.Image) - | +-- per annotation (draggable) - | +-- (attached to selected rect) - | - +-- EditToolbar (NEW) - | | - | +-- Select / Move / Delete buttons - | +-- Save / Cancel buttons - | - +-- Zustand: useAnnotationEditStore (NEW store) - | - +-- editingAnnotations: Annotation[] (local copy during edit) - +-- selectedAnnotationId: string | null - +-- isDirty: boolean - +-- saveEdits() -> PATCH /annotations/batch - +-- discardEdits() +dataset/ + train/ + cat/ # class label = directory name + img001.jpg + img002.jpg + dog/ + img003.jpg + val/ + cat/ + img004.jpg + dog/ + img005.jpg ``` -### Integration Points +Detection heuristic: +1. Root or split subdirectories contain subdirectories whose names are NOT known split names. +2. Those subdirectories contain image files (no JSON files). +3. Multiple sibling class directories exist (>= 2 classes). -**New npm dependency:** +**Classification layout: CSV labels** ``` -npm install react-konva konva +dataset/ + labels.csv # columns: filename, label (or image, class) + images/ + img001.jpg ``` -**Files to create:** -| File | Purpose | -|------|---------| -| `frontend/src/components/detail/annotation-editor.tsx` | react-konva Stage with draggable/resizable Rects | -| `frontend/src/components/detail/edit-toolbar.tsx` | Edit mode controls (select, delete, save, cancel) | -| `frontend/src/stores/annotation-edit-store.ts` | Zustand store for edit-mode state (NEW 4th store) | -| `frontend/src/types/annotation-edit.ts` | Types for annotation edit operations | - -**Backend files to create:** -| File | Purpose | -|------|---------| -| `app/routers/annotations.py` | New router: `PATCH /annotations/batch`, `DELETE /annotations/{id}` | -| `app/models/annotation.py` (modify) | Add `AnnotationUpdateRequest` model | - -**Files to modify:** -| File | Change | -|------|--------| -| `frontend/src/components/detail/sample-modal.tsx` | Add toggle between read (SVG) and edit (Konva) modes | -| `frontend/package.json` | Add `react-konva` and `konva` dependencies | -| `app/main.py` | Add `app.include_router(annotations.router)` | - -### Backend API for Annotation Updates +Detection heuristic: +1. Root contains a CSV file. +2. CSV has 2+ columns, first column values match filenames in an image directory. +**Scanner modification (in `folder_scanner.py`):** ```python -# app/routers/annotations.py (NEW) - -@router.patch("/annotations/batch") -def update_annotations_batch( - request: AnnotationBatchUpdateRequest, - db: DuckDBRepo = Depends(get_db), -) -> dict: - """Update bbox coordinates for multiple annotations. - - Used by the frontend annotation editor to save moved/resized boxes. - Only ground_truth annotations are editable (predictions are immutable). - """ - cursor = db.connection.cursor() - try: - for update in request.updates: - cursor.execute( - "UPDATE annotations SET bbox_x = ?, bbox_y = ?, bbox_w = ?, bbox_h = ?, " - "area = ? * ? WHERE id = ? AND source = 'ground_truth'", - [update.bbox_x, update.bbox_y, update.bbox_w, update.bbox_h, - update.bbox_w, update.bbox_h, update.id], - ) - finally: - cursor.close() - return {"updated": len(request.updates)} - - -@router.delete("/annotations/{annotation_id}") -def delete_annotation( - annotation_id: str, - db: DuckDBRepo = Depends(get_db), -) -> None: - """Delete a single annotation. Only ground_truth annotations are deletable.""" - cursor = db.connection.cursor() - try: - cursor.execute( - "DELETE FROM annotations WHERE id = ? AND source = 'ground_truth'", - [annotation_id], - ) - finally: - cursor.close() -``` +def scan(self, root_path: str) -> ScanResult: + # 1. Try classification folder-of-folders (cheapest check) + splits = self._try_classification_folders(root, warnings) + if splits: + return ScanResult(format="classification_folders", splits=splits, ...) -### Konva Transformer Integration + # 2. Try classification CSV + splits = self._try_classification_csv(root, warnings) + if splits: + return ScanResult(format="classification_csv", splits=splits, ...) -The key technical pattern from Konva docs: the Transformer changes `scaleX`/`scaleY`, not `width`/`height`. On `onTransformEnd`, compute the new bbox from the node's position and scale: - -```typescript -// Pattern for annotation-editor.tsx -const handleTransformEnd = (e: KonvaEventObject) => { - const node = e.target; - const scaleX = node.scaleX(); - const scaleY = node.scaleY(); - - // Reset scale, apply to dimensions - node.scaleX(1); - node.scaleY(1); - - const updated: AnnotationUpdate = { - id: node.id(), - bbox_x: node.x(), - bbox_y: node.y(), - bbox_w: Math.max(5, node.width() * scaleX), - bbox_h: Math.max(5, node.height() * scaleY), - }; - - editStore.updateAnnotation(updated); -}; + # 3. Fall through to existing COCO detection (unchanged) + splits = self._try_layout_b(root, warnings) + if not splits: + splits = self._try_layout_a(root, warnings) + if not splits: + splits = self._try_layout_c(root, warnings) + return ScanResult(format="coco", splits=splits, ...) ``` -### Data Flow for Annotation Edits +**Important:** The `ScanResult.format` field currently is always `"coco"`. This now becomes the actual detected format string that drives parser dispatch in `IngestionService`. -``` -User clicks "Edit" in sample modal - -> AnnotationEditStore.startEditing(annotations) // copy current annotations - -> Modal switches from SVG AnnotationOverlay to Konva AnnotationEditor - -> User drags/resizes boxes (Konva handles visual updates in real-time) - -> User clicks "Save" - -> AnnotationEditStore.saveEdits() - -> PATCH /annotations/batch { updates: [...] } - -> On success: invalidate TanStack Query cache for this sample's annotations - -> Modal switches back to SVG AnnotationOverlay - -> Grid refetches batch annotations (sees updated boxes) -``` - -### Build Order Implication +--- -Annotation editing depends on the sample modal existing (already built). It is independent of Docker, auth, and smart ingestion. Can be built in parallel with error triage. +## New Components to Build + +### Backend + +| Component | File | Purpose | +|-----------|------|---------| +| `ClassificationFolderParser` | `app/ingestion/classification_folder_parser.py` | Parse ImageNet-style folder-of-folders into samples + annotations | +| `ClassificationCSVParser` | `app/ingestion/classification_csv_parser.py` | Parse CSV label files into samples + annotations | +| `ClassificationPredictionParser` | `app/ingestion/classification_prediction_parser.py` | Parse classification prediction CSV/JSON | +| `compute_classification_evaluation` | `app/services/classification_evaluation.py` | Accuracy/F1/confusion matrix (pure numpy) | +| `ClassificationEvaluationResponse` | `app/models/evaluation.py` | Response model for classification metrics | +| Schema migration | `app/repositories/duckdb_repo.py` | `dataset_type` column, nullable bbox columns | +| Scanner extensions | `app/services/folder_scanner.py` | `_try_classification_folders()`, `_try_classification_csv()` | + +### Frontend + +| Component | File | Purpose | +|-----------|------|---------| +| `ClassificationLabel` | `src/components/grid/classification-label.tsx` | Class label pill overlay for grid cells and modal | +| `ClassificationEvaluationPanel` | `src/components/stats/classification-eval-panel.tsx` | Accuracy/F1 metrics display with confusion matrix | +| `ClassificationPerClassTable` | `src/components/stats/classification-per-class-table.tsx` | Per-class P/R/F1/Support table | + +### Modified Components (Existing Files) + +| Component | File | What Changes | +|-----------|------|-------------| +| `DuckDBRepo.initialize_schema` | `duckdb_repo.py` | Add `dataset_type` column, make bbox nullable | +| `FolderScanner` | `folder_scanner.py` | Add classification layout detection methods | +| `IngestionService` | `services/ingestion.py` | Parser dispatch by format (registry pattern) | +| `DatasetResponse` | `models/dataset.py` | Add `dataset_type: str = "detection"` field | +| `AnnotationResponse` | `models/annotation.py` | Make bbox fields `Optional[float] = None` | +| `AnnotationCreate` | `models/annotation.py` | Make bbox fields optional | +| `BaseParser` | `ingestion/base_parser.py` | Relax bbox requirement in docstring | +| `get_evaluation` router | `routers/statistics.py` | Branch on dataset_type | +| `get_dataset_statistics` router | `routers/statistics.py` | Adjust summary labels for classification | +| `AnnotationOverlay` | `annotation-overlay.tsx` | Conditional bbox vs label rendering | +| `GridCell` | `grid-cell.tsx` | Pass `datasetType` prop to overlay | +| `SampleModal` | `sample-modal.tsx` | Conditional annotation display, hide bbox editing for classification | +| `StatsDashboard` | `stats-dashboard.tsx` | Route to correct evaluation panel | +| `EvaluationPanel` | `evaluation-panel.tsx` | Branch on dataset type | +| `AnnotationList` | `annotation-list.tsx` | Hide bbox columns for classification | +| `ScanResults` UI | `scan-results.tsx` | Show correct format badge | +| `PredictionImportDialog` | `prediction-import-dialog.tsx` | Support classification prediction format | +| `Dataset` type | `types/dataset.ts` | Add `dataset_type` field | +| `Annotation` type | `types/annotation.ts` | Make bbox fields optional (`number | null`) | +| `useEvaluation` hook | `hooks/use-evaluation.ts` | Handle union response type | +| `useFilteredEvaluation` hook | `hooks/use-filtered-evaluation.ts` | Handle classification eval response | --- -## Feature 5: Error Triage Workflow +## Patterns to Follow -### Current Error Analysis State +### Pattern 1: Type Discriminator Threading -The error analysis system already exists (`error_analysis.py` service, `error-analysis-panel.tsx` component). It categorizes detections into TP, Hard FP, Label Error, and FN. However, it is **read-only** -- there is no way to: -- Tag individual errors (confirm/dismiss/flag for review) -- Highlight errors while dimming non-errors in the grid -- Rank "worst" images by error severity +**What:** Pass `dataset_type` as a prop from the top-level dataset query down to components that need conditional behavior. Never re-fetch it inside child components. -### Triage Workflow Architecture +**When:** Any component that renders differently for detection vs classification. -``` -Error Triage Flow -================= - -error-analysis-panel.tsx (EXISTING -- add triage actions) - | - +-- ErrorSamplesGrid (EXISTING -- add "Tag as reviewed" button) - | - +-- TriageActionBar (NEW component) - | | - | +-- "Mark as FP" / "Mark as TP" / "Mark as Mistake" buttons - | +-- "Highlight errors only" toggle - | +-- "Rank worst images" button - | - +-- useTriageStore (NEW Zustand store -- 4th store) - | - +-- triageLabels: Map // annotation_id -> label - +-- highlightMode: "all" | "errors_only" - +-- worstImagesRanking: ScoredSample[] - +-- setTriageLabel(annotationId, label) - +-- toggleHighlightMode() +**Why:** Single fetch, single source of truth. Components remain pure. + +```typescript +// Page level: fetch once +const { data: dataset } = useDataset(datasetId); + +// Thread to children + + + ``` -### New DuckDB Table: `triage_labels` +### Pattern 2: Parser Registry for Ingestion Dispatch -The triage labels need to persist. Add a new table: +**What:** Map format strings to parser classes instead of hardcoding `COCOParser()`. -```sql -CREATE TABLE IF NOT EXISTS triage_labels ( - annotation_id VARCHAR NOT NULL, - dataset_id VARCHAR NOT NULL, - label VARCHAR NOT NULL, -- 'confirmed', 'dismissed', 'needs_review', 'mistake' - created_at TIMESTAMP DEFAULT current_timestamp -) -``` +**When:** `IngestionService` creates a parser for ingestion. -### Backend Integration Points +```python +PARSER_REGISTRY: dict[str, type[BaseParser]] = { + "coco": COCOParser, + "classification_folders": ClassificationFolderParser, + "classification_csv": ClassificationCSVParser, +} -**Files to create:** -| File | Purpose | -|------|---------| -| `app/routers/triage.py` | New router: `POST /triage/label`, `GET /triage/labels`, `GET /triage/worst-images` | -| `app/models/triage.py` | Pydantic models: `TriageLabelRequest`, `TriageLabelResponse`, `ScoredSample` | -| `app/services/triage_service.py` | Triage label CRUD + "worst images" ranking algorithm | +# In ingest_with_progress(): +parser_class = PARSER_REGISTRY.get(format) +if parser_class is None: + raise ValueError(f"Unsupported format: {format}") +parser = parser_class(batch_size=1000) +``` -**Files to modify:** -| File | Change | -|------|--------| -| `app/repositories/duckdb_repo.py` | Add `triage_labels` table creation in `initialize_schema()` | -| `app/main.py` | Add `app.include_router(triage.router)` | -| `app/dependencies.py` | Add `get_triage_service()` dependency | +### Pattern 3: Evaluation Strategy Selection -### "Worst Images" Ranking Algorithm +**What:** The evaluation router reads `dataset_type` and dispatches to the correct evaluation function. Each function returns its own response model. -The ranking combines multiple error signals into a single score: +**When:** Evaluation endpoint is called. ```python -# app/services/triage_service.py - -def rank_worst_images( - cursor: DuckDBPyConnection, - dataset_id: str, - source: str, - limit: int = 50, -) -> list[ScoredSample]: - """Rank images by combined error severity score. - - Score = (2 * hard_fp_count) + (3 * label_error_count) + (1 * fn_count) - + (0.5 * low_confidence_count) - (0.1 * tp_count) - - Higher score = worse image (more problems). - """ - # Use existing error_analysis.categorize_errors() to get per-sample breakdown - # Then aggregate and sort - ... +if dataset_type == "classification": + return compute_classification_evaluation(cursor, dataset_id, source, conf_threshold, split) +else: + return compute_evaluation(cursor, dataset_id, source, iou_threshold, conf_threshold, split) ``` -### Frontend Integration Points - -**Files to create:** -| File | Purpose | -|------|---------| -| `frontend/src/stores/triage-store.ts` | Zustand store for triage state (5th store) | -| `frontend/src/components/stats/triage-action-bar.tsx` | Triage controls and actions | -| `frontend/src/components/stats/worst-images-panel.tsx` | Ranked worst images display | -| `frontend/src/hooks/use-triage.ts` | TanStack Query hooks for triage API | -| `frontend/src/types/triage.ts` | TypeScript types | - -**Files to modify:** -| File | Change | -|------|--------| -| `frontend/src/components/stats/error-analysis-panel.tsx` | Add triage action buttons to error samples | -| `frontend/src/components/stats/error-samples-grid.tsx` | Add per-sample triage label badges | -| `frontend/src/components/grid/grid-cell.tsx` | Support highlight/dim mode from triage store | -| `frontend/src/stores/ui-store.ts` | Add `highlightMode: "all" \| "errors_only"` state | - -### Highlight/Dim Mode in Grid - -When triage highlight mode is active, grid cells for non-error images get reduced opacity: - -```tsx -// grid-cell.tsx modification -const triageHighlight = useTriageStore((s) => s.highlightMode); -const isError = useTriageStore((s) => s.errorSampleIds.has(sample.id)); - -const opacity = triageHighlight === "errors_only" && !isError ? 0.2 : 1.0; - -return ( -
- ... -
-); -``` +### Pattern 4: One Annotation Per Sample for Classification + +**What:** Classification datasets have exactly one ground-truth annotation per sample (the class label). Predictions also have one annotation per sample (the predicted class with confidence). This is enforced by the parser, not by the schema. -### Build Order Implication +**When:** Classification ingestion and evaluation. -Error triage depends on the existing error analysis system (already built). It extends the stats dashboard. Independent of Docker, auth, smart ingestion, and annotation editing. +**Why this matters:** The evaluation service can safely do `GROUP BY sample_id` and take the first row, rather than needing to handle multiple annotations per sample. --- -## Feature 6: Keyboard Shortcuts +## Anti-Patterns to Avoid -### Architecture Decision: react-hotkeys-hook +### Anti-Pattern 1: Separate Tables for Each Dataset Type -**Recommendation: Use `react-hotkeys-hook` (v5.x) library** rather than building custom keyboard handling. +**What:** Creating `classification_annotations`, `detection_annotations`, etc. -Rationale: -1. Actively maintained (last published 9 days ago as of research date) -2. Lightweight (~4KB) -3. Supports scoped shortcuts (component-level) and global shortcuts -4. Works with React 19 and Next.js 16 -5. Handles modifier keys, key combinations, and key sequences -6. Prevents shortcuts from firing when user is typing in inputs +**Why bad:** Every query in the codebase touches the `annotations` table. Statistics (`GROUP BY category_name`), filtering, batch fetch, triage -- all would need parallel implementations. The codebase has ~15 queries against the annotations table across 6 services. -### Integration Point: Global vs Component-Level Shortcuts +**Instead:** Nullable bbox columns in the existing table. Classification rows have NULL bbox. -``` -Shortcut Architecture -===================== - -Global shortcuts (active everywhere): - ? -> Show shortcut help modal - Escape -> Close any open modal / exit edit mode / clear selection - / -> Focus search input - g -> Switch to Grid tab - s -> Switch to Statistics tab - e -> Switch to Embeddings tab - -Component-level shortcuts (active when component is focused): - - Grid View: - j/k -> Navigate samples (down/up) - Enter -> Open detail modal for focused sample - x -> Toggle selection mode - Shift+A -> Select all visible - - Detail Modal: - Left/Right arrow -> Previous/next sample - d -> Delete annotation (in edit mode) - Ctrl+S -> Save annotation edits - Escape -> Close modal / cancel edit - - Error Triage: - 1 -> Mark as confirmed TP - 2 -> Mark as needs review - 3 -> Mark as mistake - h -> Toggle highlight mode -``` +### Anti-Pattern 2: Repurposing Detection Response Fields -### Integration Points +**What:** Stuffing accuracy into `map50`, precision into `map75`, etc. to avoid a new response model. -**New npm dependency:** -``` -npm install react-hotkeys-hook -``` +**Why bad:** The frontend would need to know that `map50` really means "accuracy" when `dataset_type === "classification"`. Field names become lies. API consumers are confused. -**Files to create:** -| File | Purpose | -|------|---------| -| `frontend/src/hooks/use-keyboard-shortcuts.ts` | Central shortcut registration hook | -| `frontend/src/components/shortcuts/shortcut-help-modal.tsx` | Help modal showing all available shortcuts | -| `frontend/src/lib/shortcuts.ts` | Shortcut definitions map (key -> action -> description) | - -**Files to modify:** -| File | Change | -|------|--------| -| `frontend/src/app/datasets/[datasetId]/page.tsx` | Register global shortcuts (tab switching, search focus) | -| `frontend/src/components/grid/image-grid.tsx` | Register grid navigation shortcuts (j/k, Enter, x) | -| `frontend/src/components/detail/sample-modal.tsx` | Register modal shortcuts (arrows, Escape, d, Ctrl+S) | -| `frontend/src/components/stats/error-analysis-panel.tsx` | Register triage shortcuts (1/2/3, h) | -| `frontend/src/stores/ui-store.ts` | Add `shortcutHelpOpen: boolean` state | -| `frontend/package.json` | Add `react-hotkeys-hook` dependency | - -### Implementation Pattern +**Instead:** Separate response models. The frontend discriminates on the response shape. -```typescript -// frontend/src/hooks/use-keyboard-shortcuts.ts -import { useHotkeys } from 'react-hotkeys-hook'; -import { useUIStore } from '@/stores/ui-store'; - -export function useGlobalShortcuts() { - const setActiveTab = useUIStore((s) => s.setActiveTab); - const openShortcutHelp = useUIStore((s) => s.setShortcutHelpOpen); - - // ? -> show help - useHotkeys('shift+/', () => openShortcutHelp(true), { preventDefault: true }); - - // g/s/e -> tab switching - useHotkeys('g', () => setActiveTab('grid'), { preventDefault: true }); - useHotkeys('s', () => setActiveTab('statistics'), { preventDefault: true }); - useHotkeys('e', () => setActiveTab('embeddings'), { preventDefault: true }); - - // / -> focus search - useHotkeys('/', () => { - document.querySelector('[data-shortcut-target="search"]')?.focus(); - }, { preventDefault: true }); -} -``` +### Anti-Pattern 3: Making the Detection Evaluation Handle Both Types -### Build Order Implication +**What:** Adding `if dataset_type == "classification"` branches inside `compute_evaluation()`, `_load_detections()`, `_build_detections()`, etc. -Keyboard shortcuts are the most independent feature. They layer on top of existing components without changing data flow or APIs. Can be built last or in parallel with any other feature. +**Why bad:** The detection evaluation is deeply spatial -- every helper function deals with bounding boxes, xyxy conversion, IoU matrices. Grafting classification logic into this creates an unmaintainable chimera. ---- +**Instead:** A separate, clean `compute_classification_evaluation()` function. Classification evaluation is simple (array comparison, confusion matrix) -- it does not need supervision library or IoU machinery. -## New Components Summary - -### Backend (New Files) - -| File | Feature | Type | -|------|---------|------| -| `Dockerfile.backend` | Docker | Build | -| `Dockerfile.frontend` | Docker | Build | -| `docker-compose.yml` | Docker | Config | -| `nginx/default.conf` | Docker | Config | -| `.env.docker` | Docker | Config | -| `app/auth.py` | Auth | Module | -| `app/services/folder_scanner.py` | Smart Ingestion | Service | -| `app/models/scan.py` | Smart Ingestion | Model | -| `app/routers/ingestion.py` | Smart Ingestion | Router | -| `app/routers/annotations.py` | Annotation Editing | Router | -| `app/routers/triage.py` | Error Triage | Router | -| `app/models/triage.py` | Error Triage | Model | -| `app/services/triage_service.py` | Error Triage | Service | - -### Frontend (New Files) - -| File | Feature | Type | -|------|---------|------| -| `src/app/ingest/page.tsx` | Smart Ingestion | Page | -| `src/components/ingest/scan-results.tsx` | Smart Ingestion | Component | -| `src/components/ingest/import-progress.tsx` | Smart Ingestion | Component | -| `src/components/detail/annotation-editor.tsx` | Annotation Editing | Component | -| `src/components/detail/edit-toolbar.tsx` | Annotation Editing | Component | -| `src/stores/annotation-edit-store.ts` | Annotation Editing | Store | -| `src/components/stats/triage-action-bar.tsx` | Error Triage | Component | -| `src/components/stats/worst-images-panel.tsx` | Error Triage | Component | -| `src/stores/triage-store.ts` | Error Triage | Store | -| `src/hooks/use-scan.ts` | Smart Ingestion | Hook | -| `src/hooks/use-ingest.ts` | Smart Ingestion | Hook | -| `src/hooks/use-triage.ts` | Error Triage | Hook | -| `src/components/shortcuts/shortcut-help-modal.tsx` | Shortcuts | Component | -| `src/hooks/use-keyboard-shortcuts.ts` | Shortcuts | Hook | -| `src/lib/shortcuts.ts` | Shortcuts | Lib | - -### Modified Files - -| File | Features Affecting It | -|------|----------------------| -| `app/config.py` | Docker (qdrant_url), Auth (credentials) | -| `app/main.py` | Auth (router dependencies), New routers (ingestion, annotations, triage) | -| `app/dependencies.py` | Smart Ingestion (folder_scanner), Error Triage (triage_service) | -| `app/repositories/duckdb_repo.py` | Error Triage (triage_labels table) | -| `app/services/similarity_service.py` | Docker (conditional Qdrant client mode) | -| `app/services/ingestion.py` | Smart Ingestion (split parameter) | -| `app/ingestion/coco_parser.py` | Smart Ingestion (split parameter) | -| `app/models/dataset.py` | Smart Ingestion (split field on IngestRequest) | -| `app/models/annotation.py` | Annotation Editing (update models) | -| `frontend/next.config.ts` | Docker (standalone output) | -| `frontend/package.json` | Annotation Editing (react-konva), Shortcuts (react-hotkeys-hook) | -| `frontend/src/lib/api.ts` | Auth (credentials header) | -| `frontend/src/stores/ui-store.ts` | Shortcuts (help modal), Triage (highlight mode) | -| `frontend/src/components/detail/sample-modal.tsx` | Annotation Editing (edit mode toggle), Shortcuts | -| `frontend/src/components/stats/error-analysis-panel.tsx` | Error Triage (action buttons) | -| `frontend/src/components/stats/error-samples-grid.tsx` | Error Triage (label badges) | -| `frontend/src/components/grid/grid-cell.tsx` | Error Triage (highlight/dim mode) | -| `frontend/src/app/datasets/[datasetId]/page.tsx` | Shortcuts (global registration) | -| `frontend/src/components/grid/image-grid.tsx` | Shortcuts (grid navigation) | +### Anti-Pattern 4: Frontend Feature Detection Instead of Type Discrimination ---- - -## Data Flow Changes +**What:** Checking `if annotations[0]?.bbox_x === null` to determine rendering mode. -### New DuckDB Tables +**Why bad:** Fragile. Fails on samples with no annotations. Requires loading annotations before knowing how to render. Creates subtle bugs. -| Table | Feature | Schema | -|-------|---------|--------| -| `triage_labels` | Error Triage | `annotation_id VARCHAR, dataset_id VARCHAR, label VARCHAR, created_at TIMESTAMP` | +**Instead:** Use `dataset_type` from the dataset metadata (loaded once, always available). The type determines rendering, not the data shape. -### New API Endpoints - -| Method | Path | Feature | SSE? | -|--------|------|---------|------| -| `POST` | `/ingestion/scan` | Smart Ingestion | No | -| `PATCH` | `/annotations/batch` | Annotation Editing | No | -| `DELETE` | `/annotations/{id}` | Annotation Editing | No | -| `POST` | `/triage/label` | Error Triage | No | -| `GET` | `/triage/labels?dataset_id=X` | Error Triage | No | -| `GET` | `/triage/worst-images?dataset_id=X` | Error Triage | No | +--- -### New Zustand Stores +## Scalability Considerations -| Store | Feature | Slices | -|-------|---------|--------| -| `annotation-edit-store.ts` | Annotation Editing | editingAnnotations, selectedId, isDirty, save/discard actions | -| `triage-store.ts` | Error Triage | triageLabels map, highlightMode, worstImagesRanking | +| Concern | At 1K images | At 100K images | At 1M images | +|---------|-------------|---------------|-------------| +| Classification annotations (1 per image) | 1K rows, trivial | 100K rows, fast | 1M rows, may want index on (dataset_id, sample_id) | +| Confusion matrix computation | In-memory numpy, instant | In-memory numpy, <1s | In-memory numpy, ~2s (1M label comparisons) | +| Folder-of-folders ingestion (many small files) | Fast | Moderate (100K filesystem stats) | Slow -- but same as image loading | +| NULL bbox storage | None (DuckDB columnar compression) | None | None -- NULLs compress to near-zero in columnar | +| Statistics queries on mixed tables | No impact | No impact | No impact -- DuckDB predicate pushdown handles it | -Total stores: 3 existing + 2 new = **5 Zustand stores** +Classification datasets are strictly simpler than detection: 1 annotation per image, no spatial matching, no IoU. The existing architecture handles the scale without modification. --- ## Suggested Build Order -Based on dependency analysis: +Build order follows data flow dependencies: schema before parsers, parsers before frontend display, evaluation needs data. -``` -Phase 1: Docker Deployment - - Dockerfile.backend + Dockerfile.frontend - - docker-compose.yml (backend, frontend, qdrant, nginx) - - Qdrant client mode switch (local vs server) - - Next.js standalone output - - Nginx reverse proxy with SSE support - DEPENDS ON: nothing - ENABLES: cloud deployment, auth - -Phase 2: Single-User Auth - - app/auth.py (HTTPBasic + verify_credentials) - - Router-level dependency injection - - Frontend credential handling - - SSE auth via cookies - DEPENDS ON: Docker (for HTTPS context) - ENABLES: secure cloud access - -Phase 3: Smart Ingestion UI - - FolderScanner service - - /ingestion/scan endpoint - - Ingestion wizard page + components - - Split detection in existing parser - DEPENDS ON: Auth (new endpoints need it) - ENABLES: no-code dataset import - -Phase 4: Error Triage Workflow - - triage_labels table - - Triage API endpoints - - Triage Zustand store - - Worst images ranking - - Grid highlight/dim mode - DEPENDS ON: existing error analysis (already built) - CAN PARALLEL WITH: Phase 3 - -Phase 5: Annotation Editing - - react-konva integration in detail modal - - AnnotationEditor component with Transformer - - AnnotationEditStore (new Zustand store) - - PATCH /annotations/batch endpoint - DEPENDS ON: nothing new (builds on existing modal) - CAN PARALLEL WITH: Phases 3, 4 - -Phase 6: Keyboard Shortcuts - - react-hotkeys-hook integration - - Global and component-level shortcuts - - Shortcut help modal - DEPENDS ON: all other UI features complete (shortcuts reference them) - BUILD LAST: shortcuts layer on top of everything -``` +| Order | What | Dependencies | Rationale | +|-------|------|--------------|-----------| +| 1 | Schema migration + API model updates | None | Foundation: must exist before anything else | +| 2 | Classification folder parser + scanner detection | Step 1 | End-to-end ingestion works | +| 3 | Frontend conditional rendering (grid + modal) | Step 2 | Users can see classification datasets | +| 4 | Classification evaluation service + frontend | Step 3 | Metrics for classification predictions | +| 5 | Classification prediction import | Step 1 | Import predictions for evaluation | +| 6 | CSV parser + additional format support | Step 1 | Secondary ingestion format, lower priority | -### Dependency Graph +**Critical path:** Steps 1 -> 2 -> 3 -> 4 are sequential. Steps 5 and 6 can proceed in parallel after step 1. -``` -Phase 1 (Docker) - | - v -Phase 2 (Auth) - | - v -Phase 3 (Smart Ingestion) - -Phase 4 (Error Triage) -- parallel, independent -Phase 5 (Annotation Edit) -- parallel, independent -Phase 6 (Shortcuts) -- last, references all UI -``` +**What stays unchanged (no work needed):** +- Embeddings + scatter plot (sample-level, no bbox dependency) +- Similarity search (sample-level, no bbox dependency) +- Saved views (filter state, no bbox dependency) +- Tags / triage (annotation-level, uses category_name not bbox) +- Thumbnail generation (image-level, no annotation dependency) --- ## Sources -### HIGH Confidence (Official Documentation + Codebase Analysis) -- DataVisor codebase: `app/main.py`, `app/dependencies.py`, `app/config.py`, `app/repositories/duckdb_repo.py`, `app/services/similarity_service.py`, `app/services/ingestion.py`, `app/services/error_analysis.py` -- verified existing architecture -- DataVisor codebase: `frontend/src/stores/*.ts`, `frontend/src/lib/api.ts`, `frontend/src/components/**/*.tsx` -- verified frontend architecture -- [FastAPI Dependency Injection vs Middleware (GitHub Discussion #8867)](https://github.com/fastapi/fastapi/discussions/8867) -- DI recommended for auth -- [FastAPI Auth with Dependency Injection (PropelAuth)](https://www.propelauth.com/post/fastapi-auth-with-dependency-injection) -- DI pattern reference -- [Konva Transformer for React](https://konvajs.org/docs/react/Transformer.html) -- select/resize/rotate shapes -- [Konva Drag and Resize Limits](https://konvajs.org/docs/select_and_transform/Resize_Limits.html) -- boundBoxFunc for clamping -- [Qdrant Installation Docker](https://qdrant.tech/documentation/guides/installation/) -- Docker service configuration -- [Qdrant Python Client](https://python-client.qdrant.tech/qdrant_client.qdrant_client) -- local mode vs server mode -- [react-hotkeys-hook (npm)](https://www.npmjs.com/package/react-hotkeys-hook) -- v5.2.4, actively maintained -- [DuckDB Docker Container](https://duckdb.org/docs/stable/operations_manual/duckdb_docker) -- volume mount patterns -- [Next.js Standalone Output](https://nextjs.org/docs/pages/api-reference/config/next-config-js/output) -- Docker-optimized builds - -### MEDIUM Confidence (WebSearch + Cross-Verification) -- [FastAPI + Next.js Docker examples (GitHub)](https://github.com/YsrajSingh/nextjs-fastapi-docker) -- compose topology reference -- [Qdrant Docker Compose configuration (DeepWiki)](https://deepwiki.com/qdrant/qdrant_demo/2.1-quick-start-with-docker-compose) -- service definition -- [Building canvas-based editors with Konva](https://www.alikaraki.me/blog/canvas-editors-konva) -- production patterns for drag/resize -- [Next.js Dockerization 2025 guide](https://medium.com/front-end-world/dockerizing-a-next-js-application-in-2025-bacdca4810fe) -- multi-stage build best practices - ---- -*Architecture research for: DataVisor v1.1 Feature Integration* -*Researched: 2026-02-12* -*Grounded in: 12,720 LOC codebase analysis + official documentation* +- **Direct codebase analysis:** `duckdb_repo.py` (schema), `evaluation.py` (metrics), `coco_parser.py` + `base_parser.py` (ingestion), `folder_scanner.py` (detection), `annotation-overlay.tsx` + `grid-cell.tsx` + `sample-modal.tsx` (frontend rendering), `statistics.py` (API), `evaluation-panel.tsx` (frontend metrics display) +- **DuckDB ALTER TABLE:** Need to verify `ALTER COLUMN DROP NOT NULL` support in current version -- MEDIUM confidence on exact syntax +- **ImageNet folder-of-folders convention:** Standard classification dataset layout -- HIGH confidence +- **scikit-learn classification metrics patterns:** Standard accuracy/precision/recall/F1 computation -- HIGH confidence (though we use pure numpy, not sklearn) diff --git a/.planning/research/FEATURES.md b/.planning/research/FEATURES.md index ee1d75b..e69f2e2 100644 --- a/.planning/research/FEATURES.md +++ b/.planning/research/FEATURES.md @@ -1,1038 +1,428 @@ -# Feature Gap Analysis: DataVisor v1.1 vs FiftyOne & Encord +# Feature Landscape: Classification Dataset Support -**Domain:** Computer Vision Dataset Introspection / Exploration Tooling -**Researched:** 2026-02-12 -**Mode:** Competitive analysis (FiftyOne + Encord vs DataVisor) -**Overall Confidence:** HIGH (grounded in official documentation from both platforms) +**Domain:** Single-label image classification dataset introspection +**Researched:** 2026-02-18 +**Scope:** NEW features needed for classification support -- does NOT repeat existing detection features --- -## How to Read This Document +## How Classification Differs from Detection -Each feature gap is categorized by: -- **Priority:** Table Stakes (expected by CV engineers) / Differentiator (competitive edge) / Nice-to-Have (marginal value for v1.1) -- **Complexity:** Low (< 1 day) / Medium (1-3 days) / High (3+ days) -- **Depends On:** Existing DataVisor v1.0 features or new features needed first -- **Competitor Reference:** Specific documentation or behavior observed +Understanding these differences drives every feature decision below. ---- - -## 1. Dataset Ingestion & Format Support - -### 1A. Multi-Format Import (YOLO, VOC, KITTI, TFRecords, BDD) - -**What competitors do:** - -FiftyOne's `fo.Dataset.from_dir()` supports 15+ formats out of the box: -- COCO Detection, VOC Detection, YOLOv4, YOLOv5, KITTI Detection -- TFRecords (classification + detection), BDD100K, CVAT (image + video) -- OpenLABEL, DICOM, GeoJSON, GeoTIFF -- Image/Video classification directory trees -- FiftyOne native format - -The API requires explicit `dataset_type` specification -- there is no automatic format detection. Example: -```python -dataset = fo.Dataset.from_dir( - dataset_dir="/path/to/data", - dataset_type=fo.types.COCODetectionDataset, - label_field="ground_truth", -) -``` - -Encord ingestion is cloud-native: users register files from AWS S3, GCS, Azure, or OTC OSS buckets. Local upload is supported but the primary workflow is cloud storage integration via SDK. Encord's SDK enables programmatic ETL pipelines for ingestion. +| Aspect | Detection (current) | Classification (new) | +|--------|---------------------|---------------------| +| **Label granularity** | Per-annotation (many per image) | Per-image (one label per image) | +| **Spatial info** | Bounding boxes (x, y, w, h) | None -- label applies to entire image | +| **Matching logic** | IoU-based greedy matching | Direct string comparison (GT label vs predicted label) | +| **Error types** | TP, Hard FP, Label Error, FN | Correct, Misclassified (with confused pair) | +| **Key metrics** | mAP, AP@50/75, per-class AP | Accuracy, macro/micro F1, per-class precision/recall | +| **Confusion matrix** | Background row/col for unmatched | No background -- every image has exactly one GT and one prediction | +| **Display** | SVG bbox overlays on thumbnails | Text badge/label on thumbnail corner | +| **Ingestion format** | COCO JSON (images + annotations arrays) | JSONL (one JSON object per line: image, prefix, suffix) | -**What DataVisor has:** COCO format only via streaming ijson parser. +### Classification Matching (the "IoU equivalent") -**Gap:** DataVisor only supports COCO. CV engineers commonly have datasets in YOLO (especially YOLOv5/v8 from Ultralytics) and VOC (legacy Pascal datasets). Missing YOLO support is the most critical gap -- it is the most popular training format today. +In classification, matching is trivially simple: compare the predicted label to the ground truth label for each image. There is no spatial matching, no IoU threshold. The evaluation reduces to a standard confusion matrix. -**Priority:** TABLE STAKES -- YOLO and VOC are the two most common formats after COCO. Missing them means users must convert externally before loading. +FiftyOne's `evaluate_classifications()` supports three methods: +- **"simple"** (default): Direct GT label vs prediction label comparison. Each sample is marked correct/incorrect. This is what DataVisor needs. +- **"top-k"**: Prediction is correct if GT label appears in top-k predicted classes. Requires multi-class probability output (not applicable to DataVisor's single-prediction format). +- **"binary"**: Binary classification with configurable positive class. Missing labels treated as negative class. -**Complexity:** MEDIUM per format. Each format needs: (a) parser that maps to DataVisor's internal schema, (b) path resolution for images/labels, (c) tests with real-world datasets. - -- YOLO: Parse `dataset.yaml` for class names, read `.txt` label files (class_id cx cy w h), resolve image paths from `images/` directory structure -- VOC: Parse XML annotation files with `` elements, resolve image paths from `JPEGImages/` -- KITTI: Parse space-delimited `.txt` files with 15 columns per object - -**Depends on:** Existing ingestion pipeline. DataVisor's streaming parser architecture should be extended with a format-detection step before parsing begins. - -**Recommendation:** Add YOLO and VOC for v1.1. KITTI and others can wait for v1.2+. Design a `FormatDetector` class that inspects folder contents (presence of `*.yaml`, `*.xml`, `*.json`) and recommends the parser. +For DataVisor's use case (single-label classification with one prediction per image), the "simple" method is the only one that matters. --- -### 1B. Train/Val/Test Split Handling - -**What competitors do:** - -FiftyOne handles splits via the `split` parameter on `add_dir()`: -```python -dataset = fo.Dataset(name) -for split in ["train", "val", "test"]: - dataset.add_dir( - dataset_dir=dataset_dir, - dataset_type=fo.types.YOLOv5Dataset, - split=split, - tags=split, # Tags each sample with its split name - ) -``` - -This means: (a) each split is loaded separately, (b) samples are tagged with their split, (c) users can filter by split tag in the App. FiftyOne also has `Brain.compute_leaky_splits()` to detect data leakage between train/test. +## Table Stakes -Encord handles splits at the project level -- datasets are created per split, and projects reference specific datasets. The platform does not auto-detect folder structure. +Features users expect when inspecting classification datasets. Missing any of these would feel like a broken product. -**What DataVisor has:** No split awareness. The ingestion UI takes a single annotations file and image directory. +### TS-1: JSONL Ingestion Parser -**Gap:** Most real-world datasets have train/val/test directories. Users must currently load each split separately and cannot filter by split. There is no detection of the common `train/`, `val/`, `test/` folder pattern. +| Attribute | Detail | +|-----------|--------| +| **Why expected** | Classification datasets from Roboflow export as JSONL. This is the target format. | +| **Complexity** | Medium | +| **Depends on** | `BaseParser` abstract class (existing), `DuckDBRepo` schema (needs extension) | -**Priority:** TABLE STAKES -- every real dataset has splits. Without this, the first thing a user does after loading is wonder "where are my val images?" +**What to build:** A `ClassificationParser` that reads JSONL files where each line is `{"image":"filename.jpg","prefix":"prompt","suffix":"class_label"}`. The parser maps `suffix` to `category_name` and stores one annotation per image with sentinel bbox values (0,0,width,height -- full image) or a new schema approach. -**Complexity:** MEDIUM. -- Auto-detect: Scan for `train/`, `val/`, `test/` subdirectories; check for YOLO's `dataset.yaml` split definitions -- Tag on ingest: Add a `split` metadata field to each sample during ingestion -- Filter: The existing sidebar filtering system handles this automatically once the field exists +**Schema decision:** The current `annotations` table requires `bbox_x/y/w/h`. Two options: +1. **Store with sentinel values** (bbox = full image dimensions). Simpler, avoids schema migration, but semantically wrong. +2. **Add `task_type` column to datasets table** and make bbox columns nullable. Cleaner, enables task-aware rendering throughout the app. -**Depends on:** Ingestion pipeline (existing). Sidebar filtering (existing -- works on any metadata field). +Recommendation: Option 2 -- add `task_type VARCHAR DEFAULT 'detection'` to `datasets` table. Make bbox columns `DOUBLE` (already are) and allow NULLs for classification. This is a one-line schema change and pays forward for future task types (segmentation, etc.). -**Recommendation:** During ingestion, scan the target directory for split subdirectories. If found, present them in the UI and let the user select which splits to load. Tag each sample with its split name. The existing metadata filtering will handle split-based browsing. +**Folder scanner:** Extend `folder_scanner.py` to detect JSONL files alongside images. A JSONL file with `{"image":..., "suffix":...}` structure identifies a classification dataset. The existing split detection (`train/`, `valid/`, `test/` directories) works as-is since Roboflow classification exports use the same directory structure. --- -### 1C. Smart Folder Detection UI +### TS-2: Class Label Display on Thumbnails -**What competitors do:** +| Attribute | Detail | +|-----------|--------| +| **Why expected** | Every classification tool shows the class label on the thumbnail. Without it, users see unlabeled images. | +| **Complexity** | Low | +| **Depends on** | Grid cell component (existing), dataset `task_type` field (TS-1) | -FiftyOne requires Python code to load datasets -- there is no folder-detection UI. The user must know the format and write `fo.Dataset.from_dir(...)` with the correct type. This is a pain point evidenced by multiple GitHub issues about `from_dir()` failing on slightly non-standard folder structures (issues #1780, #1781, #1951). +**What to build:** A `ClassificationBadge` component that replaces the `AnnotationOverlay` (SVG bboxes) when `task_type === 'classification'`. Renders as a text badge in the top-left corner of the thumbnail. -Encord's workflow is: (1) register cloud storage, (2) create a dataset in the platform, (3) upload/sync files. It is guided but requires configuration. +**How competitors do it:** +- **Roboflow:** Classification label displayed as text in the top-left corner of the image with a semi-transparent colored background. +- **FiftyOne:** Classification fields shown as text tags in the sample's sidebar panel, not overlaid on the image in grid view. Image-level labels appear as fields, not spatial overlays. +- **Label Studio:** Text label below the image during annotation. -Neither competitor has a "point at folder and auto-detect" experience. +**Design decision:** Top-left corner badge with semi-transparent background, colored by class (using the existing `color-hash.ts`). Show GT label by default; when predictions exist, show both with GT solid and prediction as an outline/dashed badge below. This mirrors the existing convention where GT is solid stroke and predictions are dashed stroke. -**What DataVisor has:** Manual file selection in the ingestion UI. - -**Gap:** There is an opportunity to leapfrog both competitors with a smart ingestion UI that: (a) accepts a root directory, (b) scans for annotation files and image directories, (c) infers the format, (d) detects splits, (e) shows a preview before import. - -**Priority:** DIFFERENTIATOR -- neither FiftyOne nor Encord does this well. FiftyOne forces Python; Encord forces cloud config. A "drag-and-drop folder" experience is genuinely better. - -**Complexity:** MEDIUM. -- Directory scanner: Look for `*.json` (COCO), `*.yaml` + `*.txt` (YOLO), `*.xml` (VOC) -- Preview: Show detected format, split count, image count, class count before import -- Confirmation: Let user override detected format if wrong - -**Depends on:** Multi-format import (1A). Split handling (1B). - -**Recommendation:** Build a `DatasetDetector` service that returns a `DetectionResult` with: format type, annotation paths, image directories, splits found, sample counts. The frontend renders this as a confirmation dialog before ingestion begins. +**When GT and prediction differ:** Show both badges stacked, with the prediction badge having a red tint or strikethrough to indicate misclassification. When they match, show a single green-tinted badge. This gives immediate visual signal without opening the detail modal. --- -### 1D. Dataset Zoo / Pre-Built Datasets - -**What competitors do:** - -FiftyOne has a Dataset Zoo with one-line loading of 20+ benchmark datasets: -```python -import fiftyone.zoo as foz -dataset = foz.load_zoo_dataset("coco-2017", split="validation") -``` -Available datasets include COCO-2017, CIFAR-10/100, ImageNet, BDD100K, Open Images, Cityscapes, ActivityNet, KITTI, and a `quickstart` dataset with 200 samples for demos. +### TS-3: Classification Evaluation Metrics -Encord does not have a dataset zoo -- users bring their own data. +| Attribute | Detail | +|-----------|--------| +| **Why expected** | Accuracy, F1, precision, recall are the universal classification metrics. Every ML practitioner expects these. | +| **Complexity** | Medium | +| **Depends on** | Evaluation service (existing, needs classification branch), prediction import (existing) | -**What DataVisor has:** No pre-built dataset loading. +**What to build:** A `ClassificationEvaluationService` that computes: -**Gap:** The quickstart experience matters. FiftyOne users can go from `pip install` to exploring a dataset in 30 seconds. DataVisor users must have their own COCO dataset ready. +**Aggregate metrics:** +- **Accuracy:** correct / total +- **Macro F1:** unweighted average of per-class F1 scores (treats all classes equally) +- **Micro F1:** equivalent to accuracy for single-label classification +- **Weighted F1:** F1 weighted by class support (handles imbalance) -**Priority:** NICE-TO-HAVE for v1.1 (a single demo dataset is sufficient). TABLE STAKES for onboarding/documentation purposes. +**Per-class metrics:** +- Precision, Recall, F1, Support (count of GT instances) -**Complexity:** LOW. Bundle a small demo dataset (50-100 COCO images with annotations and predictions) for first-run experience. Not a full zoo. +**Why these specific metrics:** For the jersey number dataset (43 classes, likely imbalanced), accuracy alone is misleading. Macro F1 exposes classes with poor performance regardless of their frequency. Weighted F1 gives the overall picture accounting for class sizes. Per-class precision/recall identifies which specific classes the model struggles with. -**Depends on:** Nothing. Just needs a sample dataset bundled or downloadable. +**Implementation:** Use `sklearn.metrics.classification_report()` and `sklearn.metrics.confusion_matrix()` rather than building from scratch. scikit-learn is already a transitive dependency (via supervision). Classification evaluation is dramatically simpler than detection evaluation -- no IoU, no confidence sweeping, no greedy matching. The entire evaluation is one confusion matrix computation. -**Recommendation:** Ship a `quickstart` command or UI button that loads a bundled demo dataset. This is critical for documentation, demos, and first-time users. Defer a full dataset zoo indefinitely -- it is not core to the tool's value. +**Response model:** New `ClassificationEvaluationResponse` alongside the existing detection `EvaluationResponse`. The router checks `task_type` and dispatches to the correct service. --- -### 1E. Dataset Export - -**What competitors do:** - -FiftyOne exports to all the same formats it imports, via: -```python -dataset_or_view.export( - export_dir="/path/to/export", - dataset_type=fo.types.YOLOv5Dataset, - label_field="ground_truth", -) -``` -Key parameters: `export_media` (copy/move/symlink/omit), `abs_paths`, `classes` (explicit class list), `data_path`/`labels_path` (separate media and labels). Views can be exported -- so a filtered subset exports only matching samples. +### TS-4: Classification Confusion Matrix -Encord exports labels via SDK in JSON format and supports integration with training pipelines. +| Attribute | Detail | +|-----------|--------| +| **Why expected** | The confusion matrix is THE diagnostic tool for classification. It shows which classes get confused with which. | +| **Complexity** | Low (existing confusion matrix component needs minor adaptation) | +| **Depends on** | Confusion matrix component (existing), classification evaluation (TS-3) | -**What DataVisor has:** No export functionality. +**What to build:** Adapt the existing `ConfusionMatrix` component for classification. -**Gap:** After users curate a dataset (filter, tag errors, exclude bad samples), they need to export the cleaned subset for training. Without export, the curation work is stranded inside DataVisor. +**Key differences from detection confusion matrix:** +1. **No "background" row/column.** In classification, every image has exactly one GT label and one predicted label. There are no unmatched items. +2. **Simpler cell semantics.** Each cell (i, j) = count of images with GT class i predicted as class j. No IoU threshold. +3. **No IoU threshold slider.** The existing evaluation panel has IoU/confidence threshold controls -- these should be hidden for classification datasets. +4. **Confidence threshold still relevant.** If predictions have confidence scores, filtering by confidence can still be useful (exclude low-confidence predictions). -**Priority:** TABLE STAKES -- the end-to-end workflow is: load -> explore -> curate -> export for training. Export completes the loop. +**The existing `ConfusionMatrix` component and `use-confusion-cell.ts` hook already support click-to-filter** (clicking a cell shows the contributing samples). This works perfectly for classification -- clicking cell (i, j) shows all images where GT=class_i and prediction=class_j. The only change is the backend query: instead of IoU-based matching, do a simple SQL join on sample_id between GT and prediction annotations. -**Complexity:** MEDIUM. -- Export a DatasetView (filtered subset) to COCO, YOLO, or VOC format -- Handle media: copy vs symlink vs manifest-only -- Write annotation files in target format - -**Depends on:** Multi-format import (1A, for the format writers). Saved views / filtering (existing). - -**Recommendation:** Implement export for COCO and YOLO formats in v1.1. The current view (with all active filters) should be exportable. Support `copy` and `symlink` media modes. This closes the curation loop. +**For 43 classes (jersey numbers):** The matrix will be 43x43. The existing component needs to handle this density well. Consider adding: (a) row/column sorting by error count, (b) a "most confused pairs" summary table showing the top-N off-diagonal cells. FiftyOne surfaces "most confused" pairs as a first-class concept and it is extremely useful. --- -## 2. Annotation Management - -### 2A. In-App Annotation Editing (Move, Resize, Delete Bounding Boxes) - -**What competitors do:** - -FiftyOne does NOT have in-app annotation editing. It delegates to external tools (CVAT, Label Studio, Labelbox) via `dataset.annotate()`: -```python -anno_key = "corrections" -view.annotate( - anno_key, - backend="cvat", - label_field="ground_truth", - allow_additions=True, - allow_deletions=True, - allow_spatial_edits=True, -) -# Later, after editing in CVAT: -view.load_annotations(anno_key) -``` -This is a roundtrip: FiftyOne -> CVAT -> FiftyOne. Annotations are not editable in the FiftyOne App itself. +### TS-5: Classification Error Analysis -Encord has a full-featured annotation editor with: -- Bounding boxes, rotatable bounding boxes, polygons, polylines, keypoints, bitmasks, object primitives -- Vertex management: add/remove/move vertices on polygons -- Brush tool and eraser for freehand polygon refinement -- Copy/paste labels across frames (Ctrl+C, Ctrl+V) -- Undo/redo (Ctrl+Z, Ctrl+Shift+Z) -- Merge and subtract polygons -- SAM 2 model-assisted segmentation -- Interpolation for frame-to-frame tracking -- Bulk label operations: merge objects, mass-delete by class/confidence/frame range -- Wacom tablet support +| Attribute | Detail | +|-----------|--------| +| **Why expected** | Users need to know not just aggregate metrics but which specific images are wrong and why. | +| **Complexity** | Medium | +| **Depends on** | Error analysis service (existing, needs classification branch), classification evaluation (TS-3) | -**What DataVisor has:** Read-only annotation display. No editing. +**What to build:** A `ClassificationErrorAnalysis` service that categorizes each image as: -**Gap:** DataVisor's PROJECT.md scopes this as "quick corrections only, not CVAT replacement." This is the right call. The question is: what is the minimum viable annotation editing for an introspection tool? +| Category | Detection Equivalent | Definition | +|----------|---------------------|------------| +| **Correct** | True Positive | GT label == predicted label | +| **Misclassified** | Label Error | GT label != predicted label (the GT/predicted pair is recorded) | +| **Missing prediction** | False Negative | Image has GT but no prediction | +| **Spurious prediction** | False Positive | Image has prediction but no GT (rare for classification) | -**Priority:** TABLE STAKES at the "quick correction" level. When a user spots a wrong bounding box during error triage, they should be able to fix it immediately without context-switching to CVAT. +**No "Hard FP" category.** In detection, Hard FP means a prediction with no nearby GT box. In classification, there is no spatial component -- a wrong prediction is simply a misclassification. The error taxonomy is simpler. -**Complexity:** HIGH for full editing. MEDIUM for the minimum viable set: -- Delete a bounding box (click -> delete key) -- Move a bounding box (drag) -- Resize a bounding box (drag corners/edges) -- Change class label (dropdown or hotkey) -- Undo/redo (Ctrl+Z / Ctrl+Shift+Z) +**Per-class error breakdown:** For each class, show: TP count, misclassified count (broken down by which class they were confused with), missed count. This is richer than the detection per-class table because we can show the confusion target. -Full polygon editing, brush tools, interpolation, etc. are out of scope per PROJECT.md. - -**Depends on:** Sample detail modal (existing). Annotation overlay rendering (existing). - -**Recommendation:** Implement bbox-only editing in the sample detail modal: select -> move/resize/delete -> save. No polygon editing, no new annotation creation (that is CVAT territory). Add undo/redo with a simple command stack. This covers 90% of "quick correction" needs. +**"Most confused pairs" summary:** Extract the top-N off-diagonal confusion matrix cells and present as a ranked list: "Class '3' confused with '8' (N=23 times)" etc. This is the single most actionable view for classification debugging. --- -### 2B. Create New Annotations - -**What competitors do:** - -FiftyOne: Not possible in-app. Must use CVAT/Label Studio integration. +### TS-6: Classification Prediction Import -Encord: Full creation workflow -- select a tool (bbox, polygon, etc.), draw on the image, assign class from ontology, save. Ontology-driven: classes are defined upfront in a project ontology. +| Attribute | Detail | +|-----------|--------| +| **Why expected** | Users need to compare model predictions against ground truth. | +| **Complexity** | Low | +| **Depends on** | Prediction parser (existing), JSONL parser (TS-1) | -**What DataVisor has:** No annotation creation. +**What to build:** Extend the prediction import dialog to accept classification predictions. Two formats: -**Gap:** Sometimes during triage, a user finds a missing annotation (false negative) and wants to add a bounding box. This is a natural part of the correction workflow. +1. **JSONL format** (matching ingestion): `{"image":"filename.jpg","suffix":"predicted_class","confidence":0.95}` +2. **CSV format** (simpler): `filename,predicted_class,confidence` -**Priority:** NICE-TO-HAVE for v1.1. The primary workflow is editing existing annotations, not creating new ones. New annotation creation can be added in v1.2 if users request it. +The existing prediction import flow stores predictions as annotations with `source != 'ground_truth'`. For classification, each prediction is one annotation per image (instead of potentially many bboxes). -**Complexity:** MEDIUM. Requires: draw-to-create interaction, class assignment UI, persistence to DuckDB. - -**Depends on:** Annotation editing (2A). - -**Recommendation:** Defer to v1.2. Focus v1.1 on edit/delete of existing annotations. If added later, scope to bounding boxes only (click-drag to create, assign class from existing class list). +**Confidence handling:** Classification models often output a probability distribution over all classes. For DataVisor's single-label scope, only the top-1 prediction and its confidence are imported. Top-k support is a future enhancement (see Differentiators). --- -### 2C. Annotation Backend Integration (CVAT, Label Studio) - -**What competitors do:** +### TS-7: Sample Detail Modal Adaptation -FiftyOne's annotation integration is a key feature: -- `dataset.annotate()` uploads samples to CVAT/Label Studio/Labelbox -- Configurable permissions: `allow_additions`, `allow_deletions`, `allow_label_edits`, `allow_spatial_edits` -- Label schema defines task type, classes, and custom attributes -- `dataset.load_annotations()` merges results back -- Annotation runs are tracked: rename, inspect, delete +| Attribute | Detail | +|-----------|--------| +| **Why expected** | Clicking an image must show useful information. The detection modal shows bboxes; the classification modal should show class info. | +| **Complexity** | Medium | +| **Depends on** | Sample modal (existing), annotation editor (existing), dataset `task_type` | -Encord IS an annotation platform, so this is not "integration" but rather native capability. +**What to build:** Conditional rendering in `sample-modal.tsx` based on `task_type`: -**What DataVisor has:** No integration with external annotation tools. +**For classification datasets:** +- Remove bbox overlay, editable-rect, draw-layer components +- Show GT class label prominently (large text above image) +- Show predicted class label (if exists) with confidence score +- Show correct/incorrect status with color coding (green/red) +- Show class change dropdown (for editing the GT label -- replaces bbox editing) +- Retain: similarity panel, tags, triage overlay, keyboard navigation -**Gap:** For heavy annotation tasks (re-labeling hundreds of samples), an integration with CVAT would be valuable. But DataVisor is a personal tool, and setting up CVAT is non-trivial. - -**Priority:** NICE-TO-HAVE for v1.1. Most users of a personal introspection tool will make quick fixes in-app, not set up a separate CVAT instance. Defer until there is demonstrated need. - -**Complexity:** HIGH. Requires CVAT API integration, task creation, status tracking, result merging. - -**Depends on:** Nothing, but is only useful if the user has CVAT/Label Studio running. - -**Recommendation:** Defer indefinitely. Instead, support exporting flagged samples to COCO format (from 1E), which can be imported into any annotation tool. This achieves the same goal without tight coupling. +**The annotation editor** currently supports bbox move/resize/delete and class change. For classification, only class change is relevant. The `class-picker.tsx` component (dropdown to change category) works as-is. --- -## 3. Error Triage & Quality Analysis - -### 3A. Interactive Evaluation Dashboard (Confusion Matrix, PR Curves, Per-Class AP) - -**What competitors do:** - -FiftyOne's Model Evaluation panel is a standout feature: -- Interactive confusion matrix: click any cell to filter the grid to those specific GT/prediction pairs -- PR curves with adjustable confidence thresholds -- Per-class metrics: precision, recall, F1, AP -- All metrics are linked to the dataset view -- changing filters updates the evaluation metrics -- Subset evaluation: `use_subset()` to evaluate on specific conditions (e.g., only nighttime images) +### TS-8: Statistics Dashboard Adaptation -Encord Active provides model quality metrics focused on active learning: -- Entropy, Least Confidence, Margin, Variance, Mean Object Confidence -- These rank samples by uncertainty for prioritized re-annotation +| Attribute | Detail | +|-----------|--------| +| **Why expected** | The overview tab shows annotation counts, class distribution, split breakdown. These need to reflect classification semantics. | +| **Complexity** | Low | +| **Depends on** | Stats dashboard (existing), statistics hooks (existing) | -**What DataVisor has:** Error categorization (TP/FP/FN/Label Error) and dataset statistics dashboard (class distribution, annotation counts). No confusion matrix, no PR curves, no per-class AP. +**What to build:** Adapt dashboard text and metrics for classification context: -**Gap:** The confusion matrix with click-to-filter is FiftyOne's killer evaluation feature. A CV engineer evaluating a model needs to see "my model confuses 'car' with 'truck' 40% of the time" and then immediately see those misclassified samples. DataVisor has the error categorization but lacks the statistical visualization layer. +| Detection term | Classification term | +|----------------|-------------------| +| "Annotations" | "Labeled images" | +| "Annotations per image" histogram | "Class distribution" (same chart, simpler) | +| "Bounding box area" histogram | Remove (not applicable) | +| mAP/AP metrics cards | Accuracy/F1 metrics cards | +| IoU threshold slider | Remove | -**Priority:** TABLE STAKES for a model evaluation tool. DataVisor already has GT vs Predictions comparison, but without aggregate metrics (confusion matrix, mAP, per-class AP), the evaluation is sample-by-sample rather than systematic. - -**Complexity:** HIGH. -- Confusion matrix: Aggregate TP/FP/FN by class pair, render interactive heatmap, click-to-filter -- PR curves: Sweep confidence thresholds, compute precision/recall per class, Recharts line chart -- Per-class AP: Standard COCO-style AP computation -- All must link to the grid view for click-to-filter - -**Depends on:** Evaluation pipeline (existing -- TP/FP/FN matching). Statistics dashboard (existing -- extend it). Recharts (existing in stack). - -**Recommendation:** Build the confusion matrix with click-to-filter as the centerpiece. This single feature closes the biggest evaluation gap. PR curves and per-class AP can follow. Use Recharts (already in the stack) for visualization. +The underlying data is the same (annotations table rows), but the presentation changes. The class distribution chart works identically -- it counts annotations per category, which for classification is images per class. --- -### 3B. Quality Scoring Metrics (Uniqueness, Hardness, Mistakenness) - -**What competitors do:** - -FiftyOne Brain provides four computed quality scores: -- **Uniqueness:** Non-duplicate detection, comparing image content across the dataset. Useful for deduplication and early-stage data selection. -- **Hardness:** Per-sample difficulty during training, computed from model logits. Helps identify which unlabeled examples deserve annotation budget. -- **Mistakenness:** Annotation error probability, computed from model logits. Identifies likely mislabeled samples. Works on classification and detection. -- **Representativeness:** How typical a sample is, revealing common data modes vs outliers. - -Additionally: -- **Exact duplicate detection:** Identifies identical files with different names. -- **Near-duplicate detection:** Finds visually similar images that may cause data quality issues. -- **Leaky splits detection:** Finds potential data leakage between train/test/val splits. - -Encord Active provides 25+ quality metrics in three categories: -- **Data quality:** Brightness (0-1), Sharpness (0-1), Uniqueness, Area, Diversity -- **Label quality:** Border Proximity, Broken Object Tracks, Classification Quality, Label Duplicates, Object Classification Quality, Annotation Quality Score, Relative Area, Aspect Ratio -- **Issue shortcuts:** Pre-configured filters for common problems -- Duplicates (uniqueness < 0.00001), Blur (sharpness < 0.005), Dark (brightness < 0.1), Bright (brightness > 0.7), Low Annotation Quality (quality < 0.02) +## Differentiators -**What DataVisor has:** Error categorization (Hard FP, Label Error, FN) and Pydantic AI agent for pattern detection. No per-sample quality scores. No deduplication. No hardness/mistakenness scoring. +Features that set DataVisor apart. Not expected, but valuable. Build these after table stakes. -**Gap:** DataVisor categorizes errors but does not score individual samples on quality dimensions. FiftyOne's mistakenness score and Encord's issue shortcuts are the most actionable features -- they surface the "worst" samples automatically. +### D-1: Misclassification Drill-Down View -**Priority:** TABLE STAKES for the "worst images ranking" feature planned in v1.1. A combined quality score requires component metrics. +| Attribute | Detail | +|-----------|--------| +| **Value proposition** | Click a confused pair in the confusion matrix and see side-by-side examples of "predicted 8, actually 3" with the images. No other lightweight tool does this well. | +| **Complexity** | Medium | +| **Depends on** | Confusion matrix click-to-filter (TS-4), classification error analysis (TS-5) | -**Complexity:** MEDIUM per metric. -- Image uniqueness: Compute from existing DINOv2 embeddings + Qdrant similarity search (infrastructure exists) -- Image brightness/sharpness: Simple image processing metrics (OpenCV) -- Near-duplicate detection: Cosine similarity threshold on existing embeddings -- Mistakenness: Requires model logits (not just predictions), which DataVisor does not currently import +**What to build:** When a user clicks a confusion matrix cell (i, j), show a dedicated panel with: +1. All images where GT=class_i and prediction=class_j +2. Thumbnails with both labels visible (badge: "GT: 3 / Pred: 8") +3. Sort by confidence (most confident mistakes first -- these are the most concerning) +4. One-click ability to correct the GT label if it is actually wrong (label error) -**Depends on:** DINOv2 embeddings (existing). Qdrant (existing). Evaluation pipeline (existing). - -**Recommendation:** For v1.1, implement: -1. **Near-duplicate detection** using existing embeddings (low effort, high value) -2. **Image quality metrics** (brightness, sharpness, contrast) for the AI agent -3. **Composite "worst sample" score** combining: error count + low confidence + low uniqueness - -Defer hardness and mistakenness -- they require model logits, which would need a new import schema. +This extends the existing click-to-filter behavior to be richer for classification. FiftyOne shows the filtered sample list, but DataVisor can show a purpose-built comparison view. --- -### 3C. Error Triage Workflow (Review, Tag, Resolve) - -**What competitors do:** - -FiftyOne's triage workflow is programmatic: -1. Run `evaluate_detections()` to tag TP/FP/FN -2. Create a view filtering to FP or FN samples -3. Browse in the App, optionally clicking confusion matrix cells -4. Batch-tag samples via the App's tag icon -5. Programmatically process tagged samples - -FiftyOne App batch operations include: select samples in grid -> tag selected -> clone selected -> delete selected -> delete selected labels. Selection works via checkbox on each sample. - -Encord's triage workflow is more structured: -1. Encord Active surfaces issues via quality metrics and shortcuts (Blur, Dark, Low Quality, etc.) -2. Users create "Collections" -- saved groups of problematic data units -3. Issues can be tagged and tracked in Project Analytics -4. Workflows route flagged samples back to annotation stages (Annotate -> Review -> Approve pipeline) -5. Review mode: approve (N key), reject (B key), toggle review edit mode (Ctrl+E) -6. "Data Agents" automate triage by integrating foundation models into workflows - -**What DataVisor has:** Error categorization (Hard FP, Label Error, FN), bulk tagging, saved views, AI agent recommendations. No structured review-approve workflow. No issue tracking. +### D-2: Class-Level Performance Sparklines -**Gap:** The gap is not in error detection (DataVisor's categorization + AI agent is strong) but in the triage workflow UX: -- No dedicated "error review" mode that dims non-error samples -- No approve/reject/skip workflow for reviewing flagged items -- No progress tracking (reviewed 45/120 flagged samples) +| Attribute | Detail | +|-----------|--------| +| **Value proposition** | At-a-glance view of which classes perform well and which are disasters, without reading a table of numbers. | +| **Complexity** | Low | +| **Depends on** | Per-class metrics (TS-3), Recharts (existing) | -**Priority:** DIFFERENTIATOR -- a focused error triage mode with keyboard-driven review (approve/reject/skip) would be faster than both FiftyOne's programmatic approach and Encord's multi-platform workflow. - -**Complexity:** MEDIUM. -- Review mode: filter to error samples, highlight current, dim others -- Keyboard: N = correct (remove error tag), B = confirmed error, Space = skip -- Progress: "Reviewed 45/120 -- 23 confirmed errors, 22 false alarms" -- Persistence: Track review status per sample - -**Depends on:** Error categorization (existing). Tagging (existing). Keyboard shortcuts (new, see section 5). - -**Recommendation:** Build a dedicated "Triage Mode" that enters a focused review workflow: shows one error sample at a time, keyboard-driven approve/reject/skip, progress tracking, auto-advances to next sample. This is the kind of opinionated UX that makes DataVisor better than FiftyOne for error review, where FiftyOne forces users to manually browse and tag. +**What to build:** In the per-class metrics table, add inline sparkline-style bars for precision, recall, and F1. Color-code: green (>0.9), yellow (0.7-0.9), red (<0.7). Sort by worst-performing class by default to surface problems immediately. --- -### 3D. Worst Images Ranking (Combined Quality Score) - -**What competitors do:** - -FiftyOne Brain's `compute_hardness()` and `compute_mistakenness()` each produce a per-sample float score that can be sorted to find the worst samples. Users combine multiple scores by creating computed fields: -```python -dataset.set_field("quality_score", F("mistakenness") + F("hardness")) -``` - -Encord Active's issue shortcuts pre-define thresholds (blur < 0.005, dark < 0.1, etc.) and surface samples that fail multiple checks. - -Neither platform has a single "worst images" composite ranking out of the box. +### D-3: Top-K Confidence Distribution -**What DataVisor has:** Error categorization but no numeric ranking. The AI agent detects patterns but does not rank individual samples. +| Attribute | Detail | +|-----------|--------| +| **Value proposition** | Shows where the model is uncertain vs confident, separated by correct/incorrect predictions. Reveals overconfident mistakes. | +| **Complexity** | Medium | +| **Depends on** | Classification predictions with confidence (TS-6), Recharts (existing) | -**Gap:** A composite "data quality score" that ranks every sample by how problematic it is. This would power the "Smart worst images ranking" feature planned for v1.1. +**What to build:** Two overlaid histograms: +1. Confidence distribution of **correct** predictions (expect: skewed right, high confidence) +2. Confidence distribution of **incorrect** predictions (expect: more spread out) -**Priority:** DIFFERENTIATOR -- neither competitor does this as a first-class feature. A "Problems" tab showing samples ranked by composite badness score is novel. +If the incorrect predictions have high confidence, the model is dangerously overconfident. If they cluster at low confidence, a simple threshold can filter them. -**Complexity:** MEDIUM. -- Define component metrics: error count, confidence variance, brightness, sharpness, near-duplicate distance, annotation density -- Normalize each to 0-1 -- Weighted combination into single score -- Store in DuckDB, expose as sortable field -- UI: "Worst Images" view sorted by composite score - -**Depends on:** Quality metrics (3B). Error categorization (existing). - -**Recommendation:** Define a `quality_score` field computed from: (a) number of errors on sample, (b) mean prediction confidence (low = uncertain), (c) near-duplicate distance (high = unusual), (d) image quality metrics. Surface as a sortable column and as a dedicated "Worst Images" view. +Also: **confidence calibration plot** (reliability diagram) showing predicted confidence vs actual accuracy. Most models are poorly calibrated, and this visualization makes it obvious. --- -## 4. Deployment & Infrastructure - -### 4A. Docker Deployment - -**What competitors do:** - -FiftyOne (open-source) provides a Dockerfile for building custom images: -- Configurable Python version -- Persistent `/fiftyone` directory for databases and datasets -- Docker Compose not officially provided for OSS, but community examples exist - -FiftyOne Enterprise provides: -- Helm chart for Kubernetes deployment (helm.fiftyone.ai) -- Docker Compose for smaller deployments -- Central Authentication Service (CAS) -- Multi-container architecture: app, API, database, CAS - -Encord is SaaS-only -- no self-hosted Docker deployment. Data stays in user's cloud storage; the platform is hosted by Encord. +### D-4: Per-Split Evaluation Comparison -**What DataVisor has:** No Docker support. Runs locally with `uvicorn` + `npm run dev`. +| Attribute | Detail | +|-----------|--------| +| **Value proposition** | Compare model performance across train/val/test splits. Large gap between train and val accuracy immediately reveals overfitting. | +| **Complexity** | Low | +| **Depends on** | Split handling (existing), classification evaluation (TS-3) | -**Gap:** DataVisor needs Docker for cloud VM deployment (per PROJECT.md). This is the most basic deployment gap. - -**Priority:** TABLE STAKES for v1.1 (explicitly in scope per milestone definition). - -**Complexity:** MEDIUM. -- Dockerfile: Multi-stage build (Python backend + Node frontend build) -- Docker Compose: Backend, frontend, Qdrant services -- Volume mounts: Dataset storage, DuckDB database, Qdrant data -- Environment configuration: Image source paths, GPU support (optional) - -**Depends on:** Nothing. Can be built in parallel with features. - -**Recommendation:** Multi-stage Dockerfile: (1) Node build stage for frontend, (2) Python runtime with bundled frontend. Docker Compose with three services: app, qdrant, and an init container for setup. Map volumes for `/data` (datasets), `/db` (DuckDB), `/qdrant` (vectors). +**What to build:** A comparison table/chart showing accuracy, macro F1, and per-class F1 side-by-side for each split. Highlight cells where test performance is significantly worse than train (>5% drop). This is trivial to compute (run classification evaluation per split) but extremely informative. --- -### 4B. Authentication - -**What competitors do:** - -FiftyOne OSS: No authentication. Anyone with access to the port can use it. - -FiftyOne Enterprise: Full auth via Central Authentication Service (CAS), supporting OIDC/OAuth2, Auth0, and air-gapped deployments. Role-based access control with user groups and permissions. +### D-5: Embedding Scatter with Classification Coloring -Encord: Cloud-hosted with SSO, SAML, team management, SOC-2/HIPAA/GDPR compliance. +| Attribute | Detail | +|-----------|--------| +| **Value proposition** | The existing t-SNE scatter plot colored by class label instantly shows cluster quality. Misclassifications visible as dots in the wrong cluster. | +| **Complexity** | Low | +| **Depends on** | Embedding scatter (existing), classification labels | -**What DataVisor has:** No authentication. Open port. +**What to build:** The existing embedding scatter already supports coloring by class. For classification datasets, default to coloring by GT class label. Add a toggle to color by: +1. **GT class** (default) -- shows natural cluster structure +2. **Predicted class** -- shows model's view of the data +3. **Correct/incorrect** -- highlights all misclassifications as red dots -**Gap:** When deployed on a cloud VM, the app is exposed to the internet. Basic auth is the minimum security requirement. This is explicitly in scope for v1.1. - -**Priority:** TABLE STAKES for cloud deployment. Without auth, anyone who discovers the URL can access your dataset. - -**Complexity:** LOW. -- Single-user basic auth (username/password from environment variable) -- Applied as middleware on all API routes and frontend routes -- No user management, no RBAC, no SSO -- just a password gate - -**Depends on:** Docker deployment (4A). - -**Recommendation:** Implement as FastAPI middleware: check `Authorization: Basic ...` header against `DATAVISOR_USERNAME` / `DATAVISOR_PASSWORD` env vars. Frontend: show login form, store token in session. This is explicitly scoped as single-user in PROJECT.md -- do not over-engineer. +Option 3 is the killer feature: overlay misclassification status on the embedding plot. Misclassified samples that are near the decision boundary (cluster edge) are expected. Misclassified samples deep inside a correct cluster suggest label errors. --- -### 4C. Cloud Deployment Scripts - -**What competitors do:** - -FiftyOne Enterprise: Helm chart for Kubernetes with detailed docs (helm.fiftyone.ai). Community Docker deployment guides. - -FiftyOne OSS: Remote sessions via SSH port forwarding (`fiftyone app connect --destination user@host`). This is the simplest cloud access pattern. +## Anti-Features -Encord: No deployment needed (SaaS). +Features to explicitly NOT build for this milestone. -**What DataVisor has:** No deployment scripts. - -**Gap:** PROJECT.md specifies "GCP deployment script + local run script with setup instructions." - -**Priority:** TABLE STAKES for v1.1 (explicitly in scope). - -**Complexity:** LOW-MEDIUM. -- `scripts/deploy-gcp.sh`: Create GCE instance, install Docker, pull/build image, start compose -- `scripts/run-local.sh`: Docker compose up with sensible defaults -- Documentation: Setup instructions, port configuration, data mounting - -**Depends on:** Docker deployment (4A). Auth (4B). - -**Recommendation:** Provide two scripts: (1) `run-local.sh` for Docker Compose on local machine, (2) `deploy-gcp.sh` for GCE VM provisioning with startup script. Both use Docker Compose. The GCP script should configure firewall rules for port 443 only with auth required. +| Anti-Feature | Why Avoid | What to Do Instead | +|--------------|-----------|-------------------| +| **Multi-label classification** | Different data model (multiple labels per image), different metrics (hamming loss, subset accuracy), different UI (checkbox lists instead of single badge). Scope explosion. | Scope to single-label only. Add multi-label in a future milestone if needed. | +| **Top-K evaluation** | Requires importing full probability distributions (N probabilities per image per class). Complicates prediction import schema significantly. | Import only top-1 prediction with confidence. Note: the confidence score captures some of this info already. | +| **PR curves for classification** | PR curves are less informative for multi-class classification than for detection. The confusion matrix and per-class precision/recall table are better tools. Confidence-based filtering (existing) handles the threshold sweep use case. | Show per-class precision/recall in a table. Use the confidence histogram (D-3) for threshold analysis. | +| **mAP for classification** | mAP is a detection metric (requires IoU). Accuracy and macro F1 are the standard classification metrics. Showing mAP would confuse users. | Show accuracy, macro F1, weighted F1. | +| **Bbox editing for classification** | No bounding boxes. The editable-rect, draw-layer components are irrelevant. | Show class label editor (dropdown) instead. | +| **IoU threshold controls** | No spatial matching, no IoU. Showing an IoU slider would confuse users. | Hide IoU controls when `task_type === 'classification'`. | +| **Detection-specific error categories** | "Hard FP" (no nearby GT box) has no meaning in classification. "Label Error" (correct box, wrong class) conflates with misclassification. | Use simpler categories: Correct, Misclassified, Missing Prediction. | --- -### 4D. Remote Sessions / Tunnel Access - -**What competitors do:** - -FiftyOne supports remote sessions natively: -```bash -# On remote machine -fiftyone app launch --remote --port 5151 +## Feature Dependencies -# On local machine -fiftyone app connect --destination user@remote --port 5151 ``` -This sets up SSH port forwarding automatically. Users can also manually forward: `ssh -N -L 5151:localhost:5151 user@remote`. - -**What DataVisor has:** No remote session support. - -**Gap:** Minor gap if Docker + auth is implemented (users just hit the URL). SSH tunneling is a nice developer convenience but not essential when basic auth exists. - -**Priority:** NICE-TO-HAVE. Docker + auth covers the primary use case. - -**Complexity:** LOW. Document the SSH tunnel approach: `ssh -N -L 8080:localhost:8080 user@vm`. - -**Depends on:** Docker deployment (4A). - -**Recommendation:** Document SSH tunneling as an alternative to basic auth for security-conscious users. No code needed -- just docs. - ---- - -## 5. Keyboard Shortcuts & Power-User UX - -### 5A. Core Navigation Shortcuts - -**What competitors do:** - -FiftyOne App shortcuts (accessed via `?` key): -- `?` -- Show all shortcuts -- `z` -- Crop/zoom to visible labels -- `ESC` -- Reset view -- Arrow keys (up/down) -- Rotate z-order of overlapping labels -- Spacebar -- Play/pause video -- `<` / `>` -- Frame-by-frame navigation (video, when paused) -- `0-9` -- Seek to 0%-90% of video -- Grid filtering and sorting via sidebar (no keyboard shortcuts for grid navigation) - -FiftyOne notably does NOT have keyboard shortcuts for: navigating between samples in the grid, toggling label visibility by keyboard, or sample selection by keyboard. These are open feature requests (GitHub issues #2120, #1761). - -Encord annotation editor shortcuts (comprehensive): -- **Navigation:** Arrow keys (next/previous sample, frame navigation), Space (play/pause) -- **Editing:** Ctrl+Z/Ctrl+Shift+Z (undo/redo), Backspace (delete), Ctrl+C/V (copy/paste) -- **Review:** N (approve), B (reject), Ctrl+E (toggle review edit) -- **Tools:** D (freehand drawing), G (brush), H (eraser), `[`/`]` (brush size) -- **Annotation:** A (add vertex), S (remove vertex), F (edit vertex), Enter (complete), Esc (cancel) -- **Display:** Shift+H (hide all labels), Shift+N (show object names) -- **Bulk:** Ctrl+A (select all), Shift+D (remove from frame) -- **Meta:** Ctrl+Shift+K (open shortcuts menu), Ctrl+S (save), Shift+Enter (submit task) - -**What DataVisor has:** No keyboard shortcuts. - -**Gap:** Keyboard navigation is expected by power users. Both competitors support it, though FiftyOne's implementation is incomplete (no grid navigation shortcuts). - -**Priority:** TABLE STAKES for power-user adoption. CV engineers reviewing hundreds of samples expect keyboard navigation. The triage workflow (3C) depends on this. - -**Complexity:** MEDIUM. - -**Depends on:** Sample detail modal (existing). Grid view (existing). Triage mode (new, 3C). - -**Recommendation:** Implement in two tiers: - -**Tier 1 (v1.1 must-have):** -| Shortcut | Action | -|----------|--------| -| `?` | Show shortcuts help overlay | -| `ArrowLeft` / `ArrowRight` | Previous/next sample in modal | -| `ESC` | Close modal / cancel action | -| `Space` | Toggle label visibility | -| `G` | Toggle GT labels | -| `P` | Toggle prediction labels | -| `T` | Tag current sample | -| `Delete` / `Backspace` | Delete selected annotation (when editing) | -| `Ctrl+Z` / `Cmd+Z` | Undo (when editing) | -| `1-9` | Quick-assign class by index (when editing) | - -**Tier 2 (v1.1 nice-to-have):** -| Shortcut | Action | -|----------|--------| -| `J` / `K` | Navigate grid (previous/next row) | -| `Enter` | Open selected sample in modal | -| `E` | Enter edit mode on selected annotation | -| `F` | Toggle fullscreen on modal | -| `/` | Focus search bar | -| `N` / `B` | Approve / Reject in triage mode | - ---- - -### 5B. Customizable Hotkeys - -**What competitors do:** - -Encord allows customizable hotkeys: users can remap keyboard shortcuts to match their workflow preferences. Shortcuts menu via Ctrl+Shift+K. - -FiftyOne does not support customizable hotkeys. - -**What DataVisor has:** No shortcuts at all. - -**Gap:** Minor. Fixed shortcuts with good defaults cover 95% of needs. - -**Priority:** NICE-TO-HAVE. Not worth the complexity for v1.1. - -**Complexity:** MEDIUM. Requires a settings UI and keymap storage. - -**Recommendation:** Defer. Ship with sensible fixed defaults. Revisit if users request remapping. - ---- - -## 6. View & Workspace Management - -### 6A. Custom Workspaces / Panel Layouts - -**What competitors do:** - -FiftyOne Spaces (since v0.19) allow: -- Multiple panels open simultaneously (Grid, Embeddings, Histograms, Map, Model Evaluation) -- Split panels horizontally or vertically -- Drag tabs between panels -- Save workspace layouts with name, description, and color -- Load saved workspaces programmatically or via UI -- Workspace state includes panel types, sizes, positions, and internal panel state - -Encord does not have customizable workspace layouts -- it uses a fixed editor interface. - -**What DataVisor has:** Fixed layout with grid view and side-by-side embedding panel. - -**Gap:** FiftyOne's workspace system is mature and powerful. However, DataVisor's fixed layout already shows grid + embeddings + sidebar, which covers the primary workflow. Multi-panel workspaces are a power feature with diminishing returns for a personal tool. - -**Priority:** NICE-TO-HAVE for v1.1. The current layout works. - -**Complexity:** HIGH. Requires a panel framework, drag-and-drop layout, persistence. - -**Depends on:** Nothing, but affects all existing UI components. - -**Recommendation:** Defer to v1.2+. Focus v1.1 on the single-layout experience with the planned new features (triage mode, evaluation dashboard). If workspaces are ever added, start with a simple tab system rather than full drag-and-drop panels. - ---- - -### 6B. Histograms / Distribution Panels - -**What competitors do:** - -FiftyOne has a Histograms panel that shows: -- Distribution of any field (class labels, confidence scores, metadata values) -- Interactive: click histogram bars to filter the grid -- Updates automatically as the view changes - -Encord Active shows metric distributions for each quality metric. - -**What DataVisor has:** Dataset statistics dashboard with class distribution (bar chart) and annotation counts. Not interactive (clicking does not filter). - -**Gap:** Interactive histograms that filter the grid are a natural extension of the existing statistics dashboard. - -**Priority:** DIFFERENTIATOR -- interactive histograms (click bar to filter) would connect the statistics dashboard to the grid view, enabling quick data exploration by distribution. - -**Complexity:** MEDIUM. -- Render histograms for any numeric/categorical field (Recharts, already in stack) -- Click handler: clicking a bar adds a filter to the sidebar -- Bidirectional: changing sidebar filters updates histogram highlighting - -**Depends on:** Statistics dashboard (existing). Sidebar filtering (existing). Recharts (existing). - -**Recommendation:** Make the existing statistics dashboard interactive. When a user clicks on a class in the distribution chart, filter the grid to that class. When they click a confidence range bar, filter to that range. This requires minimal new UI -- just adding click handlers to existing Recharts components and dispatching filter actions to the Zustand store. - ---- - -### 6C. Map / Geolocation Panel - -**What competitors do:** - -FiftyOne has a Map panel (Mapbox GL JS) for datasets with GeoLocation fields: -- Scatterplot of sample locations on a map -- Lasso selection on the map filters the grid -- Multiple map types - -Encord does not have a map panel. - -**What DataVisor has:** No geolocation support. - -**Gap:** Only relevant for datasets with GPS metadata (autonomous driving, satellite imagery, drone footage). - -**Priority:** NICE-TO-HAVE. Out of scope for v1.1 unless the user's datasets include geolocation. - -**Complexity:** MEDIUM. Mapbox GL JS integration, GeoJSON field handling. - -**Recommendation:** Defer. Only build if there is a specific need for geolocation-aware datasets. - ---- - -## 7. Advanced Features - -### 7A. Model Zoo (Run Inference In-App) - -**What competitors do:** - -FiftyOne Model Zoo provides: -```python -import fiftyone.zoo as foz -model = foz.load_zoo_model("faster-rcnn-resnet50-fpn-coco-torch") -dataset.apply_model(model, label_field="predictions") +[TS-1: JSONL Parser + Schema] + | + +---> [TS-2: Class Label Badge] + | | + | +---> [D-5: Embedding Coloring] (uses class labels) + | + +---> [TS-6: Prediction Import] + | | + | +---> [TS-3: Classification Eval Metrics] + | | | + | | +---> [TS-4: Confusion Matrix Adaptation] + | | | | + | | | +---> [D-1: Misclassification Drill-Down] + | | | + | | +---> [TS-5: Error Analysis] + | | | | + | | | +---> [D-3: Confidence Distribution] + | | | + | | +---> [D-2: Per-Class Sparklines] + | | | + | | +---> [D-4: Per-Split Comparison] + | | + +---> [TS-7: Detail Modal Adaptation] + | + +---> [TS-8: Stats Dashboard Adaptation] ``` -- 70+ pre-trained models from PyTorch and TensorFlow -- `apply_model()` runs inference and stores predictions as label fields -- `compute_embeddings()` generates embeddings from any model -- Custom model support via `TorchImageModel` class - -Encord integrates models via "Data Agents" for pre-labeling and automated review. - -**What DataVisor has:** Import pre-computed predictions (JSON). VLM auto-tagging (Moondream2). No general model inference. -**Gap:** DataVisor imports predictions but does not run inference. Users must run models externally and import results. +**Critical path:** TS-1 (parser/schema) unblocks everything. TS-6 (prediction import) unblocks all evaluation features. TS-3 (metrics) unblocks all downstream analysis. -**Priority:** NICE-TO-HAVE for v1.1. The import-predictions workflow is sufficient for a personal tool. Running inference adds GPU management complexity. - -**Complexity:** HIGH. Model download, GPU scheduling, inference pipeline, result storage. - -**Depends on:** Prediction import (existing). - -**Recommendation:** Defer. The existing "import predictions" workflow is pragmatic. Running inference is a different product surface. If added later, start with a single model (e.g., YOLOv8) rather than a full zoo. +**Parallelizable:** TS-2 (badge display), TS-7 (detail modal), and TS-8 (stats dashboard) can be built in parallel once TS-1 is complete. They all depend on having classification data in the database but not on each other. --- -### 7B. Similarity Search UX - -**What competitors do:** +## MVP Recommendation -FiftyOne supports multiple similarity backends: -- scikit-learn, Qdrant, Redis, Pinecone, MongoDB, Elasticsearch, Milvus, LanceDB -- "Find similar" from any sample: `dataset.sort_by_similarity(sample_id, k=25)` -- Image-level and patch-level (object crop) similarity -- Text-to-image similarity via CLIP embeddings +**Phase 1 (Core Ingestion + Display):** +1. TS-1: JSONL ingestion parser + schema extension +2. TS-2: Class label badge on thumbnails +3. TS-7: Sample detail modal adaptation +4. TS-8: Statistics dashboard adaptation -Encord Active provides similarity search, natural language search, and image-based search. +This gets a classification dataset loaded, browsable, and visually meaningful. Users can explore the dataset, see class distribution, filter by class, use the embedding scatter. -**What DataVisor has:** Qdrant vector storage for similarity search. The infrastructure exists but there is no "find similar" UI interaction. +**Phase 2 (Evaluation + Error Analysis):** +5. TS-6: Classification prediction import +6. TS-3: Classification evaluation metrics +7. TS-4: Confusion matrix adaptation +8. TS-5: Classification error analysis -**Gap:** The backend capability exists but the UX is missing. Users cannot right-click a sample and say "find similar images." +This enables the full GT-vs-predictions workflow: import predictions, see accuracy/F1, explore the confusion matrix, identify misclassified samples. -**Priority:** TABLE STAKES -- the infrastructure is already built. Exposing it via UI is low-hanging fruit with high value. +**Phase 3 (Differentiators):** +9. D-5: Embedding coloring by correct/incorrect (low effort, high impact) +10. D-1: Misclassification drill-down view +11. D-2: Per-class sparklines +12. D-3: Confidence distribution histogram +13. D-4: Per-split comparison -**Complexity:** LOW. Add a "Find Similar" button/context menu item on each sample that queries Qdrant and updates the grid view. - -**Depends on:** Qdrant similarity search (existing). Grid view (existing). - -**Recommendation:** Add a "Find Similar" action to the sample detail modal and grid context menu. Query Qdrant for the k nearest neighbors by embedding, display results in the grid. This is one of the highest value-to-effort features available. +**Defer:** Multi-label classification, top-k evaluation, PR curves, mAP. --- -### 7C. Plugin System Enhancement (Python Panels, Operators) - -**What competitors do:** +## Existing Features That Work As-Is for Classification -FiftyOne's plugin system (mature, since v0.17+): -- **Panels:** Full React components embedded in the App, with Python backend logic -- **Python Panels (since v0.25):** Write panels entirely in Python (no JS needed) -- **Operators:** User-facing actions (simple to complex) that can be composed -- Configuration via `fiftyone.yml` manifest -- Plugin marketplace and curated plugin list +These features require NO changes: -**What DataVisor has:** `BasePlugin` class with ingestion/UI/transformation hooks. - -**Gap:** DataVisor's plugin system is simpler by design (Python-only). The gap is not in architecture but in ecosystem -- there are no third-party plugins yet. - -**Priority:** NICE-TO-HAVE for v1.1. The plugin system exists and works. Enhancements are not urgent. - -**Complexity:** Varies by enhancement. - -**Recommendation:** No plugin system changes for v1.1. Focus on core features. The existing `BasePlugin` is sufficient for extensibility. +| Feature | Why It Works | +|---------|-------------| +| **Image grid browser** | Renders thumbnails. Classification just needs a different overlay (badge instead of bbox). | +| **t-SNE embedding scatter** | DINOv2 embeddings are computed from images, not annotations. Works identically. | +| **Lasso filtering** | Selects by sample ID. Task-agnostic. | +| **Find similar** | Qdrant similarity search uses image embeddings. Task-agnostic. | +| **Near-duplicates** | Embedding distance. Task-agnostic. | +| **Saved views** | Filter state persistence. Task-agnostic. | +| **Tags / triage workflow** | Sample-level operations. Task-agnostic. | +| **Keyboard shortcuts** | Sample navigation. Task-agnostic. | +| **Split filtering** | Filters by split field. Task-agnostic. | +| **Search by filename** | Text search. Task-agnostic. | +| **VLM auto-tagging** | Uses image content, not annotations. Task-agnostic. | +| **AI agent analysis** | Operates on statistics and error data. Needs updated prompts for classification context but architecture is the same. | --- -### 7D. 3D Visualization (Point Clouds, Meshes) - -**What competitors do:** - -FiftyOne (since v0.17/0.24): 3D point cloud visualization, 3D bounding boxes, 3D polylines, mesh rendering, orthographic projection in grid view, dedicated 3D visualizer with configurable lights and materials. - -Encord (2025): LiDAR point cloud support (.pcd, .ply, .las, .laz, .mcap), sensor fusion visualization. - -**What DataVisor has:** 2D images only. - -**Gap:** Only relevant for 3D CV datasets (autonomous driving, robotics). - -**Priority:** OUT OF SCOPE per PROJECT.md. "3D point cloud visualization -- different rendering pipeline entirely." - -**Recommendation:** Defer indefinitely per project constraints. - ---- - -### 7E. Video Support - -**What competitors do:** - -FiftyOne: Video datasets with frame-by-frame browsing, temporal detection, playback controls (spacebar play/pause, `<`/`>` frame navigation, `0-9` seek). - -Encord: Full video annotation with keyframe interpolation, object tracking, temporal ranges. - -**What DataVisor has:** Image-only. - -**Gap:** Out of scope per PROJECT.md. - -**Priority:** OUT OF SCOPE. "Video annotation support -- image-only for now." - -**Recommendation:** Defer per project constraints. - ---- - -## 8. Data Operations - -### 8A. View Expressions / Advanced Filtering - -**What competitors do:** - -FiftyOne provides a rich Python API for dataset views: -```python -from fiftyone import ViewField as F - -# Chain view stages -view = ( - dataset - .match_tags("validation") - .match(F("metadata.size_bytes") >= 48 * 1024) - .filter_labels("predictions", F("confidence") > 0.8) - .sort_by("filepath") - .limit(100) -) -``` - -View stages include: `match()`, `filter_labels()`, `filter_field()`, `exists()`, `select()`, `exclude()`, `select_fields()`, `exclude_fields()`, `sort_by()`, `limit()`, `skip()`, `take()`, `shuffle()`, `match_tags()`, plus array operations (`.length()`, `.filter()`, `.map()`). - -Saved views store the filter rules, not the data -- storage efficient. - -**What DataVisor has:** Sidebar metadata filtering (dynamic on any field), search by filename, sort by metadata, saved views. No programmatic view API. - -**Gap:** DataVisor's UI-based filtering covers the common cases. The gap is the lack of a programmatic API for complex multi-stage filter chains. This matters for power users who want reproducible, scriptable data exploration. - -**Priority:** NICE-TO-HAVE for v1.1. The UI-based filtering covers 90% of use cases. A Python API is a v2 feature. - -**Complexity:** HIGH for a full view expression system. LOW for extending the existing filter system. - -**Depends on:** Sidebar filtering (existing). DuckDB (existing -- already supports complex SQL). - -**Recommendation:** Defer the Python view API. For v1.1, extend the sidebar to support: (a) filter by annotation count, (b) filter by prediction confidence range, (c) filter by error type. These cover the most common advanced filtering needs without a programmatic API. - ---- - -### 8B. Computed / Derived Fields - -**What competitors do:** - -FiftyOne allows adding computed fields: -```python -dataset.add_sample_field("num_objects", fo.IntField) -dataset.set_values("num_objects", [len(s.ground_truth.detections) for s in dataset]) -``` -And ViewExpressions for on-the-fly computation: -```python -view = dataset.set_field("quality", F("mistakenness") + F("hardness")) -``` - -**What DataVisor has:** Metadata fields from ingestion. No user-defined computed fields. - -**Gap:** Computed fields are useful for combining multiple metrics into composite scores (like the quality score from 3D). - -**Priority:** NICE-TO-HAVE for v1.1. Can be implemented server-side with DuckDB computed columns. - -**Complexity:** LOW for server-side computed fields in DuckDB. MEDIUM for exposing in UI. - -**Depends on:** DuckDB (existing). - -**Recommendation:** Implement the quality score (3D) as a computed field in DuckDB. Do not build a general user-defined field system for v1.1 -- just pre-compute the fields DataVisor needs. - ---- - -## Feature Priority Summary - -### Must Build for v1.1 (Table Stakes + High-Value Differentiators) - -| # | Feature | Priority | Complexity | Section | -|---|---------|----------|------------|---------| -| 1 | YOLO + VOC format import | Table Stakes | Medium | 1A | -| 2 | Train/val/test split handling | Table Stakes | Medium | 1B | -| 3 | Smart folder detection UI | Differentiator | Medium | 1C | -| 4 | Dataset export (COCO, YOLO) | Table Stakes | Medium | 1E | -| 5 | Bbox editing (move/resize/delete) | Table Stakes | High | 2A | -| 6 | Interactive confusion matrix + click-to-filter | Table Stakes | High | 3A | -| 7 | Near-duplicate detection | Table Stakes | Low | 3B | -| 8 | Image quality metrics (brightness, sharpness) | Table Stakes | Low | 3B | -| 9 | Error triage mode (keyboard review workflow) | Differentiator | Medium | 3C | -| 10 | Worst images composite ranking | Differentiator | Medium | 3D | -| 11 | Docker deployment | Table Stakes | Medium | 4A | -| 12 | Basic auth | Table Stakes | Low | 4B | -| 13 | Deployment scripts (local + GCP) | Table Stakes | Low-Medium | 4C | -| 14 | Keyboard shortcuts (Tier 1) | Table Stakes | Medium | 5A | -| 15 | "Find Similar" UI button | Table Stakes | Low | 7B | -| 16 | Interactive histograms (click-to-filter) | Differentiator | Medium | 6B | - -### Defer to v1.2+ - -| # | Feature | Why Defer | Section | -|---|---------|-----------|---------| -| 17 | Create new annotations | Quick corrections (edit/delete) are sufficient for v1.1 | 2B | -| 18 | CVAT/Label Studio integration | Export to COCO format achieves same goal | 2C | -| 19 | PR curves + per-class AP | Confusion matrix is the priority; curves follow naturally | 3A | -| 20 | Mistakenness / hardness scoring | Requires model logits import schema | 3B | -| 21 | Custom workspaces | Current layout works; panels are a large refactor | 6A | -| 22 | Customizable hotkeys | Fixed defaults are sufficient | 5B | -| 23 | Model zoo / in-app inference | Import predictions workflow is pragmatic | 7A | -| 24 | View expression Python API | UI filtering covers 90% of use cases | 8A | -| 25 | Demo / quickstart dataset | Low effort but not core to v1.1 delivery | 1D | - -### Explicitly Out of Scope - -| Feature | Reason | Section | -|---------|--------|---------| -| 3D point cloud visualization | Different rendering pipeline, per PROJECT.md | 7D | -| Video support | Image-only, per PROJECT.md | 7E | -| Map / geolocation panel | No current need for geo datasets | 6C | -| Multi-user auth / RBAC | Personal tool, per PROJECT.md | 4B | -| Plugin system overhaul | Existing BasePlugin is sufficient | 7C | - ---- - -## Feature Dependencies (v1.1 Build Order) +## Sources -``` -[Docker + Auth + Deploy Scripts] (parallel with everything) - | - v -[YOLO + VOC Parsers] ──> [Smart Folder Detection UI] ──> [Split Handling] - | - v -[Dataset Export] (requires format writers from parsers) - | - v -[Image Quality Metrics] ──> [Near-Duplicate Detection] ──> [Composite Score] - | | - v v -[Bbox Editing in Modal] ──> [Keyboard Shortcuts] ──> [Error Triage Mode] - | | - v v -[Interactive Confusion Matrix] ──────────────────────> [Click-to-Filter] - | - v -[Interactive Histograms] - | - v -["Find Similar" Button] (uses existing Qdrant infrastructure) -``` +### FiftyOne (HIGH confidence -- official documentation) +- [FiftyOne Classification Evaluation API](https://docs.voxel51.com/api/fiftyone.utils.eval.classification.html) +- [FiftyOne Evaluating Models](https://docs.voxel51.com/user_guide/evaluation.html) +- [FiftyOne Evaluate Classifications Tutorial](https://docs.voxel51.com/tutorials/evaluate_classifications.html) +- [FiftyOne Drawing Labels](https://docs.voxel51.com/user_guide/draw_labels.html) -**Critical path:** Docker/Auth and Format Parsers can start simultaneously. Most features build on existing infrastructure (DuckDB, Qdrant, Zustand stores). The confusion matrix and triage mode are the two highest-complexity features and should be prioritized early in development. +### Cleanlab (HIGH confidence -- official documentation) +- [Cleanlab Image Classification Tutorial](https://docs.cleanlab.ai/master/tutorials/image.html) +- [Cleanlab Datalab Image Issues](https://docs.cleanlab.ai/master/tutorials/datalab/image.html) +- [Cleanlab GitHub](https://github.com/cleanlab/cleanlab) ---- +### Roboflow (MEDIUM confidence -- product documentation) +- [Roboflow Classification Label Visualization](https://docs.roboflow.com/workflow-blocks/visualize-predictions/classification-label-visualization) -## Sources +### Classification Metrics (HIGH confidence -- authoritative references) +- [Google ML Classification Metrics](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall) +- [Evidently AI Multi-class Metrics](https://www.evidentlyai.com/classification-metrics/multi-class-metrics) +- [scikit-learn confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) -### FiftyOne (HIGH confidence -- official documentation) -- [FiftyOne Import Datasets (v1.12.0)](https://docs.voxel51.com/user_guide/import_datasets.html) -- [FiftyOne Export Datasets (v1.11.1)](https://docs.voxel51.com/user_guide/export_datasets.html) -- [FiftyOne Using Datasets (v1.12.0)](https://docs.voxel51.com/user_guide/using_datasets.html) -- [FiftyOne Dataset Views (v1.12.0)](https://docs.voxel51.com/user_guide/using_views.html) -- [FiftyOne App (v1.12.0)](https://docs.voxel51.com/user_guide/app.html) -- [FiftyOne Evaluation (v1.11.1)](https://docs.voxel51.com/user_guide/evaluation.html) -- [FiftyOne Brain](https://docs.voxel51.com/brain.html) -- [FiftyOne Annotation (v1.11.0)](https://docs.voxel51.com/user_guide/annotation.html) -- [FiftyOne Environments](https://docs.voxel51.com/installation/environments.html) -- [FiftyOne Model Zoo (v1.11.1)](https://docs.voxel51.com/model_zoo/index.html) -- [FiftyOne Dataset Zoo (v1.11.1)](https://docs.voxel51.com/dataset_zoo/datasets.html) -- [FiftyOne Plugins Development (v1.11.1)](https://docs.voxel51.com/plugins/developing_plugins.html) -- [FiftyOne Interactive Plots (v1.12.0)](https://docs.voxel51.com/user_guide/plots.html) -- [FiftyOne Enterprise Helm Chart](https://helm.fiftyone.ai/) -- [FiftyOne Teams Deployment (GitHub)](https://github.com/voxel51/fiftyone-teams-app-deploy) - -### FiftyOne (MEDIUM confidence -- blog posts, GitHub issues) -- [FiftyOne v0.24 Announcement (3D, Workspaces)](https://voxel51.com/blog/announcing-fiftyone-0-24-with-3d-meshes-and-custom-workspaces) -- [FiftyOne v0.25 Announcement (Python Panels, SAM 2)](https://voxel51.com/blog/announcing-fiftyone-0-25) -- [FiftyOne GitHub Issue #2120 (Selection shortcut FR)](https://github.com/voxel51/fiftyone/issues/2120) -- [FiftyOne GitHub Issue #1761 (Hide labels shortcut FR)](https://github.com/voxel51/fiftyone/issues/1761) -- [FiftyOne GitHub Issue #1780 (from_dir failure bug)](https://github.com/voxel51/fiftyone/issues/1780) -- [FiftyOne GitHub Issue #1781 (VOC same-directory bug)](https://github.com/voxel51/fiftyone/issues/1781) -- [FiftyOne Model Evaluation Blog](https://voxel51.com/blog/unified-model-insights-with-fiftyone-model-evaluation-workflows) - -### Encord (HIGH confidence -- official documentation) -- [Encord Getting Started](https://docs.encord.com/platform-documentation/GettingStarted/gettingstarted-welcome) -- [Encord Annotate Overview](https://docs.encord.com/platform-documentation/Annotate/annotate-overview) -- [Encord Label Editor](https://docs.encord.com/platform-documentation/Annotate/annotate-label-editor) -- [Encord Editor Shortcuts](https://docs.encord.com/platform-documentation/Annotate/annotate-label-editor/annotate-label-editor-settings-shortcuts) -- [Encord Active Overview](https://docs.encord.com/platform-documentation/Active/active-overview) -- [Encord Active Issue Shortcuts](https://docs.encord.com/platform-documentation/Active/active-basics/active-issue-shortcuts-prediction-types) -- [Encord Active Model Quality Metrics](https://docs.encord.com/platform-documentation/Active/active-quality-metrics/active-model-quality-metrics) -- [Encord 2025 Release Notes](https://docs.encord.com/release-notes/releasenotes-2025) - -### Encord (MEDIUM confidence -- marketing/blog) -- [Encord Product Updates Feb 2025](https://encord.com/blog/encord-product-updates-february-2025/) -- [Encord Data Quality Metrics Blog](https://encord.com/blog/data-quality-metrics/) -- [Encord Annotate Product Page](https://encord.com/annotate/) +### Label Studio (MEDIUM confidence -- official documentation) +- [Label Studio Image Classification Template](https://labelstud.io/templates/image_classification) --- -*Competitive feature analysis for: DataVisor v1.1 vs FiftyOne (Voxel51) + Encord* -*Researched: 2026-02-12* +*Classification feature landscape for: DataVisor classification support milestone* +*Researched: 2026-02-18* diff --git a/.planning/research/PITFALLS.md b/.planning/research/PITFALLS.md index 34ea13f..57a27d6 100644 --- a/.planning/research/PITFALLS.md +++ b/.planning/research/PITFALLS.md @@ -1,767 +1,392 @@ -# Domain Pitfalls: DataVisor v1.1 +# Domain Pitfalls -**Domain:** Adding Docker deployment, auth, annotation editing, smart ingestion, and error triage to an existing FastAPI + DuckDB + Next.js CV dataset introspection tool -**Researched:** 2026-02-12 -**Scope:** Pitfalls specific to v1.1 features on the existing v1.0 codebase (12,720 LOC, 59 tests) -**Overall confidence:** MEDIUM-HIGH +**Domain:** Adding single-label classification support to an existing detection-focused CV dataset tool +**Researched:** 2026-02-18 +**Confidence:** HIGH (all findings grounded in actual codebase analysis) --- ## Critical Pitfalls -Mistakes that cause rewrites, data loss, or deployment failures. +Mistakes that cause rewrites, data corruption, or broken existing workflows. -### Pitfall 1: DuckDB WAL and Lock Files Not Surviving Docker Container Restarts +### Pitfall 1: Schema Pollution -- Nullable BBox Columns Infect Every Query -**Severity:** CRITICAL -**Affects:** Docker containerization, data persistence +**What goes wrong:** The `annotations` table has `bbox_x`, `bbox_y`, `bbox_w`, `bbox_h` as `DOUBLE NOT NULL`. Classification annotations have no bounding boxes. The naive fix is making these columns nullable or stuffing sentinel values (0,0,0,0), but then every existing query that touches bbox columns -- `_load_detections()`, `_compute_iou_matrix()`, `AnnotationOverlay`, `EditableRect`, area calculations, `AnnotationUpdate` -- must guard against null/sentinel bboxes. Miss one query and you get silent wrong results or crashes. -**What goes wrong:** -DuckDB creates three filesystem artifacts alongside the database file: `datavisor.duckdb`, `datavisor.duckdb.wal` (write-ahead log), and a `datavisor.duckdb.tmp/` directory for intermediate processing. The WAL file is deleted on clean shutdown but persists if the container is killed (SIGKILL from `docker stop` after the 10s grace period, OOM kill, or crash). On next container start, DuckDB replays the WAL to recover uncommitted data. If the WAL file is missing (because the volume mount was only for the `.duckdb` file, not the directory), data loss occurs silently -- DuckDB opens without error but the last transactions are gone. +**Why it happens:** The annotations table was designed as a detection-first schema. Every column assumes spatial data exists. The `area` column is computed as `bbox_w * bbox_h`. The `AnnotationCreate` model requires all four bbox fields. The `AnnotationUpdate` model only has bbox fields -- it literally cannot update a classification annotation. -The existing `DuckDBRepo.__init__` in `app/repositories/duckdb_repo.py` creates the parent directory via `db_path.parent.mkdir(parents=True, exist_ok=True)` and connects to a file at `data/datavisor.duckdb` (from `config.py`). In Docker, this `data/` directory must be a volume mount, not just the `.duckdb` file. +**Concrete code locations affected:** +- `duckdb_repo.py:57-72` -- annotations table DDL with `NOT NULL` bbox columns +- `app/models/annotation.py:6-57` -- all three Pydantic models hardcode bbox fields +- `app/routers/annotations.py:42` -- `area = body.bbox_w * body.bbox_h` +- `app/services/evaluation.py:225` -- `_BoxRow` type alias includes bbox coordinates +- `frontend/src/types/annotation.ts:6-19` -- `Annotation` interface requires bbox fields +- `frontend/src/components/grid/annotation-overlay.tsx:63-72` -- renders `` from bbox -**Why it happens:** -Developers volume-mount only the database file (`-v ./data/datavisor.duckdb:/app/data/datavisor.duckdb`) instead of the entire directory. The WAL and tmp files are created as siblings on the container filesystem (ephemeral layer) and vanish when the container restarts. DuckDB's official documentation states: "If DuckDB exits normally, the WAL file is deleted upon exit. If DuckDB crashes, the WAL file is required to recover data." +**Consequences:** +- Classification annotations with NULL bboxes break `NOT NULL` constraints on insert +- Sentinel values (0,0,0,0) produce 0-area rectangles in SVG overlays, 0-area in stats +- Every SQL query selecting `bbox_*` columns returns meaningless data for classification +- IoU computation on zero-sized boxes produces NaN or 0, silently breaking evaluation -Additionally, Docker's default stop signal is SIGTERM with a 10-second timeout before SIGKILL. If FastAPI's shutdown handler (the `lifespan` context manager's cleanup in `app/main.py`) takes longer than 10 seconds -- possible during a large ingestion with thumbnail generation -- the container is killed before `db.close()` runs, leaving the WAL behind. +**Prevention:** Add a `task_type` discriminator column to the `datasets` table (not annotations). Classification datasets never create bbox data. Use a separate code path for classification annotations that maps to a simpler schema view. Specifically: +1. Add `task_type VARCHAR DEFAULT 'detection'` to `datasets` table +2. For classification, annotations table still has bbox columns but they store 0.0 (not NULL) to preserve NOT NULL constraint, and a `task_type`-aware query layer skips them +3. Better: create a `classifications` table with just `(id, dataset_id, sample_id, category_name, source, confidence, metadata)` -- one row per image, no bbox columns at all. This is cleaner but requires more code changes. -**Prevention:** -1. Volume-mount the entire `data/` directory, never individual files: `volumes: ["./data:/app/data"]` -2. Add a `STOPSIGNAL SIGTERM` to the Dockerfile and set `stop_grace_period: 30s` in docker-compose.yml to give the lifespan handler time to close DuckDB cleanly -3. Add an explicit `CHECKPOINT` call in the lifespan shutdown before `db.close()` to flush the WAL to the database file: `self.connection.execute("CHECKPOINT")` -4. Ensure the container user has write permission to the entire mounted directory, not just the `.duckdb` file -5. Set `checkpoint_threshold` via `PRAGMA checkpoint_threshold='8MB'` to checkpoint more frequently (default is 16MB), reducing WAL size and recovery window - -**Warning signs:** -- Data disappears after `docker-compose restart` but not after `docker-compose down && docker-compose up` -- A `.wal` file appears in the data directory after `docker stop` but is missing after `docker start` -- `docker logs` shows DuckDB opening successfully but with fewer rows than expected +**Recommendation:** Separate `classifications` table. The bbox columns are not "optional detection data" -- they are structurally meaningless for classification. Trying to reuse the annotations table forces every consumer to handle two shapes of data from one table. A separate table with shared query interfaces (via a service abstraction) is cleaner. -**Phase to address:** Docker containerization (Phase 1 of v1.1) +**Detection:** If you go the shared-table route, grep for `bbox_` across the codebase -- every hit is a location that needs a conditional. Currently 30+ references. -**Confidence:** HIGH -- verified against DuckDB official documentation on [files created by DuckDB](https://duckdb.org/docs/stable/operations_manual/footprint_of_duckdb/files_created_by_duckdb) and [WAL recovery behavior](https://duckdb.org/docs/stable/connect/concurrency). WAL lock file issue confirmed in [DuckDB Issue #10002](https://github.com/duckdb/duckdb/issues/10002). +**Phase to address:** Phase 1 (schema design). Get this wrong and everything downstream is a rewrite. --- -### Pitfall 2: Qdrant Local Mode Cannot Run in Docker -- Must Migrate to Server Mode +### Pitfall 2: Metric Confusion -- mAP/IoU Leaking into Classification Evaluation -**Severity:** CRITICAL -**Affects:** Docker containerization, Qdrant integration +**What goes wrong:** The entire evaluation pipeline is built on IoU matching. `compute_evaluation()` uses `supervision.MeanAveragePrecision` and `supervision.ConfusionMatrix.from_detections()` which expect `sv.Detections` objects with `xyxy` bounding boxes. Classification evaluation needs accuracy, precision, recall, F1, and per-class metrics computed by exact label matching (no spatial component). If you try to reuse the detection evaluation with dummy bboxes, you get meaningless mAP scores. -**What goes wrong:** -The current codebase uses Qdrant in **local embedded mode**: `QdrantClient(path=str(path))` in `app/services/similarity_service.py`. This runs Qdrant as an in-process Python library with on-disk persistence at `data/qdrant/`. In Docker, you need Qdrant as a separate container service (server mode) because: (a) the embedded Qdrant client does not support concurrent access, which matters when multiple uvicorn workers run, (b) it adds ~500MB to the FastAPI container image, and (c) Qdrant's Docker image (`qdrant/qdrant`) is the canonical deployment path and provides proper health checks, metrics, and persistence. +**Why it happens:** The evaluation service (`app/services/evaluation.py`) is 560 lines of detection-specific logic: IoU matrix computation, greedy matching, COCO-style interpolated AP. The API response model (`EvaluationResponse`) returns `map50`, `map75`, `map50_95`, `iou_threshold` -- all detection-specific fields. The frontend `evaluation-panel.tsx` renders PR curves and the confusion matrix with IoU/confidence sliders. -Switching from `QdrantClient(path=...)` to `QdrantClient(host="qdrant", port=6333)` is a one-line code change, but the data migration is not. The local-mode on-disk format is not compatible with the server-mode storage. All existing embeddings synced to Qdrant must be re-synced from DuckDB after the migration. +**Concrete code locations affected:** +- `app/services/evaluation.py` -- entire file assumes detection +- `app/services/error_analysis.py` -- `categorize_errors()` uses IoU matching +- `app/services/annotation_matching.py` -- `match_sample_annotations()` is IoU-based +- `app/models/evaluation.py` -- `APMetrics` has mAP fields, `EvaluationResponse` has `iou_threshold` +- `frontend/src/types/evaluation.ts` -- TypeScript mirrors backend detection-specific types +- `frontend/src/components/stats/evaluation-panel.tsx` -- IoU slider, PR curves +- `frontend/src/components/stats/metrics-cards.tsx` -- likely shows mAP -**Why it happens:** -Local mode is the recommended development path ("useful for development, prototyping and testing") and the existing code was designed for single-process local execution. Developers assume the migration is just changing the constructor, but forget about: (a) data format incompatibility, (b) network connectivity in docker-compose, (c) the need for an API key for security, and (d) health check dependencies (FastAPI should wait for Qdrant to be healthy before starting). +**Consequences:** +- Showing mAP for a classification dataset is nonsensical and misleading +- IoU slider has no meaning -- users will be confused +- PR curves per class are meaningful for classification but computed differently (no spatial matching) +- Error analysis categories (Hard FP, Label Error based on IoU) do not apply -**Prevention:** -1. In `docker-compose.yml`, add Qdrant as a service with a volume for persistence: - ```yaml - qdrant: - image: qdrant/qdrant:latest - volumes: ["./data/qdrant_server:/qdrant/storage"] - ports: ["6333:6333"] - ``` -2. Update `SimilarityService.__init__` to accept either `path` (local) or `url` (server) based on environment: - ```python - if qdrant_url: - self.client = QdrantClient(url=qdrant_url) - else: - self.client = QdrantClient(path=str(path)) - ``` -3. Add `DATAVISOR_QDRANT_URL` environment variable to `config.py` Settings class (default None for local dev) -4. Add a `depends_on` with health check in docker-compose so FastAPI waits for Qdrant: - ```yaml - depends_on: - qdrant: - condition: service_healthy - ``` -5. On first Docker startup, the `ensure_collection` + `_sync_from_duckdb` flow in `SimilarityService` already handles syncing -- but verify it works when the collection is empty in a fresh Qdrant server - -**Warning signs:** -- `ConnectionRefusedError` on FastAPI startup because Qdrant container is not yet ready -- Similarity search returns empty results in Docker but works locally -- FastAPI container image is 8GB+ because it bundles the Qdrant Rust binaries via qdrant-client's local mode - -**Phase to address:** Docker containerization (Phase 1 of v1.1) - -**Confidence:** HIGH -- verified against [Qdrant quickstart docs](https://qdrant.tech/documentation/quickstart/) and [qdrant-client README](https://github.com/qdrant/qdrant-client) which explicitly states "If you require concurrent access to local mode, you should use Qdrant server instead." +**Prevention:** Build a separate `compute_classification_evaluation()` function and a `ClassificationEvaluationResponse` model. Route based on `dataset.task_type`. Classification evaluation is actually simpler: compare `gt_category` to `pred_category` per sample. Metrics: accuracy, macro/micro precision/recall/F1, per-class precision/recall/F1, confusion matrix (still works, but simpler -- no "background" row/column from unmatched detections). ---- +**Detection:** If the evaluation endpoint returns `iou_threshold` for a classification dataset, something went wrong. -### Pitfall 3: NEXT_PUBLIC_API_URL Baked at Build Time, Not Configurable at Runtime +**Phase to address:** Phase 2 (evaluation logic). Must come after schema but before UI work. -**Severity:** CRITICAL -**Affects:** Docker containerization, deployment flexibility +--- -**What goes wrong:** -The frontend's API base URL is set in `frontend/src/lib/constants.ts`: +### Pitfall 3: UI Conditional Spaghetti -- `if detection else classification` Everywhere + +**What goes wrong:** Instead of polymorphic components, developers scatter `if (taskType === 'detection')` checks throughout the frontend. Components like `AnnotationOverlay`, `SampleModal`, `EvaluationPanel`, `ErrorAnalysisPanel`, `TriageOverlay`, `FilterSidebar`, `StatsDashboard` all need different rendering for classification vs detection. With 10+ components each having 2-3 conditionals, you get 30+ branching points that are easy to miss and hard to test. + +**Why it happens:** The fastest way to add classification support is to add conditionals to existing components. Each one is small and "just one more if-statement." But they compound: +- `AnnotationOverlay`: render bbox rect vs class label badge +- `SampleModal`: bbox editor vs class label display +- `EvaluationPanel`: IoU slider vs no IoU slider +- `MetricsCards`: mAP vs accuracy +- `ErrorAnalysisPanel`: spatial error types vs correct/incorrect +- `PerClassTable`: AP columns vs precision/recall/F1 columns +- `ConfusionMatrix`: background row vs no background row +- `AnnotationList`: bbox coordinates vs class label +- `DrawLayer`: bbox drawing vs class assignment +- `TriageOverlay`: per-bbox triage vs per-image triage + +**Consequences:** +- Adding a third task type (segmentation, keypoint) requires touching every component again +- Testing combinatorial explosion: each component x each task type +- Easy to miss one conditional, producing a detection UI for classification data +- Code reviews become "did you check all 30 places?" + +**Prevention:** Use a strategy/adapter pattern at the component boundary. Create a `TaskAdapter` that provides task-specific sub-components: ```typescript -export const API_BASE = process.env.NEXT_PUBLIC_API_URL ?? "http://localhost:8000"; +// Instead of 30 if-statements: +const adapter = useTaskAdapter(dataset.task_type); +// adapter.AnnotationOverlay -- renders bboxes or class badges +// adapter.EvaluationPanel -- detection or classification metrics +// adapter.getMetricLabel() -- "mAP@50" or "Accuracy" ``` +Alternatively, create parallel component trees: `detection/EvaluationPanel` and `classification/EvaluationPanel` with shared layout components. The dataset page picks the right tree once. -`NEXT_PUBLIC_` environment variables are **inlined into the JavaScript bundle at `next build` time**. They are string-replaced in the compiled JS -- there is no runtime resolution. If you build the Docker image with `NEXT_PUBLIC_API_URL=http://localhost:8000` (or leave it unset), the compiled JS will contain the literal string `"http://localhost:8000"`. When you deploy to a GCP VM at `http://35.202.x.x:8000`, the frontend still calls `localhost:8000`, which fails because the browser is on the user's machine, not the VM. - -**Why it happens:** -Next.js explicitly documents this: "Public environment variables will be inlined into the JavaScript bundle during `next build`." Developers either: (a) hardcode the URL and rebuild per environment, (b) set it at build time and forget it cannot change, or (c) try to set it in `docker run -e` and discover it has no effect. - -**Prevention:** -1. **Option A (simplest for this project):** Use a reverse proxy (nginx/caddy) that serves both frontend and API from the same origin, eliminating the need for a separate API URL. Frontend calls `/api/...` which the proxy routes to the FastAPI backend. No CORS issues, no URL configuration. -2. **Option B:** Use Next.js `publicRuntimeConfig` with `getServerSideProps` to inject the API URL at request time. But this forces SSR for every page. -3. **Option C:** Use the `next-runtime-env` library to read environment variables at runtime via a thin server-side injection. -4. **Option D:** Pass the API URL via a `