Skip to content

feat: persist duplicate detection results in PostgreSQL #85

@JohnRDOrazio

Description

@JohnRDOrazio

Problem

Duplicate detection results are currently cached only in Redis with a 10-minute TTL. When the page is reloaded, results are gone and the user must click "Find Duplicates" again (~56 seconds for a 15K-class ontology).

Proposal

  1. New duplicate_detection_results table — stores the latest detection results per project/branch, including clusters, threshold, and timestamp
  2. Auto-update on index rebuild — when run_ontology_index_task completes (or the index is updated via entity edits), automatically re-run duplicate detection and persist the results
  3. Frontend loads persisted results on mount — the Duplicates tab should check for stored results first, showing them immediately without requiring a manual "Find Duplicates" click
  4. "Find Duplicates" button re-runs detection — still available for on-demand refresh, updating the persisted results

Context

PR #80 moved duplicate detection to the ARQ worker queue and rewrote it to use PostgreSQL's pg_trgm GIN index instead of in-memory rdflib parsing. The detection itself now completes in ~56 seconds for large ontologies. Persisting results would make the UX seamless.

Tasks

  • Create duplicate_detection_results table and Alembic migration
  • Store results after run_duplicate_detection_task completes
  • Load persisted results in the Duplicates tab on mount (frontend)
  • Trigger duplicate detection automatically after run_ontology_index_task
  • Update "Find Duplicates" to refresh persisted results

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions