Skip to content

Db sharding#3

Merged
philippesaade-wmde merged 21 commits intomainfrom
db_sharding
Apr 15, 2026
Merged

Db sharding#3
philippesaade-wmde merged 21 commits intomainfrom
db_sharding

Conversation

@philippesaade-wmde
Copy link
Copy Markdown

The vector database architecture will be migrated to a sharded design based on language and entity type. Instead of a single database containing all vectors, the system will use a separate vector database per language per entity type.

Changes include:

  • Separate each search implementation into its own file under /services/search.
  • VectorSearch access changes from a single database to direct Astra collection access per database, with item/property collections split by language.
  • HybridSearch now creates multiple VectorSearch instances (one per language), queries them in parallel, and merges results per entity.
  • RRF is used to merge vector search and keyword search results in the search routes, while max similarity is used in the /similarity-score/ route.
  • Query embeddings are reused across searches (vector search + keyword rescoring) instead of being recomputed independently.
  • Similarity is now computed locally for better efficiency. This affects /similarity-score/ and keyword result similarity scoring.
  • /similarity-score/ now supports both QIDs and PIDs, and validates that each result includes at least one of QID or PID.
  • /similarity-score/ now enforces max 100 IDs.
  • Item/property routes now cap K with MAX_VECTORDB_K (default 50) and enforce bounds during query validation.

Comment thread frontend/dist/index.html
Copy link
Copy Markdown

@exowanderer exowanderer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @philippesaade-wmde

I did my best to review the files. I still have about half to complete. Please feel free to add responses / comments on the 43 comments I left so far.

Comment thread doc/adr/001_ADR_ Language_and_Entity_Type_Sharding.md
Comment thread frontend/dist/assets/index-OLX4RYVc.js
Comment thread frontend/src/views/ChatView.vue
Comment thread tests/unit/conftest.py
Comment thread tests/integration/test_live_routes.py
Comment thread wikidatasearch/services/search/HybridSearch.py
Comment thread wikidatasearch/services/search/KeywordSearch.py
Comment thread wikidatasearch/services/search/Search.py
Comment thread wikidatasearch/services/search/VectorSearch.py
Comment thread wikidatasearch/main.py Outdated
Copy link
Copy Markdown

@exowanderer exowanderer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to add a few more comments, mostly about if new functionality could break downstream (user) behaviour

Comment thread wikidatasearch/services/translator.py Outdated
Comment thread wikidatasearch/services/search/HybridSearch.py
Copy link
Copy Markdown

@exowanderer exowanderer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I requested several descriptions per file in the "search" subdir

Comment thread wikidatasearch/services/search/HybridSearch.py
Comment thread wikidatasearch/services/search/KeywordSearch.py
Comment thread wikidatasearch/services/search/Search.py
Comment thread wikidatasearch/services/search/VectorSearch.py
Copy link
Copy Markdown

@exowanderer exowanderer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few more requests for infomation

Comment thread wikidatasearch/services/__init__.py
Comment thread wikidatasearch/services/datastax.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this break any downstream functionality for community contributors or WDEmbedding reusers?

Comment thread wikidatasearch/services/jina.py
Comment thread wikidatasearch/services/search.py Outdated
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the original file that is now called HybridSearch.py?

Comment thread wikidatasearch/__init__.py
Comment thread docker-compose.yml
Comment thread pyproject.toml
Copy link
Copy Markdown
Member

@itamargiv itamargiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I gave a few comments about accessibility and behavior, the client seems to look alright when I run it locally, but I'd recommend to still click around some more with the back-end on to see if there are any odd behaviors I missed (I only ran the vue dev server).

All in all it looks great, some nice improvements and clear user instructions.

Comment thread frontend/src/views/ChatView.vue Outdated
Comment thread frontend/src/views/ChatView.vue
Comment thread frontend/src/views/ChatView.vue Outdated
Comment thread frontend/src/views/ChatView.vue
Comment thread frontend/src/views/ChatView.vue
Comment thread frontend/src/views/ChatView.vue Outdated
Comment thread wikidatasearch/routes/similarity.py Outdated
Comment thread wikidatasearch/routes/similarity.py Outdated
Comment thread wikidatasearch/main.py Outdated
Copy link
Copy Markdown
Member

@itamargiv itamargiv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on my part, thank you!

@philippesaade-wmde philippesaade-wmde merged commit 509229a into main Apr 15, 2026
1 check passed
@philippesaade-wmde philippesaade-wmde deleted the db_sharding branch April 15, 2026 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants