Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
51e925b
Major Change: Language and Entity Type Sharding
philippesaade-wmde Mar 13, 2026
53afc75
Change version to 3.0.0
philippesaade-wmde Mar 13, 2026
c64e7ac
Clamping and removing negative similarities
philippesaade-wmde Mar 13, 2026
cc84b19
Refactor ADR for Language and Entity-Type Sharding
philippesaade-wmde Mar 16, 2026
b1de483
Including docstrings for the unit tests and benchmarks
philippesaade-wmde Mar 30, 2026
70d1761
Apply suggestion from @itamargiv
philippesaade-wmde Mar 30, 2026
39a0fd0
Apply suggestion from @itamargiv
philippesaade-wmde Mar 30, 2026
18b96c3
Apply suggestion from @itamargiv
philippesaade-wmde Mar 30, 2026
b3398f9
Adding docstrings to API output representations
philippesaade-wmde Mar 31, 2026
99f7892
Apply suggestion from @exowanderer
philippesaade-wmde Mar 31, 2026
8137db3
Apply suggestion from @exowanderer
philippesaade-wmde Mar 31, 2026
8360c92
Update User-Agent header for translator service
philippesaade-wmde Mar 31, 2026
de116b8
Apply suggestion from @philippesaade-wmde
philippesaade-wmde Mar 31, 2026
1962ee8
Adding role=tooltip
philippesaade-wmde Mar 31, 2026
8c4c921
Retruning vectors is re-enabled, unit tests are fixed to validate tha…
philippesaade-wmde Mar 31, 2026
7bb74c4
Adding Ruff Lint and Github action for ruff checks
philippesaade-wmde Apr 1, 2026
0a562a3
Include missing docstrings and improve consistency on all docstrings
philippesaade-wmde Apr 1, 2026
1198973
Adding pytest in dependency-groups
philippesaade-wmde Apr 3, 2026
d2352fd
Adding pytest testpath
philippesaade-wmde Apr 3, 2026
44613cc
Merge pull request #6 from wmde/add_linter
philippesaade-wmde Apr 10, 2026
7399010
Fix Ruff Lint errors
philippesaade-wmde Apr 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: "Ruff Lint"

on:
pull_request:
branches: ["main"]

permissions:
contents: read

jobs:
ruff:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Install uv
uses: astral-sh/setup-uv@v5

- name: Set up Python
run: uv python install

- name: Run Ruff linter
run: uv run ruff check .

- name: Run Ruff formatter check
run: uv run ruff format --check .
23 changes: 0 additions & 23 deletions .github/workflows/pylint.yml

This file was deleted.

39 changes: 0 additions & 39 deletions .github/workflows/python-app.yml

This file was deleted.

3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
# Data
data/*

# Ruff Lint
.ruff_cache/

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
142 changes: 134 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,140 @@
# WikidataSearch

## Introduction
WikidataSearch is a web application and API designed to facilitate the connection between users and the Wikidata Vector Database developed as pasrt of the [Wikidata Embedding Project](https://www.wikidata.org/wiki/Wikidata:Embedding_Project).
WikidataSearch is the API and web app for semantic retrieval over the Wikidata Vector Database from the [Wikidata Embedding Project](https://www.wikidata.org/wiki/Wikidata:Embedding_Project).

**Webapp:** [wd-vectordb.wmcloud.org](https://wd-vectordb.wmcloud.org/) \
**Docs:** [wd-vectordb.wmcloud.org/docs](https://wd-vectordb.wmcloud.org/docs) \
**Project Page:** [wikidata.org/wiki/Wikidata:Embedding_Project](https://www.wikidata.org/wiki/Wikidata:Embedding_Project)
This repository powers the public service. The intended usage is the hosted API, not running your own deployment.

**Hosted Web App:** [https://wd-vectordb.wmcloud.org/](https://wd-vectordb.wmcloud.org/)
**Hosted API Docs (OpenAPI):** [https://wd-vectordb.wmcloud.org/docs](https://wd-vectordb.wmcloud.org/docs)
**Project Page:** [https://www.wikidata.org/wiki/Wikidata:Vector_Database](https://www.wikidata.org/wiki/Wikidata:Vector_Database)

## Hosted API Usage

Base URL:

```text
https://wd-vectordb.wmcloud.org
```

Use a descriptive `User-Agent` for query endpoints. Generic user agents are rejected.

Example header:

```text
User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)
```

Current operational constraints:

- Rate limit is applied per `User-Agent` (default: `30/minute`).
- `return_vectors=true` is currently disabled and returns `422`.

## API Endpoints

### `GET /item/query/`

Semantic + keyword search for Wikidata items (QIDs), fused with Reciprocal Rank Fusion (RRF).

Parameters:

- `query` (required): natural-language query or ID.
- `lang` (default: `all`): vector shard language; unknown languages are translated then searched globally.
- `K` (default/max: `50`): number of top results requested.
- `instanceof` (optional): comma-separated QIDs used as `P31` filter.
- `rerank` (default: `false`): apply reranker on textified Wikidata content.
- `return_vectors` (currently disabled).

Example:

```bash
curl -sG 'https://wd-vectordb.wmcloud.org/item/query/' \
--data-urlencode 'query=Douglas Adams' \
--data-urlencode 'lang=en' \
--data-urlencode 'K=10' \
-H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)'
```

### `GET /property/query/`

Semantic + keyword search for Wikidata properties (PIDs), fused with RRF.

Parameters:

- `query` (required)
- `lang` (default: `all`)
- `K` (default/max: `50`)
- `instanceof` (optional): comma-separated QIDs used as `P31` filter.
- `exclude_external_ids` (default: `false`): excludes properties with datatype `external-id`.
- `rerank` (default: `false`)
- `return_vectors` (currently disabled)

Example:

```bash
curl -sG 'https://wd-vectordb.wmcloud.org/property/query/' \
--data-urlencode 'query=instance of' \
--data-urlencode 'lang=en' \
--data-urlencode 'exclude_external_ids=true' \
-H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)'
```

### `GET /similarity-score/`

Similarity scoring for a fixed list of Wikidata IDs (QIDs and/or PIDs) against one query.

Parameters:

- `query` (required)
- `qid` (required): comma-separated IDs, for example `Q42,Q5,P31`.
- `lang` (default: `all`)
- `return_vectors` (currently disabled)

Example:

```bash
curl -sG 'https://wd-vectordb.wmcloud.org/similarity-score/' \
--data-urlencode 'query=science fiction writer' \
--data-urlencode 'qid=Q42,Q25169,P31' \
-H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)'
```

## Response Shape

`/item/query/` returns objects with:

- `QID`
- `similarity_score`
- `rrf_score`
- `source` (`Vector Search`, `Keyword Search`, or both)
- `reranker_score` (when `rerank=true`)

`/property/query/` returns the same shape with `PID` instead of `QID`.

`/similarity-score/` returns:

- `QID` or `PID`
- `similarity_score`

## Architecture

High-level request flow:

1. FastAPI route receives the query, enforces user-agent policy, and rate limit.
2. `HybridSearch` orchestrates retrieval:
- Vector path: embeds query with Jina embeddings and searches Astra DB vector collections across language shards in parallel.
- Keyword path: runs Wikidata keyword search against `wikidata.org`.
3. Results are fused with Reciprocal Rank Fusion (RRF), preserving source attribution.
4. Optional reranking fetches Wikidata text representations and reorders top hits with Jina reranker.
5. JSON response is returned and request metadata is logged for analytics.

Main components in this repo:

- API app and routing: `wikidatasearch/main.py`, `wikidatasearch/routes/`
- Retrieval orchestration: `wikidatasearch/services/search/HybridSearch.py`
- Vector retrieval backend: `wikidatasearch/services/search/VectorSearch.py`
- Keyword retrieval backend: `wikidatasearch/services/search/KeywordSearch.py`
- Embeddings/reranking client: `wikidatasearch/services/jina.py`

## License
WikidataSearch is open-source software licensed under the MIT License. You are free to use, modify, and distribute the software as you wish. We kindly ask for a citation to this repository if you use WikidataSearch in your projects.

## Contact
For questions, comments, or discussions, please open an issue on this GitHub repository. We are committed to fostering a welcoming and collaborative community.
See [LICENSE](LICENSE).
Loading
Loading