V3 by thomashacker · Pull Request #422 · weaviate/Verba

thomashacker · 2026-04-17T14:50:33Z

No description provided.

- ci.yml: run pytest + ruff on Python 3.11/3.12 and ESLint on Node 20 for every PR targeting main or v3 - release.yml: auto-publish to PyPI (trusted publishing) and push versioned Docker tags (v*.*.* + latest) on git tag push - docker-image.yml: upgrade build-push-action to v6, add GHA build cache - ruff.toml: Python linting config (E, W, F, I, UP, B rules, 100 char limit) - .pre-commit-config.yaml: ruff, ruff-format, prettier for frontend, and standard file hygiene hooks Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- frontend/package.json: bump version from 2.1.0 to 2.1.3 to match backend - SECURITY.md: add responsible disclosure policy pointing to GitHub's private vulnerability reporting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Track all v3 changes: DeepSeek/LM Studio integrations (pending merge), chunk deserialization fix, CORS fix, CI/CD automation, ruff, pre-commit, Dependabot, and SECURITY.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Supports deepseek-chat and deepseek-reasoner via DeepSeek's OpenAI-compatible API. R1 reasoning content is rendered in a collapsible <details> section with a Show Reasoning toggle. Model list is fetched dynamically at startup with fallback to defaults. Env vars: DEEPSEEK_API_KEY, DEEPSEEK_BASE_URL, DEEPSEEK_MODEL Closes #368 Co-Authored-By: yaowubarbara <113857460+yaowubarbara@users.noreply.github.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds LMStudioEmbedder and LMStudioGenerator using LM Studio's OpenAI-compatible local API (default: http://localhost:1234/v1). Supports dynamic model discovery, custom base URL, and optional API key. Fixes applied during merge: - Use json= instead of BytesIO in embedder (consistent with other embedders) - Use aiohttp.ClientTimeout instead of bare int for timeout - Remove redundant Content-Type header (set automatically by aiohttp json=) - Add missing EOF newlines to both files - Explicitly declare httpx>=0.27.0 in setup.py (was already used by OpenAIGenerator/UpstageGenerator but undeclared) Env vars: LMSTUDIO_BASE_URL, LMSTUDIO_API_KEY, LMSTUDIO_MODEL, LMSTUDIO_EMBEDDER_MODEL Co-Authored-By: supmo668 <mymm.psu@gmail.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Detail all additions (DeepSeek, LM Studio), fixes (chunk deserialization, CORS), dependency changes, and infrastructure work landed in v3 so far. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

orca-security-eu

Orca Security Scan Summary

Status	Check	Issues by priority
Passed	Infrastructure as Code	0 0 0 0	View in Orca
Passed	SAST	0 0 0 0	View in Orca
Passed	Secrets	0 0 0 0	View in Orca
Passed	Vulnerabilities	0 0 0 0	View in Orca

- Delete pypi_commands.sh (replaced by release.yml workflow) - .gitignore: add data/, reorganise with section comments - Add CODE_OF_CONDUCT.md (Contributor Covenant 2.1) - Add .github/PULL_REQUEST_TEMPLATE.md to guide contributors Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Track 1 — bug fixes and performance: - setup.py: remove deprecated asyncio==3.4.3 PyPI package - verba_manager.py: fix mutable default LoggerManager, lock creation race (setdefault), ClientManager dict-mutation races in clean_up/disconnect, O(n²) string concat in generate_stream_answer (list+join), simplify create_task immediately-awaited pattern - api.py: fix Exception object in WebSocket send_json (was not str), move msg.good inside try block, add _periodic_cleanup background task in lifespan, fix CORS allow_credentials, simplify streaming WebSocket - SentenceTransformersEmbedder.py: cache model by name, wrap model.encode with asyncio.to_thread to avoid blocking event loop - All generators (9 files): replace timeout=None with explicit connect=10/ read=300 timeout (httpx.Timeout or aiohttp.ClientTimeout) - server/types.py: add Pydantic max_length validators on query (50k), context (500k), and conversation (100 items) in GeneratePayload and QueryPayload to prevent unbounded memory use - server/helpers.py: add 5-minute TTL eviction to BatchManager so abandoned uploads do not leak memory indefinitely - components/util.py: remove debug print() statements from pca() - components/document.py: bare except → except Exception - components/generation/GeminiGenerator.py: bare except → except ImportError Track 2 — tests (71 tests, all passing): - goldenverba/tests/components/test_chunkers.py: unit tests for TokenChunker, SentenceChunker, MarkdownChunker - goldenverba/tests/server/test_helpers.py: BatchManager (normal flow, TTL eviction, out-of-order, abandoned, duplicate chunk, isLastChunk flag) and LoggerManager (send_report, create_new_document) - goldenverba/tests/document/test_document.py: expanded coverage Track 3 — developer guide: - BACKEND.md: architecture overview, plugin system, data flow, step-by-step guides for adding Generator/Embedder/Reader/Chunker, config system, WebSocket protocol, Weaviate schema, local dev setup Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

README: - Add DeepSeek and LM Studio to model/embedding support tables - Add DeepSeek and LM Studio env var entries to API keys table - Add DeepSeek and LM Studio provider sections with usage examples - Fix Anthropic spelling (was "Anthrophic") - Update intro blurb to mention new providers CHANGELOG: - Expand v3.0.0 Fixed section with all backend bug fixes (mutable default LoggerManager, lock race, dict mutation, Exception in send_json, bare excepts, debug prints, SentenceTransformer caching, streaming timeouts, asyncio patterns, string concat) - Add Changed entries for payload size limits and BatchManager TTL - Add test suite and BACKEND.md to Added section - Add CODE_OF_CONDUCT.md and PR template to Infrastructure section Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…+ API tests Bug fixes: - verba_manager.py process_single_document: remove pointless create_task wrapper around weaviate_manager.import_document (was equivalent to direct await but with extra overhead) - verba_manager.py verify_config: replace zip()-based structural comparison with set-based key comparison so adding/removing a component or config key cannot silently pass validation (zip truncates to the shorter iterable) - managers.py WeaviateManager: give verify_cache_collection its own cache_table dict so it no longer shares keys with embedding_table — previously whichever method ran first would win and the other would silently skip collection creation - managers.py delete_document: guard json.loads(meta) and the "Embedder" key lookup with try/except so a malformed or missing meta field no longer raises an unhandled TypeError/KeyError - managers.py add_suggestion: remove the redundant aggregate.over_all() call before fetch_objects — one filter query is sufficient regardless of whether the collection is empty - managers.py get_vectors(showAll=True): replace N sequential get_document() calls inside the async iterator with a two-pass approach — collect all items first, then batch-fetch unique documents with asyncio.gather() - managers.py connect_to_cluster/docker/embedded/custom: change from async def to def (these methods only call synchronous factory functions; the async keyword was misleading) Tests (test_api.py — 46 tests, all passing): - Health, connect (success + failure), get/set RAG config, query (including payload size validation), get/delete document, get all documents, reset (ALL / DOCUMENTS / CONFIG / SUGGESTIONS modes), get_meta, suggestions (get, get_all, delete), get_content, get_datacount, get_labels - Fixtures patch the module-level manager and client_manager singletons; mock_weaviate_client uses spec=WeaviateAsyncClient so isinstance() checks in the API pass; TestClient receives origin: http://testserver to satisfy the same-origin middleware Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…Reader Reduce reader maintenance surface from 7 integrations to 5: Removed: - FirecrawlReader: 239-line async polling reader with paid usage-based API; HTMLReader covers the primary use case (static site scraping) - UpstageDocumentParse: redundant with UnstructuredAPI for complex PDFs, paid API with no unique capability not already covered - AssemblyAIReader: replaced by WhisperReader (paid-per-minute → free local) Added: - WhisperReader: local audio/video transcription via faster-whisper. 35+ formats, no API key, no external service. Configurable model size (tiny → large-v3) and device (cpu/cuda/auto). Also fixed two latent bugs discovered during reader testing: - HTMLReader: config dict not threaded into process_url(), causing NameError at runtime on any URL fetch - GitReader: FileConfig built without metadata field, causing Pydantic ValidationError on real GitHub/GitLab API calls Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

verify_collection() was calling client.collections.create(name=...) with no properties, producing schema-less collections. Any subsequent filter on a named property (e.g. Filter.by_property("query").equal(...)) raised a gRPC UNKNOWN error from Weaviate because the property didn't exist in the schema. Fix: add _DOCUMENT_PROPERTIES, _CONFIG_PROPERTIES, _SUGGESTION_PROPERTIES, and _CHUNK_PROPERTIES class constants, and a _schema_for() helper that selects the right schema from the collection name. verify_collection() now passes the matching property list at creation time. All 27 Weaviate integration tests now pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

thomashacker and others added 6 commits April 17, 2026 16:46

docs: start v3.0.0 section in CHANGELOG

64a4b0a

Track all v3 changes: DeepSeek/LM Studio integrations (pending merge), chunk deserialization fix, CORS fix, CI/CD automation, ruff, pre-commit, Dependabot, and SECURITY.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: update CHANGELOG for v3.0.0 with all merged changes

683af23

Detail all additions (DeepSeek, LM Studio), fixes (chunk deserialization, CORS), dependency changes, and infrastructure work landed in v3 so far. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

orca-security-eu Bot reviewed Apr 17, 2026

View reviewed changes

thomashacker and others added 6 commits April 17, 2026 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V3#422

V3#422
thomashacker wants to merge 12 commits intomainfrom
v3

thomashacker commented Apr 17, 2026

Uh oh!

orca-security-eu Bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thomashacker commented Apr 17, 2026

Uh oh!

orca-security-eu Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Orca Security Scan Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

orca-security-eu Bot left a comment •

edited

Loading