feat: tokenize endpoints + property-level TextAnalyzer + StopwordPresets (Weaviate 1.37.0+)#329
Open
mpartipilo wants to merge 4 commits intomainfrom
Open
feat: tokenize endpoints + property-level TextAnalyzer + StopwordPresets (Weaviate 1.37.0+)#329mpartipilo wants to merge 4 commits intomainfrom
mpartipilo wants to merge 4 commits intomainfrom
Conversation
Port of python-client PR #2012, aligned with the TS client's `tokenize`
namespace design. Adds:
- `client.Tokenize.Text(text, tokenization, analyzerConfig?, stopwordPresets?)`
→ POST /v1/tokenize
- `collection.Tokenize.Property(propertyName, text)`
→ POST /v1/schema/{class}/properties/{prop}/tokenize
Version-gated at 1.37.0 via `[RequiresWeaviateVersion]`. `AsciiFold` is
modeled as a nullable record (null = disabled, non-null = enabled with
optional `Ignore` list) so the invalid "ignore without fold" state is
unrepresentable without a validator.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Orca Security Scan Summary
| Status | Check | Issues by priority | |
|---|---|---|---|
| Infrastructure as Code | View in Orca | ||
| SAST | View in Orca | ||
| Secrets | View in Orca | ||
| Vulnerabilities | View in Orca |
Summary - Weaviate C# Client CoverageSummary
CoverageWeaviate.Client - 48.8%
Weaviate.Client.Analyzers - 0%
Weaviate.Client.VectorData - 50.3%
|
- New docs/TOKENIZE_API_USAGE.md covers both `client.Tokenize.Text` and `collection.Tokenize.Property`, analyzer config (ASCII folding, stopwords), the result shape, and common usage patterns. - Link the guide from README under "Additional Guides". - Add an "Unreleased" CHANGELOG entry for the tokenize endpoints. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port weaviate-python-client PR #2006 on top of the tokenize-endpoint stack for Weaviate 1.37.0: - Property.TextAnalyzer: pin ASCII folding and stopword preset per property at index time. Reuses the TextAnalyzerConfig record already introduced for /v1/tokenize so tokenize-at-query and index-at-insert stay aligned. Propagates through nested properties via Property-> NestedProperties recursion. - InvertedIndexConfig.StopwordPresets: named preset->word-list map on the collection inverted-index config. Properties reference presets via TextAnalyzer.StopwordPreset. Round-trips through create + update. - InvertedIndexConfigUpdate.StopwordPresets: mirrors the set accessor on the update wrapper so c.InvertedIndexConfig.StopwordPresets = ... works inside collection.Config.Update(...). - Preflight in CollectionsClient.Create: detects either feature in the incoming schema and throws WeaviateVersionMismatchException when the connected server is older than 1.37.0, before any REST call. - Rename TokenizeAnalyzerConfig -> TextAnalyzerConfig: same shape now serves both the tokenize endpoint and the property-level analyzer, matching the server type name and Python naming. - Integration tests in TestCollectionTextAnalyzer.cs cover preset round-trip, update, referenced-removal rejection, ascii-fold combos, and version-gate behaviour. - CHANGELOG + docs/TOKENIZE_API_USAGE.md extended with worked examples for the schema-side analyzer and stopword presets. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…pwordPresets rejections The `StopwordPresets_RemoveInUse_RejectedByServer` and `StopwordPresets_RemoveReferencedByNested_RejectedByServer` tests expected `WeaviateClientException`, but the server returns HTTP 422 which the client maps to `WeaviateUnprocessableEntityException : WeaviateServerException`. The test names already indicate these are server-side rejections — align the assertions with the actual (and correct) exception type. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Ports two related Weaviate 1.37.0 features from the Python client into one stacked PR:
python-client PR #2012 —
/v1/tokenizeendpointsclient.Tokenize.Text(text, tokenization, analyzerConfig?, stopwordPresets?, ct)(POST /v1/tokenize)collection.Tokenize.Property(propertyName, text, ct)(POST /v1/schema/{class}/properties/{prop}/tokenize)tokenizenamespace design.AsciiFoldis a nullable record (AsciiFoldConfig? AsciiFold) — null = disabled, non-null = enabled with optionalIgnorelist. The invalid "ignore without fold" state is unrepresentable, so no runtime validator is needed.[RequiresWeaviateVersion(1, 37, 0)]+EnsureVersion<T>().python-client PR #2006 — property-level
TextAnalyzer+ collection-levelStopwordPresetsProperty.TextAnalyzer: pin ASCII folding and stopword preset per property at index time. Reuses theTextAnalyzerConfigrecord from (1) so tokenize-at-query and index-at-insert stay aligned. Propagates through nested properties.InvertedIndexConfig.StopwordPresets: named preset → word-list map on the collection inverted-index config. Properties reference presets viaTextAnalyzer.StopwordPreset.InvertedIndexConfigUpdate.StopwordPresets: mirrors the set accessor on the update wrapper soc.InvertedIndexConfig.StopwordPresets = ...works insidecollection.Config.Update(...).CollectionsClient.Createdetects either feature in the incoming schema and throwsWeaviateVersionMismatchExceptionwhen the server is older than 1.37.0, before any REST call.TextAnalyzerConfigshape and version gate.TokenizeAnalyzerConfigwas renamed toTextAnalyzerConfigto match the server type name.Docs: TOKENIZE_API_USAGE.md — end-to-end guide covering both scopes, including schema-time analyzer + preset examples.
Out of scope
gse_chfix — separate tracked work.Test plan
dotnet build src/Weaviate.Client/→ 0 errorsdotnet build src/Weaviate.Client.Tests/→ 0 warnings, 0 errorsdotnet test --filter FullyQualifiedName~TestTokenize→ 16/16 passed against Weaviate 1.37.1dotnet test --filter FullyQualifiedName~TestCollectionTextAnalyzeragainst Weaviate 1.37.1🤖 Generated with Claude Code