weaviate · mpartipilo · Apr 21, 2026 · Apr 21, 2026 · Apr 21, 2026 · Apr 21, 2026
@@ -9,7 +9,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [Unreleased]
 
+### Added
+
+#### Tokenization
 
+- **Tokenize Endpoints** ([#329](https://github.com/weaviate/csharp-client/pull/329)): Expose the `POST /v1/tokenize` and `POST /v1/schema/{class}/properties/{prop}/tokenize` endpoints introduced in Weaviate 1.37.0. Inspect how text is tokenized for a given method and analyzer configuration, or how a specific collection property would tokenize it. Access via `client.Tokenize.Text(...)` and `collection.Tokenize.Property(...)`. `AsciiFoldConfig` is modeled as a nullable record so the invalid "ignore without fold" state is unrepresentable. See [TOKENIZE_API_USAGE.md](docs/TOKENIZE_API_USAGE.md). Requires Weaviate ≥ 1.37.0.
+- **Property-Level `TextAnalyzerConfig`** ([#329](https://github.com/weaviate/csharp-client/pull/329)): `Property.TextAnalyzer` (also applies to nested properties) lets a collection schema pin ASCII folding and/or a stopword preset per property at index time. The same `TextAnalyzerConfig` record is reused from the `Tokenize` endpoint so tokenize-at-query and index-at-insert stay aligned. A preflight version check on `CollectionsClient.Create` raises `WeaviateVersionMismatchException` when the server is older than 1.37.0. Requires Weaviate ≥ 1.37.0.
+- **Collection-Level `StopwordPresets`** ([#329](https://github.com/weaviate/csharp-client/pull/329)): `InvertedIndexConfig.StopwordPresets` and `InvertedIndexConfigUpdate.StopwordPresets` define named preset name → word-list maps on the inverted-index config. Properties reference these presets via `TextAnalyzer.StopwordPreset`. Preset changes flow through `CollectionClient.Config.Update(c => c.InvertedIndexConfig.StopwordPresets = ...)`. Requires Weaviate ≥ 1.37.0.
 
 ---
 

@@ -126,6 +126,7 @@ For more detailed information on specific features, please refer to the official
 - **[Backup API Usage](docs/BACKUP_API_USAGE.md)**: Creating and restoring backups
 - **[Nodes API Usage](docs/NODES_API_USAGE.md)**: Querying cluster node information
 - **[Aggregate Result Accessors](docs/AGGREGATE_RESULT_ACCESSORS.md)**: Type-safe access to aggregation results
+- **[Tokenize API Usage](docs/TOKENIZE_API_USAGE.md)**: Inspect how text is tokenized with a given method or for a specific collection property. Requires Weaviate ≥ 1.37.0.
 - **[Microsoft.Extensions.VectorData Integration](docs/VECTORDATA.md)**: Standard .NET vector store abstraction support
 
 ---

@@ -0,0 +1,352 @@
+# Tokenize API Usage Guide
+
+> **Version Requirement:**
+> The tokenize endpoints require Weaviate **v1.37.0** or newer. Calls against earlier versions throw `WeaviateVersionMismatchException`.
+
+This guide covers the Weaviate C# client's tokenize API — a pair of endpoints that let you inspect how the server would tokenize a piece of text, either with an ad-hoc tokenization strategy or using the one already configured on a collection property.
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Tokenization Methods](#tokenization-methods)
+- [Ad-hoc Tokenization (`client.Tokenize.Text`)](#ad-hoc-tokenization-clienttokenizetext)
+- [Property-scoped Tokenization (`collection.Tokenize.Property`)](#property-scoped-tokenization-collectiontokenizeproperty)
+- [Analyzer Configuration](#analyzer-configuration)
+- [Stopwords](#stopwords)
+- [Result Shape](#result-shape)
+- [Property-level Text Analyzer (schema)](#property-level-text-analyzer-schema)
+- [Collection-level Stopword Presets (schema)](#collection-level-stopword-presets-schema)
+- [Common Patterns](#common-patterns)
+
+## Overview
+
+The tokenize API exposes two REST endpoints:
+
+| Method | Endpoint | Use when… |
+|---|---|---|
+| `client.Tokenize.Text(...)` | `POST /v1/tokenize` | You want to preview tokenization for arbitrary text with any method/config — no collection required. |
+| `collection.Tokenize.Property(...)` | `POST /v1/schema/{class}/properties/{prop}/tokenize` | You want to tokenize text *exactly as it would be indexed* by a specific property of an existing collection. |
+
+Both return a `TokenizeResult` containing two token lists:
+
+- **`Indexed`** — tokens as they are stored in the inverted index.
+- **`Query`** — tokens as they are used for query matching (after stopword removal, etc.).
+
+These differ when stopwords are configured: a stopword like `"the"` is still indexed (so `BM25` can count it), but dropped from `Query` so it doesn't inflate match scores.
+
+## Tokenization Methods
+
+The `PropertyTokenization` enum covers all nine server-supported strategies:
+
+| Method | Input | Output (`Indexed`) |
+|---|---|---|
+| `Word` | `"The quick brown fox"` | `["the", "quick", "brown", "fox"]` |
+| `Lowercase` | `"Hello World Test"` | `["hello", "world", "test"]` |
+| `Whitespace` | `"Hello World Test"` | `["Hello", "World", "Test"]` |
+| `Field` | `"  Hello World  "` | `["Hello World"]` *(entire field, trimmed)* |
+| `Trigram` | `"Hello"` | `["hel", "ell", "llo"]` |
+| `Gse` | Chinese/Japanese | Requires `ENABLE_TOKENIZER_GSE=true` on the server |
+| `GseCh` | Chinese-only GSE | Requires `ENABLE_TOKENIZER_GSE_CH=true` |
+| `KagomeJa` | Japanese | Requires `ENABLE_TOKENIZER_KAGOME_JA=true` |
+| `KagomeKr` | Korean | Requires `ENABLE_TOKENIZER_KAGOME_KR=true` |
+
+## Ad-hoc Tokenization (`client.Tokenize.Text`)
+
+The simplest call takes only a text and a tokenization method:
+
+```csharp
+using Weaviate.Client.Models;
+
+var result = await client.Tokenize.Text(
+    text: "The quick brown fox",
+    tokenization: PropertyTokenization.Word
+);
+
+Console.WriteLine(string.Join(", ", result.Indexed));
+// the, quick, brown, fox
+```
+
+Signature:
+
+```csharp
+Task<TokenizeResult> Tokenize.Text(
+    string text,
+    PropertyTokenization tokenization,
+    TextAnalyzerConfig? analyzerConfig = null,
+    IDictionary<string, StopwordConfig>? stopwordPresets = null,
+    CancellationToken cancellationToken = default
+);
+```
+
+## Property-scoped Tokenization (`collection.Tokenize.Property`)
+
+When you want to see how a specific property would tokenize text — using that property's configured tokenization — use the collection-scoped variant:
+
+```csharp
+var collection = await client.Collections.Get("Article");
+
+var result = await collection.Tokenize.Property(
+    propertyName: "title",
+    text: "  Hello World  "
+);
+
+Console.WriteLine(result.Tokenization);          // Field (whatever the property is configured with)
+Console.WriteLine(string.Join(", ", result.Indexed)); // Hello World
+```
+
+The server uses the property's configured tokenization method and any analyzer config attached to the property — you don't pass either yourself.
+
+## Analyzer Configuration
+
+`TextAnalyzerConfig` controls two optional analyzer stages: **ASCII folding** and **stopword removal**.
+
+### ASCII Folding
+
+`AsciiFoldConfig` is a nullable record — `null` means folding is disabled, non-`null` means it's enabled. The `Ignore` list lets you exempt specific characters from folding.
+
+```csharp
+var cfg = new TextAnalyzerConfig
+{
+    AsciiFold = new AsciiFoldConfig(), // folding enabled, nothing ignored
+};
+
+var result = await client.Tokenize.Text(
+    "L'école est fermée",
+    PropertyTokenization.Word,
+    analyzerConfig: cfg
+);
+// result.Indexed == ["l", "ecole", "est", "fermee"]
+```
+
+Ignore a specific character:
+
+```csharp
+var cfg = new TextAnalyzerConfig
+{
+    AsciiFold = new AsciiFoldConfig(Ignore: ["é"]),
+};
+
+var result = await client.Tokenize.Text(
+    "L'école est fermée",
+    PropertyTokenization.Word,
+    analyzerConfig: cfg
+);
+// result.Indexed == ["l", "école", "est", "fermée"]
+```
+
+> **Tip:** Modeling `AsciiFold` as a nullable record makes the "ignore without fold" state unrepresentable — you can't accidentally pass `Ignore` without enabling folding.
+
+### Stopwords
+
+Use a built-in preset (`"en"`, `"none"`) via the `StopwordPreset` field:
+
+```csharp
+var cfg = new TextAnalyzerConfig { StopwordPreset = "en" };
+
+var result = await client.Tokenize.Text(
+    "The quick brown fox",
+    PropertyTokenization.Word,
+    analyzerConfig: cfg
+);
+
+// result.Indexed  → ["the", "quick", "brown", "fox"]     (all tokens kept in index)
+// result.Query    → ["quick", "brown", "fox"]            ("the" removed for queries)
+```
+
+## Stopwords
+
+For more control, define a named preset via the `stopwordPresets` dictionary and reference it from `StopwordPreset`.
+
+### Add words to a preset
+
+```csharp
+var cfg = new TextAnalyzerConfig { StopwordPreset = "custom" };
+
+var presets = new Dictionary<string, StopwordConfig>
+{
+    ["custom"] = new StopwordConfig
+    {
+        Preset = StopwordConfig.Presets.None,
+        Additions = ["test"],
+    },
+};
+
+var result = await client.Tokenize.Text(
+    "hello world test",
+    PropertyTokenization.Word,
+    analyzerConfig: cfg,
+    stopwordPresets: presets
+);
+
+// result.Indexed → ["hello", "world", "test"]
+// result.Query   → ["hello", "world"]          ("test" dropped)
+```
+
+### Start from a base preset and remove words
+
+```csharp
+var cfg = new TextAnalyzerConfig { StopwordPreset = "en-no-the" };
+
+var presets = new Dictionary<string, StopwordConfig>
+{
+    ["en-no-the"] = new StopwordConfig
+    {
+        Preset = StopwordConfig.Presets.EN,
+        Removals = ["the"],
+    },
+};
+
+var result = await client.Tokenize.Text(
+    "the quick",
+    PropertyTokenization.Word,
+    analyzerConfig: cfg,
+    stopwordPresets: presets
+);
+
+// "the" is no longer a stopword in this preset, so it survives in both lists.
+```
+
+### Combining folding and stopwords
+
+```csharp
+var cfg = new TextAnalyzerConfig
+{
+    AsciiFold = new AsciiFoldConfig(Ignore: ["é"]),
+    StopwordPreset = "en",
+};
+
+var result = await client.Tokenize.Text(
+    "The école est fermée",
+    PropertyTokenization.Word,
+    analyzerConfig: cfg
+);
+
+// result.Indexed → ["the", "école", "est", "fermee"]
+// result.Query   → ["école", "est", "fermee"]         ("the" dropped)
+```
+
+## Result Shape
+
+`TokenizeResult` is a sealed record:
+
+| Member | Type | Description |
+|---|---|---|
+| `Tokenization` | `PropertyTokenization` | The method that was applied. |
+| `Indexed` | `ImmutableList<string>` | Tokens as stored in the inverted index. |
+| `Query` | `ImmutableList<string>` | Tokens used at query time (after stopword removal). |
+| `AnalyzerConfig` | `TextAnalyzerConfig?` | Echo of the analyzer config that was applied, or `null`. |
+| `StopwordConfig` | `StopwordConfig?` | Echo of the resolved stopword config, or `null`. |
+
+The `AnalyzerConfig` echo is the server's view of what was applied — useful for verifying that your config was parsed correctly. The round-trip also normalizes wire-format quirks (the server represents `asciiFold` as a `bool` + separate `asciiFoldIgnore[]`, but the client unwraps it back into the nested `AsciiFoldConfig` record).
+
+## Property-level Text Analyzer (schema)
+
+Beyond the ad-hoc tokenize endpoint, Weaviate 1.37.0 also lets you pin analyzer options directly on a property at **collection-creation time**. The same `TextAnalyzerConfig` record is reused: whatever you would pass to `client.Tokenize.Text(...)` can also be attached to a property so every value indexed through that property gets the same treatment.
+
+```csharp
+await client.Collections.Create(new CollectionCreateParams
+{
+    Name = "Article",
+    Properties =
+    [
+        new Property
+        {
+            Name = "title",
+            DataType = [DataType.Text],
+            Tokenization = PropertyTokenization.Word,
+            TextAnalyzer = new TextAnalyzerConfig
+            {
+                AsciiFold = new AsciiFoldConfig(),
+                StopwordPreset = "en",
+            },
+        },
+    ],
+});
+```
+
+Nested properties (object / object-array) accept `TextAnalyzer` too — they are `Property` records themselves, so the same field is available on every depth.
+
+> **Version requirement:** `Property.TextAnalyzer` is only wired up for servers at Weaviate ≥ 1.37.0. `CollectionsClient.Create` performs a preflight version check and throws `WeaviateVersionMismatchException` if the connected server is older, before the schema request is sent.
+
+## Collection-level Stopword Presets (schema)
+
+Named stopword lists live on the collection's inverted-index config. A preset is a `preset-name → word-list` pair; properties reference one by name via `TextAnalyzer.StopwordPreset`.
+
+```csharp
+await client.Collections.Create(new CollectionCreateParams
+{
+    Name = "Article",
+    InvertedIndexConfig = new InvertedIndexConfig
+    {
+        StopwordPresets = new Dictionary<string, IList<string>>
+        {
+            ["fr"] = new[] { "le", "la", "les" },
+            ["custom_en"] = new[] { "foo", "bar" },
+        },
+    },
+    Properties =
+    [
+        new Property
+        {
+            Name = "body",
+            DataType = [DataType.Text],
+            TextAnalyzer = new TextAnalyzerConfig { StopwordPreset = "fr" },
+        },
+    ],
+});
+```
+
+Updating presets on an existing collection goes through the normal update path:
+
+```csharp
+await collection.Config.Update(c =>
+{
+    c.InvertedIndexConfig.StopwordPresets = new Dictionary<string, IList<string>>
+    {
+        ["fr"] = new[] { "le", "la", "les", "un", "une" },
+    };
+});
+```
+
+Setting `StopwordPresets` replaces the whole preset map on the server. The server rejects removing a preset that is still referenced by a property's `TextAnalyzer.StopwordPreset` — keep preset removals and property-config changes in the same update, or unwire the property first.
+
+> **Version requirement:** Requires Weaviate ≥ 1.37.0. The preflight in `CollectionsClient.Create` also trips on `InvertedIndexConfig.StopwordPresets` before contacting the server.
+
+## Common Patterns
+
+### Previewing a query
+
+Use `collection.Tokenize.Property` to see exactly what tokens the server will match your search against:
+
+```csharp
+var tokens = (await collection.Tokenize.Property("title", userQuery)).Query;
+// Show tokens in the UI as "searching for: X, Y, Z"
+```
+
+### Debugging a BM25 miss
+
+If a search misses a term you expected, tokenize both the query and a sample document with the same property:
+
+```csharp
+var queryTokens = (await collection.Tokenize.Property("body", "running")).Query;
+var docTokens   = (await collection.Tokenize.Property("body", "I was running")).Indexed;
+
+// If the sets don't intersect, BM25 can't match — check for stemming / stopwords.
+```
+
+### Verifying analyzer config round-trip
+
+When you configure ASCII folding or a stopword preset, the server echoes back its interpretation on every call:
+
+```csharp
+var cfg = new TextAnalyzerConfig
+{
+    AsciiFold = new AsciiFoldConfig(Ignore: ["é"]),
+    StopwordPreset = "en",
+};
+
+var result = await client.Tokenize.Text("L'école", PropertyTokenization.Word, analyzerConfig: cfg);
+
+Debug.Assert(result.AnalyzerConfig!.AsciiFold!.Ignore!.SequenceEqual(new[] { "é" }));
+Debug.Assert(result.AnalyzerConfig.StopwordPreset == "en");
+```