Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

#### Tokenization

- **Tokenize Endpoints** ([#329](https://github.com/weaviate/csharp-client/pull/329)): Expose the `POST /v1/tokenize` and `POST /v1/schema/{class}/properties/{prop}/tokenize` endpoints introduced in Weaviate 1.37.0. Inspect how text is tokenized for a given method and analyzer configuration, or how a specific collection property would tokenize it. Access via `client.Tokenize.Text(...)` and `collection.Tokenize.Property(...)`. `AsciiFoldConfig` is modeled as a nullable record so the invalid "ignore without fold" state is unrepresentable. See [TOKENIZE_API_USAGE.md](docs/TOKENIZE_API_USAGE.md). Requires Weaviate ≥ 1.37.0.
- **Property-Level `TextAnalyzerConfig`** ([#329](https://github.com/weaviate/csharp-client/pull/329)): `Property.TextAnalyzer` (also applies to nested properties) lets a collection schema pin ASCII folding and/or a stopword preset per property at index time. The same `TextAnalyzerConfig` record is reused from the `Tokenize` endpoint so tokenize-at-query and index-at-insert stay aligned. A preflight version check on `CollectionsClient.Create` raises `WeaviateVersionMismatchException` when the server is older than 1.37.0. Requires Weaviate ≥ 1.37.0.
- **Collection-Level `StopwordPresets`** ([#329](https://github.com/weaviate/csharp-client/pull/329)): `InvertedIndexConfig.StopwordPresets` and `InvertedIndexConfigUpdate.StopwordPresets` define named preset name → word-list maps on the inverted-index config. Properties reference these presets via `TextAnalyzer.StopwordPreset`. Preset changes flow through `CollectionClient.Config.Update(c => c.InvertedIndexConfig.StopwordPresets = ...)`. Requires Weaviate ≥ 1.37.0.

---

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,7 @@ For more detailed information on specific features, please refer to the official
- **[Backup API Usage](docs/BACKUP_API_USAGE.md)**: Creating and restoring backups
- **[Nodes API Usage](docs/NODES_API_USAGE.md)**: Querying cluster node information
- **[Aggregate Result Accessors](docs/AGGREGATE_RESULT_ACCESSORS.md)**: Type-safe access to aggregation results
- **[Tokenize API Usage](docs/TOKENIZE_API_USAGE.md)**: Inspect how text is tokenized with a given method or for a specific collection property. Requires Weaviate ≥ 1.37.0.
- **[Microsoft.Extensions.VectorData Integration](docs/VECTORDATA.md)**: Standard .NET vector store abstraction support

---
Expand Down
352 changes: 352 additions & 0 deletions docs/TOKENIZE_API_USAGE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,352 @@
# Tokenize API Usage Guide

> **Version Requirement:**
> The tokenize endpoints require Weaviate **v1.37.0** or newer. Calls against earlier versions throw `WeaviateVersionMismatchException`.

This guide covers the Weaviate C# client's tokenize API — a pair of endpoints that let you inspect how the server would tokenize a piece of text, either with an ad-hoc tokenization strategy or using the one already configured on a collection property.

## Table of Contents

- [Overview](#overview)
- [Tokenization Methods](#tokenization-methods)
- [Ad-hoc Tokenization (`client.Tokenize.Text`)](#ad-hoc-tokenization-clienttokenizetext)
- [Property-scoped Tokenization (`collection.Tokenize.Property`)](#property-scoped-tokenization-collectiontokenizeproperty)
- [Analyzer Configuration](#analyzer-configuration)
- [Stopwords](#stopwords)
- [Result Shape](#result-shape)
- [Property-level Text Analyzer (schema)](#property-level-text-analyzer-schema)
- [Collection-level Stopword Presets (schema)](#collection-level-stopword-presets-schema)
- [Common Patterns](#common-patterns)

## Overview

The tokenize API exposes two REST endpoints:

| Method | Endpoint | Use when… |
|---|---|---|
| `client.Tokenize.Text(...)` | `POST /v1/tokenize` | You want to preview tokenization for arbitrary text with any method/config — no collection required. |
| `collection.Tokenize.Property(...)` | `POST /v1/schema/{class}/properties/{prop}/tokenize` | You want to tokenize text *exactly as it would be indexed* by a specific property of an existing collection. |

Both return a `TokenizeResult` containing two token lists:

- **`Indexed`** — tokens as they are stored in the inverted index.
- **`Query`** — tokens as they are used for query matching (after stopword removal, etc.).

These differ when stopwords are configured: a stopword like `"the"` is still indexed (so `BM25` can count it), but dropped from `Query` so it doesn't inflate match scores.

## Tokenization Methods

The `PropertyTokenization` enum covers all nine server-supported strategies:

| Method | Input | Output (`Indexed`) |
|---|---|---|
| `Word` | `"The quick brown fox"` | `["the", "quick", "brown", "fox"]` |
| `Lowercase` | `"Hello World Test"` | `["hello", "world", "test"]` |
| `Whitespace` | `"Hello World Test"` | `["Hello", "World", "Test"]` |
| `Field` | `" Hello World "` | `["Hello World"]` *(entire field, trimmed)* |
| `Trigram` | `"Hello"` | `["hel", "ell", "llo"]` |
| `Gse` | Chinese/Japanese | Requires `ENABLE_TOKENIZER_GSE=true` on the server |
| `GseCh` | Chinese-only GSE | Requires `ENABLE_TOKENIZER_GSE_CH=true` |
| `KagomeJa` | Japanese | Requires `ENABLE_TOKENIZER_KAGOME_JA=true` |
| `KagomeKr` | Korean | Requires `ENABLE_TOKENIZER_KAGOME_KR=true` |

## Ad-hoc Tokenization (`client.Tokenize.Text`)

The simplest call takes only a text and a tokenization method:

```csharp
using Weaviate.Client.Models;

var result = await client.Tokenize.Text(
text: "The quick brown fox",
tokenization: PropertyTokenization.Word
);

Console.WriteLine(string.Join(", ", result.Indexed));
// the, quick, brown, fox
```

Signature:

```csharp
Task<TokenizeResult> Tokenize.Text(
string text,
PropertyTokenization tokenization,
TextAnalyzerConfig? analyzerConfig = null,
IDictionary<string, StopwordConfig>? stopwordPresets = null,
CancellationToken cancellationToken = default
);
```

## Property-scoped Tokenization (`collection.Tokenize.Property`)

When you want to see how a specific property would tokenize text — using that property's configured tokenization — use the collection-scoped variant:

```csharp
var collection = await client.Collections.Get("Article");

var result = await collection.Tokenize.Property(
propertyName: "title",
text: " Hello World "
);

Console.WriteLine(result.Tokenization); // Field (whatever the property is configured with)
Console.WriteLine(string.Join(", ", result.Indexed)); // Hello World
```

The server uses the property's configured tokenization method and any analyzer config attached to the property — you don't pass either yourself.

## Analyzer Configuration

`TextAnalyzerConfig` controls two optional analyzer stages: **ASCII folding** and **stopword removal**.

### ASCII Folding

`AsciiFoldConfig` is a nullable record — `null` means folding is disabled, non-`null` means it's enabled. The `Ignore` list lets you exempt specific characters from folding.

```csharp
var cfg = new TextAnalyzerConfig
{
AsciiFold = new AsciiFoldConfig(), // folding enabled, nothing ignored
};

var result = await client.Tokenize.Text(
"L'école est fermée",
PropertyTokenization.Word,
analyzerConfig: cfg
);
// result.Indexed == ["l", "ecole", "est", "fermee"]
```

Ignore a specific character:

```csharp
var cfg = new TextAnalyzerConfig
{
AsciiFold = new AsciiFoldConfig(Ignore: ["é"]),
};

var result = await client.Tokenize.Text(
"L'école est fermée",
PropertyTokenization.Word,
analyzerConfig: cfg
);
// result.Indexed == ["l", "école", "est", "fermée"]
```

> **Tip:** Modeling `AsciiFold` as a nullable record makes the "ignore without fold" state unrepresentable — you can't accidentally pass `Ignore` without enabling folding.

### Stopwords

Use a built-in preset (`"en"`, `"none"`) via the `StopwordPreset` field:

```csharp
var cfg = new TextAnalyzerConfig { StopwordPreset = "en" };

var result = await client.Tokenize.Text(
"The quick brown fox",
PropertyTokenization.Word,
analyzerConfig: cfg
);

// result.Indexed → ["the", "quick", "brown", "fox"] (all tokens kept in index)
// result.Query → ["quick", "brown", "fox"] ("the" removed for queries)
```

## Stopwords

For more control, define a named preset via the `stopwordPresets` dictionary and reference it from `StopwordPreset`.

### Add words to a preset

```csharp
var cfg = new TextAnalyzerConfig { StopwordPreset = "custom" };

var presets = new Dictionary<string, StopwordConfig>
{
["custom"] = new StopwordConfig
{
Preset = StopwordConfig.Presets.None,
Additions = ["test"],
},
};

var result = await client.Tokenize.Text(
"hello world test",
PropertyTokenization.Word,
analyzerConfig: cfg,
stopwordPresets: presets
);

// result.Indexed → ["hello", "world", "test"]
// result.Query → ["hello", "world"] ("test" dropped)
```

### Start from a base preset and remove words

```csharp
var cfg = new TextAnalyzerConfig { StopwordPreset = "en-no-the" };

var presets = new Dictionary<string, StopwordConfig>
{
["en-no-the"] = new StopwordConfig
{
Preset = StopwordConfig.Presets.EN,
Removals = ["the"],
},
};

var result = await client.Tokenize.Text(
"the quick",
PropertyTokenization.Word,
analyzerConfig: cfg,
stopwordPresets: presets
);

// "the" is no longer a stopword in this preset, so it survives in both lists.
```

### Combining folding and stopwords

```csharp
var cfg = new TextAnalyzerConfig
{
AsciiFold = new AsciiFoldConfig(Ignore: ["é"]),
StopwordPreset = "en",
};

var result = await client.Tokenize.Text(
"The école est fermée",
PropertyTokenization.Word,
analyzerConfig: cfg
);

// result.Indexed → ["the", "école", "est", "fermee"]
// result.Query → ["école", "est", "fermee"] ("the" dropped)
```

## Result Shape

`TokenizeResult` is a sealed record:

| Member | Type | Description |
|---|---|---|
| `Tokenization` | `PropertyTokenization` | The method that was applied. |
| `Indexed` | `ImmutableList<string>` | Tokens as stored in the inverted index. |
| `Query` | `ImmutableList<string>` | Tokens used at query time (after stopword removal). |
| `AnalyzerConfig` | `TextAnalyzerConfig?` | Echo of the analyzer config that was applied, or `null`. |
| `StopwordConfig` | `StopwordConfig?` | Echo of the resolved stopword config, or `null`. |

The `AnalyzerConfig` echo is the server's view of what was applied — useful for verifying that your config was parsed correctly. The round-trip also normalizes wire-format quirks (the server represents `asciiFold` as a `bool` + separate `asciiFoldIgnore[]`, but the client unwraps it back into the nested `AsciiFoldConfig` record).

## Property-level Text Analyzer (schema)

Beyond the ad-hoc tokenize endpoint, Weaviate 1.37.0 also lets you pin analyzer options directly on a property at **collection-creation time**. The same `TextAnalyzerConfig` record is reused: whatever you would pass to `client.Tokenize.Text(...)` can also be attached to a property so every value indexed through that property gets the same treatment.

```csharp
await client.Collections.Create(new CollectionCreateParams
{
Name = "Article",
Properties =
[
new Property
{
Name = "title",
DataType = [DataType.Text],
Tokenization = PropertyTokenization.Word,
TextAnalyzer = new TextAnalyzerConfig
{
AsciiFold = new AsciiFoldConfig(),
StopwordPreset = "en",
},
},
],
});
```

Nested properties (object / object-array) accept `TextAnalyzer` too — they are `Property` records themselves, so the same field is available on every depth.

> **Version requirement:** `Property.TextAnalyzer` is only wired up for servers at Weaviate ≥ 1.37.0. `CollectionsClient.Create` performs a preflight version check and throws `WeaviateVersionMismatchException` if the connected server is older, before the schema request is sent.

## Collection-level Stopword Presets (schema)

Named stopword lists live on the collection's inverted-index config. A preset is a `preset-name → word-list` pair; properties reference one by name via `TextAnalyzer.StopwordPreset`.

```csharp
await client.Collections.Create(new CollectionCreateParams
{
Name = "Article",
InvertedIndexConfig = new InvertedIndexConfig
{
StopwordPresets = new Dictionary<string, IList<string>>
{
["fr"] = new[] { "le", "la", "les" },
["custom_en"] = new[] { "foo", "bar" },
},
},
Properties =
[
new Property
{
Name = "body",
DataType = [DataType.Text],
TextAnalyzer = new TextAnalyzerConfig { StopwordPreset = "fr" },
},
],
});
```

Updating presets on an existing collection goes through the normal update path:

```csharp
await collection.Config.Update(c =>
{
c.InvertedIndexConfig.StopwordPresets = new Dictionary<string, IList<string>>
{
["fr"] = new[] { "le", "la", "les", "un", "une" },
};
});
```

Setting `StopwordPresets` replaces the whole preset map on the server. The server rejects removing a preset that is still referenced by a property's `TextAnalyzer.StopwordPreset` — keep preset removals and property-config changes in the same update, or unwire the property first.

> **Version requirement:** Requires Weaviate ≥ 1.37.0. The preflight in `CollectionsClient.Create` also trips on `InvertedIndexConfig.StopwordPresets` before contacting the server.

## Common Patterns

### Previewing a query

Use `collection.Tokenize.Property` to see exactly what tokens the server will match your search against:

```csharp
var tokens = (await collection.Tokenize.Property("title", userQuery)).Query;
// Show tokens in the UI as "searching for: X, Y, Z"
```

### Debugging a BM25 miss

If a search misses a term you expected, tokenize both the query and a sample document with the same property:

```csharp
var queryTokens = (await collection.Tokenize.Property("body", "running")).Query;
var docTokens = (await collection.Tokenize.Property("body", "I was running")).Indexed;

// If the sets don't intersect, BM25 can't match — check for stemming / stopwords.
```

### Verifying analyzer config round-trip

When you configure ASCII folding or a stopword preset, the server echoes back its interpretation on every call:

```csharp
var cfg = new TextAnalyzerConfig
{
AsciiFold = new AsciiFoldConfig(Ignore: ["é"]),
StopwordPreset = "en",
};

var result = await client.Tokenize.Text("L'école", PropertyTokenization.Word, analyzerConfig: cfg);

Debug.Assert(result.AnalyzerConfig!.AsciiFold!.Ignore!.SequenceEqual(new[] { "é" }));
Debug.Assert(result.AnalyzerConfig.StopwordPreset == "en");
```
Loading
Loading