Skip to content

Support non-ASCII characters in Tokenizer (or directly accept tokenized text) #5

@ndandanov

Description

@ndandanov

Thank you for the great and performant package!

I was experiment with it today and found a potential limitation in the tokenizer related to language support.
Currently, it seems to support only ASCII characters:

        for ch in text.chars() {
            if ch.is_ascii_alphanumeric() {
                current.push(ch.to_ascii_lowercase());

If one wanted to use bb25 for other texts, e.g., containing Cyrillic alphabet, the non-ASCII characters would be stripped and tokens would be missing.

With this issue, I would like to demonstrate this.
Perhaps the tokenizer may be improved by supporting non-ASCII characters.

Alternatively, Corpus may simply support directly passing tokens when adding a new document (below is a simple example how to enable this).
Or Corpus could accept a custom Tokenizer written in Python which inherits bb25's Tokenizer.

This is a short example featuring a document which contains Bulgarian text:

import bb25 as bb

corpus = bb.Corpus()

corpus.add_document("d0", "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене", [0.2] * 15)
corpus.add_document("d1", "neural networks for ranking", [0.1] * 8)
corpus.build_index()  # must be called before creating scorers

bm25 = bb.BM25Scorer(corpus, 1.2, 0.75)
print(bm25.idf("bm25"))
# Prints out: 0.0

for doc in corpus.documents():
    print(f'Document "{doc.id}"')
    print(f'- text "{doc.text}"')
    print(f'- tokens "{doc.tokens}"')
    print(f'- length (number of tokens) "{doc.length}"')
    print(f'- term frequencies "{doc.term_freq}"')
    print()

# Prints out:
# Document "d0"
# - text "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
# - tokens "['bm25', 'best', 'matching', '25', 'ranking']"
# - length (number of tokens) "5"
# - term frequencies "{'ranking': 1, 'bm25': 1, '25': 1, 'best': 1, 'matching': 1}"
#
# Document "d1"
# - text "neural networks for ranking"
# - tokens "['neural', 'networks', 'for', 'ranking']"
# - length (number of tokens) "4"
# - term frequencies "{'networks': 1, 'for': 1, 'neural': 1, 'ranking': 1}"

You can see that d0 is seen as containing only 5 tokens - the ones which do not contain non-ASCII characters.

If the tokenizer would support non-ASCII characters, the result would have been:

# Prints out:
# Document "d0"
# - text "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
# - tokens "['BM25', 'Best', 'Matching', '25', 'е', 'усъвършенстван', 'алгоритъм', 'за', 'класиране', 'на', 'документи', 'ranking', 'в', 'информационното', 'търсене']"
# - length (number of tokens) "15"
# - term frequencies "{'класиране': 1, 'алгоритъм': 1, 'ranking': 1, 'усъвършенстван': 1, '25': 1, 'Matching': 1, 'е': 1, 'документи': 1, 'търсене': 1, 'информационното': 1, 'за': 1, 'BM25': 1, 'Best': 1, 'в': 1, 'на': 1}"
# 
# Document "d1"
# - text "neural networks for ranking"
# - tokens "['neural', 'networks', 'for', 'ranking']"
# - length (number of tokens) "4"
# - term frequencies "{'neural': 1, 'ranking': 1, 'for': 1, 'networks': 1}"

And the IDF would be correct: print(bm25.idf("bm25")) would print out 1.6094379124341003.

Adding support for directly passing tokens to add_document would require subtle changes to Corpus:

diff --git a/src/corpus.rs b/src/corpus.rs
index e11b77d..7993289 100644
--- a/src/corpus.rs
+++ b/src/corpus.rs
@@ -35,6 +35,10 @@ impl Corpus {
 
     pub fn add_document(&mut self, doc_id: &str, text: &str, embedding: Vec<f64>) {
         let tokens = self.tokenizer.tokenize(text);
+        self.add_document_with_tokens(doc_id, text, tokens, embedding);
+    }
+
+    pub fn add_document_with_tokens(&mut self, doc_id: &str, text: &str, tokens: Vec<String>, embedding: Vec<f64>) {
         let mut term_freq = HashMap::new();
         for token in &tokens {
             *term_freq.entry(token.clone()).or_insert(0) += 1;

and the respective binding:

diff --git a/src/pybindings.rs b/src/pybindings.rs
index aafdd43..f17df5f 100644
--- a/src/pybindings.rs
+++ b/src/pybindings.rs
@@ -171,7 +171,8 @@ impl PyCorpus {
         }
     }
 
-    fn add_document(&self, doc_id: &str, text: &str, embedding: Vec<f64>) -> PyResult<()> {
+    #[pyo3(signature = (doc_id, text, embedding, tokens=None))]
+    fn add_document(&self, doc_id: &str, text: &str, embedding: Vec<f64>, tokens: Option<Vec<String>>) -> PyResult<()> {
         if self.shared.borrow().is_some() {
             return Err(PyRuntimeError::new_err(
                 "Corpus is frozen and cannot be modified",
@@ -181,7 +182,11 @@ impl PyCorpus {
         let Some(corpus) = inner.as_mut() else {
             return Err(PyRuntimeError::new_err("Corpus is unavailable"));
         };
-        corpus.add_document(doc_id, text, embedding);
+        if let Some(toks) = tokens {
+            corpus.add_document_with_tokens(doc_id, text, toks, embedding);
+        } else {
+            corpus.add_document(doc_id, text, embedding);
+        }
         Ok(())
     }

Then, documents can be added like this:

corpus = bb.Corpus()

text = "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
tokens = list(tokenize(text))
corpus.add_document(doc_id=f"d0", text=text, tokens=tokens, embedding=[])

where tokenize is a custom tokenization function, e.g., as simple as:

def tokenize(text: str):
    return text.split()

Hope this would be useful!

Kind regards,
Nikolay

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions