-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Thank you for the great and performant package!
I was experiment with it today and found a potential limitation in the tokenizer related to language support.
Currently, it seems to support only ASCII characters:
for ch in text.chars() {
if ch.is_ascii_alphanumeric() {
current.push(ch.to_ascii_lowercase());If one wanted to use bb25 for other texts, e.g., containing Cyrillic alphabet, the non-ASCII characters would be stripped and tokens would be missing.
With this issue, I would like to demonstrate this.
Perhaps the tokenizer may be improved by supporting non-ASCII characters.
Alternatively, Corpus may simply support directly passing tokens when adding a new document (below is a simple example how to enable this).
Or Corpus could accept a custom Tokenizer written in Python which inherits bb25's Tokenizer.
This is a short example featuring a document which contains Bulgarian text:
import bb25 as bb
corpus = bb.Corpus()
corpus.add_document("d0", "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене", [0.2] * 15)
corpus.add_document("d1", "neural networks for ranking", [0.1] * 8)
corpus.build_index() # must be called before creating scorers
bm25 = bb.BM25Scorer(corpus, 1.2, 0.75)
print(bm25.idf("bm25"))
# Prints out: 0.0
for doc in corpus.documents():
print(f'Document "{doc.id}"')
print(f'- text "{doc.text}"')
print(f'- tokens "{doc.tokens}"')
print(f'- length (number of tokens) "{doc.length}"')
print(f'- term frequencies "{doc.term_freq}"')
print()
# Prints out:
# Document "d0"
# - text "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
# - tokens "['bm25', 'best', 'matching', '25', 'ranking']"
# - length (number of tokens) "5"
# - term frequencies "{'ranking': 1, 'bm25': 1, '25': 1, 'best': 1, 'matching': 1}"
#
# Document "d1"
# - text "neural networks for ranking"
# - tokens "['neural', 'networks', 'for', 'ranking']"
# - length (number of tokens) "4"
# - term frequencies "{'networks': 1, 'for': 1, 'neural': 1, 'ranking': 1}"You can see that d0 is seen as containing only 5 tokens - the ones which do not contain non-ASCII characters.
If the tokenizer would support non-ASCII characters, the result would have been:
# Prints out:
# Document "d0"
# - text "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
# - tokens "['BM25', 'Best', 'Matching', '25', 'е', 'усъвършенстван', 'алгоритъм', 'за', 'класиране', 'на', 'документи', 'ranking', 'в', 'информационното', 'търсене']"
# - length (number of tokens) "15"
# - term frequencies "{'класиране': 1, 'алгоритъм': 1, 'ranking': 1, 'усъвършенстван': 1, '25': 1, 'Matching': 1, 'е': 1, 'документи': 1, 'търсене': 1, 'информационното': 1, 'за': 1, 'BM25': 1, 'Best': 1, 'в': 1, 'на': 1}"
#
# Document "d1"
# - text "neural networks for ranking"
# - tokens "['neural', 'networks', 'for', 'ranking']"
# - length (number of tokens) "4"
# - term frequencies "{'neural': 1, 'ranking': 1, 'for': 1, 'networks': 1}"And the IDF would be correct: print(bm25.idf("bm25")) would print out 1.6094379124341003.
Adding support for directly passing tokens to add_document would require subtle changes to Corpus:
diff --git a/src/corpus.rs b/src/corpus.rs
index e11b77d..7993289 100644
--- a/src/corpus.rs
+++ b/src/corpus.rs
@@ -35,6 +35,10 @@ impl Corpus {
pub fn add_document(&mut self, doc_id: &str, text: &str, embedding: Vec<f64>) {
let tokens = self.tokenizer.tokenize(text);
+ self.add_document_with_tokens(doc_id, text, tokens, embedding);
+ }
+
+ pub fn add_document_with_tokens(&mut self, doc_id: &str, text: &str, tokens: Vec<String>, embedding: Vec<f64>) {
let mut term_freq = HashMap::new();
for token in &tokens {
*term_freq.entry(token.clone()).or_insert(0) += 1;and the respective binding:
diff --git a/src/pybindings.rs b/src/pybindings.rs
index aafdd43..f17df5f 100644
--- a/src/pybindings.rs
+++ b/src/pybindings.rs
@@ -171,7 +171,8 @@ impl PyCorpus {
}
}
- fn add_document(&self, doc_id: &str, text: &str, embedding: Vec<f64>) -> PyResult<()> {
+ #[pyo3(signature = (doc_id, text, embedding, tokens=None))]
+ fn add_document(&self, doc_id: &str, text: &str, embedding: Vec<f64>, tokens: Option<Vec<String>>) -> PyResult<()> {
if self.shared.borrow().is_some() {
return Err(PyRuntimeError::new_err(
"Corpus is frozen and cannot be modified",
@@ -181,7 +182,11 @@ impl PyCorpus {
let Some(corpus) = inner.as_mut() else {
return Err(PyRuntimeError::new_err("Corpus is unavailable"));
};
- corpus.add_document(doc_id, text, embedding);
+ if let Some(toks) = tokens {
+ corpus.add_document_with_tokens(doc_id, text, toks, embedding);
+ } else {
+ corpus.add_document(doc_id, text, embedding);
+ }
Ok(())
}Then, documents can be added like this:
corpus = bb.Corpus()
text = "BM25 (Best Matching 25) е усъвършенстван алгоритъм за класиране на документи (ranking) в информационното търсене"
tokens = list(tokenize(text))
corpus.add_document(doc_id=f"d0", text=text, tokens=tokens, embedding=[])where tokenize is a custom tokenization function, e.g., as simple as:
def tokenize(text: str):
return text.split()Hope this would be useful!
Kind regards,
Nikolay