This package wraps the PyThaiNLP library to add Thai language support for spaCy.
Support List
- Word segmentation (tokenization)
- Part-of-speech tagging
- Named entity recognition (NER)
- Sentence segmentation
- Dependency parsing
- Word vectors
- Python 3.9 or higher
- spaCy 3.0 or higher
- PyThaiNLP 3.1.0 or higher
pip install spacy-pythainlpimport spacy
import spacy_pythainlp.core
# Create a blank Thai language model
nlp = spacy.blank("th")
# Add the PyThaiNLP pipeline component
nlp.add_pipe("pythainlp")
# Process text
doc = nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")
# Access sentences
for sent in doc.sents:
print(sent)
# Output:
# ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน
# ผมอยากไปเที่ยวimport spacy
import spacy_pythainlp.core
nlp = spacy.blank("th")
nlp.add_pipe("pythainlp")
doc = nlp("ผมเป็นคนไทย แต่มะลิอยากไปโรงเรียนส่วนผมจะไปไหน ผมอยากไปเที่ยว")
# Get sentences
sentences = list(doc.sents)
print(f"Number of sentences: {len(sentences)}")
for i, sent in enumerate(sentences, 1):
print(f"Sentence {i}: {sent.text}")import spacy
import spacy_pythainlp.core
nlp = spacy.blank("th")
nlp.add_pipe("pythainlp", config={"pos": True})
doc = nlp("ผมเป็นคนไทย")
# Print tokens with POS tags
for token in doc:
print(f"{token.text}: {token.pos_}")import spacy
import spacy_pythainlp.core
nlp = spacy.blank("th")
nlp.add_pipe("pythainlp", config={"ner": True})
doc = nlp("วันที่ 15 กันยายน 2564 ทดสอบระบบที่กรุงเทพ")
# Print named entities
for ent in doc.ents:
print(f"{ent.text}: {ent.label_}")import spacy
import spacy_pythainlp.core
nlp = spacy.blank("th")
nlp.add_pipe("pythainlp", config={"dependency_parsing": True})
doc = nlp("ผมเป็นคนไทย")
# Print dependency relations
for token in doc:
print(f"{token.text}: {token.dep_} <- {token.head.text}")import spacy
import spacy_pythainlp.core
nlp = spacy.blank("th")
nlp.add_pipe("pythainlp", config={"word_vector": True, "word_vector_model": "thai2fit_wv"})
doc = nlp("แมว สุนัข")
# Access word vectors
for token in doc:
print(f"{token.text}: vector shape = {token.vector.shape}")
# Calculate similarity
token1 = doc[0] # แมว
token2 = doc[1] # สุนัข
print(f"Similarity: {token1.similarity(token2)}")You can customize the PyThaiNLP pipeline component by passing a configuration dictionary to nlp.add_pipe():
nlp.add_pipe(
"pythainlp",
config={
"pos_engine": "perceptron",
"pos": True,
"pos_corpus": "orchid_ud",
"sent_engine": "crfcut",
"sent": True,
"ner_engine": "thainer",
"ner": True,
"tokenize_engine": "newmm",
"tokenize": False,
"dependency_parsing": False,
"dependency_parsing_engine": "esupar",
"dependency_parsing_model": None,
"word_vector": True,
"word_vector_model": "thai2fit_wv"
}
)| Parameter | Type | Default | Description |
|---|---|---|---|
tokenize |
bool |
False |
Enable/disable word tokenization (spaCy uses PyThaiNLP's newmm by default) |
tokenize_engine |
str |
"newmm" |
Tokenization engine. See options |
sent |
bool |
True |
Enable/disable sentence segmentation |
sent_engine |
str |
"crfcut" |
Sentence tokenizer engine. See options |
pos |
bool |
True |
Enable/disable part-of-speech tagging |
pos_engine |
str |
"perceptron" |
POS tagging engine. See options |
pos_corpus |
str |
"orchid_ud" |
Corpus for POS tagging |
ner |
bool |
True |
Enable/disable named entity recognition |
ner_engine |
str |
"thainer" |
NER engine. See options |
dependency_parsing |
bool |
False |
Enable/disable dependency parsing |
dependency_parsing_engine |
str |
"esupar" |
Dependency parsing engine. See options |
dependency_parsing_model |
str |
None |
Dependency parsing model. See options |
word_vector |
bool |
True |
Enable/disable word vectors |
word_vector_model |
str |
"thai2fit_wv" |
Word vector model. See options |
Important Notes:
- When
dependency_parsingis enabled, word segmentation and sentence segmentation are automatically disabled to use the tokenization from the dependency parser. - All configuration options are optional and have sensible defaults.
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Copyright 2016-2026 PyThaiNLP Project
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.