functionwordsets

Comprehensive multilingual function-word datasets with a simple Python API

Overview

functionwordsets ships ready-to-use function-word lists for many languages and time-periods.
Each dataset is a tiny Python module located in functionwordsets/datasets/ and is loaded on demand through a minimal API.

Supported out of the box :

ID	Language / period	Entries*
`fr_21c`	French – 21st century	688
`en_21c`	English – 21st century	390
`sp_21c`	Spanish – 21st century	481
`it_21c`	Italian – 21st century	495
`nl_21c`	Dutch – 21st century	287
`gr_5cbc`	Ancient Greek – 5th-4th c. BCE	264
`oc_13c`	Old Occitan – 12th-13th c.	360
`la_1cbc`	Classical Latin – 1st c. BCE	353

*Number of distinct word-forms in the union of all categories.

You can also add or fork your own datasets: just drop a <id>.py file following the template shown below.

💡 Supported grammatical categories

(summary unchanged – see below for details)

Installation

pip install functionwordsets         # from PyPI
# or, from a cloned repo
pip install -e .

Python ≥ 3.8 – zero runtime dependencies – wheel < 20 kB zipped.

Quick start

import functionwordsets as fw

# List available datasets
print(fw.available_ids())            # ['fr_21c', 'en_21c', …]

# Load one set (defaults to fr_21c)
fr = fw.load()                       # same as fw.load('fr_21c')
print(fr.name, len(fr.all))          # French – 21st century 688

# Membership test
if 'ne' in fr.all:
    ...

# Build a custom stop-set: only articles + prepositions
stops = fr.subset(['articles', 'prepositions'])

Command-line helpers

# List dataset IDs
fw-list

# Export every French function word to a text file
fw-export fr_21c -o fr.txt

# Export only conjunctions & negations from Spanish as JSON
fw-export sp_21c --include coord_conj subord_conj negations -o sp_stop.json

Dataset layout

Internally each dataset is defined as a small Python dictionary:

data = {
    "name": "English – 21st century",
    "language": "en",
    "period": "21c",
    "categories": {
        "articles": [...],
        "prepositions": [...],
        # …
    }
}

functionwordsets treats the object as read-only, so feel free to edit or extend it in your fork.

Notes on auxiliary categories

Keys for auxiliary verbs follow the pattern aux_<lemma> (e.g. aux_être, aux_be, aux_ser). They vary by language; see each dataset file for the exact key.

Enjoy !

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
.github/workflows		.github/workflows
functionwordsets		functionwordsets
tests		tests
LICENSE		LICENSE
pyproject.toml		pyproject.toml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

functionwordsets

Overview

💡 Supported grammatical categories

Installation

Quick start

Command-line helpers

Dataset layout

Notes on auxiliary categories

About

Uh oh!

Releases 17

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

functionwordsets

Overview

💡 Supported grammatical categories

Installation

Quick start

Command-line helpers

Dataset layout

Notes on auxiliary categories

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 17

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages