muse

This repository contains in-progress experimental research software for the CDH project MuSE (Multilingual Semantic Embeddings).

For developer setup instructions, including Google Cloud Translation and HuggingFace Authentication configuration, see DEVELOPERNOTES.md.

Phase 1

The first phase of the project, we assess how well off-the-shelf multilingual translation models perform in the music-theoretical domain.

Models

We evaluate three models: a commercial state-of-the-art model and two open-weights models available on 🤗 Hugging Face.

TTLM. Google's Translation LLM (TTLM) model available through Google Cloud Translation.
HY-MT1.5. Tencent's Hunyuan Translation Model Version 1.5. We use the 1.8B parameter model.
TranslateGemma. Google's TranslateGemma translation model that supports over 400 languages. We use the 4B parameter model. Note: This is a gated model which will require authentication via HuggingFace. See DEVELOPERNOTES.md for more details.

During this phase, we also experimented with two additional open-weights models available on 🤗 Hugging Face. While we ultimately chose not to include them in our evaluation, our translation module supports them.

NLLB-200. Facebook AI Research's No Language Left Behind (NLLB) translation model that supports over 200 languages. We used the 3.3B parameter model.
MADLAD-400. Google's MADLAD-400 translation model that supports over 400 languages. We used the 3B parameter model.

Software Pipeline

The software for this phase can be broken into four stages: (1) building parallel corpora, (2) translating these corpora, (3) evaluating these translations via machine translation metics, and (4) evaluating these machine translations via human annotation tasks. Each of these stages corresponds to a module within the muse package.

parallel_corpus: module for building the parallel text copora used to assess select machine translation models
translation: module for generating machine translations with select machien translation models
evaluation: module for evaluating machine translations via quantitative metrics
annotation: module for supporting our Prodigy annotation tasks

Additional Materials

Below is a list of additional materials created during this phase:

notebooks/mt_browser.py: marimo notebook for viewing and exploring machine translation corpora and their quantiative evaluation scores
docs/data-design.md: living document for recording the current designs of the various data produced during the workflows of this project
docs/della-guide.md: guide for running MuSE translations jobs on Della
examples/slurm/translate-della.slurm: example slurm script for running machine translation jobs on Della
test_scripts: test scripts created during development

License

This project is licensed under the Apache 2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
docs		docs
examples/slurm		examples/slurm
notebooks		notebooks
src/muse		src/muse
test_scripts		test_scripts
.copywrite.hcl		.copywrite.hcl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
DEVELOPERNOTES.md		DEVELOPERNOTES.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock
yamlfmt.yml		yamlfmt.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

muse

Phase 1

Models

Software Pipeline

Additional Materials

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

muse

Phase 1

Models

Software Pipeline

Additional Materials

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages