This repository contains in-progress experimental research software for the CDH project MuSE (Multilingual Semantic Embeddings).
For developer setup instructions, including Google Cloud Translation and HuggingFace Authentication configuration, see DEVELOPERNOTES.md.
The first phase of the project, we assess how well off-the-shelf multilingual translation models perform in the music-theoretical domain.
We evaluate three models: a commercial state-of-the-art model and two open-weights models available on 🤗 Hugging Face.
-
TTLM. Google's Translation LLM (TTLM) model available through Google Cloud Translation.
-
HY-MT1.5. Tencent's Hunyuan Translation Model Version 1.5. We use the 1.8B parameter model.
-
TranslateGemma. Google's TranslateGemma translation model that supports over 400 languages. We use the 4B parameter model. Note: This is a gated model which will require authentication via HuggingFace. See DEVELOPERNOTES.md for more details.
During this phase, we also experimented with two additional open-weights models available on 🤗 Hugging Face. While we ultimately chose not to include them in our evaluation, our translation module supports them.
-
NLLB-200. Facebook AI Research's No Language Left Behind (NLLB) translation model that supports over 200 languages. We used the 3.3B parameter model.
-
MADLAD-400. Google's MADLAD-400 translation model that supports over 400 languages. We used the 3B parameter model.
The software for this phase can be broken into four stages: (1) building parallel corpora, (2) translating these corpora, (3) evaluating these translations via machine translation metics, and (4) evaluating these machine translations via human annotation tasks.
Each of these stages corresponds to a module within the muse package.
parallel_corpus: module for building the parallel text copora used to assess select machine translation modelstranslation: module for generating machine translations with select machien translation modelsevaluation: module for evaluating machine translations via quantitative metricsannotation: module for supporting our Prodigy annotation tasks
Below is a list of additional materials created during this phase:
notebooks/mt_browser.py:marimonotebook for viewing and exploring machine translation corpora and their quantiative evaluation scoresdocs/data-design.md: living document for recording the current designs of the various data produced during the workflows of this projectdocs/della-guide.md: guide for running MuSE translations jobs on Dellaexamples/slurm/translate-della.slurm: example slurm script for running machine translation jobs on Dellatest_scripts: test scripts created during development
This project is licensed under the Apache 2.0 License.
(c)2025-2026 Trustees of Princeton University. Permission granted for non-commercial distribution online under a standard Open Source license.