Add TTS evaluation by EduardoPach · Pull Request #100 · argmaxinc/OpenBench

EduardoPach · 2026-05-01T13:56:39Z

What does this PR do

Adds end-to-end TTS evaluation to OpenBench: a speech-generation pipeline produces audio from a text prompt, and a WER metric transcribes that audio with any registered ASR backend and scores it against the prompt — so reported prediction_time reflects synthesis only.

What's in

Speech-generation infra: PipelineType.SPEECH_GENERATION, SpeechGenerationSample/Dataset, GeneratedAudio prediction, runner/wandb/CLI plumbing.
SpeechGenerationWordErrorRate: accepts a TranscriptionConfig, a registered pipeline alias string, or None (defaults to whisperkit-large-v3-turbo). ASR pipeline is built lazily.
ArgmaxOSS TTS: new ArgmaxOpenSourceEngine.tts(), new ArgmaxOpenSourceSpeechGenerationPipeline, new argmax-oss-speech-generation alias.
Infra: PipelineRegistry.create_pipeline_from_config() reverse-lookup; process-wide CLI-path cache so the WER metric reuses the TTS pipeline's argmax-cli build.

Heads-up

Python 3.11+ now required (StrEnum for TTS speaker/language values).

Try it

uv run openbench-cli evaluate -p argmax-oss-speech-generation \
    -d customer-service-tts-prompts-vocalized -m wer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…line Extend the engine with a tts() method mirroring transcribe()/diarize(), add a new ArgmaxOpenSourceSpeechGenerationPipeline that uses it, and retire the old WhisperKit-based speech-gen pipeline (whisperkit-cli no longer exists since #99). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add PipelineRegistry.create_pipeline_from_config that reverse-looks up a registered Pipeline by matching _config_class, then instantiates it. Refactor SpeechGenerationWordErrorRate to accept a TranscriptionConfig instead of a fully-built Pipeline — callers no longer have to construct the transcription pipeline themselves; the metric does it lazily on first use. Default remains WhisperKitPro / parakeet-v2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rgmax-oss Switch the default ASR from WhisperKitPro / parakeet-v2 to the Argmax OSS 'whisperkit-large-v3-turbo' alias so the metric works out of the box without WHISPERKITPRO_CLI_PATH. The transcription argument now also accepts a pipeline alias string (resolved via PipelineRegistry.create_pipeline) in addition to a TranscriptionConfig, so callers can point at any registered alias without constructing a config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 'Generated TTS audio' line and the 'TTS WER hypothesis transcript' line both fire on every sample, which spams the run log and leaks the predicted transcript into INFO-level output. Drop them to DEBUG; the per-run summary printed by the benchmark runner is still INFO. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ArgmaxOpenSourceEngine.__init__ now consults a module-level dict keyed by (resolved cache_root, commit_hash) before cloning and building. The second engine constructed in a run — e.g. the WER metric's transcription engine after the TTS engine — reuses the resolved cli_path instead of re-running `swift build`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror the argmax-cli tts --help allowed values: invalid speakers / languages now fail at config construction with a clear Pydantic error instead of being passed to the CLI and failing there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

StrEnum (PEP 663, added in 3.11) gives us typed CLI values without the verbosity of a long Literal[...] and without an explicit .value when the string flows through to subprocess.run — StrEnum members are str subclasses. Drop 3.10 from requires-python and .python-version so we can rely on StrEnum and any other 3.11-only stdlib bits going forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dberkin1 and others added 16 commits February 13, 2026 20:52

Add TTS evals

1e59781

refactor

7d3ab4d

reformatting

d418181

reformat

ef86575

Merge remote-tracking branch 'origin/main' into eduardo/tts-evals

d0a9216

Slim TTS evaluation: TTS-only pipeline, WER metric does ASR

4b7aca5

Address review: drop common.py, metric takes TranscriptionPipeline

e94ae97

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge branch 'main' into eduardo/tts-evals

49f5e3c

nit: copyright date

c6b0eb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TTS evaluation#100

Add TTS evaluation#100
EduardoPach wants to merge 16 commits into
mainfrom
eduardo/tts-evals

EduardoPach commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EduardoPach commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do

What's in

Heads-up

Try it

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EduardoPach commented May 1, 2026 •

edited

Loading