Skip to content

Add TTS evaluation#100

Draft
EduardoPach wants to merge 16 commits into
mainfrom
eduardo/tts-evals
Draft

Add TTS evaluation#100
EduardoPach wants to merge 16 commits into
mainfrom
eduardo/tts-evals

Conversation

@EduardoPach
Copy link
Copy Markdown
Collaborator

@EduardoPach EduardoPach commented May 1, 2026

What does this PR do

Adds end-to-end TTS evaluation to OpenBench: a speech-generation pipeline produces audio from a text prompt, and a WER metric transcribes that audio with any registered ASR backend and scores it against the prompt — so reported prediction_time reflects synthesis only.

What's in

  • Speech-generation infra: PipelineType.SPEECH_GENERATION, SpeechGenerationSample/Dataset, GeneratedAudio prediction, runner/wandb/CLI plumbing.
  • SpeechGenerationWordErrorRate: accepts a TranscriptionConfig, a registered pipeline alias string, or None (defaults to whisperkit-large-v3-turbo). ASR pipeline is built lazily.
  • ArgmaxOSS TTS: new ArgmaxOpenSourceEngine.tts(), new ArgmaxOpenSourceSpeechGenerationPipeline, new argmax-oss-speech-generation alias.
  • Infra: PipelineRegistry.create_pipeline_from_config() reverse-lookup; process-wide CLI-path cache so the WER metric reuses the TTS pipeline's argmax-cli build.

Heads-up

  • Python 3.11+ now required (StrEnum for TTS speaker/language values).

Try it

uv run openbench-cli evaluate -p argmax-oss-speech-generation \
    -d customer-service-tts-prompts-vocalized -m wer

dberkin1 and others added 16 commits February 13, 2026 20:52
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line

Extend the engine with a tts() method mirroring transcribe()/diarize(),
add a new ArgmaxOpenSourceSpeechGenerationPipeline that uses it, and
retire the old WhisperKit-based speech-gen pipeline (whisperkit-cli no
longer exists since #99).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add PipelineRegistry.create_pipeline_from_config that reverse-looks up
a registered Pipeline by matching _config_class, then instantiates it.
Refactor SpeechGenerationWordErrorRate to accept a TranscriptionConfig
instead of a fully-built Pipeline — callers no longer have to construct
the transcription pipeline themselves; the metric does it lazily on
first use. Default remains WhisperKitPro / parakeet-v2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rgmax-oss

Switch the default ASR from WhisperKitPro / parakeet-v2 to the Argmax OSS
'whisperkit-large-v3-turbo' alias so the metric works out of the box
without WHISPERKITPRO_CLI_PATH. The transcription argument now also
accepts a pipeline alias string (resolved via PipelineRegistry.create_pipeline)
in addition to a TranscriptionConfig, so callers can point at any
registered alias without constructing a config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 'Generated TTS audio' line and the 'TTS WER hypothesis transcript'
line both fire on every sample, which spams the run log and leaks the
predicted transcript into INFO-level output. Drop them to DEBUG; the
per-run summary printed by the benchmark runner is still INFO.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ArgmaxOpenSourceEngine.__init__ now consults a module-level dict keyed
by (resolved cache_root, commit_hash) before cloning and building. The
second engine constructed in a run — e.g. the WER metric's transcription
engine after the TTS engine — reuses the resolved cli_path instead of
re-running `swift build`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the argmax-cli tts --help allowed values: invalid speakers /
languages now fail at config construction with a clear Pydantic error
instead of being passed to the CLI and failing there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
StrEnum (PEP 663, added in 3.11) gives us typed CLI values without the
verbosity of a long Literal[...] and without an explicit .value when
the string flows through to subprocess.run — StrEnum members are str
subclasses. Drop 3.10 from requires-python and .python-version so we
can rely on StrEnum and any other 3.11-only stdlib bits going forward.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant