Add TTS evaluation#100
Draft
EduardoPach wants to merge 16 commits into
Draft
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…line Extend the engine with a tts() method mirroring transcribe()/diarize(), add a new ArgmaxOpenSourceSpeechGenerationPipeline that uses it, and retire the old WhisperKit-based speech-gen pipeline (whisperkit-cli no longer exists since #99). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add PipelineRegistry.create_pipeline_from_config that reverse-looks up a registered Pipeline by matching _config_class, then instantiates it. Refactor SpeechGenerationWordErrorRate to accept a TranscriptionConfig instead of a fully-built Pipeline — callers no longer have to construct the transcription pipeline themselves; the metric does it lazily on first use. Default remains WhisperKitPro / parakeet-v2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rgmax-oss Switch the default ASR from WhisperKitPro / parakeet-v2 to the Argmax OSS 'whisperkit-large-v3-turbo' alias so the metric works out of the box without WHISPERKITPRO_CLI_PATH. The transcription argument now also accepts a pipeline alias string (resolved via PipelineRegistry.create_pipeline) in addition to a TranscriptionConfig, so callers can point at any registered alias without constructing a config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 'Generated TTS audio' line and the 'TTS WER hypothesis transcript' line both fire on every sample, which spams the run log and leaks the predicted transcript into INFO-level output. Drop them to DEBUG; the per-run summary printed by the benchmark runner is still INFO. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ArgmaxOpenSourceEngine.__init__ now consults a module-level dict keyed by (resolved cache_root, commit_hash) before cloning and building. The second engine constructed in a run — e.g. the WER metric's transcription engine after the TTS engine — reuses the resolved cli_path instead of re-running `swift build`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the argmax-cli tts --help allowed values: invalid speakers / languages now fail at config construction with a clear Pydantic error instead of being passed to the CLI and failing there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
StrEnum (PEP 663, added in 3.11) gives us typed CLI values without the verbosity of a long Literal[...] and without an explicit .value when the string flows through to subprocess.run — StrEnum members are str subclasses. Drop 3.10 from requires-python and .python-version so we can rely on StrEnum and any other 3.11-only stdlib bits going forward. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do
Adds end-to-end TTS evaluation to OpenBench: a speech-generation pipeline produces audio from a text prompt, and a WER metric transcribes that audio with any registered ASR backend and scores it against the prompt — so reported
prediction_timereflects synthesis only.What's in
PipelineType.SPEECH_GENERATION,SpeechGenerationSample/Dataset,GeneratedAudioprediction, runner/wandb/CLI plumbing.SpeechGenerationWordErrorRate: accepts aTranscriptionConfig, a registered pipeline alias string, orNone(defaults towhisperkit-large-v3-turbo). ASR pipeline is built lazily.ArgmaxOpenSourceEngine.tts(), newArgmaxOpenSourceSpeechGenerationPipeline, newargmax-oss-speech-generationalias.PipelineRegistry.create_pipeline_from_config()reverse-lookup; process-wide CLI-path cache so the WER metric reuses the TTS pipeline'sargmax-clibuild.Heads-up
Try it