feat: add opt-in TwelveLabs Pegasus on-screen visual context#580
Open
mohit-twelvelabs wants to merge 1 commit into
Open
feat: add opt-in TwelveLabs Pegasus on-screen visual context#580mohit-twelvelabs wants to merge 1 commit into
mohit-twelvelabs wants to merge 1 commit into
Conversation
The summary/terminology step works from the transcript only and is blind to on-screen text, product/brand names, UI labels and charts. When enabled via the new pegasus config block, TwelveLabs Pegasus describes that visual layer once and feeds it into the summary prompt to disambiguate segmentation and translation of on-screen content. Opt-in and non-breaking: disabled by default, and with no key configured the pipeline behaves exactly as before. Adds focused tests (no-network unit tests plus a live Pegasus check gated on TWELVELABS_API_KEY).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).
简体中文(摘要)
目前的总结 / 术语提取步骤(
_4_1_summarize.py)只依据音频转写文本,因此看不到画面上的文字、产品 / 品牌名称、UI 标签和图表。本 PR 新增一个可选的 TwelveLabs Pegasus 视觉上下文:开启后,Pegasus 会对视频画面分析一次,把"画面里有什么"的描述注入到总结 / 术语提示词中,从而帮助消除画面内容在断句与翻译时的歧义(例如专有名词、屏幕上出现的品牌名)。完全可选、不破坏现有行为:默认关闭(
pegasus.enabled: false),未配置 key 时流程与之前完全一致。可在twelvelabs.io免费获取 API key,有慷慨的免费额度。What this adds
VideoLingo's summary/terminology step (
core/_4_1_summarize.py) builds the theme and glossary from the transcript only, so it's blind to on-screen text, product/brand names, UI labels, and charts that the narration never says aloud. This PR adds an opt-in TwelveLabs Pegasus visual-context pass:core/utils/pegasus_context.py— uploads the input video once and asks Pegasus 1.5 to describe the on-screen visual layer.get_summary_prompt(...)(new optionalvisual_contextarg) so the summary + terminology extraction can disambiguate proper nouns and on-screen terms before translation/segmentation.pegasus:block inconfig.yaml.Why it helps VideoLingo
The pipeline's translation quality hinges on the summary/terminology step. Feeding it what's actually on screen (e.g. a brand logo, an app name in the UI, a chart label) lets it pick the right translation for ambiguous terms and proper nouns that the audio alone can't pin down.
Opt-in / non-breaking
pegasus.enabled: false). With no key configured,get_visual_context()returns""and the pipeline behaves exactly as before.pegasus.api_key, falling back to theTWELVELABS_API_KEYenv var — no key is hardcoded.output/log/for resumed runs. >200MB direct-upload cap is guarded.How it was tested
tests/test_pegasus_context.py: 4 no-network unit tests verifying the feature is genuinely opt-in (disabled / no-key → no-op) and that the prompt is unchanged without context but injects it when present — all pass.TWELVELABS_API_KEY(skipped without it) that performs a real asset upload + Pegasus analyze — verified passing locally against the TwelveLabs API (returns a non-empty on-screen description).Note: I couldn't run VideoLingo's full GPU/WhisperX pipeline end-to-end in my Linux sandbox (heavy CUDA/ML deps), so the integration follows the repo's existing backend conventions (
requests-style modules,load_key,rprint,output/log/caching) and was validated at the unit + live-API level. Happy to adjust to your preferences.twelvelabs>=1.2.8is added torequirements.txt(pure-Python SDK; the import is guarded so it's only needed when the feature is enabled).You can grab a free API key at https://twelvelabs.io — there's a generous free tier.