Skip to content

feat: add opt-in TwelveLabs Pegasus on-screen visual context#580

Open
mohit-twelvelabs wants to merge 1 commit into
Huanshere:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration
Open

feat: add opt-in TwelveLabs Pegasus on-screen visual context#580
mohit-twelvelabs wants to merge 1 commit into
Huanshere:mainfrom
mohit-twelvelabs:feat/twelvelabs-integration

Conversation

@mohit-twelvelabs

Copy link
Copy Markdown

Hi! I'm Mohit, I work at TwelveLabs (@mohit-twelvelabs).


简体中文(摘要)

目前的总结 / 术语提取步骤(_4_1_summarize.py)只依据音频转写文本,因此看不到画面上的文字、产品 / 品牌名称、UI 标签和图表。本 PR 新增一个可选TwelveLabs Pegasus 视觉上下文:开启后,Pegasus 会对视频画面分析一次,把"画面里有什么"的描述注入到总结 / 术语提示词中,从而帮助消除画面内容在断句与翻译时的歧义(例如专有名词、屏幕上出现的品牌名)。

完全可选、不破坏现有行为:默认关闭(pegasus.enabled: false),未配置 key 时流程与之前完全一致。可在 twelvelabs.io 免费获取 API key,有慷慨的免费额度。


What this adds

VideoLingo's summary/terminology step (core/_4_1_summarize.py) builds the theme and glossary from the transcript only, so it's blind to on-screen text, product/brand names, UI labels, and charts that the narration never says aloud. This PR adds an opt-in TwelveLabs Pegasus visual-context pass:

  • New core/utils/pegasus_context.py — uploads the input video once and asks Pegasus 1.5 to describe the on-screen visual layer.
  • The description is injected into get_summary_prompt(...) (new optional visual_context arg) so the summary + terminology extraction can disambiguate proper nouns and on-screen terms before translation/segmentation.
  • New pegasus: block in config.yaml.

Why it helps VideoLingo

The pipeline's translation quality hinges on the summary/terminology step. Feeding it what's actually on screen (e.g. a brand logo, an app name in the UI, a chart label) lets it pick the right translation for ambiguous terms and proper nouns that the audio alone can't pin down.

Opt-in / non-breaking

  • Disabled by default (pegasus.enabled: false). With no key configured, get_visual_context() returns "" and the pipeline behaves exactly as before.
  • Key is read from pegasus.api_key, falling back to the TWELVELABS_API_KEY env var — no key is hardcoded.
  • All Pegasus errors are caught with a warning so a hiccup never breaks translation. Result is cached to output/log/ for resumed runs. >200MB direct-upload cap is guarded.

How it was tested

  • tests/test_pegasus_context.py: 4 no-network unit tests verifying the feature is genuinely opt-in (disabled / no-key → no-op) and that the prompt is unchanged without context but injects it when present — all pass.
  • A live test gated on TWELVELABS_API_KEY (skipped without it) that performs a real asset upload + Pegasus analyze — verified passing locally against the TwelveLabs API (returns a non-empty on-screen description).
  • Modules compile and import cleanly; config loads.

Note: I couldn't run VideoLingo's full GPU/WhisperX pipeline end-to-end in my Linux sandbox (heavy CUDA/ML deps), so the integration follows the repo's existing backend conventions (requests-style modules, load_key, rprint, output/log/ caching) and was validated at the unit + live-API level. Happy to adjust to your preferences.

twelvelabs>=1.2.8 is added to requirements.txt (pure-Python SDK; the import is guarded so it's only needed when the feature is enabled).

You can grab a free API key at https://twelvelabs.io — there's a generous free tier.

The summary/terminology step works from the transcript only and is blind
to on-screen text, product/brand names, UI labels and charts. When enabled
via the new pegasus config block, TwelveLabs Pegasus describes that visual
layer once and feeds it into the summary prompt to disambiguate
segmentation and translation of on-screen content.

Opt-in and non-breaking: disabled by default, and with no key configured
the pipeline behaves exactly as before. Adds focused tests (no-network unit
tests plus a live Pegasus check gated on TWELVELABS_API_KEY).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant