diff --git a/GenAI.md b/GenAI.md index 3c1fd31b..a99d8fad 100644 --- a/GenAI.md +++ b/GenAI.md @@ -26,7 +26,164 @@ No code required. We want a **clear, practical proposal** with architecture and ### Your Solution for problem 1: -You need to put your solution here. + +Before picking any approach, there's one thing worth calling out that the problem doesn't explicitly mention — **not all videos have audio worth transcribing.** + +If the video is a muted screen recording or a slide presentation, running Whisper on it is a waste of time. The pipeline needs to detect this first: + +- If the video has spoken audio → use transcription (Whisper or cloud ASR) +- If it's muted / slide-based → extract frames with ffmpeg, then run a Vision LLM (GPT-4o Vision or Gemini 1.5 Pro) on those frames to read slide content, diagrams, and text +- If it's a narrated slideshow → do both and merge + +This detection step sits before all three approaches. Once past it, the pipelines below take over. + +--- + +#### Approach A — Online / Cloud-Based + +The idea here is to use existing cloud APIs for transcription like Deepgram and summarization like GPT-4o, and only handle the media cutting locally with ffmpeg. + +**How it works:** + +``` +Local video folder + ↓ +ffmpeg → split audio into 10-min chunks + ↓ +Assign chunk_id + start_sec offset to each chunk + ↓ +Upload all chunks in parallel (async) → cloud ASR processes them simultaneously + ↓ +Poll/webhook until all jobs done + ↓ +Merge transcripts by chunk_id → one final transcript with correct timestamps + ↓ +Send transcript → GPT-4o → returns summary + highlight timestamps as JSON + ↓ +ffmpeg → cut clips at those timestamps +ffmpeg → extract screenshots at those timestamps + ↓ +Write Summary.md + save to output//clips/ and screenshots/ +``` + +The key insight for this approach is **parallel chunked uploads**. If you send a 4-hour video to deepgram as one file it processes sequentially and takes a long time. But if you split it into 24 × 10-minute chunks and upload them all at once, they process in parallel and you get results very less time spent. Each chunk stores its `start_sec` so when you merge the transcripts, timestamps are absolute (not reset to 0 per chunk). + +| Component | Tool | +|---|---| +| Audio chunking | ffmpeg | +| Transcription | Deepgram (async API) | +| Summarization | GPT-4o or Gemini 1.5 Pro | +| Clip + screenshot | ffmpeg | +| Orchestration | Python with asyncio | + +**Pros:** Fastest to get running. No GPU or local model needed. High transcription accuracy out of the box. +Also this is more user friendly as a online saas product + +**Cons:** Audio goes to a third-party — not suitable for confidential videos. Cost adds up at scale (~$0.50–1.00 per video). Rate limits can become an issue on large folders. + +**Best when:** You want to get something working fast and the content isn't sensitive. + +--- + +#### Approach B — Hybrid (Local Media + Cloud LLM) My Recommendation + +This is the approach I'd go with. All the heavy media work happens locally — ffmpeg handles the cutting, Whisper handles the transcription. Only the text (transcript) goes to a cloud LLM for summarization. The video itself never leaves your machine. + +**How it works:** + +``` +Local video folder + ↓ +ffmpeg → extract audio as WAV + ↓ +Split audio into 10-min chunks + ↓ +Run Whisper on all chunks in parallel (Python multiprocessing) + ↓ +Merge transcripts by chunk_id → final_transcript.json + ↓ +Split transcript into ~4000-token segments (with overlap so context isn't lost) + ↓ +GPT-4o-mini → summarize + identify highlights with timestamps + ↓ +LLM returns: { summary, key_topics, highlights[{start_sec, end_sec, title}], takeaways } + ↓ +ffmpeg → cut clips + extract screenshots at highlight timestamps + ↓ +Write Summary.md → output// +``` + +**Why Whisper + parallel chunks?** +Whisper large-v3 runs at about 4× realtime on a CPU. So a 4-hour video takes ~1 hour if processed sequentially. With 4 parallel workers, that drops to ~15–20 minutes — similar speed to cloud, but fully local. + + + +| Component | Tool | +|---|---| +| Audio extraction | ffmpeg | +| Transcription | Whisper large-v3 (local) | +| Parallel processing | Python multiprocessing.Pool | +| Summarization | GPT-4o-mini API | +| Clips + screenshots | ffmpeg | + +**Estimated cost:** ~$0.10–0.30 per 4-hour video +**Estimated time:** ~15–20 min with parallel chunking + +**Pros:** Video stays local (privacy). LLM quality is high. Full control over output. If you ever want to go fully offline, you just swap the GPT-4o-mini call to Ollama — nothing else changes. + +**Cons:** Whisper large-v3 needs ~8–10 GB RAM. Takes 1–2 days to set up properly. + +**Best when:** You want a production-ready pipeline that works for sensitive and non-sensitive content alike. + +--- + +#### Approach C — Fully Offline + +Same as Approach B, but replace the cloud LLM with a local model running through Ollama or llama.cpp. Zero API cost. Nothing leaves the machine at any step. + +**How it works:** + +``` +(Same as Approach B up to transcript merge) + ↓ +Ollama (LLaMA 3 / Mistral) → summarization + highlight extraction + ↓ +(Same ffmpeg steps for clips + screenshots) + ↓ +Write Summary.md +``` + +**Model options:** + +| Model | RAM Needed | Quality | +|---|---|---| +| Mistral 7B | ~8 GB RAM | Decent — fine for simple videos | +| LLaMA 3 8B | ~10 GB RAM | Good | +| LLaMA 3 70B | ~48 GB VRAM | Close to GPT-4o quality, needs a real GPU | + +**Pros:** No API cost after setup. 100% air-gapped. No rate limits. + +**Cons:** 7B models produce noticeably weaker summaries than GPT-4o on long, complex transcripts. 70B needs serious hardware. Slower overall. + +**Best when:** The content is confidential and can't touch any external service, or you have GPU infrastructure available. + +--- + +#### Final Comparison + +| | A: Cloud | B: Hybrid ✅ | C: Offline | +|---|---|---|---| +| Setup time | Hours | 1–2 days | 2–4 days | +| Privacy | ❌ Audio to cloud | ✅ Only text to cloud | ✅ 100% local | +| Cost per video | $0.50–1.00 | $0.10–0.30 | $0 (hardware) | +| Summary quality | High | High | Medium–High | +| Speed (4hr video) | ~12–15 min | ~15–20 min | ~30–60 min | + +**I'd go with Approach B.** The parallel chunking keeps speed comparable to cloud. The video never leaves the machine. Cost is minimal. And the upgrade path to fully offline is literally one line change. + +--- + +#### ## Problem 2: **Zero-Shot Prompt to generate 3 LinkedIn Post** @@ -36,7 +193,214 @@ Design a **single zero-shot prompt** that takes a user’s persona configuration ### Your Solution for problem 2: -You need to put your solution here. + +#### My thinking before writing the prompt + +A few things that can go wrong with LinkedIn post generation: +- All 3 posts end up sounding like the same post with slightly different wording +- The LLM invents statistics or quotes that the persona never said +- The output is valid prose but not valid JSON — the app can't parse it +- It ignores the persona's don'ts completely + +To solve these, I made a few deliberate decisions: +- Use Checklist with specific structural rules — not just "write in 3 different tones" +- Explicit rule: no invented stats. Use `[STAT]` as a placeholder instead +- Return `response_format: json_object` at the API level so non-JSON is impossible +- Persona don'ts are injected directly into the prompt as a list the LLM has to follow + +--- + +#### Persona Config Schema (stored once, reused per generation) + +```json +{ + "name": "Priya Sharma", + "role": "VP of Engineering", + "industry": "FinTech / SaaS", + "tone": "direct, no-fluff, occasionally self-deprecating", + "audience": "engineering leaders and CTOs", + "dos": [ + "Share hard-won lessons", + "Use concrete examples", + "End with a question that invites discussion" + ], + "donts": [ + "No corporate buzzwords like leverage, synergy, ecosystem", + "No humble-bragging", + "No vague motivational filler" + ], + "example_posts": [ + "Spent 3 years building the wrong thing. Here's what I'd tell myself." + ] +} +``` + +--- + +### Simple rule for good prompt +## context->prompt->constraints->output_format + + +#### The Prompt + +--- + +**📌 CONTEXT** + +``` +You are a LinkedIn ghostwriter. Your job is to write 3 distinct LinkedIn post drafts +for the persona provided below. You write entirely in their voice — matching their tone, +vocabulary, and style based on their profile and example posts. + +The user has a one-time persona config (name, role, industry, tone, dos/donts, examples). +They will provide a topic each time. Your output goes directly into an app that shows +3 draft cards to the user — so it must be machine-readable JSON, nothing else. +``` + +--- + +**✍️ PROMPT** + +``` +Generate 3 LinkedIn post drafts for the persona and topic below. + +PERSONA +Name: {{name}} +Role: {{role}} +Industry: {{industry}} +Tone: {{tone}} +Audience: {{audience}} +Dos: {{dos}} +Donts: {{donts}} +Example posts (match this voice exactly): {{example_posts}} + +TOPIC + +Topic: {{topic}} +Context (optional): {{context}} +Goal: {{goal}} + +STYLES + +Style 1 — INSIGHT: +One sharp, concise observation or counterintuitive truth about the topic. +Short paragraphs. No filler. Feels like something you'd stop scrolling for. +Target: 100–180 words. + +Style 2 — STORY: +Open with a specific moment or scene. Build toward a lesson or realization. +First-person, conversational, emotionally grounded. +Target: 200–300 words. + +Style 3 — CHECKLIST: +Lead with a clear promise ("3 things I wish I knew..." / "5 signs that..."). +Numbered list, each point has a short explanation. +Immediately useful and easy to share. +Target: 180–260 words. +``` + +--- + +**⚙️ CONSTRAINTS** + +``` +RULES +- Do NOT start any post with "I" or "We" as the first word +- First line (hook) must be ≤ 12 words — strong enough to stop scrolling +- End every post with exactly one CTA — a question or a direct call to action +- Max 5 hashtags per post, relevant and specific (no generic ones like #motivation) +- Never fabricate statistics, research, or quotes — use [STAT] as a placeholder if needed +- Each post must feel structurally and emotionally different from the other two +- Strictly follow the persona's dos and donts — do not ignore them +``` + +--- + +**📤 OUTPUT FORMAT** + +``` +Return ONLY this JSON. No markdown. No explanation. Nothing before or after it. + +{ + "topic": "", + "persona": "", + "posts": [ + { + "style": "Insight", + "hook": "", + "body": "", + "cta": "", + "hashtags": ["#Tag1", "#Tag2", "#Tag3"], + "word_count": + }, + { + "style": "Story", + "hook": "...", + "body": "...", + "cta": "...", + "hashtags": [...], + "word_count": + }, + { + "style": "Checklist", + "hook": "...", + "body": "...", + "cta": "...", + "hashtags": [...], + "word_count": + } + ] +} +``` + +--- + +**✅ RESULT (example filled output)** + +```json +{ + "topic": "Why most engineering teams underestimate technical debt", + "persona": "Priya Sharma", + "posts": [ + { + "style": "Insight", + "hook": "Technical debt doesn't slow you down gradually. It stops you suddenly.", + "body": "Most teams treat it like a future problem.\nIt's not.\n\nEvery shortcut you ship today is a tax on every feature you build tomorrow. The codebase gets heavier. Velocity drops. But it happens slowly enough that nobody notices until the quarter you miss your roadmap completely.\n\nI've seen teams lose [STAT] months of engineering capacity to debt they thought was 'manageable'. It never stays manageable.\n\nThe teams that stay fast are the ones who treat debt repayment as a recurring line item — not a someday project.", + "cta": "How does your team currently track and prioritize technical debt?", + "hashtags": ["#EngineeringLeadership", "#TechDebt", "#SoftwareEngineering"], + "word_count": 112 + }, + { + "style": "Story", + "hook": "Three years ago, our team was shipping fast. Then we weren't.", + "body": "We'd been cutting corners since month one. Nothing dramatic — just small things. A test skipped here, a refactor deferred there. Each decision made sense in isolation.\n\nThen we tried to add payments.\n\nWhat should have been a 6-week project took 7 months. Every new feature required touching 4 other things that were fragile. Every deploy was a prayer.\n\nThe irony? We thought we were moving fast because we skipped the boring stuff. We were just borrowing time from our future selves — at a terrible interest rate.\n\nWe eventually paid it down. But it cost us a full roadmap cycle and two engineers who burnt out and left.", + "cta": "Has your team ever had a moment where debt caught up all at once? What did you do?", + "hashtags": ["#EngineeringLeadership", "#TechDebt", "#CTOLife", "#LessonsLearned"], + "word_count": 148 + }, + { + "style": "Checklist", + "hook": "4 signs your team is deeper in technical debt than you think:", + "body": "1. New engineers take 3+ weeks to make their first meaningful PR\nOnboarding time is a direct measure of codebase complexity. If it's long, the code is tangled.\n\n2. Estimates are consistently off by 2x or more\nWhen engineers can't predict how long things take, it's usually because they don't know what they'll have to touch.\n\n3. Bug fixes keep introducing new bugs\nYou're not writing bad code — you're working in a system with too many invisible dependencies.\n\n4. Refactors keep getting bumped from the sprint\nIf it's always 'next quarter', it'll never happen. Debt that doesn't have a slot in the roadmap doesn't get paid.", + "cta": "Which one of these sounds most familiar to your team right now?", + "hashtags": ["#EngineeringLeadership", "#TechDebt", "#SoftwareTeams", "#ProductEngineering"], + "word_count": 143 + } + ] +} +``` + +#### What can go wrong and how the prompt handles it + +| Risk | How it's handled | +|---|---| +| All 3 posts sound the same | Each style has distinct strucSture, mechanics, and word count targets | +| LLM invents a stat | Explicit rule + `[STAT]` placeholder shown in example output | +| Persona donts ignored | Injected directly as a named constraint block — hard to miss | +| Starts with "I" | Stated explicitly in constraints | +| Output isn't valid JSON | `response_format: json_object` at API level prevents this entirely | + +--- ## Problem 3: **Smart DOCX Template → Bulk DOCX/PDF Generator (Proposal + Prompt)** @@ -54,7 +418,213 @@ Submit a **proposal** for building this system using GenAI (OpenAI/Gemini) for ### Your Solution for problem 3: -You need to put your solution here. + +#### How the system works — big picture + +``` +User uploads DOCX + ↓ +Module 1: AI Field Detection +LLM reads the doc → identifies variable fields → returns JSON schema + ↓ +Module 2: User Review + Template Store +User confirms or edits the fields → system tokenizes the DOCX → stores it + ↓ + ├─→ Module 3: Single Generation + │ User fills a form → tokens get substituted → download DOCX or PDF + │ + └─→ Module 4: Bulk Generation + Upload Excel/Sheet → map columns → validate → generate all → ZIP + report +``` + +--- + +#### Module 1 — AI Field Detection + +When the user uploads a DOCX, the system extracts its full text and sends it to GPT-4o with this prompt: + +``` +You are a document analysis assistant. + +Read the document below and identify all fields that would change between different +document instances — things like names, dates, amounts, addresses, job titles, IDs. + +For each field return: + - field_id: snake_case key (e.g. candidate_name) + - label: human-readable name (e.g. "Candidate Full Name") + - type: one of [text, date, number, currency, email, phone] + - sample_value: the actual value in the doc right now (this is what gets replaced) + - required: true or false + - context_snippet: short excerpt (~10 words) showing where this field appears + +Return ONLY valid JSON: { "fields": [ ... ] } +No explanation. No preamble. + +Document: +{{extracted_text}} +``` + +**Example output:** +```json +{ + "fields": [ + { + "field_id": "candidate_name", + "label": "Candidate Full Name", + "type": "text", + "sample_value": "John Smith", + "required": true, + "context_snippet": "Dear John Smith, we are pleased to offer..." + }, + { + "field_id": "start_date", + "label": "Start Date", + "type": "date", + "sample_value": "January 15, 2025", + "required": true, + "context_snippet": "Your employment begins on January 15, 2025" + }, + { + "field_id": "annual_salary", + "label": "Annual Salary", + "type": "currency", + "sample_value": "$85,000", + "required": true, + "context_snippet": "...base salary of $85,000, paid monthly..." + } + ] +} +``` + +The user sees this list in the UI and can rename, merge, delete, or manually add fields before confirming. + +--- + +#### Module 2 — Template Store + +After the user confirms: +1. Replace every `sample_value` in the DOCX XML directly with `{{field_id}}` tokens +2. Store the tokenized DOCX in object storage (S3/GCS) — this is the master template +3. Store the field schema in the DB against a `template_id` + +Replacements happen in the raw DOCX XML (`word/document.xml`) — not via export/re-import. This is what preserves formatting, logos, tables, headers/footers, and signatures exactly. + +**Production note on XML replacement**: Manipulating raw DOCX XML directly is a trap — Word frequently splits a single visible word across multiple run elements due to autocorrect history or formatting changes, so a search for "John Smith" in the XML may find nothing and fail silently. The practical fix is to use docxtpl (python-docx-template), a library built specifically for this. It handles Jinja2-style {{field_id}} tags inside Word documents, automatically merges split runs before substitution, and preserves all native formatting. This is the industry-standard solution instead of writing a custom XML parser. + + +**Stored schema:** +```json +{ + "template_id": "tmpl_offer_v2", + "template_name": "Offer Letter v2", + "fields": [ + { + "field_id": "candidate_name", + "label": "Candidate Full Name", + "type": "text", + "required": true, + "validation": { "min_length": 2, "max_length": 100 } + }, + { + "field_id": "start_date", + "label": "Start Date", + "type": "date", + "required": true, + "format": "MMMM D, YYYY" + }, + { + "field_id": "annual_salary", + "label": "Annual Salary", + "type": "currency", + "required": true, + "currency_code": "USD" + } + ] +} +``` + +--- + +#### Module 3 — Single Generation + +The frontend reads the schema and auto-renders a typed form — date pickers for date fields, number inputs for currency, etc. User fills it, submits, and gets their file back. + +``` +POST /generate/single +{ + "template_id": "tmpl_offer_v2", + "output_format": "pdf", + "fields": { + "candidate_name": "Alice Johnson", + "start_date": "February 1, 2025", + "annual_salary": "$92,000" + } +} +→ Alice_Johnson_OfferLetter_20250201.pdf +``` + +PDF conversion: LibreOffice headless or Gotenberg Docker. Both preserve complex Word formatting reliably. + +--- + +#### Module 4 — Bulk Generation + +``` +User uploads .xlsx or provides Google Sheet URL + ↓ +System reads column headers +Fuzzy-match to field schema ("Full Name" → candidate_name, "Joining Date" → start_date) + ↓ +User confirms column mapping (auto-confirmed if match confidence > 90%) + ↓ +Pre-flight validation (before anything is generated): + - Check every required field is present in every row + - Type check: dates parse correctly, numbers are numbers + - Flag invalid rows with row number + specific error + ↓ +Show validation report: "182 rows valid, 3 rows with errors" +User can fix + re-upload, or skip bad rows and proceed + ↓ +Async generation queue — one worker per row +Each worker: load template → substitute fields → render DOCX → convert to PDF + ↓ +ZIP file assembled +report.csv included in ZIP +Download link sent (or push to Google Drive) +``` + +**Output naming:** +``` +{CandidateName}_{TemplateName}_{YYYYMMDD}.pdf + +Examples: + Alice_Johnson_OfferLetter_20250201.pdf + Raj_Patel_Certificate_20250115.pdf + +If duplicates: append row index → Alice_Johnson_OfferLetter_20250201_002.pdf +``` + +**report.csv (included in ZIP):** + +| Row | candidate_name | Status | Error | +|---|---|---|---| +| 2 | Alice Johnson | ✅ Success | — | +| 7 | — | ❌ Failed | start_date: required field missing | +| 12 | Carol White | ❌ Failed | annual_salary: expected number, got "TBD" | + +--- + +#### Error handling + +| Scenario | What happens | +|---|---| +| Missing required field | Pre-flight catches it. Row skipped. Listed in report. | +| Wrong data type | "Row 7, start_date: expected date, got 'ASAP'" | +| Token in template not in schema | Warning shown: `{{signing_date}}` not mapped — renders as literal text | +| PDF conversion fails for one row | DOCX fallback delivered for that row. Flagged in report. Rest of ZIP unaffected. | +| 1000+ row batch | Chunk into groups of 100. Parallel workers. Progress bar in UI. | + +--- ## Problem 4: Architecture Proposal for 5-Min Character Video Series Generator @@ -66,4 +636,212 @@ Create a **small, clear architecture proposal** (no code, no prompts) describing ### Your Solution for problem 4: -You need to put your solution here. + +#### How I'm thinking about this + +The core challenge is consistency. Characters need to look, sound, and behave the same across every episode — but each episode has a new story. So the architecture needs a "bible" layer that stays fixed, and a generation layer that runs fresh each time. + +I also think of video generation as something that should happen last, not first. It's expensive (~$12–18 per episode). The storyboard + audio package should be validated by the user before we commit to rendering a full video. + +--- + +#### System overview + +``` +Layer 1: Series Bible Store (set up once) +Characters + relationships + world rules → saved to DB + +Layer 2: Script Engine (LLM) +Episode brief + Bible → structured scene JSON + + User reviews script here before anything else happens + +Layer 3: Visual Generation Layer 4: Voice Generation +One image per scene (parallel) TTS per dialogue line (parallel) + +Layer 5: Assembly + Delivery +Stitch everything → Episode Package → optional video render +``` + +--- + +#### Layer 1 — Series Bible Store + +The user defines this once and it gets injected into every episode generation. + +**Character schema:** +```json +{ + "character_id": "char_001", + "name": "Maya", + "role": "protagonist", + "personality": "curious, sarcastic, fiercely loyal", + "speech_style": "quick wit, rhetorical questions, dry humor", + "behavioral_rules": [ + "Never backs down from a challenge", + "Deflects vulnerability with sarcasm" + ], + "visual_prompt": "woman, 28, short dark hair, red jacket, anime style", + "visual_consistency_notes": "Always wears red jacket. Scar above left eyebrow.", + "reference_image": "storage://series_01/maya_ref.png", + "voice_profile": { + "tts_service": "elevenlabs", + "voice_id": "voice_abc123", + "speed": 1.05 + }, + "relationships": [ + { + "character_id": "char_002", + "type": "best friend — disagrees on methods, not goals" + } + ] +} +``` + +The `reference_image` is the most important field for visual consistency — it's the seed image used in every scene this character appears in. + +**World / series config:** +```json +{ + "series_id": "series_01", + "setting": "near-future tech startup, comedic tone", + "style": "slice-of-life comedy with drama", + "platform_format": "16:9", + "language": "English", + "episode_history": [ + { "episode": 1, "summary": "Maya joins the team. Tension with Leo introduced." } + ] +} +``` + +--- + +#### Layer 2 — Script Engine (LLM) + +**Input:** Episode brief from user + full Character Bible + episode history (for continuity) + +**Constraints passed to the LLM:** +- ~5 minutes = 8–12 scenes +- Each scene: 25–45 seconds +- ~750 spoken words total +- Must follow character personalities and relationship rules from the Bible + +**Scene output schema:** +```json +{ + "scene_id": "ep02_sc04", + "scene_number": 4, + "location": "open-plan office, late afternoon", + "characters_present": ["char_001", "char_002"], + "action_description": "Maya spins her chair. Her expression shifts to controlled anger.", + "dialogue": [ + { "character_id": "char_001", "line": "How long have you been touching my code?" }, + { "character_id": "char_002", "line": "Long enough to stop it crashing production." } + ], + "visual_mood": "tense, warm backlighting, close shots", + "narration": null, + "target_duration_sec": 35, + "transition": "cut", + "music_cue": "low ambient tension" +} +``` + +**User review gate sits here.** Before any image or audio is generated, the user sees the full script in a readable format and can edit dialogue, swap characters, or adjust scenes. This prevents spending $0.40+ on images for a script that needs rework. + +--- + +#### Layer 3 — Visual Generation (Storyboard) + +One image per scene, generated in parallel after script approval. + +**Image prompt per scene:** +``` +{character.visual_prompt} + {scene.location} + {scene.visual_mood} + {scene.action_description} ++ style anchor: "anime style, consistent with series_01 character design" +``` + +**Consistency challenges and how they're handled:** + +| Problem | Solution | +|---|---| +| Character looks different each scene | Use `reference_image` as IP-Adapter seed for every frame | +| Multiple characters in one frame | Generate separately → composite them | +| Style drifts across scenes | Global style token on every single prompt — no exceptions | +| Same location looks different | Store approved background per location, reuse as seed | + +**Tool options:** DALL-E 3 (easiest API), Stable Diffusion + IP-Adapter (best consistency), Midjourney API (best quality, harder to automate). + +--- + +#### Layer 4 — Voice Generation (TTS) + +Each dialogue line gets its own audio file, driven by the character's `voice_profile`. + +**Process:** +1. Call TTS API per line with character's `voice_id` and speed/style settings +2. Save individual files (`ep02_sc04_char001_line1.mp3`) +3. Stitch lines with 300–500ms natural pauses +4. Output one audio file per scene + one full episode audio file +5. Tag each file with `start_time_ms` for video sync + +**TTS options:** + +| Service | Quality | Notes | +|---|---|---| +| ElevenLabs | Best | Supports voice cloning for truly unique character voices | +| OpenAI TTS | Good | Fast and cheap, 6 built-in voices | +| Azure Neural | Good | Wide language support | + +--- + +#### Layer 5 — Episode Package + Assembly + +``` +/episodes/ep02/ + ├── script.md ← readable script + ├── script.json ← source of truth + ├── storyboard/ + │ ├── ep02_sc01.png + │ ├── ep02_sc02.png + │ └── ... + ├── audio/ + │ ├── ep02_sc01_full.mp3 + │ └── ep02_full_voiceover.mp3 + ├── episode_manifest.json ← asset paths + timing per scene + └── ep02_final.mp4 ← optional rendered video +``` + +**Video render options:** +- **ffmpeg slideshow** — each storyboard frame shown for `target_duration_sec`, audio overlaid. Free, fast, looks like a motion comic or visual podcast. +- **Runway Gen-3 / Pika Labs** — each frame animated with motion. Much more realistic. ~$12–18 per episode. + +The `episode_manifest.json` makes selective regeneration possible — if one scene needs a dialogue fix, only that scene's audio gets re-generated. Everything else in the episode stays. + +--- + +#### Iteration support + +Since the constraints specifically ask for easy iteration: + +| User wants to | System does | +|---|---| +| Fix one line of dialogue | Re-generate TTS for that line only. Re-stitch scene audio. | +| Change a scene's mood/visuals | Re-generate that scene's image only. | +| Swap a character out | Re-run script generation with updated cast. User reviews new script. | +| Use only some characters | `characters_present` in episode brief accepts any subset — supported natively. | +| Adjust pacing | Re-run with new `target_duration_sec`. LLM adjusts dialogue density. | + +--- + +#### Cost breakdown + +| Step | Tool | Approx cost | +|---|---|---| +| Script | GPT-4o | $0.05–0.15 | +| Storyboard (10 frames) | DALL-E 3 | $0.40 | +| Voiceover | OpenAI TTS | $0.05–0.10 | +| **Total without video** | | **~$0.60–0.70** | +| Video render | Runway Gen-3 | $12–18 | +| **Total with video** | | **~$13–19** | + +The storyboard + audio package at ~$0.65 is good enough to review and iterate on. Only render the full video once the episode is locked in.