From c5935e7b491c116b640f91344be992e45a542518 Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Fri, 20 Feb 2026 00:26:35 +0530 Subject: [PATCH 1/9] Add enhanced production-grade solution for Problem 1 --- GenAI.md | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 203 insertions(+), 1 deletion(-) diff --git a/GenAI.md b/GenAI.md index 3c1fd31b..3e5819e6 100644 --- a/GenAI.md +++ b/GenAI.md @@ -26,7 +26,209 @@ No code required. We want a **clear, practical proposal** with architecture and ### Your Solution for problem 1: -You need to put your solution here. +# Approach Comparison +## 1. Cloud / SaaS-Based Solution +### Architecture +``` +Local Video → Upload → SaaS Transcription → SaaS Summarization → Export → Post-processing +``` +### Pros +* Fastest to implement +* High ASR quality +* Minimal infrastructure + +### Cons +* High per-minute cost for 3–4 hour videos +* Privacy concerns (videos leave local environment) +* Hard blocker for clip/screenshot extraction + > Most SaaS tools (Grain, Otter, Fireflies) do not expose an API to extract raw video clips at custom timestamps. We get a shareable link at best not a downloadable MP4 segment. Screenshot extraction at specific frames is not supported at all in any major SaaS offering. This alone makes SaaS non-viable for this use case's core output requirement. +* Difficult to customise output structure + +### Best For +Small, non-sensitive workloads where speed matters more than cost control. + +## 2. Hybrid Architecture +Local media processing + Cloud LLM for structured summarization +### High-Level Pipeline +``` +Batch Orchestrator + ↓ +Proxy Video Generation (Bitrate Laddering) + ↓ +Audio Extraction + ↓ +Local Transcription (faster-whisper) + ↓ +Sliding Window Chunking + ↓ +Recursive LLM Summarization + ↓ +Highlight Merge + Confidence Scoring + ↓ +Clip Extraction + ↓ +Screenshot Extraction + ↓ +Markdown + Report Generation +``` +### Production Optimizations +1. **Bitrate Laddering (Compute Optimization)** + Instead of processing full-resolution video: + ``` + ffmpeg -i input.mp4 -vf scale=-2:480 -c:v libx264 -preset veryfast proxy.mp4 + ``` + * Proxy (480p) used for screenshots and frame sampling + * Original video used only for final clip extraction + * Reduces CPU/GPU usage by 60–80% + * Improves batch throughput significantly + +2. **Sliding Window + Recursive Summarization** + **Problem**: 3–4 hour transcript exceeds safe context window limits. + **Solution** + **Step 1 — Sliding Window Transcription** + * 20–30 minute transcript chunks + * 1–2 minute overlap + * Word-level timestamps preserved + **Step 2 — Level 1 Summaries** + Each chunk → structured JSON summary + **Step 3 — Meta-Summary Pass** + Summaries of summaries → unified highlight ranking + **Step 4 — Deduplication** + * Merge overlapping highlights (±30 sec) + * Remove duplicates + * Enforce minimum clip duration + + This ensures: + * No mid-video information loss + * Balanced coverage + * Reduced hallucination risk + + 3. **Timestamp Confidence Scoring** + Each highlight includes a computed confidence score: +``` + { + "title": "Core Architecture Decision", + "start_time": "01:12:33", + "end_time": "01:14:02", + "confidence_score": 0.87, + "confidence_reason": "High transcript clarity, clean silence boundaries" + } +``` +**Confidence factors:** +* Whisper transcription confidence +* Silence detection at boundaries +* Semantic completeness +* Cross-chunk consistency +This improves trust and usability. + +4. **Token Cost Control Strategy** + To control API cost for long transcripts: + * Remove filler words before LLM call + * Use chunk-level summarization before meta-summary + * Temperature = 0 for deterministic JSON + * Strict JSON schema enforcement + * Avoid verbose model outputs +Estimated cost per 3-hour video: +*~$1–3 depending on model and chunk count* + +5. **Idempotent Batch Processing** + Each video maintains processing state: +``` + { + "video_id": "video_001", + "status": "TRANSCRIBED", + "last_successful_stage": "chunk_summarization", + "retry_count": 1 + } +``` +**Pipeline states:** +* INGESTED +* PROXY_GENERATED +* TRANSCRIBED +* CHUNK_SUMMARIZED +* META_SUMMARIZED +* CLIPPED +* SCREENSHOTS_DONE +* COMPLETED +* FAILED +If interrupted, the system resumes from last completed stage. + +6. **Observability & Cost Reporting** + Each video generates: + ``` + processing_report.json + ``` +Includes: +* Duration +* Stage-wise processing time +* LLM token usage +* Estimated API cost +* Highlight count + +Example: +``` +Video: product_masterclass.mp4 +Duration: 3h 18m +Transcription Time: 12m +LLM Tokens: 58,400 +Estimated Cost: $1.94 +Highlights: 11 +Avg Confidence: 0.83 +``` +### Output Folder Structure +``` +output// +├── Summary.md +├── processing_report.json +├── clips/ +│ ├── highlight_01.mp4 +│ └── highlight_02.mp4 +└── screenshots/ + ├── highlight_01.jpg + └── highlight_02.jpg +``` +## 3. Fully Offline Architecture +All components run locally: +* Transcription: faster-whisper +* LLM: LLaMA / Mistral (via Ollama or llama.cpp) +* Media: FFmpeg + +### Pros +* Maximum privacy +* No API cost +* Suitable for air-gapped environments +### Cons +* Requires strong GPU +* Lower summarization quality vs GPT-4 class models +* Higher setup complexity +### JSON Reliability Note +Local LLMs (LLaMA/Mistral) are less reliable at strict JSON output than GPT-4 class models. +Mitigation: Use **grammar-constrained decoding** to enforce schema at the token level: +- **llama.cpp**: pass a `.gbnf` grammar file that matches your highlight JSON schema +- **Ollama**: use `format: "json"` in the API call + +This ensures the offline pipeline produces parseable output without post-processing fallbacks. +### Decision Matrix + +| Factor | SaaS | Hybrid (Recommended) | Offline | +| -------------------- | -------- | -------------------- | ------------------ | +| Privacy | Low | High | Maximum | +| Cost Control | Low | High | High | +| Quality | High | Highest | Medium | +| Customization | Limited | Full | Full | +| Batch Reliability | Moderate | High | Hardware-dependent | +| Production Evolution | Low | High | Medium | + +## **My Recommendation** +The **Hybrid Architecture** with recursive summarization, bitrate laddering, and confidence scoring provides the best balance of: +* Cost efficiency +* Privacy +* High-quality summaries +* Deterministic clip alignment +* Batch reliability +* Production-readiness +**It is scalable from a small POC to a production-grade media processing pipeline.** + ## Problem 2: **Zero-Shot Prompt to generate 3 LinkedIn Post** From eb61967df1f2033d4c789d03abf9450158df3543 Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Fri, 20 Feb 2026 22:38:51 +0530 Subject: [PATCH 2/9] Enhance GenAI.md with solution comparisons and details Updated the document to include detailed descriptions of cloud-based, hybrid, and fully offline solutions for video processing and summarization. Added pros and cons for each approach, along with architecture diagrams and model selection based on hardware. --- GenAI.md | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 71 insertions(+), 5 deletions(-) diff --git a/GenAI.md b/GenAI.md index 3e5819e6..3e9cd429 100644 --- a/GenAI.md +++ b/GenAI.md @@ -28,15 +28,25 @@ No code required. We want a **clear, practical proposal** with architecture and # Approach Comparison ## 1. Cloud / SaaS-Based Solution +**How it works:** Upload videos to an AI-powered SaaS platform (e.g. Grain, Otter.ai, Fireflies, AssemblyAI video pipeline). The platform handles transcription, summarisation, and chapter generation automatically through its hosted API. + ### Architecture ``` Local Video → Upload → SaaS Transcription → SaaS Summarization → Export → Post-processing ``` +| Stage | SaaS Provider Workflow | +| ---------------- | ---------------------- | +| Upload | Videos uploaded via API or web UI to the SaaS provider. | +| Transcription | Provider's ASR engine (e.g., Whisper-based) generates time-coded transcript. | +| Summarisation | Provider's LLM layer produces summary and highlights with timestamps. | +| Asset Extraction | Limited: some tools export clip URLs; screenshots not always supported. | +| Output | JSON / Markdown export via provider API. Custom folder structure requires post-processing script. | + + ### Pros * Fastest to implement * High ASR quality * Minimal infrastructure - ### Cons * High per-minute cost for 3–4 hour videos * Privacy concerns (videos leave local environment) @@ -47,9 +57,10 @@ Local Video → Upload → SaaS Transcription → SaaS Summarization → Export ### Best For Small, non-sensitive workloads where speed matters more than cost control. -## 2. Hybrid Architecture +## 2. Hybrid Solution Local media processing + Cloud LLM for structured summarization -### High-Level Pipeline +**How it works**: All media processing (transcription, clip cutting, screenshot extraction) runs locally using open-source tools (Whisper, FFmpeg, OpenCV). Only the text transcript is sent to a cloud LLM (OpenAI GPT-4o / Gemini 1.5 Pro) for intelligent summarisation and highlight detection. This is my recommended approach. +### High-Level Architecture ``` Batch Orchestrator ↓ @@ -71,6 +82,20 @@ Screenshot Extraction ↓ Markdown + Report Generation ``` + +| # | Stage | Description | +|----|----------------------------|-------------| +| 1 | Ingest | Folder watcher detects new video files → adds to SQLite job queue with status INGESTED | +| 2 | Proxy Generation | FFmpeg transcodes to 480p proxy (libx264 veryfast). Used for all frame sampling & screenshots. Original preserved for final clip cuts only. Saves 60–80% CPU vs processing at source resolution. | +| 3 | Audio Extraction | FFmpeg extracts audio track to mono 16 kHz WAV — optimal format for Whisper input. | +| 4 | Transcription | faster-whisper (large-v3, GPU). Produces word-level timestamped JSON. Batch mode: multiple videos transcribed concurrently. Status → TRANSCRIBED. | +| 5 | Sliding Window Chunk | Transcript split into 20–30 min chunks with 1–2 min overlap at silence boundaries. Word timestamps preserved across chunks. | +| 6 | Chunk Summarisation | Each chunk → GPT-4o / Gemini 1.5 Pro (async, parallel calls). Returns structured JSON per chunk: chapter title, highlight timestamps, key points, confidence score. Status → CHUNK_SUMMARIZED. | +| 7 | Meta-Summary Pass | All chunk JSONs merged → second LLM call produces unified highlight ranking. Deduplicates overlapping highlights (±30 s window). Enforces minimum clip duration. Status → META_SUMMARIZED. | +| 8 | Clip Extraction | FFmpeg stream-copies highlight segments from original (full-res) video using precise start/end timestamps (±30 s padding). Lossless, fast. Status → CLIPPED. | +| 9 | Screenshot Extraction | FFmpeg -ss -vframes 1 on proxy file extracts keyframe JPG at each highlight timestamp. Fast single-frame decode. Status → SCREENSHOTS_DONE. | +| 10 | Summary Build | Summary.md assembled from meta-summary JSON + relative asset paths. processing_report.json written with token usage, cost estimate, stage timings. Status → COMPLETED. | + ### Production Optimizations 1. **Bitrate Laddering (Compute Optimization)** Instead of processing full-resolution video: @@ -128,6 +153,7 @@ This improves trust and usability. * Temperature = 0 for deterministic JSON * Strict JSON schema enforcement * Avoid verbose model outputs + Estimated cost per 3-hour video: *~$1–3 depending on model and chunk count* @@ -151,7 +177,26 @@ Estimated cost per 3-hour video: * SCREENSHOTS_DONE * COMPLETED * FAILED -If interrupted, the system resumes from last completed stage. + +If the pipeline is interrupted at any stage, the batch orchestrator resumes from the last completed stage — no reprocessing of already-finished steps +``` +INGESTED + ↓ +PROXY_GENERATED + ↓ +TRANSCRIBED + ↓ +CHUNK_SUMMARIZED + ↓ +META_SUMMARIZED + ↓ +CLIPPED + ↓ +SCREENSHOTS_DONE → FAILED (resumable from last stage) + ↓ +COMPLETED + +``` 6. **Observability & Cost Reporting** Each video generates: @@ -187,11 +232,28 @@ output// ├── highlight_01.jpg └── highlight_02.jpg ``` -## 3. Fully Offline Architecture +## 3. Fully Offline Solution All components run locally: * Transcription: faster-whisper * LLM: LLaMA / Mistral (via Ollama or llama.cpp) * Media: FFmpeg +### Architecture +| Component | Fully Offline (Approach 3) Description | +|------------------|----------------------------------------| +| Transcription | faster-whisper (large-v3) on GPU — same as Approach 2. | +| LLM | Local model served via Ollama or llama.cpp. Recommended models: Llama-3.1 70B (if GPU VRAM ≥ 48 GB) or Mistral-7B / Gemma-2 27B for lighter setups. | +| Prompt | Same chunked approach as Approach 2. JSON output enforced via grammar-constrained decoding (llama.cpp grammar / Ollama `format: json`). | +| Video Processing | FFmpeg + OpenCV — identical to Approach 2. | +| Limitations | Local LLM quality is lower than GPT-4o for complex summarisation, especially for domain-specific or technical content. | + +## Model Selection by Hardware +| GPU VRAM | Recommended Model | Quality vs Hybrid | Throughput (3-hr video) | +|-------------------------------|--------------------------------|------------------------------|--------------------------| +| ≥ 48 GB (A100 / H100) | Llama-3.1 70B (Q4) | ~85% of GPT-4o quality | ~20 min | +| 24–48 GB (A40 / 3090) | Gemma-2 27B (Q4) | ~75% of GPT-4o quality | ~35 min | +| 12–24 GB (4090 / 3080) | Mistral-7B or Phi-3 Mini | ~60–65% of GPT-4o quality | ~50 min | +| < 12 GB | Not recommended | Quality insufficient for production | — | + ### Pros * Maximum privacy @@ -201,6 +263,7 @@ All components run locally: * Requires strong GPU * Lower summarization quality vs GPT-4 class models * Higher setup complexity + ### JSON Reliability Note Local LLMs (LLaMA/Mistral) are less reliable at strict JSON output than GPT-4 class models. Mitigation: Use **grammar-constrained decoding** to enforce schema at the token level: @@ -208,6 +271,9 @@ Mitigation: Use **grammar-constrained decoding** to enforce schema at the token - **Ollama**: use `format: "json"` in the API call This ensures the offline pipeline produces parseable output without post-processing fallbacks. + +> [!NOTE] +> When to choose Offline: Regulated or confidential content (legal, medical, financial) where no data upload is permitted, and the organisation owns ≥ 24 GB VRAM GPU hardware. Expect ~75–85% of Hybrid quality at the cost of higher setup complexity. ### Decision Matrix | Factor | SaaS | Hybrid (Recommended) | Offline | From 5a651530183beaff879234509293cc3455b1583a Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Sun, 22 Feb 2026 18:08:39 +0530 Subject: [PATCH 3/9] Add zero-shot structured prompt design for Problem 2 (LinkedIn generator) --- GenAI.md | 181 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 179 insertions(+), 2 deletions(-) diff --git a/GenAI.md b/GenAI.md index 3e9cd429..64fc8f18 100644 --- a/GenAI.md +++ b/GenAI.md @@ -274,6 +274,7 @@ This ensures the offline pipeline produces parseable output without post-process > [!NOTE] > When to choose Offline: Regulated or confidential content (legal, medical, financial) where no data upload is permitted, and the organisation owns ≥ 24 GB VRAM GPU hardware. Expect ~75–85% of Hybrid quality at the cost of higher setup complexity. + ### Decision Matrix | Factor | SaaS | Hybrid (Recommended) | Offline | @@ -302,9 +303,185 @@ Design a **single zero-shot prompt** that takes a user’s persona configuration **TASK:** Write a prompt that can work. -### Your Solution for problem 2: +### Problem 2 — Zero-Shot Prompt: LinkedIn Post Generator -You need to put your solution here. +A single prompt call (no fine-tuning, no multi-turn) that accepts a user persona configuration + a topic and returns 3 structurally distinct, LinkedIn-ready post drafts as directly parseable JSON. + +## Design Decisions +| Decision | Rationale | +|----------|-----------| +| Structured Output | Prompt mandates a strict JSON schema. The app calls `JSON.parse()` directly — no regex scraping, no free-text post-processing. | +| Zero-Shot Reliability | Explicit constraints + a worked schema example (few-shot schema, not few-shot content) reduces hallucination risk without inflating token cost with full examples. | +| Style Separation | Three named styles are defined with concrete structural rules and hard word count bounds. Prevents the model returning three tonal variations of the same structure. | +| Enum-Constrained Style Field ▸ NEW | The `style` field in the JSON schema is shown as an enum directly inside the schema definition — not just mentioned in the rules. The model sees allowed values where it fills them in, which is more reliable than a rule listed elsewhere. | +| Persona Injection | All persona fields injected as a typed, structured block with inline examples. Avoids vague “write in my voice” instructions that models routinely under-follow. | +| Do/Don't Enforcement | `do_rules` and `dont_rules` serialised as numbered lists. LLMs comply more reliably with numbered constraints than free-form narrative instructions. | +| Hallucination Guard | Explicit ban: the model must not invent statistics, quotes, or external references unless they appear in `topic_context`. Named specifically, not bundled into a general “be accurate” instruction. | +| API-Level JSON Enforcement | In addition to prompt instructions, the API call itself enforces JSON mode: `response_format: { type: 'json_object' }` (OpenAI) or `responseMimeType: 'application/json'` (Gemini). Prompt + API constraint together eliminate malformed output. | + +## The Prompt +### SYSTEM PROMPT +``` +You are a professional LinkedIn ghostwriter and content strategist. +Your only job is to produce valid JSON — nothing else. +Do not include any text, explanation, or markdown outside the JSON object. +Do not wrap the output in code fences. + +Your output must always be a single JSON object matching this exact schema: + +{ + "posts": [ + { + "style": "punchy_insight | narrative_story | actionable_checklist", + "hook": "", + "body": "", + "cta": "", + "hashtags": ["", "", ""], + "estimated_word_count": , + "style_notes": "" + } + ] +} + +Rules you must follow at all times: +1. Return exactly 3 post objects — no more, no fewer. +2. Each post must use one of these exact style values: + "punchy_insight" | "narrative_story" | "actionable_checklist" + Each style value must appear exactly once across the 3 posts. +3. Do not add extra keys to the schema. +4. Never invent statistics, case study numbers, quotes, or external references + unless they were explicitly provided in the topic_context field. +5. Obey every item in do_rules and dont_rules without exception. +6. Each post must be meaningfully different in structure — not just tone. + A reader must instantly recognise which style they are reading. +7. LinkedIn formatting: use \n\n between paragraphs. No markdown headers. + Emojis only if emoji_preference is 'yes' or 'sometimes'. +8. All three posts must be ready to publish — no [brackets], no placeholders. +``` + +### STYLE DEFINITIONS (part of system prompt) +``` +STYLE 1 — "punchy_insight" + Structure: + - Hook: one strong declarative sentence. + - 3–5 very short paragraphs (1–2 lines each). + - White space between every paragraph. No story arc. No list. + - End with a thought-provoking question or sharp closing statement. + - Target: under 150 words. + +STYLE 2 — "narrative_story" + Structure: + - Hook opens mid-scene (in medias res — present tense). + - 2–3 paragraph arc: situation → tension/realisation → outcome/lesson. + - Transition to broader takeaway (1 paragraph). + - CTA asks the reader to share their own experience. + - Target: 180–250 words. + +STYLE 3 — "actionable_checklist" + Structure: + - Hook states a clear value promise ("5 things I learned about X"). + - Numbered list of 4–6 items. + - Each item: bold short label + 1–2 sentence explanation. + - Closing paragraph ties the list to the user's broader expertise. + - CTA drives a save or share action. + - Target: 200–280 words. +``` + +### USER PROMPT (filled per API call) +``` +Generate 3 LinkedIn post drafts using the persona and topic below. + +=== PERSONA === +name: {{name}} +professional_background: {{background}} +current_role: {{current_role}} +industry: {{industry}} +tone: {{tone}} + (e.g. "conversational and warm" | "authoritative and direct" | "humble and reflective") +language_style: {{language_style}} + (e.g. "plain English, no jargon" | "industry terms welcome" | "bilingual EN/HI mix") +typical_post_length: {{length_preference}} + (e.g. "short and punchy < 150 words" | "medium 150–250 words" | "long-form") +emoji_preference: {{emoji_preference}} + (yes | no | sometimes) +audience: {{audience}} + (e.g. "startup founders" | "engineering students" | "HR professionals") + +do_rules: +{{numbered list of do rules}} + +dont_rules: +{{numbered list of dont rules}} + +=== TOPIC === +topic: {{topic}} +topic_context: {{optional: key points, personal anecdotes, data to include}} +goal: {{goal}} + (e.g. "build thought leadership" | "drive profile visits" | "encourage comments") + +=== INSTRUCTIONS === +- Follow all system rules strictly. +- Produce exactly 3 posts — one per style. +- Return only the JSON object. No other text. +``` + +## API Call Configuration +The prompt alone is not sufficient — the API call must also enforce JSON mode. Both layers together make malformed output practically impossible. + +### OpenAI +``` +const response = await openai.chat.completions.create({ + model: "gpt-4o", + response_format: { type: 'json_object' }, // ← API-level JSON enforcement + temperature: 0, // ← deterministic output + messages: [ + { role: "system", content: SYSTEM_PROMPT }, + { role: "user", content: buildUserPrompt(persona, topic) } + ] +}); +const posts = JSON.parse(response.choices[0].message.content).posts; +``` + +### Gemini +``` +const response = await model.generateContent({ + generationConfig: { + responseMimeType: "application/json", // ← API-level JSON enforcement + temperature: 0, + }, + contents: [{ role: 'user', parts: [{ text: FULL_PROMPT }] }] +}); +const posts = JSON.parse(response.response.text()).posts; +``` +> [!Note] +> **Why both layers?** Prompt instructions tell the model what to do. API json_mode enforces it at the token-sampling level — the model cannot physically emit a non-JSON token. One layer without the other leaves a gap. + +## Sample Filled Persona (Reference) +| Field | Value | +|-------|-------| +| name | Priya Nair | +| current_role | Senior Product Manager at a B2B SaaS startup | +| industry | Product Management / B2B SaaS | +| tone | Conversational, warm, occasionally vulnerable | +| language_style | Plain English, light PM terminology, zero buzzwords | +| emoji_preference | Sometimes — max 2 per post | +| audience | Early-career PMs, startup founders, product enthusiasts | +| do_rules | 1. Share real personal experiences.
2. Use specific details when available.
3. End with a question that invites discussion. | +| dont_rules | 1. No corporate jargon.
2. No humblebrag tone.
3. Never claim expertise not earned.
4. Avoid passive voice. | +| topic | Why saying no is the most important product skill | +| topic_context | Recently declined a high-visibility feature request from the CEO. The team thanked me later. No external stats — personal experience only. | +| goal | Build thought leadership; encourage PMs to comment with their own "no" stories | + +## Why This Prompt Is Reliable +| Property | How It Is Achieved | +|----------|--------------------| +| Schema-first design | JSON schema is defined before any content rules. The model anchors its output format before reading persona or topic. | +| Enum in schema (not just rules) | `style` field shows allowed values inline: `punchy_insight \| narrative_story \| actionable_checklist`. Model sees the constraint exactly where it writes the value. | +| Hard structural differentiation | Each style has mutually exclusive structural rules (word count, format, arc type). Three tonal variations of the same structure are structurally impossible. | +| Named hallucination ban | Rule 4 specifically bans invented statistics, numbers, quotes, and external references — not a vague “be accurate” instruction. | +| Typed persona fields | Each field includes an inline example value. Reduces misinterpretation of abstract descriptors like `tone` or `language_style`. | +| API + prompt JSON enforcement | `response_format` / `responseMimeType` enforces JSON at the token level. `JSON.parse()` on the raw completion — no stripping, no regex, no fallback parser. | +| Temperature = 0 | Deterministic output. Same persona + topic always produces structurally consistent drafts. Avoids creative drift between calls. | ## Problem 3: **Smart DOCX Template → Bulk DOCX/PDF Generator (Proposal + Prompt)** From 695d07d13003f1727caf32abbc5fbf4769bfd589 Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Tue, 24 Feb 2026 02:03:44 +0530 Subject: [PATCH 4/9] Add architecture and AI-assisted design for Problem 3 (DOCX templating system) --- GenAI.md | 204 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 203 insertions(+), 1 deletion(-) diff --git a/GenAI.md b/GenAI.md index 64fc8f18..33f452da 100644 --- a/GenAI.md +++ b/GenAI.md @@ -499,7 +499,209 @@ Submit a **proposal** for building this system using GenAI (OpenAI/Gemini) for ### Your Solution for problem 3: -You need to put your solution here. +## Problem 3 — Smart DOCX Template → Bulk DOCX/PDF Generator + +Users maintain Word templates (offer letters, invoices, certificates, contracts) where only a handful of fields change per document. The system detects editable fields via AI, then handles single and bulk generation with deterministic rendering — no hallucinations, no formatting loss. + +| Principle | What It Means Here | +|-----------|-------------------| +| LLM Containment | LLM is invoked exactly once — for field detection only. Every other step (rendering, validation, PDF conversion) is deterministic. Zero hallucination risk in final documents. | +| Deterministic Rendering | python-docxtpl (Jinja2) replaces only placeholders. Original DOCX XML — tables, headers, footers, logos, signatures — is never touched. | +| Schema Versioning | Every confirmed template gets a version stamp. Bulk jobs record which version was used. Critical for auditing offer letters, contracts, compliance docs. | +| Row-Level Failure Isolation | A single bad spreadsheet row does not abort the batch. The job continues; the report flags every failure independently. | +| Template Injection Defense ▸ NEW | User-supplied field values are sanitized before Jinja2 rendering. Jinja2 control characters (`{% %}`, `{{ }}`, `{# #}`) are escaped to prevent template injection attacks that could break rendering or expose config data. | +| Streaming ZIP ▸ NEW | ZIP bundle is streamed to disk as files are rendered — not assembled in memory. Prevents OOM errors on large batches (hundreds/thousands of rows). | +| Data Minimisation | Spreadsheet row data is processed in memory and discarded after the job. It is never persisted to the database — only the generation report is retained. | +| Auditability | Each bulk job produces a structured job summary JSON alongside the CSV report. Supports SLA tracking and compliance reporting. | + +## High-Level Architecture +``` +User Upload DOCX + ↓ +Text Extraction Layer + ↓ +LLM Field Detection + ↓ +Field Schema Confirmation + ↓ +Template Conversion (Placeholder Injection) + ↓ +------------------------------------------- +Single Generation Flow: +Form Input → Validation → Render → DOCX/PDF Download + +Bulk Generation Flow: +Excel/Sheet Upload → Row Validation → Parallel Rendering → ZIP Bundle + Report +``` +## Phase 1 — Template Creation (AI-Assisted Field Detection) +### Step 1: Document Upload & Extraction +| Item | Description | +|------|------------| +| Tool | python-docx | +| What is extracted | Full text with paragraph boundaries and table cell context preserved. Repeated values flagged for `appears_multiple_times` detection. | +| What is NOT sent to LLM | Raw file bytes, images, embedded fonts, binary data. Only the extracted text string is sent. | + +### Step 2 — LLM Field Detection Prompt +LLM is used exactly once. The prompt enforces a strict JSON-only response: +``` +SYSTEM: +You are a document analysis AI. Your only job is to identify fields that +change between different instances of this template document. +Return a JSON array only — no explanation, no markdown, no extra keys. + +Each object in the array must match this schema exactly: +[ + { + "field_key": "snake_case_identifier", + "display_label": "Human Readable Label", + "field_type": "text | date | currency | number | email | boolean", + "sample_value": "", + "context_hint": "", + "required": true | false, + "appears_multiple_times": true | false + } +] + +Rules: +1. Only extract values that realistically differ per document instance. + >(names, dates, amounts, roles, addresses, IDs) +2. Do NOT extract: document title, company name, static boilerplate, + >unless they explicitly vary per instance. +3. If a value appears multiple times (e.g. candidate name in greeting AND signature), set appears_multiple_times: true. + >The system will replace all occurrences. +4. Return the JSON array only. No other text. +``` +### Step 3 — Schema Confirmation UI +| Item | Description | +|------|------------| +| User actions | Add missed fields, remove false positives, rename labels, change field types, mark optional vs required, set date format / currency symbol. | +| Output | Confirmed fields saved as `template_schema.json`. DOCX converted to Jinja2 template via exact string replacement (not a second LLM call). | +| Version stamp | First confirmation = version 1.0. Any schema edit increments the minor version. Major structural changes prompt user to confirm a new major version. | + +### Template Schema (Saved JSON) +``` +{ + "template_id": "offer_letter", + "version": "1.0", + "created_at": "2026-02-10T09:00:00Z", + "fields": [ + { + "field_key": "candidate_name", + "display_label": "Candidate Full Name", + "field_type": "text", + "required": true, + "appears_multiple_times": true + }, + { + "field_key": "start_date", + "display_label":"Start Date", + "field_type": "date", + "format": "DD MMMM YYYY", + "required": true + }, + { + "field_key": "salary_annual", + "display_label": "Annual Salary", + "field_type": "currency", + "currency_symbol":"₹", + "required": true + } + ] +} +``` +## Phase 2 — Single Document Generation +### Pipeline +| Stage | Description | +|-------|------------| +| Select Template | User picks a saved template. Form is auto-generated from the field schema — date picker for date fields, currency input for currency, etc. | +| Client Validation | Required field check, type validation (date format, numeric range). Runs in browser before any network call. | +| Server Validation | Re-validates all fields server-side before rendering. Type enforcement, sanitization. Never trust client-side validation alone. | +| Injection Sanitize | Jinja2 control characters (`{% %}`, `{{ }}`, `{# #}`) are escaped in every field value before the render call. Prevents template injection — a docxtpl-specific attack vector where a user submits a malicious Jinja2 expression as a field value. | +| Render DOCX | python-docxtpl renders the Jinja2 template with sanitized values. Original formatting — tables, headers/footers, logos, signatures — is untouched. | +| Convert to PDF | LibreOffice headless: `soffice --headless --convert-to pdf`. Spawned as a subprocess per document. | +| Download | DOCX and/or PDF served as a file download. Filename: `__.` — pattern configurable in template settings. | + +## Phase 3 — Bulk Document Generation + +### Spreadsheet Interface +The system generates a downloadable Excel template where column headers exactly match field_key values (with display_label as a comment on the header cell). The user fills one row per document. For Google Sheets, the user provides a share link, the system reads it via Google Sheets API. + +### Pipeline +| Stage | Description | +|-------|------------| +| Parse Sheet | Read Excel (openpyxl) or Google Sheet rows. Map column headers to `field_key` values from schema. Flag unrecognised columns as warnings in the report. | +| Row Validation | Each row validated independently. Checks: required fields present, type correctness, date format parseable, numeric fields numeric. Invalid rows are flagged and skipped — the batch continues. | +| Sanitize (per row) | Jinja2 control characters escaped in every field value of every row before any render call. Applied uniformly regardless of source (Excel or Sheets). | +| Parallel Render | Configurable worker pool (default: CPU count). Each valid row: render DOCX → convert PDF. Worker failures are caught per-row and logged to the report without stopping other workers. | +| Streaming ZIP | Files are written into the ZIP bundle as they are rendered — not buffered in memory first. Prevents OOM errors on large batches. ZIP streamed to disk and made available for download when all rows are done. | +| Report Generation | CSV report: `row_number`, `status`, `file_name`, `error_reason`. Job summary JSON: `rows_total`, `rows_success`, `rows_failed`, `processing_time_seconds`, `template_version`. Both are included in the ZIP and shown as a summary table in the UI. | + +## ZIP Output Structure +``` +generated_docs/ +└── offer_letter_v1.0_2026-02-10/ + ├── pdf/ + │ ├── Rohan_Sharma_OfferLetter_20260210.pdf + │ ├── Super_Man_OfferLetter_20260210.pdf + │ └── Anjali_Pathania_OfferLetter_20260210.pdf + ├── docx/ + │ ├── Rohan_Sharma_OfferLetter_20260210.docx + │ └── Super_Man_OfferLetter_20260210.docx + ├── generation_report.csv + └── job_summary.json +``` +## Generation Report (CSV + UI Table) +| Row | Status | File Name | Error Reason | +|-------|------------|--------------------------------------------|--------------------------------| +| 1 | ✓ Success | Rohan_Sharma_OfferLetter_20260210.pdf | — | +| 2 | ✓ Success | Super_Man_OfferLetter_20260210.pdf | — | +| 3 | ✗ Skipped | — | Missing required field: salary_annual | +| 4 | ✗ Skipped | — | Invalid date format: start_date | +| 5 | ✓ Success | Anjali_Pathania_OfferLetter_20260210.pdf | — | + +## Job Summary JSON + +``` +{ + "template_id": "offer_letter", + "template_version": "1.0", + "job_id": "bulk_20260210_001", + "rows_total": 120, + "rows_success": 115, + "rows_skipped": 5, + "processing_time_seconds": 48, + "generated_at": "2026-02-10T11:23:00Z" +} +``` +## Security Boundaries +| Threat | Mitigation | +|--------|------------| +| Template Injection | All field values sanitized before Jinja2 render. Jinja2 control sequences (`{% %}`, `{{ }}`, `{# #}`) are HTML-escaped or stripped. Sandbox mode enabled in Jinja2 environment to prevent code execution. | +| Malicious DOCX upload | File type validated by magic bytes (not extension). DOCX unpacked and inspected before processing. Macro-enabled `.docm` files rejected. | +| Google Sheets data exposure | Service account scoped to read-only. Credentials stored in server environment variables, never in client code or logs. | +| Spreadsheet data retention | Row data processed in memory only. Never written to the database. Only the job summary and error report are retained. | +| Rendered document access | Download links are signed, time-limited URLs (expire after 30 minutes). Generated files deleted from server after download or after 24h. | + +## Scalability & Observability + +| Component | Description | +|-----------|------------| +| Worker Pool | Configurable concurrency (default: CPU count). Each worker handles one row end-to-end (validate → render → PDF). Worker failures are isolated — no shared state between rows. | +| Streaming ZIP | Files streamed into ZIP as they complete. No full batch buffer in memory. Handles batches of thousands of rows without OOM risk. | +| Job Queue | Bulk jobs tracked in a lightweight job table (SQLite for single-server, Redis/Postgres for multi-server). Supports retry and resume if server restarts mid-batch. | +| Horizontal Scaling | Workers are stateless — can be run on separate machines. Job queue acts as the coordinator. ZIP assembly happens on the output server. | +| LibreOffice Pool | LibreOffice headless has a high startup cost (~1–2 sec per process). Production setup maintains a warm pool of LibreOffice instances to eliminate per-document startup latency. | + +## Why This Architecture Works +> [!IMPORTANT] +> **LLM is a scalpel, not a paintbrush:** Used once, for the one task where creative inference is needed (field detection). Everything else — rendering, validation, PDF conversion, report generation — is deterministic. This makes the system safe to run at scale on legal and financial documents. + +> [!NOTE] +> **Schema versioning = auditability:** Every bulk job records which template version was used. If an offer letter is disputed 6 months later, you can replay exactly which fields and which template were active at generation time. + +> [!CAUTION] +> **Template injection is a real risk:** docxtpl uses Jinja2 under the hood. A user who submits {% for x in config %} as a field value can crash the render or expose internal configuration. Sanitization and Jinja2 sandbox mode are non-negotiable in production. + ## Problem 4: Architecture Proposal for 5-Min Character Video Series Generator From 34eaf94aa5573436d0989f66a752a75ca5d0875b Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Tue, 24 Feb 2026 02:54:14 +0530 Subject: [PATCH 5/9] Add modular architecture proposal for Problem 4 (Character-based video generator) --- GenAI.md | 231 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 230 insertions(+), 1 deletion(-) diff --git a/GenAI.md b/GenAI.md index 33f452da..f976c5e1 100644 --- a/GenAI.md +++ b/GenAI.md @@ -713,4 +713,233 @@ Create a **small, clear architecture proposal** (no code, no prompts) describing ### Your Solution for problem 4: -You need to put your solution here. +## Problem 4 — Character-Based 5-Min Episode Video Series Generator + +Users define a cast of characters once with reference images, personality, voice profiles, and relationships. For each new episode, the user provides a short story prompt. The system generates a complete episode package: script, storyboard, visual assets, audio, and a final rendered video, while maintaining character consistency across every episode in the series. + +## System Overview: Two-Phase Design +| Phase | Description | +|-------|------------| +| A — Series Bible Setup (one-time) | Character definition, relationship graph, world/style rules. Stored as versioned JSON. Injected into every episode generation call as the single source of truth. | +| B — Episode Generation (per episode) | User provides a short story prompt → system runs a 5-stage pipeline: Script → Storyboard → Visuals → Audio → Video Assembly. Each stage is an independent service. | + +``` +PHASE A (one-time) PHASE B (per episode) +───────────────────── ────────────────────────────────────────── +Character Setup Episode Prompt + ↓ ↓ +Relationship Graph Stage 1: Script Generation + ↓ ↓ +World / Style Rules Stage 2: Storyboard / Shot Planning + ↓ ↓ +Series Bible JSON ─────────────────→ Stage 3: Visual Asset Generation + ↑ ↓ + │ Episode Memory (written back) Stage 4: Audio Generation + └──────────────────────────────── Stage 5: Video Assembly + ↓ + Final MP4 + Production Package +``` +## Phase A — Series Bible (Single Source of Truth) +The Series Bible is the authoritative configuration for the entire series. It is injected in full into every episode generation call. Every stage — script, visuals, audio — reads from it. Characters never need to be re-described per episode. + +### Character Schema +``` +{ + "character_id": "char_maya", + "name": "Maya", + "age": 28, + "visual_description": "South Asian woman, shoulder-length black hair, + often in yellow kurta, warm-toned skin, expressive dark eyes", + "reference_images": ["maya_ref_01.png", "maya_ref_02.png"], + "personality_traits": ["optimistic", "impulsive", "loyal", "bad at taking advice"], + "speaking_style": "Fast-paced, metaphor-heavy, ends sentences with rhetorical questions", + "behavioral_rules": [ + "Never admits fear directly — deflects with humour", + "Protective of her younger brother Rohan" + ], + "voice_profile": { + "provider": "ElevenLabs", + "voice_id": "xyz123", + "speed": 1.05, + "pitch_shift": 0 + } +} +``` +### Relationship Graph (Structured Edges) +Stored as a directed graph of edge objects. Injected into script generation to enforce interaction consistency — the LLM knows not just who the characters are, but how they relate and where that relationship currently stands. +``` +{ + "edges": [ + { + "from": "char_maya", + "to": "char_rohan", + "type": "sibling", + "dynamic": "protective", + "current_state": "supportive but argumentative after last episode's fight" + }, + { + "from": "char_maya", + "to": "char_priya", + "type": "mentor", + "dynamic": "warm", + "current_state": "maya increasingly doubts priya's advice" + } + ] +} +``` +### World Rules +``` +{ + "setting": "Urban Indian neighbourhood, present day", + "tone": "Light drama with humour", + "recurring_themes": ["family responsibility", "career growth", "self-doubt"] +} +``` +### Episode Memory Log +After each episode is generated and approved, the system writes a short continuity summary back into the Series Bible. This is what enables true multi-episode coherence, the script LLM in Episode 3 knows what happened in Episodes 1 and 2. + +``` +"episode_log": [ + { + "episode_id": "ep_001", + "title": "The Interview", + "summary": "Maya gets a job offer in another city. Rohan is supportive but hurt.", + "relationship_state_changes": [ + { "edge": "maya→rohan", "new_state": "strained — maya feeling guilty" } + ], + "unresolved_threads": ["Maya has not yet told her parents about the job offer"] + } +] +``` +> [!IMPORTANT] +> **Why this matters:** Without episode memory, the script LLM treats every episode as isolated. With it, character arcs carry forward — Maya's guilt from Episode 1 can surface in Episode 2's dialogue naturally, without the user having to re-explain it in every prompt. + +## Phase B — Episode Generation Pipeline + +### Stage 1 — Script Generation +| Stage | Description | +|-------|------------| +| Input | Series Bible JSON (full) + episode prompt (situation, cast subset, tone, goal) + `episode_log` (previous episode summaries for continuity). | +| LLM | GPT-4o or Claude 3.5 Sonnet. Long-context model to hold full Series Bible + episode log without truncation. | +| Output Schema | JSON array of scenes. Each scene: `scene_id`, `setting`, `characters_present`, `action_description`, `dialogue` [{`character_id`, `line`, `emotion`, `direction`}], `estimated_duration_seconds`. | +| Pacing Control | Speech duration estimated at 130–150 WPM. Sum of scene durations must be 270–310 seconds (4.5–5.5 min). If over → trim lowest-priority scene. If under → expand a key dialogue exchange. | +| Character Fidelity | Series Bible fields — `personality_traits`, `speaking_style`, `behavioral_rules` — injected per-character into the prompt. Instruction: “Every dialogue line must be consistent with the character's speaking_style and behavioral_rules above.” Applied to every scene, not just globally. | + +### Script Scene Schema +``` +[ + { + "scene_id": 1, + "setting": "Kitchen, morning", + "characters_present": ["char_maya", "char_rohan"], + "action_description": "Morning disagreement about career decision", + "estimated_duration_seconds": 65, + "dialogue": [ + { "character_id": "char_maya", "line": "You think it's that simple?", "emotion": "frustrated", "direction": "turns away" }, + { "character_id": "char_rohan", "line": "I think you're scared.", "emotion": "calm", "direction": "steady eye contact" } + ] + } +] +``` +### Stage 2 — Storyboard / Shot Planning +| Stage | Description | +|-------|------------| +| Input | Generated script JSON. | +| Process | Second LLM pass produces a shot list. One or more shots per scene. Each shot: `shot_type` (wide/medium/close-up/reaction), `camera_movement` (static/pan/zoom), `characters_in_frame`, `background_description`, `mood/lighting`. Separates narrative logic from visual composition. | +|| Output | Shot list JSON attached to the episode package. Each shot becomes one image generation call in Stage 3. | + +### Stage 3 — Visual Asset Generation +| Stage | Description | +|-------|------------| +| Character Images | Each shot generated via SDXL or DALL-E 3. Prompt = `visual_description` (Series Bible) + emotion + setting + lighting. Reference image fed as IP-Adapter style anchor for face/style consistency. | +| Backgrounds | Generated separately per unique setting. Reused across shots with the same setting — avoids per-shot inconsistency and reduces generation cost. | +| CLIP Similarity Gate | Each generated image scored against character reference via CLIP cosine similarity. Images below 0.85 threshold are flagged for user review or auto-regenerated (max 2 retries before escalating to user). | + +### **Character Consistency: Layered Strategy** + +| Layer | Method | How It Works | Cost | +|-------|--------|-------------|------| +| 1 | Textual Anchoring | Detailed `visual_description` injected into every image prompt. Age, skin tone, hair, clothing signature all specified. | Zero — always active | +| 2 | IP-Adapter | Reference image fed as style/identity anchor at inference time. No training required. | Per-call inference overhead only | +| 3 | Character LoRA | Fine-tuned LoRA trained on 15–30 reference images per character. Best consistency results. One-time training cost. | One-time GPU training (~30 min) | +| 4 | CLIP Gating | Cosine similarity vs reference. Score < 0.85 → regenerate (max 2 retries) → escalate to user. | Per-image scoring: negligible | + +### Stage 4 — Audio Generation +| Stage | Description | +|-------|------------| +| Dialogue / Voiceover | Each dialogue line sent to ElevenLabs (or Azure Neural TTS) with character's `voice_id` and `speed` from Series Bible. Emotion tags mapped to ElevenLabs expression controls (e.g., `frustrated` → raised energy, faster pace). | +| Narration | Optional narrator voice for scene transitions. Defined in Series Bible as a separate voice profile if the series uses narration style. | +| Background Music | Royalty-free track generated or selected (Mubert / Suno AI) based on episode tone tag. Mixed at lower volume than dialogue. Fade-in/fade-out applied at episode start/end. | +| Audio Sync | TTS audio duration measured programmatically per line. Scene image display duration adjusted to match audio length. Ensures total episode stays within ±10 seconds of 5 minutes — audio is the timing master. | + +### Stage 5 — Video Assembly +| Stage | Description | +|-------|------------| +| Composition | FFmpeg or MoviePy. Background image + character images composited as layers per shot. Subtle Ken Burns pan/zoom applied to static images to add motion. Scene transitions: hard cut or 0.5s cross-dissolve based on tone. | +| Subtitles | Dialogue lines timestamped to TTS audio output. Auto-generated SRT subtitle file. Burnt-in or as a separate track — user selects at export. | +| Output Formats | 16:9 (YouTube/desktop) or 9:16 (Reels/Shorts). Resolution: 1080p. User selects at episode creation time. | +| Production Package | ZIP bundle: `final_episode.mp4` + `script.json` + `shot_list.json` + `images/` + `audio/` + `subtitles.srt`. Enables manual re-edit in DaVinci Resolve or Premiere without regenerating assets. | + +### Output Package Structure +``` +episodes/ep_002_the_decision/ +├── final_episode.mp4 +├── script.json +├── shot_list.json +├── subtitles.srt +├── images/ +│ ├── scene_01_shot_01.png +│ ├── scene_01_shot_02.png +│ └── scene_02_shot_01.png +└── audio/ + ├── maya_line_01.mp3 + ├── rohan_line_01.mp3 + └── bgm_episode.mp3 +``` +## Scene-Level Regeneration +Full episode regeneration is expensive — 5 stages × multiple API calls. The system supports surgical re-generation so users can iterate without restarting the entire pipeline. + +| User Action | What Reruns | What Is Skipped | +|-------------|------------|-----------------| +| Regenerate scene dialogue | Stage 1 (that scene only) → Stage 4 audio for that scene → Stage 5 re-assembly | All other scenes, all images | +| Change scene emotion/tone | Stage 1 (that scene) → Stage 3 images for affected shots → Stage 4 audio → Stage 5 | Unaffected scenes and shots | +| Swap a character from episode | Stage 3 image re-generation for all shots with that character → Stage 5 re-assembly | Script, audio, other characters | +| Adjust pacing / trim scene | Stage 1 duration recalculation → Stage 5 re-assembly only | All assets — no regeneration cost | +| Change background music tone | Stage 4 BGM only → Stage 5 re-assembly | All character assets and dialogue | +> [!Warning] +> **Cost impact:** Regenerating a single scene's dialogue costs ~5% of a full episode generation. Without scene-level granularity, every small edit forces a full pipeline re-run — unacceptable for iterative content creation. + +## Microservice Architecture +Each stage runs as an independent service. This enables parallel processing, horizontal scaling, and service replacement without affecting the rest of the pipeline. + +| Service | Responsibility | Scales With | Replaceable With | +|----------|---------------|-------------|------------------| +| Script Service | LLM call → scene JSON | Episode request volume | Any LLM API | +| Visual Service | Image generation per shot | Shot count (most expensive) | Any T2I model or API | +| Audio Service | TTS per dialogue line + BGM | Dialogue line count | Any TTS provider | +| Assembly Service | FFmpeg composition → MP4 | Resolution + scene count | Any video rendering tool | +| Series Bible Store | JSON versioning + episode log | Series count (lightweight) | Any key-value store | + +## Core Challenges & Solutions + +| Challenge | Solution | +|------------|----------| +| Character visual drift across episodes | 4-layer consistency strategy: textual anchoring → IP-Adapter → LoRA (optional) → CLIP similarity gating at 0.85 threshold | +| Personality inconsistency in dialogue | `behavioral_rules` + `speaking_style` injected per-character into every scene prompt, not just globally | +| Duration mismatch | Word-count-based speech estimation (130–150 WPM). Trim/expand pass before finalising script. Audio sync makes TTS the timing master for video assembly | +| Multi-episode continuity drift | Episode memory log written back to Series Bible after each episode. Unresolved threads and relationship state changes persist as context for the next episode's LLM call | +| High regeneration cost | Scene-level regeneration: only re-run affected stages for the changed scene. Full pipeline only on new episodes | +| Partial cast episodes | `cast_subset` field in episode prompt. Script LLM instructed to write only for listed characters; absent characters may be referenced but not present on screen | + +## Architecture Summary +| Stage | Technology | Output | Service | +|-------|------------|--------|---------| +| Series Bible | JSON Schema + Web UI | Character/world config + episode log | Series Bible Store | +| Script Gen | GPT-4o / Claude 3.5 Sonnet | Scene-by-scene script JSON | Script Service | +| Storyboard | LLM (2nd pass) | Shot list per scene | Script Service | +| Visuals | SDXL + IP-Adapter / LoRA | Character + background images | Visual Service | +| Audio | ElevenLabs + Mubert/Suno | Dialogue audio + BGM | Audio Service | +| Video Assembly | FFmpeg / MoviePy | Final 5-min MP4 + package ZIP | Assembly Service | + +> [!TIP] +> **Key design principle:** The Series Bible is the single source of truth, injected into every stage of every episode. Episode Memory ensures the series has a living continuity: what happened in Episode 1 shapes how characters behave in Episode 5, without the user having to re-explain it every time. From d24b4b7e2661f3d20faa49d746b7ed30fef6d78f Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Tue, 24 Feb 2026 03:04:21 +0530 Subject: [PATCH 6/9] Update GenAI.md --- GenAI.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/GenAI.md b/GenAI.md index f976c5e1..d651ae5e 100644 --- a/GenAI.md +++ b/GenAI.md @@ -564,11 +564,11 @@ Each object in the array must match this schema exactly: Rules: 1. Only extract values that realistically differ per document instance. - >(names, dates, amounts, roles, addresses, IDs) + (names, dates, amounts, roles, addresses, IDs) 2. Do NOT extract: document title, company name, static boilerplate, - >unless they explicitly vary per instance. + unless they explicitly vary per instance. 3. If a value appears multiple times (e.g. candidate name in greeting AND signature), set appears_multiple_times: true. - >The system will replace all occurrences. + The system will replace all occurrences. 4. Return the JSON array only. No other text. ``` ### Step 3 — Schema Confirmation UI @@ -846,7 +846,7 @@ After each episode is generated and approved, the system writes a short continui |-------|------------| | Input | Generated script JSON. | | Process | Second LLM pass produces a shot list. One or more shots per scene. Each shot: `shot_type` (wide/medium/close-up/reaction), `camera_movement` (static/pan/zoom), `characters_in_frame`, `background_description`, `mood/lighting`. Separates narrative logic from visual composition. | -|| Output | Shot list JSON attached to the episode package. Each shot becomes one image generation call in Stage 3. | +| Output | Shot list JSON attached to the episode package. Each shot becomes one image generation call in Stage 3. | ### Stage 3 — Visual Asset Generation | Stage | Description | From 909614560aa15260cd6ab8eba897a8517a0bc557 Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Tue, 24 Feb 2026 13:27:23 +0530 Subject: [PATCH 7/9] Enhance GenAI.md with failure handling and cost details --- GenAI.md | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/GenAI.md b/GenAI.md index d651ae5e..dd48e9d3 100644 --- a/GenAI.md +++ b/GenAI.md @@ -232,6 +232,14 @@ output// ├── highlight_01.jpg └── highlight_02.jpg ``` + +### LLM Failure Handling Strategy + +- Each chunk-level LLM call has retry logic (max 3 attempts with exponential backoff). +- If a single chunk fails permanently, that chunk is marked `FAILED` and excluded from meta-summary. +- The final Summary.md includes a warning section listing any skipped chunks. +- The job does not fail entirely due to a single chunk failure. + ## 3. Fully Offline Solution All components run locally: * Transcription: faster-whisper @@ -303,7 +311,7 @@ Design a **single zero-shot prompt** that takes a user’s persona configuration **TASK:** Write a prompt that can work. -### Problem 2 — Zero-Shot Prompt: LinkedIn Post Generator +## Problem 2 — Zero-Shot Prompt: LinkedIn Post Generator A single prompt call (no fine-tuning, no multi-turn) that accepts a user persona configuration + a topic and returns 3 structurally distinct, LinkedIn-ready post drafts as directly parseable JSON. @@ -357,6 +365,7 @@ Rules you must follow at all times: 7. LinkedIn formatting: use \n\n between paragraphs. No markdown headers. Emojis only if emoji_preference is 'yes' or 'sometimes'. 8. All three posts must be ready to publish — no [brackets], no placeholders. +9. 9. If topic_context is empty and the topic is ambiguous, do not fabricate assumptions. Instead, interpret the topic generically and avoid specific claims. ``` ### STYLE DEFINITIONS (part of system prompt) @@ -909,6 +918,16 @@ Full episode regeneration is expensive — 5 stages × multiple API calls. The s > [!Warning] > **Cost impact:** Regenerating a single scene's dialogue costs ~5% of a full episode generation. Without scene-level granularity, every small edit forces a full pipeline re-run — unacceptable for iterative content creation. +## Cost Considerations + +Image generation is the dominant cost driver (per-shot calls). +To control cost: +- Reuse background images across scenes when setting unchanged. +- Cache character expressions where emotion is repeated. +- Limit regeneration retries to a maximum threshold. +- Allow “Script + Asset Package Only” mode without auto-rendering final MP4. + + ## Microservice Architecture Each stage runs as an independent service. This enables parallel processing, horizontal scaling, and service replacement without affecting the rest of the pipeline. From dbeb13811138c8d24edf505c995942be37cbf316 Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Tue, 24 Feb 2026 13:31:25 +0530 Subject: [PATCH 8/9] Fix duplicate numbering in GenAI.md --- GenAI.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/GenAI.md b/GenAI.md index dd48e9d3..934a383b 100644 --- a/GenAI.md +++ b/GenAI.md @@ -365,7 +365,7 @@ Rules you must follow at all times: 7. LinkedIn formatting: use \n\n between paragraphs. No markdown headers. Emojis only if emoji_preference is 'yes' or 'sometimes'. 8. All three posts must be ready to publish — no [brackets], no placeholders. -9. 9. If topic_context is empty and the topic is ambiguous, do not fabricate assumptions. Instead, interpret the topic generically and avoid specific claims. +9. If topic_context is empty and the topic is ambiguous, do not fabricate assumptions. Instead, interpret the topic generically and avoid specific claims. ``` ### STYLE DEFINITIONS (part of system prompt) From f220ec13053eabafb190a6fe026ecd78241d4f52 Mon Sep 17 00:00:00 2001 From: Rohan Sharma <155351638+Rohan29-De@users.noreply.github.com> Date: Sat, 28 Feb 2026 13:23:49 +0530 Subject: [PATCH 9/9] Fix code block formatting in GenAI.md --- GenAI.md | 52 ++++++++++++++++++++++++++-------------------------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/GenAI.md b/GenAI.md index 934a383b..439449f2 100644 --- a/GenAI.md +++ b/GenAI.md @@ -31,7 +31,7 @@ No code required. We want a **clear, practical proposal** with architecture and **How it works:** Upload videos to an AI-powered SaaS platform (e.g. Grain, Otter.ai, Fireflies, AssemblyAI video pipeline). The platform handles transcription, summarisation, and chapter generation automatically through its hosted API. ### Architecture -``` +```text Local Video → Upload → SaaS Transcription → SaaS Summarization → Export → Post-processing ``` | Stage | SaaS Provider Workflow | @@ -61,7 +61,7 @@ Small, non-sensitive workloads where speed matters more than cost control. Local media processing + Cloud LLM for structured summarization **How it works**: All media processing (transcription, clip cutting, screenshot extraction) runs locally using open-source tools (Whisper, FFmpeg, OpenCV). Only the text transcript is sent to a cloud LLM (OpenAI GPT-4o / Gemini 1.5 Pro) for intelligent summarisation and highlight detection. This is my recommended approach. ### High-Level Architecture -``` +```test Batch Orchestrator ↓ Proxy Video Generation (Bitrate Laddering) @@ -99,7 +99,7 @@ Markdown + Report Generation ### Production Optimizations 1. **Bitrate Laddering (Compute Optimization)** Instead of processing full-resolution video: - ``` + ```bash ffmpeg -i input.mp4 -vf scale=-2:480 -c:v libx264 -preset veryfast proxy.mp4 ``` * Proxy (480p) used for screenshots and frame sampling @@ -130,7 +130,7 @@ Markdown + Report Generation 3. **Timestamp Confidence Scoring** Each highlight includes a computed confidence score: -``` +```json { "title": "Core Architecture Decision", "start_time": "01:12:33", @@ -159,7 +159,7 @@ Estimated cost per 3-hour video: 5. **Idempotent Batch Processing** Each video maintains processing state: -``` +```json { "video_id": "video_001", "status": "TRANSCRIBED", @@ -179,7 +179,7 @@ Estimated cost per 3-hour video: * FAILED If the pipeline is interrupted at any stage, the batch orchestrator resumes from the last completed stage — no reprocessing of already-finished steps -``` +```text INGESTED ↓ PROXY_GENERATED @@ -200,7 +200,7 @@ COMPLETED 6. **Observability & Cost Reporting** Each video generates: - ``` + ```json processing_report.json ``` Includes: @@ -211,7 +211,7 @@ Includes: * Highlight count Example: -``` +```javacript Video: product_masterclass.mp4 Duration: 3h 18m Transcription Time: 12m @@ -221,7 +221,7 @@ Highlights: 11 Avg Confidence: 0.83 ``` ### Output Folder Structure -``` +```text output// ├── Summary.md ├── processing_report.json @@ -329,7 +329,7 @@ A single prompt call (no fine-tuning, no multi-turn) that accepts a user persona ## The Prompt ### SYSTEM PROMPT -``` +```javascript You are a professional LinkedIn ghostwriter and content strategist. Your only job is to produce valid JSON — nothing else. Do not include any text, explanation, or markdown outside the JSON object. @@ -369,7 +369,7 @@ Rules you must follow at all times: ``` ### STYLE DEFINITIONS (part of system prompt) -``` +```javascript STYLE 1 — "punchy_insight" Structure: - Hook: one strong declarative sentence. @@ -397,7 +397,7 @@ STYLE 3 — "actionable_checklist" ``` ### USER PROMPT (filled per API call) -``` +```javascript Generate 3 LinkedIn post drafts using the persona and topic below. === PERSONA === @@ -438,7 +438,7 @@ goal: {{goal}} The prompt alone is not sufficient — the API call must also enforce JSON mode. Both layers together make malformed output practically impossible. ### OpenAI -``` +```javascript const response = await openai.chat.completions.create({ model: "gpt-4o", response_format: { type: 'json_object' }, // ← API-level JSON enforcement @@ -452,7 +452,7 @@ const posts = JSON.parse(response.choices[0].message.content).posts; ``` ### Gemini -``` +```javascript const response = await model.generateContent({ generationConfig: { responseMimeType: "application/json", // ← API-level JSON enforcement @@ -524,7 +524,7 @@ Users maintain Word templates (offer letters, invoices, certificates, contracts) | Auditability | Each bulk job produces a structured job summary JSON alongside the CSV report. Supports SLA tracking and compliance reporting. | ## High-Level Architecture -``` +```text User Upload DOCX ↓ Text Extraction Layer @@ -552,7 +552,7 @@ Excel/Sheet Upload → Row Validation → Parallel Rendering → ZIP Bundle + Re ### Step 2 — LLM Field Detection Prompt LLM is used exactly once. The prompt enforces a strict JSON-only response: -``` +```javascript SYSTEM: You are a document analysis AI. Your only job is to identify fields that change between different instances of this template document. @@ -588,7 +588,7 @@ Rules: | Version stamp | First confirmation = version 1.0. Any schema edit increments the minor version. Major structural changes prompt user to confirm a new major version. | ### Template Schema (Saved JSON) -``` +```json { "template_id": "offer_letter", "version": "1.0", @@ -646,7 +646,7 @@ The system generates a downloadable Excel template where column headers exactly | Report Generation | CSV report: `row_number`, `status`, `file_name`, `error_reason`. Job summary JSON: `rows_total`, `rows_success`, `rows_failed`, `processing_time_seconds`, `template_version`. Both are included in the ZIP and shown as a summary table in the UI. | ## ZIP Output Structure -``` +```text generated_docs/ └── offer_letter_v1.0_2026-02-10/ ├── pdf/ @@ -670,7 +670,7 @@ generated_docs/ ## Job Summary JSON -``` +```json { "template_id": "offer_letter", "template_version": "1.0", @@ -732,7 +732,7 @@ Users define a cast of characters once with reference images, personality, voice | A — Series Bible Setup (one-time) | Character definition, relationship graph, world/style rules. Stored as versioned JSON. Injected into every episode generation call as the single source of truth. | | B — Episode Generation (per episode) | User provides a short story prompt → system runs a 5-stage pipeline: Script → Storyboard → Visuals → Audio → Video Assembly. Each stage is an independent service. | -``` +```text PHASE A (one-time) PHASE B (per episode) ───────────────────── ────────────────────────────────────────── Character Setup Episode Prompt @@ -752,7 +752,7 @@ Series Bible JSON ─────────────────→ Stage The Series Bible is the authoritative configuration for the entire series. It is injected in full into every episode generation call. Every stage — script, visuals, audio — reads from it. Characters never need to be re-described per episode. ### Character Schema -``` +```json { "character_id": "char_maya", "name": "Maya", @@ -776,7 +776,7 @@ The Series Bible is the authoritative configuration for the entire series. It is ``` ### Relationship Graph (Structured Edges) Stored as a directed graph of edge objects. Injected into script generation to enforce interaction consistency — the LLM knows not just who the characters are, but how they relate and where that relationship currently stands. -``` +```json { "edges": [ { @@ -797,7 +797,7 @@ Stored as a directed graph of edge objects. Injected into script generation to e } ``` ### World Rules -``` +```json { "setting": "Urban Indian neighbourhood, present day", "tone": "Light drama with humour", @@ -807,7 +807,7 @@ Stored as a directed graph of edge objects. Injected into script generation to e ### Episode Memory Log After each episode is generated and approved, the system writes a short continuity summary back into the Series Bible. This is what enables true multi-episode coherence, the script LLM in Episode 3 knows what happened in Episodes 1 and 2. -``` +```json "episode_log": [ { "episode_id": "ep_001", @@ -835,7 +835,7 @@ After each episode is generated and approved, the system writes a short continui | Character Fidelity | Series Bible fields — `personality_traits`, `speaking_style`, `behavioral_rules` — injected per-character into the prompt. Instruction: “Every dialogue line must be consistent with the character's speaking_style and behavioral_rules above.” Applied to every scene, not just globally. | ### Script Scene Schema -``` +```json [ { "scene_id": 1, @@ -890,7 +890,7 @@ After each episode is generated and approved, the system writes a short continui | Production Package | ZIP bundle: `final_episode.mp4` + `script.json` + `shot_list.json` + `images/` + `audio/` + `subtitles.srt`. Enables manual re-edit in DaVinci Resolve or Premiere without regenerating assets. | ### Output Package Structure -``` +```text episodes/ep_002_the_decision/ ├── final_episode.mp4 ├── script.json