diff --git a/GenAI.md b/GenAI.md index 3c1fd31b..439449f2 100644 --- a/GenAI.md +++ b/GenAI.md @@ -26,7 +26,284 @@ No code required. We want a **clear, practical proposal** with architecture and ### Your Solution for problem 1: -You need to put your solution here. +# Approach Comparison +## 1. Cloud / SaaS-Based Solution +**How it works:** Upload videos to an AI-powered SaaS platform (e.g. Grain, Otter.ai, Fireflies, AssemblyAI video pipeline). The platform handles transcription, summarisation, and chapter generation automatically through its hosted API. + +### Architecture +```text +Local Video → Upload → SaaS Transcription → SaaS Summarization → Export → Post-processing +``` +| Stage | SaaS Provider Workflow | +| ---------------- | ---------------------- | +| Upload | Videos uploaded via API or web UI to the SaaS provider. | +| Transcription | Provider's ASR engine (e.g., Whisper-based) generates time-coded transcript. | +| Summarisation | Provider's LLM layer produces summary and highlights with timestamps. | +| Asset Extraction | Limited: some tools export clip URLs; screenshots not always supported. | +| Output | JSON / Markdown export via provider API. Custom folder structure requires post-processing script. | + + +### Pros +* Fastest to implement +* High ASR quality +* Minimal infrastructure +### Cons +* High per-minute cost for 3–4 hour videos +* Privacy concerns (videos leave local environment) +* Hard blocker for clip/screenshot extraction + > Most SaaS tools (Grain, Otter, Fireflies) do not expose an API to extract raw video clips at custom timestamps. We get a shareable link at best not a downloadable MP4 segment. Screenshot extraction at specific frames is not supported at all in any major SaaS offering. This alone makes SaaS non-viable for this use case's core output requirement. +* Difficult to customise output structure + +### Best For +Small, non-sensitive workloads where speed matters more than cost control. + +## 2. Hybrid Solution +Local media processing + Cloud LLM for structured summarization +**How it works**: All media processing (transcription, clip cutting, screenshot extraction) runs locally using open-source tools (Whisper, FFmpeg, OpenCV). Only the text transcript is sent to a cloud LLM (OpenAI GPT-4o / Gemini 1.5 Pro) for intelligent summarisation and highlight detection. This is my recommended approach. +### High-Level Architecture +```test +Batch Orchestrator + ↓ +Proxy Video Generation (Bitrate Laddering) + ↓ +Audio Extraction + ↓ +Local Transcription (faster-whisper) + ↓ +Sliding Window Chunking + ↓ +Recursive LLM Summarization + ↓ +Highlight Merge + Confidence Scoring + ↓ +Clip Extraction + ↓ +Screenshot Extraction + ↓ +Markdown + Report Generation +``` + +| # | Stage | Description | +|----|----------------------------|-------------| +| 1 | Ingest | Folder watcher detects new video files → adds to SQLite job queue with status INGESTED | +| 2 | Proxy Generation | FFmpeg transcodes to 480p proxy (libx264 veryfast). Used for all frame sampling & screenshots. Original preserved for final clip cuts only. Saves 60–80% CPU vs processing at source resolution. | +| 3 | Audio Extraction | FFmpeg extracts audio track to mono 16 kHz WAV — optimal format for Whisper input. | +| 4 | Transcription | faster-whisper (large-v3, GPU). Produces word-level timestamped JSON. Batch mode: multiple videos transcribed concurrently. Status → TRANSCRIBED. | +| 5 | Sliding Window Chunk | Transcript split into 20–30 min chunks with 1–2 min overlap at silence boundaries. Word timestamps preserved across chunks. | +| 6 | Chunk Summarisation | Each chunk → GPT-4o / Gemini 1.5 Pro (async, parallel calls). Returns structured JSON per chunk: chapter title, highlight timestamps, key points, confidence score. Status → CHUNK_SUMMARIZED. | +| 7 | Meta-Summary Pass | All chunk JSONs merged → second LLM call produces unified highlight ranking. Deduplicates overlapping highlights (±30 s window). Enforces minimum clip duration. Status → META_SUMMARIZED. | +| 8 | Clip Extraction | FFmpeg stream-copies highlight segments from original (full-res) video using precise start/end timestamps (±30 s padding). Lossless, fast. Status → CLIPPED. | +| 9 | Screenshot Extraction | FFmpeg -ss -vframes 1 on proxy file extracts keyframe JPG at each highlight timestamp. Fast single-frame decode. Status → SCREENSHOTS_DONE. | +| 10 | Summary Build | Summary.md assembled from meta-summary JSON + relative asset paths. processing_report.json written with token usage, cost estimate, stage timings. Status → COMPLETED. | + +### Production Optimizations +1. **Bitrate Laddering (Compute Optimization)** + Instead of processing full-resolution video: + ```bash + ffmpeg -i input.mp4 -vf scale=-2:480 -c:v libx264 -preset veryfast proxy.mp4 + ``` + * Proxy (480p) used for screenshots and frame sampling + * Original video used only for final clip extraction + * Reduces CPU/GPU usage by 60–80% + * Improves batch throughput significantly + +2. **Sliding Window + Recursive Summarization** + **Problem**: 3–4 hour transcript exceeds safe context window limits. + **Solution** + **Step 1 — Sliding Window Transcription** + * 20–30 minute transcript chunks + * 1–2 minute overlap + * Word-level timestamps preserved + **Step 2 — Level 1 Summaries** + Each chunk → structured JSON summary + **Step 3 — Meta-Summary Pass** + Summaries of summaries → unified highlight ranking + **Step 4 — Deduplication** + * Merge overlapping highlights (±30 sec) + * Remove duplicates + * Enforce minimum clip duration + + This ensures: + * No mid-video information loss + * Balanced coverage + * Reduced hallucination risk + + 3. **Timestamp Confidence Scoring** + Each highlight includes a computed confidence score: +```json + { + "title": "Core Architecture Decision", + "start_time": "01:12:33", + "end_time": "01:14:02", + "confidence_score": 0.87, + "confidence_reason": "High transcript clarity, clean silence boundaries" + } +``` +**Confidence factors:** +* Whisper transcription confidence +* Silence detection at boundaries +* Semantic completeness +* Cross-chunk consistency +This improves trust and usability. + +4. **Token Cost Control Strategy** + To control API cost for long transcripts: + * Remove filler words before LLM call + * Use chunk-level summarization before meta-summary + * Temperature = 0 for deterministic JSON + * Strict JSON schema enforcement + * Avoid verbose model outputs + +Estimated cost per 3-hour video: +*~$1–3 depending on model and chunk count* + +5. **Idempotent Batch Processing** + Each video maintains processing state: +```json + { + "video_id": "video_001", + "status": "TRANSCRIBED", + "last_successful_stage": "chunk_summarization", + "retry_count": 1 + } +``` +**Pipeline states:** +* INGESTED +* PROXY_GENERATED +* TRANSCRIBED +* CHUNK_SUMMARIZED +* META_SUMMARIZED +* CLIPPED +* SCREENSHOTS_DONE +* COMPLETED +* FAILED + +If the pipeline is interrupted at any stage, the batch orchestrator resumes from the last completed stage — no reprocessing of already-finished steps +```text +INGESTED + ↓ +PROXY_GENERATED + ↓ +TRANSCRIBED + ↓ +CHUNK_SUMMARIZED + ↓ +META_SUMMARIZED + ↓ +CLIPPED + ↓ +SCREENSHOTS_DONE → FAILED (resumable from last stage) + ↓ +COMPLETED + +``` + +6. **Observability & Cost Reporting** + Each video generates: + ```json + processing_report.json + ``` +Includes: +* Duration +* Stage-wise processing time +* LLM token usage +* Estimated API cost +* Highlight count + +Example: +```javacript +Video: product_masterclass.mp4 +Duration: 3h 18m +Transcription Time: 12m +LLM Tokens: 58,400 +Estimated Cost: $1.94 +Highlights: 11 +Avg Confidence: 0.83 +``` +### Output Folder Structure +```text +output// +├── Summary.md +├── processing_report.json +├── clips/ +│ ├── highlight_01.mp4 +│ └── highlight_02.mp4 +└── screenshots/ + ├── highlight_01.jpg + └── highlight_02.jpg +``` + +### LLM Failure Handling Strategy + +- Each chunk-level LLM call has retry logic (max 3 attempts with exponential backoff). +- If a single chunk fails permanently, that chunk is marked `FAILED` and excluded from meta-summary. +- The final Summary.md includes a warning section listing any skipped chunks. +- The job does not fail entirely due to a single chunk failure. + +## 3. Fully Offline Solution +All components run locally: +* Transcription: faster-whisper +* LLM: LLaMA / Mistral (via Ollama or llama.cpp) +* Media: FFmpeg +### Architecture +| Component | Fully Offline (Approach 3) Description | +|------------------|----------------------------------------| +| Transcription | faster-whisper (large-v3) on GPU — same as Approach 2. | +| LLM | Local model served via Ollama or llama.cpp. Recommended models: Llama-3.1 70B (if GPU VRAM ≥ 48 GB) or Mistral-7B / Gemma-2 27B for lighter setups. | +| Prompt | Same chunked approach as Approach 2. JSON output enforced via grammar-constrained decoding (llama.cpp grammar / Ollama `format: json`). | +| Video Processing | FFmpeg + OpenCV — identical to Approach 2. | +| Limitations | Local LLM quality is lower than GPT-4o for complex summarisation, especially for domain-specific or technical content. | + +## Model Selection by Hardware +| GPU VRAM | Recommended Model | Quality vs Hybrid | Throughput (3-hr video) | +|-------------------------------|--------------------------------|------------------------------|--------------------------| +| ≥ 48 GB (A100 / H100) | Llama-3.1 70B (Q4) | ~85% of GPT-4o quality | ~20 min | +| 24–48 GB (A40 / 3090) | Gemma-2 27B (Q4) | ~75% of GPT-4o quality | ~35 min | +| 12–24 GB (4090 / 3080) | Mistral-7B or Phi-3 Mini | ~60–65% of GPT-4o quality | ~50 min | +| < 12 GB | Not recommended | Quality insufficient for production | — | + + +### Pros +* Maximum privacy +* No API cost +* Suitable for air-gapped environments +### Cons +* Requires strong GPU +* Lower summarization quality vs GPT-4 class models +* Higher setup complexity + +### JSON Reliability Note +Local LLMs (LLaMA/Mistral) are less reliable at strict JSON output than GPT-4 class models. +Mitigation: Use **grammar-constrained decoding** to enforce schema at the token level: +- **llama.cpp**: pass a `.gbnf` grammar file that matches your highlight JSON schema +- **Ollama**: use `format: "json"` in the API call + +This ensures the offline pipeline produces parseable output without post-processing fallbacks. + +> [!NOTE] +> When to choose Offline: Regulated or confidential content (legal, medical, financial) where no data upload is permitted, and the organisation owns ≥ 24 GB VRAM GPU hardware. Expect ~75–85% of Hybrid quality at the cost of higher setup complexity. + +### Decision Matrix + +| Factor | SaaS | Hybrid (Recommended) | Offline | +| -------------------- | -------- | -------------------- | ------------------ | +| Privacy | Low | High | Maximum | +| Cost Control | Low | High | High | +| Quality | High | Highest | Medium | +| Customization | Limited | Full | Full | +| Batch Reliability | Moderate | High | Hardware-dependent | +| Production Evolution | Low | High | Medium | + +## **My Recommendation** +The **Hybrid Architecture** with recursive summarization, bitrate laddering, and confidence scoring provides the best balance of: +* Cost efficiency +* Privacy +* High-quality summaries +* Deterministic clip alignment +* Batch reliability +* Production-readiness +**It is scalable from a small POC to a production-grade media processing pipeline.** + ## Problem 2: **Zero-Shot Prompt to generate 3 LinkedIn Post** @@ -34,9 +311,186 @@ Design a **single zero-shot prompt** that takes a user’s persona configuration **TASK:** Write a prompt that can work. -### Your Solution for problem 2: +## Problem 2 — Zero-Shot Prompt: LinkedIn Post Generator + +A single prompt call (no fine-tuning, no multi-turn) that accepts a user persona configuration + a topic and returns 3 structurally distinct, LinkedIn-ready post drafts as directly parseable JSON. + +## Design Decisions +| Decision | Rationale | +|----------|-----------| +| Structured Output | Prompt mandates a strict JSON schema. The app calls `JSON.parse()` directly — no regex scraping, no free-text post-processing. | +| Zero-Shot Reliability | Explicit constraints + a worked schema example (few-shot schema, not few-shot content) reduces hallucination risk without inflating token cost with full examples. | +| Style Separation | Three named styles are defined with concrete structural rules and hard word count bounds. Prevents the model returning three tonal variations of the same structure. | +| Enum-Constrained Style Field ▸ NEW | The `style` field in the JSON schema is shown as an enum directly inside the schema definition — not just mentioned in the rules. The model sees allowed values where it fills them in, which is more reliable than a rule listed elsewhere. | +| Persona Injection | All persona fields injected as a typed, structured block with inline examples. Avoids vague “write in my voice” instructions that models routinely under-follow. | +| Do/Don't Enforcement | `do_rules` and `dont_rules` serialised as numbered lists. LLMs comply more reliably with numbered constraints than free-form narrative instructions. | +| Hallucination Guard | Explicit ban: the model must not invent statistics, quotes, or external references unless they appear in `topic_context`. Named specifically, not bundled into a general “be accurate” instruction. | +| API-Level JSON Enforcement | In addition to prompt instructions, the API call itself enforces JSON mode: `response_format: { type: 'json_object' }` (OpenAI) or `responseMimeType: 'application/json'` (Gemini). Prompt + API constraint together eliminate malformed output. | + +## The Prompt +### SYSTEM PROMPT +```javascript +You are a professional LinkedIn ghostwriter and content strategist. +Your only job is to produce valid JSON — nothing else. +Do not include any text, explanation, or markdown outside the JSON object. +Do not wrap the output in code fences. + +Your output must always be a single JSON object matching this exact schema: + +{ + "posts": [ + { + "style": "punchy_insight | narrative_story | actionable_checklist", + "hook": "", + "body": "", + "cta": "", + "hashtags": ["", "", ""], + "estimated_word_count": , + "style_notes": "" + } + ] +} + +Rules you must follow at all times: +1. Return exactly 3 post objects — no more, no fewer. +2. Each post must use one of these exact style values: + "punchy_insight" | "narrative_story" | "actionable_checklist" + Each style value must appear exactly once across the 3 posts. +3. Do not add extra keys to the schema. +4. Never invent statistics, case study numbers, quotes, or external references + unless they were explicitly provided in the topic_context field. +5. Obey every item in do_rules and dont_rules without exception. +6. Each post must be meaningfully different in structure — not just tone. + A reader must instantly recognise which style they are reading. +7. LinkedIn formatting: use \n\n between paragraphs. No markdown headers. + Emojis only if emoji_preference is 'yes' or 'sometimes'. +8. All three posts must be ready to publish — no [brackets], no placeholders. +9. If topic_context is empty and the topic is ambiguous, do not fabricate assumptions. Instead, interpret the topic generically and avoid specific claims. +``` + +### STYLE DEFINITIONS (part of system prompt) +```javascript +STYLE 1 — "punchy_insight" + Structure: + - Hook: one strong declarative sentence. + - 3–5 very short paragraphs (1–2 lines each). + - White space between every paragraph. No story arc. No list. + - End with a thought-provoking question or sharp closing statement. + - Target: under 150 words. + +STYLE 2 — "narrative_story" + Structure: + - Hook opens mid-scene (in medias res — present tense). + - 2–3 paragraph arc: situation → tension/realisation → outcome/lesson. + - Transition to broader takeaway (1 paragraph). + - CTA asks the reader to share their own experience. + - Target: 180–250 words. + +STYLE 3 — "actionable_checklist" + Structure: + - Hook states a clear value promise ("5 things I learned about X"). + - Numbered list of 4–6 items. + - Each item: bold short label + 1–2 sentence explanation. + - Closing paragraph ties the list to the user's broader expertise. + - CTA drives a save or share action. + - Target: 200–280 words. +``` + +### USER PROMPT (filled per API call) +```javascript +Generate 3 LinkedIn post drafts using the persona and topic below. + +=== PERSONA === +name: {{name}} +professional_background: {{background}} +current_role: {{current_role}} +industry: {{industry}} +tone: {{tone}} + (e.g. "conversational and warm" | "authoritative and direct" | "humble and reflective") +language_style: {{language_style}} + (e.g. "plain English, no jargon" | "industry terms welcome" | "bilingual EN/HI mix") +typical_post_length: {{length_preference}} + (e.g. "short and punchy < 150 words" | "medium 150–250 words" | "long-form") +emoji_preference: {{emoji_preference}} + (yes | no | sometimes) +audience: {{audience}} + (e.g. "startup founders" | "engineering students" | "HR professionals") + +do_rules: +{{numbered list of do rules}} + +dont_rules: +{{numbered list of dont rules}} + +=== TOPIC === +topic: {{topic}} +topic_context: {{optional: key points, personal anecdotes, data to include}} +goal: {{goal}} + (e.g. "build thought leadership" | "drive profile visits" | "encourage comments") + +=== INSTRUCTIONS === +- Follow all system rules strictly. +- Produce exactly 3 posts — one per style. +- Return only the JSON object. No other text. +``` -You need to put your solution here. +## API Call Configuration +The prompt alone is not sufficient — the API call must also enforce JSON mode. Both layers together make malformed output practically impossible. + +### OpenAI +```javascript +const response = await openai.chat.completions.create({ + model: "gpt-4o", + response_format: { type: 'json_object' }, // ← API-level JSON enforcement + temperature: 0, // ← deterministic output + messages: [ + { role: "system", content: SYSTEM_PROMPT }, + { role: "user", content: buildUserPrompt(persona, topic) } + ] +}); +const posts = JSON.parse(response.choices[0].message.content).posts; +``` + +### Gemini +```javascript +const response = await model.generateContent({ + generationConfig: { + responseMimeType: "application/json", // ← API-level JSON enforcement + temperature: 0, + }, + contents: [{ role: 'user', parts: [{ text: FULL_PROMPT }] }] +}); +const posts = JSON.parse(response.response.text()).posts; +``` +> [!Note] +> **Why both layers?** Prompt instructions tell the model what to do. API json_mode enforces it at the token-sampling level — the model cannot physically emit a non-JSON token. One layer without the other leaves a gap. + +## Sample Filled Persona (Reference) +| Field | Value | +|-------|-------| +| name | Priya Nair | +| current_role | Senior Product Manager at a B2B SaaS startup | +| industry | Product Management / B2B SaaS | +| tone | Conversational, warm, occasionally vulnerable | +| language_style | Plain English, light PM terminology, zero buzzwords | +| emoji_preference | Sometimes — max 2 per post | +| audience | Early-career PMs, startup founders, product enthusiasts | +| do_rules | 1. Share real personal experiences.
2. Use specific details when available.
3. End with a question that invites discussion. | +| dont_rules | 1. No corporate jargon.
2. No humblebrag tone.
3. Never claim expertise not earned.
4. Avoid passive voice. | +| topic | Why saying no is the most important product skill | +| topic_context | Recently declined a high-visibility feature request from the CEO. The team thanked me later. No external stats — personal experience only. | +| goal | Build thought leadership; encourage PMs to comment with their own "no" stories | + +## Why This Prompt Is Reliable +| Property | How It Is Achieved | +|----------|--------------------| +| Schema-first design | JSON schema is defined before any content rules. The model anchors its output format before reading persona or topic. | +| Enum in schema (not just rules) | `style` field shows allowed values inline: `punchy_insight \| narrative_story \| actionable_checklist`. Model sees the constraint exactly where it writes the value. | +| Hard structural differentiation | Each style has mutually exclusive structural rules (word count, format, arc type). Three tonal variations of the same structure are structurally impossible. | +| Named hallucination ban | Rule 4 specifically bans invented statistics, numbers, quotes, and external references — not a vague “be accurate” instruction. | +| Typed persona fields | Each field includes an inline example value. Reduces misinterpretation of abstract descriptors like `tone` or `language_style`. | +| API + prompt JSON enforcement | `response_format` / `responseMimeType` enforces JSON at the token level. `JSON.parse()` on the raw completion — no stripping, no regex, no fallback parser. | +| Temperature = 0 | Deterministic output. Same persona + topic always produces structurally consistent drafts. Avoids creative drift between calls. | ## Problem 3: **Smart DOCX Template → Bulk DOCX/PDF Generator (Proposal + Prompt)** @@ -54,7 +508,209 @@ Submit a **proposal** for building this system using GenAI (OpenAI/Gemini) for ### Your Solution for problem 3: -You need to put your solution here. +## Problem 3 — Smart DOCX Template → Bulk DOCX/PDF Generator + +Users maintain Word templates (offer letters, invoices, certificates, contracts) where only a handful of fields change per document. The system detects editable fields via AI, then handles single and bulk generation with deterministic rendering — no hallucinations, no formatting loss. + +| Principle | What It Means Here | +|-----------|-------------------| +| LLM Containment | LLM is invoked exactly once — for field detection only. Every other step (rendering, validation, PDF conversion) is deterministic. Zero hallucination risk in final documents. | +| Deterministic Rendering | python-docxtpl (Jinja2) replaces only placeholders. Original DOCX XML — tables, headers, footers, logos, signatures — is never touched. | +| Schema Versioning | Every confirmed template gets a version stamp. Bulk jobs record which version was used. Critical for auditing offer letters, contracts, compliance docs. | +| Row-Level Failure Isolation | A single bad spreadsheet row does not abort the batch. The job continues; the report flags every failure independently. | +| Template Injection Defense ▸ NEW | User-supplied field values are sanitized before Jinja2 rendering. Jinja2 control characters (`{% %}`, `{{ }}`, `{# #}`) are escaped to prevent template injection attacks that could break rendering or expose config data. | +| Streaming ZIP ▸ NEW | ZIP bundle is streamed to disk as files are rendered — not assembled in memory. Prevents OOM errors on large batches (hundreds/thousands of rows). | +| Data Minimisation | Spreadsheet row data is processed in memory and discarded after the job. It is never persisted to the database — only the generation report is retained. | +| Auditability | Each bulk job produces a structured job summary JSON alongside the CSV report. Supports SLA tracking and compliance reporting. | + +## High-Level Architecture +```text +User Upload DOCX + ↓ +Text Extraction Layer + ↓ +LLM Field Detection + ↓ +Field Schema Confirmation + ↓ +Template Conversion (Placeholder Injection) + ↓ +------------------------------------------- +Single Generation Flow: +Form Input → Validation → Render → DOCX/PDF Download + +Bulk Generation Flow: +Excel/Sheet Upload → Row Validation → Parallel Rendering → ZIP Bundle + Report +``` +## Phase 1 — Template Creation (AI-Assisted Field Detection) +### Step 1: Document Upload & Extraction +| Item | Description | +|------|------------| +| Tool | python-docx | +| What is extracted | Full text with paragraph boundaries and table cell context preserved. Repeated values flagged for `appears_multiple_times` detection. | +| What is NOT sent to LLM | Raw file bytes, images, embedded fonts, binary data. Only the extracted text string is sent. | + +### Step 2 — LLM Field Detection Prompt +LLM is used exactly once. The prompt enforces a strict JSON-only response: +```javascript +SYSTEM: +You are a document analysis AI. Your only job is to identify fields that +change between different instances of this template document. +Return a JSON array only — no explanation, no markdown, no extra keys. + +Each object in the array must match this schema exactly: +[ + { + "field_key": "snake_case_identifier", + "display_label": "Human Readable Label", + "field_type": "text | date | currency | number | email | boolean", + "sample_value": "", + "context_hint": "", + "required": true | false, + "appears_multiple_times": true | false + } +] + +Rules: +1. Only extract values that realistically differ per document instance. + (names, dates, amounts, roles, addresses, IDs) +2. Do NOT extract: document title, company name, static boilerplate, + unless they explicitly vary per instance. +3. If a value appears multiple times (e.g. candidate name in greeting AND signature), set appears_multiple_times: true. + The system will replace all occurrences. +4. Return the JSON array only. No other text. +``` +### Step 3 — Schema Confirmation UI +| Item | Description | +|------|------------| +| User actions | Add missed fields, remove false positives, rename labels, change field types, mark optional vs required, set date format / currency symbol. | +| Output | Confirmed fields saved as `template_schema.json`. DOCX converted to Jinja2 template via exact string replacement (not a second LLM call). | +| Version stamp | First confirmation = version 1.0. Any schema edit increments the minor version. Major structural changes prompt user to confirm a new major version. | + +### Template Schema (Saved JSON) +```json +{ + "template_id": "offer_letter", + "version": "1.0", + "created_at": "2026-02-10T09:00:00Z", + "fields": [ + { + "field_key": "candidate_name", + "display_label": "Candidate Full Name", + "field_type": "text", + "required": true, + "appears_multiple_times": true + }, + { + "field_key": "start_date", + "display_label":"Start Date", + "field_type": "date", + "format": "DD MMMM YYYY", + "required": true + }, + { + "field_key": "salary_annual", + "display_label": "Annual Salary", + "field_type": "currency", + "currency_symbol":"₹", + "required": true + } + ] +} +``` +## Phase 2 — Single Document Generation +### Pipeline +| Stage | Description | +|-------|------------| +| Select Template | User picks a saved template. Form is auto-generated from the field schema — date picker for date fields, currency input for currency, etc. | +| Client Validation | Required field check, type validation (date format, numeric range). Runs in browser before any network call. | +| Server Validation | Re-validates all fields server-side before rendering. Type enforcement, sanitization. Never trust client-side validation alone. | +| Injection Sanitize | Jinja2 control characters (`{% %}`, `{{ }}`, `{# #}`) are escaped in every field value before the render call. Prevents template injection — a docxtpl-specific attack vector where a user submits a malicious Jinja2 expression as a field value. | +| Render DOCX | python-docxtpl renders the Jinja2 template with sanitized values. Original formatting — tables, headers/footers, logos, signatures — is untouched. | +| Convert to PDF | LibreOffice headless: `soffice --headless --convert-to pdf`. Spawned as a subprocess per document. | +| Download | DOCX and/or PDF served as a file download. Filename: `__.` — pattern configurable in template settings. | + +## Phase 3 — Bulk Document Generation + +### Spreadsheet Interface +The system generates a downloadable Excel template where column headers exactly match field_key values (with display_label as a comment on the header cell). The user fills one row per document. For Google Sheets, the user provides a share link, the system reads it via Google Sheets API. + +### Pipeline +| Stage | Description | +|-------|------------| +| Parse Sheet | Read Excel (openpyxl) or Google Sheet rows. Map column headers to `field_key` values from schema. Flag unrecognised columns as warnings in the report. | +| Row Validation | Each row validated independently. Checks: required fields present, type correctness, date format parseable, numeric fields numeric. Invalid rows are flagged and skipped — the batch continues. | +| Sanitize (per row) | Jinja2 control characters escaped in every field value of every row before any render call. Applied uniformly regardless of source (Excel or Sheets). | +| Parallel Render | Configurable worker pool (default: CPU count). Each valid row: render DOCX → convert PDF. Worker failures are caught per-row and logged to the report without stopping other workers. | +| Streaming ZIP | Files are written into the ZIP bundle as they are rendered — not buffered in memory first. Prevents OOM errors on large batches. ZIP streamed to disk and made available for download when all rows are done. | +| Report Generation | CSV report: `row_number`, `status`, `file_name`, `error_reason`. Job summary JSON: `rows_total`, `rows_success`, `rows_failed`, `processing_time_seconds`, `template_version`. Both are included in the ZIP and shown as a summary table in the UI. | + +## ZIP Output Structure +```text +generated_docs/ +└── offer_letter_v1.0_2026-02-10/ + ├── pdf/ + │ ├── Rohan_Sharma_OfferLetter_20260210.pdf + │ ├── Super_Man_OfferLetter_20260210.pdf + │ └── Anjali_Pathania_OfferLetter_20260210.pdf + ├── docx/ + │ ├── Rohan_Sharma_OfferLetter_20260210.docx + │ └── Super_Man_OfferLetter_20260210.docx + ├── generation_report.csv + └── job_summary.json +``` +## Generation Report (CSV + UI Table) +| Row | Status | File Name | Error Reason | +|-------|------------|--------------------------------------------|--------------------------------| +| 1 | ✓ Success | Rohan_Sharma_OfferLetter_20260210.pdf | — | +| 2 | ✓ Success | Super_Man_OfferLetter_20260210.pdf | — | +| 3 | ✗ Skipped | — | Missing required field: salary_annual | +| 4 | ✗ Skipped | — | Invalid date format: start_date | +| 5 | ✓ Success | Anjali_Pathania_OfferLetter_20260210.pdf | — | + +## Job Summary JSON + +```json +{ + "template_id": "offer_letter", + "template_version": "1.0", + "job_id": "bulk_20260210_001", + "rows_total": 120, + "rows_success": 115, + "rows_skipped": 5, + "processing_time_seconds": 48, + "generated_at": "2026-02-10T11:23:00Z" +} +``` +## Security Boundaries +| Threat | Mitigation | +|--------|------------| +| Template Injection | All field values sanitized before Jinja2 render. Jinja2 control sequences (`{% %}`, `{{ }}`, `{# #}`) are HTML-escaped or stripped. Sandbox mode enabled in Jinja2 environment to prevent code execution. | +| Malicious DOCX upload | File type validated by magic bytes (not extension). DOCX unpacked and inspected before processing. Macro-enabled `.docm` files rejected. | +| Google Sheets data exposure | Service account scoped to read-only. Credentials stored in server environment variables, never in client code or logs. | +| Spreadsheet data retention | Row data processed in memory only. Never written to the database. Only the job summary and error report are retained. | +| Rendered document access | Download links are signed, time-limited URLs (expire after 30 minutes). Generated files deleted from server after download or after 24h. | + +## Scalability & Observability + +| Component | Description | +|-----------|------------| +| Worker Pool | Configurable concurrency (default: CPU count). Each worker handles one row end-to-end (validate → render → PDF). Worker failures are isolated — no shared state between rows. | +| Streaming ZIP | Files streamed into ZIP as they complete. No full batch buffer in memory. Handles batches of thousands of rows without OOM risk. | +| Job Queue | Bulk jobs tracked in a lightweight job table (SQLite for single-server, Redis/Postgres for multi-server). Supports retry and resume if server restarts mid-batch. | +| Horizontal Scaling | Workers are stateless — can be run on separate machines. Job queue acts as the coordinator. ZIP assembly happens on the output server. | +| LibreOffice Pool | LibreOffice headless has a high startup cost (~1–2 sec per process). Production setup maintains a warm pool of LibreOffice instances to eliminate per-document startup latency. | + +## Why This Architecture Works +> [!IMPORTANT] +> **LLM is a scalpel, not a paintbrush:** Used once, for the one task where creative inference is needed (field detection). Everything else — rendering, validation, PDF conversion, report generation — is deterministic. This makes the system safe to run at scale on legal and financial documents. + +> [!NOTE] +> **Schema versioning = auditability:** Every bulk job records which template version was used. If an offer letter is disputed 6 months later, you can replay exactly which fields and which template were active at generation time. + +> [!CAUTION] +> **Template injection is a real risk:** docxtpl uses Jinja2 under the hood. A user who submits {% for x in config %} as a field value can crash the render or expose internal configuration. Sanitization and Jinja2 sandbox mode are non-negotiable in production. + ## Problem 4: Architecture Proposal for 5-Min Character Video Series Generator @@ -66,4 +722,243 @@ Create a **small, clear architecture proposal** (no code, no prompts) describing ### Your Solution for problem 4: -You need to put your solution here. +## Problem 4 — Character-Based 5-Min Episode Video Series Generator + +Users define a cast of characters once with reference images, personality, voice profiles, and relationships. For each new episode, the user provides a short story prompt. The system generates a complete episode package: script, storyboard, visual assets, audio, and a final rendered video, while maintaining character consistency across every episode in the series. + +## System Overview: Two-Phase Design +| Phase | Description | +|-------|------------| +| A — Series Bible Setup (one-time) | Character definition, relationship graph, world/style rules. Stored as versioned JSON. Injected into every episode generation call as the single source of truth. | +| B — Episode Generation (per episode) | User provides a short story prompt → system runs a 5-stage pipeline: Script → Storyboard → Visuals → Audio → Video Assembly. Each stage is an independent service. | + +```text +PHASE A (one-time) PHASE B (per episode) +───────────────────── ────────────────────────────────────────── +Character Setup Episode Prompt + ↓ ↓ +Relationship Graph Stage 1: Script Generation + ↓ ↓ +World / Style Rules Stage 2: Storyboard / Shot Planning + ↓ ↓ +Series Bible JSON ─────────────────→ Stage 3: Visual Asset Generation + ↑ ↓ + │ Episode Memory (written back) Stage 4: Audio Generation + └──────────────────────────────── Stage 5: Video Assembly + ↓ + Final MP4 + Production Package +``` +## Phase A — Series Bible (Single Source of Truth) +The Series Bible is the authoritative configuration for the entire series. It is injected in full into every episode generation call. Every stage — script, visuals, audio — reads from it. Characters never need to be re-described per episode. + +### Character Schema +```json +{ + "character_id": "char_maya", + "name": "Maya", + "age": 28, + "visual_description": "South Asian woman, shoulder-length black hair, + often in yellow kurta, warm-toned skin, expressive dark eyes", + "reference_images": ["maya_ref_01.png", "maya_ref_02.png"], + "personality_traits": ["optimistic", "impulsive", "loyal", "bad at taking advice"], + "speaking_style": "Fast-paced, metaphor-heavy, ends sentences with rhetorical questions", + "behavioral_rules": [ + "Never admits fear directly — deflects with humour", + "Protective of her younger brother Rohan" + ], + "voice_profile": { + "provider": "ElevenLabs", + "voice_id": "xyz123", + "speed": 1.05, + "pitch_shift": 0 + } +} +``` +### Relationship Graph (Structured Edges) +Stored as a directed graph of edge objects. Injected into script generation to enforce interaction consistency — the LLM knows not just who the characters are, but how they relate and where that relationship currently stands. +```json +{ + "edges": [ + { + "from": "char_maya", + "to": "char_rohan", + "type": "sibling", + "dynamic": "protective", + "current_state": "supportive but argumentative after last episode's fight" + }, + { + "from": "char_maya", + "to": "char_priya", + "type": "mentor", + "dynamic": "warm", + "current_state": "maya increasingly doubts priya's advice" + } + ] +} +``` +### World Rules +```json +{ + "setting": "Urban Indian neighbourhood, present day", + "tone": "Light drama with humour", + "recurring_themes": ["family responsibility", "career growth", "self-doubt"] +} +``` +### Episode Memory Log +After each episode is generated and approved, the system writes a short continuity summary back into the Series Bible. This is what enables true multi-episode coherence, the script LLM in Episode 3 knows what happened in Episodes 1 and 2. + +```json +"episode_log": [ + { + "episode_id": "ep_001", + "title": "The Interview", + "summary": "Maya gets a job offer in another city. Rohan is supportive but hurt.", + "relationship_state_changes": [ + { "edge": "maya→rohan", "new_state": "strained — maya feeling guilty" } + ], + "unresolved_threads": ["Maya has not yet told her parents about the job offer"] + } +] +``` +> [!IMPORTANT] +> **Why this matters:** Without episode memory, the script LLM treats every episode as isolated. With it, character arcs carry forward — Maya's guilt from Episode 1 can surface in Episode 2's dialogue naturally, without the user having to re-explain it in every prompt. + +## Phase B — Episode Generation Pipeline + +### Stage 1 — Script Generation +| Stage | Description | +|-------|------------| +| Input | Series Bible JSON (full) + episode prompt (situation, cast subset, tone, goal) + `episode_log` (previous episode summaries for continuity). | +| LLM | GPT-4o or Claude 3.5 Sonnet. Long-context model to hold full Series Bible + episode log without truncation. | +| Output Schema | JSON array of scenes. Each scene: `scene_id`, `setting`, `characters_present`, `action_description`, `dialogue` [{`character_id`, `line`, `emotion`, `direction`}], `estimated_duration_seconds`. | +| Pacing Control | Speech duration estimated at 130–150 WPM. Sum of scene durations must be 270–310 seconds (4.5–5.5 min). If over → trim lowest-priority scene. If under → expand a key dialogue exchange. | +| Character Fidelity | Series Bible fields — `personality_traits`, `speaking_style`, `behavioral_rules` — injected per-character into the prompt. Instruction: “Every dialogue line must be consistent with the character's speaking_style and behavioral_rules above.” Applied to every scene, not just globally. | + +### Script Scene Schema +```json +[ + { + "scene_id": 1, + "setting": "Kitchen, morning", + "characters_present": ["char_maya", "char_rohan"], + "action_description": "Morning disagreement about career decision", + "estimated_duration_seconds": 65, + "dialogue": [ + { "character_id": "char_maya", "line": "You think it's that simple?", "emotion": "frustrated", "direction": "turns away" }, + { "character_id": "char_rohan", "line": "I think you're scared.", "emotion": "calm", "direction": "steady eye contact" } + ] + } +] +``` +### Stage 2 — Storyboard / Shot Planning +| Stage | Description | +|-------|------------| +| Input | Generated script JSON. | +| Process | Second LLM pass produces a shot list. One or more shots per scene. Each shot: `shot_type` (wide/medium/close-up/reaction), `camera_movement` (static/pan/zoom), `characters_in_frame`, `background_description`, `mood/lighting`. Separates narrative logic from visual composition. | +| Output | Shot list JSON attached to the episode package. Each shot becomes one image generation call in Stage 3. | + +### Stage 3 — Visual Asset Generation +| Stage | Description | +|-------|------------| +| Character Images | Each shot generated via SDXL or DALL-E 3. Prompt = `visual_description` (Series Bible) + emotion + setting + lighting. Reference image fed as IP-Adapter style anchor for face/style consistency. | +| Backgrounds | Generated separately per unique setting. Reused across shots with the same setting — avoids per-shot inconsistency and reduces generation cost. | +| CLIP Similarity Gate | Each generated image scored against character reference via CLIP cosine similarity. Images below 0.85 threshold are flagged for user review or auto-regenerated (max 2 retries before escalating to user). | + +### **Character Consistency: Layered Strategy** + +| Layer | Method | How It Works | Cost | +|-------|--------|-------------|------| +| 1 | Textual Anchoring | Detailed `visual_description` injected into every image prompt. Age, skin tone, hair, clothing signature all specified. | Zero — always active | +| 2 | IP-Adapter | Reference image fed as style/identity anchor at inference time. No training required. | Per-call inference overhead only | +| 3 | Character LoRA | Fine-tuned LoRA trained on 15–30 reference images per character. Best consistency results. One-time training cost. | One-time GPU training (~30 min) | +| 4 | CLIP Gating | Cosine similarity vs reference. Score < 0.85 → regenerate (max 2 retries) → escalate to user. | Per-image scoring: negligible | + +### Stage 4 — Audio Generation +| Stage | Description | +|-------|------------| +| Dialogue / Voiceover | Each dialogue line sent to ElevenLabs (or Azure Neural TTS) with character's `voice_id` and `speed` from Series Bible. Emotion tags mapped to ElevenLabs expression controls (e.g., `frustrated` → raised energy, faster pace). | +| Narration | Optional narrator voice for scene transitions. Defined in Series Bible as a separate voice profile if the series uses narration style. | +| Background Music | Royalty-free track generated or selected (Mubert / Suno AI) based on episode tone tag. Mixed at lower volume than dialogue. Fade-in/fade-out applied at episode start/end. | +| Audio Sync | TTS audio duration measured programmatically per line. Scene image display duration adjusted to match audio length. Ensures total episode stays within ±10 seconds of 5 minutes — audio is the timing master. | + +### Stage 5 — Video Assembly +| Stage | Description | +|-------|------------| +| Composition | FFmpeg or MoviePy. Background image + character images composited as layers per shot. Subtle Ken Burns pan/zoom applied to static images to add motion. Scene transitions: hard cut or 0.5s cross-dissolve based on tone. | +| Subtitles | Dialogue lines timestamped to TTS audio output. Auto-generated SRT subtitle file. Burnt-in or as a separate track — user selects at export. | +| Output Formats | 16:9 (YouTube/desktop) or 9:16 (Reels/Shorts). Resolution: 1080p. User selects at episode creation time. | +| Production Package | ZIP bundle: `final_episode.mp4` + `script.json` + `shot_list.json` + `images/` + `audio/` + `subtitles.srt`. Enables manual re-edit in DaVinci Resolve or Premiere without regenerating assets. | + +### Output Package Structure +```text +episodes/ep_002_the_decision/ +├── final_episode.mp4 +├── script.json +├── shot_list.json +├── subtitles.srt +├── images/ +│ ├── scene_01_shot_01.png +│ ├── scene_01_shot_02.png +│ └── scene_02_shot_01.png +└── audio/ + ├── maya_line_01.mp3 + ├── rohan_line_01.mp3 + └── bgm_episode.mp3 +``` +## Scene-Level Regeneration +Full episode regeneration is expensive — 5 stages × multiple API calls. The system supports surgical re-generation so users can iterate without restarting the entire pipeline. + +| User Action | What Reruns | What Is Skipped | +|-------------|------------|-----------------| +| Regenerate scene dialogue | Stage 1 (that scene only) → Stage 4 audio for that scene → Stage 5 re-assembly | All other scenes, all images | +| Change scene emotion/tone | Stage 1 (that scene) → Stage 3 images for affected shots → Stage 4 audio → Stage 5 | Unaffected scenes and shots | +| Swap a character from episode | Stage 3 image re-generation for all shots with that character → Stage 5 re-assembly | Script, audio, other characters | +| Adjust pacing / trim scene | Stage 1 duration recalculation → Stage 5 re-assembly only | All assets — no regeneration cost | +| Change background music tone | Stage 4 BGM only → Stage 5 re-assembly | All character assets and dialogue | +> [!Warning] +> **Cost impact:** Regenerating a single scene's dialogue costs ~5% of a full episode generation. Without scene-level granularity, every small edit forces a full pipeline re-run — unacceptable for iterative content creation. + +## Cost Considerations + +Image generation is the dominant cost driver (per-shot calls). +To control cost: +- Reuse background images across scenes when setting unchanged. +- Cache character expressions where emotion is repeated. +- Limit regeneration retries to a maximum threshold. +- Allow “Script + Asset Package Only” mode without auto-rendering final MP4. + + +## Microservice Architecture +Each stage runs as an independent service. This enables parallel processing, horizontal scaling, and service replacement without affecting the rest of the pipeline. + +| Service | Responsibility | Scales With | Replaceable With | +|----------|---------------|-------------|------------------| +| Script Service | LLM call → scene JSON | Episode request volume | Any LLM API | +| Visual Service | Image generation per shot | Shot count (most expensive) | Any T2I model or API | +| Audio Service | TTS per dialogue line + BGM | Dialogue line count | Any TTS provider | +| Assembly Service | FFmpeg composition → MP4 | Resolution + scene count | Any video rendering tool | +| Series Bible Store | JSON versioning + episode log | Series count (lightweight) | Any key-value store | + +## Core Challenges & Solutions + +| Challenge | Solution | +|------------|----------| +| Character visual drift across episodes | 4-layer consistency strategy: textual anchoring → IP-Adapter → LoRA (optional) → CLIP similarity gating at 0.85 threshold | +| Personality inconsistency in dialogue | `behavioral_rules` + `speaking_style` injected per-character into every scene prompt, not just globally | +| Duration mismatch | Word-count-based speech estimation (130–150 WPM). Trim/expand pass before finalising script. Audio sync makes TTS the timing master for video assembly | +| Multi-episode continuity drift | Episode memory log written back to Series Bible after each episode. Unresolved threads and relationship state changes persist as context for the next episode's LLM call | +| High regeneration cost | Scene-level regeneration: only re-run affected stages for the changed scene. Full pipeline only on new episodes | +| Partial cast episodes | `cast_subset` field in episode prompt. Script LLM instructed to write only for listed characters; absent characters may be referenced but not present on screen | + +## Architecture Summary +| Stage | Technology | Output | Service | +|-------|------------|--------|---------| +| Series Bible | JSON Schema + Web UI | Character/world config + episode log | Series Bible Store | +| Script Gen | GPT-4o / Claude 3.5 Sonnet | Scene-by-scene script JSON | Script Service | +| Storyboard | LLM (2nd pass) | Shot list per scene | Script Service | +| Visuals | SDXL + IP-Adapter / LoRA | Character + background images | Visual Service | +| Audio | ElevenLabs + Mubert/Suno | Dialogue audio + BGM | Audio Service | +| Video Assembly | FFmpeg / MoviePy | Final 5-min MP4 + package ZIP | Assembly Service | + +> [!TIP] +> **Key design principle:** The Series Bible is the single source of truth, injected into every stage of every episode. Episode Memory ensures the series has a living continuity: what happened in Episode 1 shapes how characters behave in Episode 5, without the user having to re-explain it every time.