diff --git a/GenAI.md b/GenAI.md index 3c1fd31b..c7e10cf9 100644 --- a/GenAI.md +++ b/GenAI.md @@ -1,3 +1,16 @@ +
+ +### πŸ‘€ Submitted by: **Nithin N** + +> ⚠️ **Note:** My primary GitHub account is **[Nithin9585](https://github.com/Nithin9585)** β€” due to login issues, this submission is made from a secondary account. +> To verify my profile, projects, and work history, please visit: +> +> ## πŸ”— [github.com/Nithin9585](https://github.com/Nithin9585) + +
+ +--- + # GenAI Assignment: **Evaluation Criteria** @@ -26,7 +39,7 @@ No code required. We want a **clear, practical proposal** with architecture and ### Your Solution for problem 1: -You need to put your solution here. +[**> Click here to view the Architectural Proposal (Solution_1_Video_Notes.md)**](./Solution_1_Video_Notes.md) ## Problem 2: **Zero-Shot Prompt to generate 3 LinkedIn Post** @@ -36,7 +49,7 @@ Design a **single zero-shot prompt** that takes a user’s persona configuration ### Your Solution for problem 2: -You need to put your solution here. +[**> Click here to view the Prompt Design (Solution_2_LinkedIn.md)**](./Solution_2_LinkedIn.md) ## Problem 3: **Smart DOCX Template β†’ Bulk DOCX/PDF Generator (Proposal + Prompt)** @@ -45,7 +58,7 @@ Users have many Word documents that act like templates (offer letters, certifica We want a system that: 1. Converts an uploaded **DOCX** into a reusable **template** by identifying editable fields. -2. Supports **single generation** (form-fill β†’ DOCX/PDF download). +2. Supports **single generation** (form-fill β†’ DOCX/PDF output). 3. Supports **bulk generation** via **Excel/Google Sheet** rows. ### **Task (No coding)** @@ -54,7 +67,7 @@ Submit a **proposal** for building this system using GenAI (OpenAI/Gemini) for ### Your Solution for problem 3: -You need to put your solution here. +[**> Click here to view the Template Engine Proposal (Solution_3_Doc_Template.md)**](./Solution_3_Doc_Template.md) ## Problem 4: Architecture Proposal for 5-Min Character Video Series Generator @@ -66,4 +79,4 @@ Create a **small, clear architecture proposal** (no code, no prompts) describing ### Your Solution for problem 4: -You need to put your solution here. +[**> Click here to view the Character Video Architecture (Solution_4_Character_Video.md)**](./Solution_4_Character_Video.md) diff --git a/Solution_1_Video_Notes.md b/Solution_1_Video_Notes.md new file mode 100644 index 00000000..8155f4fa --- /dev/null +++ b/Solution_1_Video_Notes.md @@ -0,0 +1,221 @@ +# Solution: Proposal for "Video-to-Notes" Platform + +## The Problem + +We have a local folder of long videos (3–4 hours each, 200MB+). We need an automated pipeline to generate a **Summary Package** per video: +- `Summary.md` β€” structured notes with key takeaways +- **Highlight clips** β€” short video segments of key moments +- **Screenshots** β€” frames from important slides/moments + +--- + +## Approach Comparison + +### Approach 1: Online/Cloud-Based SaaS (e.g., Pictory, ScreenApp, Exemplary.ai) + +```mermaid +graph LR + User[User] -->|Upload 5GB+ per video| Cloud[SaaS Platform] + Cloud -->|Black-Box AI| Output[Summary + Clips] + Output -->|Download| User +``` + +**How it works:** Upload videos to a third-party platform. The platform transcribes, summarizes, and generates clips automatically. + +| Factor | Assessment | +|--------|-----------| +| File Size | [NO] Upload bottleneck β€” uploading 200MB–9GB per video is slow and fragile | +| Duration | [NO] Most platforms cap at 3 hours (ScreenApp Business Plan) β€” our 3–4hr videos may fail | +| Batch Processing | [NO] No bulk automation β€” manual upload per file via browser | +| Customization | [NO] Black-box AI optimized for "viral" clips, not technical/informational content | +| Cost | [NO] Subscription-based; 10 Γ— 4hr videos = 2,400 min, exceeds most Pro plan limits | + +**Verdict: REJECTED** β€” Upload friction, duration limits, and no batch control make this unworkable. + +--- + +### Approach 2: Hybrid Architecture β€” Local Processing + Cloud AI -- RECOMMENDED + +```mermaid +graph LR + A[Local Video 2GB+] -->|FFmpeg: Extract Audio| B[Audio File ~60MB] + B -->|Upload only audio| C[Deepgram STT API] + C -->|Transcript + Timestamps| D[Claude 3.5 Sonnet LLM] + D -->|Structured JSON| E[Local FFmpeg] + A -->|Original quality source| E + E --> F[Clips + Screenshots + Summary.md] +``` + +**How it works:** +1. **Local FFmpeg** extracts only the audio from each video (Opus codec, 64kbps -> ~60MB for 4hrs) +2. **Deepgram API** transcribes the audio with word-level timestamps (~12 sec per hour of audio) +3. **Claude 3.5 Sonnet** (200k token context) reads the full transcript and returns a JSON with summary + highlight timestamps +4. **Local FFmpeg** cuts clips and screenshots from the original high-quality video using those timestamps + +**Full Pipeline (Detailed):** + +```mermaid +graph TD + subgraph "Phase 1: Ingestion" + Start([Start Batch]) --> Scan[Scan Input Folder] + Scan --> Check{Valid File?} + Check -- No --> LogError[Log to skipped.csv] + Check -- Yes --> FFprobe[Extract Metadata via ffprobe] + end + + subgraph "Phase 2: Audio Extraction" + FFprobe --> Extract["FFmpeg: Extract Opus Audio (-vn -acodec libopus -b:a 64k)"] + Extract --> AudioFile(output_audio.opus ~60MB) + end + + subgraph "Phase 3: Transcription" + AudioFile --> Deepgram["Deepgram Nova-2 API (diarize + timestamps)"] + Deepgram --> Transcript[Full Transcript + Word Timestamps JSON] + end + + subgraph "Phase 4: Intelligence" + Transcript --> LLM["Claude 3.5 Sonnet (200k context window)"] + LLM --> Analysis[Structured JSON: Summary + Highlight Segments] + end + + subgraph "Phase 5 & 6: Asset Production" + Analysis --> Cut["FFmpeg: Cut Clips (-ss start -t duration)"] + Analysis --> Snap["FFmpeg: Screenshots (-vframes 1)"] + Cut --> Assets[assets folder] + Snap --> Assets + Assets --> Assemble[Generate Summary.md via Jinja2] + end + + Assemble --> End([Done]) +``` + +| Factor | Assessment | +|--------|-----------| +| File Size | [YES] Only ~60MB audio uploaded (97% bandwidth reduction) | +| Duration | [YES] No limit β€” Claude 3.5 handles 200k tokens (full 4hr transcript) | +| Batch Processing | [YES] Python script with retry logic, state persistence, skips corrupt files | +| Customization | [YES] Full control over prompt β€” prioritize technical/informational content | +| Cost | ~$1.50 per 4hr video (Deepgram $0.0043/min + Claude API) | + +**Verdict: RECOMMENDED** β€” Solves the bandwidth problem (audio extraction) and the context problem (200k token LLM). + +--- + +### Approach 3: Fully Offline β€” Open-Source Models (Faster-Whisper + Llama 3) + +```mermaid +graph LR + A[Local Video] -->|Faster-Whisper on GPU| B[Local Transcript] + B -->|Llama 3 70B| C[Local JSON Summary] + C -->|FFmpeg| D[Clips + Screenshots] +``` + +**How it works:** Run everything locally β€” Faster-Whisper for transcription, Llama 3 70B for summarization, FFmpeg for asset generation. Zero data leaves the machine. + +| Factor | Assessment | +|--------|-----------| +| File Size | [YES] No upload needed | +| Duration | [WARN] Llama 3 70B needs 40GB VRAM (dual GPU or A6000 $4,000+) | +| Batch Processing | [WARN] Prone to OOM crashes on long files; requires chunking (lossy summaries) | +| Customization | [YES] Full control | +| Cost | [WARN] High CapEx (hardware); $0 per-run after setup | +| Privacy | [YES] Air-gapped β€” no data leaves premises | + +**Verdict: CONDITIONAL -- Viable only if data is classified** β€” Requires enterprise GPU hardware. Smaller models (8B) hallucinate timestamps and lose context on 4hr videos. + +--- + +## Strategic Recommendation Summary + +| Feature | SaaS (Cloud Only) | **Hybrid (Local + API)** | Offline (Local Only) | +|---|---|---|---| +| Data Movement | [NO] Upload GBs | [YES] Upload MBs (audio only) | [YES] Zero transfer | +| Long Context (4hr) | [NO] Often capped <3hrs | [YES] 200k+ tokens | [WARN] Hardware limited | +| Cost Efficiency | [NO] High subscriptions | ~$1.50/video | [WARN] High CapEx | +| Privacy | [NO] 3rd party storage | [WARN] Transient API calls | [YES] Air-gapped | +| Batch Automation | [NO] Manual uploads | [YES] Fully scripted | [WARN] OOM risk | +| **Recommendation** | **Reject** | ** Adopt** | **Reject (unless classified)** | + +--- + +## JSON Schema (LLM Output Contract) + +The LLM must return a strict JSON so FFmpeg commands can be generated reliably: + +```json +{ + "meta": { + "title": "Q3 All-Hands Meeting", + "main_topics": ["Financials", "Roadmap", "Q&A"] + }, + "summary_content": { + "executive_summary": "200-300 word overview...", + "key_takeaways": ["Insight 1", "Insight 2"], + "action_items": ["Follow up on budget", "Schedule roadmap review"] + }, + "segments": [ + { + "id": "seg_001", + "timestamp_start": "00:15:20", + "timestamp_end": "00:18:45", + "segment_title": "Q3_Financials_Overview", + "description": "CFO presents Q3 revenue breakdown", + "reasoning": "High information density β€” key financial decision point", + "assets_to_generate": { "clip": true, "screenshot": false } + } + ] +} +``` + +**Key design decisions:** +- `timestamp_start/end` enforced as `HH:MM:SS` regex β€” FFmpeg rejects any other format +- `reasoning` field forces Chain-of-Thought, reducing hallucinated timestamps +- `assets_to_generate` flags let the LLM decide: not every moment needs a 50MB clip + +--- + +## Zero-Shot Prompt (The LLM Instruction) + +``` +You are a Senior Technical Archivist. Process the transcript below into a structured JSON knowledge artifact. + +RULES (Anti-Hallucination Protocol): +1. Only use timestamps that exist verbatim in the transcript. Never guess. +2. Add a 10-second pad: subtract 10s from start, add 10s to end of each clip. +3. Clips must be 30 seconds–3 minutes long. +4. Prioritize: technical demos, decisions, debates, conclusions. Skip banter/logistics. +5. Output ONLY valid JSON. No markdown fencing, no preamble. + +PROCESS: +1. Scan the full transcript to map the video structure. +2. Identify 5–10 highlight candidates. +3. Verify timestamps exist in the source text. +4. Output the JSON. + +[TRANSCRIPT BELOW] +``` + +**Why Zero-Shot?** Few-shot examples waste context window tokens. With a 4hr transcript (40k tokens), we need every token for the actual content. Claude 3.5 follows detailed zero-shot instructions reliably. + +--- + +## Bulk Processing & Error Handling + +**Resilience features:** +- `ffprobe` validates each file before processing β€” corrupt files logged to `skipped.csv`, batch continues +- API calls wrapped in exponential backoff retry (2s 4s 8s, max 5 retries) +- `job_status.json` tracks completed videos β€” if script crashes at video #49, it resumes at #50 + +**Output structure:** +``` +Output/ + 2024-11-05_Q3_All_Hands/ + Summary.md + manifest.json + assets/ + Clip_01_Financials_00-15-20.mp4 + Clip_02_Roadmap_01-10-00.mp4 + Screenshot_01_Slide_A.jpg +``` + +**Batch Report** generated at end: `Batch_Report.csv` with filename, duration, status, cost estimate per video. diff --git a/Solution_2_LinkedIn.md b/Solution_2_LinkedIn.md new file mode 100644 index 00000000..6e890a93 --- /dev/null +++ b/Solution_2_LinkedIn.md @@ -0,0 +1,234 @@ +# Solution: Zero-Shot Prompt for LinkedIn Post Generation + +## The Task + +Design a **single zero-shot prompt** that takes a user's persona configuration + a topic and generates **3 LinkedIn post drafts in 3 distinct styles**, each aligned to the user's voice. Output must be structured JSON so the app can display the 3 drafts. + +--- + +## The 3 Post Styles + +| Style | Name | Goal | Hook Type | +|---|---|---|---| +| Style 1 | Personal Narrative | Empathy & Trust | Vulnerability / "I" statement | +| Style 2 | Actionable Listicle | Saves & Utility | Value promise ("X steps to...") | +| Style 3 | Contrarian Insight | Comments & Debate | Pattern interruption / myth-busting | + +--- + +## JSON Output Schema + +The LLM must return this exact structure so the app can render 3 draft cards: + +```json +{ + "meta": { + "topic": "string", + "persona_analysis": "string β€” how the AI interpreted the voice settings" + }, + "posts": [ + { + "style_id": "narrative", + "style_label": "Personal Narrative", + "hook": "string β€” first 2-3 lines (the 'See More' bait)", + "body": "string β€” full post body using \\n for line breaks", + "cta": "string β€” closing call to action", + "hook_analysis": "string β€” why this hook works for the persona", + "estimated_length_words": 150 + }, + { + "style_id": "listicle", + "style_label": "Actionable Listicle", + "hook": "string", + "body": "string", + "cta": "string", + "hook_analysis": "string", + "estimated_length_words": 120 + }, + { + "style_id": "contrarian", + "style_label": "Contrarian Insight", + "hook": "string", + "body": "string", + "cta": "string", + "hook_analysis": "string", + "estimated_length_words": 130 + } + ] +} +``` + +**Critical JSON rules:** +- Use `\n` (literal backslash-n) for line breaks inside strings β€” never actual newlines +- No markdown fencing, no preamble β€” raw JSON only +- `hook_analysis` forces Chain-of-Thought reasoning before committing to the hook text + +--- + +## The Zero-Shot Prompt + +This is the full system prompt to paste into the API call: + +``` +ROLE +You are an elite LinkedIn Personal Brand Strategist and Ghostwriter. +Your job: take a Topic and User Persona and generate 3 high-fidelity LinkedIn post drafts. + +HARD RULES +1. Output ONLY raw valid JSON. No markdown fencing, no preamble, no explanation. +2. Use the literal characters \n for line breaks inside JSON strings. Never break the line. +3. Adopt the user's exact voice. Do not revert to generic AI tone. +4. Each post must be meaningfully different in structure and hook β€” not just paraphrases. +5. Do NOT use: "In today's fast-paced world", "game-changer", "synergy", or generic buzzwords. +6. Do NOT use hashtags in the hook. + +INPUT FORMAT +You will receive a JSON object with: +- topic: the subject to write about +- persona.role: the user's professional identity +- persona.tone: array of tone adjectives (e.g. ["direct", "analytical"]) +- persona.experience: years of experience +- persona.formatting: "emojis" or "no-emojis" +- persona.dos: content guidelines to follow +- persona.donts: content guidelines to avoid + +VOICE CALIBRATION +- Tone "direct/no-nonsense" β†’ short sentences, no adjectives, zero emojis +- Tone "empathetic/coach" β†’ softer transitions, question marks, moderate emojis +- Role "executive" β†’ high-level strategy, avoid tactical weeds +- Role "builder/engineer" β†’ technical accuracy, specifics over fluff +- Experience "10+ years" β†’ speak with authority; "junior" β†’ speak with enthusiasm + +STYLE DEFINITIONS + +STYLE 1 β€” PERSONAL NARRATIVE (Framework: SLA β€” Story, Lesson, Application) +- Hook: Cold open. Start mid-action. Use "I" statements. Vulnerability or failure. + Example pattern: "I [did X]. It [went wrong/changed everything]." +- Body: Chronological micro-story. Short paragraphs. Emotional arc. +- Takeaway: One universal lesson from the story. +- Grammar: First-person singular throughout. + +STYLE 2 β€” ACTIONABLE LISTICLE (Framework: EDF β€” Educational Framework) +- Hook: Specific value promise with a number. + Example pattern: "X [things/steps/mistakes] that [outcome]:" +- Body: Vertical list. Each item on its own line. One idea per bullet. No fluff. +- CTA: Tell the reader what to do with this list (save it, share it, try #1 today). +- Grammar: Second-person ("you") or imperative voice. + +STYLE 3 β€” CONTRARIAN INSIGHT (Framework: CA β€” Contrarian Approach) +- Hook: Challenge a widely-held belief or "best practice" directly. + Example pattern: "Stop [doing X]. Here's why it's hurting you." +- Body: Dismantle the myth with logic or data. Offer the "new way." +- Tone: Firm, authoritative, slightly polarizing β€” but professional, not aggressive. +- Grammar: Short, punchy lines. One sentence per paragraph. + +CHAIN-OF-THOUGHT PROCESS (internal β€” do not output this) +1. Read the persona. Identify 3 linguistic rules to apply (vocabulary, sentence length, emoji use). +2. Read the topic. Identify the core insight, a personal angle, and a contrarian angle. +3. For each style, write the hook_analysis first, then write the post. +4. Verify: Do all 3 posts sound like the same person but look structurally different? +5. Verify: Is the JSON valid? Are all line breaks escaped as \n? + +OUTPUT SCHEMA +Return exactly this JSON structure with all 3 posts populated: + +{ + "meta": { + "topic": "...", + "persona_analysis": "..." + }, + "posts": [ + { + "style_id": "narrative", + "style_label": "Personal Narrative", + "hook": "...", + "body": "...", + "cta": "...", + "hook_analysis": "...", + "estimated_length_words": 0 + }, + { + "style_id": "listicle", + "style_label": "Actionable Listicle", + "hook": "...", + "body": "...", + "cta": "...", + "hook_analysis": "...", + "estimated_length_words": 0 + }, + { + "style_id": "contrarian", + "style_label": "Contrarian Insight", + "hook": "...", + "body": "...", + "cta": "...", + "hook_analysis": "...", + "estimated_length_words": 0 + } + ] +} +``` + +--- + +## How the App Uses This Prompt + +The app injects the user's persona + topic as the user message: + +```python +import json + +SYSTEM_PROMPT = "..." # The full prompt above + +user_input = { + "topic": "The future of remote work for creative agencies", + "persona": { + "role": "Agency Founder", + "tone": ["direct", "ambitious"], + "experience": "15 years", + "formatting": "no-emojis", + "dos": ["share real lessons", "use specific numbers"], + "donts": ["avoid corporate jargon", "no motivational fluff"] + } +} + +user_message = f"Generate 3 LinkedIn posts for this input:\n{json.dumps(user_input, indent=2)}" + +# OpenAI +response = client.chat.completions.create( + model="gpt-4o", + messages=[ + {"role": "system", "content": SYSTEM_PROMPT}, + {"role": "user", "content": user_message} + ], + response_format={"type": "json_object"} # Forces valid JSON output +) + +# Gemini +response = client.generate_content( + contents=user_message, + generation_config={"response_mime_type": "application/json"} +) +``` + +--- + +## Why Zero-Shot Works Here + +- **No examples needed** β€” the style definitions (SLA, EDF, CA) are precise enough to guide structure +- **Saves context window** β€” few-shot examples would consume tokens needed for the transcript/topic +- **Chain-of-thought via `hook_analysis`** β€” forces the model to reason before generating, reducing hallucination +- **Negative constraints** β€” "Do NOT use..." instructions are more effective than positive ones at preventing generic output + +--- + +## Error Handling + +| Failure | Detection | Fix | +|---|---|---| +| Invalid JSON | Pydantic/Zod parse error | Retry with error message: "Invalid JSON at line X. Regenerate." | +| All 3 posts sound the same | Style-bleed | Add to prompt: "Verify posts are structurally distinct before outputting." | +| Contrarian post is too soft | Safety alignment | Reframe as "professional debate" not "attack" | +| Line breaks broken | Literal newline in JSON | Enforce `\n` rule in prompt + post-process: `content.replace('\n', '\\n')` | + +--- diff --git a/Solution_3_Doc_Template.md b/Solution_3_Doc_Template.md new file mode 100644 index 00000000..647ca0d8 --- /dev/null +++ b/Solution_3_Doc_Template.md @@ -0,0 +1,202 @@ +# Solution: Smart DOCX Template β†’ Bulk DOCX/PDF Generator + +## The Task (from GenAI.md) + +Build a system that: +1. Converts an uploaded DOCX into a reusable template by **auto-detecting editable fields using GenAI** +2. Supports **single generation** (form-fill β†’ DOCX/PDF output) +3. Supports **bulk generation** via Excel/Google Sheet rows + +No code required β€” practical design using GenAI (OpenAI/Gemini) for field detection and schema generation. + +--- + +## System Architecture + +```mermaid +graph TD + subgraph "Step 1: Template Creation" + Upload[User uploads DOCX] --> Cleaner[XML Run Merger] + Cleaner --> LLM[GenAI Field Detector] + LLM --> Schema[JSON Field Schema] + Schema --> UI[Smart Mapper UI] + end + + subgraph "Step 2: Single Generation" + UI --> Form[Dynamic Web Form] + Form --> Engine[Jinja2 Template Engine] + Engine --> Gotenberg[PDF Renderer - Gotenberg] + Gotenberg --> Output[DOCX / PDF Download] + end + + subgraph "Step 3: Bulk Generation" + Sheet[Excel / Google Sheet] --> Validator[Schema Validator] + Validator --> Queue[Task Queue - Redis/Celery] + Queue --> Workers[Worker Pool] + Workers --> Gotenberg + Workers --> S3[S3 Storage] + S3 --> ZIP[ZIP Bundle + Report.csv] + end +``` + +--- + +## The Core GenAI Role: Field Detection + +The hardest part of this system is **automatically identifying which parts of a DOCX are dynamic fields**. This is where GenAI is used. + +### The Problem: Split Runs in DOCX XML + +When a user types `{{CandidateName}}` in Word, the internal XML often looks like this due to autocorrect/spell-check interruptions: + +```xml + + {{Candidate + Na + me}} + +``` + +A regex search for `{{CandidateName}}` fails. The system must first **merge split runs**, then scan. + +### GenAI Field Detection Prompt + +After XML cleaning, the plain text of the document is sent to the LLM: + +``` +ROLE +You are a document analysis engine. Analyze the following document text and identify all dynamic fields that should be replaced per-recipient. + +RULES +1. Identify explicit placeholders: {{FieldName}}, [FieldName], , or ALL_CAPS_WORDS used as variables. +2. Identify implicit fields: dates, names, amounts, addresses that appear to be instance-specific. +3. For each field, infer its data type: String, Date, Currency, Email, Boolean, List. +4. Identify any repeating blocks (e.g., invoice line items) as Loop fields. +5. Identify any conditional blocks (e.g., "If EU client, include GDPR clause") as Boolean fields. +6. Output ONLY valid JSON. No explanation. + +OUTPUT SCHEMA: +{ + "fields": [ + { + "name": "string β€” camelCase field name", + "label": "string β€” human readable label", + "type": "String | Date | Currency | Email | Boolean | Number", + "required": true, + "detected_from": "string β€” the exact text that triggered detection", + "description": "string β€” hint for the user filling the form" + } + ], + "loops": [ + { + "name": "string β€” loop variable name", + "description": "string β€” what each row represents", + "fields": ["field1", "field2"] + } + ], + "conditionals": [ + { + "name": "string β€” boolean flag name", + "description": "string β€” what block this controls" + } + ] +} + +DOCUMENT TEXT: +[DOCUMENT PLAIN TEXT HERE] +``` + +### Example Output + +For an offer letter containing `Dear {{CandidateName}}`, `Start Date: [StartDate]`, and a salary table: + +```json +{ + "fields": [ + { "name": "candidateName", "label": "Candidate Full Name", "type": "String", "required": true, "detected_from": "{{CandidateName}}", "description": "Enter the candidate's full legal name" }, + { "name": "startDate", "label": "Start Date", "type": "Date", "required": true, "detected_from": "[StartDate]", "description": "First day of employment" }, + { "name": "salary", "label": "Annual Salary", "type": "Currency", "required": true, "detected_from": "{{Salary}}", "description": "Gross annual compensation in USD" }, + { "name": "includeRelocation", "label": "Include Relocation Package?", "type": "Boolean", "required": false, "detected_from": "Relocation Allowance clause", "description": "Toggle to include/exclude relocation terms" } + ], + "loops": [], + "conditionals": [ + { "name": "includeRelocation", "description": "Entire relocation package paragraph" } + ] +} +``` + +--- + +## Template Engine: How Fields Get Injected + +The DOCX template uses **Jinja2 syntax** (via `python-docx-template`): + +| Use Case | Syntax in DOCX | +|---|---| +| Simple field | `{{ candidateName }}` | +| Date formatting | `{{ startDate \| date_format }}` | +| Currency formatting | `{{ salary \| currency }}` | +| Conditional block | `{%p if includeRelocation %}...{%p endif %}` | +| Table row loop | `{%tr for item in lineItems %}...{%tr endfor %}` | + +The system wraps the cleaned XML with these tags based on the GenAI-detected schema. + +--- + +## Single Generation Flow + +1. User uploads DOCX β†’ GenAI detects fields β†’ JSON schema saved +2. App renders a **dynamic web form** from the schema (date pickers for Date fields, currency inputs for Currency, toggles for Boolean) +3. User fills form β†’ app injects data into template β†’ sends to **Gotenberg** (Dockerized LibreOffice) for PDF conversion +4. User downloads DOCX + PDF + +--- + +## Bulk Generation Flow + +```mermaid +graph LR + A[Upload Excel / Connect Google Sheet] --> B[Validate columns against schema] + B --> C{All rows valid?} + C -- No --> D[Pre-flight Report: show errors before running] + C -- Yes --> E[Push one task per row to Redis queue] + E --> F[Worker pool: inject data + render PDF per row] + F --> G[Upload PDFs to S3] + G --> H[Stream ZIP + Generation_Report.csv to user] +``` + +**Key design decisions:** +- **Pre-flight validation** before any generation starts β€” show all errors upfront, not mid-job +- **Fan-out architecture** β€” each row is an independent task; one failure doesn't kill the batch +- **Streaming ZIP** β€” PDFs piped directly from S3 into ZIP stream; server never holds full file in RAM +- **Generation_Report.csv** β€” lists every row with status (Success/Failed) and error reason for failed rows + +--- + +## File Naming + +User defines a naming pattern during template setup using the same field names: + +``` +Pattern: {{ candidateName }}_OfferLetter_{{ startDate }}.pdf +Output: John-Doe_OfferLetter_2024-03-15.pdf +``` + +**Sanitization:** `/`, `\`, `:`, `*` and other illegal characters in field values are replaced with `-` before filename construction. Duplicates get `_1`, `_2` suffixes. + +--- + +## Technology Stack + +| Layer | Choice | Why | +|---|---|---| +| Field Detection | OpenAI GPT-4o / Gemini 1.5 Pro | Best at inferring field types from context | +| Template Engine | `python-docx-template` (Jinja2) | Handles loops, conditionals natively in DOCX XML | +| Excel Parsing | `python-calamine` | Rust-based, ~10x faster than pandas for large files | +| PDF Rendering | Gotenberg (Dockerized LibreOffice) | Preserves fonts, tables, headers β€” no Word license needed | +| Task Queue | Redis + Celery | Industry standard for Python async bulk jobs | +| Storage | AWS S3 / MinIO | Lifecycle rules auto-delete temp files after 24hrs | +| Google Sheets | Sheets API v4 + OAuth 2.0 | Batch fetch entire range in one API call | + +--- + diff --git a/Solution_4_Character_Video.md b/Solution_4_Character_Video.md new file mode 100644 index 00000000..d5021cb3 --- /dev/null +++ b/Solution_4_Character_Video.md @@ -0,0 +1,373 @@ +# Architecture: Character-Based Video Series Generator (5-Min Episodes) + +> **Real-World Reference Implementation:** +> This architecture is based on **[LoraFrame (IDLock Engine)](https://github.com/Nithin9585/LoraFrame_)** β€” a project I built and presented at the **CENI AI Hackathon, Hyderabad** (Top 10 Finalist). +> LoraFrame is a persistent character memory & video generation system that combines episodic memory, LLM reasoning, and identity-preservation technology to create "permanent digital actors" that maintain visual consistency across generated images and videos. +> *(Note: This is my personal GitHub account β€” [Nithin9585](https://github.com/Nithin9585))* + +--- + +## The Problem + +We need to generate a **5-minute episode video** using AI-generated characters, but **no current AI video model can produce 5 minutes in one shot**. State-of-the-art models (Veo 3.1, Runway Gen-4, Kling 3.0) cap out at **5–15 seconds per clip**. The core challenge is: + +1. **Generate many short clips** that are individually high-quality +2. **Maintain character identity** (face, clothing, style) across all clips +3. **Assemble clips** into a coherent 5-minute narrative with transitions, audio, and pacing +4. **Remember character history** across episodes so a series feels continuous + +This proposal describes a system that solves all four problems. + +--- + +## System Architecture (Core Engine) + +The following diagram shows the **LoraFrame IDLock Engine** β€” the core system that powers character creation, identity-locked generation, memory, and self-healing refinement. + +```mermaid +graph TD + + %% ===== CLIENT LAYER ===== + U["User / Client App"] --> API["API Gateway (FastAPI)"] + + %% ===== JOB QUEUE ===== + API --> Q["Redis Queue (RQ)"] + + %% ===== WORKERS ===== + Q --> GEN["Generator Worker"] + Q --> REF["Refiner Worker"] + Q --> STA["State Analyzer Worker"] + Q --> COL["LoRA Collector Worker"] + Q --> TRN["LoRA Trainer Worker"] + + %% ===== DATA LAYER ===== + PG["Postgres Metadata DB"] + VDB["Vector DB (Embeddings)"] + OBJ["Object Storage (Images / Models)"] + + %% ===== MEMORY ===== + GEN --> PG + GEN --> VDB + STA --> PG + STA --> VDB + + %% ===== PROMPT ENGINE ===== + GEN --> LLM["LLM Prompt Engine"] + + %% ===== LORA SYSTEM ===== + GEN --> LR["LoRA Registry"] + LR --> GEN + COL --> TRN + TRN --> OBJ + TRN --> LR + LR --> PG + + %% ===== GENERATION ===== + GEN --> AI["Image / Video Generator"] + AI --> OBJ + + %% ===== VALIDATION ===== + AI --> VAL["Vision Validator (IDR)"] + VAL -->|Pass| STA + VAL -->|Fail| REF + + %% ===== REFINEMENT LOOP ===== + REF --> AI + + %% ===== LORA DATA PIPELINE ===== + STA --> COL + + %% ===== ADMIN UI ===== + API --> UI["Admin / Dashboard"] +``` + +### Component Breakdown + +| Component | Technology | Role | +|---|---|---| +| **API Gateway** | FastAPI (Python) | REST API for character creation, episode requests, job status | +| **Redis Queue** | Redis / RQ | Async job dispatch to workers; decouples API from heavy GPU tasks | +| **Generator Worker** | Veo 3.1 / Imagen 3 / Kling 3.0 | Produces identity-locked images/video clips per scene | +| **Refiner Worker** | InsightFace + Inpainting | Self-healing loop β€” if IDR detects identity drift, re-generates the face region | +| **State Analyzer** | LLM + Vector DB | Updates episodic memory after each generation (injuries, costume changes, mood) | +| **LoRA Collector/Trainer** | SDXL LoRA training pipeline | Collects validated images β†’ fine-tunes a character-specific LoRA adapter | +| **LLM Prompt Engine** | Groq (Llama 3 70B) | Converts simple user prompts into rich, context-aware scene descriptions | +| **Vision Validator (IDR)** | InsightFace + ONNX Runtime | Compares generated face embeddings against canonical reference β€” reject if similarity < threshold | +| **LoRA Registry** | Postgres + Object Storage | Tracks which LoRA weights belong to which character; version-controlled | +| **Postgres** | PostgreSQL (SQLAlchemy) | Stores character metadata, episode scripts, scene timelines, generation logs | +| **Vector DB** | Pinecone / FAISS | Stores episodic memory embeddings for RAG-based character state retrieval | +| **Object Storage** | GCS / S3 | Stores generated images, video clips, LoRA model weights | + +--- + +## Long Video Assembly Pipeline (5-Minute Episodes) + +Since no single AI model can generate 5 minutes of video at once, we use a **Scene-by-Scene Generation + Assembly** pipeline. This is the key architectural addition that turns short AI clips into a full episode. + +```mermaid +graph TD + subgraph "Phase 1: Script & Storyboard" + EP["Episode Prompt (user story)"] --> SCRIPT["LLM Script Engine"] + BIBLE["Series Bible (characters, relationships)"] --> SCRIPT + MEM["Episodic Memory (Vector DB)"] --> SCRIPT + SCRIPT --> SCENES["Scene Breakdown JSON"] + SCENES --> |"12-18 scenes Γ— 15-25s each"| SB["Storyboard Plan"] + end + + subgraph "Phase 2: Scene-by-Scene Generation" + SB --> LOOP["Scene Generation Loop"] + LOOP --> |"For each scene"| IDGEN["IDLock Generator (Core Engine)"] + IDGEN --> |"5-8s clip"| EXTEND["Scene Extension (last-frame chaining)"] + EXTEND --> |"15-25s clip"| VALID["IDR Validation"] + VALID --> |Pass| CLIP["Validated Scene Clip"] + VALID --> |Fail| REFINE["Refiner β†’ Re-generate"] + REFINE --> IDGEN + end + + subgraph "Phase 3: Audio Pipeline" + SCENES --> TTS["TTS Engine (per character voice)"] + SCENES --> MUSIC["Background Music / SFX Selection"] + TTS --> AUDIO["Scene Audio Tracks"] + MUSIC --> AUDIO + end + + subgraph "Phase 4: Assembly & Post-Production" + CLIP --> ASSEMBLE["FFmpeg Video Assembler"] + AUDIO --> ASSEMBLE + ASSEMBLE --> TRANS["Transition Engine (cross-fade, cuts)"] + TRANS --> FINAL["Final 5-Min Episode MP4"] + FINAL --> QC["Quality Check (duration, lip-sync, continuity)"] + QC --> |Pass| DELIVER["Deliver to User"] + QC --> |Fail| RETRY["Flag scenes for regeneration"] + RETRY --> LOOP + end +``` + +### How the Long Video Pipeline Works + +#### Phase 1: Script & Storyboard Generation + +The user submits a **short episode prompt** (e.g., *"Ava discovers the secret lab; her mentor warns her about the consequences"*). The LLM Script Engine: + +1. **Loads the Series Bible** β€” character profiles, relationship maps, visual rules +2. **Retrieves episodic memory** via RAG β€” what happened in previous episodes (Vector DB) +3. **Generates a structured scene breakdown** β€” typically **12–18 scenes**, each 15–25 seconds long, totaling ~5 minutes + +Example Scene Breakdown JSON: +```json +{ + "episode": { + "title": "The Hidden Lab", + "total_target_duration_sec": 300, + "scenes": [ + { + "scene_id": "sc_001", + "description": "Establishing shot β€” Ava walks toward the abandoned building at dusk", + "characters": ["ava"], + "duration_sec": 20, + "camera": "wide tracking shot, golden hour lighting", + "dialogue": null, + "narration": "Ava had always been curious. But tonight, curiosity felt dangerous.", + "mood": "tense, mysterious" + }, + { + "scene_id": "sc_002", + "description": "Close-up β€” Ava pushes open the heavy metal door, revealing blue lab lighting inside", + "characters": ["ava"], + "duration_sec": 15, + "camera": "close-up face, rack focus to door interior", + "dialogue": null, + "narration": null, + "mood": "suspenseful" + } + ] + } +} +``` + +#### Phase 2: Scene-by-Scene Generation with Last-Frame Chaining + +This is the critical technique for producing **long, continuous video from short AI clips**: + +1. **Generate an initial 5–8 second clip** for each scene using the IDLock Generator (Veo 3.1 / Kling 3.0 API) +2. **Extract the last frame** of the generated clip +3. **Feed the last frame as a reference image** to the next generation call β†’ this is **"Scene Extension" / "Last-Frame Chaining"** +4. **Repeat 2–3 times** per scene to extend each scene to 15–25 seconds +5. **IDR Validation** checks every clip β€” if the character's face drifts beyond the similarity threshold, the Refiner re-generates that clip + +This approach is inspired by how **Veo 3.1's SceneBuilder** and **Kling 3.0's multi-shot generation** work: + +| Technique | Source | Used For | +|---|---|---| +| **Last-Frame Chaining** | Veo 3.1 Scene Extension API | Extend a scene from 8s β†’ 20s while maintaining visual continuity | +| **Multi-Shot Generation** | Kling 3.0 MVL Architecture | Generate 2–6 distinct scenes with character consistency in a single session | +| **Element Reference** | Kling 3.0 Character Reference 3.0 | Lock character identity across all shots using reference images | +| **IDR Self-Healing** | LoraFrame (InsightFace) | If face similarity drops below 0.85, regenerate that specific clip | + +#### Phase 3: Audio Pipeline (Parallel) + +While video is being generated, the audio pipeline runs in parallel: + +- **TTS Engine** (ElevenLabs / Coqui XTTS) generates character-specific voice lines from the script +- **Music Selection** picks background tracks matching mood tags (tense, joyful, dramatic) +- **SFX Engine** adds ambient sounds (footsteps, door creaks, wind) + +Each character has a **fixed voice profile** stored in the Series Bible β€” ensuring the same voice across episodes. + +#### Phase 4: Assembly & Post-Production + +``` +FFmpeg Assembly Pipeline: +1. Concatenate scene clips in order β†’ raw_video.mp4 +2. Add cross-fade transitions (0.5s) β†’ smooth_video.mp4 +3. Mix dialogue audio with music/SFX β†’ mixed_audio.aac +4. Merge video + audio β†’ episode_final.mp4 +5. Validate total duration β‰ˆ 300 seconds β†’ QC pass/fail +``` + +If QC fails (e.g., total duration is 247s instead of 300s), the system flags the shortest scenes and regenerates them with longer durations. + +--- + +## Key System: Identity Persistence (IDLock) + +The biggest challenge in AI video series is **keeping the same character looking the same** across hundreds of generated clips. LoraFrame solves this with a multi-layer identity system: + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ IDLock Stack β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ Layer 1: Reference Images (canonical face + angles) β”‚ +β”‚ Layer 2: InsightFace Embeddings (512-d face vector) β”‚ +β”‚ Layer 3: LoRA Weights (fine-tuned on character) β”‚ +β”‚ Layer 4: Style Anchors (clothing, palette, props) β”‚ +β”‚ Layer 5: Episodic Memory (RAG β€” what happened) β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + +Generation Flow: +User Prompt β†’ LLM enriches with memory + style anchors + β†’ Generator uses reference images + LoRA weights + β†’ Output validated by InsightFace (cosine similarity β‰₯ 0.85) + β†’ PASS: save to memory | FAIL: refine and retry (max 3 loops) +``` + +--- + +## Episodic Memory: How Characters "Remember" + +Each generation event creates a **memory record** stored in the Vector DB: + +```json +{ + "character_id": "char_ava_001", + "episode": 3, + "scene": "sc_007", + "state": { + "clothing": "torn lab coat, no glasses", + "injuries": "bandaged left hand", + "mood": "determined but shaken", + "location": "underground lab corridor" + }, + "embedding": [0.012, -0.445, 0.893, ...] +} +``` + +Before generating a new scene, the **LLM Prompt Engine** runs a RAG query: +- *"What was Ava wearing in the most recent scene?"* +- Vector DB returns the latest state β†’ LLM includes `"torn lab coat, bandaged left hand"` in the generation prompt +- This ensures **visual continuity within and across episodes** + +--- + +## Technology Stack + +### Backend (LoraFrame Engine) +| Layer | Technology | +|---|---| +| Framework | Python 3.10+, FastAPI | +| Database | PostgreSQL (SQLAlchemy) | +| Cache / Queue | Redis (RQ workers) | +| Vector DB | Pinecone / FAISS | +| LLM Inference | Groq (Llama 3 70B/8B) | +| Image/Video Gen | Google Veo 3.1, Imagen 3, Kling 3.0 (multi-provider) | +| Identity Lock | InsightFace, ONNX Runtime | +| LoRA Training | SDXL LoRA fine-tuning pipeline | +| Storage | Google Cloud Storage (GCS) / AWS S3 | + +### Long Video Assembly Layer +| Component | Technology | +|---|---| +| Scene Extension | Veo 3.1 Scene Extension API (last-frame chaining) | +| Multi-Shot | Kling 3.0 Multi-Shot Generation | +| TTS | ElevenLabs API / Coqui XTTS (self-hosted) | +| Music/SFX | Mubert API / local library | +| Video Assembly | FFmpeg (concat, transitions, audio merge) | +| Quality Control | Duration validator + lip-sync checker | + +### Frontend +| Layer | Technology | +|---|---| +| Framework | React.js | +| Language | JavaScript | + +--- + +## Project Structure (LoraFrame Backend) + +``` +cineAI/ +β”œβ”€β”€ app/ +β”‚ β”œβ”€β”€ api/ # API Routes (characters, generate, video, episodes) +β”‚ β”œβ”€β”€ core/ # Config, Database, Redis setup +β”‚ β”œβ”€β”€ models/ # SQLAlchemy Database Models +β”‚ β”œβ”€β”€ schemas/ # Pydantic Request/Response Models +β”‚ β”œβ”€β”€ services/ # Core Logic (Groq, Gemini, MemoryEngine) +β”‚ └── workers/ # Async Task Workers (generator, refiner, trainer, assembler) +β”œβ”€β”€ assembly/ +β”‚ β”œβ”€β”€ script_engine/ # LLM-based scene breakdown generator +β”‚ β”œβ”€β”€ chainer/ # Last-frame chaining / scene extension logic +β”‚ β”œβ”€β”€ audio/ # TTS, music selection, SFX mixing +β”‚ └── ffmpeg_ops/ # FFmpeg concat, transitions, final render +β”œβ”€β”€ scripts/ # Utility scripts +β”œβ”€β”€ tests/ # Pytest suite +β”œβ”€β”€ uploads/ # Local storage for dev +β”œβ”€β”€ .env.example # Environment variable template +β”œβ”€β”€ requirements.txt # Python dependencies +└── README.md +``` + +--- + +## API Endpoints + +| Method | Endpoint | Description | +|---|---|---| +| `POST` | `/api/v1/characters` | Create a new character from reference images | +| `POST` | `/api/v1/generate` | Generate a consistent image for a character | +| `POST` | `/api/v1/video/generate` | Generate a single video scene (short clip) | +| `POST` | `/api/v1/episodes/create` | Create a full 5-min episode from a story prompt | +| `GET` | `/api/v1/episodes/{episode_id}/status` | Poll episode assembly progress | +| `GET` | `/api/v1/episodes/{episode_id}/download` | Download the final episode MP4 + assets | +| `GET` | `/api/v1/jobs/{job_id}` | Check individual generation job status | + +--- + +## Live Deployment + +| Component | URL | +|---|---| +| **Frontend (Live App)** | [https://lore-frame-in.vercel.app](https://lore-frame-in.vercel.app) | +| **Backend API (GCP)** | [https://cineai-api-4sjsy6xola-uc.a.run.app/docs](https://cineai-api-4sjsy6xola-uc.a.run.app/docs) | + +--- + +## Why This Architecture Works for 5-Minute Videos + +| Challenge | Solution | +|---|---| +| AI models only generate 5–15s clips | **Scene-by-scene generation** with last-frame chaining extends each to 15–25s; 15 scenes = 5 min | +| Character faces change between clips | **IDLock (InsightFace + LoRA)** validates every frame; self-healing refiner fixes drift | +| Stories lack continuity across episodes | **Episodic Memory (RAG + Vector DB)** ensures characters "remember" past events and state | +| Audio doesn't match video | **Parallel audio pipeline** with per-character voice profiles + mood-tagged music | +| Quality varies across scenes | **Vision Validator + QC pipeline** β€” reject and regenerate below-threshold clips | +| No single tool does everything | **Modular, multi-provider architecture** β€” swap Veo for Kling or Runway per scene as needed |