diff --git a/Backend.md b/Backend.md index 4f7ff501..3246a4a2 100644 --- a/Backend.md +++ b/Backend.md @@ -26,7 +26,164 @@ **Your Solution for problem 1:** -You need to put your solution here. +--- + +### System Design + +**Stack:** Spring Boot (Java 21), PostgreSQL, Redis (queue), S3-compatible object storage. + +Spring Boot was chosen deliberately over Node.js. Java 21 virtual threads handle concurrent I/O-bound AI API calls cheaply without extra infrastructure. Spring Security, HikariCP connection pooling, and @Async pipeline stages are production-grade and reduce the amount of code we need to write from scratch. + +``` +[Client / CLI Script] + | + v +[API Service - Spring Boot] + | + |-----> [PostgreSQL] metadata, job state, artifact references + | + |-----> [S3 Object Storage] raw video files, all generated outputs + | + |-----> [Redis Queue] job queue only + | + v + [Worker Service] separate deployable, scales independently + | + |-----> FFmpeg clip extraction, screenshots + |-----> Whisper/Deepgram transcription (external HTTP) + |-----> OpenAI/Gemini summarization + highlights (external HTTP) +``` + +**What each component owns and deliberately does not own:** + +The API Service handles all client-facing concerns: authentication, upload URL generation, job creation, status polling, and issuing download URLs. It never calls FFmpeg or any AI service. Its job is to accept requests quickly and hand off work. + +The Worker Service is the only component that does heavy processing. It picks jobs from the Redis queue, runs the pipeline stages in order, writes all outputs to S3, and updates job state in PostgreSQL. It is deployed separately from the API so that slow processing (30 minutes for a 4-hour video) never blocks client requests. + +S3 stores all file data — raw video, transcripts, clips, screenshots, summary markdown. The database stores only the S3 key (a path string) alongside metadata. Storing binary files in the database would destroy query performance. + +Redis is used only as a job queue. No application state lives in Redis. If Redis is lost, jobs already in PostgreSQL can be re-queued from there. + +**Batch Folder Processing (core requirement):** + +The spec requires processing a local folder of videos with no manual work per video. A browser UI alone cannot satisfy this — forcing a user to upload 50 files individually through a form violates the requirement. The API is therefore designed headless-first: every endpoint is consumable by a CLI script, not just a browser. + +``` +[Local Folder /videos/] [CLI Script] + lecture_01.mp4 ------------> 1. Scan folder for video files + lecture_02.mp4 2. For each file concurrently (default pool=5): + lecture_03.mp4 POST /api/v1/uploads/initiate -> pre-signed S3 URL + ... Stream file directly to S3 + POST /api/v1/uploads/complete -> video_asset_id + 3. POST /api/v1/jobs/batch with all asset IDs + 4. Print job IDs + Web UI monitoring link +``` + +The batch endpoint is the only addition required. Everything else — upload, processing, storage — is reused unchanged. The worker queue absorbs the bulk ingestion spike naturally. + +CLI script handles idempotency: before uploading, it checks whether a video_asset already exists for the same filename and file size. If yes, it skips the upload and reuses the existing asset ID. This makes re-running the script on the same folder safe. + +--- + +### DB Choice: PostgreSQL + +PostgreSQL. The job state machine requires ACID transactions — when two workers both try to claim the same queued job, only one must win. Postgres achieves this with `SELECT FOR UPDATE SKIP LOCKED`, a single atomic operation. Without this, two workers process the same video simultaneously, producing duplicate outputs and wasting AI API credits. MongoDB could do this with transactions but adds complexity without benefit here. The data model is relational (User → VideoAsset → Job → Artifact → Highlight), and foreign keys enforce integrity automatically. + +Scaling path: read replicas for polling queries at v1. Partition `job_events` by `created_at` when log volume becomes large. + +--- + +### Schema (Tables) + +**users** +- id, email, password_hash, created_at + +**video_assets** +- id, user_id (FK → users), filename, storage_key (unique), size_bytes, duration_sec, mime_type, status, created_at +- status values: uploaded | processing | ready | error + +**jobs** +- id, user_id (FK → users), video_asset_id (FK → video_assets), status, attempt_count, max_attempts, idempotency_key (unique), error_message, started_at, completed_at, created_at +- status values: queued | processing | success | failed | cancelled + +**job_events** ← append-only progress log +- id, job_id (FK → jobs), event_type, message, metadata, created_at +- event_type examples: upload_confirmed, transcription_started, transcription_done, summary_generated, clips_extracted, failed, retrying + +**artifacts** +- id, job_id (FK → jobs), user_id (FK → users), artifact_type, storage_key, filename, size_bytes, metadata, created_at +- artifact_type values: summary_md | transcript | clip | screenshot + +**highlights** +- id, job_id (FK → jobs), timestamp_sec, end_timestamp_sec, title, description, clip_artifact_id (FK → artifacts), screenshot_artifact_id (FK → artifacts), sort_order, created_at + +Migration order: users → video_assets → jobs → job_events → artifacts → highlights + +--- + +### Key Constraints / Indexes + +- `video_assets.storage_key` unique — no duplicate S3 keys +- `jobs.idempotency_key` unique — prevents duplicate job creation for the same video +- Index on `jobs(user_id)` — jobs list page queries by user +- Index on `jobs(status)` filtered to queued only — worker polls this constantly, must be small and fast +- Index on `jobs(started_at)` filtered to processing — stuck job detection cron scans this +- Index on `artifacts(job_id)` — results page loads all artifacts for one job +- Index on `highlights(job_id, sort_order)` — ordered highlight list per job +- Index on `job_events(job_id, created_at)` — event log timeline per job +- FK: `jobs.user_id` → `users.id` +- FK: `jobs.video_asset_id` → `video_assets.id` +- FK: `artifacts.job_id` → `jobs.id` with CASCADE DELETE +- FK: `highlights.clip_artifact_id` → `artifacts.id` + +--- + +### Job Lifecycle + +``` +queued --> processing --> success + | + └--> failed --> queued (explicit user retry, only allowed from failed) + +cancelled (terminal state, reachable from queued or processing) +``` + +**Idempotency:** `idempotency_key = SHA256(user_id + ":" + video_asset_id)`. When a job creation request arrives, the API attempts an INSERT. If the key already exists, it returns the existing job instead of creating a duplicate. Safe for retried HTTP requests and CLI re-runs. + +**Worker claim (atomic):** Worker updates `status = processing` with a WHERE clause requiring `status = queued`. If zero rows are updated, another worker already claimed the job — skip it. This prevents two workers processing the same video. + +**Retry limit:** Max 3 attempts. Each failure increments `attempt_count`. At attempt 3, status becomes `failed` permanently. User must explicitly retry. + +**Stuck job recovery:** A scheduled task runs every 5 minutes. Any job stuck in `processing` for more than 30 minutes has its status reset to `queued` and `attempt_count` incremented. This handles crashed workers. Without this, a worker crash permanently loses that job — the user sees infinite "processing" with no result. + +**Checkpoint-based resumption:** Worker checks `job_events` before each pipeline stage. If `transcription_done` event already exists, it skips re-running transcription. If a retry happens after partial completion, only the remaining stages run. Partial outputs from successful stages are preserved across retries. + +--- + +### Storage Layout + Safe Downloads + +``` +bucket/ + uploads/ + {user_id}/{video_asset_id}/original.mp4 + + outputs/ + {user_id}/{job_id}/summary.md + {user_id}/{job_id}/transcript.txt + {user_id}/{job_id}/clips/ + highlight_01_t120s.mp4 + highlight_02_t1840s.mp4 + {user_id}/{job_id}/screenshots/ + highlight_01_t120s.jpg + highlight_02_t1840s.jpg +``` + +The `user_id` prefix enables IAM policy scoping per user and makes GDPR deletion simple — delete the prefix. The `job_id` subfolder for outputs means retries get a fresh folder and never collide with partial outputs from a previous failed run. + +**Safe downloads:** The S3 bucket is fully private. No file URL is ever exposed directly to the client. When a download is requested, the API verifies that `artifact.user_id` matches the authenticated user (prevents IDOR), then generates a pre-signed S3 URL with a 15-minute TTL. The client downloads directly from S3 using that URL. The API server is never in the download data path. + +**Direct uploads:** For files over 10MB, the API returns a pre-signed S3 multipart upload URL. The client streams the file directly to S3, bypassing the API server entirely. After upload completes, the client confirms with the API, which verifies the S3 object exists before accepting. + --- @@ -44,7 +201,131 @@ You need to put your solution here. **Your Solution for problem 2:** -You need to put your solution here. +--- + +### System Design + +``` +[Client] + | + v +[API Service - Spring Boot] + | + |-----> [PostgreSQL] users, accounts, personas, drafts, schedules, logs + | + |-----> [LinkedIn OAuth 2.0] external: token exchange + | + |-----> [GenAI Service] internal: draft generation (separate service boundary) + | + |-----> [Quartz Scheduler] persistent scheduler, survives server restarts + | + v + [Worker / Posting Job] + | + |-----> [LinkedIn API] publish post using decrypted token +``` + +**OAuth flow:** + +User clicks "Connect LinkedIn" → API generates a state token (stored in DB for CSRF validation) → redirects user to LinkedIn authorization URL. LinkedIn redirects back to the callback endpoint with an authorization code. API validates the state token, exchanges the code for access and refresh tokens via LinkedIn's token endpoint, encrypts both tokens immediately using AES-256-GCM, and stores the encrypted bytes in the database. The plaintext token never touches the database. + +**Token refresh flow:** + +Before any LinkedIn API call, the worker checks whether the access token expires within the next 5 minutes. If so, it calls LinkedIn's refresh endpoint, re-encrypts the new token pair, and updates the stored record. If refresh fails (user revoked access), the account status is set to `reauth_required` and the user is notified. No posting is attempted. + +**Draft generation flow:** + +Client sends persona_id and topic to the API. API loads the active prompt version from the `prompt_versions` table, calls the GenAI service with the prompt plus persona context, receives 3 draft objects back (one per style), and stores them as Draft rows with a reference to the prompt_version used. This reference allows rollback analysis — if a prompt version produced poor drafts, every affected draft can be identified. + +**Scheduling and posting flow:** + +User approves a draft and creates a schedule. A Quartz scheduler job is registered with a trigger at `scheduled_at` (UTC). When the trigger fires, Quartz enqueues a posting job. The worker decrypts the access token, posts to LinkedIn, and writes the result to `post_attempts`. Success or failure is recorded regardless of outcome. + +--- + +### Schema (Tables) + +**users** +- id, email, password_hash, created_at + +**linkedin_accounts** +- id, user_id (FK → users), linkedin_member_id (unique), encrypted_access_token, encrypted_refresh_token, token_iv, access_token_expires_at, scopes, status, connected_at, last_used_at +- status values: active | reauth_required | revoked + +**personas** +- id, user_id (FK → users), display_name, background, tone, language, dos (array), donts (array), is_active, created_at, updated_at + +**drafts** +- id, user_id (FK → users), persona_id (FK → personas), topic, topic_context, style_variant, content, prompt_version_id (FK → prompt_versions), status, approved_at, created_at +- style_variant values: concise_insight | story_based | actionable_checklist +- status values: pending | approved | rejected | scheduled | posted + +**schedules** +- id, user_id (FK → users), draft_id (FK → drafts, unique), linkedin_account_id (FK → linkedin_accounts), scheduled_at (UTC), timezone, status, quartz_job_key (unique), created_at +- status values: pending | posted | failed | cancelled +- draft_id is unique: one draft can only have one schedule + +**post_attempts** +- id, schedule_id (FK → schedules), draft_id (FK → drafts), user_id (FK → users), linkedin_post_id, attempt_number, status, http_status_code, error_message, attempted_at +- status values: success | failed | rate_limited + +**prompt_versions** +- id, feature_key, version (unique per feature_key), prompt_text, config (model, temperature, max_tokens), is_active, deployed_at, created_by, notes, created_at + +Migration order: users → linkedin_accounts → personas → prompt_versions → drafts → schedules → post_attempts + +--- + +### Key Constraints / Indexes + +- `linkedin_accounts.linkedin_member_id` unique — one account record per LinkedIn member +- `schedules.draft_id` unique — prevents scheduling the same draft twice +- `schedules.quartz_job_key` unique — prevents duplicate Quartz jobs +- `(prompt_versions.feature_key, prompt_versions.version)` unique — no duplicate versions per feature +- Index on `schedules(scheduled_at)` filtered to `status = pending` — scheduler polls this frequently +- Index on `drafts(user_id, created_at DESC)` — drafts list page +- Index on `post_attempts(schedule_id)` — audit log and retry history per schedule +- Index on `prompt_versions(feature_key)` filtered to `is_active = true` — fast active prompt lookup +- FK: `drafts.persona_id` → `personas.id` +- FK: `schedules.draft_id` → `drafts.id` +- FK: `post_attempts.schedule_id` → `schedules.id` + +--- + +### Security: Token Encryption + +Tokens are encrypted at the application layer using AES-256-GCM before being written to the database. A random 12-byte IV is generated per token and stored alongside the encrypted bytes. The encryption key comes from an environment variable (or AWS KMS at v1) and never touches the database. + +On read: load encrypted bytes and IV from DB → fetch decryption key from env/KMS → decrypt in application memory → use for the API call → discard. The decrypted token is never logged, never returned to the client, and never stored anywhere except in-memory for the duration of the operation. + +Scopes requested from LinkedIn: `r_liteprofile`, `r_emailaddress`, `w_member_social` only. No company admin scopes, no full profile scopes. + +--- + +### Reliability: Double-Post Prevention, Retries, Rate Limiting + +**Double-post prevention (the hard case):** The obvious protection is `schedules.draft_id UNIQUE` — a draft can only be scheduled once. But there is a race condition that this alone does not solve: the worker posts to LinkedIn successfully, then crashes before marking `schedule.status = posted`. On restart, it would try to post again. + +The fix: before every posting attempt, the worker queries `post_attempts` for the current `schedule_id`. If any row has `status = success`, the worker aborts with a no-op and logs a dedup event. This check catches the post-then-crash scenario. The `linkedin_post_id` returned by LinkedIn is stored in `post_attempts`, so even if the check is somehow bypassed, the duplicate post ID would be detectable. + +**Retry with backoff:** On failure or rate limiting, Quartz reschedules the posting job with exponential backoff: 5 min → 15 min → 60 min. After 3 failures, `schedule.status = failed`. Each attempt writes a new row to `post_attempts` — full audit trail preserved. + +**LinkedIn rate limits:** Worker checks how many posts this user has made today via `post_attempts` count before posting. If the daily limit is approached, the job is delayed to the next available window rather than failed outright. + +--- + +### Prompt / Config Storage Proposal + +**Decision: Hybrid — Git as review gate, database as runtime source of truth.** + +The `prompt_versions` table stores all prompt versions with versioning metadata. GenAI team submits prompts via a Git PR → CI validates the schema → on merge, a migration script upserts the new version into the database with `is_active = false`. The backend team then activates it by flipping `is_active = true` on the new version and `false` on the old one. + +Why not Git-only: rollback requires a new deploy. A database flag flip is instant and zero-downtime. + +Why not database-only: prompts need peer review before production. Git enforces that review process. + +Every draft stores `prompt_version_id` as a foreign key. This means every draft is traceable to the exact prompt that generated it. If a prompt version produces poor output quality, all affected drafts can be identified and the version rolled back by flipping its `is_active` flag. + --- @@ -62,7 +343,146 @@ You need to put your solution here. **Your Solution for problem 3:** -You need to put your solution here. +--- + + +### System Design + +``` +[Client] + | + v +[API Service - Spring Boot] + | + |-----> [PostgreSQL] template metadata, fields, runs, row results + | + |-----> [S3 Object Storage] template files, CSVs, generated docs, ZIPs + | + |-----> [Redis Queue] bulk job queue + | + v + [Worker Service] + | + |-----> [Template Parser] Apache POI reads DOCX, extracts field placeholders + |-----> [Field Extractor] LLM-assisted field type inference + |-----> [Doc Generator] Apache POI fills template per row → DOCX + |-----> [PDF Converter] LibreOffice headless converts DOCX to PDF + |-----> [ZIP Bundler] assembles successful outputs + report CSV +``` + +**Template ingestion:** User uploads a DOCX file. The API stores it in S3 and creates a TemplateVersion record. The worker parses the document using Apache POI to identify placeholder fields (e.g. `{{candidate_name}}`), then calls the LLM to infer field types and whether fields are required. The result is stored as TemplateField rows. The user reviews detected fields, corrects any errors, and confirms — at which point the template status becomes `ready`. + +**Single generation:** User selects a ready template, fills a form with field values, submits. The API validates required fields, calls the worker synchronously (or via a lightweight queue), which fills the template and converts to DOCX/PDF. A signed S3 URL is returned for download. + +**Bulk generation:** User uploads a CSV where each row contains values for one document. The API validates the CSV header matches template fields, stores the CSV in S3, creates a BulkRun record, and enqueues the job. The worker processes each row independently. A failure on row 47 does not stop rows 48 onward. When all rows are done, the worker builds a ZIP from successful outputs and generates a per-row report CSV showing success or failure with error reasons for each row. + +--- + +### Schema (Tables) + +**users** +- id, email, password_hash, created_at + +**templates** +- id, user_id (FK → users), name, description, status, created_at, updated_at +- status values: draft | fields_detected | ready | archived + +**template_versions** +- id, template_id (FK → templates), version_number, storage_key (unique), size_bytes, is_active, uploaded_at +- Each re-upload creates a new version. Old versions are never mutated. + +**template_fields** +- id, template_version_id (FK → template_versions), field_key (unique per version), display_label, field_type, is_required, default_value, validation_regex, sort_order +- field_type values: text | number | date | currency | boolean + +**bulk_runs** +- id, user_id (FK → users), template_version_id (FK → template_versions), status, total_rows, success_count, failure_count, csv_storage_key, zip_storage_key, report_storage_key, idempotency_key (unique), started_at, completed_at, created_at +- status values: queued | processing | partial | success | failed + +**bulk_rows** +- id, bulk_run_id (FK → bulk_runs), row_number, input_data (JSON field values), status, output_filename, artifact_id (FK → artifacts), error_message, processed_at +- status values: pending | processing | success | failed + +**artifacts** +- id, user_id (FK → users), bulk_run_id (FK → bulk_runs, nullable — null for single-generate), artifact_type, storage_key (unique), filename, size_bytes, created_at +- artifact_type values: docx | pdf | zip | report_csv + +**job_events** +- id, bulk_run_id (FK → bulk_runs), event_type, message, metadata, created_at + +Migration order: users → templates → template_versions → template_fields → bulk_runs → bulk_rows → artifacts → job_events + +--- + +### Key Constraints / Indexes + +- `template_versions.storage_key` unique — no duplicate S3 keys +- `(template_fields.template_version_id, template_fields.field_key)` unique — no duplicate fields per version +- `bulk_runs.idempotency_key` unique — prevents duplicate run creation from same CSV upload +- `(bulk_rows.bulk_run_id, bulk_rows.row_number)` unique — no duplicate row numbers per run +- `artifacts.storage_key` unique +- Index on `templates(user_id)` — template list page +- Index on `template_versions(template_id)` filtered to `is_active = true` — active version lookup +- Index on `bulk_runs(user_id, created_at DESC)` — run history page +- Index on `bulk_runs(status)` filtered to `queued` — worker poll +- Index on `bulk_rows(bulk_run_id, row_number)` — progress display and ordered processing +- Index on `bulk_rows(bulk_run_id)` filtered to `status = failed` — retry-failed-only queries + +--- + +### Bulk Storage Strategy + +``` +bucket/ + templates/ + {user_id}/{template_id}/v{n}/template.docx immutable per version + + bulk-inputs/ + {user_id}/{bulk_run_id}/input.csv kept 30 days for debug/rerun + + bulk-outputs/ + {user_id}/{bulk_run_id}/docs/ + Alice_OfferLetter_2025-01-15.pdf + Bob_OfferLetter_2025-01-15.pdf + {user_id}/{bulk_run_id}/bundle.zip + {user_id}/{bulk_run_id}/report.csv + + single-outputs/ + {user_id}/single/{artifact_id}/output.pdf +``` + +Cleanup policy: + +| Asset | Retention | Trigger | +|---|---|---| +| Template DOCX | Until user deletes template | Manual | +| Bulk CSV input | 30 days after run completes | Nightly job | +| Individual generated docs (pre-ZIP) | Deleted after ZIP verified | Post-bundle step | +| ZIP + report CSV | 7 days | Nightly job | +| Single-generate outputs | 24 hours | Nightly job | + +--- + +### Reliability: Partial Success, Resumable Runs, Per-Row Retries + +Each row in `bulk_rows` is an independent unit of work. The worker processes rows in `row_number` order and updates each row's status immediately after processing. A failure on any row is recorded with an `error_message` and processing continues to the next row. The final run status is `partial` if there are any failures alongside successes, `success` if all rows succeeded, `failed` if every row failed. + +If the worker crashes mid-run and restarts, it queries `bulk_rows WHERE bulk_run_id = ? AND status = pending` — this automatically skips rows already processed. The run is resumable without any additional state management. + +The `POST /api/v1/bulk-runs/{id}/retry-failed` endpoint re-queues only rows with `status = failed`, resetting them to `pending`. Successful rows are never re-processed. This allows users to fix data errors in a subset of rows and retry just those rows. + +`bulk_runs.idempotency_key = SHA256(user_id + template_version_id + csv_storage_key)` — re-submitting the exact same CSV for the same template returns the existing run rather than creating a duplicate. + +--- + +### Security: Tenant Isolation, Safe Downloads, Path Traversal + +Every database query for templates and bulk runs is scoped with `WHERE user_id = authenticated_user_id`. No user can reference another user's template or run by ID, even if they guess the UUID. S3 storage keys are also prefixed by `user_id`, so even a leaked key would be blocked by the API ownership check. + +Output filenames are generated server-side from field values. All user-provided field data passes through a sanitizer that keeps only `[a-zA-Z0-9_\-]` characters before being used in a filename. No user-controlled string ever reaches the filesystem or S3 key construction directly. This prevents path traversal attacks. + +Safe downloads follow the same pre-signed URL pattern as Problem 1 — ownership checked, 15-minute TTL, bucket is private. + --- @@ -80,7 +500,179 @@ You need to put your solution here. **Your Solution for problem 4:** -You need to put your solution here. +--- + + +### System Design + +``` +[Client] + | + v +[API Service - Spring Boot] + | + |-----> [PostgreSQL] characters, relationships, episodes, scenes, jobs + | + |-----> [S3 Object Storage] character images, generated audio, scene renders, final video + | + |-----> [Redis Queue] episode generation jobs + | + v + [Episode Pipeline Worker] + | + Stage 1: Scene Breakdown (LLM) + Stage 2: Dialogue Generation (LLM + series bible snapshot) + Stage 3: Asset Plan (LLM → image/audio prompts) + Stage 4: Image Generation (DALL-E / Stable Diffusion) + Stage 5: Voiceover Generation (TTS per character voice) + Stage 6: Render Plan Assembly (structured JSON output) + Stage 7: [Optional] Video Render (FFmpeg assembles final video) +``` + +**The core challenge is character consistency.** The LLM that generates episode 3 has no memory of episodes 1 and 2. It also has no guarantee that the character definitions in the database are the same as they were when episode 1 was generated — users may have edited traits since then. Both of these problems are solved by the series bible snapshot. + +At the moment an episode generation job starts, the worker reads all characters and relationships referenced for that episode from the database and assembles a structured JSON object called the series bible snapshot. This snapshot is stored in `episodes.series_bible_snapshot` and also embedded verbatim in the system prompt of every LLM call throughout the pipeline. This means: + +- All LLM calls in the pipeline use exactly the same character data +- The episode is fully reproducible even if characters are edited later +- Character changes never silently affect already-generated episodes +- Debugging is straightforward — the snapshot is stored and inspectable + +--- + +### Schema (Tables) + +**users** +- id, email, password_hash, created_at + +**characters** +- id, user_id (FK → users), name, age, personality_traits (array), speaking_style, visual_notes, behavior_rules (array), reference_image_key (S3), content_hash, is_active, created_at, updated_at + +**relationships** +- id, user_id (FK → users), character_a_id (FK → characters), character_b_id (FK → characters), relationship_type, description, is_bidirectional, created_at +- relationship_type values: friend | rival | mentor | parent_child | colleague +- character_a_id must differ from character_b_id +- (user_id, character_a_id, character_b_id) is unique + +**episodes** +- id, user_id (FK → users), title, story_prompt, episode_goal, style, target_duration_sec, language, format, series_bible_snapshot (JSON), status, created_at +- style values: comedy | motivational | drama | slice_of_life +- format values: 16:9 | 9:16 | 1:1 +- status values: draft | queued | processing | package_ready | render_queued | rendered | failed + +**episode_characters** (junction) +- episode_id (FK → episodes), character_id (FK → characters) +- PK: (episode_id, character_id) + +**scenes** +- id, episode_id (FK → episodes), scene_number, location, summary, duration_sec, dialogues (JSON array of character_id + line + emotion), created_at +- (episode_id, scene_number) is unique + +**assets** +- id, user_id (FK → users), episode_id (FK → episodes, nullable), scene_id (FK → scenes, nullable), asset_type, storage_key, content_hash, generation_prompt, metadata, created_at +- asset_type values: character_image | background_image | voice_line | music_cue | scene_render | final_video + +**render_jobs** +- id, episode_id (FK → episodes), user_id (FK → users), status, attempt_count, render_plan (JSON), output_asset_id (FK → assets), error_message, started_at, completed_at, created_at +- status values: queued | processing | success | failed + +**job_events** +- id, episode_id (FK → episodes), event_type, message, metadata, created_at +- event_type examples: scene_breakdown_done | dialogues_done | asset_plan_done | images_generated | render_started | render_done | failed + +Migration order: users → characters → relationships → episodes → episode_characters → scenes → assets → render_jobs → job_events + +--- + +### Key Constraints / Indexes + +- `(relationships.user_id, relationships.character_a_id, relationships.character_b_id)` unique — no duplicate relationship pairs +- `(scenes.episode_id, scenes.scene_number)` unique — ordered scenes per episode +- Index on `characters(user_id)` — character library page +- Index on `relationships(character_a_id)` and `relationships(character_b_id)` — lookup relationships involving a character from either direction +- Index on `episodes(user_id, created_at DESC)` — episode history page +- Index on `scenes(episode_id, scene_number)` — ordered scene list per episode +- Index on `assets(content_hash)` — deduplication check before generating an asset +- Index on `assets(scene_id)` — asset gallery per scene +- Index on `render_jobs(status)` filtered to `queued` — render worker poll +- FK: `episode_characters.episode_id` → `episodes.id` with CASCADE DELETE +- FK: `scenes.episode_id` → `episodes.id` with CASCADE DELETE + +--- + +### Consistency Data: What Gets Persisted + +The series bible snapshot stored in `episodes.series_bible_snapshot` contains: + +``` +{ + "captured_at": "timestamp", + "characters": [ + { + "id": "uuid", + "name": "Maya", + "personality_traits": ["curious", "warm", "occasionally anxious"], + "speaking_style": "asks clarifying questions, uses 'actually' often", + "visual_notes": "short curly hair, always wears yellow", + "behavior_rules": ["never aggressive", "overthinks decisions aloud"] + } + ], + "relationships": [ + { + "character_a": "Maya", + "character_b": "Raj", + "type": "mentor", + "description": "Raj mentors Maya but respects her independence" + } + ] +} +``` + +This snapshot is embedded into every LLM call's system prompt during episode generation. It is stored permanently with the episode record. Editing a character after episode generation does not affect any existing episode. + +**Asset deduplication:** Before generating any image or audio asset, the worker hashes the generation prompt and checks `assets.content_hash` for an existing match belonging to this user. If a matching asset exists, the existing S3 key is reused — no duplicate generation call is made, no duplicate storage cost. + +--- + +### Storage Layout + +``` +bucket/ + characters/ + {user_id}/{character_id}/reference.jpg + + episodes/ + {user_id}/{episode_id}/ + assets/ + scene_01_background.jpg + scene_01_maya_voice.mp3 + scene_02_background.jpg + final_video.mp4 + render_plan.json + script.md +``` + +--- + +### Security + Cost Controls + +Quotas enforced at the API layer before job creation: + +| Resource | MVP Limit | Reason | +|---|---|---| +| Characters per user | 20 | Prompt size grows with character count | +| Episodes per day | 5 | AI API cost per episode | +| Render jobs per day | 2 | Compute cost per render | +| Scenes per episode | 12 | Pacing control for 5-minute target | +| Character image size | 5 MB | Storage cost | + +A global concurrency semaphore limits how many render jobs run simultaneously across all users. If the semaphore is full, new render jobs queue instead of starting. This prevents cost spikes from concurrent renders. + +Character reference images are validated for file size (max 5 MB) and MIME type (jpeg/png only) before the pre-signed upload URL is issued. The validation happens before any file reaches S3. + +All queries are scoped by `user_id`. Episode and character IDs are UUID v4 — not guessable by enumeration. Ownership is verified server-side before every read or write operation. + +--- ## Problem 5: Cross-Cutting @@ -94,4 +686,104 @@ Answer briefly for the whole platform: **Your Answer for problem 5:** -You need to put your solution here. + +--- + +### 1. Multi-Tenancy + +User-level tenancy for MVP. Every resource table carries a `user_id` column as a non-nullable foreign key. This column appears in every query's WHERE clause at the service layer — it is never optional. No resource is addressable without going through an ownership check. + +Workspace-level tenancy is a v1 concern. When added, a `workspaces` table and a `workspace_members` junction table with roles would be introduced. Resource tables would gain a `workspace_id` FK alongside `user_id`. Queries would filter by workspace membership rather than direct user ownership. + +Row-Level Security (RLS) in Postgres can be added at v1 as a second layer of enforcement. The application layer check is the primary gate; RLS is defense in depth for the case where application code has a bug. + +--- + +### 2. AuthZ Model + +MVP uses simple ownership-based access control. Every resource has an owner (`user_id`). The service layer verifies `resource.user_id == JWT subject` before any read, write, or delete operation. + +JWT tokens are stateless (HS256, 1-hour TTL). Refresh tokens are stored hashed in a `user_sessions` table and are revocable. Refresh tokens are delivered via HttpOnly cookie — never in localStorage, which is accessible to JavaScript and vulnerable to XSS. + +Two enforcement points: +- API layer (primary): ownership check in service methods before any DB operation +- DB layer (secondary, v1): Postgres RLS policies as a backstop + +Admin access is a boolean `is_admin` flag on the `users` table, checked only for internal tooling endpoints. No RBAC at MVP. + +--- + +### 3. Observability + +Every inbound HTTP request is assigned a `correlation_id` (UUID, generated if not present in the `X-Correlation-ID` header). This ID is propagated to all downstream calls — Redis queue messages, worker log lines, AI service HTTP headers, and every `job_events` row. It is returned in the HTTP response header so the frontend can include it in bug reports. + +Structured JSON log format for every job-related event: +``` +{ + "timestamp": "...", + "level": "INFO", + "correlation_id": "...", + "job_id": "...", + "user_id": "...", + "event": "transcription_started", + "duration_ms": 4200 +} +``` + +`user_id` is logged as an internal identifier. Email, name, and other PII never appear in log lines. + +Metrics exposed via Micrometer to Prometheus: + +| Metric | Type | Purpose | +|---|---|---| +| jobs_created_total | Counter | Throughput tracking | +| jobs_duration_seconds | Histogram | P95 completion time | +| jobs_failed_total (by reason) | Counter | Failure categorization | +| queue_depth | Gauge | Backpressure detection | +| ai_api_latency_seconds | Histogram | External dependency health | +| bulk_rows_processed_total | Counter | Bulk job throughput | + +Alert triggers: failure rate above 5/min, queue depth above 500, AI API P95 latency above 30 seconds. + +--- + +### 4. Data Retention + +| Data | Retention | Mechanism | +|---|---|---| +| Raw video uploads | 7 days after job success | Nightly cleanup job | +| Video processing artifacts | 30 days | Nightly cleanup job | +| Job event logs (DB) | 90 days | Archival job | +| Bulk CSV inputs | 30 days after run completes | Nightly cleanup job | +| Generated individual docs (pre-ZIP) | Deleted after ZIP verified | Post-bundle worker step | +| ZIP + report CSV | 7 days | Nightly cleanup job | +| Single-generate outputs | 24 hours | Nightly cleanup job | +| LinkedIn post_attempts | 1 year | Annual archival (compliance) | +| Character + episode data | Until user deletes | Manual only | +| User account (GDPR deletion) | Immediate hard delete + S3 prefix wipe | User-triggered | + +The nightly cleanup job always deletes from S3 first, then removes the DB row. This order ensures no DB row ever points to a missing file. S3 deletion is confirmed before the DB row is removed. + +--- + +### 5. Secrets and Compliance + +**Encryption:** LinkedIn OAuth tokens are encrypted with AES-256-GCM at the application layer before storage. A random 12-byte IV is generated per token. The encryption key lives in environment variables at MVP and moves to AWS KMS (envelope encryption) at v1. Key rotation at v1 does not require re-deploying the application. + +**Key storage path:** +- MVP: environment variable → application layer encryption/decryption +- v1: AWS KMS → GenerateDataKey → encrypt data with data key → encrypt data key with CMK → store both encrypted values + +**PII fields:** `users.email`, `personas.background`, `linkedin_accounts.*`, `bulk_rows.input_data` (may contain names, addresses from CSV). + +PII rules: +- PII never appears in log lines +- PII never stored in `job_events.metadata` +- `bulk_rows.input_data` is subject to the 90-day retention policy and scrubbed at that point +- GDPR deletion: hard delete on `users` table cascades to all linked rows, followed by an async S3 prefix deletion job + +**API secrets** (OpenAI key, LinkedIn client secret): environment variables only. Never committed to Git. Rotated via the CI pipeline secrets manager. + +**Transport:** All endpoints TLS only. HSTS header enforced. Internal service-to-service communication also uses TLS. + +---