From aa8507b4d5448b6405e9653f1cbe00376eebb5c0 Mon Sep 17 00:00:00 2001 From: Mohmed Husain Date: Sat, 28 Feb 2026 13:44:44 +0530 Subject: [PATCH] All questions ans --- Backend.md | 314 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 309 insertions(+), 5 deletions(-) diff --git a/Backend.md b/Backend.md index 4f7ff501..bb8b28ca 100644 --- a/Backend.md +++ b/Backend.md @@ -26,7 +26,70 @@ **Your Solution for problem 1:** -You need to put your solution here. +**Database & Storage Selection** + +PostgreSQL for all metadata — it gives us ACID transactions (critical for job state integrity), JSONB for flexible metadata fields, strong indexing, and mature migration tooling (Flyway/Alembic). For MVP it scales vertically; for v1 we add read replicas and PgBouncer. + +S3 (or MinIO self-hosted for MVP) for blob storage — videos, transcripts, clips, screenshots, and Summary.md files. Objects are stored under a predictable key layout: +`uploads/{user_id}/{video_id}/original.mp4` and `artifacts/{job_id}/transcript.json|summary.md|clips/|screenshots/`. + +**Schema (tables, key constraints, indexes)** + +- **User** — `id (PK, UUID)`, email (UNIQUE), password_hash, display_name, created_at +- **VideoAsset** — `id (PK)`, `user_id (FK → User, CASCADE)`, original_filename, `storage_key (UNIQUE)`, mime_type, size_bytes, duration_secs, metadata (JSONB), uploaded_at. *Index on user_id.* +- **Job** — `id (PK)`, `video_asset_id (FK → VideoAsset, CASCADE)`, `idempotency_key (UNIQUE)` — prevents duplicate jobs for the same video, status (CHECK: queued/processing/success/failed), stage (CHECK: transcription/summarization/asset_extraction/complete), attempt_count, max_attempts (default 3), locked_by, locked_at, error_message, timestamps. *Index on (status, stage) for worker polling.* +- **JobEvent** — `id (PK)`, `job_id (FK → Job, CASCADE)`, event_type, from_status, to_status, message, metadata (JSONB), created_at. *Index on job_id and created_at for audit queries.* +- **Artifact** — `id (PK)`, `job_id (FK → Job, CASCADE)`, artifact_type (CHECK: transcript/summary_md/clip/screenshot), `storage_key (UNIQUE)`, mime_type, size_bytes, created_at. *Index on job_id.* +- **Highlight** — `id (PK)`, `job_id (FK → Job, CASCADE)`, title, description, start_ts_secs, end_ts_secs, `clip_artifact_id (FK → Artifact, SET NULL)`, `screenshot_artifact_id (FK → Artifact, SET NULL)`, sort_order. *Index on (job_id, sort_order).* + +*Migration order:* User → VideoAsset → Job → JobEvent → Artifact → Highlight (follows FK dependencies). + +**Job Lifecycle & Reliability** + +The job follows: **queued → processing → success/failed**. Internally, the job tracks the current `stage` (transcription → summarization → asset_extraction → complete). Each stage completion publishes a message for the next stage, so if a failure happens mid-pipeline, retry resumes from the failed stage — not from scratch. + +- **Idempotency:** The `idempotency_key` (hash of user_id + video storage_key) prevents duplicate job creation at the API level. +- **Worker locking:** Workers claim jobs using `SELECT ... FOR UPDATE SKIP LOCKED` — no double-processing. +- **Retries:** On transient failure, the worker increments attempt_count, logs a JobEvent, and re-enqueues with exponential backoff. After max_attempts (3), the job moves to `failed`. +- **Stale lock recovery:** A cron reaper reclaims jobs where locked_at is older than 10 minutes. +- **Observability:** Every state transition is recorded in JobEvent with timestamps and worker IDs. We log correlation_ids through the queue for end-to-end tracing. + +**Security** + +- All S3 buckets are private. Uploads and downloads use **pre-signed URLs** (15-min expiry) so the API never proxies large files. +- Storage keys are server-generated UUIDs — users never control path components (prevents path traversal). +- API authenticates via JWT; every request checks `user_id` ownership before returning data. +- Content-Disposition header is set on downloads to force the original filename. + +**Cost & Scalability (MVP → v1)** + +- *MVP:* Single API instance, 2–3 workers, one Postgres, MinIO. Handles tens of concurrent jobs. +- *v1:* Horizontal worker autoscaling based on queue depth, Postgres read replicas, move to managed S3, add CDN (CloudFront) for frequently accessed artifacts. + +**System Design** + +``` +┌──────────┐ ┌──────────────┐ ┌───────────────────┐ +│ Client │──────▶│ API Server │──────▶│ Message Queue │ +│ │◀──────│ (REST) │ │ (Redis/RabbitMQ) │ +└──────────┘ └──────┬───────┘ └────────┬──────────┘ + │ │ + ▼ ▼ + ┌──────────────┐ ┌──────────────────┐ + │ PostgreSQL │ │ Worker Pool │ + │ (metadata) │ │ • Transcription │ + └──────────────┘ │ • Summarization │ + │ │ • Asset Extractor│ + ┌──────────────┐ └────────┬─────────┘ + │ S3 / MinIO │ │ + │ (blob store) │◀───────────────┘ + └──────────────┘ ┌──────────────────┐ + │ External AI │ + │ (Whisper, GPT) │ + └──────────────────┘ +``` + +**Flow:** Client uploads video via pre-signed URL → API creates Job (queued) → Queue dispatches to Transcription Worker → worker calls Whisper API, stores transcript in S3, advances stage → Summarization Worker calls LLM, produces Summary.md + Highlights → Asset Extraction Worker uses ffmpeg to cut clips/screenshots at highlight timestamps → Job marked success → Client polls status and downloads artifacts via pre-signed URLs. --- @@ -44,7 +107,70 @@ You need to put your solution here. **Your Solution for problem 2:** -You need to put your solution here. +**Database & Storage** + +PostgreSQL — relational integrity between users, accounts, personas, drafts, and schedules is essential. Token encryption is done at the application level before storage. No blob storage needed here (text-only data). + +**Schema** + +- **User** — `id (PK, UUID)`, email (UNIQUE), password_hash, display_name, created_at +- **LinkedInAccount** — `id (PK)`, `user_id (FK → User, CASCADE)`, linkedin_user_id, `access_token_enc (BYTEA)` — AES-256-GCM encrypted, refresh_token_enc (BYTEA), token_expires_at, scopes, is_active, timestamps. *UNIQUE(user_id, linkedin_user_id).* +- **Persona** — `id (PK)`, `user_id (FK → User, CASCADE)`, name, background, tone, language_style, dos, donts, extra_config (JSONB), is_default, timestamps. *Index on user_id.* +- **Draft** — `id (PK)`, `user_id (FK)`, `persona_id (FK → Persona, CASCADE)`, topic, context, style_variant (concise_insight / story_based / actionable_checklist), content (TEXT), status (CHECK: generated/approved/rejected/posted/failed), prompt_version_id (FK), generation_meta (JSONB), timestamps. *Index on (user_id, status).* +- **Schedule** — `id (PK)`, `draft_id (FK → Draft, CASCADE, UNIQUE)` — a draft can only be scheduled once (dedupe), `linkedin_account_id (FK)`, scheduled_at, timezone, status (CHECK: pending/enqueued/posted/failed/cancelled). *Partial index on (status, scheduled_at) WHERE status = 'pending' for scheduler queries.* +- **PostAttempt** — `id (PK)`, `schedule_id (FK → Schedule, CASCADE)`, attempt_number, linkedin_post_id, http_status, response_body, status (success/failed/rate_limited), attempted_at. *UNIQUE(schedule_id, attempt_number).* +- **PromptVersion** — `id (PK)`, name, version, prompt_template (TEXT), model_config (JSONB), is_active, changelog, created_at. *UNIQUE(name, version).* *Partial index on (name, is_active) WHERE is_active = true.* + +*Migration order:* User → LinkedInAccount → Persona → PromptVersion → Draft → Schedule → PostAttempt. + +**Security** + +- LinkedIn tokens are encrypted with **AES-256-GCM** before DB storage. The encryption key lives in a secrets manager (AWS KMS / Vault) — never in code. +- OAuth flow uses the `state` parameter (signed, short-lived) to prevent CSRF. +- Only least-privilege LinkedIn scopes are requested: `w_member_social` (post) and `r_liteprofile` (name). +- All queries enforce `WHERE user_id = :current_user` — users can only access their own accounts, personas, and drafts. + +**Reliability** + +- **Retry posting:** On 5xx or network timeout, retry up to 3 times with exponential backoff. Permanent errors (401/403) are not retried — user is notified to re-authorize. +- **Double-post prevention:** `draft_id UNIQUE` on Schedule means one draft = one schedule only. Post Worker checks for an existing `success` PostAttempt before calling LinkedIn. Scheduler uses `FOR UPDATE SKIP LOCKED` so two workers never pick the same job. +- **Rate limiting:** Worker respects LinkedIn's 429 `Retry-After` header. API also limits users to max 5 posts/day and 10 draft generations/hour. +- **Token refresh:** A background cron refreshes tokens 7 days before expiry. If refresh fails, `is_active` is set to false and the user is notified. + +**Prompt/Config Storage Proposal** + +Hybrid approach: GenAI team maintains prompts in a **Git repo** (source of truth, reviewed via PRs). On merge, a CI step upserts the new version into the **PromptVersion DB table** and marks it active. Workers read the active version from the DB at runtime. Rollback is instant: flip `is_active` to a previous version via an admin API. Every Draft stores `prompt_version_id` so we can trace which prompt produced which output. + +**Cost & Scalability (MVP → v1)** + +- *MVP:* One API server, one scheduler cron (runs every minute), one post worker. Handles hundreds of users. +- *v1:* Separate draft-generation workers (CPU-bound LLM calls) from lightweight post workers. Queue-based autoscaling. Rate-limit-aware scheduling with priority lanes. + +**System Design** + +``` +┌──────────┐ ┌──────────────────┐ ┌──────────────┐ +│ Frontend │─────▶│ API Server │─────▶│ PostgreSQL │ +│ │◀─────│ (REST + OAuth) │ └──────────────┘ +└──────────┘ └───────┬──────────┘ + │ │ + │ OAuth redirect │ enqueue + ▼ ▼ +┌──────────────┐ ┌─────────────────────┐ +│ LinkedIn │ │ Task Queue │ +│ OAuth 2.0 │ │ (Redis + BullMQ) │ +└──────────────┘ └──────┬──────────────┘ + │ + ┌──────────┼───────────┐ + ▼ ▼ ▼ + ┌───────────┐ ┌─────────┐ ┌───────────┐ + │ Draft Gen │ │Scheduler│ │Post Worker│ + │ Worker │ │ (cron) │ │(publishes │ + │(calls AI) │ │ │ │to LinkedIn│ + └───────────┘ └─────────┘ └───────────┘ +``` + +**Flow:** User connects LinkedIn via OAuth → tokens encrypted and stored → User sets up Persona → Requests drafts → Draft Gen Worker calls GenAI with persona + topic + active PromptVersion, returns 3 variants → User approves one, schedules it → Scheduler cron picks up due schedules, enqueues publish jobs → Post Worker decrypts token, posts to LinkedIn API, logs result in PostAttempt. --- @@ -62,7 +188,75 @@ You need to put your solution here. **Your Solution for problem 3:** -You need to put your solution here. +**Database & Storage** + +PostgreSQL for metadata (templates, fields, bulk run tracking, per-row status). S3/MinIO for blob storage — uploaded DOCX templates, CSV inputs, generated documents, and ZIP bundles. + +Storage layout: `templates/{user_id}/{template_id}/v{N}/template.docx`, `inputs/{bulk_run_id}/input.csv`, `outputs/{bulk_run_id}/rows/001_Name.pdf`, `outputs/{bulk_run_id}/bundle.zip`. + +**Cleanup policy:** CSV inputs deleted after 7 days, generated docs after 30 days, ZIPs after 14 days (large, expensive). Automated via S3 lifecycle rules. + +**Schema** + +- **Template** — `id (PK)`, `user_id (FK → User, CASCADE)`, name, description, current_version, timestamps. *Index on user_id.* +- **TemplateVersion** — `id (PK)`, `template_id (FK → Template, CASCADE)`, version, `storage_key (UNIQUE)`, original_filename, file_hash (SHA-256), created_at. *UNIQUE(template_id, version).* +- **TemplateField** — `id (PK)`, `template_version_id (FK → TemplateVersion, CASCADE)`, field_key, field_type (CHECK: text/number/date/currency/boolean/optional_block), is_required, default_value, sort_order. *UNIQUE(template_version_id, field_key).* +- **BulkRun** — `id (PK)`, `template_version_id (FK)`, `user_id (FK)`, input_storage_key, total_rows, processed_rows, succeeded_rows, failed_rows, status (CHECK: pending/processing/completed/failed/cancelled), zip_storage_key, report_storage_key, timestamps. *Index on (user_id) and (status).* +- **BulkRow** — `id (PK)`, `bulk_run_id (FK → BulkRun, CASCADE)`, row_number, input_data (JSONB), status (CHECK: pending/processing/success/failed/skipped), error_message, output_storage_key, attempt_count. *UNIQUE(bulk_run_id, row_number). Index on (bulk_run_id, status).* +- **Artifact** — `id (PK)`, `user_id (FK)`, source_type (single_fill/bulk_row/bulk_zip), source_id, `storage_key (UNIQUE)`, filename, mime_type, size_bytes, created_at. +- **JobEvent** — `id (PK)`, entity_type, entity_id, event_type, message, metadata (JSONB), created_at. *Index on (entity_type, entity_id).* + +*Migration order:* User → Template → TemplateVersion → TemplateField → BulkRun → BulkRow → Artifact → JobEvent. + +**Reliability** + +- **Partial success:** Each row is processed independently. Failed rows don't stop the bulk run. Final status shows `succeeded_rows` vs `failed_rows` counts. +- **Per-row status:** Every BulkRow tracks its own status and error_message. The generation report (CSV) lists each row's outcome. +- **Retries:** Transient errors (PDF conversion timeout) retry up to 3 times. Validation errors (missing field) fail immediately — no retry. +- **Resumable runs:** If the worker crashes, it restarts by querying `WHERE status IN ('pending','processing')` on that BulkRun and resumes. Already-succeeded rows are skipped. +- **Idempotency:** UNIQUE(bulk_run_id, row_number) ensures reprocessing overwrites — no duplicate docs. +- **Concurrency:** Worker processes rows in batches of 10 to avoid overloading the PDF renderer. + +**Security** + +- All queries scoped to `user_id`. Templates and outputs are isolated per user via S3 key prefixes (`{user_id}/...`). +- Downloads via pre-signed S3 URLs (15-min expiry), Content-Disposition forces download. +- Storage keys are server-generated UUIDs — no user-controlled paths (anti-path-traversal). +- DOCX uploads are scanned for macros/VBA and rejected if found. Field values are escaped to prevent XML injection. +- Max file sizes enforced: DOCX 20MB, CSV 50MB. + +**Cost & Scalability (MVP → v1)** + +- *MVP:* Single API + one bulk worker. PDF conversion via LibreOffice headless (or Gotenberg). Handles hundreds of rows per run. +- *v1:* Pool of bulk workers with concurrency-limited row processing. Move PDF conversion to a dedicated microservice for better resource isolation. Add progress webhooks for real-time UI updates. + +**System Design** + +``` +┌──────────┐ ┌──────────────────┐ ┌──────────────┐ +│ Frontend │─────▶│ API Server │─────▶│ PostgreSQL │ +│ (form UI,│◀─────│ (REST API) │ └──────────────┘ +│ uploads)│ └───────┬──────────┘ +└──────────┘ │ + ┌──────────┼──────────────┐ + ▼ ▼ ▼ + ┌──────────────┐ ┌────────────┐ ┌──────────────────┐ + │ Template │ │ Export │ │ Bulk Job Worker │ + │ Ingestion │ │ Service │ │ (parallel rows, │ + │ (parse DOCX, │ │ (single │ │ ZIP builder, │ + │ detect │ │ fill + │ │ report gen) │ + │ fields) │ │ PDF) │ └──────────────────┘ + └──────────────┘ └────────────┘ │ + │ │ ▼ + ▼ ▼ ┌──────────────┐ + ┌────────────────────┐ │ PDF Converter│ + │ S3 / MinIO │ │ (LibreOffice │ + │ (templates, CSVs, │ │ headless) │ + │ outputs, ZIPs) │ └──────────────┘ + └────────────────────┘ +``` + +**Flow:** User uploads DOCX → Template Ingestion parses it, detects `{{field}}` placeholders, stores fields in DB → For single fill: user fills form → Export Service fills template + converts to PDF → returns download URL. For bulk fill: user uploads CSV → API creates BulkRun → Bulk Worker processes each row (fill + convert), tracks per-row status → builds ZIP + report → uploads to S3 → user downloads via pre-signed URL. --- @@ -80,7 +274,89 @@ You need to put your solution here. **Your Solution for problem 4:** -You need to put your solution here. +**Database & Storage** + +PostgreSQL for all structured data (characters, relationships, episodes, scenes, render jobs). S3/MinIO for all media assets — character reference images, generated scene images, voice lines, music, and rendered videos. + +Storage layout: `series/{series_id}/characters/{char_id}/reference_v{N}.png`, `series/{series_id}/episodes/{ep_id}/scenes/{scene_num}/scene_image.png|voice_{char}.mp3`, `series/{series_id}/episodes/{ep_id}/render/final.mp4`. + +**Schema** + +- **Character** — `id (PK)`, `user_id (FK)`, series_id, name, personality (TEXT), speaking_style, visual_desc, `ref_image_key` (S3), voice_profile (JSONB — voice clone ID, pitch, speed), consistency_embedding (BYTEA — CLIP vector for visual consistency), extra_meta (JSONB), timestamps. *Index on series_id and user_id.* +- **Relationship** — `id (PK)`, series_id, `character_a_id (FK → Character, CASCADE)`, `character_b_id (FK → Character, CASCADE)`, rel_type (friend/rival/parent_child/mentor), description, is_bidirectional. *UNIQUE(character_a_id, character_b_id). CHECK(a ≠ b).* +- **Episode** — `id (PK)`, series_id, `user_id (FK)`, episode_number, title, story_prompt, style (CHECK: comedy/motivational/slice_of_life/drama/action), target_duration_secs (default 300), output_format (16:9/9:16/1:1), language, narration_ratio, status (CHECK: draft/generating/ready/rendering/complete/failed), timestamps. *UNIQUE(series_id, episode_number).* +- **EpisodeCharacter** — (episode_id, character_id) composite PK — tracks which characters appear in each episode. +- **Scene** — `id (PK)`, `episode_id (FK → Episode, CASCADE)`, scene_number, description, dialogue (JSONB array of {character_id, line, emotion}), narration, duration_secs, background_desc, music_cue. *UNIQUE(episode_id, scene_number).* +- **Asset** — `id (PK)`, episode_id (FK), scene_id (FK), character_id (FK), asset_type (CHECK: scene_image/character_image/voice_line/background_music/sound_effect/script_doc), `storage_key (UNIQUE)`, `content_hash` (SHA-256 for dedup), mime_type, size_bytes, generation_meta (JSONB), version. *Index on episode_id and content_hash.* +- **RenderJob** — `id (PK)`, `episode_id (FK → Episode, CASCADE)`, stage (CHECK: script_gen/asset_gen/video_render/complete), status (CHECK: queued/processing/success/failed), attempt_count, max_attempts, locked_by, locked_at, error_message, timestamps. *Index on (status, stage).* +- **Artifact** — `id (PK)`, `episode_id (FK)`, artifact_type (CHECK: final_video/script_pdf/storyboard/asset_bundle), `storage_key (UNIQUE)`, mime_type, size_bytes, created_at. + +*Migration order:* User → Character → Relationship → Episode → EpisodeCharacter → Scene → Asset → RenderJob → Artifact. + +**Consistency Data (Character Continuity Across Episodes)** + +This is the core challenge. We persist: +- **Reference image + CLIP embedding** — stored per character. Every image generation call receives the reference image and embedding as conditioning input (via IP-Adapter or similar) to maintain the same face/style. +- **Visual description** — detailed text ("short brown hair, blue hoodie, round glasses") appended to every scene image prompt. +- **Voice profile** — voice clone ID (ElevenLabs) ensures the same voice across all episodes. +- **Personality + speaking style** — injected into the LLM system prompt so dialogue stays in character. +- **Relationship graph** — fed to the LLM so characters interact correctly (mentor gives advice, rivals disagree). +- **Episode history** — short LLM-generated summaries of past episodes stored in DB, fed as context for new episodes to maintain narrative continuity. + +Before generating any episode, the system assembles a "character context pack" (all the above) and passes it to both the LLM and the image/voice generation models. + +**Versioning & Dedup** + +- Character reference images are versioned (v1, v2, ...). The Character row points to the current version. +- Every generated asset stores a `content_hash` (SHA-256). Before writing a new asset, the worker checks for an existing match — if found, it reuses that S3 object. Background music and sound effects benefit most from dedup. + +**Reliability** + +The episodic pipeline runs as sequential stages per episode: **script_gen → asset_gen → video_render → complete**. Each stage is a RenderJob with its own state machine (queued → processing → success/failed). Same retry/locking patterns as Problem 1 (SKIP LOCKED, exponential backoff, max 3 attempts, stale lock reaper). Episodes can be processed in parallel across different episodes. + +**Security + Cost Controls** + +- **Quotas:** Per-user limits (e.g., 5 series, 50 episodes/series, 20 characters/series) enforced at the API layer. +- **Rate limits:** Max 10 episode generations/hour per user. External AI calls tracked in a `usage_ledger` table (tokens × cost) with daily/monthly caps. +- **Large assets:** Max reference image 10MB, max rendered video 500MB. Render jobs timeout after 30 minutes. +- **Access control:** All resources scoped by user_id. Downloads via pre-signed URLs. + +**Cost & Scalability (MVP → v1)** + +- *MVP:* Single API, 1–2 workers covering all stages, one Postgres. Good for single-user / small team. +- *v1:* Separate worker pools per stage (script gen is LLM-heavy, asset gen is GPU-heavy, render is CPU-heavy). Queue-based autoscaling. Usage-based billing tied to the cost ledger. + +**System Design** + +``` +┌──────────┐ ┌──────────────────┐ ┌──────────────┐ +│ Frontend │─────▶│ API Server │─────▶│ PostgreSQL │ +│ (series │◀─────│ (REST API) │ └──────────────┘ +│ bible, │ └───────┬──────────┘ +│ episode │ │ +│ editor) │ ┌─────────┼──────────────┐ +└──────────┘ ▼ ▼ ▼ + ┌──────────┐ ┌────────────┐ ┌──────────────┐ + │ Script │ │ Asset Gen │ │ Render │ + │ Generator│ │ Worker │ │ Pipeline │ + │ (LLM → │ │ (image gen,│ │ (compose │ + │ script + │ │ voice gen, │ │ video, │ + │ scenes) │ │ music) │ │ export mp4) │ + └──────────┘ └────────────┘ └──────────────┘ + │ │ + ▼ ▼ + ┌──────────────────────────────────┐ + │ S3 / MinIO (blob store) │ + └──────────────────────────────────┘ + │ + ┌───────────┴────────────┐ + │ External AI Services │ + │ (GPT-4, DALL-E, │ + │ ElevenLabs, etc.) │ + └────────────────────────┘ +``` + +**Flow:** User defines characters with reference images and personality in a series bible → Creates an episode with a story prompt and selects characters → System assembles the "character context pack" → Script Generator (LLM) produces script + scene breakdown → Asset Gen Worker generates scene images (using character embeddings for consistency), voice lines (using cloned voices), and music cues → Render Pipeline composites everything into a ~5min video → Artifact stored in S3, episode marked complete. ## Problem 5: Cross-Cutting @@ -94,4 +370,32 @@ Answer briefly for the whole platform: **Your Answer for problem 5:** -You need to put your solution here. +**1. Multi-Tenancy** + +Workspace-level. Every resource belongs to a `workspace` (not directly to a user). A `workspace_member` join table links users to workspaces with a role. This way, for personal use it's one user = one workspace, but teams can share resources (templates, personas, series) without data duplication. All resource tables carry a `workspace_id` FK, and every query filters by `WHERE workspace_id = :current_workspace`. + +**2. AuthZ Model** + +RBAC with four roles: **owner** (full control + billing), **admin** (CRUD all resources, manage members), **member** (CRUD own resources, read shared), **viewer** (read-only). Enforced at two layers: (1) API middleware checks the user's role from `workspace_member` against the endpoint's required permission, (2) PostgreSQL Row-Level Security as a defense-in-depth backup — even a bug in the API can't leak data across workspaces. + +**3. Observability** + +Every job state transition is logged in a JobEvent table with `job_id`, `from_status`, `to_status`, `worker_id`, `timestamp`, and `duration_ms`. Every inbound API request gets a `correlation_id` (UUID) that propagates through queue messages and worker logs, enabling end-to-end tracing. Key metrics (Prometheus): `jobs_total{status, stage}`, `job_duration_seconds{stage}`, `queue_depth{queue_name}`, `external_api_latency{service}`, `http_request_duration{method, path}`. Logs are structured JSON shipped to a centralized store (ELK/Loki). + +**4. Data Retention** + +- Uploaded inputs (videos, CSVs, DOCX): deleted 30 days after job completion. +- Generated artifacts (transcripts, PDFs, clips): 60 days, then deleted. +- ZIP bundles: 14 days (large, expensive). +- Job events / audit logs: 90 days hot, archived for 1 year, then deleted. +- LinkedIn tokens: hard-deleted immediately on user disconnect. +- Deleted user accounts: 30-day soft-delete grace period, then hard-purged (GDPR). + +Automated via S3 lifecycle rules + a daily DB cleanup cron. + +**5. Secrets & Compliance** + +- **Token encryption:** AES-256-GCM envelope encryption — a Data Encryption Key (DEK) encrypts the token, the DEK itself is encrypted by a Key Encryption Key (KEK) in AWS KMS / Vault. KEK never leaves the HSM. +- **Key rotation:** DEK rotated every 90 days. A cron re-encrypts active tokens. Old DEKs retained (encrypted) to decrypt legacy data. +- **PII handling:** Emails, names, LinkedIn profiles are classified as PII. On account deletion, all PII is hard-deleted within 30 days. PII is never logged — emails are hashed in logs if needed. GDPR data export supported via an API endpoint. +- **Transport:** All traffic over TLS 1.2+. Internal services use mTLS or private VPC. S3 enforces SecureTransport policy.