diff --git a/Backend.md b/Backend.md index 4f7ff501..14bd2faf 100644 --- a/Backend.md +++ b/Backend.md @@ -15,7 +15,7 @@ **Goal:** Upload video → async processing → outputs: transcript, **Summary.md**, highlights (timestamps), screenshot/clip references. [READ MORE ABOUT THE PROJECT](./Video-summary-platform.md) -**Your solution must include** +**My solution includes** * **System design:** API + worker(s) + storage + external AI/transcription boundaries * **DB choice:** Postgres vs other (short justification) @@ -24,9 +24,58 @@ * **Job lifecycle:** queued → processing → success/failed (+ retry/idempotency rule) * **Storage layout:** how files/artifacts are stored + safe download strategy -**Your Solution for problem 1:** +**My Solution for problem 1:** +## System Design +API Service (Spring Boot) handles video upload, job creation, and artifact retrieval. +Worker Service processes jobs asynchronously using a queue (Redis/RabbitMQ). +PostgreSQL stores metadata. +Object storage (S3-compatible) stores videos and generated outputs. +External AI services handle transcription and summarization. + +Supports batch folder processing, chunked upload, and streaming for large files (200MB+, 3–4 hours). + +Flow: +1. Upload video → create VideoAsset + Job (QUEUED) +2. Worker picks job → PROCESSING → calls AI +3. Stores transcript, summary, highlights → SUCCESS/FAILED + +## Database Choice +PostgreSQL for strong relational modeling, indexing on job status, and JSONB for AI metadata. + +## Schema (Tables) +User(id, email, created_at) +VideoAsset(id, user_id, storage_path, duration, created_at) +Job(id, video_id, status, retry_count, idempotency_key, created_at) +JobEvent(id, job_id, status, message, created_at) +Artifact(id, job_id, type, storage_path, created_at) +Highlight(id, job_id, timestamp_start, timestamp_end, text) + +## Constraints & Indexes +FK: video_asset.user_id → user.id +FK: job.video_id → video_asset.id +Index on job(status) +Unique job.idempotency_key +Index on artifact.job_id + +## Job Lifecycle +QUEUED → PROCESSING → SUCCESS | FAILED +Max 3 retries with idempotency check. + +## Storage Strategy +/tenant/{userId}/videos/{videoId}.mp4 +/tenant/{userId}/jobs/{jobId}/summary.md + +Secure downloads via pre-signed URLs after auth validation. + +## Security +User ownership validation and private object storage. + +## Reliability +Job state machine, JobEvent logging, retries with backoff. + +## Cost & Scalability +MVP single worker → v1 horizontal workers + S3 + queue partitioning. -You need to put your solution here. --- @@ -34,7 +83,7 @@ You need to put your solution here. **Goal:** Connect LinkedIn → store persona → generate drafts (handled by GenAI team) → approve → schedule → auto-post + audit logs. [READ MORE ABOUT THE PROJECT](./linkedin-automation.md) -**Your solution must include** +**My solution includes** * **System design:** OAuth flow, token storage/refresh, scheduler/worker design * **Schema:** User, LinkedInAccount, Persona, Draft, Schedule, PostAttempt/PostLog @@ -42,9 +91,115 @@ You need to put your solution here. * **Reliability:** retry posting, dedupe to prevent double-post, rate limiting * **Prompt/config storage proposal:** how backend stores “prompt versions” or “config packs” provided by GenAI team (DB vs repo vs hybrid, versioning + rollback) -**Your Solution for problem 2:** - -You need to put your solution here. +**My Solution for problem 2:** + +## System Design +User connects LinkedIn via OAuth → backend stores encrypted access + refresh tokens. +User creates Persona (tone, topics, do/don’t rules). +GenAI service generates 3 drafts → stored as Draft records. +User approves one draft → creates Schedule entry. +Scheduler service polls due schedules → sends job to worker. +Worker posts to LinkedIn API → stores result in PostLog. + +Components: +- API Service (Spring Boot): OAuth, persona CRUD, draft approval, scheduling +- Scheduler (cron/queue based): finds due posts using scheduled_time index +- Worker Service: posts to LinkedIn, handles retries, token refresh +- PostgreSQL: metadata storage +- Redis/Queue: async posting jobs + +Flow: +1. OAuth connect → store encrypted tokens +2. Create persona → request draft generation (GenAI boundary) +3. Save drafts → user approves one +4. Create Schedule (PENDING) +5. Scheduler → enqueue job when scheduled_time reached +6. Worker → post to LinkedIn → update status POSTED/FAILED → write PostLog + +## Schema +User(id, email) + +LinkedInAccount( + id, + user_id, + access_token_enc, + refresh_token_enc, + expires_at, + created_at +) + +Persona( + id, + user_id, + tone, + topics, + do_dont_rules, + created_at +) + +Draft( + id, + user_id, + persona_id, + content, + status, -- DRAFT | APPROVED | REJECTED + created_at +) + +Schedule( + id, + draft_id, + scheduled_time, + status, -- PENDING | POSTED | FAILED + dedupe_key, + created_at +) + +PostLog( + id, + schedule_id, + status, + linkedin_post_id, + response, + created_at +) + +PromptConfig( + id, + version, + content, + created_at +) + +## Constraints & Indexes +Unique index on LinkedInAccount.user_id +Index on Schedule.scheduled_time for scheduler polling +Unique index on Schedule.dedupe_key to prevent double posting +FK: Draft.persona_id → Persona.id +FK: Schedule.draft_id → Draft.id +FK: PostLog.schedule_id → Schedule.id + +## Security +OAuth with least-privilege scopes. +Access and refresh tokens encrypted using AES-256. +Tokens decrypted only inside worker at posting time. +User ownership validation on persona, drafts, and schedules. + +## Reliability +Dedupe key = hash(draft_id + scheduled_time) to avoid double posting. +Retry with exponential backoff for transient LinkedIn failures. +Automatic token refresh using refresh_token before expiry. +Rate limiting per user to respect LinkedIn API limits. +PostLog keeps full audit trail. + +## Prompt / Config Storage +PromptConfig table stores versioned prompt templates from GenAI team. +Each draft stores prompt version used → enables rollback and reproducibility. + +## Cost & Scalability +MVP: single scheduler + single worker. +v1: horizontally scalable workers, queue partitioning, and delayed job queues. +Use scheduled_time index to avoid full table scans. --- @@ -52,7 +207,7 @@ You need to put your solution here. **Goal:** Upload DOCX template → detect fields → single fill export → bulk fill via CSV/Sheet → ZIP + per-row report. [READ MORE ABOUT THE PROJECT](./docs-template-output-generation.md) -**Your solution must include** +**My solution includes** * **System design:** template ingestion, field extraction service, bulk job worker, export service * **Schema:** Template, TemplateVersion, TemplateField, BulkRun, BulkRow, Artifact, JobEvent @@ -60,9 +215,31 @@ You need to put your solution here. * **Reliability:** partial success handling, per-row status, retries, resumable bulk run * **Security:** template isolation per tenant/user, safe downloads, anti-path traversal -**Your Solution for problem 3:** +**My Solution for problem 3:** + +## System Design +Template upload → extract fields → store TemplateVersion + TemplateFields. +CSV upload → BulkRun → Worker processes rows → generate DOCX/PDF → ZIP export. + +## Schema +Template(id, user_id, name) +TemplateVersion(id, template_id, storage_path, created_at) +TemplateField(id, template_version_id, field_name) +BulkRun(id, template_version_id, status, total_rows, processed_rows) +BulkRow(id, bulk_run_id, status, output_artifact_id) +Artifact(id, storage_path, type) +JobEvent(id, bulk_run_id, message) + +## Storage Strategy +CSV input, generated docs, and ZIP stored in object storage. +Temporary files deleted after ZIP creation. + +## Reliability +Per-row status, resumable using processed_rows, partial success supported. + +## Security +Tenant isolation via user_id, signed URLs, path traversal prevention. -You need to put your solution here. --- @@ -70,7 +247,7 @@ You need to put your solution here. **Goal:** Define characters once (image + traits + relationships). For each episode story → output episode package (script/scenes/assets plan/render plan), optionally render. [READ MORE ABOUT THE PROJECT](./char-based-video-generation.md) -**Your solution must include** +**My solution includes** * **System design:** episodic pipeline as jobs, asset management, consistency strategy storage * **Schema:** Character, Relationship, Episode, Scene, Asset, RenderJob, Artifact @@ -78,9 +255,30 @@ You need to put your solution here. * **Storage:** images/audio/video assets, versioning, dedupe strategy * **Security + cost controls:** quotas, rate limits, large asset constraints -**Your Solution for problem 4:** +**My Solution for problem 4:** + +## System Design +Characters defined once with traits and assets. +Episode pipeline: Episode → scenes → assets → render jobs processed asynchronously. + +## Schema +Character(id, user_id, name, traits, voice_id, appearance_ref) +Relationship(id, character_a_id, character_b_id, type) +Episode(id, user_id, title, character_snapshot_version) +Scene(id, episode_id, script_text) +Asset(id, scene_id, type, storage_path, hash) +RenderJob(id, episode_id, status, retry_count) +Artifact(id, render_job_id, storage_path) + +## Consistency +Character snapshot version stored per episode to maintain continuity. + +## Storage +Versioned assets with hash-based deduplication and per-user quotas. + +## Security & Cost +Rate limits on render jobs, file size limits, per-user storage quotas. -You need to put your solution here. ## Problem 5: Cross-Cutting @@ -92,6 +290,25 @@ Answer briefly for the whole platform: 4. **Data retention:** what to delete and when (inputs, artifacts, logs) 5. **Secrets & compliance:** token encryption, key management approach, PII handling -**Your Answer for problem 5:** +**My Answer for problem 5:** + +## Multi-Tenancy +User-level tenancy with user_id present in all tables. + +## AuthZ Model +RBAC (USER, ADMIN) enforced at API layer and query filters. + +## Observability +Logs include job_id, correlation_id, and status transitions. +Metrics: job latency, failure rate, queue depth. + +## Data Retention +Raw inputs: 30 days +Logs: 14 days +Artifacts: user-controlled deletion. + +## Secrets & Compliance +Tokens encrypted with AES-256. +Keys stored in environment/secret manager. +Minimal PII storage. -You need to put your solution here.