Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
247 changes: 232 additions & 15 deletions Backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

**Goal:** Upload video → async processing → outputs: transcript, **Summary.md**, highlights (timestamps), screenshot/clip references. [READ MORE ABOUT THE PROJECT](./Video-summary-platform.md)

**Your solution must include**
**My solution includes**

* **System design:** API + worker(s) + storage + external AI/transcription boundaries
* **DB choice:** Postgres vs other (short justification)
Expand All @@ -24,63 +24,261 @@
* **Job lifecycle:** queued → processing → success/failed (+ retry/idempotency rule)
* **Storage layout:** how files/artifacts are stored + safe download strategy

**Your Solution for problem 1:**
**My Solution for problem 1:**
## System Design
API Service (Spring Boot) handles video upload, job creation, and artifact retrieval.
Worker Service processes jobs asynchronously using a queue (Redis/RabbitMQ).
PostgreSQL stores metadata.
Object storage (S3-compatible) stores videos and generated outputs.
External AI services handle transcription and summarization.

Supports batch folder processing, chunked upload, and streaming for large files (200MB+, 3–4 hours).

Flow:
1. Upload video → create VideoAsset + Job (QUEUED)
2. Worker picks job → PROCESSING → calls AI
3. Stores transcript, summary, highlights → SUCCESS/FAILED

## Database Choice
PostgreSQL for strong relational modeling, indexing on job status, and JSONB for AI metadata.

## Schema (Tables)
User(id, email, created_at)
VideoAsset(id, user_id, storage_path, duration, created_at)
Job(id, video_id, status, retry_count, idempotency_key, created_at)
JobEvent(id, job_id, status, message, created_at)
Artifact(id, job_id, type, storage_path, created_at)
Highlight(id, job_id, timestamp_start, timestamp_end, text)

## Constraints & Indexes
FK: video_asset.user_id → user.id
FK: job.video_id → video_asset.id
Index on job(status)
Unique job.idempotency_key
Index on artifact.job_id

## Job Lifecycle
QUEUED → PROCESSING → SUCCESS | FAILED
Max 3 retries with idempotency check.

## Storage Strategy
/tenant/{userId}/videos/{videoId}.mp4
/tenant/{userId}/jobs/{jobId}/summary.md

Secure downloads via pre-signed URLs after auth validation.

## Security
User ownership validation and private object storage.

## Reliability
Job state machine, JobEvent logging, retries with backoff.

## Cost & Scalability
MVP single worker → v1 horizontal workers + S3 + queue partitioning.

You need to put your solution here.

---

## **Problem 2: LinkedIn Automation Platform (Backend Architecture)**

**Goal:** Connect LinkedIn → store persona → generate drafts (handled by GenAI team) → approve → schedule → auto-post + audit logs. [READ MORE ABOUT THE PROJECT](./linkedin-automation.md)

**Your solution must include**
**My solution includes**

* **System design:** OAuth flow, token storage/refresh, scheduler/worker design
* **Schema:** User, LinkedInAccount, Persona, Draft, Schedule, PostAttempt/PostLog
* **Security:** encrypt tokens at rest, least-privilege scopes, access control
* **Reliability:** retry posting, dedupe to prevent double-post, rate limiting
* **Prompt/config storage proposal:** how backend stores “prompt versions” or “config packs” provided by GenAI team (DB vs repo vs hybrid, versioning + rollback)

**Your Solution for problem 2:**

You need to put your solution here.
**My Solution for problem 2:**

## System Design
User connects LinkedIn via OAuth → backend stores encrypted access + refresh tokens.
User creates Persona (tone, topics, do/don’t rules).
GenAI service generates 3 drafts → stored as Draft records.
User approves one draft → creates Schedule entry.
Scheduler service polls due schedules → sends job to worker.
Worker posts to LinkedIn API → stores result in PostLog.

Components:
- API Service (Spring Boot): OAuth, persona CRUD, draft approval, scheduling
- Scheduler (cron/queue based): finds due posts using scheduled_time index
- Worker Service: posts to LinkedIn, handles retries, token refresh
- PostgreSQL: metadata storage
- Redis/Queue: async posting jobs

Flow:
1. OAuth connect → store encrypted tokens
2. Create persona → request draft generation (GenAI boundary)
3. Save drafts → user approves one
4. Create Schedule (PENDING)
5. Scheduler → enqueue job when scheduled_time reached
6. Worker → post to LinkedIn → update status POSTED/FAILED → write PostLog

## Schema
User(id, email)

LinkedInAccount(
id,
user_id,
access_token_enc,
refresh_token_enc,
expires_at,
created_at
)

Persona(
id,
user_id,
tone,
topics,
do_dont_rules,
created_at
)

Draft(
id,
user_id,
persona_id,
content,
status, -- DRAFT | APPROVED | REJECTED
created_at
)

Schedule(
id,
draft_id,
scheduled_time,
status, -- PENDING | POSTED | FAILED
dedupe_key,
created_at
)

PostLog(
id,
schedule_id,
status,
linkedin_post_id,
response,
created_at
)

PromptConfig(
id,
version,
content,
created_at
)

## Constraints & Indexes
Unique index on LinkedInAccount.user_id
Index on Schedule.scheduled_time for scheduler polling
Unique index on Schedule.dedupe_key to prevent double posting
FK: Draft.persona_id → Persona.id
FK: Schedule.draft_id → Draft.id
FK: PostLog.schedule_id → Schedule.id

## Security
OAuth with least-privilege scopes.
Access and refresh tokens encrypted using AES-256.
Tokens decrypted only inside worker at posting time.
User ownership validation on persona, drafts, and schedules.

## Reliability
Dedupe key = hash(draft_id + scheduled_time) to avoid double posting.
Retry with exponential backoff for transient LinkedIn failures.
Automatic token refresh using refresh_token before expiry.
Rate limiting per user to respect LinkedIn API limits.
PostLog keeps full audit trail.

## Prompt / Config Storage
PromptConfig table stores versioned prompt templates from GenAI team.
Each draft stores prompt version used → enables rollback and reproducibility.

## Cost & Scalability
MVP: single scheduler + single worker.
v1: horizontally scalable workers, queue partitioning, and delayed job queues.
Use scheduled_time index to avoid full table scans.

---

## **Problem 3: DOCX Template → Bulk DOCX/PDF Generator (Backend + Storage)**

**Goal:** Upload DOCX template → detect fields → single fill export → bulk fill via CSV/Sheet → ZIP + per-row report. [READ MORE ABOUT THE PROJECT](./docs-template-output-generation.md)

**Your solution must include**
**My solution includes**

* **System design:** template ingestion, field extraction service, bulk job worker, export service
* **Schema:** Template, TemplateVersion, TemplateField, BulkRun, BulkRow, Artifact, JobEvent
* **Bulk storage strategy:** where CSV input + generated docs + ZIP live, cleanup policy
* **Reliability:** partial success handling, per-row status, retries, resumable bulk run
* **Security:** template isolation per tenant/user, safe downloads, anti-path traversal

**Your Solution for problem 3:**
**My Solution for problem 3:**

## System Design
Template upload → extract fields → store TemplateVersion + TemplateFields.
CSV upload → BulkRun → Worker processes rows → generate DOCX/PDF → ZIP export.

## Schema
Template(id, user_id, name)
TemplateVersion(id, template_id, storage_path, created_at)
TemplateField(id, template_version_id, field_name)
BulkRun(id, template_version_id, status, total_rows, processed_rows)
BulkRow(id, bulk_run_id, status, output_artifact_id)
Artifact(id, storage_path, type)
JobEvent(id, bulk_run_id, message)

## Storage Strategy
CSV input, generated docs, and ZIP stored in object storage.
Temporary files deleted after ZIP creation.

## Reliability
Per-row status, resumable using processed_rows, partial success supported.

## Security
Tenant isolation via user_id, signed URLs, path traversal prevention.

You need to put your solution here.

---

## **Problem 4: Character-Based Video Series Generator (Backend Architecture)**

**Goal:** Define characters once (image + traits + relationships). For each episode story → output episode package (script/scenes/assets plan/render plan), optionally render. [READ MORE ABOUT THE PROJECT](./char-based-video-generation.md)

**Your solution must include**
**My solution includes**

* **System design:** episodic pipeline as jobs, asset management, consistency strategy storage
* **Schema:** Character, Relationship, Episode, Scene, Asset, RenderJob, Artifact
* **Consistency data:** what gets persisted to keep character continuity across episodes
* **Storage:** images/audio/video assets, versioning, dedupe strategy
* **Security + cost controls:** quotas, rate limits, large asset constraints

**Your Solution for problem 4:**
**My Solution for problem 4:**

## System Design
Characters defined once with traits and assets.
Episode pipeline: Episode → scenes → assets → render jobs processed asynchronously.

## Schema
Character(id, user_id, name, traits, voice_id, appearance_ref)
Relationship(id, character_a_id, character_b_id, type)
Episode(id, user_id, title, character_snapshot_version)
Scene(id, episode_id, script_text)
Asset(id, scene_id, type, storage_path, hash)
RenderJob(id, episode_id, status, retry_count)
Artifact(id, render_job_id, storage_path)

## Consistency
Character snapshot version stored per episode to maintain continuity.

## Storage
Versioned assets with hash-based deduplication and per-user quotas.

## Security & Cost
Rate limits on render jobs, file size limits, per-user storage quotas.

You need to put your solution here.

## Problem 5: Cross-Cutting

Expand All @@ -92,6 +290,25 @@ Answer briefly for the whole platform:
4. **Data retention:** what to delete and when (inputs, artifacts, logs)
5. **Secrets & compliance:** token encryption, key management approach, PII handling

**Your Answer for problem 5:**
**My Answer for problem 5:**

## Multi-Tenancy
User-level tenancy with user_id present in all tables.

## AuthZ Model
RBAC (USER, ADMIN) enforced at API layer and query filters.

## Observability
Logs include job_id, correlation_id, and status transitions.
Metrics: job latency, failure rate, queue depth.

## Data Retention
Raw inputs: 30 days
Logs: 14 days
Artifacts: user-controlled deletion.

## Secrets & Compliance
Tokens encrypted with AES-256.
Keys stored in environment/secret manager.
Minimal PII storage.

You need to put your solution here.