ClipAI is an end-to-end AI-powered autonomous video editing pipeline that transforms raw talking-head videos into professional, engagement-ready shorts — completely hands-free.
It uses cutting-edge Large Language Models (LLaMA-3), Text-to-Image Diffusion (Stable Diffusion XL), and FFmpeg hardware compositing to:
- 🎤 Transcribe your video using Groq Whisper
- 🧠 Analyze context via LLaMA-3.3-70B to find visually interesting moments
- 🎨 Generate cinematic B-Roll images matching the speaker's words
- 🎬 Composite everything with zoompan animations, fades, and subtitle burns
- ☁️ Deliver the final cut via Cloudinary CDN
💡 Why Generative AI instead of Stock APIs?
Stock footage APIs (like Pexels) return generic results. If a speaker says "A glowing coffee cup next to a 1980s computer," stock APIs return plain coffee images. Our Stable Diffusion pipeline generates pixel-perfect, context-aware visuals that match exactly what the speaker describes — achieving 100% semantic relevance.
┌──────────────────────────────────────────────────────────────────┐
│ ClipAI Pipeline │
├──────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌───────────┐ ┌──────────────────────────┐ │
│ │ Next.js │───▶│ Node.js │───▶│ FastAPI (Python) │ │
│ │ Client │ │ Proxy │ │ │ │
│ │ :3000 │ │ :5001 │ │ ┌────────────────────┐ │ │
│ └─────────┘ └───────────┘ │ │ 1. FFmpeg Extract │ │ │
│ ▲ │ │ 2. Groq Whisper │ │ │
│ │ │ │ 3. LLaMA-3 Analysis │ │ │
│ │ ┌───────────┐ │ │ 4. SDXL Image Gen │ │ │
│ └─────────│ Cloudinary│◀───│ │ 5. FFmpeg Composite │ │ │
│ │ CDN │ │ └────────────────────┘ │ │
│ └───────────┘ └──────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────┘
|
|
|
|
| Layer | Technology | Purpose |
|---|---|---|
| 🖥️ Frontend | Next.js 14 + Tailwind CSS | Responsive UI with real-time status |
| 🔌 Backend Proxy | Node.js + Express + Multer | File upload buffering & API routing |
| 🧠 AI Engine | Python + FastAPI | Core pipeline orchestration |
| 🗣️ Transcription | Groq Whisper API | Speech-to-text with timestamps |
| 💬 LLM | LLaMA-3.3-70B (Groq) | Context analysis & prompt engineering |
| 🎨 Image Gen | Stable Diffusion XL (HuggingFace) | Text-to-image B-roll generation |
| 🎬 Video Engine | FFmpeg | Compositing, transitions, subtitles |
| ☁️ CDN | Cloudinary | Cloud storage & video delivery |
graph LR
A[📹 Upload Video] --> B[🎵 Extract Audio]
B --> C[🗣️ Groq Whisper<br/>Transcription]
C --> D[🧠 LLaMA-3 70B<br/>Context Analysis]
D --> E[🎨 Stable Diffusion XL<br/>Image Generation]
E --> F[🎬 FFmpeg Compositing<br/>Zoompan + Subtitles]
F --> G[☁️ Cloudinary Upload]
G --> H[✅ Final Video Ready]
style A fill:#f97316,stroke:#ea580c,color:#fff
style B fill:#f59e0b,stroke:#d97706,color:#fff
style C fill:#10b981,stroke:#059669,color:#fff
style D fill:#3b82f6,stroke:#2563eb,color:#fff
style E fill:#a855f7,stroke:#9333ea,color:#fff
style F fill:#ec4899,stroke:#db2777,color:#fff
style G fill:#06b6d4,stroke:#0891b2,color:#fff
style H fill:#22c55e,stroke:#16a34a,color:#fff
| Step | Process | Technology | What Happens |
|---|---|---|---|
| 1️⃣ | Audio Extraction | FFmpeg subprocess | Video → .mp3 audio file extracted |
| 2️⃣ | Transcription | Groq Whisper API | Audio → timestamped text segments |
| 3️⃣ | Context Analysis | LLaMA-3.3-70B | Transcript → cinematic image prompts |
| 4️⃣ | B-Roll Generation | Stable Diffusion XL | Prompts → photorealistic images |
| 5️⃣ | Motion Animation | FFmpeg zoompan | Static images → animated video clips |
| 6️⃣ | Compositing | FFmpeg filter_complex | Overlay B-roll + burn SRT subtitles |
| 7️⃣ | Cloud Delivery | Cloudinary API | Upload → global CDN URL returned |
✅ Node.js v18+
✅ Python 3.9+
✅ FFmpeg (in system PATH)
git clone https://github.com/samay-hash/ClipAI_Intern.git
cd ClipAI_Intern📁 ai-service/.env (click to expand)
# Sarvam API — for speech-to-text / Hindi captions
SARVAM_API_KEY="sk_..."
# Groq API — for LLaMA-3 and Whisper
GROQ_API_KEY="gsk_..."
# Hugging Face — for Stable Diffusion XL
HF_API_KEY="hf_..."
# Cloudinary — for video cloud storage
CLOUDINARY_CLOUD_NAME="..."
CLOUDINARY_API_KEY="..."
CLOUDINARY_API_SECRET="..."📁 backend/.env (click to expand)
PORT=5001
AI_SERVICE_URL="http://localhost:8000"
⚠️ Run each in a separate terminal
# Terminal 1: AI Engine (Python)
cd ai-service
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python main.py
# → Running on http://localhost:8000# Terminal 2: Backend Proxy (Node.js)
cd backend
npm install
npm run dev
# → Running on http://localhost:5001# Terminal 3: Frontend (Next.js)
cd frontend
npm install
npm run dev
# → Running on http://localhost:3000Navigate to http://localhost:3000 → Upload a video → Watch AI magic happen! ✨
|
Vercel |
Render (Node.js) |
Render (Docker) |
ClipAI_Intern/
├── 🎨 frontend/ # Next.js 14 + Tailwind CSS
│ ├── src/app/page.tsx # Main UI (upload, progress, gallery)
│ ├── src/app/globals.css # Design system (CSS variables)
│ └── package.json
│
├── 🔌 backend/ # Node.js Express Proxy
│ ├── server.js # Upload handling, AI service proxy
│ └── package.json
│
├── 🧠 ai-service/ # Python FastAPI AI Engine
│ ├── main.py # Core pipeline (Whisper → LLaMA → SDXL → FFmpeg)
│ ├── Dockerfile # Docker config for Render deployment
│ └── requirements.txt # Python dependencies
│
└── 📄 README.md # You are here!
Contributions are welcome! Here's how to get involved:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Commit your changes:
git commit -m 'Add amazing feature' - Push to the branch:
git push origin feature/amazing-feature - Open a Pull Request
| Area | Idea | Difficulty |
|---|---|---|
| 🎨 | Add more visual styles (Watercolor, Pixel Art) | 🟢 Easy |
| 🔊 | Add background music overlay | 🟡 Medium |
| 📊 | Redis/Celery job queue for scaling | 🟡 Medium |
| 🎥 | AI video generation (instead of images) | 🔴 Hard |
| 🌐 | Multi-language subtitle support | 🟡 Medium |
This project is licensed under the MIT License — see the LICENSE file for details.