+```
+
+### Flags
+
+
+Name of the Flash app to delete. Required explicitly for safety.
+
+
+
+
+Unlike other subcommands, `delete` requires the `--app` flag explicitly. This is a safety measure for destructive operations.
+
+
+
+### Process
+
+1. Shows app details and resources to be deleted.
+2. Prompts for confirmation (required).
+3. Deletes all environments and their resources.
+4. Deletes all builds.
+5. Deletes the app.
+
+
+
+This operation is irreversible. All environments, builds, endpoints, volumes, and configuration will be permanently deleted.
+
+
+
+---
+
+## App hierarchy
+
+A Flash app contains environments and builds:
+
+```text
+Flash App (my-project)
+│
+├── Environments
+│ ├── dev
+│ │ ├── Endpoints (ep1, ep2)
+│ │ └── Volumes (vol1)
+│ ├── staging
+│ │ ├── Endpoints (ep1, ep2)
+│ │ └── Volumes (vol1)
+│ └── production
+│ ├── Endpoints (ep1, ep2)
+│ └── Volumes (vol1)
+│
+└── Builds
+ ├── build_v1 (2024-01-15)
+ ├── build_v2 (2024-01-18)
+ └── build_v3 (2024-01-20)
+```
+
+## Auto-detection
+
+Flash CLI automatically detects the app name from your current directory:
+
+```bash
+cd /path/to/APP_NAME
+flash deploy # Deploys to 'APP_NAME' app
+flash env list # Lists 'APP_NAME' environments
+```
+
+Override with the `--app` flag:
+
+```bash
+flash deploy --app other-project
+flash env list --app other-project
+```
+
+## Related commands
+
+- [`flash env`](/flash/cli/env) - Manage environments within an app
+- [`flash deploy`](/flash/cli/deploy) - Deploy to an app's environment
+- [`flash init`](/flash/cli/init) - Create a new project
diff --git a/flash/cli/build.mdx b/flash/cli/build.mdx
new file mode 100644
index 00000000..fb6da58f
--- /dev/null
+++ b/flash/cli/build.mdx
@@ -0,0 +1,184 @@
+---
+title: "build"
+sidebarTitle: "build"
+---
+
+Build a deployment-ready artifact for your Flash application without deploying. Use this for more control over the build process or to inspect the artifact before deploying.
+
+```bash
+flash build [OPTIONS]
+```
+
+## Examples
+
+Build with all dependencies:
+
+```bash
+flash build
+```
+
+Build and launch local preview environment:
+
+```bash
+flash build --preview
+```
+
+Build with excluded packages (for smaller deployment size):
+
+```bash
+flash build --exclude torch,torchvision,torchaudio
+```
+
+Keep the build directory for inspection:
+
+```bash
+flash build --keep-build
+```
+
+## Flags
+
+
+Skip transitive dependencies during pip install. Only installs direct dependencies specified in `@remote` decorators. Useful when the base image already includes dependencies.
+
+
+
+Keep the `.flash/.build` directory after creating the archive. Useful for debugging build issues or inspecting generated files.
+
+
+
+Custom name for the output archive file.
+
+
+
+Comma-separated list of packages to exclude from the build (e.g., `torch,torchvision`). Use this to skip packages already in the base image.
+
+
+
+Launch a local Docker-based test environment after building. Automatically enables `--keep-build`.
+
+
+## What happens during build
+
+1. **Function discovery**: Finds all `@remote` decorated functions.
+2. **Grouping**: Groups functions by their `resource_config`.
+3. **Manifest generation**: Creates `.flash/flash_manifest.json` with endpoint definitions.
+4. **Dependency installation**: Installs Python packages for Linux x86_64.
+5. **Packaging**: Bundles everything into `.flash/artifact.tar.gz`.
+
+## Build artifacts
+
+After running `flash build`:
+
+| File/Directory | Description |
+|----------------|-------------|
+| `.flash/artifact.tar.gz` | Deployment package ready for Runpod |
+| `.flash/flash_manifest.json` | Service discovery configuration |
+| `.flash/.build/` | Temporary build directory (removed unless `--keep-build`) |
+
+## Cross-platform builds
+
+Flash automatically handles cross-platform builds:
+
+- **Automatic platform targeting**: Dependencies are installed for Linux x86_64, regardless of your build platform.
+- **Python version matching**: Uses your current Python version for package compatibility.
+- **Binary wheel enforcement**: Only pre-built wheels are used, preventing compilation issues.
+
+You can build on macOS, Windows, or Linux, and the deployment will work on Runpod.
+
+## Managing deployment size
+
+Runpod Serverless has a **500MB deployment limit**. Use `--exclude` to skip packages already in your base image:
+
+```bash
+# For GPU deployments (PyTorch pre-installed)
+flash build --exclude torch,torchvision,torchaudio
+```
+
+### Base image reference
+
+| Resource type | Base image | Safe to exclude |
+|--------------|------------|-----------------|
+| GPU | PyTorch base | `torch`, `torchvision`, `torchaudio` |
+| CPU | Python slim | Do not exclude ML packages |
+
+
+
+Check the [worker-flash repository](https://github.com/runpod-workers/worker-flash) for current base images and pre-installed packages.
+
+
+
+## Preview environment
+
+Test your deployment locally before pushing to Runpod:
+
+```bash
+flash build --preview
+```
+
+This:
+
+1. Builds your project (creates archive and manifest).
+2. Creates a Docker network for inter-container communication.
+3. Starts one container per resource config (mothership + workers).
+4. Exposes the mothership on `localhost:8000`.
+5. On shutdown (`Ctrl+C`), stops and removes all containers.
+
+### When to use preview
+
+- Test deployment configuration before production.
+- Validate manifest structure.
+- Debug resource provisioning.
+- Verify cross-endpoint function calls.
+
+## Troubleshooting
+
+### Build fails with "functions not found"
+
+Ensure your project has `@remote` decorated functions:
+
+```python
+from runpod_flash import remote, LiveServerless
+
+config = LiveServerless(name="my-worker")
+
+@remote(resource_config=config)
+def my_function(data):
+ return {"result": data}
+```
+
+### Archive is too large
+
+Use `--exclude` or `--no-deps`:
+
+```bash
+flash build --exclude torch,torchvision,torchaudio
+```
+
+### Dependency installation fails
+
+If a package doesn't have Linux x86_64 wheels:
+
+1. Ensure standard pip is installed: `python -m ensurepip --upgrade`
+2. Check PyPI for Linux wheel availability.
+3. For Python 3.13+, some packages may require newer manylinux versions.
+
+### Need to examine generated files
+
+Use `--keep-build`:
+
+```bash
+flash build --keep-build
+ls .flash/.build/
+```
+
+## Related commands
+
+- [`flash deploy`](/flash/cli/deploy) - Build and deploy in one step
+- [`flash run`](/flash/cli/run) - Start development server
+- [`flash env`](/flash/cli/env) - Manage environments
+
+
+
+Most users should use `flash deploy` instead, which runs build and deploy in one step. Use `flash build` when you need more control or want to inspect the artifact.
+
+
diff --git a/flash/cli/deploy.mdx b/flash/cli/deploy.mdx
new file mode 100644
index 00000000..bd4224fa
--- /dev/null
+++ b/flash/cli/deploy.mdx
@@ -0,0 +1,247 @@
+---
+title: "deploy"
+sidebarTitle: "deploy"
+---
+
+Build and deploy your Flash application to Runpod Serverless endpoints in one step. This is the primary command for getting your application running in the cloud.
+
+```bash
+flash deploy [OPTIONS]
+```
+
+## Examples
+
+Build and deploy a Flash app from the current directory (auto-selects environment if only one exists):
+
+```bash
+flash deploy
+```
+
+Deploy to a specific environment:
+
+```bash
+flash deploy --env production
+```
+
+Deploy with excluded packages to reduce size:
+
+```bash
+flash deploy --exclude torch,torchvision,torchaudio
+```
+
+Build and test locally before deploying:
+
+```bash
+flash deploy --preview
+```
+
+## Flags
+
+
+Target environment name (e.g., `dev`, `staging`, `production`). Auto-selected if only one exists. Creates the environment if it doesn't exist.
+
+
+
+Flash app name. Auto-detected from the current directory if not specified.
+
+
+
+Skip transitive dependencies during pip install. Useful when the base image already includes dependencies.
+
+
+
+Comma-separated packages to exclude (e.g., `torch,torchvision`). Use this to stay under the 500MB deployment limit.
+
+
+
+Custom archive name for the build artifact.
+
+
+
+Build and launch a local Docker-based preview environment instead of deploying to Runpod.
+
+
+
+Bundle local `runpod_flash` source instead of the PyPI version. For development and testing only.
+
+
+## What happens during deployment
+
+1. **Build phase**: Creates the deployment artifact (same as `flash build`).
+2. **Environment resolution**: Detects or creates the target environment.
+3. **Upload**: Sends the artifact to Runpod storage.
+4. **Provisioning**: Creates or updates Serverless endpoints.
+5. **Configuration**: Sets up environment variables and service discovery.
+6. **Verification**: Confirms endpoints are healthy.
+
+## Architecture
+
+After deployment, your entire application runs on Runpod Serverless:
+
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ Users(["USERS"])
+
+ subgraph Runpod ["RUNPOD SERVERLESS"]
+ Mothership["MOTHERSHIP ENDPOINT
(your FastAPI app from main.py)
• Your HTTP routes
• Orchestrates @remote calls
• Public URL for users"]
+ GPU["gpu-worker
(your @remote function)"]
+ CPU["cpu-worker
(your @remote function)"]
+
+ Mothership -->|"internal"| GPU
+ Mothership -->|"internal"| CPU
+ end
+
+ Users -->|"HTTPS (authenticated)"| Mothership
+
+ style Runpod fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Users fill:#4D38F5,stroke:#4D38F5,color:#fff
+ style Mothership fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style GPU fill:#22C55E,stroke:#22C55E,color:#000
+ style CPU fill:#22C55E,stroke:#22C55E,color:#000
+```
+
+
+## Environment management
+
+### Automatic creation
+
+If the specified environment doesn't exist, `flash deploy` creates it:
+
+```bash
+# Creates 'staging' if it doesn't exist
+flash deploy --env staging
+```
+
+### Auto-selection
+
+When you have only one environment, it's selected automatically:
+
+```bash
+# Auto-selects the only available environment
+flash deploy
+```
+
+When multiple environments exist, you must specify one:
+
+```bash
+# Required when multiple environments exist
+flash deploy --env staging
+```
+
+### Default environment
+
+If no environment exists and none is specified, Flash creates a `production` environment by default.
+
+## Post-deployment
+
+After successful deployment, Flash displays:
+
+```text
+✓ Deployment Complete
+
+Your mothership is deployed at:
+https://api-xxxxx.runpod.net
+
+Available Routes:
+POST /api/hello
+POST /gpu/process
+
+All endpoints require authentication:
+curl -X POST https://api-xxxxx.runpod.net/api/hello \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"param": "value"}'
+```
+
+### Authentication
+
+All deployed endpoints require authentication with your Runpod API key:
+
+```bash
+export RUNPOD_API_KEY="your_key_here"
+
+curl -X POST https://YOUR_ENDPOINT_URL/path \
+ -H "Authorization: Bearer $RUNPOD_API_KEY" \
+ -H "Content-Type: application/json" \
+ -d '{"param": "value"}'
+```
+
+## Preview mode
+
+Test locally before deploying:
+
+```bash
+flash deploy --preview
+```
+
+This builds your project and runs it in Docker containers locally:
+
+- Mothership exposed on `localhost:8000`.
+- All containers communicate via Docker network.
+- Press `Ctrl+C` to stop.
+
+## Managing deployment size
+
+Runpod Serverless has a **500MB limit**. Use `--exclude` to skip packages in the base image:
+
+```bash
+# GPU deployments (PyTorch pre-installed)
+flash deploy --exclude torch,torchvision,torchaudio
+```
+
+| Resource type | Safe to exclude |
+|--------------|-----------------|
+| GPU | `torch`, `torchvision`, `torchaudio` |
+| CPU | Do not exclude ML packages |
+
+## flash run vs flash deploy
+
+| Aspect | `flash run` | `flash deploy` |
+|--------|-------------|----------------|
+| FastAPI app runs on | Your machine | Runpod Serverless |
+| `@remote` functions run on | Runpod Serverless | Runpod Serverless |
+| Endpoint naming | `live-` prefix | No prefix |
+| Automatic updates | Yes | No |
+| Use case | Development | Production |
+
+## Troubleshooting
+
+### Multiple environments error
+
+```text
+Error: Multiple environments found: dev, staging, production
+```
+
+Specify the target environment:
+
+```bash
+flash deploy --env staging
+```
+
+### Deployment size limit
+
+Use `--exclude` to reduce size:
+
+```bash
+flash deploy --exclude torch,torchvision,torchaudio
+```
+
+### Authentication fails
+
+Ensure your API key is set:
+
+```bash
+echo $RUNPOD_API_KEY
+export RUNPOD_API_KEY="your_key_here"
+```
+
+## Related commands
+
+- [`flash build`](/flash/cli/build) - Build without deploying
+- [`flash run`](/flash/cli/run) - Local development server
+- [`flash env`](/flash/cli/env) - Manage environments
+- [`flash app`](/flash/cli/app) - Manage applications
+- [`flash undeploy`](/flash/cli/undeploy) - Remove endpoints
diff --git a/flash/cli/env.mdx b/flash/cli/env.mdx
new file mode 100644
index 00000000..7d4494ba
--- /dev/null
+++ b/flash/cli/env.mdx
@@ -0,0 +1,255 @@
+---
+title: "env"
+sidebarTitle: "env"
+---
+
+Manage deployment environments for Flash applications. Environments are isolated deployment contexts (like `dev`, `staging`, `production`) within a Flash app.
+
+```bash Command
+flash env [OPTIONS]
+```
+
+## Subcommands
+
+| Subcommand | Description |
+|------------|-------------|
+| `list` | Show all environments for an app |
+| `create` | Create a new environment |
+| `get` | Show details of an environment |
+| `delete` | Delete an environment and its resources |
+
+---
+
+## env list
+
+Show all available environments for an app.
+
+```bash Command
+flash env list [OPTIONS]
+```
+
+### Example
+
+```bash
+# List environments for current app
+flash env list
+
+# List environments for specific app
+flash env list --app APP_NAME
+```
+
+### Flags
+
+
+Flash app name. Auto-detected from current directory if not specified.
+
+
+### Output
+
+```text
+┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
+┃ Name ┃ ID ┃ Active Build ┃ Created At ┃
+┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
+│ dev │ env_abc123 │ build_xyz789 │ 2024-01-15 10:30 │
+│ staging │ env_def456 │ build_uvw456 │ 2024-01-16 14:20 │
+│ production │ env_ghi789 │ build_rst123 │ 2024-01-20 09:15 │
+└────────────┴─────────────────────┴───────────────────┴──────────────────┘
+```
+
+---
+
+## env create
+
+Create a new deployment environment.
+
+```bash Command
+flash env create [OPTIONS]
+```
+
+### Example
+
+```bash
+# Create staging environment
+flash env create staging
+
+# Create environment in specific app
+flash env create production --app APP_NAME
+```
+
+### Arguments
+
+
+Name for the new environment (e.g., `dev`, `staging`, `production`).
+
+
+### Flags
+
+
+Flash app name. Auto-detected from current directory if not specified.
+
+
+### Notes
+
+- If the app doesn't exist, it's created automatically.
+- Environment names must be unique within an app.
+- Newly created environments have no active build until first deployment.
+
+
+
+You don't always need to create environments explicitly. Running `flash deploy --env ` creates the environment automatically if it doesn't exist.
+
+
+
+---
+
+## env get
+
+Show detailed information about a deployment environment.
+
+```bash Command
+flash env get [OPTIONS]
+```
+
+### Example
+
+```bash
+# Get details for production environment
+flash env get production
+
+# Get details for specific app's environment
+flash env get staging --app APP_NAME
+```
+
+### Arguments
+
+
+Name of the environment to inspect.
+
+
+### Flags
+
+
+Flash app name. Auto-detected from current directory if not specified.
+
+
+### Output
+
+```text
+╭────────────────────────────────────╮
+│ Environment: production │
+├────────────────────────────────────┤
+│ ID: env_ghi789 │
+│ State: DEPLOYED │
+│ Active Build: build_rst123 │
+│ Created: 2024-01-20 09:15:00 │
+╰────────────────────────────────────╯
+
+ Associated Endpoints
+┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
+┃ Name ┃ ID ┃
+┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
+│ my-gpu │ ep_abc123 │
+│ my-cpu │ ep_def456 │
+└────────────────┴────────────────────┘
+
+ Associated Network Volumes
+┏━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┓
+┃ Name ┃ ID ┃
+┡━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━┩
+│ model-cache │ nv_xyz789 │
+└────────────────┴────────────────────┘
+```
+
+---
+
+## env delete
+
+Delete a deployment environment and all its associated resources.
+
+```bash Command
+flash env delete [OPTIONS]
+```
+
+### Examples
+
+```bash
+# Delete development environment
+flash env delete dev
+
+# Delete environment in specific app
+flash env delete staging --app APP_NAME
+```
+
+### Arguments
+
+
+Name of the environment to delete.
+
+
+### Flags
+
+
+Flash app name. Auto-detected from current directory if not specified.
+
+
+### Process
+
+1. Shows environment details and resources to be deleted.
+2. Prompts for confirmation (required).
+3. Undeploys all associated endpoints.
+4. Removes all associated network volumes.
+5. Deletes the environment from the app.
+
+
+
+This operation is irreversible. All endpoints, volumes, and configuration associated with the environment will be permanently deleted.
+
+
+
+---
+
+## Environment states
+
+| State | Description |
+|-------|-------------|
+| PENDING | Environment created but not deployed |
+| DEPLOYING | Deployment in progress |
+| DEPLOYED | Successfully deployed and running |
+| FAILED | Deployment or health check failed |
+| DELETING | Deletion in progress |
+
+## Common workflows
+
+### Three-tier deployment
+
+```bash
+# Create environments
+flash env create dev
+flash env create staging
+flash env create production
+
+# Deploy to each
+flash deploy --env dev
+flash deploy --env staging
+flash deploy --env production
+```
+
+### Feature branch testing
+
+```bash
+# Create feature environment
+flash env create FEATURE_NAME
+
+# Deploy feature branch
+git checkout FEATURE_NAME
+flash deploy --env FEATURE_NAME
+
+# Clean up after merge
+flash env delete FEATURE_NAME
+```
+
+## Related commands
+
+- [`flash deploy`](/flash/cli/deploy) - Deploy to an environment
+- [`flash app`](/flash/cli/app) - Manage applications
+- [`flash undeploy`](/flash/cli/undeploy) - Remove specific endpoints
diff --git a/flash/cli/init.mdx b/flash/cli/init.mdx
new file mode 100644
index 00000000..12f93b93
--- /dev/null
+++ b/flash/cli/init.mdx
@@ -0,0 +1,89 @@
+---
+title: "init"
+sidebarTitle: "init"
+---
+
+Create a new Flash project with a ready-to-use template structure including a FastAPI server, example GPU and CPU workers, and configuration files.
+
+```bash
+flash init [PROJECT_NAME] [OPTIONS]
+```
+
+## Example
+
+Create a new project directory:
+
+```bash
+flash init PROJECT_NAME
+cd PROJECT_NAME
+pip install -r requirements.txt
+flash run
+```
+
+Initialize in the current directory:
+
+```bash
+flash init .
+```
+
+## Arguments
+
+
+Name of the project directory to create. If omitted or set to `.`, initializes in the current directory.
+
+
+## Flags
+
+
+Overwrite existing files if they already exist in the target directory.
+
+
+## What it creates
+
+The command creates the following project structure:
+
+```text
+PROJECT_NAME/
+├── main.py # FastAPI application entry point
+├── workers/
+│ ├── gpu/ # GPU worker example
+│ │ ├── __init__.py
+│ │ └── endpoint.py
+│ └── cpu/ # CPU worker example
+│ ├── __init__.py
+│ └── endpoint.py
+├── .env # Environment variables template
+├── .gitignore # Git ignore patterns
+├── .flashignore # Flash deployment ignore patterns
+├── requirements.txt # Python dependencies
+└── README.md # Project documentation
+```
+
+### Template contents
+
+- **main.py**: FastAPI application that imports routers from the `workers/` directory.
+- **workers/gpu/endpoint.py**: Example GPU worker with a `@remote` decorated function using `LiveServerless`.
+- **workers/cpu/endpoint.py**: Example CPU worker with a `@remote` decorated function using CPU configuration.
+- **.env**: Template for environment variables including `RUNPOD_API_KEY`.
+
+## Next steps
+
+After initialization:
+
+1. Copy `.env.example` to `.env` (if needed) and add your `RUNPOD_API_KEY`.
+2. Install dependencies: `pip install -r requirements.txt`
+3. Start the development server: `flash run`
+4. Open http://localhost:8888/docs to explore the API.
+5. Customize the workers for your use case.
+6. Deploy with `flash deploy` when ready.
+
+
+
+This command only creates local files. It doesn't interact with Runpod or create any cloud resources. Cloud resources are created when you run `flash run` or `flash deploy`.
+
+
+
+## Related commands
+
+- [`flash run`](/flash/cli/run) - Start the development server
+- [`flash deploy`](/flash/cli/deploy) - Build and deploy to Runpod
diff --git a/flash/cli/overview.mdx b/flash/cli/overview.mdx
new file mode 100644
index 00000000..db53b4bb
--- /dev/null
+++ b/flash/cli/overview.mdx
@@ -0,0 +1,84 @@
+---
+title: "CLI overview"
+sidebarTitle: "Overview"
+description: "Learn how to use the Flash CLI for local development and deployment."
+---
+
+The Flash CLI provides commands for initializing projects, running local development servers, building deployment artifacts, and managing your applications on Runpod Serverless.
+
+Before using the CLI, make sure you've [installed Flash](/flash/overview#install-flash) and set your [Runpod API key](/get-started/api-keys) in your environment.
+
+## Available commands
+
+| Command | Description |
+|---------|-------------|
+| [`flash init`](/flash/cli/init) | Create a new Flash project with a template structure |
+| [`flash run`](/flash/cli/run) | Start the local development server with automatic updates |
+| [`flash build`](/flash/cli/build) | Build a deployment artifact without deploying |
+| [`flash deploy`](/flash/cli/deploy) | Build and deploy your application to Runpod |
+| [`flash env`](/flash/cli/env) | Manage deployment environments |
+| [`flash app`](/flash/cli/app) | Manage Flash applications |
+| [`flash undeploy`](/flash/cli/undeploy) | Remove deployed endpoints |
+
+## Getting help
+
+View help for any command by adding `--help`:
+
+```bash
+flash --help
+flash deploy --help
+flash env --help
+```
+
+## Common workflows
+
+### Local development
+
+```bash
+# Create a new project
+flash init PROJECT_NAME
+cd PROJECT_NAME
+
+# Install dependencies
+pip install -r requirements.txt
+
+# Add your API key to .env
+# Start the development server
+flash run
+```
+
+### Deploy to production
+
+```bash
+# Build and deploy
+flash deploy
+
+# Deploy to a specific environment
+flash deploy --env ENVIRONMENT_NAME
+```
+
+### Manage deployments
+
+```bash
+# List environments
+flash env list
+
+# Check environment status
+flash env get ENVIRONMENT_NAME
+
+# Remove an environment
+flash env delete ENVIRONMENT_NAME
+```
+
+### Clean up endpoints
+
+```bash
+# List deployed endpoints
+flash undeploy list
+
+# Remove specific endpoint
+flash undeploy ENDPOINT_NAME
+
+# Remove all endpoints
+flash undeploy --all
+```
\ No newline at end of file
diff --git a/flash/cli/run.mdx b/flash/cli/run.mdx
new file mode 100644
index 00000000..4dab9e6c
--- /dev/null
+++ b/flash/cli/run.mdx
@@ -0,0 +1,156 @@
+---
+title: "run"
+sidebarTitle: "run"
+---
+
+Start the Flash development server for local testing with automatic updates. Your FastAPI app runs locally while `@remote` functions execute on Runpod Serverless.
+
+```bash
+flash run [OPTIONS]
+```
+
+## Example
+
+Start the development server with defaults:
+
+```bash
+flash run
+```
+
+Start with auto-provisioning to eliminate cold-start delays:
+
+```bash
+flash run --auto-provision
+```
+
+Start on a custom port:
+
+```bash
+flash run --port 3000
+```
+
+## Flags
+
+
+Host address to bind the server to.
+
+
+
+Port number to bind the server to.
+
+
+
+Enable or disable auto-reload on code changes. Enabled by default.
+
+
+
+Auto-provision all Serverless endpoints on startup instead of lazily on first call. Eliminates cold-start delays during development.
+
+
+## Architecture
+
+With `flash run`, your system runs in a hybrid architecture:
+
+```mermaid
+%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'14px','fontFamily':'font-inter'}}}%%
+
+flowchart TB
+ subgraph Local ["YOUR MACHINE (localhost:8888)"]
+ FastAPI["FastAPI App (main.py)
• Your HTTP routes
• Orchestrates @remote calls
• Updates automatically"]
+ end
+
+ subgraph Runpod ["RUNPOD SERVERLESS"]
+ GPU["live-gpu-worker
(your @remote function)"]
+ CPU["live-cpu-worker
(your @remote function)"]
+ end
+
+ FastAPI -->|"HTTPS"| GPU
+ FastAPI -->|"HTTPS"| CPU
+
+ style Local fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style Runpod fill:#1a1a2e,stroke:#5F4CFE,stroke-width:2px,color:#fff
+ style FastAPI fill:#5F4CFE,stroke:#5F4CFE,color:#fff
+ style GPU fill:#22C55E,stroke:#22C55E,color:#000
+ style CPU fill:#22C55E,stroke:#22C55E,color:#000
+```
+
+**Key points:**
+
+- Your FastAPI app runs locally and updates automatically for rapid iteration.
+- `@remote` functions run on Runpod as Serverless endpoints.
+- Endpoints are prefixed with `live-` to distinguish from production.
+- Changes to local code are picked up instantly.
+
+This is different from `flash deploy`, where everything runs on Runpod.
+
+## Auto-provisioning
+
+By default, endpoints are provisioned lazily on first `@remote` function call. Use `--auto-provision` to provision all endpoints at server startup:
+
+```bash
+flash run --auto-provision
+```
+
+### How it works
+
+1. **Discovery**: Scans your app for `@remote` decorated functions.
+2. **Deployment**: Deploys resources concurrently (up to 3 at a time).
+3. **Confirmation**: Asks for confirmation if deploying more than 5 endpoints.
+4. **Caching**: Stores deployed resources in `.runpod/resources.pkl` for reuse.
+5. **Updates**: Recognizes existing endpoints and updates if configuration changed.
+
+### Benefits
+
+- **Zero cold start**: All endpoints ready before you test them.
+- **Faster development**: No waiting for deployment on first HTTP call.
+- **Resource reuse**: Cached endpoints are reused across server restarts.
+
+### When to use
+
+- Local development with multiple endpoints.
+- Testing workflows that call multiple remote functions.
+- Debugging where you want deployment separated from handler logic.
+
+## Provisioning modes
+
+| Mode | When endpoints are deployed |
+|------|----------------------------|
+| Default (lazy) | On first `@remote` function call |
+| `--auto-provision` | At server startup |
+
+## Testing your API
+
+Once the server is running, test your endpoints:
+
+```bash
+# Health check
+curl http://localhost:8888/
+
+# Call a GPU endpoint
+curl -X POST http://localhost:8888/gpu/hello \
+ -H "Content-Type: application/json" \
+ -d '{"message": "Hello from GPU!"}'
+```
+
+Open http://localhost:8888/docs for the interactive API explorer.
+
+## Requirements
+
+- `RUNPOD_API_KEY` must be set in your `.env` file or environment.
+- A valid Flash project structure (created by `flash init` or manually).
+
+## flash run vs flash deploy
+
+| Aspect | `flash run` | `flash deploy` |
+|--------|-------------|----------------|
+| FastAPI app runs on | Your machine (localhost) | Runpod Serverless |
+| `@remote` functions run on | Runpod Serverless | Runpod Serverless |
+| Endpoint naming | `live-` prefix | No prefix |
+| Automatic updates | Yes | No |
+| Use case | Development | Production |
+
+## Related commands
+
+- [`flash init`](/flash/cli/init) - Create a new project
+- [`flash deploy`](/flash/cli/deploy) - Deploy to production
+- [`flash undeploy`](/flash/cli/undeploy) - Remove endpoints
diff --git a/flash/cli/undeploy.mdx b/flash/cli/undeploy.mdx
new file mode 100644
index 00000000..8225182f
--- /dev/null
+++ b/flash/cli/undeploy.mdx
@@ -0,0 +1,213 @@
+---
+title: "undeploy"
+sidebarTitle: "undeploy"
+---
+
+Manage and delete Runpod Serverless endpoints deployed via Flash. Use this command to clean up endpoints created during local development with `flash run`.
+
+```bash
+flash undeploy [NAME|list] [OPTIONS]
+```
+
+## Example
+
+List all tracked endpoints:
+
+```bash
+flash undeploy list
+```
+
+Remove a specific endpoint:
+
+```bash
+flash undeploy ENDPOINT_NAME
+```
+
+Remove all endpoints:
+
+```bash
+flash undeploy --all
+```
+
+## Usage modes
+
+### List endpoints
+
+Display all tracked endpoints with their current status:
+
+```bash
+flash undeploy list
+```
+
+Output includes:
+
+- **Name**: Endpoint name
+- **Endpoint ID**: Runpod endpoint identifier
+- **Status**: Current health status (Active/Inactive/Unknown)
+- **Type**: Resource type (Live Serverless, Cpu Live Serverless, etc.)
+
+**Status indicators:**
+
+| Status | Meaning |
+|--------|---------|
+| Active | Endpoint is running and responding |
+| Inactive | Tracking exists but endpoint deleted externally |
+| Unknown | Error during health check |
+
+### Undeploy by name
+
+Delete a specific endpoint:
+
+```bash
+flash undeploy ENDPOINT_NAME
+```
+
+This:
+
+1. Searches for endpoints matching the name.
+2. Shows endpoint details.
+3. Prompts for confirmation.
+4. Deletes the endpoint from Runpod.
+5. Removes from local tracking.
+
+### Undeploy all
+
+Delete all tracked endpoints (requires double confirmation):
+
+```bash
+flash undeploy --all
+```
+
+Safety features:
+
+1. Shows total count of endpoints.
+2. First confirmation: Yes/No prompt.
+3. Second confirmation: Type "DELETE ALL" exactly.
+4. Deletes all endpoints from Runpod.
+5. Removes all from tracking.
+
+### Interactive selection
+
+Select endpoints to undeploy using checkboxes:
+
+```bash
+flash undeploy --interactive
+```
+
+Use arrow keys to navigate, space bar to select/deselect, and Enter to confirm.
+
+### Clean up stale tracking
+
+Remove inactive endpoints from tracking without API deletion:
+
+```bash
+flash undeploy --cleanup-stale
+```
+
+Use this when endpoints were deleted via the Runpod console or API (not through Flash). The local tracking file (`.runpod/resources.pkl`) becomes stale, and this command cleans it up.
+
+## Flags
+
+
+Undeploy all tracked endpoints. Requires double confirmation for safety.
+
+
+
+Interactive checkbox selection mode. Select multiple endpoints to undeploy.
+
+
+
+Remove inactive endpoints from local tracking without attempting API deletion. Use when endpoints were deleted externally.
+
+
+## Arguments
+
+
+Name of the endpoint to undeploy. Use `list` to show all endpoints.
+
+
+## undeploy vs env delete
+
+| Command | Scope | When to use |
+|---------|-------|-------------|
+| `flash undeploy` | Individual endpoints from local tracking | Development cleanup, granular control |
+| `flash env delete` | Entire environment + all resources | Production cleanup, full teardown |
+
+For production deployments, use `flash env delete` to remove entire environments and all associated resources.
+
+## How tracking works
+
+Flash tracks deployed endpoints in `.runpod/resources.pkl`. Endpoints are added when you:
+
+- Run `flash run --auto-provision`
+- Run `flash run` and call `@remote` functions
+- Run `flash deploy`
+
+The tracking file is in `.gitignore` and should never be committed. It contains local deployment state.
+
+## Common workflows
+
+### Basic cleanup
+
+```bash
+# Check what's deployed
+flash undeploy list
+
+# Remove a specific endpoint
+flash undeploy ENDPOINT_NAME
+
+# Clean up stale tracking
+flash undeploy --cleanup-stale
+```
+
+### Bulk operations
+
+```bash
+# Undeploy all endpoints
+flash undeploy --all
+
+# Interactive selection
+flash undeploy --interactive
+```
+
+### Managing external deletions
+
+If you delete endpoints via the Runpod console:
+
+```bash
+# Check status - will show as "Inactive"
+flash undeploy list
+
+# Remove stale tracking entries
+flash undeploy --cleanup-stale
+```
+
+## Troubleshooting
+
+### Endpoint shows as "Inactive"
+
+The endpoint was deleted via Runpod console or API. Clean up:
+
+```bash
+flash undeploy --cleanup-stale
+```
+
+### Can't find endpoint by name
+
+Check the exact name:
+
+```bash
+flash undeploy list
+```
+
+### Undeploy fails with API error
+
+1. Check `RUNPOD_API_KEY` in `.env`.
+2. Verify network connectivity.
+3. Check if the endpoint still exists on Runpod.
+
+## Related commands
+
+- [`flash run`](/flash/cli/run) - Development server (creates endpoints)
+- [`flash deploy`](/flash/cli/deploy) - Deploy to Runpod
+- [`flash env delete`](/flash/cli/env) - Delete entire environment
diff --git a/flash/custom-docker-images.mdx b/flash/custom-docker-images.mdx
new file mode 100644
index 00000000..da7b4842
--- /dev/null
+++ b/flash/custom-docker-images.mdx
@@ -0,0 +1,327 @@
+---
+title: "Use custom Docker images with Flash"
+sidebarTitle: "Custom Docker images"
+description: "Deploy pre-built Docker images with Flash using ServerlessEndpoint."
+tag: "BETA"
+---
+
+Flash's `LiveServerless` configuration handles most use cases by automatically managing dependencies and executing arbitrary Python code. However, for specialized environments that require custom Docker images—such as pre-built ML frameworks, specific CUDA versions, or system-level dependencies—you can use `ServerlessEndpoint` or `CpuServerlessEndpoint`.
+
+## When to use custom Docker images
+
+Use custom Docker images when you need:
+
+- **Pre-built inference servers**: vLLM, TensorRT-LLM, or other specialized serving frameworks.
+- **System-level dependencies**: Custom CUDA versions, cuDNN, or system libraries not installable via `pip`.
+- **Baked-in models**: Large models pre-downloaded in the image to avoid runtime downloads.
+- **Existing Serverless workers**: You already have a working Runpod Serverless Docker image that you want to use with Flash.
+
+
+For most use cases, you should use `LiveServerless` and [remote functions](/flash/remote-functions). It's simpler, faster, and lets you execute arbitrary Python code remotely.
+
+
+## How it works
+
+Unlike `LiveServerless` (which delivers your Python code to pre-built Flash workers), you can use `ServerlessEndpoint` to create a traditional [Runpod Serverless endpoint](/serverless/overview) using any Docker image you specify.
+
+
+
+Here are the key differences between `ServerlessEndpoint` and `LiveServerless` resources:
+
+| Aspect | LiveServerless | ServerlessEndpoint |
+|--------|---------------|-------------------|
+| **Code execution** | Delivers Python code with each request | Uses the [handler function](/serverless/workers/handler-functions) in your Docker image |
+| **Input format** | Any Python arguments | Dictionary: `{"input": {...}}` |
+| **Docker image** | Pre-built Flash images | Your custom image |
+| **Dependencies** | Specified in decorator | Baked into Docker image |
+| **Use case** | Dynamic Python functions | Pre-built inference servers |
+
+## Basic usage
+
+
+
+Create a `ServerlessEndpoint` resource configuration pointing to your Docker image. For example:
+
+```python
+from runpod_flash import ServerlessEndpoint, GpuGroup
+
+config = ServerlessEndpoint(
+ name="my-custom-worker",
+ imageName="your-registry/your-image:tag",
+ gpus=[GpuGroup.AMPERE_24],
+ workersMax=3
+)
+```
+
+
+
+
+Call `.run()` with a dictionary payload in the format `{"input": {...}}`:
+
+```python
+import asyncio
+from runpod_flash import ServerlessEndpoint, GpuGroup, ResourceManager
+
+async def main():
+ # Explicitly provision the endpoint if it doesn't already exist
+ manager = ResourceManager()
+ deployed_endpoint = await manager.get_or_deploy_resource(config)
+
+ # Send a request to the endpoint
+ result = await config.run({
+ "input": {
+ "prompt": "Your input data",
+ "param1": "value1"
+ }
+ })
+ print(result)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+**No `@remote` decorator is needed**. The endpoint will process the request using the [handler function](/serverless/workers/handler-functions) that's baked into your Docker image.
+
+
+
+
+## Complete example: vLLM inference
+
+This example uses Runpod's official [vLLM worker](/serverless/vllm/overview) to deploy the `microsoft/Phi-3.5-mini-instruct` language model:
+
+```python title="vllm_example.py"
+import asyncio
+from runpod_flash import ServerlessEndpoint, GpuGroup, ResourceManager
+
+# Configure vLLM endpoint
+vllm_config = ServerlessEndpoint(
+ name="vllm-small-model",
+ imageName="runpod/worker-vllm:stable-cuda12.1.0",
+ gpus=[GpuGroup.AMPERE_24], # RTX 4090 or similar (24GB)
+ workersMax=3,
+ env={
+ "MODEL_NAME": "microsoft/Phi-3.5-mini-instruct",
+ "MAX_MODEL_LEN": "4096",
+ "GPU_MEMORY_UTILIZATION": "0.9",
+ "MAX_CONCURRENCY": "30",
+ }
+)
+
+async def main():
+ # Explicitly provision the endpoint if it doesn't exist
+ manager = ResourceManager()
+ deployed_endpoint = await manager.get_or_deploy_resource(vllm_config)
+
+ print(f"Endpoint deployed at: {deployed_endpoint.endpoint_url}")
+
+ # Generate text
+ result = await deployed_endpoint.run({
+ "input": {
+ "prompt": "Explain quantum computing in simple terms:",
+ "max_tokens": 100,
+ "temperature": 0.7
+ }
+ })
+
+ # Extract the generated text
+ text = result.output[0]['choices'][0]['tokens'][0]
+ print(f"\nGenerated text: {text}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+Here's what happens when you run this code:
+
+1. **Resource configuration**: The `ServerlessEndpoint` configuration specifies the official Runpod [vLLM worker](/serverless/vllm/overview) Docker image and GPU requirements.
+2. **Environment variables**: Model and vLLM settings are configured via `env`.
+3. **Provisioning**: In `main()`, `ResourceManager.get_or_deploy_resource()` creates the endpoint if it doesn't already exist.
+3. **Request**: The input is sent as a dictionary via `.run()` to the deployed vLLM endpoint, matching the worker's expected input format.
+4. **Response**: The results are extracted from the nested response structure.
+
+## Available Docker images
+
+### Official Runpod workers
+
+Runpod provides pre-built worker images for common frameworks:
+
+| Framework | Image | Image link |
+|-----------|-------|---------------|
+| vLLM | `runpod/worker-vllm` | [Link](https://hub.docker.com/r/runpod/worker-vllm) |
+| Automatic1111 | `runpod/worker-a1111:stable` | [A1111 docs](/serverless/workers/sdxl-a1111) |
+| ComfyUI | `runpod/worker-comfy` | [Link]](https://hub.docker.com/r/runpod/worker-comfyui) |
+
+### Custom images
+
+To use your own Docker image:
+
+1. **Build a handler**: Follow the [Serverless handler guide](/serverless/workers/handler-functions).
+2. **Create a Dockerfile**: Package your handler with dependencies.
+3. **Push to registry**: Upload to Docker Hub, GitHub Container Registry, or Runpod's registry.
+4. **Use in Flash**: Reference the image in `imageName`.
+
+See [Deploy custom workers](/serverless/workers/deploy) for details.
+
+## Configuration options
+
+All parameters from `LiveServerless` are available:
+
+```python
+config = ServerlessEndpoint(
+ name="custom-worker",
+ imageName="your-registry/image:tag", # Required
+ gpus=[GpuGroup.AMPERE_80],
+ workersMin=0,
+ workersMax=5,
+ idleTimeout=10,
+ env={
+ "MODEL_PATH": "/models/llama",
+ "MAX_BATCH_SIZE": "32"
+ },
+ networkVolumeId="vol_abc123", # Optional: persistent storage
+ executionTimeoutMs=300000 # 5 minutes
+)
+```
+
+See the [resource configuration reference](/flash/resource-configuration) for all available options.
+
+## CPU endpoints
+
+For CPU workloads, use `CpuServerlessEndpoint`:
+
+```python
+from runpod_flash import CpuServerlessEndpoint, CpuInstanceType
+
+config = CpuServerlessEndpoint(
+ name="cpu-worker",
+ imageName="your-registry/cpu-worker:latest",
+ instanceIds=[CpuInstanceType.CPU5C_4_8] # 4 vCPU, 8GB RAM
+)
+```
+
+## Environment variables
+
+Pass configuration to your Docker image via environment variables. For example:
+
+```python
+config = ServerlessEndpoint(
+ name="vllm-worker",
+ imageName="runpod/worker-vllm:stable-cuda12.1.0",
+ env={
+ "MODEL_NAME": "meta-llama/Llama-3.2-3B-Instruct",
+ "MAX_MODEL_LEN": "8192",
+ "HF_TOKEN": "hf_...", # For gated models
+ "TRUST_REMOTE_CODE": "True"
+ }
+)
+```
+
+## Explicit provisioning
+
+If it doesn't already exist, you'll need to provision the endpoint before you can make requests. For example:
+
+```python
+from runpod_flash import ResourceManager
+
+async def main():
+ manager = ResourceManager()
+ deployed = await manager.get_or_deploy_resource(config)
+
+ print(f"Endpoint ID: {deployed.id}")
+ print(f"Endpoint URL: {deployed.endpoint_url}")
+
+ # Now make requests
+ result = await deployed.run({"input": {...}})
+```
+
+## Request/response format
+
+### Request structure
+
+All requests must use the format `{"input": {...}}`. For example:
+
+```python
+{
+ "input": {
+ # Your worker-specific parameters
+ "param1": "value1",
+ "param2": "value2"
+ }
+}
+```
+
+### Response structure
+
+The response is a `JobOutput` object with these attributes:
+
+```python
+result.id # Job ID
+result.workerId # Worker that processed the request
+result.status # COMPLETED, IN_PROGRESS, FAILED
+result.delayTime # Queue delay in ms
+result.executionTime # Execution time in ms
+result.output # Worker response (structure varies by worker)
+result.error # Error message if failed
+```
+
+Extract data from `result.output` based on your worker's output format.
+
+
+
+## Limitations
+
+- **Input format**: Only supports dictionary payloads `{"input": {...}}`. You cannot pass arbitrary Python arguments like with `LiveServerless`.
+- **Code execution**: Cannot execute arbitrary Python code remotely. Your Docker image must include all logic.
+- **@remote decorator**: The `@remote` decorator does not work with `ServerlessEndpoint`. Use `.run()` directly.
+- **Handler required**: Your Docker image must implement a Runpod Serverless [handler function](/serverless/workers/handler-functions).
+
+## Troubleshooting
+
+### Endpoint fails to initialize
+
+**Problem**: Workers fail to start or crash immediately.
+
+**Solutions**:
+
+- Check that your Docker image is compatible with [Runpod Serverless](/serverless/overview).
+- Verify environment variables are correct.
+- Ensure the image includes a valid handler function.
+- Check worker logs in the Runpod console.
+
+### Out of memory errors
+
+**Problem**: Workers crash with CUDA OOM or RAM errors.
+
+**Solutions**:
+
+- Use a larger GPU: `gpus=[GpuGroup.AMPERE_80]`
+- Reduce `GPU_MEMORY_UTILIZATION` (for vLLM/ML frameworks).
+- Lower `MAX_MODEL_LEN` or batch size.
+- Reduce `workersMax` to limit parallel execution.
+
+### Wrong response format
+
+**Problem**: Cannot extract data from `result.output`.
+
+**Solutions**:
+
+- Check your worker's documentation for response format.
+- Print the full `result` to see the structure.
+- Look at worker logs for errors.
+
+### Authentication errors
+
+**Problem**: Cannot download gated models or private images.
+
+**Solutions**:
+
+- Add `HF_TOKEN` to `env` for Hugging Face gated models.
+- Configure Docker registry authentication in Runpod console for private images.
+- Verify API keys are correct.
+
+## Next steps
+
+- [View the resource configuration reference](/flash/resource-configuration) for all `ServerlessEndpoint` options.
+- [Learn about vLLM deployment](/serverless/vllm/overview) for LLM inference.
+- [Build custom Serverless workers](/serverless/workers/overview) for specialized use cases.
+- [Create Flash apps](/flash/apps/build-app) combining custom images with FastAPI.
diff --git a/flash/monitoring.mdx b/flash/monitoring.mdx
new file mode 100644
index 00000000..96212791
--- /dev/null
+++ b/flash/monitoring.mdx
@@ -0,0 +1,177 @@
+---
+title: "Monitor and debug remote functions"
+sidebarTitle: "Monitor and debug"
+description: "Monitor, debug, and troubleshoot Flash deployments."
+tag: "BETA"
+---
+
+This page covers how to monitor and debug your Flash deployments, including viewing logs, troubleshooting common issues, and optimizing performance.
+
+## Viewing logs
+
+When running Flash functions, logs are displayed in your terminal. The output includes:
+
+- Endpoint creation and reuse status.
+- Job submission and queue status.
+- Execution progress.
+- Worker information (delay time, execution time).
+
+Example output:
+
+```text
+2025-11-19 12:35:15,109 | INFO | Created endpoint: rb50waqznmn2kg - flash-quickstart-fb
+2025-11-19 12:35:15,112 | INFO | URL: https://console.runpod.io/serverless/user/endpoint/rb50waqznmn2kg
+2025-11-19 12:35:15,114 | INFO | LiveServerless:rb50waqznmn2kg | API /run
+2025-11-19 12:35:15,655 | INFO | LiveServerless:rb50waqznmn2kg | Started Job:b0b341e7-e460-4305-9acd-fc2dfd1bd65c-u2
+2025-11-19 12:35:15,762 | INFO | Job:b0b341e7-e460-4305-9acd-fc2dfd1bd65c-u2 | Status: IN_QUEUE
+2025-11-19 12:36:09,983 | INFO | Job:b0b341e7-e460-4305-9acd-fc2dfd1bd65c-u2 | Status: COMPLETED
+2025-11-19 12:36:10,068 | INFO | Worker:icmkdgnrmdf8gz | Delay Time: 51842 ms
+2025-11-19 12:36:10,068 | INFO | Worker:icmkdgnrmdf8gz | Execution Time: 1533 ms
+```
+
+### Log levels
+
+You can control log verbosity using the `LOG_LEVEL` environment variable:
+
+```bash
+LOG_LEVEL=DEBUG python your_script.py
+```
+
+Available log levels: `DEBUG`, `INFO`, `WARNING`, `ERROR`.
+
+## Monitoring in the Runpod console
+
+View detailed metrics and logs in the [Runpod console](https://www.runpod.io/console/serverless):
+
+1. Navigate to the **Serverless** section.
+2. Click on your endpoint to view:
+ - Active workers and queue depth.
+ - Request history and job status.
+ - Worker logs and execution details.
+ - Metrics (requests, latency, errors).
+
+### Endpoint metrics
+
+The console provides metrics including:
+
+- **Request rate**: Number of requests per minute.
+- **Queue depth**: Number of pending requests.
+- **Latency**: Average response time.
+- **Worker count**: Active and idle workers.
+- **Error rate**: Failed requests percentage.
+
+## Debugging common issues
+
+### Cold start delays
+
+If you're experiencing slow initial responses:
+
+- **Cause**: Workers need time to start, load dependencies, and initialize models.
+- **Solutions**:
+ - Set `workersMin=1` to keep at least one worker warm.
+ - Use smaller models or optimize model loading.
+ - Use `--auto-provision` with `flash run` for development.
+
+```python
+config = LiveServerless(
+ name="always-warm",
+ workersMin=1, # Keep one worker always running
+ idleTimeout=30 # Longer idle timeout
+)
+```
+
+### Timeout errors
+
+If requests are timing out:
+
+- **Cause**: Execution taking longer than the timeout limit.
+- **Solutions**:
+ - Increase `executionTimeoutMs` in your configuration.
+ - Optimize your function to run faster.
+ - Break long operations into smaller chunks.
+
+```python
+config = LiveServerless(
+ name="long-running",
+ executionTimeoutMs=600000 # 10 minutes
+)
+```
+
+### Memory errors
+
+If you're seeing out-of-memory errors:
+
+- **Cause**: Model or data too large for available GPU/CPU memory.
+- **Solutions**:
+ - Use a larger GPU type (e.g., `GpuGroup.AMPERE_80` for 80GB VRAM).
+ - Use model quantization or smaller batch sizes.
+ - Clear GPU memory between operations.
+
+```python
+config = LiveServerless(
+ name="large-model",
+ gpus=[GpuGroup.AMPERE_80], # A100 80GB
+ template=PodTemplate(containerDiskInGb=100) # More disk space
+)
+```
+
+### Dependency errors
+
+If packages aren't being installed correctly:
+
+- **Cause**: Missing or incompatible dependencies.
+- **Solutions**:
+ - Verify package names and versions in the `dependencies` list.
+ - Check that packages have Linux `x86_64` wheels available.
+ - Import packages inside the function, not at the top of the file.
+
+```python
+@remote(
+ resource_config=config,
+ dependencies=["torch==2.0.0", "transformers==4.36.0"] # Pin versions
+)
+def my_function(data):
+ import torch # Import inside the function
+ import transformers
+ # ...
+```
+
+### Authentication errors
+
+If you're seeing API key errors:
+
+- **Cause**: Missing or invalid Runpod API key.
+- **Solutions**:
+ - Verify your API key is set in the environment.
+ - Check that the `.env` file is in the correct directory.
+ - Ensure the API key has the required permissions.
+
+```bash
+# Check if API key is set
+echo $RUNPOD_API_KEY
+
+# Set API key directly
+export RUNPOD_API_KEY=your_api_key_here
+```
+
+## Performance optimization
+
+### Reducing cold starts
+
+- Set `workersMin=1` for endpoints that need fast responses.
+- Use `idleTimeout` to balance cost and warm worker availability.
+- Cache models on network volumes to reduce loading time.
+
+### Optimizing execution time
+
+- Profile your functions to identify bottlenecks.
+- Use appropriate GPU types for your workload.
+- Batch multiple inputs into a single request when possible.
+- Use async operations to parallelize independent tasks.
+
+### Managing costs
+
+- Set appropriate `workersMax` limits to control scaling.
+- Use CPU workers for non-GPU tasks.
+- Monitor usage in the console to identify optimization opportunities.
+- Use shorter `idleTimeout` for sporadic workloads.
\ No newline at end of file
diff --git a/flash/overview.mdx b/flash/overview.mdx
new file mode 100644
index 00000000..106a00b8
--- /dev/null
+++ b/flash/overview.mdx
@@ -0,0 +1,260 @@
+---
+title: "Overview"
+sidebarTitle: "Overview"
+description: "Rapidly develop and deploy AI/ML apps with the Flash Python SDK."
+tag: "BETA"
+---
+
+import { ServerlessTooltip, PodsTooltip, WorkersTooltip } from "/snippets/tooltips.jsx";
+
+
+Flash is currently in beta. [Join our Discord](https://discord.gg/cUpRmau42V) to provide feedback and get support.
+
+
+Flash is a Python SDK for developing and deploying AI workflows on [Runpod Serverless](/serverless/overview). You write Python functions locally, and Flash handles infrastructure management, GPU/CPU provisioning, dependency installation, and data transfer automatically.
+
+
+
+ Write a standalone Flash script for instant access to Runpod infrastructure.
+
+
+ Create a Flash app with a FastAPI server and deploy it on Runpod to serve production endpoints.
+
+
+
+## Why use Flash?
+
+Flash is the easiest and fastest way to develop and deploy AI/ML workloads on Runpod:
+
+- **No Docker images or manual resource management:** Unlike traditional Runpod (which requires you to build custom Docker images) or (which require manual management and bill 24/7), Flash automatically handles infrastructure using simple Python decorators.
+- **Write [remote functions](#remote-functions) using local Python scripts:** Run the script, and Flash provisions endpoints, installs dependencies, and scales GPU/CPU automatically.
+- **Instant updates without rebuilds:** When you update your code, changes can be deployed to your workers instantly without requiring you to rebuild/redeploy the worker image—just run the script again.
+- **Granular hardware control:** Specify the [exact hardware](#resource-configuration) you need for each function, from RTX 4090s to A100 80GB GPUs, enabling you to optimize for cost and performance for AI inference, training, and other compute-intensive tasks.
+- **Production-ready architecture:** When you're ready to deploy your code to production, build a [Flash app](/flash/apps/overview) with a FastAPI server to route requests between GPU/CPU workers. The [Flash CLI](/flash/cli/overview) gives you full control over the app's development and deployment lifecycle.
+- **Pay only for compute time:** Flash uses the same per-second pricing model as [Runpod Serverless](/serverless/pricing). You're only charged for actual compute time—there are no costs when your code isn't running.
+
+## Install Flash
+
+Create a Python virtual environment and use `pip` to install Flash:
+
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip install runpod-flash
+```
+
+
+Flash requires Python 3.10 or higher.
+
+
+In your project directory, create a `.env` file and add your [Runpod API key](/get-started/api-keys), replacing `YOUR_API_KEY` with your actual API key:
+
+```bash
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+
+Your Flash API key needs **All** access permissions to your Runpod account. You can generate an API key with the correct permissions from [Settings > API Keys](https://www.runpod.io/console/user/settings) in the Runpod console.
+
+
+
+## Core concepts
+
+### Remote functions
+
+The `@remote` decorator marks functions for execution on Runpod's infrastructure. Code inside the decorated function runs remotely on a Serverless worker, while code outside the function runs locally on your machine.
+
+```python
+@remote(resource_config=config, dependencies=["pandas"])
+def process_data(data):
+ # This code runs remotely on Runpod
+ import pandas as pd
+ df = pd.DataFrame(data)
+ return df.describe().to_dict()
+
+async def main():
+ # This code runs locally
+ result = await process_data(my_data)
+```
+
+When you run a remote function, Flash:
+- Automatically provisions resources on Runpod's infrastructure.
+- Installs your dependencies automatically.
+- Runs your function on a remote GPU/CPU.
+- Returns the result to your local environment.
+
+[Learn more about remote functions](/flash/remote-functions).
+
+### Resource configuration
+
+Flash provides fine-grained control over hardware allocation through configuration objects. You can configure GPU types, worker counts, idle timeouts, environment variables, and more.
+
+```python
+from runpod_flash import remote, LiveServerless, GpuGroup
+
+gpu_config = LiveServerless(
+ name="ml-inference",
+ gpus=[GpuGroup.AMPERE_80], # A100 80GB
+ workersMax=5
+)
+```
+
+[View the complete resource configuration reference](/flash/resource-configuration).
+
+### Dependency management
+
+Specify Python packages in the decorator, and Flash installs them automatically on the remote worker:
+
+```python
+@remote(
+ resource_config=gpu_config,
+ dependencies=["transformers==4.36.0", "torch", "pillow"]
+)
+def generate_image(prompt):
+ # Import inside the function
+ from transformers import pipeline
+ # ...
+```
+
+Imports should be placed inside the function body because they need to happen on the remote worker, not in your local environment.
+
+[Learn more about dependency management](/flash/remote-functions#dependency-management).
+
+### Parallel execution
+
+Run multiple remote functions concurrently using Python's async capabilities:
+
+```python
+results = await asyncio.gather(
+ process_item(item1),
+ process_item(item2),
+ process_item(item3)
+)
+```
+
+## Development workflows
+
+Flash supports two primary workflows for running workloads on Runpod: standalone scripts and Flash apps.
+
+### Standalone scripts
+
+This is the fastest way to get started with Flash. Just write a Python script with `@remote` decorated functions and run it locally with `python script.py`.
+
+```python
+import asyncio
+from runpod_flash import remote, LiveServerless, GpuGroup
+
+config = LiveServerless(
+ name="gpu-inference",
+ gpus=[GpuGroup.ADA_24],
+)
+
+@remote(resource_config=config, dependencies=["torch"])
+def process_on_gpu(data):
+ import torch
+ # Your GPU workload here
+ return {"result": "processed"}
+
+async def main():
+ result = await process_on_gpu({"input": "data"})
+ print(result)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+Run the script locally, and Flash executes the `@remote` function on Runpod's infrastructure:
+
+```bash
+python script.py
+```
+
+**Use this approach for:**
+- Quick prototypes and experiments.
+- Batch processing jobs.
+- One-off data processing tasks.
+- Local development and testing.
+
+[Follow the quickstart](/flash/quickstart) to create your first Flash script.
+
+### Flash apps
+
+When you're ready to build a production-ready API, you can create a [Flash app](/flash/apps/overview) with FastAPI and deploy it to Runpod. Flash apps provide a complete development and deployment workflow with local testing and production deployment.
+
+To get started, initialize a new Flash app project in your current directory:
+
+```bash
+flash init
+```
+
+This creates a new project with a FastAPI server and example workers. Remote functions are defined in the `workers/` directory.
+
+Start a local development server to test your app:
+
+```bash
+flash run
+```
+
+Deploy your app to production when ready:
+
+```bash
+flash deploy
+```
+
+**Use this approach for:**
+
+- Production HTTP APIs.
+- Persistent endpoints.
+- Long-running services.
+- Team collaboration with staging/production environments.
+
+[Follow this tutorial](/flash/apps/build-app) to build your first Flash app.
+
+## Use cases
+
+Flash is well-suited for a range of AI and data processing workloads:
+
+- **Multi-modal AI pipelines**: Orchestrate unified workflows combining text, image, and audio models with GPU acceleration.
+- **Distributed model training**: Scale training operations across multiple GPU workers for faster model development.
+- **AI research experimentation**: Rapidly prototype and test complex model combinations without infrastructure overhead.
+- **Production inference systems**: Deploy multi-stage inference pipelines for real-world applications.
+- **Data processing workflows**: Process large datasets using CPU workers for general computation and GPU workers for accelerated tasks.
+- **Hybrid GPU/CPU workflows**: Optimize cost and performance by combining CPU preprocessing with GPU inference.
+
+## Coding agent integration
+
+Flash provides a skill package for AI coding agents like Claude Code, Cline, and Cursor. The skill gives these agents detailed context about the Flash SDK, CLI, best practices, and common patterns.
+
+Install the Flash skill by running the following command in your terminal:
+
+```bash
+npx skills add runpod/skills
+```
+
+This allows your coding agent to provide more accurate Flash code suggestions and troubleshooting help. You can find the Flash `SKILL.md` file in the [runpod/skills repository](https://github.com/runpod/skills/blob/main/flash/SKILL.md).
+
+## Limitations
+
+- Serverless deployments using Flash are currently restricted to the `EU-RO-1` datacenter.
+- Be aware of your account's maximum worker capacity limits. Flash can rapidly scale workers across multiple endpoints, and you may hit capacity constraints. Contact [Runpod support](https://www.runpod.io/contact) to increase your account's capacity allocation if needed.
+
+## Next steps
+
+
+
+ Write your first standalone script with Flash
+
+
+ Create a FastAPI app with Flash
+
+
+ Complete reference for resource configuration
+
+
+ Learn about Flash CLI commands
+
+
+
+## Getting help
+
+Join the [Runpod community on Discord](https://discord.gg/cUpRmau42V) for support and discussion.
diff --git a/flash/pricing.mdx b/flash/pricing.mdx
new file mode 100644
index 00000000..28ca0df8
--- /dev/null
+++ b/flash/pricing.mdx
@@ -0,0 +1,109 @@
+---
+title: "Pricing"
+sidebarTitle: "Pricing"
+description: "Understand Flash pricing and optimize your costs."
+tag: "BETA"
+---
+
+Flash follows the same pricing model as [Runpod Serverless](/serverless/pricing). You pay per second of compute time, with no charges when your code isn't running. Pricing depends on the GPU or CPU type you configure for your endpoints.
+
+## How pricing works
+
+You're billed from when a worker starts until it completes your request, plus any idle time before scaling down. If a worker is already warm, you skip the cold start and only pay for execution time.
+
+### Compute cost breakdown
+
+Flash workers incur charges during these periods:
+
+1. **Start time**: The time required to initialize a worker and load models into GPU memory. This includes starting the container, installing dependencies, and preparing the runtime environment.
+2. **Execution time**: The time spent processing your request (running your `@remote` decorated function).
+3. **Idle time**: The period a worker remains active after completing a request, waiting for additional requests before scaling down.
+
+### Pricing by resource type
+
+Flash supports both GPU and CPU workers. Pricing varies based on the hardware type:
+
+- **GPU workers**: Use `LiveServerless` or `ServerlessEndpoint` with GPU configurations. Pricing depends on the GPU type (e.g., RTX 4090, A100 80GB).
+- **CPU workers**: Use `LiveServerless` or `CpuServerlessEndpoint` with CPU configurations. Pricing depends on the CPU instance type.
+
+See the [Serverless pricing page](/serverless/pricing) for current rates by GPU and CPU type.
+
+## How to estimate and optimize costs
+
+To estimate costs for your Flash workloads, consider:
+
+- How long each function takes to execute.
+- How many concurrent workers you need (`workersMax` setting).
+- Which GPU or CPU types you'll use.
+- Your idle timeout configuration (`idleTimeout` setting).
+
+### Cost optimization strategies
+
+#### Choose appropriate hardware
+
+Select the smallest GPU or CPU that meets your performance requirements. For example, if your workload fits in 24GB of VRAM, use `GpuGroup.ADA_24` or `GpuGroup.AMPERE_24` instead of larger GPUs.
+
+```python
+# Cost-effective configuration for workloads that fit in 24GB VRAM
+config = LiveServerless(
+ name="cost-optimized",
+ gpus=[GpuGroup.ADA_24, GpuGroup.AMPERE_24], # RTX 4090, L4, A5000, 3090
+)
+```
+
+#### Configure idle timeouts
+
+Balance responsiveness and cost by adjusting the `idleTimeout` parameter. Shorter timeouts reduce idle costs but increase cold starts for sporadic traffic.
+
+```python
+# Lower idle timeout for cost savings (more cold starts)
+config = LiveServerless(
+ name="low-idle",
+ idleTimeout=5, # 5 seconds (default)
+)
+
+# Higher idle timeout for responsiveness (higher idle costs)
+config = LiveServerless(
+ name="responsive",
+ idleTimeout=30, # 30 seconds
+)
+```
+
+#### Use CPU workers for non-GPU tasks
+
+For data preprocessing, postprocessing, or other tasks that don't require GPU acceleration, use CPU workers instead of GPU workers.
+
+```python
+from runpod_flash import LiveServerless, CpuInstanceType
+
+# CPU configuration for non-GPU tasks
+cpu_config = LiveServerless(
+ name="data-processor",
+ instanceIds=[CpuInstanceType.CPU5C_2_4], # 2 vCPU, 4GB RAM
+)
+```
+
+#### Limit maximum workers
+
+Set `workersMax` to prevent runaway scaling and unexpected costs:
+
+```python
+config = LiveServerless(
+ name="controlled-scaling",
+ workersMax=3, # Limit to 3 concurrent workers
+)
+```
+
+### Monitoring costs
+
+Monitor your usage in the [Runpod console](https://www.runpod.io/console/serverless) to track:
+
+- Total compute time across endpoints.
+- Worker utilization and idle time.
+- Cost breakdown by endpoint.
+
+## Next steps
+
+- [Create remote functions](/flash/remote-functions) with optimized resource configurations.
+- [View Serverless pricing details](/serverless/pricing) for current rates.
+- [Configure resources](/flash/resource-configuration) for your workloads.
diff --git a/flash/quickstart.mdx b/flash/quickstart.mdx
new file mode 100644
index 00000000..60770d89
--- /dev/null
+++ b/flash/quickstart.mdx
@@ -0,0 +1,358 @@
+---
+title: "Get started with Flash"
+sidebarTitle: "Quickstart"
+description: "Run your first GPU workload with Flash in less than 5 minutes."
+tag: "BETA"
+---
+
+This quickstart gets you running GPU workloads on Runpod in minutes. You'll execute a function on a remote GPU and see the results immediately.
+
+## Requirements
+
+- [Runpod account](/get-started/manage-accounts).
+- [An API key](/get-started/api-keys) with **All** access permissions to your Runpod account.
+- [Python 3.10+](https://www.python.org/downloads/) installed.
+
+## Step 1: Install Flash
+
+Create a virtual environment and install Flash:
+
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip install runpod-flash
+```
+
+## Step 2: Add your API key to your environment
+
+Create a `.env` file with your Runpod API key:
+
+```bash
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+Replace `YOUR_API_KEY` with your actual API key from the [Runpod console](https://www.runpod.io/console/user/settings).
+
+
+Your API key needs **All** access permissions to your Runpod account.
+
+
+## Step 3: Copy this code
+
+Create a file called `gpu_demo.py` and paste this code into it:
+
+```python
+import asyncio
+from dotenv import load_dotenv
+from runpod_flash import remote, LiveServerless, GpuGroup
+
+# Load API key from .env
+load_dotenv()
+
+# Configure GPU resources
+config = LiveServerless(
+ name="flash-quickstart",
+ gpus=[GpuGroup.ADA_24], # RTX 4090
+ workersMax=3
+)
+
+# Define a function that runs on GPU
+@remote(resource_config=config, dependencies=["numpy", "torch"])
+def gpu_matrix_multiply(size):
+ # IMPORTANT: Import packages INSIDE the function
+ import numpy as np
+ import torch
+
+ # Get GPU name
+ device_name = torch.cuda.get_device_name(0)
+
+ # Create random matrices
+ A = np.random.rand(size, size)
+ B = np.random.rand(size, size)
+
+ # Multiply matrices
+ C = np.dot(A, B)
+
+ return {
+ "matrix_size": size,
+ "result_mean": float(np.mean(C)),
+ "gpu": device_name
+ }
+
+# Call the function
+async def main():
+ print("Running matrix multiplication on Runpod GPU...")
+ result = await gpu_matrix_multiply(1000)
+
+ print(f"\n✓ Matrix size: {result['matrix_size']}x{result['matrix_size']}")
+ print(f"✓ Result mean: {result['result_mean']:.4f}")
+ print(f"✓ GPU used: {result['gpu']}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+## Step 4: Run it
+
+Execute the script:
+
+```bash
+python gpu_demo.py
+```
+
+You'll see Flash provision a GPU worker and execute your function:
+
+```text
+Running matrix multiplication on Runpod GPU...
+Creating endpoint: server_LiveServerless_a1b2c3d4
+Provisioning Serverless endpoint...
+Endpoint ready
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Job completed, output received
+
+✓ Matrix size: 1000x1000
+✓ Result mean: 249.8286
+✓ GPU used: NVIDIA GeForce RTX 4090
+```
+
+The first run takes 30-60 seconds, while Runpod provisions the endpoint, installs dependencies, and starts a worker. Subsequent runs take 2-3 seconds (because the worker is already running).
+
+
+Try running the script again immediately and notice how much faster it is. Flash reuses the same endpoint and cached dependencies. You can even update the code and run it again to see the changes take effect instantly.
+
+
+## Step 5: Understand what you just did
+
+Let's break down the code you just ran:
+
+### Imports and setup
+
+```python
+import asyncio
+from dotenv import load_dotenv
+from runpod_flash import remote, LiveServerless, GpuGroup
+
+load_dotenv()
+```
+
+- **`asyncio`**: Enables asynchronous execution (Flash functions run async).
+- **`load_dotenv()`**: Loads your `RUNPOD_API_KEY` from the `.env` file for authentication.
+- **`remote`, `LiveServerless`, `GpuGroup`**: Core Flash components.
+
+### Resource configuration
+
+```python
+config = LiveServerless(
+ name="flash-quickstart",
+ gpus=[GpuGroup.ADA_24],
+ workersMax=3
+)
+```
+
+This tells Flash what hardware to use:
+
+- **`name`**: Identifies your endpoint in the [Runpod console](https://www.runpod.io/console/serverless).
+- **`gpus`**: Which GPU types to use (here: RTX 4090 with 24GB VRAM).
+- **`workersMax`**: Maximum parallel workers (allows 3 concurrent executions).
+
+See [GPU types](/references/gpu-types#gpu-pools) for all available GPUs or [resource configuration](/flash/resource-configuration) for all options.
+
+### Remote function
+
+```python
+@remote(resource_config=config, dependencies=["numpy", "torch"])
+def gpu_matrix_multiply(size):
+ import numpy as np
+ import torch
+
+ # Get GPU name
+ device_name = torch.cuda.get_device_name(0)
+
+ # Create random matrices
+ A = np.random.rand(size, size)
+ B = np.random.rand(size, size)
+
+ # Multiply matrices
+ C = np.dot(A, B)
+
+ return {
+ "matrix_size": size,
+ "result_mean": float(np.mean(C)),
+ "gpu": device_name
+ }
+```
+
+The `@remote` decorator marks the function to run on Runpod's infrastructure:
+
+- **`resource_config=config`**: Uses the GPU configuration you defined.
+- **`dependencies=["numpy", "torch"]`**: Packages to install on the worker.
+- **Function body**: The matrix multiplication code runs on the remote GPU, not your local machine.
+- **Return value**: The result is returned to your local machine as a Python dictionary.
+
+
+You must import packages **inside the function body**, not at the top of your file. These imports need to happen on the remote worker.
+
+
+### Calling the function
+
+```python
+async def main():
+ print("Running matrix multiplication on Runpod GPU...")
+ result = await gpu_matrix_multiply(1000)
+
+ print(f"\n✓ Matrix size: {result['matrix_size']}x{result['matrix_size']}")
+ print(f"✓ Result mean: {result['result_mean']:.4f}")
+ print(f"✓ GPU used: {result['gpu']}")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+Here's what happens when you run the `@remote` decorated function:
+
+1. Flash checks if the endpoint specified in your `LiveServerless` configuration already exists.
+ - If yes: It updates the endpoint if the configuration has changed.
+ - If no: It creates a new endpoint, initializes a worker, and installs your dependencies.
+2. Flash sends your code to the GPU worker
+3. The GPU worker executes the function with the provided inputs.
+4. The result is returned to your local machine as a Python dictionary, where it's printed in your terminal.
+
+Everything outside the `@remote` function (all the `print` statements, etc.) runs **locally on your machine**. Only the decorated function runs remotely.
+
+## Step 6: Run multiple operations in parallel
+
+Flash makes it easy to run multiple GPU operations concurrently. Replace your `main()` function with the code below:
+
+```python
+async def main():
+ print("Running 3 matrix operations in parallel...")
+
+ # Run all three operations at once
+ results = await asyncio.gather(
+ gpu_matrix_multiply(500),
+ gpu_matrix_multiply(1000),
+ gpu_matrix_multiply(2000)
+ )
+
+ # Print results
+ for i, result in enumerate(results, 1):
+ print(f"\n{i}. Size: {result['matrix_size']}x{result['matrix_size']}")
+ print(f" Mean: {result['result_mean']:.4f}")
+ print(f" GPU: {result['gpu']}")
+```
+
+Run the script again:
+
+```bash
+python gpu_demo.py
+```
+
+All three operations execute simultaneously:
+
+```text
+Running 3 matrix operations in parallel...
+Initial job status: IN_QUEUE
+Initial job status: IN_QUEUE
+Initial job status: IN_QUEUE
+Job completed, output received
+Job completed, output received
+Job completed, output received
+
+1. Size: 500x500
+ Mean: 125.3097
+ GPU: NVIDIA GeForce RTX 4090
+
+2. Size: 1000x1000
+ Mean: 249.9442
+ GPU: NVIDIA GeForce RTX 4090
+
+3. Size: 2000x2000
+ Mean: 500.1321
+ GPU: NVIDIA GeForce RTX 4090
+```
+
+## Clean up
+
+When you're done testing, clean up the endpoints:
+
+```bash
+# List all endpoints
+flash undeploy list
+
+# Remove the quickstart endpoint
+flash undeploy flash-quickstart
+
+# Or remove all endpoints
+flash undeploy --all
+```
+
+## Next steps
+
+You've successfully run GPU code on Runpod! Now you're ready to learn more about Flash:
+
+
+
+ Use Hugging Face transformers to generate text with GPT-2
+
+
+ Learn how to configure and optimize remote functions
+
+
+ Choose GPUs, adjust workers, set timeouts
+
+
+ Deploy production APIs with FastAPI
+
+
+
+## Troubleshooting
+
+### Authentication error
+
+```text
+Error: API key is not set
+```
+
+**Solution**: Make sure your `.env` file is in the same directory as your Python script and contains `RUNPOD_API_KEY=your_key`.
+
+You can also export the API key in your terminal as a workaround:
+
+```bash
+export RUNPOD_API_KEY=your_key
+```
+
+### Job stuck in queue
+
+```text
+Initial job status: IN_QUEUE
+[Stays in queue for >60 seconds]
+```
+
+**Solution**: No GPUs available. Add more GPU types for fallback:
+
+```python
+config = LiveServerless(
+ name="flash-quickstart",
+ gpus=[GpuGroup.ADA_24, GpuGroup.AMPERE_24, GpuGroup.AMPERE_48]
+)
+```
+
+Or check [GPU availability](https://www.runpod.io/console/serverless) in the console.
+
+### Import errors
+
+```text
+ModuleNotFoundError: No module named 'numpy'
+```
+
+**Solution**: Move imports inside the `@remote` function:
+
+```python
+@remote(resource_config=config, dependencies=["numpy"])
+def my_function():
+ import numpy as np # Import here, not at top of file
+ # ...
+```
+
+See the [execution model](/flash/execution-model#common-execution-issues) for more troubleshooting.
diff --git a/flash/remote-functions.mdx b/flash/remote-functions.mdx
new file mode 100644
index 00000000..3a5da808
--- /dev/null
+++ b/flash/remote-functions.mdx
@@ -0,0 +1,273 @@
+---
+title: "Create remote functions"
+sidebarTitle: "Create remote functions"
+description: "Learn how to create and configure remote functions with Flash."
+tag: "BETA"
+---
+
+Remote functions are the core building blocks of Flash. The `@remote` decorator marks Python functions for execution on Runpod's Serverless infrastructure, handling resource provisioning, dependency installation, and data transfer automatically.
+
+## How remote functions work
+
+A remote function is just a Python function that's been marked with the `@remote` decorator. For example:
+
+```python
+@remote(resource_config=config, dependencies=["torch"])
+def run_inference(data):
+ import torch
+ # Your inference code here
+ return result
+```
+
+When you call a remote function from a local Python script or [Flash app](/flash/apps/overview), the function code is sent to a Runpod worker. The worker executes the function code and returns the result to your local environment.
+
+## Resource configuration
+
+Every remote function requires a resource configuration that specifies the compute resources to use.
+
+`LiveServerless` is the primary configuration class for Flash. It supports full remote code execution, allowing you to run arbitrary Python functions on Runpod's infrastructure.
+
+### GPU configuration
+
+For GPU workloads, create a `LiveServerless` configuration and specify the [GPU pool(s)](/references/gpu-types#gpu-pools) that your workers will use with the `gpus` parameter.
+
+```python
+from runpod_flash import LiveServerless, GpuGroup
+
+gpu_config = LiveServerless(
+ name="ml-inference",
+ gpus=[GpuGroup.AMPERE_80], # A100 80GB
+ workersMax=5,
+ idleTimeout=10
+)
+
+@remote(resource_config=gpu_config, dependencies=["torch"])
+def run_inference(data):
+ import torch
+ # Your inference code here
+ return result
+```
+
+Here are the common configuration options for `LiveServerless`:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `name` | Name for your endpoint (required) | - |
+| `gpus` | [GPU pool IDs](/references/gpu-types#gpu-pools) that can be used by workers | `[GpuGroup.ANY]` |
+| `workersMax` | Maximum number of workers | 3 |
+| `workersMin` | Minimum number of workers | 0 |
+| `idleTimeout` | Minutes before scaling down | 5 |
+
+See the [resource configuration reference](/flash/resource-configuration) for all available options.
+
+
+Learn how to view, update, and delete your endpoints in the [managing endpoints guide](/flash/managing-endpoints).
+
+
+### CPU configuration
+
+For CPU workloads, specify `instanceIds` instead of `gpus`:
+
+```python
+from runpod_flash import LiveServerless, CpuInstanceType
+
+cpu_config = LiveServerless(
+ name="data-processor",
+ instanceIds=[CpuInstanceType.CPU5C_4_8], # 4 vCPU, 8GB RAM
+ workersMax=3
+)
+
+@remote(resource_config=cpu_config, dependencies=["pandas"])
+def process_data(data):
+ import pandas as pd
+ df = pd.DataFrame(data)
+ return df.describe().to_dict()
+```
+
+### Custom Docker images
+
+For specialized environments that require pre-built Docker images—such as vLLM, TensorRT, or images with custom system dependencies—you'll need to use the `ServerlessEndpoint` configuration.
+
+See [Custom Docker images](/flash/custom-docker-images) for details.
+
+
+## Dependency management
+
+Specify Python packages in the `dependencies` parameter of the `@remote` decorator. Flash installs these packages on the remote worker before executing your function.
+
+```python
+@remote(
+ resource_config=config,
+ dependencies=["transformers==4.36.0", "torch", "pillow"]
+)
+def generate_image(prompt):
+ from transformers import pipeline
+ import torch
+ from PIL import Image
+ # Your code here
+```
+
+
+Some packages (like PyTorch) are pre-installed on GPU workers, but including them in dependencies ensures the correct version is available.
+
+
+
+### Import packages inside the function body
+
+You must import packages **inside the decorated function body,** not at the top of your file. This will ensure the imports happen on the remote worker, not in your local environment.
+
+
+**Correct:** imports inside the function.
+```python
+@remote(resource_config=config, dependencies=["numpy"])
+def compute(data):
+ import numpy as np # Import here
+ return np.sum(data)
+```
+**Incorrect:** imports at top of file won't work.
+
+```python
+import numpy as np # This import happens locally, not on the worker
+
+@remote(resource_config=config, dependencies=["numpy"])
+def compute(data):
+ return np.sum(data) # numpy not available on the remote worker
+```
+
+### Version pinning
+
+You can pin specific versions using standard pip syntax:
+
+```python
+dependencies=["transformers==4.36.0", "torch>=2.0.0"]
+```
+
+## Parallel execution
+
+Flash functions are asynchronous by default. Use Python's `asyncio` to run multiple functions in parallel:
+
+```python
+import asyncio
+
+async def main():
+ # Run three functions in parallel
+ results = await asyncio.gather(
+ process_item(item1),
+ process_item(item2),
+ process_item(item3)
+ )
+ return results
+```
+
+This is particularly useful for:
+
+- Batch processing multiple inputs.
+- Running different models on the same data.
+- Parallelizing independent pipeline stages.
+
+### Example: Parallel batch processing
+
+```python
+import asyncio
+from runpod_flash import remote, LiveServerless, GpuGroup
+
+config = LiveServerless(
+ name="batch-processor",
+ gpus=[GpuGroup.ADA_24],
+ workersMax=5 # Allow up to 5 parallel workers
+)
+
+@remote(resource_config=config, dependencies=["torch"])
+def process_batch(batch_id, data):
+ import torch
+ # Process batch
+ return {"batch_id": batch_id, "result": len(data)}
+
+async def main():
+ batches = [
+ (1, [1, 2, 3]),
+ (2, [4, 5, 6]),
+ (3, [7, 8, 9])
+ ]
+
+ # Process all batches in parallel
+ results = await asyncio.gather(*[
+ process_batch(batch_id, data)
+ for batch_id, data in batches
+ ])
+
+ print(results)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+## Using persistent storage
+
+Attach [network volumes](/storage/network-volumes) for persistent storage across workers and endpoints. This is useful for sharing large models or datasets between workers without downloading them each time.
+
+```python
+config = LiveServerless(
+ name="model-server",
+ networkVolumeId="vol_abc123", # Your network volume ID
+ template=PodTemplate(containerDiskInGb=100)
+)
+```
+
+To find your network volume ID:
+
+1. Go to the [Storage page](https://www.runpod.io/console/storage) in the Runpod console.
+2. Click on your network volume.
+3. Copy the volume ID from the URL or volume details.
+
+### Example: Using a network volume for model storage
+
+```python
+from runpod_flash import LiveServerless, GpuGroup, PodTemplate
+
+config = LiveServerless(
+ name="model-inference",
+ gpus=[GpuGroup.AMPERE_80],
+ networkVolumeId="vol_abc123",
+ template=PodTemplate(containerDiskInGb=100)
+)
+
+@remote(resource_config=config, dependencies=["torch", "transformers"])
+def run_inference(prompt):
+ from transformers import AutoModelForCausalLM, AutoTokenizer
+
+ # Load model from network volume
+ model_path = "/runpod-volume/models/llama-7b"
+ model = AutoModelForCausalLM.from_pretrained(model_path)
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
+
+ # Run inference
+ inputs = tokenizer(prompt, return_tensors="pt")
+ outputs = model.generate(**inputs)
+ return tokenizer.decode(outputs[0])
+```
+
+## Environment variables
+
+Pass environment variables to remote functions using the `env` parameter:
+
+```python
+config = LiveServerless(
+ name="api-worker",
+ env={"HF_TOKEN": "your_token", "MODEL_ID": "gpt2"}
+)
+```
+
+
+
+Environment variables are excluded from configuration hashing. Changing environment values won't trigger endpoint recreation, which allows different processes to load environment variables from `.env` files without causing false drift detection.
+
+
+
+
+## Next steps
+
+- [Create API endpoints](/flash/apps/build-app) using FastAPI.
+- [Deploy Flash applications](/flash/apps/deploy-apps) for production.
+- [View the resource configuration reference](/flash/resource-configuration) for all available options.
+- [Clean up development endpoints](/flash/cli/undeploy) when you're done testing.
diff --git a/flash/resource-configuration.mdx b/flash/resource-configuration.mdx
new file mode 100644
index 00000000..bd773ad9
--- /dev/null
+++ b/flash/resource-configuration.mdx
@@ -0,0 +1,324 @@
+---
+title: "Resource configuration reference"
+sidebarTitle: "Configuration reference"
+description: "A complete reference for Flash GPU/CPU resource configuration options."
+tag: "BETA"
+---
+
+Flash provides several resource configuration classes for different use cases. This reference covers all available parameters and options.
+
+## LiveServerless
+
+`LiveServerless` is the primary configuration class for Flash. It supports full remote code execution, allowing you to run arbitrary Python functions on Runpod's infrastructure.
+
+```python
+from runpod_flash import LiveServerless, GpuGroup, CpuInstanceType, PodTemplate
+
+gpu_config = LiveServerless(
+ name="ml-inference",
+ gpus=[GpuGroup.AMPERE_80],
+ workersMax=5,
+ idleTimeout=10,
+ template=PodTemplate(containerDiskInGb=100)
+)
+```
+
+### Parameters
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `name` | `string` | Name for your endpoint (required) | - |
+| `gpus` | `list[GpuGroup]` | GPU pool IDs that can be used by workers | `[GpuGroup.ANY]` |
+| `gpuCount` | `int` | Number of GPUs per worker | 1 |
+| `instanceIds` | `list[CpuInstanceType]` | CPU instance types (forces CPU endpoint) | `None` |
+| `workersMin` | `int` | Minimum number of workers | 0 |
+| `workersMax` | `int` | Maximum number of workers | 3 |
+| `idleTimeout` | `int` | Minutes before scaling down | 5 |
+| `env` | `dict` | Environment variables | `None` |
+| `networkVolumeId` | `string` | Persistent storage volume ID | `None` |
+| `executionTimeoutMs` | `int` | Max execution time in milliseconds | 0 (no limit) |
+| `scalerType` | `string` | Scaling strategy | `QUEUE_DELAY` |
+| `scalerValue` | `int` | Scaling parameter value | 4 |
+| `locations` | `string` | Preferred datacenter locations | `None` |
+| `template` | `PodTemplate` | Pod template overrides | `None` |
+
+### GPU configuration example
+
+```python
+from runpod_flash import LiveServerless, GpuGroup, PodTemplate
+
+config = LiveServerless(
+ name="gpu-inference",
+ gpus=[GpuGroup.AMPERE_80], # A100 80GB
+ gpuCount=1,
+ workersMin=0,
+ workersMax=5,
+ idleTimeout=10,
+ template=PodTemplate(containerDiskInGb=100),
+ env={"MODEL_ID": "llama-7b"}
+)
+```
+
+### Handling GPU unavailability issues
+
+If no GPUs in your list are available, you'll see:
+
+```text
+Initial job status: IN_QUEUE
+```
+
+The job will stay in queue waiting for capacity to become available.
+
+**Solutions**:
+1. **Check console**: View [GPU availability](https://www.runpod.io/console/serverless) in the Runpod console.
+2. **Add more GPU types**: Expand your `gpus` list to include more fallback options.
+3. **Use GpuGroup.ANY**: Switch to `[GpuGroup.ANY]` for maximum availability.
+4. **Contact support**: If capacity is consistently unavailable, contact [Runpod support](https://www.runpod.io/contact) to discuss increasing your account limits.
+
+### CPU configuration example
+
+```python
+from runpod_flash import LiveServerless, CpuInstanceType
+
+config = LiveServerless(
+ name="cpu-processor",
+ instanceIds=[CpuInstanceType.CPU5C_4_8], # 4 vCPU, 8GB RAM
+ workersMax=3,
+ idleTimeout=5
+)
+```
+
+## ServerlessEndpoint
+
+`ServerlessEndpoint` is for GPU workloads that require custom Docker images.
+
+These resources work similarly to [traditional Serverless endpoints](/serverless/overview). Before you can run your function, you'll need to:
+- Write a [handler function](/serverless/workers/handler-functions) that processes the input dictionary.
+- [Create a Dockerfile](/serverless/workers/create-dockerfile) that packages your handler function and its dependencies.
+- [Push the image to a container registry](/serverless/workers/deploy).
+
+You'll then add the image name to your resource configuration:
+
+```python highlight="5"
+from runpod_flash import ServerlessEndpoint, GpuGroup
+
+config = ServerlessEndpoint(
+ name="custom-ml-env",
+ imageName="pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime",
+ gpus=[GpuGroup.AMPERE_80]
+)
+```
+
+
+### Request structure
+
+When you make requests to the endpoint, you'll need to provide the input as a dictionary in the form of `{"input": {...}}`. For example:
+
+```json
+{
+ "input": {
+ "prompt": "Hello, world!"
+ }
+}
+```
+
+### Parameters
+
+All parameters from `LiveServerless` are available, plus:
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `imageName` | `string` | Custom Docker image | - |
+
+### Limitations
+
+- Only supports dictionary payloads in the form of `{"input": {...}}`.
+- Cannot execute arbitrary Python functions remotely.
+- Requires a custom Docker image with a [handler function](/serverless/workers/handler-functions) that processes the input dictionary.
+
+### Example
+
+```python
+from runpod_flash import ServerlessEndpoint, GpuGroup
+
+# Custom image with pre-installed models
+config = ServerlessEndpoint(
+ name="stable-diffusion",
+ imageName="my-registry/stable-diffusion:v1.0",
+ gpus=[GpuGroup.AMPERE_24],
+ workersMax=3
+)
+
+# Send requests as dictionaries
+result = await config.run({
+ "input": {
+ "prompt": "A beautiful sunset over mountains",
+ "width": 512,
+ "height": 512
+ }
+})
+```
+
+## CpuServerlessEndpoint
+
+`CpuServerlessEndpoint` is for CPU workloads that require custom Docker images. Like `ServerlessEndpoint`, it only supports dictionary payloads.
+
+```python
+from runpod_flash import CpuServerlessEndpoint, CpuInstanceType
+
+config = CpuServerlessEndpoint(
+ name="data-processor",
+ imageName="python:3.11-slim",
+ instanceIds=[CpuInstanceType.CPU5C_4_8]
+)
+```
+
+### Parameters
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `name` | `string` | Name for your endpoint (required) | - |
+| `imageName` | `string` | Custom Docker image | - |
+| `instanceIds` | `list[CpuInstanceType]` | CPU instance types | - |
+| `workersMin` | `int` | Minimum number of workers | 0 |
+| `workersMax` | `int` | Maximum number of workers | 3 |
+| `idleTimeout` | `int` | Minutes before scaling down | 5 |
+| `env` | `dict` | Environment variables | `None` |
+| `networkVolumeId` | `string` | Persistent storage volume ID | `None` |
+| `executionTimeoutMs` | `int` | Max execution time in milliseconds | 0 (no limit) |
+
+## Resource class comparison
+
+| Feature | LiveServerless | ServerlessEndpoint | CpuServerlessEndpoint |
+|---------|----------------|--------------------|-----------------------|
+| Remote code execution | ✅ Full Python function execution | ❌ Dictionary payload only | ❌ Dictionary payload only |
+| Custom Docker images | ❌ Fixed optimized images | ✅ Any Docker image | ✅ Any Docker image |
+| Use case | Dynamic remote functions | Traditional API endpoints | Traditional CPU endpoints |
+| Function returns | Any Python object | Dictionary only | Dictionary only |
+| `@remote` decorator | Full functionality | Limited to payload passing | Limited to payload passing |
+
+## Available GPU types
+
+The `GpuGroup` enum provides access to GPU pools. Each pool contains specific GPU models grouped by architecture and VRAM.
+
+| GpuGroup | GPUs Included | VRAM | Best For |
+|----------|---------------|------|----------|
+| `GpuGroup.ANY` | Any available GPU | Varies | Fast provisioning, prototyping, development |
+| `GpuGroup.AMPERE_16` | RTX A4000 | 16GB | Small models, basic inference |
+| `GpuGroup.AMPERE_24` | L4, A5000, RTX 3090 | 24GB | General ML workloads, mid-size models |
+| `GpuGroup.ADA_24` | RTX 4090 | 24GB | Cost-effective inference, gaming GPUs |
+| `GpuGroup.AMPERE_48` | A40, RTX A6000 | 48GB | Large models, fine-tuning |
+| `GpuGroup.ADA_48_PRO` | L40, L40S | 48GB | Professional inference, large models |
+| `GpuGroup.AMPERE_80` | A100 80GB | 80GB | XL models, intensive training |
+| `GpuGroup.ADA_80_PRO` | H100 | 80GB | Cutting-edge inference, latest architecture |
+| `GpuGroup.HOPPER_141` | H200 | 141GB | Largest models, maximum VRAM |
+
+### Pool naming conventions
+
+GPU pool names follow the pattern: `{ARCHITECTURE}_{VRAM}_{TIER}`
+
+- **AMPERE**: NVIDIA Ampere architecture (A-series, RTX 30-series)
+- **ADA**: NVIDIA Ada Lovelace architecture (RTX 40-series, L40)
+- **HOPPER**: NVIDIA Hopper architecture (H-series)
+- **VRAM number**: Memory capacity in GB (16, 24, 48, 80, 141)
+- **PRO suffix**: Professional/datacenter GPUs (L40, H100, H200)
+
+**Examples**:
+- `AMPERE_80` = Ampere architecture with 80GB VRAM (A100)
+- `ADA_24` = Ada Lovelace with 24GB VRAM (RTX 4090)
+- `ADA_48_PRO` = Professional Ada GPUs with 48GB (L40/L40S)
+
+See the [complete GPU types reference](/references/gpu-types#gpu-pools) for detailed specifications and availability.
+
+## Available CPU instance types
+
+The `CpuInstanceType` enum provides access to CPU configurations:
+
+### 3rd generation general purpose
+
+| CpuInstanceType | ID | vCPU | RAM |
+|-----------------|-----|------|-----|
+| `CPU3G_1_4` | cpu3g-1-4 | 1 | 4GB |
+| `CPU3G_2_8` | cpu3g-2-8 | 2 | 8GB |
+| `CPU3G_4_16` | cpu3g-4-16 | 4 | 16GB |
+| `CPU3G_8_32` | cpu3g-8-32 | 8 | 32GB |
+
+### 3rd generation compute-optimized
+
+| CpuInstanceType | ID | vCPU | RAM |
+|-----------------|-----|------|-----|
+| `CPU3C_1_2` | cpu3c-1-2 | 1 | 2GB |
+| `CPU3C_2_4` | cpu3c-2-4 | 2 | 4GB |
+| `CPU3C_4_8` | cpu3c-4-8 | 4 | 8GB |
+| `CPU3C_8_16` | cpu3c-8-16 | 8 | 16GB |
+
+### 5th generation compute-optimized
+
+| CpuInstanceType | ID | vCPU | RAM |
+|-----------------|-----|------|-----|
+| `CPU5C_1_2` | cpu5c-1-2 | 1 | 2GB |
+| `CPU5C_2_4` | cpu5c-2-4 | 2 | 4GB |
+| `CPU5C_4_8` | cpu5c-4-8 | 4 | 8GB |
+| `CPU5C_8_16` | cpu5c-8-16 | 8 | 16GB |
+
+## PodTemplate
+
+Use `PodTemplate` to configure additional pod settings:
+
+```python
+from runpod_flash import LiveServerless, PodTemplate
+
+config = LiveServerless(
+ name="custom-template",
+ template=PodTemplate(
+ containerDiskInGb=100,
+ env=[{"key": "PYTHONPATH", "value": "/workspace"}]
+ )
+)
+```
+
+### Parameters
+
+| Parameter | Type | Description | Default |
+|-----------|------|-------------|---------|
+| `containerDiskInGb` | `int` | Container disk size in GB | 20 |
+| `env` | `list[dict]` | Environment variables as key-value pairs | `None` |
+
+## Environment variables
+
+Environment variables can be set in two ways:
+
+### Using the `env` parameter
+
+```python
+config = LiveServerless(
+ name="api-worker",
+ env={"HF_TOKEN": "your_token", "MODEL_ID": "gpt2"}
+)
+```
+
+### Using PodTemplate
+
+```python
+config = LiveServerless(
+ name="api-worker",
+ template=PodTemplate(
+ env=[
+ {"key": "HF_TOKEN", "value": "your_token"},
+ {"key": "MODEL_ID", "value": "gpt2"}
+ ]
+ )
+)
+```
+
+
+
+Environment variables are excluded from configuration hashing. Changing environment values won't trigger endpoint recreation, which allows different processes to load environment variables from `.env` files without causing false drift detection. Only structural changes (like GPU type, image, or template modifications) trigger endpoint updates.
+
+
+
+## Next steps
+
+- [Create remote functions](/flash/remote-functions) using these configurations.
+- [Deploy Flash applications](/flash/apps/deploy-apps) for production.
+- [Learn about pricing](/flash/pricing) to optimize costs.
diff --git a/images/flash_sdxl_output.png b/images/flash_sdxl_output.png
new file mode 100644
index 00000000..07dbbe29
Binary files /dev/null and b/images/flash_sdxl_output.png differ
diff --git a/snippets/tooltips.jsx b/snippets/tooltips.jsx
index 1751ca50..ace6967f 100644
--- a/snippets/tooltips.jsx
+++ b/snippets/tooltips.jsx
@@ -83,7 +83,7 @@ export const WorkerTooltip = () => {
export const WorkersTooltip = () => {
return (
- worker
+ workers
);
};
diff --git a/tutorials/flash/image-generation-with-sdxl.mdx b/tutorials/flash/image-generation-with-sdxl.mdx
new file mode 100644
index 00000000..5cb122a6
--- /dev/null
+++ b/tutorials/flash/image-generation-with-sdxl.mdx
@@ -0,0 +1,657 @@
+---
+title: "Generate images with Flash and SDXL"
+sidebarTitle: "Generate images with Flash + SDXL"
+description: "Learn how to use Flash with Stable Diffusion XL to generate high-quality images from text prompts."
+---
+
+This tutorial shows you how to build an image generation application using Flash and Stable Diffusion XL (SDXL). You'll learn how to load a pretrained diffusion model on a GPU worker and generate images from text prompts.
+
+
+
+
+
+## What you'll learn
+
+In this tutorial you'll learn how to:
+
+- Use the Hugging Face diffusers library with Flash.
+- Load and run Stable Diffusion XL models on GPU workers.
+- Generate high-quality images from text prompts.
+- Save generated images to disk.
+- Configure generation parameters like guidance scale and steps.
+
+## Requirements
+
+- You've [created a Runpod account](/get-started/manage-accounts).
+- You've [created a Runpod API key](/get-started/api-keys).
+- You've installed [Python 3.10 or higher](https://www.python.org/downloads/).
+- You've completed the [Flash quickstart](/flash/quickstart) or are familiar with Flash basics.
+
+## What you'll build
+
+By the end of this tutorial, you'll have a working image generation application that:
+
+- Accepts text prompts as input.
+- Generates photorealistic images using Stable Diffusion XL.
+- Runs entirely on Runpod's GPU infrastructure.
+- Saves generated images to your local machine.
+
+## Step 1: Set up your project
+
+Create a new directory for your project and set up a Python virtual environment:
+
+```bash
+mkdir flash-image-generation
+cd flash-image-generation
+python3 -m venv venv
+source venv/bin/activate
+```
+
+Install Flash:
+
+```bash
+pip install runpod-flash
+```
+
+Create a `.env` file with your Runpod API key:
+
+```bash
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+Replace `YOUR_API_KEY` with your actual API key from the [Runpod console](https://www.runpod.io/console/user/settings).
+
+## Step 2: Understand Stable Diffusion XL
+
+[Stable Diffusion XL (SDXL)](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) is a state-of-the-art text-to-image model from Stability AI. It offers:
+
+- **High-quality images**: Generates photorealistic 1024x1024 images
+- **Better prompt understanding**: Improved text comprehension compared to SD 1.5
+- **Fine details**: Enhanced rendering of hands, faces, and text
+- **Open source**: Available for free on Hugging Face
+
+SDXL requires significant GPU resources:
+- **Model size**: ~7GB of weights
+- **VRAM requirement**: Minimum 16GB (24GB recommended)
+- **Generation time**: 20-40 seconds per image on RTX 4090
+
+We'll use the [diffusers](https://huggingface.co/docs/diffusers/index) library from Hugging Face, which provides a clean Python API for Stable Diffusion models.
+
+## Step 3: Create your project file
+
+Create a new file called `image_generation.py`:
+
+```bash
+touch image_generation.py
+```
+
+Open this file in your code editor. The following steps walk through building the image generation application.
+
+## Step 4: Add imports and configuration
+
+Add the necessary imports and Flash configuration:
+
+```python
+import asyncio
+import base64
+from pathlib import Path
+from dotenv import load_dotenv
+from runpod_flash import remote, LiveServerless, GpuGroup
+
+# Load environment variables from .env file
+load_dotenv()
+
+# Configuration for GPU workers
+gpu_config = LiveServerless(
+ name="image-generation",
+ gpus=[GpuGroup.ADA_24, GpuGroup.AMPERE_24], # 24GB GPUs
+ workersMax=2,
+ idleTimeout=15
+)
+```
+
+**Configuration breakdown**:
+
+- **`name="image-generation"`**: Identifies your endpoint in the Runpod console.
+- **`gpus=[GpuGroup.ADA_24, GpuGroup.AMPERE_24]`**: Uses RTX 4090 or L4/A5000 GPUs (both have 24GB VRAM, sufficient for SDXL).
+- **`workersMax=2`**: Allows up to 2 parallel workers.
+- **`idleTimeout=15`**: Keeps workers active for 15 minutes (SDXL models are large, so we want longer caching).
+
+
+SDXL requires at least 16GB VRAM. Using 24GB GPUs provides comfortable headroom and faster generation.
+
+
+## Step 5: Define the image generation function
+
+Add the remote function that will run on the GPU worker:
+
+```python
+@remote(
+ resource_config=gpu_config,
+ dependencies=["diffusers", "torch", "transformers", "accelerate"]
+)
+def generate_image(prompt, negative_prompt="", num_steps=30, guidance_scale=7.5):
+ """Generate an image using Stable Diffusion XL."""
+ import torch
+ from diffusers import StableDiffusionXLPipeline
+ import base64
+ from io import BytesIO
+
+ # Load the SDXL model
+ model_id = "stabilityai/stable-diffusion-xl-base-1.0"
+ pipe = StableDiffusionXLPipeline.from_pretrained(
+ model_id,
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16"
+ )
+
+ # Move model to GPU
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ pipe = pipe.to(device)
+
+ # Generate image
+ image = pipe(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ num_inference_steps=num_steps,
+ guidance_scale=guidance_scale,
+ height=1024,
+ width=1024
+ ).images[0]
+
+ # Convert image to base64 for transmission
+ buffered = BytesIO()
+ image.save(buffered, format="PNG")
+ img_str = base64.b64encode(buffered.getvalue()).decode()
+
+ return {
+ "image_base64": img_str,
+ "prompt": prompt,
+ "negative_prompt": negative_prompt,
+ "num_steps": num_steps,
+ "guidance_scale": guidance_scale,
+ "device": device,
+ "resolution": "1024x1024"
+ }
+```
+
+**Key concepts**:
+
+**1. Dependencies**: The function requires four packages:
+ - `diffusers`: Hugging Face library for diffusion models
+ - `torch`: PyTorch for GPU computation
+ - `transformers`: Text encoder dependencies
+ - `accelerate`: Efficient model loading
+
+**2. Model loading**:
+ ```python
+ pipe = StableDiffusionXLPipeline.from_pretrained(
+ model_id,
+ torch_dtype=torch.float16,
+ use_safetensors=True,
+ variant="fp16"
+ )
+ ```
+ This downloads SDXL from Hugging Face. Key parameters:
+ - `torch_dtype=torch.float16`: Use half-precision (saves VRAM, faster)
+ - `use_safetensors=True`: Use safe tensor format
+ - `variant="fp16"`: Download the fp16 version (~7GB instead of ~14GB)
+
+**3. GPU acceleration**:
+ ```python
+ pipe = pipe.to(device)
+ ```
+ Moves the entire pipeline (text encoder, UNet, VAE) to GPU.
+
+**4. Image generation**:
+ ```python
+ image = pipe(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ num_inference_steps=num_steps,
+ guidance_scale=guidance_scale,
+ height=1024,
+ width=1024
+ ).images[0]
+ ```
+
+ Parameters:
+ - **`prompt`**: What you want to see in the image
+ - **`negative_prompt`**: What you don't want (e.g., "blurry, low quality")
+ - **`num_inference_steps`**: More steps = better quality but slower (20-50 typical)
+ - **`guidance_scale`**: How closely to follow the prompt (7-10 recommended)
+ - **`height/width`**: SDXL is trained for 1024x1024
+
+**5. Image encoding**:
+ ```python
+ buffered = BytesIO()
+ image.save(buffered, format="PNG")
+ img_str = base64.b64encode(buffered.getvalue()).decode()
+ ```
+ We encode the image as base64 to return it through Flash. This allows us to transmit the image data as a string.
+
+## Step 6: Add the main function and image saving
+
+Create functions to call the generator and save images:
+
+```python
+def save_image(base64_string, filename):
+ """Save a base64-encoded image to disk."""
+ import base64
+ from PIL import Image
+ from io import BytesIO
+
+ # Decode base64 string
+ img_data = base64.b64decode(base64_string)
+
+ # Open and save image
+ image = Image.open(BytesIO(img_data))
+ image.save(filename)
+ print(f"✓ Image saved to {filename}")
+
+async def main():
+ print("Generating image with Stable Diffusion XL on Runpod GPU...")
+ print("This may take 1-2 minutes on first run (downloading model)...\n")
+
+ # Define your prompt
+ prompt = "A serene landscape with mountains, a lake, and sunset, highly detailed, photorealistic"
+ negative_prompt = "blurry, low quality, distorted, ugly"
+
+ # Generate image
+ result = await generate_image(
+ prompt=prompt,
+ negative_prompt=negative_prompt,
+ num_steps=30,
+ guidance_scale=7.5
+ )
+
+ # Save the generated image
+ output_dir = Path("generated_images")
+ output_dir.mkdir(exist_ok=True)
+
+ filename = output_dir / "sdxl_output.png"
+ save_image(result["image_base64"], filename)
+
+ # Display metadata
+ print(f"\n{'='*60}")
+ print("GENERATION DETAILS")
+ print('='*60)
+ print(f"Prompt: {result['prompt']}")
+ print(f"Negative prompt: {result['negative_prompt']}")
+ print(f"Steps: {result['num_steps']}")
+ print(f"Guidance scale: {result['guidance_scale']}")
+ print(f"Resolution: {result['resolution']}")
+ print(f"Device: {result['device']}")
+ print('='*60)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+This main function:
+
+- Calls the remote function with `await`.
+- Creates a `generated_images` directory if it doesn't exist.
+- Decodes and saves the base64 image to disk.
+- Displays generation metadata.
+
+## Step 7: Run your first generation
+
+Run the application:
+
+```bash
+python image_generation.py
+```
+
+**First run output** (takes 2-3 minutes):
+
+```text
+Generating image with Stable Diffusion XL on Runpod GPU...
+This may take 1-2 minutes on first run (downloading model)...
+
+Creating endpoint: server_LiveServerless_a1b2c3d4
+Provisioning Serverless endpoint...
+Endpoint ready
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Downloading model weights from Hugging Face...
+Model loaded, generating image...
+Job completed, output received
+✓ Image saved to generated_images/sdxl_output.png
+
+============================================================
+GENERATION DETAILS
+============================================================
+Prompt: A serene landscape with mountains, a lake, and sunset, highly detailed, photorealistic
+Negative prompt: blurry, low quality, distorted, ugly
+Steps: 30
+Guidance scale: 7.5
+Resolution: 1024x1024
+Device: cuda
+============================================================
+```
+
+**Subsequent runs** (takes 30-40 seconds):
+
+```text
+Generating image with Stable Diffusion XL on Runpod GPU...
+
+Resource LiveServerless_a1b2c3d4 already exists, reusing.
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Job completed, output received
+✓ Image saved to generated_images/sdxl_output.png
+
+[Results appear]
+```
+
+Open `generated_images/sdxl_output.png` to see your generated image!
+
+
+The first run downloads ~7GB of model weights, which takes 1-2 minutes. Subsequent runs reuse the cached model and complete in 30-40 seconds.
+
+
+## Step 8: Experiment with different prompts
+
+Try various prompts to see SDXL's capabilities:
+
+```python
+async def main():
+ # Create output directory
+ output_dir = Path("generated_images")
+ output_dir.mkdir(exist_ok=True)
+
+ # Try different prompts
+ prompts = [
+ {
+ "prompt": "A cyberpunk city at night with neon lights, flying cars, rain, cinematic",
+ "negative": "blurry, low quality",
+ "filename": "cyberpunk_city.png"
+ },
+ {
+ "prompt": "A cute corgi puppy wearing a space suit, floating in space, highly detailed",
+ "negative": "distorted, ugly, bad anatomy",
+ "filename": "space_corgi.png"
+ },
+ {
+ "prompt": "An ancient wizard's study filled with books, potions, magical artifacts, candlelight",
+ "negative": "blurry, modern, plastic",
+ "filename": "wizard_study.png"
+ }
+ ]
+
+ for i, p in enumerate(prompts, 1):
+ print(f"\n{'='*60}")
+ print(f"Generating image {i}/{len(prompts)}")
+ print(f"Prompt: {p['prompt'][:50]}...")
+ print('='*60)
+
+ result = await generate_image(
+ prompt=p['prompt'],
+ negative_prompt=p['negative'],
+ num_steps=30,
+ guidance_scale=7.5
+ )
+
+ filename = output_dir / p['filename']
+ save_image(result["image_base64"], filename)
+ print(f"✓ Saved to {filename}\n")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+Run it:
+
+```bash
+python image_generation.py
+```
+
+You'll see three different images generated sequentially on the same GPU worker. Each generation takes about 30-40 seconds after the first one.
+
+## Understanding generation parameters
+
+Let's explore how different parameters affect image quality:
+
+### Number of inference steps
+
+```python
+# Fast but lower quality (15-20 steps)
+result = await generate_image(prompt, num_steps=20)
+
+# Balanced (30 steps) - recommended
+result = await generate_image(prompt, num_steps=30)
+
+# High quality but slower (50 steps)
+result = await generate_image(prompt, num_steps=50)
+```
+
+**Effects**:
+- **15-20 steps**: Faster (15-20 seconds) but less refined details
+- **30 steps**: Good balance of quality and speed (30-40 seconds) - **recommended**
+- **50+ steps**: Diminishing returns, minimal quality improvement
+
+### Guidance scale
+
+```python
+# Low guidance - more creative, less faithful to prompt
+result = await generate_image(prompt, guidance_scale=5.0)
+
+# Medium guidance - balanced (recommended)
+result = await generate_image(prompt, guidance_scale=7.5)
+
+# High guidance - very faithful to prompt, may oversaturate
+result = await generate_image(prompt, guidance_scale=12.0)
+```
+
+**Effects**:
+- **3-5**: More artistic freedom, less literal interpretation
+- **7-10**: Balanced, follows prompt closely - **recommended**
+- **12+**: Very literal, may produce oversaturated or exaggerated images
+
+### Negative prompts
+
+Negative prompts tell the model what to avoid:
+
+```python
+# Good negative prompts for photorealistic images
+negative_prompt = "blurry, low quality, distorted, ugly, bad anatomy, watermark"
+
+# Good negative prompts for artistic images
+negative_prompt = "realistic, photograph, blurry, low quality"
+
+# Good negative prompts for portraits
+negative_prompt = "distorted face, bad anatomy, extra limbs, low quality"
+```
+
+Use negative prompts to:
+
+- Remove common artifacts ("distorted", "low quality").
+- Avoid unwanted styles ("cartoon", "3D render").
+- Fix common issues ("bad anatomy", "extra fingers").
+
+## Troubleshooting
+
+### Out of memory error
+
+**Issue**: `RuntimeError: CUDA out of memory`
+
+**Cause**: SDXL requires significant VRAM (16GB minimum)
+
+**Solutions**:
+1. Verify you're using 24GB GPUs:
+ ```python
+ gpus=[GpuGroup.ADA_24, GpuGroup.AMPERE_24] # 24GB GPUs
+ ```
+
+2. Use half-precision (already in the example):
+ ```python
+ torch_dtype=torch.float16 # Half precision
+ ```
+
+3. If still failing, use 48GB GPUs:
+ ```python
+ gpus=[GpuGroup.AMPERE_48] # A40/A6000 with 48GB
+ ```
+
+### Model download fails
+
+**Issue**: `Error: Failed to download model from Hugging Face`
+
+**Solutions**:
+1. Increase execution timeout for first run:
+ ```python
+ gpu_config = LiveServerless(
+ name="image-generation",
+ executionTimeoutMs=600000 # 10 minutes for first download
+ )
+ ```
+
+2. Check Hugging Face Hub status at [status.huggingface.co](https://status.huggingface.co)
+
+3. Try a smaller model first to test connectivity:
+ ```python
+ model_id = "runwayml/stable-diffusion-v1-5" # Smaller SD 1.5
+ ```
+
+### Image quality is poor
+
+**Issue**: Generated images look blurry or low quality
+
+**Solutions**:
+1. Increase inference steps:
+ ```python
+ num_steps=40 # More steps = better quality
+ ```
+
+2. Adjust guidance scale:
+ ```python
+ guidance_scale=8.5 # Higher guidance
+ ```
+
+3. Improve your prompt:
+ ```python
+ prompt = "A detailed portrait, highly detailed, sharp focus, 8k, professional photography"
+ ```
+
+4. Add quality keywords to your prompt:
+ - "highly detailed"
+ - "sharp focus"
+ - "8k"
+ - "photorealistic"
+ - "professional"
+
+### Slow generation
+
+**Issue**: Image generation takes >60 seconds per image
+
+**Possible causes**:
+1. Worker scaled down (cold start)
+2. Model not cached
+3. Too many inference steps
+
+**Solutions**:
+1. Increase `idleTimeout` to keep workers active:
+ ```python
+ idleTimeout=30 # Keep active for 30 minutes
+ ```
+
+2. Reduce inference steps:
+ ```python
+ num_steps=20 # Faster but slightly lower quality
+ ```
+
+3. Set `workersMin=1` to always have a warm worker ready
+
+### Images look distorted or have artifacts
+
+**Issue**: Generated images have weird artifacts or distortions
+
+**Solutions**:
+1. Use negative prompts:
+ ```python
+ negative_prompt="distorted, ugly, bad anatomy, extra limbs, disfigured"
+ ```
+
+2. Adjust guidance scale (try 7-9 range):
+ ```python
+ guidance_scale=8.0
+ ```
+
+3. Increase inference steps for better refinement:
+ ```python
+ num_steps=35
+ ```
+
+## Next steps
+
+Now that you've built an image generation app with Flash, you can:
+
+### Try other Stable Diffusion models
+
+Explore different models from Hugging Face:
+
+```python
+# SDXL Turbo - 4x faster, 1 step generation
+model_id = "stabilityai/sdxl-turbo"
+
+# Stable Diffusion 1.5 - smaller, faster
+model_id = "runwayml/stable-diffusion-v1-5"
+
+# Stable Diffusion 2.1 - better at artistic styles
+model_id = "stabilityai/stable-diffusion-2-1"
+```
+
+### Add image-to-image generation
+
+Use an existing image as a starting point:
+
+```python
+from diffusers import StableDiffusionXLImg2ImgPipeline
+
+# Load img2img pipeline
+pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(...)
+
+# Generate variations of an existing image
+image = pipe(prompt, image=init_image, strength=0.75).images[0]
+```
+
+### Build a Flash app
+
+Convert your script to a production [Flash app](/flash/apps/overview):
+
+```bash
+flash init image-generation-app
+# Move your function to workers/gpu/endpoint.py
+# Add FastAPI routes for HTTP API
+flash deploy
+```
+
+### Optimize with network volumes
+
+Use [network volumes](/flash/managing-endpoints) to cache models across workers:
+
+```python
+config = LiveServerless(
+ name="image-generation",
+ networkVolumeId="vol_abc123", # Pre-loaded SDXL model
+ template=PodTemplate(containerDiskInGb=100)
+)
+```
+
+### Explore advanced features
+
+- **LoRA fine-tuning**: Customize SDXL for specific styles
+- **ControlNet**: Guide generation with edge maps, depth, or pose
+- **Inpainting**: Edit specific parts of images
+- **Upscaling**: Generate higher resolution images
+
+## Related resources
+
+- [Flash remote functions guide](/flash/remote-functions)
+- [Flash resource configuration](/flash/resource-configuration)
+- [Managing Flash endpoints](/flash/managing-endpoints)
+- [Hugging Face diffusers documentation](https://huggingface.co/docs/diffusers/index)
+- [Stable Diffusion XL model card](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0)
+- [Prompt engineering guide](https://huggingface.co/docs/diffusers/using-diffusers/write_good_prompt)
diff --git a/tutorials/flash/text-generation-with-transformers.mdx b/tutorials/flash/text-generation-with-transformers.mdx
new file mode 100644
index 00000000..fda30664
--- /dev/null
+++ b/tutorials/flash/text-generation-with-transformers.mdx
@@ -0,0 +1,450 @@
+---
+title: "Generate text with Flash and transformers"
+sidebarTitle: "Generate text with Flash + transformers"
+description: "Learn how to use Flash with Hugging Face transformers to build a GPU-accelerated text generation application."
+---
+
+This tutorial shows you how to build a text generation application using Flash and Hugging Face's transformers library. You'll learn how to load a pretrained language model on a GPU worker and generate text from prompts.
+
+## What you'll learn
+
+In this tutorial you'll learn how to:
+
+- Install and use the Hugging Face transformers library with Flash.
+- Load pretrained models on remote GPU workers.
+- Move models to GPU for faster inference.
+- Configure text generation parameters like temperature and max length.
+- Return structured results with metadata.
+
+## Requirements
+
+- You've [created a Runpod account](/get-started/manage-accounts).
+- You've [created a Runpod API key](/get-started/api-keys).
+- You've installed [Python 3.10 or higher](https://www.python.org/downloads/).
+- You've completed the [Flash quickstart](/flash/quickstart) or are familiar with Flash basics.
+
+## What you'll build
+
+By the end of this tutorial, you'll have a working text generation application that:
+
+- Accepts text prompts as input.
+- Generates natural language completions using GPT-2.
+- Runs entirely on Runpod's GPU infrastructure.
+- Returns generated text with execution metadata.
+
+## Step 1: Set up your project
+
+Create a new directory for your project and set up a Python virtual environment:
+
+```bash
+mkdir flash-text-generation
+cd flash-text-generation
+python3 -m venv venv
+source venv/bin/activate
+```
+
+Install Flash:
+
+```bash
+pip install runpod-flash
+```
+
+Create a `.env` file with your Runpod API key:
+
+```bash
+touch .env && echo "RUNPOD_API_KEY=YOUR_API_KEY" > .env
+```
+
+Replace `YOUR_API_KEY` with your actual API key from the [Runpod console](https://www.runpod.io/console/user/settings).
+
+## Step 2: Understand the Hugging Face transformers library
+
+[Hugging Face transformers](https://huggingface.co/docs/transformers/index) is a popular Python library for working with pretrained language models. It provides:
+
+- **Thousands of pretrained models**: GPT-2, GPT-3, BERT, T5, LLaMA, and more
+- **Unified API**: Same code works across different model architectures
+- **Model hub integration**: Download models directly from [Hugging Face Hub](https://huggingface.co/models)
+- **Production-ready**: Used by companies and researchers worldwide
+
+For this tutorial, we'll use **GPT-2**, a 124M parameter language model from OpenAI. It's small enough to load quickly but powerful enough to generate coherent text.
+
+## Step 3: Create your project file
+
+Create a new file called `text_generation.py`:
+
+```bash
+touch text_generation.py
+```
+
+Open this file in your code editor. The following steps walk through building the text generation application.
+
+## Step 4: Add imports and configuration
+
+Add the necessary imports and Flash configuration:
+
+```python
+import asyncio
+from dotenv import load_dotenv
+from runpod_flash import remote, LiveServerless, GpuGroup
+
+# Load environment variables from .env file
+load_dotenv()
+
+# Configuration for GPU workers
+gpu_config = LiveServerless(
+ name="text-generation",
+ gpus=[GpuGroup.AMPERE_24, GpuGroup.ADA_24], # 24GB GPUs
+ workersMax=3,
+ idleTimeout=10
+)
+```
+
+**Configuration breakdown**:
+
+- **`name="text-generation"`**: Identifies your endpoint in the Runpod console
+- **`gpus=[GpuGroup.AMPERE_24, GpuGroup.ADA_24]`**: Allows workers to use L4, A5000, RTX 3090, or RTX 4090 GPUs (all have 24GB VRAM)
+- **`workersMax=3`**: Allows up to 3 parallel workers for concurrent requests
+- **`idleTimeout=10`**: Keeps workers active for 10 minutes after last use (reduces cold starts)
+
+
+GPT-2 only requires about 2GB of VRAM, so 24GB GPUs are more than sufficient. For larger models like LLaMA or GPT-J, you might need 48GB or 80GB GPUs.
+
+
+## Step 5: Define the text generation function
+
+Add the remote function that will run on the GPU worker:
+
+```python
+@remote(
+ resource_config=gpu_config,
+ dependencies=["transformers", "torch", "accelerate"]
+)
+def generate_text(prompt, max_length=50):
+ """Generate text using a pretrained language model."""
+ import torch
+ from transformers import AutoTokenizer, AutoModelForCausalLM
+
+ # Load the GPT-2 model and tokenizer
+ model_name = "gpt2"
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
+ model = AutoModelForCausalLM.from_pretrained(model_name)
+
+ # Move model to GPU if available
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ device_name = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
+ model = model.to(device)
+
+ # Tokenize the input prompt
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
+
+ # Generate text
+ with torch.no_grad():
+ outputs = model.generate(
+ **inputs,
+ max_length=max_length,
+ num_return_sequences=1,
+ temperature=0.7,
+ do_sample=True,
+ pad_token_id=tokenizer.eos_token_id
+ )
+
+ # Decode the generated tokens back to text
+ generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
+
+ return {
+ "prompt": prompt,
+ "generated_text": generated_text,
+ "model_name": model_name,
+ "device": device,
+ "device_name": device_name,
+ "max_length": max_length
+ }
+```
+
+**Key concepts**:
+
+**1. Dependencies**: The function requires three packages:
+ - `transformers`: Hugging Face library for language models
+ - `torch`: PyTorch for GPU computation
+ - `accelerate`: Helper library for loading large models efficiently
+
+**2. Model loading**:
+ ```python
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
+ model = AutoModelForCausalLM.from_pretrained(model_name)
+ ```
+ These lines download and load the GPT-2 model from Hugging Face Hub. The first time this runs, it downloads ~500MB of model weights. Subsequent runs use the cached version.
+
+**3. GPU acceleration**:
+ ```python
+ device = "cuda" if torch.cuda.is_available() else "cpu"
+ model = model.to(device)
+ ```
+ This moves the model to GPU for faster inference. On Runpod workers, `torch.cuda.is_available()` returns `True`.
+
+**4. Tokenization**:
+ ```python
+ inputs = tokenizer(prompt, return_tensors="pt").to(device)
+ ```
+ Converts your text prompt into token IDs that the model understands. The `.to(device)` moves these tokens to GPU memory.
+
+**5. Generation parameters**:
+ - `max_length=50`: Maximum number of tokens to generate
+ - `temperature=0.7`: Controls randomness (0.0 = deterministic, 1.0+ = very random)
+ - `do_sample=True`: Use sampling instead of greedy decoding for more diverse outputs
+ - `num_return_sequences=1`: Generate one completion per prompt
+
+**6. No gradient tracking**:
+ ```python
+ with torch.no_grad():
+ ```
+ Disables gradient computation, reducing memory usage and speeding up inference.
+
+## Step 6: Add the main function
+
+Create the main function to test your text generator:
+
+```python
+async def main():
+ print("Starting text generation on Runpod GPU...")
+
+ # Define a prompt
+ prompt = "The future of artificial intelligence is"
+
+ # Generate text
+ result = await generate_text(prompt, max_length=100)
+
+ # Display results
+ print("\n" + "="*60)
+ print("TEXT GENERATION RESULTS")
+ print("="*60)
+ print(f"\nPrompt: {result['prompt']}")
+ print(f"\nGenerated text:\n{result['generated_text']}")
+ print("\n" + "-"*60)
+ print(f"Model: {result['model_name']}")
+ print(f"Device: {result['device']}")
+ print(f"GPU: {result['device_name']}")
+ print(f"Max length: {result['max_length']} tokens")
+ print("="*60)
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+This main function:
+
+- Calls the remote function with `await` (runs asynchronously).
+- Waits for the GPU worker to complete text generation.
+- Displays the results in a formatted output.
+
+## Step 7: Run your first generation
+
+Run the application:
+
+```bash
+python text_generation.py
+```
+
+**First run output** (takes 60-90 seconds):
+
+```text
+Starting text generation on Runpod GPU...
+Creating endpoint: server_LiveServerless_a1b2c3d4
+Provisioning Serverless endpoint...
+Endpoint ready
+Registering RunPod endpoint at https://api.runpod.ai/xvf32dan8rcilp
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Installing dependencies: transformers torch accelerate
+Downloading model weights...
+Job completed, output received
+
+============================================================
+TEXT GENERATION RESULTS
+============================================================
+
+Prompt: The future of artificial intelligence is
+
+Generated text:
+The future of artificial intelligence is bright and full of possibilities. With advancements in machine learning and deep learning, we're seeing AI systems that can understand natural language, recognize images, and even create art. The potential applications are endless, from healthcare to transportation to education.
+
+------------------------------------------------------------
+Model: gpt2
+Device: cuda
+GPU: NVIDIA GeForce RTX 4090
+Max length: 100 tokens
+============================================================
+```
+
+**Subsequent runs** (takes 2-5 seconds):
+
+```text
+Starting text generation on Runpod GPU...
+Resource LiveServerless_a1b2c3d4 already exists, reusing.
+Registering RunPod endpoint at https://api.runpod.ai/xvf32dan8rcilp
+Executing function on RunPod endpoint ID: xvf32dan8rcilp
+Initial job status: IN_QUEUE
+Job completed, output received
+
+[Results appear immediately]
+```
+
+Notice the dramatic speed improvement on subsequent runs—the endpoint is already provisioned, dependencies are installed, and the model is cached.
+
+## Step 8: Experiment with different prompts
+
+Modify the main function to try different prompts:
+
+```python
+async def main():
+ print("Starting text generation on Runpod GPU...")
+
+ # Try multiple prompts
+ prompts = [
+ "Once upon a time in a distant galaxy",
+ "The secret to happiness is",
+ "In the year 2050, technology will"
+ ]
+
+ for prompt in prompts:
+ print(f"\n{'='*60}")
+ print(f"Generating for: {prompt}")
+ print('='*60)
+
+ result = await generate_text(prompt, max_length=80)
+ print(f"\n{result['generated_text']}\n")
+
+if __name__ == "__main__":
+ asyncio.run(main())
+```
+
+Run it again:
+
+```bash
+python text_generation.py
+```
+
+You'll see three different completions generated sequentially on the same GPU worker.
+
+## Troubleshooting
+
+### Model download fails
+
+**Issue**: `Error: Failed to download model from Hugging Face`
+
+**Solutions**:
+1. Check internet connectivity from workers (rare issue on Runpod)
+2. Try a different model that might be available faster
+3. Increase execution timeout in configuration:
+ ```python
+ gpu_config = LiveServerless(
+ name="text-generation",
+ executionTimeoutMs=300000 # 5 minutes
+ )
+ ```
+
+### Out of memory error
+
+**Issue**: `RuntimeError: CUDA out of memory`
+
+**Solutions**:
+1. Use smaller models (GPT-2 instead of GPT-2 Large)
+2. Reduce `max_length` parameter
+3. Use larger GPUs:
+ ```python
+ gpus=[GpuGroup.AMPERE_48] # 48GB GPUs
+ ```
+
+### Slow generation
+
+**Issue**: Text generation takes >30 seconds per request
+
+**Possible causes**:
+1. Worker scaled down (cold start)
+2. Model not cached
+3. Large `max_length` value
+
+**Solutions**:
+1. Increase `idleTimeout` to keep workers active:
+ ```python
+ idleTimeout=30 # Keep active for 30 minutes
+ ```
+2. Set `workersMin=1` to always have a warm worker ready
+3. Reduce `max_length` to generate fewer tokens
+
+### Generation quality is poor
+
+**Issue**: Generated text is incoherent or repetitive
+
+**Solutions**:
+1. Adjust `temperature` (try 0.7-0.9)
+2. Add `top_p` and `top_k` sampling:
+ ```python
+ outputs = model.generate(
+ **inputs,
+ max_length=max_length,
+ temperature=0.8,
+ top_p=0.9,
+ top_k=50,
+ do_sample=True
+ )
+ ```
+3. Try a larger model (GPT-2 Medium or Large)
+
+## Next steps
+
+Now that you've built a text generation app with Flash, you can:
+
+### Explore other models
+
+Try different models from Hugging Face:
+
+```python
+# Instruction-following model
+model_name = "facebook/opt-1.3b"
+
+# Code generation model
+model_name = "Salesforce/codegen-350M-mono"
+
+# Dialogue model
+model_name = "microsoft/DialoGPT-medium"
+```
+
+### Build a chat interface
+
+Extend your app to handle multi-turn conversations:
+
+```python
+@remote(resource_config=gpu_config, dependencies=["transformers", "torch"])
+def chat(conversation_history):
+ """Multi-turn chat with context."""
+ # Concatenate conversation history
+ prompt = "\n".join(conversation_history)
+ # Generate response
+ # Return new message
+```
+
+### Deploy as a Flash app
+
+Convert your script to a production [Flash app](/flash/apps/overview):
+
+```bash
+flash init text-generation-app
+# Move your function to workers/gpu/endpoint.py
+# Add FastAPI routes
+flash deploy
+```
+
+### Optimize performance
+
+- Use [network volumes](/flash/managing-endpoints) to cache models across workers.
+- Implement [request batching](/flash/remote-functions#parallel-execution) for higher throughput.
+- Try [quantized models](https://huggingface.co/docs/transformers/main_classes/quantization) for faster inference.
+
+## Related resources
+
+- [Flash remote functions guide](/flash/remote-functions)
+- [Flash resource configuration](/flash/resource-configuration)
+- [Managing Flash endpoints](/flash/managing-endpoints)
+- [Hugging Face transformers documentation](https://huggingface.co/docs/transformers/index)
+- [Hugging Face model hub](https://huggingface.co/models)