Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 18 additions & 13 deletions 01_getting_started/01_hello_world/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,14 @@ Server starts at **http://localhost:8888**

### 4. Test the API

Visit **http://localhost:8888/docs** for interactive API documentation. QB endpoints are auto-generated by `flash run` based on your `@remote` functions.
Visit **http://localhost:8888/docs** for interactive API documentation. QB endpoints are auto-generated by `flash run` based on your `@Endpoint` functions.

```bash
curl -X POST http://localhost:8888/gpu_worker/runsync \
-H "Content-Type: application/json" \
-d '{"message": "Hello GPU!"}'
```

Visit **http://localhost:8888/docs** for interactive API documentation.

### Full CLI Documentation

For complete CLI usage including deployment, environment management, and troubleshooting:
Expand All @@ -58,14 +56,14 @@ Simple GPU-based serverless function that:
- Runs on any available GPU

The worker demonstrates:
- Remote execution with `@remote` decorator
- GPU resource configuration with `LiveServerless`
- Automatic scaling based on demand
- Remote execution with the `@Endpoint` decorator
- GPU resource configuration via `gpu=` parameter
- Automatic scaling via `workers=` parameter
- Local development and testing

## API Endpoints

QB (queue-based) endpoints are auto-generated from `@remote` functions. Visit `/docs` for the full API schema.
QB (queue-based) endpoints are auto-generated from `@Endpoint` functions. Visit `/docs` for the full API schema.

### `gpu_hello`

Expand Down Expand Up @@ -100,7 +98,7 @@ Executes a simple GPU worker and returns system/GPU information.

```
01_hello_world/
├── gpu_worker.py # GPU worker with @remote decorator
├── gpu_worker.py # GPU worker with @Endpoint decorator
├── pyproject.toml # Project metadata
├── requirements.txt # Dependencies
├── .env.example # Environment variables template
Expand All @@ -110,16 +108,23 @@ Executes a simple GPU worker and returns system/GPU information.
## Key Concepts

### Remote Execution
The `@remote` decorator transparently executes functions on serverless infrastructure:
The `@Endpoint` decorator transparently executes functions on serverless infrastructure:
- Code runs locally during development
- Automatically deploys to Runpod when configured
- Handles serialization and resource management

```python
from runpod_flash import Endpoint, GpuGroup

@Endpoint(name="my-worker", gpu=GpuGroup.ANY, workers=(0, 3))
async def my_function(data: dict) -> dict:
return {"result": "processed"}
```

### Resource Scaling
The GPU worker scales to zero when idle:
- **workersMin=0**: Scales down completely when idle
- **workersMax=3**: Up to 3 concurrent workers
- **idleTimeout=5**: 5 minutes before scaling down
- **workers=(0, 3)**: Scale from 0 to 3 workers
- **idle_timeout=5**: 5 minutes before scaling down

### GPU Detection
The worker uses PyTorch to detect and report GPU information:
Expand All @@ -142,7 +147,7 @@ flash run

## Next Steps

- Customize GPU type: Change `GpuGroup.ANY` to specific GPU (ADA_24, AMPERE_80, etc.)
- Customize GPU type: Change `GpuGroup.ANY` to a specific GPU (e.g. `GpuGroup.ADA_24`, `GpuGroup.AMPERE_80`)
- Add your own GPU-accelerated code
- Implement error handling and validation
- Deploy to production with `flash deploy`
Expand Down
27 changes: 12 additions & 15 deletions 01_getting_started/01_hello_world/gpu_worker.py
Original file line number Diff line number Diff line change
@@ -1,20 +1,17 @@
# GPU serverless worker -- detects available GPU hardware.
# Run with: flash run
# Test directly: python gpu_worker.py
from runpod_flash import GpuGroup, LiveServerless, remote
# gpu serverless worker -- detects available GPU hardware.
# run with: flash run
# test directly: python gpu_worker.py
from runpod_flash import Endpoint, GpuGroup

gpu_config = LiveServerless(

@Endpoint(
name="01_01_gpu_worker",
gpus=[GpuGroup.ANY],
workersMin=0,
workersMax=3,
idleTimeout=5,
gpu=GpuGroup.ANY,
workers=(0, 3),
idle_timeout=5,
)


@remote(resource_config=gpu_config)
async def gpu_hello(payload: dict) -> dict:
"""Simple GPU worker that returns GPU hardware info."""
async def gpu_hello(input_data: dict) -> dict:
"""GPU worker that returns GPU hardware info."""
import platform
from datetime import datetime

Expand All @@ -25,7 +22,7 @@ async def gpu_hello(payload: dict) -> dict:
gpu_count = torch.cuda.device_count()
gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)

message = payload.get("message", "Hello from GPU worker!")
message = input_data.get("message", "Hello from GPU worker!")

return {
"status": "success",
Expand Down
42 changes: 29 additions & 13 deletions 01_getting_started/02_cpu_worker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Server starts at **http://localhost:8888**

### 4. Test the API

Visit **http://localhost:8888/docs** for interactive API documentation. QB endpoints are auto-generated by `flash run` based on your `@remote` functions.
Visit **http://localhost:8888/docs** for interactive API documentation. QB endpoints are auto-generated by `flash run` based on your `@Endpoint` functions.

```bash
curl -X POST http://localhost:8888/cpu_worker/runsync \
Expand All @@ -56,14 +56,14 @@ Simple CPU-based serverless function that:
- Runs on general-purpose CPU instances

The worker demonstrates:
- Remote execution with `@remote` decorator
- CPU resource configuration with `CpuLiveServerless`
- Automatic scaling based on demand
- Remote execution with the `@Endpoint` decorator
- CPU resource configuration via `cpu=` parameter
- Automatic scaling via `workers=` parameter
- Lightweight API request handling

## API Endpoints

QB (queue-based) endpoints are auto-generated from `@remote` functions. Visit `/docs` for the full API schema.
QB (queue-based) endpoints are auto-generated from `@Endpoint` functions. Visit `/docs` for the full API schema.

### `cpu_hello`

Expand Down Expand Up @@ -92,7 +92,7 @@ Executes a simple CPU worker and returns a greeting with system information.

```
02_cpu_worker/
├── cpu_worker.py # CPU worker with @remote decorator
├── cpu_worker.py # CPU worker with @Endpoint decorator
├── pyproject.toml # Project metadata
├── requirements.txt # Dependencies
├── .env.example # Environment variables template
Expand All @@ -102,23 +102,39 @@ Executes a simple CPU worker and returns a greeting with system information.
## Key Concepts

### Remote Execution
The `@remote` decorator transparently executes functions on serverless infrastructure:
The `@Endpoint` decorator transparently executes functions on serverless infrastructure:
- Code runs locally during development
- Automatically deploys to Runpod when configured
- Handles serialization and resource management

### Resource Scaling
The CPU worker scales to zero when idle:
- **workersMin=0**: Scales down completely when idle
- **workersMax=3**: Up to 3 concurrent workers
- **idleTimeout=5**: 5 minutes before scaling down
```python
from runpod_flash import Endpoint

@Endpoint(name="my-worker", cpu="cpu3c-1-2", workers=(0, 3))
async def my_function(data: dict) -> dict:
return {"result": "processed"}
```

### CPU Instance Types
Available CPU configurations:
- `CpuInstanceType.CPU3G_2_8`: 2 vCPU, 8GB RAM (General Purpose)
- `CpuInstanceType.CPU3C_4_8`: 4 vCPU, 8GB RAM (Compute Optimized)
- `CpuInstanceType.CPU5G_4_16`: 4 vCPU, 16GB RAM (Latest Gen)

CPU type can be specified as an enum or a string shorthand:
```python
# enum
@Endpoint(name="worker", cpu=CpuInstanceType.CPU3C_1_2)

# string shorthand
@Endpoint(name="worker", cpu="cpu3c-1-2")
```

### Resource Scaling
The CPU worker scales to zero when idle:
- **workers=(0, 3)**: Scale from 0 to 3 workers
- **idle_timeout=5**: 5 minutes before scaling down

## Development

### Test Worker Locally
Expand Down Expand Up @@ -148,7 +164,7 @@ Compare with GPU workers when you need:

## Next Steps

- Customize CPU type: Change `CpuInstanceType.CPU3G_2_8` to specific instance type
- Customize CPU type: Change `"cpu3c-1-2"` to a different instance type
- Add request validation and error handling
- Integrate with databases or external APIs
- Deploy to production with `flash deploy`
Expand Down
27 changes: 12 additions & 15 deletions 01_getting_started/02_cpu_worker/cpu_worker.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,21 @@
# CPU serverless worker -- lightweight processing without GPU.
# Run with: flash run
# Test directly: python cpu_worker.py
from runpod_flash import CpuInstanceType, CpuLiveServerless, remote
# cpu serverless worker -- lightweight processing without GPU.
# run with: flash run
# test directly: python cpu_worker.py
from runpod_flash import CpuInstanceType, Endpoint

cpu_config = CpuLiveServerless(

@Endpoint(
name="01_02_cpu_worker",
instanceIds=[CpuInstanceType.CPU3C_1_2],
workersMin=0,
workersMax=3,
idleTimeout=5,
cpu=CpuInstanceType.CPU3C_1_2,
workers=(0, 3),
idle_timeout=5,
)


@remote(resource_config=cpu_config)
async def cpu_hello(payload: dict) -> dict:
"""Simple CPU worker that returns a greeting."""
async def cpu_hello(input_data: dict) -> dict:
"""CPU worker that returns a greeting."""
import platform
from datetime import datetime

message = f"Hello, {payload.get('name', 'Anonymous Panda')}!"
message = f"Hello, {input_data.get('name', 'Anonymous Panda')}!"

return {
"status": "success",
Expand Down
55 changes: 28 additions & 27 deletions 01_getting_started/03_mixed_workers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Learn the production pattern of combining CPU and GPU workers for cost-effective

- **Mixed worker architecture** - Combining CPU and GPU workers intelligently
- **Cost optimization** - Using GPU only when necessary
- **Pipeline orchestration** - Coordinating multiple worker types
- **Pipeline orchestration** - Coordinating multiple worker types via a load-balanced endpoint
- **Production patterns** - Real-world ML service architecture

## Architecture
Expand Down Expand Up @@ -131,41 +131,42 @@ Total: $0.0019/sec

### CPU Preprocessing Worker
```python
preprocess_config = CpuLiveServerless(
@Endpoint(
name="preprocess_worker",
instanceIds=[CpuInstanceType.CPU3G_2_8], # 2 vCPU, 8GB
workersMin=0,
workersMax=10, # High traffic capacity
idleTimeout=3, # Quick scale-down
cpu=CpuInstanceType.CPU3G_2_8, # 2 vCPU, 8GB
workers=(0, 10),
idle_timeout=3,
)
async def preprocess_text(input_data: dict) -> dict: ...
```

**Cost:** ~$0.0002/sec
**Best for:** Validation, cleaning, tokenization

### GPU Inference Worker
```python
gpu_config = LiveServerless(
@Endpoint(
name="inference_worker",
gpus=[GpuGroup.ADA_24], # RTX 4090
workersMin=0,
workersMax=3,
idleTimeout=5,
gpu=GpuGroup.ADA_24, # RTX 4090
workers=(0, 3),
idle_timeout=5,
dependencies=["torch"],
)
async def gpu_inference(input_data: dict) -> dict: ...
```

**Cost:** ~$0.0015/sec
**Best for:** ML model inference only

### CPU Postprocessing Worker
```python
postprocess_config = CpuLiveServerless(
@Endpoint(
name="postprocess_worker",
instanceIds=[CpuInstanceType.CPU3G_2_8], # 2 vCPU, 8GB
workersMin=0,
workersMax=10,
idleTimeout=3,
cpu=CpuInstanceType.CPU3G_2_8, # 2 vCPU, 8GB
workers=(0, 10),
idle_timeout=3,
)
async def postprocess_results(input_data: dict) -> dict: ...
```

**Cost:** ~$0.0002/sec
Expand All @@ -176,7 +177,13 @@ postprocess_config = CpuLiveServerless(
The `/classify` load-balanced endpoint orchestrates all workers:

```python
@remote(resource_config=pipeline_config, method="POST", path="/classify")
from cpu_worker import postprocess_results, preprocess_text
from gpu_worker import gpu_inference
from runpod_flash import Endpoint

pipeline = Endpoint(name="classify_pipeline", cpu="cpu3c-1-2", workers=(1, 3))

@pipeline.post("/classify")
async def classify(text: str) -> dict:
"""Complete ML pipeline: CPU preprocess -> GPU inference -> CPU postprocess."""
preprocess_result = await preprocess_text({"text": text})
Expand Down Expand Up @@ -234,15 +241,9 @@ For higher volumes, savings multiply significantly.

```python
try:
# Stage 1: Preprocess (validation already done)
preprocess_result = await preprocess_text(data)

# Stage 2: GPU inference
gpu_result = await gpu_inference(preprocess_result)

# Stage 3: Postprocess
final_result = await postprocess_results(gpu_result)

return final_result
except Exception as e:
logger.error(f"Pipeline failed: {e}")
Expand All @@ -251,13 +252,13 @@ except Exception as e:

### 2. Timeouts

Set appropriate timeouts for each stage:
Set appropriate timeouts for each stage via `execution_timeout_ms`:
```python
# CPU stages: short timeouts
preprocess_config.executionTimeout = 30 # seconds
@Endpoint(name="preprocess", cpu="cpu3c-1-2", execution_timeout_ms=30000)

# GPU stage: longer timeout
gpu_config.executionTimeout = 120 # seconds
@Endpoint(name="inference", gpu=GpuGroup.ADA_24, execution_timeout_ms=120000)
```

### 3. Monitoring
Expand Down Expand Up @@ -299,7 +300,7 @@ Review worker usage:
### Slow Performance

- Increase CPU worker max count for preprocessing
- Check if GPU cold start is the issue (set workersMin=1)
- Check if GPU cold start is the issue (set `workers=(1, 3)` for always-warm)
- Consider caching preprocessed data

## Next Steps
Expand Down
Loading