Optimization involves measuring performance with benchmarking, identifying bottlenecks, and tuning your endpoint configurations.
Quick optimization checklist
| Strategy | Impact | When to use |
|---|
| Use cached models | ⬇️ Cold start (major) | Models on Hugging Face |
| Bake models into image | ⬇️ Cold start | Private models |
| Set active workers > 0 | ⬇️ Cold start (eliminates) | Latency-sensitive apps |
| Select multiple GPU types | ⬆️ Availability | Production workloads |
| Increase max workers | ⬆️ Throughput | High concurrency |
| Lower queue delay threshold | ⬇️ Response time | Traffic spikes |
Understanding delay time
Two metrics affect request response time:
| Metric | Description | Optimization |
|---|
| Delay time | Waiting for a worker (includes cold start) | Model caching, active workers |
| Execution time | GPU processing the request | Code optimization, GPU selection |
Delay time breaks down into:
- Initialization time: Downloading Docker image
- Cold start time: Loading model into GPU memory
Use benchmarking to measure these metrics for your workload.
If cold start exceeds 7 minutes, the worker is marked unhealthy. Extend with RUNPOD_INIT_TIMEOUT=800 (seconds).
Reduce cold starts
Use cached models (recommended)
For models on Hugging Face, cached models provide the fastest cold starts and lowest cost.
Bake models into images
For private models, embed them in your Docker image. Models load from high-speed local NVMe storage instead of downloading at runtime.
Maintain active workers
Set active workers > 0 to eliminate cold starts entirely. Active workers cost up to 30% less than flex workers.
Formula: Active workers = (Requests/min × Request duration in seconds) / 60
Example: 6 requests/min × 30 seconds = 3 active workers needed.
Improve availability
Select multiple GPU types
Specify multiple GPU types in priority order. A single high-end GPU often outperforms multiple lower-tier cards for .
Add headroom to max workers
Set max workers ~20% above expected concurrency to handle load spikes without throttling.
Tune auto-scaling
Lower the queue delay threshold to 2-3 seconds (default: 4) for faster worker provisioning.
Architecture considerations
| Choice | Tradeoff |
|---|
| Baked models | Fastest loading, but larger images |
| Network volumes | Flexible, but restricts to specific data centers |
| Multiple GPU types | Higher availability, variable performance |