Skip to main content
Optimization involves measuring performance with benchmarking, identifying bottlenecks, and tuning your endpoint configurations.

Quick optimization checklist

StrategyImpactWhen to use
Use cached models⬇️ Cold start (major)Models on Hugging Face
Bake models into image⬇️ Cold startPrivate models
Set active workers > 0⬇️ Cold start (eliminates)Latency-sensitive apps
Select multiple GPU types⬆️ AvailabilityProduction workloads
Increase max workers⬆️ ThroughputHigh concurrency
Lower queue delay threshold⬇️ Response timeTraffic spikes

Understanding delay time

Two metrics affect request response time:
MetricDescriptionOptimization
Delay timeWaiting for a worker (includes cold start)Model caching, active workers
Execution timeGPU processing the requestCode optimization, GPU selection
Delay time breaks down into:
  • Initialization time: Downloading Docker image
  • Cold start time: Loading model into GPU memory
Use benchmarking to measure these metrics for your workload.
If cold start exceeds 7 minutes, the worker is marked unhealthy. Extend with RUNPOD_INIT_TIMEOUT=800 (seconds).

Reduce cold starts

For models on Hugging Face, cached models provide the fastest cold starts and lowest cost.

Bake models into images

For private models, embed them in your Docker image. Models load from high-speed local NVMe storage instead of downloading at runtime.

Maintain active workers

Set active workers > 0 to eliminate cold starts entirely. Active workers cost up to 30% less than flex workers. Formula: Active workers = (Requests/min × Request duration in seconds) / 60 Example: 6 requests/min × 30 seconds = 3 active workers needed.

Improve availability

Select multiple GPU types

Specify multiple GPU types in priority order. A single high-end GPU often outperforms multiple lower-tier cards for .

Add headroom to max workers

Set max workers ~20% above expected concurrency to handle load spikes without throttling.

Tune auto-scaling

Lower the queue delay threshold to 2-3 seconds (default: 4) for faster worker provisioning.

Architecture considerations

ChoiceTradeoff
Baked modelsFastest loading, but larger images
Network volumesFlexible, but restricts to specific data centers
Multiple GPU typesHigher availability, variable performance