Optimize your endpoints - Runpod Documentation

Optimization involves measuring performance with benchmarking, identifying bottlenecks, and tuning your endpoint configurations.

Quick optimization checklist

Strategy	Impact	When to use
Use cached models	⬇️ Cold start (major)	Models on Hugging Face
Bake models into image	⬇️ Cold start	Private models
Set active workers > 0	⬇️ Cold start (eliminates)	Latency-sensitive apps
Select multiple GPU types	⬆️ Availability	Production workloads
Increase max workers	⬆️ Throughput	High concurrency
Lower queue delay threshold	⬇️ Response time	Traffic spikes

Understanding delay time

Two metrics affect request response time:

Metric	Description	Optimization
Delay time	Waiting for a worker (includes cold start)	Model caching, active workers
Execution time	GPU processing the request	Code optimization, GPU selection

Delay time breaks down into:

Initialization time: Downloading Docker image
Cold start time: Loading model into GPU memory

Use benchmarking to measure these metrics for your workload.

If cold start exceeds 7 minutes, the worker is marked unhealthy. Extend with RUNPOD_INIT_TIMEOUT=800 (seconds).

Reduce cold starts

Use cached models (recommended)

For models on Hugging Face, cached models provide the fastest cold starts and lowest cost.

Bake models into images

For private models, embed them in your Docker image. Models load from high-speed local NVMe storage instead of downloading at runtime.

Maintain active workers

Set active workers > 0 to eliminate cold starts entirely. Active workers cost up to 30% less than flex workers. Formula: Active workers = (Requests/min × Request duration in seconds) / 60 Example: 6 requests/min × 30 seconds = 3 active workers needed.

Improve availability

Select multiple GPU types

Specify multiple GPU types in priority order. A single high-end GPU often outperforms multiple lower-tier cards for .

Add headroom to max workers

Set max workers ~20% above expected concurrency to handle load spikes without throttling.

Tune auto-scaling

Lower the queue delay threshold to 2-3 seconds (default: 4) for faster worker provisioning.

Architecture considerations

Choice	Tradeoff
Baked models	Fastest loading, but larger images
Network volumes	Flexible, but restricts to specific data centers
Multiple GPU types	Higher availability, variable performance

Worker and endpoint logsView and access logs for Serverless endpoints and workers.

​Quick optimization checklist

​Understanding delay time

​Reduce cold starts

​Use cached models (recommended)

​Bake models into images

​Maintain active workers

​Improve availability

​Select multiple GPU types

​Add headroom to max workers

​Tune auto-scaling

​Architecture considerations

Quick optimization checklist

Understanding delay time

Reduce cold starts

Use cached models (recommended)

Bake models into images

Maintain active workers

Improve availability

Select multiple GPU types

Add headroom to max workers

Tune auto-scaling

Architecture considerations