Endpoint settings

This guide details the configuration options available for Runpod Serverless endpoints.

Some settings can only be updated after deploying your endpoint. See Edit an endpoint.

Quick reference

Setting	Default	Description
Active workers	0	Always-on workers (eliminates cold starts)
Max workers	3	Maximum concurrent workers
GPUs per worker	1	GPU count per worker instance
Idle timeout	5s	Time before idle worker shuts down
Execution timeout	600s (10 min)	Max job duration
Job TTL	24h	Total job lifespan in system
FlashBoot	Enabled	Faster cold starts via state retention

General configuration

Endpoint name

Display name for identifying your endpoint in the console. Does not affect the endpoint ID used for API requests.

Endpoint type

Queue-based endpoints use a built-in queueing system with guaranteed execution and automatic retries. Ideal for async tasks, batch processing, and long-running jobs. Implemented using handler functions. Load balancing endpoints route traffic directly to workers, bypassing the queue. Designed for low-latency applications like real-time or custom REST APIs. See Load balancing endpoints.

GPU configuration

Determines the hardware tier for your workers. Select multiple GPU categories to create a prioritized fallback list. If your first choice is unavailable, Runpod automatically uses the next option. Selecting multiple types improves availability during high demand.

GPU type(s)	Memory	Flex cost per second	Active cost per second	Description
A4000, A4500, RTX 4000	16 GB	$0.00016	$0.00011	The most cost-effective for small models.
4090 PRO	24 GB	$0.00031	$0.00021	Extreme throughput for small-to-medium models.
L4, A5000, 3090	24 GB	$0.00019	$0.00013	Great for small-to-medium sized inference workloads.
L40, L40S, 6000 Ada PRO	48 GB	$0.00053	$0.00037	Extreme inference throughput on LLMs like Llama 3 7B.
A6000, A40	48 GB	$0.00034	$0.00024	A cost-effective option for running big models.
H100 PRO	80 GB	$0.00116	$0.00093	Extreme throughput for big models.
A100	80 GB	$0.00076	$0.00060	High throughput GPU, yet still very cost-effective.
H200 PRO	141 GB	$0.00155	$0.00124	Extreme throughput for huge models.
B200	180 GB	$0.00240	$0.00190	Maximum throughput for huge models.

Worker scaling

Active workers

Minimum number of workers that remain warm and ready at all times. Setting this to 1+ eliminates cold starts. Active workers incur charges when idle but receive a 20-30% discount.

Max workers

Maximum concurrent instances your endpoint can scale to. Acts as a cost safety limit and concurrency cap. Set ~20% higher than expected max concurrency to handle traffic spikes smoothly.

GPUs per worker

Number of GPUs assigned to each worker instance. Default is 1. Generally prioritize fewer high-end GPUs over multiple lower-tier GPUs.

Auto-scaling type

Queue delay: Adds workers when requests wait longer than the threshold (default: 4 seconds). Best when slight delays are acceptable for higher utilization. Request count: More aggressive scaling based on pending + active work. Formula: Math.ceil((requestsInQueue + requestsInProgress) / scalerValue). Use scaler value of 1 for max responsiveness. Recommended for LLM workloads or frequent short requests.

Lifecycle and timeouts

Idle timeout

How long a worker stays active after completing a request before shutting down. You’re billed during idle time, but the worker remains warm for immediate processing. Default: 5 seconds.

Execution timeout

Maximum duration for a single job. When exceeded, the job fails and the worker stops. Keep enabled to prevent runaway jobs. Default: 600s (10 min). Range: 5s to 7 days. Configure in Advanced settings, or override per-request via executionTimeout in the job policy.

Job TTL (time-to-live)

Total lifespan of a job in the system. When TTL expires, job data is deleted regardless of state (queued, running, or completed). Default: 24 hours. Range: 10s to 7 days. The timer starts at submission, not execution. If a job queues for 45 minutes with a 1-hour TTL, only 15 minutes remain for execution.

TTL is a hard limit. If it expires while a job is running, the job is immediately removed and status checks return 404. Set TTL to cover both expected queue time and execution time.

Override per-request via ttl in the job policy.

Result retention

Request type	Retention	Notes
Async (`/run`)	30 min	Retrieve via `/status/{job_id}`
Sync (`/runsync`)	1 min	Returned in response; also available via `/status/{job_id}`

Results are permanently deleted after retention expires.

Performance features

FlashBoot

Reduces cold starts by retaining worker state after spin-down, allowing faster “revival” than fresh boots. Most effective on endpoints with consistent traffic where workers frequently cycle between active and idle.

Model

Select from cached models to schedule workers on with model files pre-loaded. Significantly reduces model loading time during initialization.

Advanced settings

Data centers

Restrict your endpoint to specific regions. For maximum availability, allow all data centers:restricting decreases the available GPU pool.

Network volumes

Network volumes provide persistent storage across worker restarts. Tradeoffs: adds network latency and restricts your endpoint to the volume’s data center. Use only when you need shared persistence or datasets exceeding container limits.

CUDA version selection

Ensures workers run on with compatible drivers. Select your required version plus all newer versions, since CUDA is backward compatible and a wider range increases available hardware.

Expose HTTP/TCP ports

Exposes the worker’s public IP and port for direct external communication. Required for persistent connections like WebSockets.

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

Quick reference

General configuration

Endpoint name

Endpoint type

GPU configuration

Worker scaling

Active workers

Max workers

GPUs per worker

Auto-scaling type

Lifecycle and timeouts

Idle timeout

Execution timeout

Job TTL (time-to-live)

Result retention

Performance features

FlashBoot

Model

Advanced settings

Data centers

Network volumes

CUDA version selection

Expose HTTP/TCP ports

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

Documentation Index

​Quick reference

​General configuration

​Endpoint name

​Endpoint type

​GPU configuration

​Worker scaling

​Active workers

​Max workers

​GPUs per worker

​Auto-scaling type

​Lifecycle and timeouts

​Idle timeout

​Execution timeout

​Job TTL (time-to-live)

​Result retention

​Performance features

​FlashBoot

​Model

​Advanced settings

​Data centers

​Network volumes

​CUDA version selection

​Expose HTTP/TCP ports

Quick reference

General configuration

Endpoint name

Endpoint type

GPU configuration

Worker scaling

Active workers

Max workers

GPUs per worker

Auto-scaling type

Lifecycle and timeouts

Idle timeout

Execution timeout

Job TTL (time-to-live)

Result retention

Performance features

FlashBoot

Model

Advanced settings

Data centers

Network volumes

CUDA version selection

Expose HTTP/TCP ports