This guide details the configuration options available for Runpod Serverless endpoints.Documentation Index
Fetch the complete documentation index at: https://runpod-b18f5ded-new-sls-quickstart.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Some settings can only be updated after deploying your endpoint. See Edit an endpoint.
Quick reference
| Setting | Default | Description |
|---|---|---|
| Active workers | 0 | Always-on workers (eliminates cold starts) |
| Max workers | 3 | Maximum concurrent workers |
| GPUs per worker | 1 | GPU count per worker instance |
| Idle timeout | 5s | Time before idle worker shuts down |
| Execution timeout | 600s (10 min) | Max job duration |
| Job TTL | 24h | Total job lifespan in system |
| FlashBoot | Enabled | Faster cold starts via state retention |
General configuration
Endpoint name
Display name for identifying your endpoint in the console. Does not affect the endpoint ID used for API requests.Endpoint type
Queue-based endpoints use a built-in queueing system with guaranteed execution and automatic retries. Ideal for async tasks, batch processing, and long-running jobs. Implemented using handler functions. Load balancing endpoints route traffic directly to workers, bypassing the queue. Designed for low-latency applications like real-time or custom REST APIs. See Load balancing endpoints.GPU configuration
Determines the hardware tier for your workers. Select multiple GPU categories to create a prioritized fallback list. If your first choice is unavailable, Runpod automatically uses the next option. Selecting multiple types improves availability during high demand.| GPU type(s) | Memory | Flex cost per second | Active cost per second | Description |
|---|---|---|---|---|
| A4000, A4500, RTX 4000 | 16 GB | $0.00016 | $0.00011 | The most cost-effective for small models. |
| 4090 PRO | 24 GB | $0.00031 | $0.00021 | Extreme throughput for small-to-medium models. |
| L4, A5000, 3090 | 24 GB | $0.00019 | $0.00013 | Great for small-to-medium sized inference workloads. |
| L40, L40S, 6000 Ada PRO | 48 GB | $0.00053 | $0.00037 | Extreme inference throughput on LLMs like Llama 3 7B. |
| A6000, A40 | 48 GB | $0.00034 | $0.00024 | A cost-effective option for running big models. |
| H100 PRO | 80 GB | $0.00116 | $0.00093 | Extreme throughput for big models. |
| A100 | 80 GB | $0.00076 | $0.00060 | High throughput GPU, yet still very cost-effective. |
| H200 PRO | 141 GB | $0.00155 | $0.00124 | Extreme throughput for huge models. |
| B200 | 180 GB | $0.00240 | $0.00190 | Maximum throughput for huge models. |
Worker scaling
Active workers
Minimum number of workers that remain warm and ready at all times. Setting this to 1+ eliminates cold starts. Active workers incur charges when idle but receive a 20-30% discount.Max workers
Maximum concurrent instances your endpoint can scale to. Acts as a cost safety limit and concurrency cap. Set ~20% higher than expected max concurrency to handle traffic spikes smoothly.GPUs per worker
Number of GPUs assigned to each worker instance. Default is 1. Generally prioritize fewer high-end GPUs over multiple lower-tier GPUs.Auto-scaling type
Queue delay: Adds workers when requests wait longer than the threshold (default: 4 seconds). Best when slight delays are acceptable for higher utilization. Request count: More aggressive scaling based on pending + active work. Formula:Math.ceil((requestsInQueue + requestsInProgress) / scalerValue). Use scaler value of 1 for max responsiveness. Recommended for LLM workloads or frequent short requests.
Lifecycle and timeouts
Idle timeout
How long a worker stays active after completing a request before shutting down. You’re billed during idle time, but the worker remains warm for immediate processing. Default: 5 seconds.Execution timeout
Maximum duration for a single job. When exceeded, the job fails and the worker stops. Keep enabled to prevent runaway jobs. Default: 600s (10 min). Range: 5s to 7 days. Configure in Advanced settings, or override per-request viaexecutionTimeout in the job policy.
Job TTL (time-to-live)
Total lifespan of a job in the system. When TTL expires, job data is deleted regardless of state (queued, running, or completed). Default: 24 hours. Range: 10s to 7 days. The timer starts at submission, not execution. If a job queues for 45 minutes with a 1-hour TTL, only 15 minutes remain for execution. Override per-request viattl in the job policy.
Result retention
| Request type | Retention | Notes |
|---|---|---|
Async (/run) | 30 min | Retrieve via /status/{job_id} |
Sync (/runsync) | 1 min | Returned in response; also available via /status/{job_id} |