> ## Documentation Index
> Fetch the complete documentation index at: https://runpod-b18f5ded-new-sls-quickstart.mintlify.site/llms.txt
> Use this file to discover all available pages before exploring further.

# Endpoint settings

> Reference guide for all Serverless endpoint settings and parameters.

export const InferenceTooltip = () => {
  return <Tooltip headline="AI inference" tip="The execution phase where a trained model makes predictions on new data. When you prompt a model and it responds, that's inference.">inference</Tooltip>;
};

export const MachinesTooltip = () => {
  return <Tooltip headline="Machine" tip="The physical server hardware within a data center that hosts your compute resources.">machines</Tooltip>;
};

This guide details the configuration options available for Runpod Serverless endpoints.

<Note>
  Some settings can only be updated after deploying your endpoint. See [Edit an endpoint](/serverless/endpoints/overview#edit-an-endpoint).
</Note>

## Quick reference

| Setting               | Default       | Description                                |
| --------------------- | ------------- | ------------------------------------------ |
| **Active workers**    | 0             | Always-on workers (eliminates cold starts) |
| **Max workers**       | 3             | Maximum concurrent workers                 |
| **GPUs per worker**   | 1             | GPU count per worker instance              |
| **Idle timeout**      | 5s            | Time before idle worker shuts down         |
| **Execution timeout** | 600s (10 min) | Max job duration                           |
| **Job TTL**           | 24h           | Total job lifespan in system               |
| **FlashBoot**         | Enabled       | Faster cold starts via state retention     |

## General configuration

### Endpoint name

Display name for identifying your endpoint in the console. Does not affect the endpoint ID used for API requests.

### Endpoint type

**Queue-based endpoints** use a built-in queueing system with guaranteed execution and automatic retries. Ideal for async tasks, batch processing, and long-running jobs. Implemented using [handler functions](/serverless/workers/handler-functions).

**Load balancing endpoints** route traffic directly to workers, bypassing the queue. Designed for low-latency applications like real-time <InferenceTooltip /> or custom REST APIs. See [Load balancing endpoints](/serverless/load-balancing/overview).

### GPU configuration

Determines the hardware tier for your workers. Select multiple GPU categories to create a prioritized fallback list. If your first choice is unavailable, Runpod automatically uses the next option. Selecting multiple types improves availability during high demand.

| **GPU type(s)**         | **Memory** | **Flex cost per second** | **Active cost per second** | **Description**                                       |
| ----------------------- | ---------- | ------------------------ | -------------------------- | ----------------------------------------------------- |
| A4000, A4500, RTX 4000  | 16 GB      | \$0.00016                | \$0.00011                  | The most cost-effective for small models.             |
| 4090 PRO                | 24 GB      | \$0.00031                | \$0.00021                  | Extreme throughput for small-to-medium models.        |
| L4, A5000, 3090         | 24 GB      | \$0.00019                | \$0.00013                  | Great for small-to-medium sized inference workloads.  |
| L40, L40S, 6000 Ada PRO | 48 GB      | \$0.00053                | \$0.00037                  | Extreme inference throughput on LLMs like Llama 3 7B. |
| A6000, A40              | 48 GB      | \$0.00034                | \$0.00024                  | A cost-effective option for running big models.       |
| H100 PRO                | 80 GB      | \$0.00116                | \$0.00093                  | Extreme throughput for big models.                    |
| A100                    | 80 GB      | \$0.00076                | \$0.00060                  | High throughput GPU, yet still very cost-effective.   |
| H200 PRO                | 141 GB     | \$0.00155                | \$0.00124                  | Extreme throughput for huge models.                   |
| B200                    | 180 GB     | \$0.00240                | \$0.00190                  | Maximum throughput for huge models.                   |

## Worker scaling

### Active workers

Minimum number of workers that remain warm and ready at all times. Setting this to 1+ eliminates cold starts. Active workers incur charges when idle but receive a 20-30% discount.

### Max workers

Maximum concurrent instances your endpoint can scale to. Acts as a cost safety limit and concurrency cap. Set \~20% higher than expected max concurrency to handle traffic spikes smoothly.

### GPUs per worker

Number of GPUs assigned to each worker instance. Default is 1. Generally prioritize fewer high-end GPUs over multiple lower-tier GPUs.

### Auto-scaling type

**Queue delay**: Adds workers when requests wait longer than the threshold (default: 4 seconds). Best when slight delays are acceptable for higher utilization.

**Request count**: More aggressive scaling based on pending + active work. Formula: `Math.ceil((requestsInQueue + requestsInProgress) / scalerValue)`. Use scaler value of 1 for max responsiveness. Recommended for LLM workloads or frequent short requests.

## Lifecycle and timeouts

### Idle timeout

How long a worker stays active after completing a request before shutting down. You're billed during idle time, but the worker remains warm for immediate processing. Default: 5 seconds.

### Execution timeout

Maximum duration for a single job. When exceeded, the job fails and the worker stops. Keep enabled to prevent runaway jobs. Default: 600s (10 min). Range: 5s to 7 days.

Configure in **Advanced** settings, or override per-request via `executionTimeout` in the [job policy](/serverless/endpoints/send-requests#execution-policies).

### Job TTL (time-to-live)

Total lifespan of a job in the system. When TTL expires, job data is deleted regardless of state (queued, running, or completed). Default: 24 hours. Range: 10s to 7 days.

The timer starts at submission, not execution. If a job queues for 45 minutes with a 1-hour TTL, only 15 minutes remain for execution.

<Warning>
  TTL is a hard limit. If it expires while a job is running, the job is immediately removed and status checks return 404. Set TTL to cover both expected queue time and execution time.
</Warning>

Override per-request via `ttl` in the [job policy](/serverless/endpoints/send-requests#execution-policies).

### Result retention

| Request type      | Retention | Notes                                                       |
| ----------------- | --------- | ----------------------------------------------------------- |
| Async (`/run`)    | 30 min    | Retrieve via `/status/{job_id}`                             |
| Sync (`/runsync`) | 1 min     | Returned in response; also available via `/status/{job_id}` |

Results are permanently deleted after retention expires.

## Performance features

### FlashBoot

Reduces cold starts by retaining worker state after spin-down, allowing faster "revival" than fresh boots. Most effective on endpoints with consistent traffic where workers frequently cycle between active and idle.

### Model

Select from [cached models](/serverless/endpoints/model-caching) to schedule workers on <MachinesTooltip /> with model files pre-loaded. Significantly reduces model loading time during initialization.

## Advanced settings

### Data centers

Restrict your endpoint to specific regions. For maximum availability, allow all data centers:restricting decreases the available GPU pool.

### Network volumes

[Network volumes](/storage/network-volumes) provide persistent storage across worker restarts. Tradeoffs: adds network latency and restricts your endpoint to the volume's data center. Use only when you need shared persistence or datasets exceeding container limits.

### CUDA version selection

Ensures workers run on <MachinesTooltip /> with compatible drivers. Select your required version plus all newer versions, since CUDA is backward compatible and a wider range increases available hardware.

### Expose HTTP/TCP ports

Exposes the worker's public IP and port for direct external communication. Required for persistent connections like WebSockets.
