Quickstart

This quickstart gets you running a Serverless endpoint on Runpod in minutes, using a ready-to-use template to deploy a language model and send a test request.

Requirements

A Runpod account with available credits.
A Runpod API key.

Step 1: Set up your environment

Choose your preferred method for interacting with Runpod. If using the CLI or REST API, you’ll need to configure your API key.

Runpod CLI
REST API
Web

Install and configure the Runpod CLI.macOS/Linux:

# Install runpodctl
curl -fsSL https://install.runpod.io | bash

# Configure with your API key
runpodctl doctor

Windows:

# Install using PowerShell
iwr -useb https://install.runpod.io/windows | iex

# Configure with your API key
runpodctl doctor

Verify the installation:

runpodctl version

Export your Runpod API key as an environment variable:

export RUNPOD_API_KEY="your_api_key_here"

Step 2: Deploy an endpoint

Deploy a vLLM worker with a small, fast language model.

Runpod CLI
REST API
Web

First, create a Serverless template with the vLLM worker image:

runpodctl template create \
  --name "vllm-qwen" \
  --image "runpod/worker-v1-vllm:stable-cuda12.1.0" \
  --env '{"MODEL_NAME": "Qwen/Qwen2.5-0.5B-Instruct"}' \
  --serverless

Note the template ID from the output. Then create an endpoint using that template:

runpodctl serverless create \
  --name "my-first-endpoint" \
  --template-id YOUR_TEMPLATE_ID \
  --gpu-id "NVIDIA GeForce RTX 4090" \
  --workers-min 0 \
  --workers-max 3

The output includes your endpoint ID:

Endpoint created successfully
ID: abc123xyz
Name: my-first-endpoint

First, create a template using the vLLM worker image:

curl --request POST \
  --url https://rest.runpod.io/v1/templates \
  --header "Authorization: Bearer $RUNPOD_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "name": "vllm-qwen",
    "imageName": "runpod/worker-v1-vllm:stable-cuda12.1.0",
    "isServerless": true,
    "env": {
      "MODEL_NAME": "Qwen/Qwen2.5-0.5B-Instruct"
    }
  }'

Note the id from the response. Then create an endpoint using that template:

curl --request POST \
  --url https://rest.runpod.io/v1/endpoints \
  --header "Authorization: Bearer $RUNPOD_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "name": "my-first-endpoint",
    "templateId": "YOUR_TEMPLATE_ID",
    "gpuTypeIds": ["NVIDIA GeForce RTX 4090", "NVIDIA L4", "NVIDIA RTX A4000"],
    "workersMin": 0,
    "workersMax": 3,
    "idleTimeout": 5
  }'

The response includes your endpoint ID:

{
  "id": "abc123xyz",
  "name": "my-first-endpoint",
  ...
}

Go to the Serverless section and click New Endpoint.
Under The Hub, click vLLM.
Click Deploy vX.X.X.
In the Model field, enter: Qwen/Qwen2.5-0.5B-Instruct
Click Next then Create Endpoint.
Once deployed, note your Endpoint ID from the endpoint details page—you’ll need it for API requests.

Your endpoint will begin initializing. This takes 1-2 minutes while Runpod provisions resources and loads the model.

Step 3: Send a request

Once your endpoint shows Ready status, send a test request. If you haven’t already, export your API key in your terminal:

export RUNPOD_API_KEY="your_api_key_here"

cURL
Python

Run this command in your terminal, replacing YOUR_ENDPOINT_ID with your actual endpoint ID:

curl --request POST \
  --url "https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/runsync" \
  --header "Authorization: Bearer $RUNPOD_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "input": {
      "prompt": "What is the capital of France?",
      "max_tokens": 100
    }
  }'

Create a file called test_endpoint.py and paste the following code:

test_endpoint.py

import requests
import os

ENDPOINT_ID = "YOUR_ENDPOINT_ID"  # Replace with your endpoint ID
API_KEY = os.environ.get("RUNPOD_API_KEY")

if not API_KEY:
    raise ValueError("RUNPOD_API_KEY environment variable not set")

response = requests.post(
    f"https://api.runpod.ai/v2/{ENDPOINT_ID}/runsync",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json={
        "input": {
            "prompt": "What is the capital of France?",
            "max_tokens": 100
        }
    }
)

print(response.json())

Install dependencies and run the script:

pip install requests
python test_endpoint.py

You should receive a response like this:

{
  "id": "sync-abc123-xyz",
  "status": "COMPLETED",
  "output": {
    "text": "The capital of France is Paris.",
    ...
  }
}

The first request may take 30-60 seconds as the worker loads the model into GPU memory. Subsequent requests will complete in just a few seconds until the worker scales down due to inactivity.

Step 4: Clean up

To avoid ongoing charges, delete your endpoint when you’re done testing.

Runpod CLI
REST API
Web

List your endpoints to find the ID:

runpodctl serverless list

Delete the endpoint:

runpodctl serverless delete YOUR_ENDPOINT_ID

Optionally, delete the template you created:

runpodctl template delete YOUR_TEMPLATE_ID

curl --request DELETE \
  --url "https://rest.runpod.io/v1/endpoints/YOUR_ENDPOINT_ID" \
  --header "Authorization: Bearer $RUNPOD_API_KEY"

You’ve successfully deployed and tested your first Serverless endpoint.

Next steps

Build a custom worker

Create your own handler function and Docker image.

Send requests

Learn about sync, async, and streaming requests.

Endpoint settings

Configure scaling, timeouts, and GPU selection.

Configure vLLM

Customize your vLLM deployment for different models.

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

Requirements

Step 1: Set up your environment

Step 2: Deploy an endpoint

Step 3: Send a request

Step 4: Clean up

Next steps

Build a custom worker

Send requests

Endpoint settings

Configure vLLM

Get started

Flash

Serverless

Pods

Storage

Public Endpoints

Instant Clusters

Integrations

Hub

Fine-tuning

Reference

Documentation Index

​Requirements

​Step 1: Set up your environment

​Step 2: Deploy an endpoint

​Step 3: Send a request

​Step 4: Clean up

​Next steps

Build a custom worker

Send requests

Endpoint settings

Configure vLLM

Requirements

Step 1: Set up your environment

Step 2: Deploy an endpoint

Step 3: Send a request

Step 4: Clean up

Next steps