Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runpod-b18f5ded-new-sls-quickstart.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

This quickstart gets you running a Serverless endpoint on Runpod in minutes, using a ready-to-use template to deploy a language model and send a test request.

Requirements

Step 1: Set up your environment

Choose your preferred method for interacting with Runpod. If using the CLI or REST API, you’ll need to configure your API key.
Install and configure the Runpod CLI.macOS/Linux:
# Install runpodctl
curl -fsSL https://install.runpod.io | bash

# Configure with your API key
runpodctl doctor
Windows:
# Install using PowerShell
iwr -useb https://install.runpod.io/windows | iex

# Configure with your API key
runpodctl doctor
Verify the installation:
runpodctl version

Step 2: Deploy an endpoint

Deploy a vLLM worker with a small, fast language model.
First, create a Serverless template with the vLLM worker image:
runpodctl template create \
  --name "vllm-qwen" \
  --image "runpod/worker-v1-vllm:stable-cuda12.1.0" \
  --env '{"MODEL_NAME": "Qwen/Qwen2.5-0.5B-Instruct"}' \
  --serverless
Note the template ID from the output. Then create an endpoint using that template:
runpodctl serverless create \
  --name "my-first-endpoint" \
  --template-id YOUR_TEMPLATE_ID \
  --gpu-id "NVIDIA GeForce RTX 4090" \
  --workers-min 0 \
  --workers-max 3
The output includes your endpoint ID:
Endpoint created successfully
ID: abc123xyz
Name: my-first-endpoint
Your endpoint will begin initializing. This takes 1-2 minutes while Runpod provisions resources and loads the model.

Step 3: Send a request

Once your endpoint shows Ready status, send a test request. If you haven’t already, export your API key in your terminal:
export RUNPOD_API_KEY="your_api_key_here"
Run this command in your terminal, replacing YOUR_ENDPOINT_ID with your actual endpoint ID:
curl --request POST \
  --url "https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/runsync" \
  --header "Authorization: Bearer $RUNPOD_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "input": {
      "prompt": "What is the capital of France?",
      "max_tokens": 100
    }
  }'
You should receive a response like this:
{
  "id": "sync-abc123-xyz",
  "status": "COMPLETED",
  "output": {
    "text": "The capital of France is Paris.",
    ...
  }
}
The first request may take 30-60 seconds as the worker loads the model into GPU memory. Subsequent requests will complete in just a few seconds until the worker scales down due to inactivity.

Step 4: Clean up

To avoid ongoing charges, delete your endpoint when you’re done testing.
List your endpoints to find the ID:
runpodctl serverless list
Delete the endpoint:
runpodctl serverless delete YOUR_ENDPOINT_ID
Optionally, delete the template you created:
runpodctl template delete YOUR_TEMPLATE_ID
You’ve successfully deployed and tested your first Serverless endpoint.

Next steps

Build a custom worker

Create your own handler function and Docker image.

Send requests

Learn about sync, async, and streaming requests.

Endpoint settings

Configure scaling, timeouts, and GPU selection.

Configure vLLM

Customize your vLLM deployment for different models.