This quickstart gets you running a Serverless endpoint on Runpod in minutes, using a ready-to-use template to deploy a language model and send a test request.
Requirements
Step 1: Set up your API key
Export your Runpod API key as an environment variable so you can use it in commands:
export RUNPOD_API_KEY = "your_api_key_here"
Step 2: Deploy an endpoint
Deploy a vLLM worker with a small, fast language model.
Go to the Serverless section and click New Endpoint .
Under The Hub , click vLLM .
Click Deploy vX.X.X .
In the Model field, enter: Qwen/Qwen2.5-0.5B-Instruct
Click Next then Create Endpoint .
Once deployed, note your Endpoint ID from the endpoint details page—you’ll need it for API requests.
First, create a template using the vLLM worker image: curl --request POST \
--url https://rest.runpod.io/v1/templates \
--header "Authorization: Bearer $RUNPOD_API_KEY " \
--header "Content-Type: application/json" \
--data '{
"name": "vllm-qwen",
"imageName": "runpod/worker-vllm:stable-cuda12.1.0",
"isServerless": true,
"env": {
"MODEL_NAME": "Qwen/Qwen2.5-0.5B-Instruct"
}
}'
Note the id from the response. Then create an endpoint using that template: curl --request POST \
--url https://rest.runpod.io/v1/endpoints \
--header "Authorization: Bearer $RUNPOD_API_KEY " \
--header "Content-Type: application/json" \
--data '{
"name": "my-first-endpoint",
"templateId": "YOUR_TEMPLATE_ID",
"gpuTypeIds": ["NVIDIA GeForce RTX 4090", "NVIDIA L4", "NVIDIA RTX A4000"],
"workersMin": 0,
"workersMax": 3,
"idleTimeout": 5
}'
The response includes your endpoint ID: {
"id" : "abc123xyz" ,
"name" : "my-first-endpoint" ,
...
}
Your endpoint will begin initializing. This takes 1-2 minutes while Runpod provisions resources and loads the model.
Step 3: Send a request
Once your endpoint shows Ready status, send a test request.
Replace YOUR_ENDPOINT_ID with your actual endpoint ID: curl --request POST \
--url "https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/runsync" \
--header "Authorization: Bearer $RUNPOD_API_KEY " \
--header "Content-Type: application/json" \
--data '{
"input": {
"prompt": "What is the capital of France?",
"max_tokens": 100
}
}'
Create a file called test_endpoint.py: import requests
import os
ENDPOINT_ID = "YOUR_ENDPOINT_ID" # Replace with your endpoint ID
API_KEY = os.environ.get( "RUNPOD_API_KEY" )
if not API_KEY :
raise ValueError ( "RUNPOD_API_KEY environment variable not set" )
response = requests.post(
f "https://api.runpod.ai/v2/ {ENDPOINT_ID} /runsync" ,
headers = { "Authorization" : f "Bearer {API_KEY} " },
json = {
"input" : {
"prompt" : "What is the capital of France?" ,
"max_tokens" : 100
}
}
)
print (response.json())
Install dependencies and run: pip install requests
python test_endpoint.py
You should receive a response like this:
{
"id" : "sync-abc123-xyz" ,
"status" : "COMPLETED" ,
"output" : {
"text" : "The capital of France is Paris." ,
...
}
}
The first request may take 30-60 seconds as the worker loads the model into GPU memory. Subsequent requests complete in just a few seconds.
Step 4: Clean up
To avoid ongoing charges, delete your endpoint when you’re done testing.
Go to the Serverless section .
Click the three dots on your endpoint and select Delete Endpoint .
Type the endpoint name to confirm.
curl --request DELETE \
--url "https://rest.runpod.io/v1/endpoints/YOUR_ENDPOINT_ID" \
--header "Authorization: Bearer $RUNPOD_API_KEY "
You’ve successfully deployed and tested your first Serverless endpoint.
Next steps
Build a custom worker Create your own handler function and Docker image.
Send requests Learn about sync, async, and streaming requests.
Endpoint settings Configure scaling, timeouts, and GPU selection.
Configure vLLM Customize your vLLM deployment for different models.