Skip to main content
This quickstart gets you running a Serverless endpoint on Runpod in minutes, using a ready-to-use template to deploy a language model and send a test request.

Requirements

Step 1: Set up your API key

Export your Runpod API key as an environment variable so you can use it in commands:
export RUNPOD_API_KEY="your_api_key_here"

Step 2: Deploy an endpoint

Deploy a vLLM worker with a small, fast language model.
  1. Go to the Serverless section and click New Endpoint.
  2. Under The Hub, click vLLM.
  3. Click Deploy vX.X.X.
  4. In the Model field, enter: Qwen/Qwen2.5-0.5B-Instruct
  5. Click Next then Create Endpoint.
  6. Once deployed, note your Endpoint ID from the endpoint details page—you’ll need it for API requests.
Your endpoint will begin initializing. This takes 1-2 minutes while Runpod provisions resources and loads the model.

Step 3: Send a request

Once your endpoint shows Ready status, send a test request.
Replace YOUR_ENDPOINT_ID with your actual endpoint ID:
curl --request POST \
  --url "https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/runsync" \
  --header "Authorization: Bearer $RUNPOD_API_KEY" \
  --header "Content-Type: application/json" \
  --data '{
    "input": {
      "prompt": "What is the capital of France?",
      "max_tokens": 100
    }
  }'
You should receive a response like this:
{
  "id": "sync-abc123-xyz",
  "status": "COMPLETED",
  "output": {
    "text": "The capital of France is Paris.",
    ...
  }
}
The first request may take 30-60 seconds as the worker loads the model into GPU memory. Subsequent requests complete in just a few seconds.

Step 4: Clean up

To avoid ongoing charges, delete your endpoint when you’re done testing.
  1. Go to the Serverless section.
  2. Click the three dots on your endpoint and select Delete Endpoint.
  3. Type the endpoint name to confirm.
You’ve successfully deployed and tested your first Serverless endpoint.

Next steps

Build a custom worker

Create your own handler function and Docker image.

Send requests

Learn about sync, async, and streaming requests.

Endpoint settings

Configure scaling, timeouts, and GPU selection.

Configure vLLM

Customize your vLLM deployment for different models.