Build a Flash app - Runpod Documentation

Flash apps let you build APIs to serve AI/ML workloads on Runpod Serverless. This guide walks you through the process of building a Flash app from scratch, from project initialization and local testing to production deployment.

If you haven’t already, we recommend starting with the Quickstart guide to get a feel for how Flash @Endpoint functions work.

Requirements:

You’ve created a Runpod account.
You’ve created a Runpod API key.
You’ve installed Python 3.10-3.12 (3.13+ is not yet supported).

Step 1: Initialize a new project

Create a new directory and install Flash using uv:

# Create the project directory and navigate into it:
mkdir flash_app
cd flash_app

# Install Flash:
uv venv
source .venv/bin/activate
uv pip install runpod-flash

Use the flash init command to generate a structured project template with a preconfigured application entry point:

uv run flash init .

Authenticate with Runpod:

uv run flash login

This opens your browser to authorize Flash. After you approve, your credentials are saved for all Flash CLI commands.

Step 2: Explore the project template

This is the structure of the project template created by flash init:

flash_app

lb_worker.py

gpu_worker.py

cpu_worker.py

.env.example

.flashignore

.gitignore

pyproject.toml

requirements.txt

README.md

This template includes:

Example worker files with @Endpoint decorated functions for load-balanced and queue-based endpoints.
Templates for requirements.txt, .env.example, .gitignore, etc.
Pre-configured endpoint configurations for GPU and CPU workers.

When you start the server, it creates API endpoints at /gpu/hello and /cpu/hello, which call the endpoint functions described in their respective worker files.

Step 3: Install Python dependencies

Install required dependencies:

uv pip install -r requirements.txt

Step 4: Start the local API server

Use flash run to start the API server:

uv run flash run

Open a new terminal tab or window and test your endpoints using cURL:

# Test the queue-based GPU endpoint
curl -X POST http://localhost:8888/gpu_worker/runsync \
    -H "Content-Type: application/json" \
    -d '{"input": {"input_data": {"message": "Hello from the GPU"}}}'

# Test the load-balanced endpoint
curl -X POST http://localhost:8888/lb_worker/process \
    -H "Content-Type: application/json" \
    -d '{"input_data": {"message": "Hello from Flash"}}'

If you switch back to the terminal tab where you used flash run, you’ll see the details of the job’s progress.

Faster testing with auto-provisioning

For development with multiple endpoints, use --auto-provision to deploy all resources before testing:

uv run flash run --auto-provision

This eliminates cold-start delays by provisioning all serverless endpoints upfront. Endpoints are cached and reused across server restarts, making subsequent runs faster. Resources are identified by name, so the same endpoint won’t be re-deployed if the configuration hasn’t changed.

Step 5: Open the API explorer

Besides starting the API server, flash run also starts an interactive API explorer. Point your web browser at http://localhost:8888/docs to explore the API. To run endpoint functions in the explorer:

Expand one of the functions under GPU Workers or CPU Workers.
Click Try it out and then Execute.

You’ll get a response from your workers right in the explorer.

Step 6: Customize your endpoints

To customize your endpoints:

Edit the @Endpoint functions in your worker files (lb_worker.py, gpu_worker.py, cpu_worker.py).
Add new worker files for new endpoints.
Test individual workers by running them as scripts (e.g., python gpu_worker.py).
Restart the development server to pick up changes.

Example: Adding a custom GPU endpoint

To add a new GPU endpoint for image generation, create a new worker file or modify an existing one. For deployed apps, each queue-based function needs its own unique endpoint configuration:

from runpod_flash import Endpoint, GpuType

@Endpoint(
    name="image_generator",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,
    workers=2,
    dependencies=["diffusers", "torch", "transformers", "pillow"]
)
async def generate_image(prompt: str, width: int = 512, height: int = 512) -> dict:
    import torch
    from diffusers import StableDiffusionPipeline
    import base64
    import io

    pipeline = StableDiffusionPipeline.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        torch_dtype=torch.float16
    ).to("cuda")

    image = pipeline(prompt=prompt, width=width, height=height).images[0]

    buffered = io.BytesIO()
    image.save(buffered, format="PNG")
    img_str = base64.b64encode(buffered.getvalue()).decode()

    return {"image": img_str, "prompt": prompt}

This creates a new Serverless endpoint specifically for image generation. When deployed, it will be available at its own endpoint URL with its own /run or /runsync routes.

Step 7: Deploy to Runpod

When you’re ready to deploy your app to Runpod, use flash deploy:

uv run flash deploy

This command:

Builds your application into a deployment artifact.
Uploads it to Runpod’s storage.
Provisions independent Serverless endpoints for each endpoint configuration.
Configures service discovery for inter-endpoint communication.

After deployment, you’ll receive URLs for all deployed endpoints, grouped by configuration type:

✓ Deployment Complete

Load-balanced endpoints:
  https://abc123xyz.api.runpod.ai  (lb_worker)
    POST   /process
    GET    /health

Queue-based endpoints:
  https://api.runpod.ai/v2/def456xyz  (gpu_worker)
  https://api.runpod.ai/v2/ghi789xyz  (cpu_worker)

All requests to deployed endpoints require authentication with your Runpod API key. For example:

# Call a load-balanced endpoint
curl -X POST https://abc123xyz.api.runpod.ai/process \
    -H "Authorization: Bearer $RUNPOD_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"input_data": {"message": "Hello from Flash"}}'

# Call a queue-based endpoint
curl -X POST https://api.runpod.ai/v2/def456xyz/runsync \
    -H "Authorization: Bearer $RUNPOD_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{"input": {"input_data": {"message": "Hello from the GPU"}}}'

For detailed deployment options including environment management, see Deploy Flash apps.

Next steps

Deploy Flash applications for production use.
Configure hardware resources for your endpoints.
Monitor and troubleshoot your endpoints.

​Requirements:

​Step 1: Initialize a new project

​Step 2: Explore the project template

​Step 3: Install Python dependencies

​Step 4: Start the local API server

​Faster testing with auto-provisioning

​Step 5: Open the API explorer

​Step 6: Customize your endpoints

​Example: Adding a custom GPU endpoint

​Step 7: Deploy to Runpod

​Next steps

Requirements:

Step 1: Initialize a new project

Step 2: Explore the project template

Step 3: Install Python dependencies

Step 4: Start the local API server

Faster testing with auto-provisioning

Step 5: Open the API explorer

Step 6: Customize your endpoints

Example: Adding a custom GPU endpoint

Step 7: Deploy to Runpod

Next steps