Customize your Flash app - Runpod Documentation

After running flash init, you have a working project template with example and . This guide shows you how to customize the template to build your application.

Endpoint types

Flash supports two endpoint types, each suited for different use cases:

Type	Best for	Functions per endpoint
Queue-based	Long-running GPU tasks	One
Load-balanced	Fast HTTP APIs	Multiple (via routes)

Queue-based
Load-balanced

Each @Endpoint function creates a separate Serverless endpoint:

@Endpoint(name="preprocess", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
def preprocess(data): ...

@Endpoint(name="inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
def run_model(input): ...

Call via /run or /runsync: https://api.runpod.ai/v2/{endpoint_id}/runsync

Multiple routes share one endpoint:

api = Endpoint(name="api-server", cpu="cpu5c-4-8", workers=(1, 5))

@api.post("/generate")
def generate_text(prompt: str): ...

@api.get("/health")
def health_check(): ...

Call via HTTP routes: https://{endpoint_id}.api.runpod.ai/generate

Add load balancing routes

To add routes to an existing load balancing endpoint, use the route decorator pattern:

lb_worker.py

from runpod_flash import Endpoint

api = Endpoint(name="lb_worker", cpu="cpu5c-4-8", workers=(1, 5))

# Existing routes
@api.post("/process")
async def process(input_data: dict) -> dict:
    # ... existing code ...
    pass

# Add a new route
@api.get("/status")
async def get_status() -> dict:
    return {"status": "healthy", "version": "1.0"}

All routes share the same lb_worker Serverless endpoint. Each route is accessible at its defined path. Key points:

Multiple routes can share one endpoint configuration
Each route has its own HTTP method and path
All routes on the same endpoint deploy to one Serverless endpoint

Add queue-based endpoints

To add a new queue-based endpoint, create a new endpoint with a unique name:

gpu_worker.py

from runpod_flash import Endpoint, GpuType

# Existing endpoint
@Endpoint(
    name="gpu-inference",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=3,
    dependencies=["torch"]
)
async def run_inference(input: dict) -> dict:
    import torch
    # Inference logic
    return {"result": "processed"}

# New endpoint for a different workload
@Endpoint(
    name="gpu-training",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=1,
    dependencies=["torch", "transformers"]
)
async def train_model(config: dict) -> dict:
    import torch
    from transformers import Trainer
    # Training logic
    return {"model_path": "/models/trained"}

This creates two separate Serverless endpoints, each with its own URL and scaling configuration.

Do not reuse the same endpoint name for multiple queue-based functions when deploying Flash apps. Each queue-based @Endpoint must have its own unique name parameter.

Modify endpoint configurations

Customize endpoint configurations for each worker function in your app. Each @Endpoint function can have its own GPU type, scaling parameters, and timeouts optimized for its specific workload.

# Example: Different configs for different workloads
@Endpoint(
    name="preprocess",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,  # Cost-effective for preprocessing
    workers=(0, 5)
)
async def preprocess(data): ...

@Endpoint(
    name="inference",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,  # High VRAM for large models
    workers=(1, 10)  # Keep one worker ready
)
async def inference(data): ...

For details, see:

Configuration parameters for all available options.
GPU types for selecting hardware.
Best practices for optimization guidance.

Test your customizations

After customizing your app, test locally with flash run:

flash run

# If using uv:
uv run flash run

This starts a development server at http://localhost:8888 with:

Interactive API documentation at /docs
Auto-reload on code changes
Real remote execution on Runpod workers

Make sure to test:

All HTTP routes work as expected
Endpoint functions execute correctly
Dependencies install properly
Error handling works

Next steps

Test locally

Use flash run for local development and testing.

Deploy to Runpod

Deploy your application to production with flash deploy.

Configure hardware resources

Complete reference for configuration options.

Create endpoint functions

Learn more about writing and optimizing endpoint functions.

​Endpoint types

​Add load balancing routes

​Add queue-based endpoints

​Modify endpoint configurations

​Test your customizations

​Next steps

Test locally

Deploy to Runpod

Configure hardware resources

Create endpoint functions

Endpoint types

Add load balancing routes

Add queue-based endpoints

Modify endpoint configurations

Test your customizations

Next steps