Skip to main content
After running flash init, you have a working project template with example and . This guide shows you how to customize the template to build your application.

Endpoint types

Flash supports two endpoint types, each suited for different use cases:
TypeBest forFunctions per endpoint
Queue-basedLong-running GPU tasksOne
Load-balancedFast HTTP APIsMultiple (via routes)
Each @Endpoint function creates a separate Serverless endpoint:
@Endpoint(name="preprocess", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
def preprocess(data): ...

@Endpoint(name="inference", gpu=GpuType.NVIDIA_A100_80GB_PCIe)
def run_model(input): ...
Call via /run or /runsync: https://api.runpod.ai/v2/{endpoint_id}/runsync

Add load balancing routes

To add routes to an existing load balancing endpoint, use the route decorator pattern:
lb_worker.py
from runpod_flash import Endpoint

api = Endpoint(name="lb_worker", cpu="cpu5c-4-8", workers=(1, 5))

# Existing routes
@api.post("/process")
async def process(input_data: dict) -> dict:
    # ... existing code ...
    pass

# Add a new route
@api.get("/status")
async def get_status() -> dict:
    return {"status": "healthy", "version": "1.0"}
All routes share the same lb_worker Serverless endpoint. Each route is accessible at its defined path. Key points:
  • Multiple routes can share one endpoint configuration
  • Each route has its own HTTP method and path
  • All routes on the same endpoint deploy to one Serverless endpoint

Add queue-based endpoints

To add a new queue-based endpoint, create a new endpoint with a unique name:
gpu_worker.py
from runpod_flash import Endpoint, GpuType

# Existing endpoint
@Endpoint(
    name="gpu-inference",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=3,
    dependencies=["torch"]
)
async def run_inference(input: dict) -> dict:
    import torch
    # Inference logic
    return {"result": "processed"}

# New endpoint for a different workload
@Endpoint(
    name="gpu-training",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,
    workers=1,
    dependencies=["torch", "transformers"]
)
async def train_model(config: dict) -> dict:
    import torch
    from transformers import Trainer
    # Training logic
    return {"model_path": "/models/trained"}
This creates two separate Serverless endpoints, each with its own URL and scaling configuration.
Do not reuse the same endpoint name for multiple queue-based functions when deploying Flash apps. Each queue-based @Endpoint must have its own unique name parameter.

Modify endpoint configurations

Customize endpoint configurations for each worker function in your app. Each @Endpoint function can have its own GPU type, scaling parameters, and timeouts optimized for its specific workload.
# Example: Different configs for different workloads
@Endpoint(
    name="preprocess",
    gpu=GpuType.NVIDIA_GEFORCE_RTX_4090,  # Cost-effective for preprocessing
    workers=(0, 5)
)
async def preprocess(data): ...

@Endpoint(
    name="inference",
    gpu=GpuType.NVIDIA_A100_80GB_PCIe,  # High VRAM for large models
    workers=(1, 10)  # Keep one worker ready
)
async def inference(data): ...
For details, see:

Test your customizations

After customizing your app, test locally with flash run:
flash run
This starts a development server at http://localhost:8888 with:
  • Interactive API documentation at /docs
  • Auto-reload on code changes
  • Real remote execution on Runpod workers
Make sure to test:
  • All HTTP routes work as expected
  • Endpoint functions execute correctly
  • Dependencies install properly
  • Error handling works

Next steps

Test locally

Use flash run for local development and testing.

Deploy to Runpod

Deploy your application to production with flash deploy.

Configure hardware resources

Complete reference for configuration options.

Create endpoint functions

Learn more about writing and optimizing endpoint functions.