- Scale beyond single machines: Train models too large for one GPU, or accelerate training across multiple nodes.
- High-speed networking: 1600-3200 Gbps between nodes for efficient gradient synchronization and data movement.
- Zero configuration: Pre-configured static IPs, environment variables, and framework support.
- On-demand: Deploy in minutes, pay only for what you use.
Get started
Deploy a Slurm cluster
Managed Slurm for HPC workloads.
PyTorch distributed training
Multi-node PyTorch for deep learning.
Axolotl fine-tuning
Fine-tune LLMs across multiple GPUs.
How it works
Runpod provisions multiple GPU nodes in the same connected with high-speed networking. One node is designated primary (NODE_RANK=0), and all nodes receive pre-configured environment variables for distributed communication.
The high-speed interfaces (ens1-ens8) handle inter-node communication for , , and . The eth0 interface on the primary node handles external traffic. See the configuration reference for environment variables and network details.
Supported hardware
| GPU | Network speed | Nodes |
|---|---|---|
| B200 | 3200 Gbps | 2-8 nodes (16-64 GPUs) |
| H200 | 3200 Gbps | 2-8 nodes (16-64 GPUs) |
| H100 | 3200 Gbps | 2-8 nodes (16-64 GPUs) |
| A100 | 1600 Gbps | 2-8 nodes (16-64 GPUs) |
Pricing
Pricing is based on GPU type and number of nodes. See Instant Clusters pricing for current rates. Custom pricing is available for enterprise workloads. Contact our sales team for details.All accounts have a default spending limit. To deploy larger clusters, contact help@runpod.io.