How Meta Trains LLaMA 4 on 100,000 GPUs

Training GPT-4 reportedly used 25,000 A100 GPUs for roughly 90 days. LLaMA 4’s training involved 100,000 H100s. The math sounds simple - more GPUs, bigger model - but the engineering of getting 100,000 GPUs to work together efficiently is one of the hardest distributed systems problems in existence.

The headline GPU count is misleading. The hard part isn’t having 100,000 GPUs. The hard part is ensuring all 100,000 are actually computing useful work at any given moment.

The Utilization Problem

A single H100 GPU running a transformer forward pass can achieve 70-80% of theoretical FLOP utilization. 100,000 H100s working together achieve closer to 30-40% on large models. The gap represents the cost of coordination.

GPUs go idle when they’re waiting for:

Data from other GPUs (communication overhead)
Gradient synchronization across the cluster (AllReduce operations)
The pipeline bubble in pipeline parallelism
Checkpointing to persistent storage

The entire discipline of training at this scale is minimizing these idle periods. Each percentage point of GPU utilization recovered represents millions of dollars of training cost and weeks of training time.

The Parallelism Stack

At 100,000+ GPU scale, you can’t just use data parallelism (copies of the model training on different batches). A single LLaMA 4 model with hundreds of billions of parameters doesn’t fit in the memory of a single GPU or even a single node. Meta uses a combination:

Tensor parallelism: Individual weight matrices are split across multiple GPUs. During a matrix multiply, each GPU handles a portion of the computation and they communicate the partial results. Requires high-bandwidth interconnects because communication is frequent and in the critical path.

Pipeline parallelism: Model layers are split into stages, each running on different GPUs. GPUs process different micro-batches simultaneously (the pipeline). The “pipeline bubble” - when early stages are idle waiting for backpropagation to reach them - is the efficiency cost.

Data parallelism: Multiple copies of the model (or model shards) process different batches. Gradients are averaged across copies. This scales the batch size and amortizes the fixed communication costs.

Meta’s 4D parallelism approach (the “D” references their internal codename) adds sequence parallelism for handling very long context training, making it a 4-dimensional parallelism scheme. Each GPU participates in all four dimensions simultaneously.

The Networking Requirement

The communication between GPUs during training is not like regular networking. Moving a 100-billion-parameter model’s gradients requires transferring terabytes of data per second across the cluster.

NVIDIA’s NVLink provides 900 GB/s bandwidth within a node (8 H100s connected with NVLink). Between nodes, InfiniBand HDR (400 Gb/s) or HDR100 (200 Gb/s) is the standard. Meta’s clusters use custom network topologies with fat-tree or dragonfly architectures to minimize the number of hops between any two nodes.

The ratio of compute to communication bandwidth is carefully managed. If communication is the bottleneck, GPUs wait. If compute is the bottleneck, the training is efficient.

For AllReduce operations (averaging gradients across all data-parallel workers), Meta uses Ring-AllReduce: each GPU sends to the next GPU in a ring, forwarding accumulated gradients as they go. This is bandwidth-optimal but requires careful latency management at scale.

Fault Tolerance at 100,000 GPUs

Hardware failure probability scales with fleet size. At 100,000 GPUs, if each GPU fails once every 3 years, you expect a failure every 15 minutes on average. Training a model over 90 days without fault tolerance would be impossible.

The solutions:

Frequent checkpointing: Save the model state to fast storage (NVMe SSDs on the nodes, then to distributed storage) every few hundred steps. Failure cost = time since last checkpoint + restart time.

Resilient training code: Training infrastructure at Meta and other labs is designed to detect node failures and restart the training job excluding the failed node. The remaining GPUs reorganize their communication topology and continue.

Spare capacity: Running 101,000 GPUs when you need 100,000 means a failure doesn’t stall the job. The spare node takes over.

Staggered checkpoints: Not all nodes checkpoint simultaneously (which would cause a pause in training). Checkpoints are staggered to keep training throughput high.

The Memory Wall

H100 GPUs have 80 GB of HBM3 memory. A naive calculation suggests a 70B parameter model (in float16) requires 140 GB for weights alone - nearly 2 H100s just for the weights. Then add:

Optimizer states (AdamW keeps first and second moment estimates: 2x parameters = 280 GB more)
Gradients (another 140 GB)
Activations for backpropagation

A 70B parameter model trained with AdamW requires approximately 1.4 TB of GPU memory for full precision training. Distributing this across 18+ H100s just to fit the model is why tensor parallelism is necessary.

Techniques that help:

Mixed precision training: Store weights in bf16, compute forward/backward in fp16, keep master weights in fp32
Activation checkpointing: Don’t store all activations for backprop - recompute them. Saves memory at the cost of 30-40% extra compute
ZeRO (Zero Redundancy Optimizer): Shard optimizer states, gradients, and weights across data-parallel workers

The Software Behind the Scale

PyTorch is the foundation. Meta’s internal training stack (used for LLaMA) wraps it with:

TorchDynamo/torch.compile: Just-in-time compilation of training code for better GPU kernel utilization
FSDP (Fully Sharded Data Parallelism): Production data parallelism with full parameter sharding
Custom CUDA kernels: Hand-optimized kernels for attention, layer normalization, and activation functions

The gap between “runs on one GPU” and “efficiently runs on 100,000 GPUs” is measured in engineer-years. The frameworks abstract some of it, but production training at scale requires infrastructure engineering expertise that is genuinely rare.

Bottom Line

Training LLaMA 4 at 100,000 GPU scale requires solving distributed systems, networking, fault tolerance, and memory management problems simultaneously. The actual computation - transformer forward and backward passes - is the easy part. The hard parts are keeping 100,000 GPUs busy, communicating gradients fast enough not to bottleneck training, and recovering gracefully from the near-constant stream of hardware failures. This is why frontier model training is limited to a handful of organizations with both the capital and the engineering expertise to build these systems.

The Utilization Problem#

The Parallelism Stack#

The Networking Requirement#

Fault Tolerance at 100,000 GPUs#

The Memory Wall#

The Software Behind the Scale#

Bottom Line#

Comments