member_2e5ba30f Posted on May 31 • Originally published at waynehacking8.github.io Where Tensor-Parallel Inference Hits the NVLink Wall # cuda # gpu # machinelearning # performance Where tensor-parallel inference hits the NVLink wall 2026-05-31 · GPU / distributed systems Tensor parallelism splits each layer across GPUs, so every forward pass pays for an all-reduce over the network fabric. On a single node that fabric is NVLink/NVSwitch — and how close you get to its theoretical budget decides whether TP helps or hurts. This post measures it on 4× H100 and explains where the wall is. Repo with the full harness and CSVs: nccl-collectives-bench . What was measured A bandwidth sweep (message size 8 B → 8 GB) of the three collectives that bound distributed LLM work — all-reduce, all-gather, reduce-scatter — driving the canonical nvidia/nccl-tests and adding a parser + analysis layer on top. The headline number: All-reduce bus bandwidth ≈ 366 GB/s , about 77 % of the per-GPU NVLink uni-directional budget on this box. That 77 % is the practical ceiling TP communication runs into; the remaining gap is protocol overhead and the algorithm's traffic multiplier. Algorithm ranking at large messages: NVLS > Ring > Tree . NVLink SHARP (NVLS) offloads the reduction into the switch, which is why it pulls ahead once messages are big enough to amortise setup. A protocol study (Simple / LL / LL128) showing the small-message latency floor — the regime that actually matters for decode , where each token's all-reduce is tiny. Why it matters for inference Training all-reduces gradients on big tensors, so it lives in the bandwidth-bound regime where 366 GB/s is good news. Decode is the opposite : one token at a time means small messages, so you're pinned against the latency floor, not the bandwidth ceiling. That is the real "TP wall" — past a certain TP degree, the per-token all-reduce latency dominates and adding GPUs makes decode slower , not faster. The repo also includes an
LIVE
