Collectives Explainer Series
Author: Yue Lu
Date: April 2026
A walkthrough of collective communication for distributed GPU workloads — from the topology-free α-β cost model down to how dynamic contention shifts the Pareto rankings on real clusters. The primitives and cost models here apply to LLM training, LLM inference, and HPC workloads alike; where worked examples use inference-scale message sizes or LLM parallelism mappings, they’re concrete illustrations, not restrictions on scope. Every file in this folder is self-contained; you can start with 01_collective_algorithms.md, or jump straight to any topic-specific note once you know the vocabulary. 00_summary.md is a one-page cheatsheet of symbols, primitives, and the canonical α-β formulas for readers who already know the material and want fast lookup.
Reading order
01_collective_algorithms.md ← start here
│
▼
02_topology_mapping.md (single-tier: star, torus, mesh)
│
▼
03_hierarchical_topologies.md (multi-tier: fat-tree/Clos + composition)
│
┌─────────┴────────────┐
▼ ▼
04_in_network_ 05_contention_
collectives.md and_congestion.md
Branch structure. Read 01_collective_algorithms.md → 02_topology_mapping.md → 03_hierarchical_topologies.md first (topology: single-tier, then multi-tier). After that, two independent branches:
04_in_network_collectives.mddeepens SHARP / NVLS / Quantum SHARP — in-network reduction and switch multicast, on both star (02_topology_mapping.md§2) and fat-tree spine (03_hierarchical_topologies.md§1).05_contention_and_congestion.mdextends the ideal-model ladders from02_topology_mapping.mdand03_hierarchical_topologies.mdunder realistic contention coefficients $\eta_\alpha, \eta_\beta$; per-tier $\eta$ profile for hierarchical fabrics.
Read 04_in_network_collectives.md for SHARP’s $O(N) \to O(1)$ hop-count collapse; 05_contention_and_congestion.md for how real-cluster contention changes ideal-model rankings.
File index
| File | Topic | Prereq |
|---|---|---|
00_summary.md | One-page cheatsheet: symbol table, seven primitives, per-algorithm $(n_\alpha, n_\beta)$ table, per-topology specializations, hierarchical composition rule, INC effects, η-realistic form, $N = 512$ anchor numbers | familiarity with the rest of the series |
01_collective_algorithms.md | α-β cost model; seven primitives (BC, Reduce, AR, RS, AG, A2A, P2P); binomial / pipelined BC / Reduce; ring / tree / recursive-doubling / Rabenseifner AR; mapping to TP/EP/SP/PP; algbw/busbw conventions | None |
02_topology_mapping.md | Three-topology catalog of single-tier scale-up fabrics: star, torus, mesh. Per-topology cost derivations; torus dim-decomp AR with 2×2 worked example; side-by-side comparison at $N = 512$ | 01_collective_algorithms.md |
03_hierarchical_topologies.md | Multi-tier Clos / fat-tree (§1, including the NVL72 SuperPOD case study); composition rules (RS → sub-AR → AG; A2A outlier) (§2); INC and per-tier $\eta$ in hierarchies (§3); rail-optimized SuperPOD topology and k-ary fat-tree appendices | 02_topology_mapping.md |
04_in_network_collectives.md | SHARP / NVLS / Quantum SHARP — in-network reduction (switch ALU), switch multicast, and emerging HW A2A as distinct capabilities; how $n_\alpha$ collapses from $O(N)$ to $O(1)$ | 01_collective_algorithms.md, 02_topology_mapping.md §2, 03_hierarchical_topologies.md §1 |
05_contention_and_congestion.md | Contention coefficients $\eta_\alpha, \eta_\beta$; single-tier and per-tier calibration (fat-tree oversubscription $s \to \eta_\beta \leq 1/s$); re-running $N = 512$ under realistic $\eta$ | 02_topology_mapping.md §5, 03_hierarchical_topologies.md §1 |
Primitive → section map
| Primitive | Introduced in |
|---|---|
| Point-to-point (p2p) | 01_collective_algorithms.md §8 |
| Broadcast (ring-chain / binomial tree / pipelined tree) | 01_collective_algorithms.md §3 |
| Reduce (binomial / pipelined) | 01_collective_algorithms.md §4 |
| Ring all-reduce | 01_collective_algorithms.md §5.1 |
| Double binary tree all-reduce (NCCL) | 01_collective_algorithms.md §5.2 |
| Simple recursive-doubling AR | 01_collective_algorithms.md App. B.1 |
| Rabenseifner halving-doubling AR | 01_collective_algorithms.md App. B.2 |
| Ring all-gather / reduce-scatter | 01_collective_algorithms.md §6 |
| PAT all-gather / reduce-scatter (NCCL 2.23+, scale-out) | 01_collective_algorithms.md App. A |
| Recursive-doubling AG / recursive-halving RS | 01_collective_algorithms.md App. B.4 |
| Ring-relay all-to-all | 01_collective_algorithms.md §7.1 |
| Pairwise direct-send all-to-all | 01_collective_algorithms.md §7.2 |
| Bruck all-to-all | 01_collective_algorithms.md App. B.5 |
| Switch-multicast-assisted AG / BC | 04_in_network_collectives.md §1.2 |
| Torus BC / Reduce (dim-decomposed) | 02_topology_mapping.md §3.2, §3.3 |
| Torus dim-decomposed ring AR | 02_topology_mapping.md §3.4 |
| Torus dim-decomposed Rabenseifner AR | 02_topology_mapping.md App. A |
| Torus dim-decomposed AG / RS | 02_topology_mapping.md §3.5 |
| Torus A2A (bisection-limited) | 02_topology_mapping.md §3.6 |
| Full mesh (direct, no switch) | 02_topology_mapping.md §4.1 |
| $k$-D mesh (torus without wraparound) | 02_topology_mapping.md §4.2 |
| Fat-tree / Clos (leaf-spine, 3-tier) | 03_hierarchical_topologies.md §1 |
| Hierarchical AR (RS → sub-AR → AG) | 03_hierarchical_topologies.md §2.1 |
| In-network AR (SHARP / NVLS / INC) | 04_in_network_collectives.md |
What’s not here
- Formal derivations of the analytical cost model and its integration into a specific performance tool — those live in the tool-specific documentation, not in this general explainer series.
- Workload-specific end-to-end latency models (inference decode / prefill, training iteration time) — the collectives here are one ingredient; the full pipeline treatment is out of scope.
- Runnable benchmarks or Pareto sweeps — this folder is reading material. Calibration experiments belong wherever the reader’s performance tool lives.
For readers vs for practitioners
- Readers who want intuition and visuals:
01_collective_algorithms.md→02_topology_mapping.md→03_hierarchical_topologies.md→ (04_in_network_collectives.mdand05_contention_and_congestion.mdby interest). - Practitioners who want to plug numbers into a cost formula: skim
00_summary.mdfor the symbol table and per-algorithm $(n_\alpha, n_\beta)$, then jump to the cost summary in02_topology_mapping.md§5.1 (single-tier ideal) and05_contention_and_congestion.md§5 (realistic); the formulas are stated in-line and self-contained. For multi-tier fabrics,03_hierarchical_topologies.md§2 extends the single-tier formulas with the hierarchical composition rule. - Reviewers / decision-makers comparing architectures: start at
02_topology_mapping.md§5.1 (ideal single-tier formulas) and04_in_network_collectives.md§3.1 (concrete N=512 ladder with INC), then05_contention_and_congestion.md§5 (realistic), and cross-check the margin-compression discussion in05_contention_and_congestion.md§5.3–§5.4. For composing INC across tiers in a hierarchy,03_hierarchical_topologies.md§3 covers SHARP at the inner / outer tier and per-tier $\eta$.
Notation and Key Equations Cheatsheet
Author: Yue Lu
Date: April 2026
A one-page reference for the symbols, primitives, and canonical α-β cost formulas used throughout 01_collective_algorithms.md – 05_contention_and_congestion.md. Each row points to the section that derives it; equations here are stated, not justified.
Table of Contents
- Symbols
- The seven primitives
- Cost-model skeleton
- Per-algorithm costs (topology-free)
- Per-topology specializations
- Hierarchical composition
- In-network collectives (INC)
- Realistic cost: η coefficients
1. Symbols
| Symbol | Meaning | Defined in |
|---|---|---|
| $\alpha$ | per-hop latency (setup + switch traversal + propagation) | 01_collective_algorithms.md §1 |
| $M$ | message size in bytes | 01_collective_algorithms.md §1 |
| $\mathrm{BW}$ | per-link bandwidth (one direction) | 01_collective_algorithms.md §1 |
| $N$ | rank count in the collective group | 01_collective_algorithms.md §2 |
| $V_i$ | payload held by rank $i$ (size $M$) | 01_collective_algorithms.md §2 |
| $n_\alpha$ | latency-term count (sequential synchronization hops) | 01_collective_algorithms.md §1 |
| $n_\beta$ | bandwidth-term count (per-rank bytes moved, in units of $M$) | 01_collective_algorithms.md §1 |
| $\mathrm{BW_{eff}}$ | effective per-rank bandwidth seen by the collective ($\mathrm{BW}/n_\beta$, accounting for dual-touch on AR) | 04_in_network_collectives.md §1.4 |
| $D$ | network diameter (worst-case hop count between rank pair) | 02_topology_mapping.md §1, App. B |
| $d_i$ | dim-$i$ rank count on a torus / mesh ($\prod_i d_i = N$) | 02_topology_mapping.md §1.2, App. B |
| $d_{\max}$ | $\max_i d_i$ — controls torus A2A bisection penalty | 02_topology_mapping.md §3.6 |
| $k$ | torus / mesh dimensionality (number of dims) | 02_topology_mapping.md §3 |
| $\alpha_{\mathrm{switch}}$ | switch cut-through latency (~100–400 ns; $\ll \alpha$) | 04_in_network_collectives.md §1.1 |
| $s$ | fat-tree / Clos oversubscription ratio at a tier boundary ($s \geq 1$) | 03_hierarchical_topologies.md §1 |
| $L$ | outer-tier rank count in a 2-tier hierarchy ($N = L \cdot N_{\mathrm{inner}}$) | 03_hierarchical_topologies.md §2.1 |
| $\eta_\alpha \geq 1$ | realistic α-side contention coefficient (inflation factor) | 05_contention_and_congestion.md §4 |
| $\eta_\beta \in (0, 1]$ | realistic BW-side contention coefficient (utilization factor) | 05_contention_and_congestion.md §4 |
| algbw | algorithm bandwidth: $M / t$ (per-call effective rate) | 01_collective_algorithms.md App. D |
| busbw | bus bandwidth: $C_{\mathrm{primitive}}(N) \cdot M / t$ (link-level rate) | 01_collective_algorithms.md App. D |
2. The seven primitives
| Primitive | Group state in → out |
|---|---|
| Broadcast (BC) | rank 0 holds $V$ → all hold $V$ |
| Reduce | rank $i$ holds $V_i$ → rank 0 holds $\sum_i V_i$ |
| All-reduce (AR) | rank $i$ holds $V_i$ → all hold $\sum_i V_i$ |
| Reduce-scatter (RS) | rank $i$ holds $V_i$ (size $M$) → rank $i$ holds chunk $\sum_j V_j[\mathrm{chunk}_i]$ (size $M/N$) |
| All-gather (AG) | rank $i$ holds chunk $i$ (size $M/N$) → all hold concatenation (size $M$) |
| All-to-all (A2A) | rank $i$ holds $N$ chunks indexed by destination → rank $i$ holds $N$ chunks indexed by sender |
| Point-to-point (P2P) | rank $a$ holds $V$ → rank $b$ holds $V$ |
Identity: $\mathrm{AR} \equiv \mathrm{RS} + \mathrm{AG}$ (both halves over the same group). Used by ring AR, Rabenseifner halving-doubling, hierarchical RS → sub-AR → AG.
3. Cost-model skeleton
Every collective cost in this series factors as
$$t \;=\; n_\alpha \cdot \alpha \;+\; n_\beta \cdot \frac{M}{\mathrm{BW}}$$
- $n_\alpha$ counts sequential hops on the algorithm’s critical path.
- $n_\beta$ counts per-rank byte movement in units of $M$ (not the total fabric traffic).
A primitive’s BW-side lower bound is set by the cut-through structure:
| Primitive | $n_\beta$ lower bound | Why |
|---|---|---|
| BC, Reduce, AG, RS | $(N-1)/N \to 1$ | one rank’s share crosses the cut once |
| AR | $2(N-1)/N \to 2$ | every byte is touched twice (RS in, AG out) — unless INC fuses the two halves |
| A2A | $(N-1)/N$ per rank | $N(N-1)$ pairwise distinct payloads bisection-bound |
4. Per-algorithm costs (topology-free)
Costs assume an abstract fully-connected fabric (every rank can reach every other in 1 hop). Topology corrections come from §5.
| Algorithm | Primitive | $n_\alpha$ | $n_\beta$ | Source |
|---|---|---|---|---|
| Ring (chain) | BC | $(N-1)$ raw / $\to 1$ pipelined | $1$ | 01_collective_algorithms.md §3.1 |
| Binomial tree | BC | $\lceil \log_2 N \rceil$ raw / pipelined to $\to 1$ | $1$ | 01_collective_algorithms.md §3.2 |
| Ring | Reduce | $(N-1)$ raw / pipelined to $\to 1$ | $1$ | 01_collective_algorithms.md §4.1 |
| Binomial tree | Reduce | $\lceil \log_2 N \rceil$ raw / pipelined to $\to 1$ | $1$ | 01_collective_algorithms.md §4.2 |
| Ring | AR | $2(N-1)$ | $2(N-1)/N \to 2$ | 01_collective_algorithms.md §5.1 |
| Double binary tree (DBT) | AR | $2 \lceil \log_2 N \rceil$ | $2$ (BW-eff = BW/2) | 01_collective_algorithms.md §5.2 |
| Recursive-halving + doubling (Rabenseifner) | AR | $2 \lceil \log_2 N \rceil$ | $2(N-1)/N \to 2$ | 01_collective_algorithms.md App. B.2 |
| Simple recursive-doubling | AR | $\lceil \log_2 N \rceil$ | $\lceil \log_2 N \rceil$ | 01_collective_algorithms.md App. B.1 |
| Ring | AG / RS | $(N-1)$ | $(N-1)/N \to 1$ | 01_collective_algorithms.md §6 |
| PAT (parallel aggregated trees) | AG / RS | $\lceil \log_2 N \rceil$ | $(N-1)/N \to 1$ | 01_collective_algorithms.md App. A |
| Recursive-doubling AG / -halving RS | AG / RS | $\lceil \log_2 N \rceil$ | $(N-1)/N \to 1$ | 01_collective_algorithms.md App. B.4 |
| Pairwise direct-send | A2A | $(N-1)$ | $(N-1)/N$ | 01_collective_algorithms.md §7.2 |
| Ring-relay | A2A | $(N-1)$ | varies (bisection-bound on torus) | 01_collective_algorithms.md §7.1 |
| Bruck (log-round) | A2A | $\lceil \log_2 N \rceil$ | $\lceil \log_2 N \rceil / 2$ | 01_collective_algorithms.md App. B.5 |
Headline AR formulas:
$$t_{\mathrm{ring\,AR}} \;=\; 2(N-1)\,\alpha \;+\; \frac{2(N-1)}{N} \cdot \frac{M}{\mathrm{BW}}$$
$$t_{\mathrm{tree\,AR}} \;=\; 2\lceil \log_2 N \rceil \,\alpha \;+\; 2\,\frac{M}{\mathrm{BW}}$$
(DBT hits the BW floor as a hardware ceiling under asymptotic pipelining; the practice gap is documented in 01_collective_algorithms.md §5.3.)
5. Per-topology specializations
Single-tier scale-up fabrics. AR formula entries; AG / RS halve both terms; BC / Reduce hit $M/\mathrm{BW}$ under pipelining and inherit the topology’s α term.
| Topology | $n_\alpha$ (AR) | $n_\beta$ (AR) | A2A BW term | Source |
|---|---|---|---|---|
| Star (single switch) | $2(N-1)$ ring / $2 \lceil \log_2 N \rceil$ DBT | $2(N-1)/N \to 2$ | $(N-1)/N \cdot M/\mathrm{BW}$ pairwise | 02_topology_mapping.md §2 |
| Torus ($k$-D, $\prod d_i = N$, dim-decomposed ring) | $2 \sum_i (d_i - 1)$ | $2(N-1)/N \to 2$ | $(d_{\max}/8) \cdot M/\mathrm{BW}$ at cubic shapes | 02_topology_mapping.md §3.4, §3.6 |
| Torus dim-decomposed Rabenseifner (not shipped) | $2 \sum_i \lceil \log_2 d_i \rceil$ | $2(N-1)/N \to 2$ | — | 02_topology_mapping.md App. A |
| Full mesh | identical to star at $\alpha \to \alpha_{\mathrm{link}}$ | $2(N-1)/N \to 2$ | $(N-1)/N \cdot M/\mathrm{BW}$ | 02_topology_mapping.md §4.1 |
| $k$-D mesh (no wraparound) | $2 \sum_i (d_i - 1)$ | $2(N-1)/N \to 2$ | $(d_{\max}/4) \cdot M/\mathrm{BW}$ — 2× worse than torus | 02_topology_mapping.md §4.2 |
Bisection: torus $\mathrm{BW}{\mathrm{bisect}} = (2N/d{\max}) \cdot \mathrm{BW}$; fat-tree $s \cdot N \cdot \mathrm{BW}$ (oversubscription).
6. Hierarchical composition
Multi-tier Clos / fat-tree. Two-tier AR with $L$ outer ranks each containing $N/L$ inner ranks decomposes as inner RS → outer sub-AR → inner AG:
$$t_{\mathrm{AR}}^{\mathrm{hier}} \;=\; t_{\mathrm{RS,inner}}(M) \;+\; t_{\mathrm{AR,outer}}\!\left(\tfrac{M\,L}{N}\right) \;+\; t_{\mathrm{AG,inner}}(M)$$
The middle phase ships the telescoped payload $ML/N$ — which is why hierarchical AR scales: the costliest tier (typically outer) carries less data.
Oversubscription at any tier boundary multiplies that tier’s BW term by $s$ (equivalent to $\eta_\beta = 1/s$ on cross-tier traffic):
$$t_{\mathrm{BW,outer}} \;=\; s \cdot \frac{M\,L/N}{\mathrm{BW}_{\mathrm{outer}}}$$
A2A is the outlier: it cannot telescope, so per-tier ranks see $(N-1)/N \cdot M/\mathrm{BW}$ at the outer tier’s bisection rate. See 03_hierarchical_topologies.md §2.2.
7. In-network collectives (INC)
INC moves the reduction or replication into the switch ASIC (NVLS / Quantum SHARP / Spectrum-X SHARP / Tomahawk Ultra). On a single-switch star, $n_\alpha$ collapses from $O(N)$ or $O(\log N)$ down to $\sim 2$, and AR uniquely also gets a BW-eff doubling because the switch fuses the reduce and broadcast halves.
| Primitive | Required HW | $n_\alpha$ (INC) | $\mathrm{BW_{eff}}$ (INC) | Speedup vs software |
|---|---|---|---|---|
| AR | switch ALU + multicast xbar | $\sim 2\,\alpha_{\mathrm{switch}}$ | $\mathrm{BW}$ (vs SW $\mathrm{BW}/2$) | α: $\sim (N-1)$ or $\log_2 N$ ×; BW: 2× |
| Reduce | switch ALU | $\sim \alpha_{\mathrm{switch}}$ | $\mathrm{BW}$ | α: $\sim \log_2 N$ ×; BW: 1× |
| RS | switch ALU | $\sim 2\,\alpha_{\mathrm{switch}}$ | $\mathrm{BW}$ | α: $\sim (N-1)/2$ ×; BW: 1× |
| AG | multicast xbar | $\sim 2\,\alpha_{\mathrm{switch}}$ | $\mathrm{BW}$ | α: $\sim (N-1)/2$ ×; BW: 1× |
| BC | multicast xbar | $\sim 2\,\alpha_{\mathrm{switch}}$ | $\mathrm{BW}$ | α: $\sim \log_2 N$ ×; BW: 1× |
| A2A | crossbar scatter-gather (HW A2A; emerging) | $\sim \alpha_{\mathrm{switch}}$ | $(N-1)/N \cdot \mathrm{BW}$ | α: $\sim N$ ×; BW: 1× |
Scope. SHARP-class INC is single-switch-domain (NVLS) or SHARP-enabled aggregation tree (Quantum SHARP, Spectrum-X SHARP). HW A2A ships today on Tomahawk Ultra and is on Rubin’s roadmap. Cross-domain traffic falls back to software DBT / ring.
8. Realistic cost: η coefficients
Ideal-formula upper bounds inflate to realized cost via two scalar coefficients per (fabric, primitive):
$$t_{\mathrm{realized}} \;=\; \eta_\alpha \cdot n_\alpha \cdot \alpha \;+\; n_\beta \cdot \frac{M}{\eta_\beta \cdot \mathrm{BW}}$$
with $\eta_\alpha \geq 1$ and $\eta_\beta \in (0, 1]$.
| Fabric / regime | $\eta_\alpha$ | $\eta_\beta$ | Calibration source |
|---|---|---|---|
| Crossbar (NVLink + NVSwitch, no SHARP) | 1.00 | 0.80 | NCCL-tests H100 AR busbw 360 / 450 GB/s |
| NVLS (NVLink SHARP, in-network reduction) | 1.00 | 0.52 | NVLS busbw 470 / DBT 360 GB/s, 1.3× lift / framing |
| Torus (off-prefix + concurrent groups) | 1.20 | 0.60 | TPU v4 twisted-vs-untwisted 1.63× upper bound |
| Fat-tree / Clos upper tier at oversubscription $s$ | 1.00 | $\min(\eta_\beta^{\mathrm{hw}}, 1/s)$ | structural |
Per-tier η in a hierarchy. Each phase of the §6 hierarchical decomposition uses its own tier’s $(\eta_\alpha^{\mathrm{tier}}, \eta_\beta^{\mathrm{tier}})$.
See 05_contention_and_congestion.md §4 for the full calibration profile and §5 for the realistic re-run of the $N = 512$ ladder.
Further reading
01_collective_algorithms.md— α-β model and per-primitive algorithm derivations; algbw / busbw appendix.02_topology_mapping.md— single-tier specializations (star, torus, mesh) and the $N = 512$ ideal comparison.03_hierarchical_topologies.md— multi-tier Clos / fat-tree, composition rules, NVL72 SuperPOD case study, INC at inner / outer tier.04_in_network_collectives.md— SHARP-class INC mechanics, AR-only BW-eff doubling, per-primitive ceilings, scale-up vs scale-out.05_contention_and_congestion.md— η calibration, per-tier η on hierarchies, realistic re-run of the $N = 512$ ladder.references.md— primary sources for each formula and empirical value.
Collective Algorithms: A Network Topology-Free Introduction
Author: Yue Lu
Date: April 2026
Collective operations are the vocabulary of distributed GPU workloads — broadcast, reduce, all-reduce, all-gather, reduce-scatter, all-to-all, point-to-point. They show up in LLM training, LLM inference, HPC simulation, and any other multi-rank compute that needs to synchronize or redistribute state. This note defines each primitive, walks through the dominant algorithms on small concrete examples — ring and binomial tree for broadcast and reduce, ring and double binary tree (DBT) for all-reduce, ring for all-gather / reduce-scatter, pairwise direct-send for all-to-all — and maps every primitive to the parallelism pattern — tensor parallelism (TP), expert parallelism (EP), sequence parallelism (SP), pipeline parallelism (PP) — that uses it. A generic pipelining derivation (Appendix C) covers the mechanism by which all log-depth schedules collapse their bandwidth terms from $L \cdot M/\mathrm{BW}$ to $M/\mathrm{BW}$. No physical topology yet — just algorithms as they’d run on an abstract fully-connected fabric. Topology enters in the companion note 02_topology_mapping.md.
Table of Contents
- The α-β cost model
- The seven primitives
- Broadcast
- 3.1 Ring BC
- 3.2 Binomial tree BC
- 3.3 Comparison and practical adoption
- Reduce
- All-reduce
- All-gather / reduce-scatter
- All-to-all
- Point-to-point hop
- Mapping primitives to DP, TP, EP, SP, PP
- Appendix A: Parallel Aggregated Trees (PAT) — scale-out AG / RS
- Appendix B: Non-mainline AR / AG / RS / A2A variants
- Appendix C: Asymptotic form of linear schedules (via pipelining)
- Appendix D: algbw vs busbw — the NCCL-tests vocabulary
- Further reading
1. The α-β cost model
Every collective algorithm on every topology costs time in the same two-part shape. The time to move a message of $M$ bytes between two ranks on a link of bandwidth $\mathrm{BW}$ is modeled as
$$t = \alpha + \frac{M}{\mathrm{BW}}$$
- $\alpha$ (per-hop latency): fixed setup + switch traversal + propagation, measured in seconds. Does not depend on message size.
- $M / \mathrm{BW}$ (bandwidth term): time spent pushing bytes through the wire.
A collective-level schedule summarizes its $k$ sequential hops and its per-rank total bytes moved into two dimensionless counts:
$$t = n_\alpha \cdot \alpha + n_\beta \cdot \frac{M}{\mathrm{BW}}$$
- $n_\alpha$ (latency count) — the number of sequential synchronization hops. Ring and tree schedules on $N$ ranks chain $O(N)$ and $O(\log_2 N)$ hops respectively; in-network collective (INC) primitives that host the reduction inside a switch collapse $n_\alpha$ to $2$ regardless of $N$.
- $n_\beta$ (bandwidth count) — the multiplicative factor on the per-rank payload $M/\mathrm{BW}$, aggregating how many link-traversal-equivalents of the full $M$-byte message each rank’s critical path moves. Good algorithms keep $n_\beta$ close to the lower bound set by the primitive: $(N-1)/N \to 1$ for one-way redistribution (shipping one rank’s share across the cut is unavoidable) and $2(N-1)/N \to 2$ for all-reduce (every byte of the result must visit an endpoint link twice — once in, once out). Switch-hosted INC further collapses the all-reduce $n_\beta$ to 1 by moving the reduction into the fabric (
04_in_network_collectives.md §1.4).
The primitives themselves (all-reduce, all-gather, reduce-scatter, all-to-all, and the others) are defined in §2; INC is covered in 04_in_network_collectives.md. The per-primitive and per-algorithm values of $(n_\alpha, n_\beta)$ are worked out in §§3–8.
Writing both costs in $(n_\alpha, n_\beta)$ form makes latency-term and bandwidth-term efficiency directly comparable across topologies and across in-network vs software schedules. “Cost” means wall-clock time under this model; contention and congestion (which inflate $\alpha$ and deflate $\mathrm{BW}$) are covered in 05_contention_and_congestion.md.
2. The seven primitives
Seven collectives cover essentially all multi-rank communication in distributed GPU workloads (LLM training and inference, HPC simulation, distributed gradient descent, etc.). The first six are defined on a group of $N$ ranks each holding data of size $M$; the last is point-to-point. In the table below, $V$ denotes a payload of size $M$ bytes (treated as a vector for reduction purposes — e.g., a gradient tensor, an activation shard, a key-value (KV) cache chunk); $V_i$ is the copy held by rank $i$, and $\sum_i V_i$ is the element-wise sum across ranks. For the primitives whose name contains “reduce”, the reduction operator is commutative-associative (sum, max, min, bit-op); the pure-movement primitives (broadcast, all-gather, all-to-all) just redistribute data and don’t combine it.
| Primitive | What each rank starts with | What each rank ends with |
|---|---|---|
| Broadcast | rank 0 holds $V$; others hold nothing | all ranks hold $V$ |
| Reduce | rank $i$ holds $V_i$ | rank 0 holds $\sum_i V_i$ (others unchanged) |
| All-reduce (AR) | rank $i$ holds $V_i$ | all ranks hold $\sum_i V_i$ |
| Reduce-scatter (RS) | rank $i$ holds length-$M$ vector $V_i$ | rank $i$ holds one chunk of size $M/N$ equal to $\sum_j V_j[\mathrm{chunk}_i]$ |
| All-gather (AG) | rank $i$ holds one chunk of size $M/N$ | all ranks hold the concatenation of all $N$ chunks |
| All-to-all (A2A) | rank $i$ holds $N$ chunks of size $M/N$; chunk $j$ is destined for rank $j$ | rank $i$ holds $N$ chunks of size $M/N$; chunk $j$ was sent by rank $j$ (specifically, rank $j$’s original “chunk $i$”) |
| P2P (send/recv) | rank $a$ holds $V$ | rank $b$ holds $V$ |
A2A indexing convention: chunks in the send buffer are indexed by destination (rank $i$’s chunk $j$ is what rank $i$ wants to give to rank $j$); chunks in the receive buffer are indexed by sender (rank $i$’s received chunk $j$ is what rank $j$ gave it, which was rank $j$’s original chunk $i$). This matches the MPI MPI_Alltoall layout.
Three useful identities:
- AR ≡ RS + AG (logically). The end state of AR is identical to running RS then AG: RS leaves each rank with one fully-reduced $M/N$-chunk, and AG then redistributes those chunks so everyone has all $N$ of them. This is a semantic equivalence, not a prescription — some AR algorithms (bandwidth-optimal ring AR [Patarasuk-Yuan 2009], dim-decomposed ring AR on torus/mesh) do execute as a literal RS phase followed by an AG phase, but others (recursive halving-doubling / Rabenseifner, recursive doubling, tree AR, in-network AR) never materialize the intermediate RS end-state as a distinct phase. The identity is still useful for reasoning about cost lower bounds (§5) regardless of implementation.
- AR ≡ Reduce + Broadcast (also logically, but strictly worse in cost). Reduce aggregates every $V_i$ into rank 0’s $\sum_i V_i$; broadcast then replicates it to all ranks. The end state matches AR’s, but composing the two primitives ships the full reduced $M$-byte result across the fabric twice — each phase carries an $M/\mathrm{BW}$-floor bandwidth term — whereas tree-based and ring AR fuse the two directions so each byte traverses an endpoint link only twice total (matching the $2(N-1)/N$ lower bound). This identity is useful as a conceptual baseline; no production AR implementation uses it. §4.3 works out the cost gap.
- A2A is a permutation, not a reduction. No summation happens; every rank ends up holding data that belongs elsewhere. Mixture-of-Experts (MoE) expert dispatch is exactly this pattern (plus a reverse A2A for the combine).
Sections 3–7 walk through the six collective primitives in the table order: §3 Broadcast, §4 Reduce, §5 AR, §6 AG / RS, §7 A2A. Broadcast and Reduce lead because their tree-based schedules are the structural building blocks for tree AR in §5, so putting them up front lets the DBT construction cite primitives already in scope. AR, AG / RS, and A2A are the reduction / redistribution primitives that dominate time in any collective-heavy workload (distributed data parallelism (DDP) / fully-sharded data parallelism (FSDP) training, LLM inference, MoE routing, HPC all-reduce). Each of §§5–7 follows the same template: a ring-based derivation (the NCCL large-$M$ default for all three primitives), the tree-based variant that NCCL actually ships alongside ring when one exists (DBT for AR), and a comparison that explains the NCCL selection rule between them. Variants that appear in the literature and in other runtimes — simple recursive-doubling AR, Rabenseifner halving-doubling AR, recursive-doubling AG / recursive-halving RS, Parallel Aggregated Trees (PAT), Bruck A2A — are not in the NCCL shipping menu but remain useful reference points; their full step-by-step derivations live in Appendix B, with Appendix A giving the same treatment to the one shipping non-ring AG / RS algorithm — Parallel Aggregated Trees (PAT, NCCL 2.23+, scale-out). Cross-references to each appendix subsection appear inline where the main-text comparisons call for them. §8 covers point-to-point, §9 maps primitives to parallelism axes, Appendix C derives the generic pipelining mechanism that the §3 / §4 / §5 collectives invoke to collapse their bandwidth terms, and Appendix D defines the algbw / busbw reporting convention used in NCCL-tests.
3. Broadcast
Broadcast (BC) is the simplest one-to-many primitive: rank 0 starts holding $V$ (size $M$ bytes), and all $N$ ranks must end holding a copy of $V$. It shows up in parameter loading from rank 0 at initialization, in metadata and embedding distribution, in the down-multicast half of tree AR (§5.2), and in any dataflow that needs to replicate a single reference payload across a group. Two algorithms cover the practical cost landscape — ring (§3.1, bandwidth-optimal via pipelining, linear-$\alpha$) and binomial tree (§3.2, latency-optimal log-$\alpha$; with chunked pipelining the BW term collapses to the same $M/\mathrm{BW}$ floor as ring — see Appendix C for the generic derivation). §3.3 explains which one NCCL ncclBroadcast dispatches for a given $(N, M)$ regime and points to 04_in_network_collectives.md §1.1 for the switch-multicast primitive that collapses $n_\alpha$ to 2 regardless of $N$.
3.1 Ring BC
Arrange ranks in a ring $R_0 \to R_1 \to \ldots \to R_{N-1} \to R_0$ with $R_0$ as the source. BC has a single source, so data flows one way from $R_0$ down the ring and the wrap-around edge $R_{N-1} \to R_0$ is idle — the schedule uses the ring’s $N-1$ forward edges as a chain. The same ring topology (same physical wiring, same per-rank in/out neighbor set) is reused by ring AR in §5.1 and ring AG / RS in §6, where every rank sends to every other and the wrap-around carries real traffic.
Ring for $N = 4$ (arrows are the forward direction used by BC):
R0 ───→ R1 ───→ R2 ───→ R3
Chunk $V$ into $P$ equal pieces of size $M/P$ and stream them down the ring: once a rank has received chunk $s$ from its left neighbor, it forwards chunk $s$ to its right neighbor while concurrently receiving chunk $s+1$ from its left. Every forward link carries one chunk per step in steady state — a software analogue of physical wormhole forwarding. We walk through $N = 4$ with $P = N$ (one chunk per rank, matching ring AR’s convention in §5.1 and giving the cleanest symmetric fill/drain), so $V = [V_0, V_1, V_2, V_3]$ with each $V_s$ of size $M/N$ initially held by $R_0$; the other three ranks start empty. The $P = N$ choice is illustrative — the asymptotic cost reached for larger $P$ per Appendix C is discussed after the walkthrough.
Initial state.
R0: [V0 V1 V2 V3]
R1: [ — ]
R2: [ — ]
R3: [ — ]
After step 1 — $R_0$ ships $V_0$ to $R_1$. Pipeline is filling; only one link is active.
Step 1: R0 → R1 (V0)
R0: [V0 V1 V2 V3]
R1: [V0 ] ← received V0
R2: [ — ]
R3: [ — ]
After step 2 — two links active concurrently: $R_0 \to R_1$ streams $V_1$ while $R_1 \to R_2$ forwards $V_0$.
Step 2: R0 → R1 (V1); R1 → R2 (V0) concurrently
R0: [V0 V1 V2 V3]
R1: [V0 V1 ]
R2: [V0 ]
R3: [ — ]
After step 3 — pipeline fully populated: three links carry $V_2, V_1, V_0$ in the same step. $V_0$ reaches the tail rank $R_3$.
Step 3: R0 → R1 (V2); R1 → R2 (V1); R2 → R3 (V0) concurrently
R0: [V0 V1 V2 V3]
R1: [V0 V1 V2 ]
R2: [V0 V1 ]
R3: [V0 ] ← first chunk arrives at tail
After step 4 — last chunk enters the pipeline; $R_1$ is now complete.
Step 4: R0 → R1 (V3); R1 → R2 (V2); R2 → R3 (V1) concurrently
R0: [V0 V1 V2 V3]
R1: [V0 V1 V2 V3] ← complete
R2: [V0 V1 V2 ]
R3: [V0 V1 ]
After step 5 — $R_0$ is idle; two links still active for drain.
Step 5: R1 → R2 (V3); R2 → R3 (V2) concurrently
R0: [V0 V1 V2 V3]
R1: [V0 V1 V2 V3]
R2: [V0 V1 V2 V3] ← complete
R3: [V0 V1 V2 ]
After step 6 — final drain.
Step 6: R2 → R3 (V3)
R0: [V0 V1 V2 V3]
R1: [V0 V1 V2 V3]
R2: [V0 V1 V2 V3]
R3: [V0 V1 V2 V3] ← complete — BC done
Total: $N - 1 + P - 1 = 6$ steps (fill $N-1$, then drain $P-1$ after the last chunk enters at $R_0$). Each step ships $M/P$ bytes over its active link at cost $\alpha + M/(P\,\mathrm{BW})$:
$$t_{\mathrm{ring\,BC}} = (N + P - 2)\left(\alpha + \frac{M}{P\,\mathrm{BW}}\right)$$
At optimal $P^* = \sqrt{(N{-}2)M/(\alpha \mathrm{BW})}$, in the large-$M$ limit, the cost approaches the asymptotic form:
$$t_{\mathrm{ring\,BC}} \approx (N - 1)\,\alpha + \frac{M}{\mathrm{BW}}$$
The “$\approx$” hides an $O(\sqrt{M})$ correction of order $2\sqrt{(N{-}2)\alpha M/\mathrm{BW}}$ between the two floors — $(N{-}1)\alpha$ is the minimum the latency term can hit ($P \to 1$), $M/\mathrm{BW}$ is the minimum the BW term can hit ($P \to \infty$), and neither limit is physical; $P^*$ lives between them. The correction is $O(\sqrt{M})$ while the dominant $M/\mathrm{BW}$ term is $O(M)$, so it vanishes relative to the asymptotic floor as $M \to \infty$ but is nonzero at any finite $M$. See Appendix C for the full derivation.
$n_\alpha = N - 1$, $n_\beta = 1$ — bandwidth-optimal (matches the asymptotic tree floor in §3.2) but linear $\alpha$ in $N$. Ring BC loses to pipelined tree on the $\alpha$ side but keeps the same BW floor, so it wins on simple or low-radix fabrics where tree-schedule bookkeeping is expensive relative to the $\alpha$ advantage — or on ring-wired physical topologies where the logical ring maps directly onto the copper (same logic that keeps ring AR in NCCL’s shipping menu for large-$M$; see §5.3).
3.2 Binomial tree BC
The binomial tree BC doubles the set of ranks holding $V$ at every step. At step $k$, every rank already holding $V$ pairs with one rank that doesn’t and ships $V$ over the link; concurrent pairs run in parallel. After $\lceil \log_2 N \rceil$ steps every rank holds $V$. We walk through $N = 4$ with $V = [v_0, v_1, v_2, v_3]$ (size $M$) initially held by $R_0$; the other three ranks start empty.
Tree structure for $N = 4$. The send pattern below traces out the binomial tree $B_2$ rooted at $R_0$:
R0 (depth 0, source)
│ │
▼ ▼
R1 R2 (depth 1)
│
▼
R3 (depth 2)
$R_0$’s two children are the roots of $B_1$ (on $\{R_1, R_3\}$) and $B_0$ (on $\{R_2\}$) — the recursive construction of $B_k$ hangs a $B_{k-1}, B_{k-2}, \ldots, B_0$ off the root in descending size. Level-counts $(1, 2, 1)$ are row $2$ of Pascal’s triangle — $\binom{2}{0}, \binom{2}{1}, \binom{2}{2}$ — which is what “binomial” refers to: the number of nodes at depth $d$ in $B_k$ is $\binom{k}{d}$. Depth is $\lceil \log_2 N \rceil = 2$, setting $n_\alpha$.
Initial state.
R0: [v0 v1 v2 v3]
R1: [ — ]
R2: [ — ]
R3: [ — ]
holding-set: {R0}
After step 1 — $R_0 \to R_1$. The holding set doubles from $\{R_0\}$ to $\{R_0, R_1\}$.
Step 1: R0 sends V to R1
R0: [v0 v1 v2 v3]
R1: [v0 v1 v2 v3] ← received from R0
R2: [ — ]
R3: [ — ]
holding-set: {R0, R1}
After step 2 — $R_0 \to R_2$ and $R_1 \to R_3$, concurrently (two disjoint pairs, both sides of the holding set each sending to one new partner). The holding set doubles to all 4 ranks.
Step 2: R0 → R2 and R1 → R3 concurrently
R0: [v0 v1 v2 v3]
R1: [v0 v1 v2 v3]
R2: [v0 v1 v2 v3] ← received from R0
R3: [v0 v1 v2 v3] ← received from R1
holding-set: {R0, R1, R2, R3} — BC complete
BC completes in $\lceil \log_2 4 \rceil = 2$ sequential steps; at general $N$ in $\lceil \log_2 N \rceil$ steps. Each step moves the full $M$-byte payload over one link per active pair, so the per-step cost is $\alpha + M/\mathrm{BW}$ and the sequential steps sum:
$$t_{\mathrm{bin\,BC}} = \lceil \log_2 N \rceil \cdot \alpha + \lceil \log_2 N \rceil \cdot \frac{M}{\mathrm{BW}}$$
$n_\alpha = \lceil \log_2 N \rceil$ (latency-optimal — a lower bound for BC over a binary-combinable fabric), $n_\beta = \lceil \log_2 N \rceil$ (bandwidth-suboptimal — same $\log N$ coefficient weakness as simple recursive-doubling AR in App. B.1).
Asymptotic form (bandwidth-bound regime; pipelined implementation). Chunk $V$ into $P$ sub-segments of size $M/P$ and stream them through the tree — each internal node forwards chunk $s$ to its children while receiving chunk $s+1$ from its parent — turning every tree edge into a conveyor. Substituting $L = \lceil \log_2 N \rceil$ (tree depth) into the Appendix C master formula and optimizing at $P^* = \sqrt{(\lceil \log_2 N \rceil - 1)M/(\alpha \mathrm{BW})}$ in the large-$M$ limit gives the asymptotic floor:
$$t_{\mathrm{bin\,BC}} \approx \lceil \log_2 N \rceil \cdot \alpha + \frac{M}{\mathrm{BW}}$$
The “$\approx$” hides an $O(\sqrt{M})$ interference correction $2\sqrt{(\lceil \log_2 N \rceil - 1)\alpha M/\mathrm{BW}}$ between the two floors — the $\alpha$-count picks up a $(P^{-}1)\alpha$ surcharge, and the BW term carries a fill/drain residue of the same magnitude — both vanishing relative to $M/\mathrm{BW}$ as $M \to \infty$ but nonzero at any finite $M$. This is the form NCCL’s tree-path ncclBroadcast runs for bulk $M$ — simultaneously latency- and bandwidth-optimal for BC on unbounded-port fabrics. Full derivation of the $(L+P-1)$ slot count, $P^$, the correction, and the port-budget caveat that licenses the collapse on real fabrics are all in Appendix C; the binomial tree satisfies the port budget (busiest rank holds ≤ 3 concurrent tree-edge partners) so the collapse applies.
3.3 Comparison and practical adoption
Three software implementations (binomial tree $P=1$, binomial tree $P^$, ring $P^$) plus a hardware-multicast path sit on the table. Before presenting the combined view, the hardware option needs a brief introduction — the two software algorithms were fully specified in §§3.1–3.2, but the multicast primitive has not yet appeared in this chapter.
Hardware multicast primitive. Switched fabrics expose a switch-multicast primitive (InfiniBand Quantum Scalable Hierarchical Aggregation and Reduction Protocol (SHARP), NVSwitch Gen3+ / NVLink SHARP (NVLS), many PCIe switches1) that replicates a single $M$-byte payload across all destination ports in one switch-local operation. The source rank pushes $V$ upstream to the switch (1 hop); the switch crossbar then drives $V$ out all $N-1$ destination ports concurrently in one multicast fan-out (1 more hop). Two switch-crossing hops total, independent of $N$:
$$t_{\mathrm{INC,\,BC}} \;\approx\; 2\,\alpha_\mathrm{switch} + \frac{M}{\mathrm{BW}}$$
$\alpha_\mathrm{switch}$ is the switch cut-through latency — $\sim 100$–$200$ ns for NVSwitch Gen3/4, $\sim 250$ ns for Broadcom Tomahawk Ultra — typically well below the $\alpha \approx 0.5$–$1\,\mu$s that software BC accumulates across $\lceil \log_2 N \rceil$ or $N-1$ endpoint-driven hops.
All four options side by side. The two software algorithms each have a finite-$P$ and an asymptotic form; which form wins is governed by the $M$-regime, since the choice of $P$ (non-pipelined $P = 1$ vs asymptotic $P = P^*$) is dictated by whether the α term or the BW term dominates. Switch multicast is an orthogonal hardware-dependent alternative and applies across regimes where the fabric exposes it:
| Algorithm | α term | BW term | Regime / adoption |
|---|---|---|---|
| Binomial tree, non-pipelined ($P = 1$) | $\lceil \log_2 N \rceil \cdot \alpha$ | $\lceil \log_2 N \rceil \cdot M/\mathrm{BW}$ | Small $M$ (α-bound, $M/\mathrm{BW} \ll \alpha$); NCCL/MPI tree path |
| Binomial tree, asymptotic ($P^*$) | $\lceil \log_2 N \rceil \cdot \alpha$ | $M/\mathrm{BW}$ | Medium $M$; NCCL/MPI tree path |
| Ring, asymptotic ($P^*$) | $(N-1) \cdot \alpha$ | $M/\mathrm{BW}$ | Large $M$ (BW-bound, $M/\mathrm{BW} \gg \alpha$); NCCL/MPI ring path |
| Switch multicast (INC) | $2 \cdot \alpha_\mathrm{switch}$ | $M/\mathrm{BW}$ | All $M$ where fabric exposes multicast; NVLS (intra-NVLink), SHARP (IB), Tomahawk Ultra (Ethernet) |
Reading the rows as a progression. At small $M$ the BW term is negligible no matter how many times $L$ multiplies it, so software picks whichever algorithm has the smallest $L$ on the $\alpha$ side — binomial tree, with $L = \lceil \log_2 N \rceil$ vs ring’s $N-1$. At medium $M$ the BW term matters, so the tree starts pipelining to collapse its $\lceil\log_2 N\rceil \cdot M/\mathrm{BW}$ coefficient down to $M/\mathrm{BW}$; the α term stays at log depth, so tree remains favored. At large $M$ the BW term dominates and both tree-asymptotic and ring-asymptotic sit at the same $M/\mathrm{BW}$ floor; the pure α-β model says tree still wins (smaller α term), but in practice implementation overhead (finite pipeline depth $P \ll P^*$, per-step kernel cost) keeps tree’s BW coefficient above the asymptotic 1, while ring’s intrinsically-pipelined $P = N$ schedule hits the floor cleanly — the same ring-vs-tree crossover logic that §5.3 derives in full for AR.
INC sits alongside this progression as a hardware alternative rather than a fourth regime. It collapses the α term to an $N$-independent constant (the α-side win) but — unlike AR — does not lift the BW ceiling, since pipelined tree and ring BC both already reach $M/\mathrm{BW}$, which is the single-rank-port floor that switch multicast also sits at. INC BC’s entire win is therefore on the α side: dramatic at small $M$ (where software BC pays the full $\lceil \log_2 N \rceil \cdot \alpha$ or $(N-1)\alpha$) and converging toward software as $M$ grows and the $M/\mathrm{BW}$ term dominates. See 04_in_network_collectives.md §1.1–§1.2 for the mechanism and per-primitive scope (multicast helps BC, AG, and the broadcast-half of AR).
4. Reduce
Reduce is the time-reverse dual of Broadcast: every rank $i$ starts with its own $V_i$ (size $M$), and rank 0 ends holding $\sum_i V_i$. It shows up in gradient accumulation onto a parameter server, loss aggregation for logging, score collection in some serving paths, and the up-reduce half of tree AR (§5.2). The two schedules mirror BC with time reversed: ring Reduce is a chain accumulating toward the sink (time-reverse of the §3.1 ring BC chain spreading from the source), and binomial-tree Reduce is the same tree as §3.2 with every edge arrow flipped — data flows leaves → root with a summation at every internal node instead of root → leaves. §4.1 covers the ring variant and §4.2 the binomial-tree variant, both with pipelining applied to collapse $n_\beta$ to 1 via the generic mechanism in Appendix C. §4.3 relates Reduce back to AR (showing why naïvely composing Reduce + BC is strictly worse than the fused DBT AR in §5.2) and points to the in-network Reduce primitive in 04_in_network_collectives.md §1.1.
4.1 Ring Reduce
Arrange the ranks in a chain $R_{N-1} \to R_{N-2} \to \ldots \to R_1 \to R_0$ with $R_0$ as the root. Reduce has a single sink — the mirror image of BC’s single source — so data flows one way along the chain toward $R_0$ and the wrap-around edge $R_0 \to R_{N-1}$ is idle. Each forwarding rank adds its own $V_i$ into the incoming partial sum before passing it downstream; the partial sum arriving at $R_0$ has already been summed across $R_{N-1}$ down to $R_1$, and $R_0$ folds in its own $V_0$ to complete. The same ring topology (same physical wiring, same in/out neighbor set) is reused by ring AR in §5.1 and ring AG / RS in §6.1, where every rank contributes on every link and the wrap-around carries real traffic.
Ring for $N = 4$ (arrows are the forward direction used by Reduce — data flows toward the root):
R3 ───→ R2 ───→ R1 ───→ R0
Chunk each rank’s $V_i$ into $P$ equal pieces of size $M/P$ and stream them down the chain: once a rank has received partial-sum chunk $s$ from its upstream neighbor, it adds its own chunk $s$ into the arriving vector and forwards the updated partial sum downstream, while concurrently receiving chunk $s+1$ from upstream. Every forward link carries one chunk per step in steady state — the same conveyor mechanic as ring BC (§3.1), with one on-chip add per hop. We walk through $N = 4$ with $P = N$ (one chunk per rank position, matching ring AR’s convention in §5.1 and giving the cleanest symmetric fill/drain). Each $V_i$ splits into four chunks $V_{i,s}$ of size $M/4$; write $S_s = V_{0,s} + V_{1,s} + V_{2,s} + V_{3,s}$ for the target full-sum chunk that $R_0$ must hold at the end.
Initial state.
R3: [V30 V31 V32 V33] ← source end of chain
R2: [V20 V21 V22 V23]
R1: [V10 V11 V12 V13]
R0: [V00 V01 V02 V03] ← root (final destination)
After step 1 — $R_3$ ships $V_{3,0}$ to $R_2$; $R_2$ sums it into its own slot 0. Pipeline is filling; only one link is active.
Step 1: R3 → R2 (V30); R2 accumulates
R3: [V30 V31 V32 V33]
R2: [V30+V20 V21 V22 V23] ← slot 0 now 2-way partial over {R3, R2}
R1: [V10 V11 V12 V13]
R0: [V00 V01 V02 V03]
After step 2 — two links active concurrently: $R_3 \to R_2$ streams $V_{3,1}$ while $R_2 \to R_1$ forwards the 2-way partial for slot 0; $R_1$ folds in $V_{1,0}$.
Step 2: R3 → R2 (V31); R2 → R1 (V30+V20) concurrently
R3: [V30 V31 V32 V33]
R2: [V30+V20 V31+V21 V22 V23]
R1: [V30+V20+V10 V11 V12 V13] ← slot 0 now 3-way partial over {R3, R2, R1}
R0: [V00 V01 V02 V03]
After step 3 — pipeline fully populated: three links carry chunks at three stages of accumulation, and $R_0$ receives the first complete slot.
Step 3: R3 → R2 (V32); R2 → R1 (V31+V21); R1 → R0 (V30+V20+V10) concurrently
R3: [V30 V31 V32 V33]
R2: [V30+V20 V31+V21 V32+V22 V23]
R1: [V30+V20+V10 V31+V21+V11 V12 V13]
R0: [S0 V01 V02 V03] ← slot 0 fully reduced at root
After step 4 — last chunk enters the pipeline; slot 1 completes at $R_0$.
Step 4: R3 → R2 (V33); R2 → R1 (V32+V22); R1 → R0 (V31+V21+V11) concurrently
R3: [V30 V31 V32 V33]
R2: [V30+V20 V31+V21 V32+V22 V33+V23]
R1: [V30+V20+V10 V31+V21+V11 V32+V22+V12 V13]
R0: [S0 S1 V02 V03]
After step 5 — $R_3$ is idle (its last chunk has already entered the pipeline); two links still active for drain.
Step 5: R2 → R1 (V33+V23); R1 → R0 (V32+V22+V12) concurrently
R3: [V30 V31 V32 V33]
R2: [V30+V20 V31+V21 V32+V22 V33+V23]
R1: [V30+V20+V10 V31+V21+V11 V32+V22+V12 V33+V23+V13]
R0: [S0 S1 S2 V03]
After step 6 — final drain; $R_0$ receives the last partial and folds in $V_{0,3}$.
Step 6: R1 → R0 (V33+V23+V13)
R0: [S0 S1 S2 S3] ← Reduce complete at root
Total: $N - 1 + P - 1 = 6$ steps (fill $N-1$ for the head chunk to traverse the chain, then drain $P-1$ after the last chunk enters at $R_{N-1}$). Each step ships $M/P$ bytes over its active link at cost $\alpha + M/(P\,\mathrm{BW})$:
$$t_{\mathrm{ring\,reduce}} = (N + P - 2)\left(\alpha + \frac{M}{P\,\mathrm{BW}}\right)$$
At optimal $P^* = \sqrt{(N{-}2)M/(\alpha \mathrm{BW})}$, in the large-$M$ limit, the cost approaches the asymptotic form:
$$t_{\mathrm{ring\,reduce}} \approx (N - 1)\,\alpha + \frac{M}{\mathrm{BW}}$$
The “$\approx$” hides an $O(\sqrt{M})$ correction of order $2\sqrt{(N{-}2)\alpha M/\mathrm{BW}}$ between the two floors — same structure as ring BC in §3.1, same Appendix C derivation, modified only by the extra on-chip add at each hop. The add is a reduce-op FLOP local to each rank’s accelerator; under the standard α-β accounting it is assumed non-bottlenecking (modern accelerators run reduce-ops at a throughput well above network BW) and contributes to neither $\alpha$ nor the BW term.
$n_\alpha = N - 1$, $n_\beta = 1$ — bandwidth-optimal (matches the asymptotic tree floor in §4.2) but linear $\alpha$ in $N$. Unlike the ring-vs-tree AR crossover at large $M$ (§5.3), for standalone Reduce the pipelined tree (§4.2) matches the same $M/\mathrm{BW}$ floor with $\lceil \log_2 N \rceil$ latency instead of $N-1$, so tree strictly dominates on port-adequate fabrics. Ring Reduce does ship in the production algorithm menus — NCCL reuses the shared ring infrastructure that Ring AllReduce runs on and exposes it via NCCL_ALGO=Ring, and MPICH / OpenMPI carry ring Reduce in their algorithm tables — but the auto-tuner rarely selects it for standalone Reduce because there is no ring-vs-tree crossover incentive at the single-sink level. The schedule remains load-bearing in two places even so: (i) the RS phase of ring AR / ring RS (§§5.1, 6.1) executes exactly this chain-accumulate mechanic on each chunk, composed with a ring wrap-around so every rank ends up owning one fully-reduced chunk instead of funneling everything to a single root; and (ii) in parameter-server-style deployments where only $R_0$ needs the result and the topology is a literal chain (early MapReduce-era setups, some inference-serving paths aggregating scores), ring Reduce is the schedule the software falls back to.
4.2 Binomial tree Reduce
At step $k$, every rank whose $k$-th bit is 1 sends its current vector to its paired “bit-0” partner, which adds the received vector into its local copy. The active-contributor set halves each step: $\{R_0, R_1, R_2, R_3\} \to \{R_0, R_2\} \to \{R_0\}$ at $N = 4$. After $\lceil \log_2 N \rceil$ steps only rank 0 remains, holding the full $\sum_i V_i$. We trace $N = 4$ with the same initial state as §5.1 (each $R_i$ holds $V_i = [v_{i,0}, v_{i,1}, v_{i,2}, v_{i,3}]$).
Tree structure for $N = 4$. The same binomial tree $B_2$ as §3.2, with every edge’s arrow reversed — data flows leaves → root with summation at every internal node:
R0 (depth 0, root — collects sum)
▲ ▲
│ │
R1 R2 (depth 1)
▲
│
R3 (depth 2)
Read side-by-side with §3.2: identical edges, opposite direction; BC and Reduce are the same tree schedule read in opposite time directions. Depth is $\lceil \log_2 N \rceil = 2$, setting $n_\alpha$ just as for BC.
Initial state.
R0: [v00 v01 v02 v03]
R1: [v10 v11 v12 v13]
R2: [v20 v21 v22 v23]
R3: [v30 v31 v32 v33]
active-set: {R0, R1, R2, R3}
After step 1 — $R_1 \to R_0$ and $R_3 \to R_2$, concurrently (two disjoint pairs; the “bit-1 senders” fold into their “bit-0 receivers”). $R_0$ and $R_2$ each sum the incoming vector into their local copy; $R_1$ and $R_3$ drop out.
Step 1: R1 → R0 (R0 sums in); R3 → R2 (R2 sums in)
R0: [v00+v10 v01+v11 v02+v12 v03+v13] ← 2-way sum over {R0, R1}
R1: stale ← sent up
R2: [v20+v30 v21+v31 v22+v32 v23+v33] ← 2-way sum over {R2, R3}
R3: stale ← sent up
active-set: {R0, R2}
After step 2 — $R_2 \to R_0$. The active set halves to $\{R_0\}$; $R_0$ now holds the full 4-way sum.
Step 2: R2 → R0 (R0 sums in)
R0: [S0 S1 S2 S3] ← full 4-way sum at root
R1: stale
R2: stale
R3: stale
where S_k = v_{0,k} + v_{1,k} + v_{2,k} + v_{3,k}.
active-set: {R0} — Reduce complete
Reduce is done in $\lceil \log_2 4 \rceil = 2$ sequential steps; at general $N$ in $\lceil \log_2 N \rceil$ steps. Each step moves the full $M$-byte vector on the active link:
$$t_{\mathrm{bin\,reduce}} = \lceil \log_2 N \rceil \cdot \alpha + \lceil \log_2 N \rceil \cdot \frac{M}{\mathrm{BW}}$$
Same $(n_\alpha, n_\beta)$ as binomial tree BC (§3.2), and for the same reason: the tree depth sets both the sequential step count and the per-step full-$M$ transfer. Read side-by-side with the BC matrix in §3.2, the relationship is literal: flip the arrow direction on every edge and replace “send $V$ and overwrite” with “send current vector and sum in” — BC and Reduce are the same tree schedule read in opposite time directions.
Asymptotic form (bandwidth-bound regime; pipelined implementation). Chunks of size $M/P$ flow up the tree with internal nodes reducing on the fly: once a parent has received chunk $s$ from both of its children, it forwards the summed chunk upward while reducing chunk $s+1$ in parallel. Substituting $L = \lceil \log_2 N \rceil$ (tree depth) into the Appendix C master formula and optimizing at $P^* = \sqrt{(\lceil \log_2 N \rceil - 1)M/(\alpha \mathrm{BW})}$ in the large-$M$ limit, the BW coefficient drops from $\lceil \log_2 N \rceil$ to 1 up to an $O(\sqrt{M})$ correction $2\sqrt{(\lceil \log_2 N \rceil - 1)\alpha M/\mathrm{BW}}$:
$$t_{\mathrm{bin\,reduce}} \approx \lceil \log_2 N \rceil \cdot \alpha + \frac{M}{\mathrm{BW}}$$
This is what NCCL’s tree-path ncclReduce runs for bulk $M$ — simultaneously latency- and bandwidth-optimal for Reduce on port-adequate fabrics, mirroring pipelined BC. The generic derivation and the port-budget caveat are in Appendix C; the binomial tree satisfies the port budget so the collapse applies. Ring Reduce (§4.1) reaches the same $M/\mathrm{BW}$ floor but with $N-1$ latency hops instead of $\lceil \log_2 N \rceil$, so at the standalone-Reduce level there is no ring-vs-tree crossover — tree dominates.
4.3 Comparison and practical adoption
Three software implementations (binomial tree $P=1$, binomial tree $P^$, ring $P^$) plus a hardware in-network path sit on the table. Before presenting the combined view, the hardware option needs a brief introduction — the two software algorithms were fully specified in §§4.1–4.2, but the switch-ALU reduce primitive has not yet appeared in this chapter.
Hardware in-network Reduce primitive. Switched fabrics expose a switch-hosted ALU (InfiniBand Quantum SHARP, NVSwitch Gen3+ / NVLS) that sums $N$ incoming $M$-byte flits into one $M$-byte output in one switch-local operation. Each rank pushes $V_i$ upstream to the switch (1 hop); the switch crossbar reduces the $N$ payloads on-chip and forwards the single reduced result to the root rank (1 more hop). Two switch-crossing hops total, independent of $N$:
$$t_{\mathrm{INC,\,Reduce}} \;\approx\; 2\,\alpha_\mathrm{switch} + \frac{M}{\mathrm{BW}}$$
Only the switch-ALU half of the INC primitive is used here, symmetric to BC in §3.3 which used only the multicast half — no multicast is needed because the output goes to a single destination. $\alpha_\mathrm{switch}$ is the switch cut-through plus ALU latency — $\sim 100$–$200$ ns for NVSwitch Gen3/4 with NVLS and comparable on IB Quantum-2 with SHARPv2 — typically well below the $\alpha \approx 0.5$–$1\,\mu$s that software Reduce accumulates across $\lceil \log_2 N \rceil$ or $N-1$ endpoint-driven hops. PCIe switches appear in §3.3’s BC list but are absent here because they have no ALU (see footnote at §3.3); Ethernet in-network reduction (RoCE-based aggregation on newer AI-focused switches) is an evolving direction, omitted until its ALU-inclusive latency is well-characterized.
All four options side by side. The two software algorithms each have a finite-$P$ and an asymptotic form; which form wins is governed by the $M$-regime, since the choice of $P$ (non-pipelined $P = 1$ vs asymptotic $P = P^*$) is dictated by whether the α term or the BW term dominates. Switch-ALU Reduce is an orthogonal hardware-dependent alternative and applies across regimes where the fabric exposes it:
| Algorithm | α term | BW term | Regime / adoption |
|---|---|---|---|
| Binomial tree, non-pipelined ($P = 1$) | $\lceil \log_2 N \rceil \cdot \alpha$ | $\lceil \log_2 N \rceil \cdot M/\mathrm{BW}$ | Small $M$ (α-bound, $M/\mathrm{BW} \ll \alpha$); NCCL/MPI tree path |
| Binomial tree, asymptotic ($P^*$) | $\lceil \log_2 N \rceil \cdot \alpha$ | $M/\mathrm{BW}$ | Medium-to-large $M$; NCCL/MPI tree path (default for bulk $M$) |
| Ring, asymptotic ($P^*$) | $(N-1) \cdot \alpha$ | $M/\mathrm{BW}$ | Available in NCCL (NCCL_ALGO=Ring, shared ring-AR infra) and MPI menus, but rarely selected for standalone Reduce — tree dominates at the same BW floor with smaller α. Same mechanic composes into the RS phase of ring AR / RS (§§5.1, 6) |
| Switch-ALU Reduce (INC) | $2 \cdot \alpha_\mathrm{switch}$ | $M/\mathrm{BW}$ | All $M$ where fabric exposes it; IB SHARP, NVSwitch NVLS |
Reading the rows as a progression. At small $M$ the BW term is negligible no matter what coefficient multiplies it, so software picks whichever algorithm has the smallest $L$ on the α side — binomial tree, with $L = \lceil \log_2 N \rceil$ vs ring’s $N-1$. At medium $M$ the BW term matters, so the tree pipelines to collapse its $\lceil \log_2 N \rceil \cdot M/\mathrm{BW}$ coefficient down to $M/\mathrm{BW}$; the α term stays at log depth, so tree remains favored. At large $M$ — unlike BC (§3.3) and AR (§5.3) where ring wins on the BW side via its intrinsically-pipelined $P = N$ schedule — Reduce has no ring-vs-tree crossover: ring’s BW floor matches tree’s $M/\mathrm{BW}$ but its α term remains $(N-1)\alpha$ vs tree’s $\lceil\log_2 N\rceil \cdot \alpha$, so tree strictly dominates across the full $M$ range. This is why ncclReduce’s tuner reaches for the tree path across small and large $M$ alike — the same $(N, M)$ crossover exists between $P=1$ and $P^*$ within the tree family, but not between tree and ring.
INC sits alongside this progression as a hardware alternative rather than a fifth regime. It collapses the α term to an $N$-independent constant (the α-side win) but — unlike AR — does not lift the BW ceiling, since pipelined tree and ring both already reach $M/\mathrm{BW}$, which is the single-destination-port floor that switch-ALU Reduce also sits at. INC Reduce’s entire win is therefore on the α side: dramatic at small $M$ (where software Reduce pays the full $\lceil \log_2 N \rceil \cdot \alpha$) and converging toward software tree as $M$ grows. See 04_in_network_collectives.md §1.1–§1.2 for the switch-ALU mechanism and per-primitive scope; the BW-side ceiling lift that AR uniquely receives (from fusing the reduce and broadcast halves so each byte crosses the fabric once instead of twice) is laid out in 04_in_network_collectives.md §1.4.
Why Reduce stays standalone and does not build AR. Reduce + BC is semantically equivalent to AR but strictly worse in cost. Composing two pipelined-tree primitives costs $2\lceil \log_2 N \rceil \cdot \alpha + 2 M/\mathrm{BW}$ (pipelined tree in each direction). DBT AR’s $2\lceil \log_2 N \rceil \cdot \alpha + M/\mathrm{BW}$ (§5.2) has the same α count but half the BW term, because DBT’s two complementary trees each carry half the message concurrently on the opposite direction of the same full-duplex link instead of shipping the full reduced result a second time. Modern AR schedules (§5) exploit this fusion and avoid the decomposition. Reduce therefore appears as a standalone primitive only when the output is single-rank by design — gradient accumulation onto a parameter server, loss aggregation for logging, score collection in some serving paths — not as a building block for AR.
5. All-reduce
AR is the workhorse reduction primitive: every rank starts with its own length-$M$ vector $V_i$ and ends holding the elementwise sum $\sum_i V_i$. Two families of algorithms dominate in practice: ring-based (§5.1), which chains $2(N-1)$ sequential exchanges along a logical ring and hits the Patarasuk-Yuan bandwidth lower bound, and tree-based (§5.2), which collapses the step count to $O(\log N)$ via hypercube / binary-tree pairings. NCCL ships ring and double binary tree (DBT) — the two algorithms covered in the main text below. Two tree variants from the MPI literature and older HPC work — simple recursive doubling and Rabenseifner halving-doubling — remain instructive as reference points even though they are not in NCCL’s shipping menu; their derivations live in Appendix B, and Appendix B.3 consolidates the port-budget / partner-cycling argument against them. §5.3 summarizes the comparison between the shipped algorithms — ring, DBT, and the hardware in-network AR primitive — and explains the NCCL selection rule between ring and DBT.
5.1 Ring AR
Ring AR on $N$ ranks runs RS in $N-1$ steps, then AG in $N-1$ steps — the Patarasuk-Yuan bandwidth-optimal construction. We walk through $N=4$, each rank holding a length-4 vector, chunked into 4 pieces.
Setup. Four ranks $R_0, R_1, R_2, R_3$ wired into a ring — each rank concurrently sends to its right neighbor and receives from its left neighbor every step:
R0 ───→ R1 ───→ R2 ───→ R3
▲ │
└───────────────────────┘
Each rank $R_i$ holds a vector $V_i = [v_{i,0}, v_{i,1}, v_{i,2}, v_{i,3}]$. Target: every rank ends up holding $[S_0, S_1, S_2, S_3]$ where $S_k = \sum_i v_{i,k}$.
Phase 1: reduce-scatter (3 steps).
At step $t \in \{1, 2, \ldots, N-1\}$, rank $i$ sends chunk $(i - t + 1) \bmod N$ to its right neighbor and receives chunk $(i - t) \bmod N$ from its left neighbor, adding the incoming value into its local copy at that slot. That slot — $(i - t) \bmod N$ — is the rank’s accumulator chunk for step $t$. Sanity-check at $t=1$: $R_i$ sends its own-index chunk $i$ rightward and accumulates into slot $(i-1) \bmod N$.
The accumulator slot index shifts by $-1 \bmod N$ every step. After $N-1$ steps, each rank’s accumulator sits at slot $(i - (N-1)) \bmod N = (i+1) \bmod N$ and holds the fully reduced $N$-way sum for that slot.
Initial (before any step):
R0: [v00 v01 v02 v03]
R1: [v10 v11 v12 v13]
R2: [v20 v21 v22 v23]
R3: [v30 v31 v32 v33]
After step 1 — each rank sends one chunk right and receives one chunk from its left neighbor, adding it into its local copy. Only the accumulator chunk changes; every other slot is untouched.
Step 1: each rank adds one incoming chunk into its accumulator slot
R0: [v00 v01 v02 v03+v33 ] ← accumulator: chunk 3
R1: [v10+v00 v11 v12 v13 ] ← accumulator: chunk 0
R2: [v20 v21+v11 v22 v23 ] ← accumulator: chunk 1
R3: [v30 v31 v32+v22 v33 ] ← accumulator: chunk 2
Every accumulator now holds a 2-term partial sum. The slots not marked as accumulators still hold the rank’s original values — they’ll stay that way until AG phase overwrites them.
After step 2 — each rank forwards the chunk it just updated to its right neighbor, so the partial sum grows by one more term. The accumulator slot index shifts by $-1 \bmod 4$ each step.
Step 2: each rank forwards its step-1 accumulator; receiver adds it in
R0: [v00 v01 v02+v22+v32 v03+v33 ] ← accumulator: chunk 2 (3 terms)
R1: [v10+v00 v11 v12 v13+v03+v33 ] ← accumulator: chunk 3 (3 terms)
R2: [v20+v10+v00 v21+v11 v22 v23 ] ← accumulator: chunk 0 (3 terms)
R3: [v30 v31+v21+v11 v32+v22 v33 ] ← accumulator: chunk 1 (3 terms)
Trace of what flowed into each accumulator:
- R0 chunk 2 picked up $v_{22}+v_{32}$ from R3 (R3’s chunk 2 after step 1).
- R1 chunk 3 picked up $v_{03}+v_{33}$ from R0.
- R2 chunk 0 picked up $v_{10}+v_{00}$ from R1.
- R3 chunk 1 picked up $v_{21}+v_{11}$ from R2.
After step 3 — one more forward, and each accumulator absorbs the final term. The accumulator is now a full 4-way sum — that’s $S_k$, the fully-reduced chunk for slot $k$.
Step 3: each rank forwards its step-2 accumulator; receiver adds the last summand
R0: [v00 S1 v02+v22+v32 v03+v33 ] ← fully reduced: chunk 1 = S1
R1: [v10+v00 v11 S2 v13+v03+v33 ] ← fully reduced: chunk 2 = S2
R2: [v20+v10+v00 v21+v11 v22 S3 ] ← fully reduced: chunk 3 = S3
R3: [S0 v31+v21+v11 v32+v22 v33 ] ← fully reduced: chunk 0 = S0
where S_k = v_{0,k} + v_{1,k} + v_{2,k} + v_{3,k}.
This is exactly the RS end-state: each rank holds one fully-reduced chunk (on a different slot), and the other three slots still contain junk partial sums. Those will be overwritten in the next phase.
Cost accounting. Each of the 3 steps costs $\alpha + M/(4\,\mathrm{BW})$ (one handshake plus one $M/N$-sized chunk over the link). The 3 steps run sequentially (step $t+1$ forwards the chunk that step $t$ just accumulated), so the costs add: the whole RS phase takes
$$t_{\mathrm{RS}} = 3\alpha + 3 \cdot \frac{M}{4\,\mathrm{BW}} \qquad \text{(totals across all 3 steps, not per step).}$$
Generalizing to $N$ ranks: $(N-1)\alpha + (N-1)\,M/(N\,\mathrm{BW})$.
Phase 2: all-gather (3 steps).
After RS, each rank owns exactly one fully-reduced chunk — $R_i$ owns slot $(i+1) \bmod N$ — and needs the other three. The plan: same ring, same direction (send right, receive from left), no reduction — each rank forwards what it just received, overwriting the junk in the receiver’s corresponding slot.
At step $t \in \{1, 2, \ldots, N-1\}$, rank $i$ sends chunk $(i - t + 2) \bmod N$ to its right neighbor and overwrites its local slot $(i - t + 1) \bmod N$ with the chunk received from its left neighbor. Sanity-check at $t=1$: $R_i$ forwards its freshly-reduced slot $(i+1) \bmod N$ rightward and overwrites its own-index slot $i$ with an incoming fully-reduced chunk.
These are exactly the RS formulas shifted by $+1$ in slot index. Same ring, same direction, same $-1 \bmod N$ slot-index shift per step — just a $+1$ offset in where the action starts. The reason for the offset: RS leaves $R_i$ owning slot $(i+1) \bmod N$ (not slot $i$), so AG begins by forwarding that $+1$-offset slot.
Starting state (using ? for the stale partial sums from RS — they’re unreadable garbage, about to be overwritten):
Before AG step 1 (only the fully-reduced chunk per rank is known-good):
R0: [ ? S1 ? ? ]
R1: [ ? ? S2 ? ]
R2: [ ? ? ? S3 ]
R3: [S0 ? ? ? ]
After AG step 1 — each rank sends its one fully-reduced chunk right. The receiver overwrites its corresponding slot.
AG step 1: R_i sends its one known chunk to R_{i+1}
R0: [S0 S1 ? ? ] (received S0 from R3 → chunk 0 overwritten)
R1: [ ? S1 S2 ? ] (received S1 from R0 → chunk 1 overwritten)
R2: [ ? ? S2 S3 ] (received S2 from R1 → chunk 2 overwritten)
R3: [S0 ? ? S3 ] (received S3 from R2 → chunk 3 overwritten)
Each rank now holds two fully-reduced chunks.
After AG step 2 — each rank forwards the chunk it just received (not its original one) to the right. Same overwrite pattern.
AG step 2: forward what you just received
R0: [S0 S1 ? S3 ] (received S3 from R3 → chunk 3 overwritten)
R1: [S0 S1 S2 ? ] (received S0 from R0 → chunk 0 overwritten)
R2: [ ? S1 S2 S3 ] (received S1 from R1 → chunk 1 overwritten)
R3: [S0 ? S2 S3 ] (received S2 from R2 → chunk 2 overwritten)
Three of the four slots are now full-reduced on every rank.
After AG step 3 — one last forward completes the ring.
AG step 3: forward what you just received
R0: [S0 S1 S2 S3 ]
R1: [S0 S1 S2 S3 ]
R2: [S0 S1 S2 S3 ]
R3: [S0 S1 S2 S3 ]
Every rank now holds the complete reduced vector $[S_0, S_1, S_2, S_3]$. AR is done.
Cost accounting (AG). Same structure as RS — 3 sequential steps, each $\alpha + M/(4\,\mathrm{BW})$. Total for the AG phase:
$$t_{\mathrm{AG}} = 3\alpha + 3 \cdot \frac{M}{4\,\mathrm{BW}} \qquad \text{(totals across all 3 steps).}$$
Generalizing to $N$ ranks: $(N-1)\alpha + (N-1)\,M/(N\,\mathrm{BW})$.
Total cost. Combining RS + AG for $N$ ranks:
$$t_{\mathrm{ring\,AR}} = 2(N-1)\,\alpha + 2 \cdot \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
For $N=4$: $6\alpha + 1.5 \cdot M/\mathrm{BW}$. In the $N \to \infty$ limit the BW coefficient $2(N{-}1)/N$ approaches the asymptotic form $2 M/\mathrm{BW}$ — every byte must leave its originating rank once and arrive at each of the other $N{-}1$ ranks once, giving a floor of $2 M/\mathrm{BW}$ that ring hits nearly perfectly.
Asymptotic form (bandwidth-bound regime; intrinsically pipelined implementation). Unlike ring BC (§3.1) and ring Reduce (§4.1) — which trace a single message through $N{-}1$ sequential hops and reach their asymptote only at optimal pipeline depth $P^{*} = \sqrt{(N{-}2)M/(\alpha\mathrm{BW})}$ — ring AR’s closed form above is already the pipelined asymptote with $P = N$. The Patarasuk-Yuan construction bakes the pipelining into the derivation via the $M/N$ chunked payload; the $2(N{-}1)$-step schedule is not the $P = 1$ baseline that gets pipelined later, it is the pipelined schedule. Three structural reasons no further collapse is possible:
- No fill or drain. Every rank starts with its own full vector, so at step 1 every rank is already concurrently sending one chunk on its right-neighbor link and receiving a different chunk from its left — steady state from step 1 to step $2(N{-}1)$. Unlike ring BC / Reduce whose pipeline fills over $N{-}1$ inactive steps while the first message traverses the chain, ring AR runs $N$ parallel chunk-pipelines in lockstep (each originating rank drives its own $N$-hop journey), so there is no fill phase to amortize.
- $P > N$ cannot improve the BW floor. The bottleneck neighbor-link already carries $2(N{-}1) \cdot M/N$ total bytes across the collective — matching the per-rank BW lower bound for AR (each of $M$ bytes must leave its origin rank once and arrive at each of the other $N{-}1$ ranks once). Finer segmentation shrinks per-step payload but not total bytes per link; the BW coefficient $2(N{-}1)/N \to 2\,M/\mathrm{BW}$ is a true lower bound, not an asymptote approached in a limit.
- No $O(\sqrt{M})$ correction. The Appendix C master formula’s $2\sqrt{(L{-}1)\alpha M/\mathrm{BW}}$ correction originates in the $L{-}1$ idle steps while a single-source pipeline fills. Ring AR has no single source and no fill, so that correction is absent — the closed form is exact under the conflict-free ring assumption, not an $\approx$-approximation:
$$t_{\mathrm{ring\,AR}} \;=\; 2(N{-}1)\,\alpha + 2 \cdot \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
This property — bandwidth-optimal without needing the pipelining trick — is the structural reason ring stays competitive at large $M$ despite its $O(N) \cdot \alpha$ latency, and the reason ring re-takes the crown from DBT at bulk $M$ in NCCL’s tuner. Revisited in §5.3 where ring’s exact asymptote sits beside DBT’s pipelined $M/\mathrm{BW}$ floor (whose real-world coefficient $c_{\mathrm{real}} \geq 1$ inflates above that floor under implementation overhead) and the INC primitive’s $M/\mathrm{BW}$ floor.
5.2 Double binary tree (DBT) AR
Tree-based AR collapses ring’s $N-1$ sequential steps into $\lceil \log_2 N \rceil$ by using a binary-hypercube pairing pattern instead of a ring. At each step, every rank exchanges data with exactly one partner, and the “reachable set” doubles in size each step — so the whole $N$-rank group is covered in $\log_2 N$ steps.
Three tree-flavored AR variants appear in the literature, each reading AR through one of its two natural decompositions (RS + AG, or reduce + broadcast):
- Simple recursive doubling — does not decompose. Single-pass butterfly / hypercube sweep, equivalent to recursive-doubling AG with addition substituted for concatenation (adding two $M$-sized vectors keeps the result $M$-sized, so no half grows or shrinks). Minimum latency, but the price of single-pass simplicity is sending the full $M$ at every step, so bandwidth scales as $\log_2 N \cdot M$. Not shipped by NCCL. Full derivation: Appendix B.1.
- Rabenseifner halving-doubling ≡ RS + AG. Two complementary hypercube passes with chunk-exponential payloads. Recognizes that step $k$ only needs the step-$k$-relevant chunks (size $2^{k-1} \cdot M/N$ or its complement), which shrinks the BW coefficient from $\log_2 N$ to the optimal $2(N{-}1)/N$ at the cost of $2\times$ more steps. Not shipped by NCCL. Full derivation: Appendix B.2.
- Double binary tree (DBT) [SST09] ≡ reduce + broadcast (× 2 trees). Two complementary tree passes, full-vector payloads split into pipeline segments. The second tree $T_2$ runs concurrently with complementary interior/leaf roles so both link directions stay busy, halving the per-link message at every step. Shipped by NCCL as the non-ring multi-node default; derived in full below.
Why NCCL ships DBT and not the other two is the subject of Appendix B.3; the short version is that DBT is the only tree-flavored variant whose per-rank concurrency requirement fits the port budget of real fabrics, which in turn lets pipelining collapse its $\log_2 N$ BW coefficient down to near-ring. The other two serialize on bounded-port fabrics.
Single binary tree at $N=4$. Tree structure:
R0 (root, depth 0)
/ \
R1 R2 (depth 1)
/
R3 (depth 2)
$R_0$ is root. $R_1$ and $R_2$ are $R_0$’s children. $R_3$ is $R_1$’s child. Depth = 2 = $\lceil \log_2 4 \rceil$.
Initial state (same $v_{i,k}$ as before):
R0: [v00 v01 v02 v03]
R1: [v10 v11 v12 v13]
R2: [v20 v21 v22 v23]
R3: [v30 v31 v32 v33]
Phase 1 — Reduce (depth $\log_2 N$ steps, leaves → root).
Step 1 (depth 2 → depth 1). $R_3$ sends its full vector up to $R_1$; $R_1$ sums. $R_2$ is idle (no children).
R0: [v00 v01 v02 v03 ] ← unchanged (root waiting)
R1: [v10+v30 v11+v31 v12+v32 v13+v33 ] ← 2-way sum over {R1, R3}
R2: [v20 v21 v22 v23 ] ← unchanged (leaf, waiting)
R3: stale ← sent up
Step 2 (depth 1 → depth 0). $R_1$ and $R_2$ concurrently send their vectors up to $R_0$. $R_0$ sums both incoming into its local copy.
R0: [S0 S1 S2 S3 ] ← 4-way full sum
R1: stale ← sent up
R2: stale ← sent up
R3: stale
Phase 2 — Broadcast (depth $\log_2 N$ steps, root → leaves).
Step 3 (depth 0 → depth 1). $R_0$ concurrently sends $S$ down to $R_1$ and $R_2$.
R0: [S0 S1 S2 S3]
R1: [S0 S1 S2 S3] ← received
R2: [S0 S1 S2 S3] ← received
R3: stale
Step 4 (depth 1 → depth 2). $R_1$ sends $S$ down to $R_3$. $R_2$ is idle (no children).
R0: [S0 S1 S2 S3]
R1: [S0 S1 S2 S3]
R2: [S0 S1 S2 S3]
R3: [S0 S1 S2 S3] ← received
All 4 ranks now hold $S$. Single-tree AR done in $2 \lceil \log_2 N \rceil = 4$ sequential steps.
Cost of single binary tree. Each step moves the full $M$-byte vector on the active link: $\alpha + M/\mathrm{BW}$. 4 sequential steps → total $4\alpha + 4M/\mathrm{BW}$. Generalizing:
$$t_{\mathrm{single\,tree\,AR}} = 2\lceil \log_2 N \rceil \, \alpha + 2\lceil \log_2 N \rceil \cdot \frac{M}{\mathrm{BW}}$$
Latency hits the log-depth lower bound for a binary-combinable reduction, but the BW coefficient is $\log_2 N$ — a factor of $\log_2 N / 2$ above the ring-optimal $2(N{-}1)/N \to 2$ floor. The inefficiency: at steps 1 and 4, only one link is active; at steps 2 and 3, multiple links are active but each one is driving only one direction. Half the full-duplex link capacity is idle across the algorithm.
The double-tree optimization. Construct a second tree $T_2$ whose role assignments are the complement of $T_1$: every rank that is interior in $T_1$ is a leaf in $T_2$, and vice versa. For $N=4$:
T1: R0 T2: R3
/ \ / \
R1 R2 R0 R2
/ \
R3 R1
Role check:
| Rank | $T_1$ | $T_2$ |
|---|---|---|
| $R_0$ | root (internal) | leaf |
| $R_1$ | internal | leaf |
| $R_2$ | leaf | internal |
| $R_3$ | leaf | root (internal) |
Each rank is interior in exactly one tree. Now run AR on $T_1$ and $T_2$ concurrently, each on a different half of the message: chunks $\{0, 1\}$ (left half, $M/2$ bytes) flow through $T_1$ — reduced at $R_0$, broadcast from $R_0$. Chunks $\{2, 3\}$ (right half, $M/2$ bytes) flow through $T_2$ — reduced at $R_3$, broadcast from $R_3$. Because the two trees have complementary roles, a rank’s $T_1$ traffic travels on one direction of the full-duplex link while its $T_2$ traffic travels on the other — both halves make progress every step.
Initial state (| separates the left half carried by $T_1$ from the right half carried by $T_2$):
left half (T1) | right half (T2)
R0: [ v00 v01 | v02 v03 ]
R1: [ v10 v11 | v12 v13 ]
R2: [ v20 v21 | v22 v23 ]
R3: [ v30 v31 | v32 v33 ]
Step 1 — reduce, depth 2 → 1 in both trees. $T_1$: $R_3 \to R_1$ (left half). $T_2$: $R_1 \to R_2$ (right half). Concurrent.
R0: [ v00 v01 | v02 v03 ] ← idle both trees
R1: [ v10+v30 v11+v31 | stale stale ] T1: +R3 left; T2: right sent up
R2: [ v20 v21 | v22+v12 v23+v13 ] T2: +R1 right
R3: [ stale stale | v32 v33 ] T1: left sent up
Step 2 — reduce, depth 1 → 0 in both trees. $T_1$: $R_1, R_2 \to R_0$ (left halves, two concurrent sends into $R_0$). $T_2$: $R_0, R_2 \to R_3$ (right halves, two concurrent sends into $R_3$). Concurrent.
R0: [ S0 S1 | stale stale ] T1: 4-way left; T2: right sent up
R1: [ stale stale | stale stale ]
R2: [ stale stale | stale stale ]
R3: [ stale stale | S2 S3 ] T2: 4-way right
where S_k = v_{0,k} + v_{1,k} + v_{2,k} + v_{3,k}.
After the two reduce steps, $R_0$ holds the fully-reduced left half and $R_3$ holds the fully-reduced right half — mirror-image completions of each tree’s reduce phase.
Step 3 — broadcast, depth 0 → 1 in both trees. $T_1$: $R_0 \to R_1, R_2$ (left half). $T_2$: $R_3 \to R_0, R_2$ (right half). Concurrent.
R0: [ S0 S1 | S2 S3 ] T2: received right from R3
R1: [ S0 S1 | stale stale ] T1: received left from R0
R2: [ S0 S1 | S2 S3 ] both trees: received
R3: [ stale stale | S2 S3 ]
Step 4 — broadcast, depth 1 → 2 in both trees. $T_1$: $R_1 \to R_3$ (left half). $T_2$: $R_2 \to R_1$ (right half). Concurrent.
R0: [ S0 S1 | S2 S3 ]
R1: [ S0 S1 | S2 S3 ] T2: received right from R2
R2: [ S0 S1 | S2 S3 ]
R3: [ S0 S1 | S2 S3 ] T1: received left from R1
All 4 ranks now hold $[S_0, S_1, S_2, S_3]$ in $2\lceil \log_2 N \rceil = 4$ sequential steps — same step count as single-tree, but each step moves only $M/2$ bytes per link (the other half is on the sibling tree, which occupies the opposite direction of the same full-duplex link).
Cost. $2\lceil \log_2 N \rceil$ sequential steps, each costing $\alpha + (M/2)/\mathrm{BW}$ (handshake + half-message on the active link):
$$t_{\mathrm{double\,tree\,AR}} = 2\lceil \log_2 N \rceil \, \alpha + \lceil \log_2 N \rceil \cdot \frac{M}{\mathrm{BW}}$$
For $N=4$: $4\alpha + 2\,M/\mathrm{BW}$ — $2\times$ better than single-tree’s $4\,M/\mathrm{BW}$, and at this size coincidentally matching ring’s $2\,M/\mathrm{BW}$. The bandwidth term improves on single-tree’s $2\log_2 N \cdot M/\mathrm{BW}$ by a factor of $2$, thanks to the half-message-per-link property the second tree enables.
Asymptotic form (bandwidth-bound regime; pipelined implementation). The closed form above is the non-pipelined schedule with $P = 1$ — each tree carries one full $M/2$-byte segment per step. Cut each tree’s half-message into $P$ equal segments and stream them through the $L = 2\lceil \log_2 N \rceil$-step schedule: segments at different tree depths use disjoint physical edges (a segment-$s$ forward on a depth-$k$ edge and a segment-$(s{+}1)$ forward on a depth-$(k{-}1)$ edge don’t conflict), and the two trees occupy opposite directions of each full-duplex link (so $T_1$ and $T_2$ segments don’t conflict either). Per-rank concurrency peaks at $\sim 3$ partners — the busiest role is a tree’s root at the final reduce step, concurrently receiving from its up-to-2 children while participating in the sibling tree’s broadcast as a leaf — well within the port budget of any modern fabric tier (3+ NVLink channels on-node, 3+ NICs off-node). Substituting $L = 2\lceil \log_2 N \rceil$ and per-segment payload $M/(2P)$ into the Appendix C master formula and optimizing:
$$t_{\mathrm{DBT,\,pipe}}(P^{*}) \;\approx\; 2\lceil \log_2 N \rceil \cdot \alpha + \frac{M}{\mathrm{BW}}$$
The “$\approx$” hides an $O(\sqrt{M})$ correction of order $2\sqrt{(2\lceil \log_2 N \rceil - 1)\,\alpha M/\mathrm{BW}}$ — same Appendix C master-formula structure as ring BC (§3.1) and ring Reduce (§4.1), now with $L = 2\lceil \log_2 N \rceil$; the correction vanishes relative to the $M/\mathrm{BW}$ floor as $M \to \infty$. The pure-model $\lceil \log_2 N \rceil$ factor in the BW term is gone — replaced by the App-C floor of 1 — while the latency term stays at $2\lceil \log_2 N \rceil \alpha$. That is the “log-depth latency AND near-ring bandwidth” combination that makes DBT the shipping choice for small-to-medium $M$ on NCCL. This $M/\mathrm{BW}$ floor is a lower bound only: implementation reality (finite pipeline depth $P \ll P^*$, per-step kernel launch overhead, imperfect edge-by-edge overlap) pushes the real-world DBT BW coefficient to some $c_{\mathrm{real}} \geq 1$ above this floor — §5.3’s practice caveat explains how that inflation determines where NCCL’s tuner flips from DBT to ring at bulk $M$.
5.3 Comparison and practical adoption
Three software forms plus a hardware in-network path sit on the table: ring (§5.1, intrinsically pipelined with $P = N$), DBT non-pipelined ($P = 1$, §5.2), DBT asymptotic ($P^*$, §5.2), and switch-ALU AR (INC). Before presenting the combined view, the hardware option needs a brief introduction — the three software forms were fully specified in §§5.1–5.2, but INC AR has not yet appeared in this chapter. Two MPI-era variants — simple recursive doubling and Rabenseifner — are omitted from the table because NCCL / RCCL do not ship them; their derivations and the port-budget / partner-cycling argument that rules them out live in Appendix B.3.
Hardware in-network AR primitive. Switched fabrics expose a switch-hosted ALU (InfiniBand Quantum SHARP / SHARPv2, NVSwitch Gen3+ / NVLS) that fuses AR’s two halves into a single on-chip operation: the switch crossbar reduces the $N$ incoming $M$-byte flits into one output, then multicasts that output back down to all $N$ ranks. Each endpoint link carries $M$ bytes up (one flit into the switch) and $M$ bytes down (the multicast copy) on opposite directions of the full-duplex link. Because the up and down halves use opposite directions they overlap — unlike software AR, whose two halves are serialized within the step schedule — so the BW term drops from software AR’s $2\,M/\mathrm{BW}$ floor to $M/\mathrm{BW}$. Two switch-crossing hops total, independent of $N$:
$$t_{\mathrm{INC,\,AR}} \;\approx\; 2\,\alpha_\mathrm{switch} + \frac{M}{\mathrm{BW}}$$
Both halves of the INC primitive are used here — the switch-ALU half (compare §4.3 INC Reduce, which used only the ALU half) and the switch-multicast half (compare §3.3 INC BC, which used only the multicast half). $\alpha_\mathrm{switch}$ is the switch cut-through plus ALU latency — $\sim 100$–$200$ ns for NVSwitch Gen3/4 with NVLS and comparable on IB Quantum-2 with SHARPv2 — typically well below the $\alpha \approx 0.5$–$1\,\mu$s that software AR accumulates across $2\lceil \log_2 N \rceil$ or $2(N-1)$ endpoint-driven hops. PCIe switches are absent here for the same ALU-less reason they are absent from §4.3 (see footnote at §3.3). The BW-side lift — unique to AR among the primitives in this chapter — is derived in 04_in_network_collectives.md §1.4.
All four options side by side. The two DBT rows differ by the choice of $P$ (non-pipelined $P = 1$ vs asymptotic $P = P^*$), and which wins is dictated by the $M$-regime since that choice is governed by whether the α term or the BW term dominates. Switch-ALU AR is an orthogonal hardware-dependent alternative and applies across regimes where the fabric exposes it:
| Algorithm | α term | BW term | Regime / adoption |
|---|---|---|---|
| Ring, intrinsically pipelined ($P = N$, §5.1) | $2(N{-}1)\,\alpha$ | $\dfrac{2(N{-}1)}{N} \cdot \dfrac{M}{\mathrm{BW}}$ | Large-$M$ AR in NCCL (ring re-takes the crown — see practice caveat) |
| DBT, non-pipelined ($P = 1$, §5.2) | $2\lceil \log_2 N \rceil \, \alpha$ | $\lceil \log_2 N \rceil \cdot \dfrac{M}{\mathrm{BW}}$ | Small-$M$ AR (α-bound regime); NCCL DBT path before pipelining kicks in |
| DBT, asymptotic ($P^*$, §5.2) | $2\lceil \log_2 N \rceil \, \alpha$ | $\dfrac{M}{\mathrm{BW}}$ | Small-to-medium $M$ AR in NCCL (default DBT path at bulk $M$) |
| Switch-ALU AR (INC) | $2 \cdot \alpha_\mathrm{switch}$ | $M/\mathrm{BW}$ | All $M$ where fabric exposes it; IB SHARP/SHARPv2, NVSwitch NVLS |
Reading the rows as a progression. At small $M$ the BW term is negligible no matter what multiplies it, so software picks the algorithm with the smallest $\alpha$ count — DBT’s $2\lceil \log_2 N \rceil$ beats ring’s $2(N{-}1)$, and NCCL runs DBT. At medium $M$ the BW term matters, and DBT pipelines to collapse its $\lceil \log_2 N \rceil \cdot M/\mathrm{BW}$ coefficient down to the $M/\mathrm{BW}$ α-β floor; α stays at log depth, so DBT remains favored. Under the pure α-β model, DBT strictly dominates ring at every $M > 0$ — both are at log-depth or linear-depth α and DBT’s BW coefficient is half of ring’s. At large $M$, however — unlike Reduce (§4.3) where tree strictly dominates ring — AR exhibits a genuine ring-vs-DBT crossover in practice: ring’s $2(N{-}1)/N \to 2$ floor is attained intrinsically (no pipelining overhead to pay), while DBT’s $M/\mathrm{BW}$ is only a lower bound and the real-world coefficient inflates above it (see practice caveat below), so ring re-takes the crown once $M$ is large enough that its $O(N) \cdot \alpha$ latency amortizes into the pipeline.
INC sits alongside this progression as a hardware alternative rather than a fourth regime. Unlike BC (§3.3) and Reduce (§4.3) where INC’s entire win is on the α side, AR uniquely receives a BW-side lift: the switch-fused reduce+multicast makes each byte cross each endpoint link once in each direction concurrently rather than sequentially as software schedules do, pushing the BW coefficient from ring’s $2(N{-}1)/N \to 2\,M/\mathrm{BW}$ floor down to $M/\mathrm{BW}$. INC matches DBT’s α-β asymptote on the BW side — but hits it as a true hardware floor without the practice inflation ($c_{\mathrm{real}} \geq 1$) that DBT pays in software. See 04_in_network_collectives.md §1.4 for the effective-BW derivation. The α-side win is the same $N$-independent $2\,\alpha_\mathrm{switch}$ — dramatic relative to software’s $O(\log N)$ or $O(N)$ α hops.
No α-β crossover exists between ring and DBT. On the pure α-β model with DBT at its pipelined floor $M/\mathrm{BW}$, DBT strictly dominates ring at every $M > 0$: its latency term $2\lceil \log_2 N \rceil \alpha$ is strictly below ring’s $2(N{-}1)\alpha$ for all $N \geq 4$, and its BW coefficient $1$ is half of ring’s $2(N{-}1)/N \to 2$. The observed large-$M$ inversion where ring re-takes the crown is entirely an implementation-practice effect — see practice caveat below. Crossovers against the non-shipped rec-doub and Rabenseifner variants are in Appendix B.3.
Practice caveat — where ring re-takes the crown. NCCL’s tuner picks DBT for small-$M$ AR and ring for large-$M$ AR, and published benchmarks [DEMYST-NCCL] confirm this inversion. The $M/\mathrm{BW}$ floor from §5.2 is a lower bound only: tree-specific implementation overhead (finite pipeline depth $P \ll P^{*}$, per-step kernel complexity, CUDA launch and synchronization granularity, imperfect edge-by-edge overlap) pushes the real-world DBT BW coefficient to some $c_{\mathrm{real}} \geq 1$ above the floor, while ring’s $2(N{-}1)/N \to 2$ is attained intrinsically with no pipelining overhead to pay. Crossover in $M$ from equating the two:
$$M_*^{\,\mathrm{practice}} \;=\; \frac{2\,(N - 1 - \lceil \log_2 N \rceil)\,\alpha \cdot \mathrm{BW}}{c_{\mathrm{real}} - 2(N{-}1)/N}.$$
For $c_{\mathrm{real}} = 2$ (a representative setting where DBT’s per-byte cost roughly doubles the floor) at scale-up fabric parameters ($\alpha = 0.5\,\mu$s, $\mathrm{BW} = 900\,\mathrm{GB/s}$ — NVLink-5 / NVSwitch class): $N = 72 \Rightarrow M_^{\,\mathrm{practice}} \approx 2\,\mathrm{GB}$; $N = 512 \Rightarrow M_^{\,\mathrm{practice}} \approx 116\,\mathrm{GB}$. For $c_{\mathrm{real}} \to 2(N{-}1)/N$ the crossover goes to infinity (DBT never loses); for $c_{\mathrm{real}} < 2(N{-}1)/N$ there is no real crossover and DBT dominates ring everywhere, matching the α-β result. The α-β-only reasoning remains the right intuition for small-to-medium $M$ where DBT’s lower $\alpha$ dominates; ring wins when per-step BW overhead is small enough that its $O(N) \cdot \alpha$ latency amortizes into the pipeline.
6. All-gather / reduce-scatter
AG and RS are the two halves of AR and also useful primitives in their own right. AG starts with each rank holding one $M/N$-byte chunk and ends with all ranks holding the concatenation of all $N$ chunks. RS is the dual: each rank starts with the full $M$-byte vector and ends holding one $M/N$ chunk reduced across all ranks. Both appear pervasively in sharded-parameter training (gradient RS into sharded optimizer state, weight AG before each forward pass), sequence-sharded activations, and any scheme that splits a tensor across ranks and needs to rematerialize or reduce it.
Like AR, both primitives have ring-based and tree-flavored implementations. Unlike AR, however, no tree-flavored AG / RS variant beats ring on α-β for scale-up. The two trees in the literature — recursive-doubling AG / recursive-halving RS (MPI menu, App. B.4) and Parallel Aggregated Trees (PAT, NCCL 2.23+, App. A) — both match ring’s $(N{-}1)/N$ BW coefficient without beating it, so their only edge is a $\lceil \log_2 N \rceil$ vs $N{-}1$ α term. That α edge only pays off when $N \cdot \alpha$ is non-negligible — the scale-out regime where $N$ = node count and inter-node α is in the μs range. On scale-up ($N \lesssim 72$, NVLink/NVSwitch α ≈ 0.5 μs), $(N{-}1)\alpha$ stays within tens of μs and ring wins on concrete practice grounds that the α-β model misses. So NCCL ships ring as the sole scale-up AG / RS path, and dispatches to PAT only for inter-node AG / RS at 1 rank per node. This section derives ring, explains why it wins the scale-up comparison despite its higher α, and calls out the scale-out PAT exception at the end. Rec-doubling / rec-halving stays in App. B.4 as MPI reference; PAT mechanics are in App. A.
Ring wiring. Ring-based AG is exactly Phase 2 of ring AR from §5.1; ring-based RS is exactly Phase 1. Same wiring, same forward direction — each rank concurrently sends one chunk right and receives one chunk from its left neighbor every step:
R0 ───→ R1 ───→ R2 ───→ R3
▲ │
└───────────────────────┘
AG forwards received chunks (overwrite on arrival). RS forwards an accumulating partial sum (add on arrival). As standalone collectives they each take $N-1$ steps:
$$t_{\mathrm{ring\,RS}} = t_{\mathrm{ring\,AG}} = (N-1)\,\alpha + \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
where $M$ is the per-rank final total volume — for AG, the size each rank ends with; for RS, the size each rank starts with before the reduce. The bandwidth term is exactly half of ring AR’s because you only do one of the two phases.
Asymptotic form (bandwidth-bound regime; intrinsically pipelined implementation). Ring AG / RS inherits ring AR’s property (§5.1): the closed form above is already the pipelined asymptote with $P = N$. The $M/N$ chunked payload bakes pipelining into the derivation; the $N{-}1$-step schedule is not a $P = 1$ baseline that gets pipelined later, it is the pipelined schedule. Three structural reasons no further collapse is possible, paralleling §5.1:
- No fill or drain. $N$ parallel chunk-pipelines run in lockstep — for AG each of the $N$ original chunks drives its own $N{-}1$-hop journey from its origin rank; for RS each output chunk accumulates along a symmetric $N{-}1$-hop path. Every link is busy from step 1 to step $N{-}1$; no fill phase to amortize.
- $P > N$ cannot improve the BW floor. The bottleneck neighbor-link carries $(N{-}1) \cdot M/N$ total bytes over the collective — matching the per-rank BW lower bound (each byte crosses each endpoint link $(N{-}1)/N$ times on average). Finer segmentation shrinks per-step payload but not total bytes per link; the BW coefficient $(N{-}1)/N$ is a true lower bound, not an asymptote approached in a limit.
- No $O(\sqrt{M})$ correction. The Appendix C master formula’s $\sqrt{M}$ correction originates in the $L{-}1$ idle steps while a single-source pipeline fills. Ring AG / RS has no single source and no fill, so that correction is absent — the closed form is exact under the conflict-free ring assumption, not an $\approx$-approximation:
$$t_{\mathrm{ring\,RS}} \;=\; t_{\mathrm{ring\,AG}} \;=\; (N{-}1)\,\alpha + \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
Exactly half of ring AR’s BW coefficient — consistent with AG / RS being one of AR’s two phases. This is the structural reason ring is BW-optimal for AG / RS without needing pipelining gymnastics, and the reason the tree-flavored alternatives (App. B.4, App. A) can only match this floor, not beat it — discussed next.
Why ring is the scale-up default. On α-β arithmetic the tree variants strictly dominate ring: both App. B.4 and App. A match ring’s $(N{-}1)/N$ BW coefficient with a $\lceil \log_2 N \rceil$ α vs ring’s $N{-}1$ α. Yet NCCL ships ring for every scale-up AG / RS configuration. The inversion is an implementation-practice effect — four concrete advantages the α-β model misses add up to more than the trees’ log-depth α edge at scales where $\alpha \cdot N$ is small:
- Pipeline feasibility. Ring’s per-round chunk is a constant $M/N$, so chunks stream through the ring at wire speed — the α term amortizes across the pipeline and vanishes at large $M$, exactly as derived above. Rec-doubling AG ships chunks of geometrically growing size $M/N, 2M/N, 4M/N, \ldots$; a pipelined schedule would need custom per-round chunk geometry, or drain the pipeline at every round boundary. Rec-halving RS has the mirror problem (chunks halving each round). PAT’s reversed-Bruck schedule is pipelineable but engineered around the inter-node fabric’s staging constraints, not the scale-up fabric’s.
- Non-power-of-2 support. Rec-doub / rec-halv strictly need $N = 2^k$ for clean bit-pairings. Ring handles any $N$, including the odd counts that fall out of pipeline-parallel stage assignments or irregular job configurations — NCCL routinely runs on $N = 6$, $N = 72$ (NVL72), and other non-power-of-2 counts.
- Kernel simplicity. Ring is a single communication loop with fixed chunk size; rec-doubling is $\lceil \log_2 N \rceil$ distinct rounds, each with a different partner offset and chunk size. Each additional kernel-launch boundary costs CUDA setup time that the α-β model treats as free.
- Compute overlap. The steady-chunk pattern of ring AG overlaps naturally with per-chunk compute (e.g., FSDP / ZeRO gather-then-compute). Rec-doubling’s round boundaries are global sync points that break overlap.
Also relevant: unlike AR, no hardware in-network primitive exists for AG / RS. SHARP / NVLS fuse the AR-specific simultaneous reduce+multicast in the switch ALU; neither pure concatenation (AG) nor pure per-chunk reduction (RS) receives that two-halves-at-once BW-side lift (see 04_in_network_collectives.md §2), so the scale-up choice is purely between software variants and ring wins.
MPI (MPICH, OpenMPI) does ship rec-doubling AG and rec-halving RS as algorithm options, typically dispatched by message-size threshold at runtime; NCCL’s single-algorithm-per-primitive default prioritizes the pipelineable ring case on scale-up.
Scale-out exception: PAT at 1 rank per node. PAT [NCCL-PAT] addresses the one regime where ring’s $(N{-}1)\alpha$ term becomes prohibitive — one rank per node communicating over the NIC / inter-node path, so $N$ is the node count rather than the intra-node GPU count. On scale-up ($\alpha \approx 0.5\,\mu$s, $N \lesssim 72$) $(N{-}1)\alpha$ stays in the tens of μs; on scale-out with inter-node α in the μs range and $N$ reaching the hundreds, the same term blows up to hundreds of μs. PAT reverses the Bruck offset schedule (largest hops first) and ships a bounded intermediate buffer so it composes with the inter-node path’s finite staging memory. Two constraints limit where PAT ships:
- Inter-node only, 1 rank per node. The NCCL 2.23 implementation restricts PAT to one rank per node because only the inter-node phase is implemented; intra-node AG / RS still runs ring. This matches the scale-out design intent — PAT’s log-depth structure pays off against the scale-out fabric’s α cost, not against on-node NVLink latency.
- Scale-up doesn’t benefit. On a scale-up NVLink / NVSwitch domain, ring’s α cost is already small and its BW-optimal pipelining keeps it at the BW floor. Replacing ring with PAT would trade pipeline-friendliness for a log-depth α schedule that doesn’t fit the scale-up port budget any better than rec-doubling does — the same partner-cycling objection from AR (Appendix B.3) applies.
The scale-out motivation and the specific “reversed Bruck + bounded buffer” mechanism are worked step-by-step at $N = 8$ in Appendix A.
7. All-to-all
A2A permutes data: rank $i$ holds $N$ chunks where chunk $j$ is destined for rank $j$; after the collective, rank $i$ holds $N$ chunks where chunk $j$ was contributed by rank $j$. The transpose pattern is a pure permutation — no summation, no reduction — so aggregate cross-fabric traffic is exactly $(N{-}1)M$ bytes (each of the $N$ ranks ships $(N{-}1)/N$ of its per-rank payload to other ranks). This makes A2A the most bandwidth-hungry of the primitives in this note: unlike AR, whose RS+AG decomposition compresses the BW term to $2(N{-}1)/N \cdot M/\mathrm{BW}$, A2A has no such decomposition to hide behind.
Worked example at $N=4$. Each rank $R_i$ starts with $N$ chunks $v_{i,0}, v_{i,1}, v_{i,2}, v_{i,3}$, where $v_{i,j}$ is the chunk that $R_i$ contributes to $R_j$. The transpose pattern arises in any permutation-shaped redistribution — MoE expert dispatch, distributed FFTs, matrix transposes, shuffle-in-parallel sorting — and the derivation below is agnostic to which.
Initial (each rank holds 4 chunks, one per destination):
R0: [v00 → R0 v01 → R1 v02 → R2 v03 → R3]
R1: [v10 → R0 v11 → R1 v12 → R2 v13 → R3]
R2: [v20 → R0 v21 → R1 v22 → R2 v23 → R3]
R3: [v30 → R0 v31 → R1 v32 → R2 v33 → R3]
After A2A, each rank holds the column of chunks destined for it:
Final state:
R0: [v00 v10 v20 v30] ← all chunks whose destination was R0
R1: [v01 v11 v21 v31]
R2: [v02 v12 v22 v32]
R3: [v03 v13 v23 v33]
The (source, destination) matrix is transposed along the diagonal. Two α-β-equivalent software schedules implement this transpose — §7.1 derives the ring relay (the natural fit on bisection-limited fabrics such as torus / hypercube / on-chip rings), §7.2 derives the pairwise direct-send schedule that NCCL ships on switched fabrics (fat-tree / Clos / NVSwitch), and §7.3 compares both against Bruck’s $O(\log N)$-latency alternative (derivation in Appendix B.5) and explains the NCCL selection rule.
7.1 Ring A2A
The ring relay runs A2A on the same wiring as ring AR’s Phase 1 / 2 from §5.1 — but with both directions of the full-duplex ring in use at every step. Shortest-arc routing sends each chunk the shorter way around the ring (right if the forward distance $d(i, j) = (j - i) \bmod N \le N/2$, left otherwise); intermediate ranks forward hop-by-hop when $d > 1$. We walk through $N = 4$.
Setup. Four ranks $R_0, R_1, R_2, R_3$ wired into a bi-directional ring — each rank concurrently drives both its right-edge and left-edge links every step:
┌──────────────────┐
▼ │
R0 ⇄ R1 ⇄ R2 ⇄ R3
│ ▲
└──────────────────┘
Each rank $R_i$ starts with 4 chunks $\{v_{i,0}, v_{i,1}, v_{i,2}, v_{i,3}\}$ (same initial layout as the §7 intro); target is to deliver each $v_{i,j}$ to $R_j$.
Per-rank schedule. Shortest-arc routing assigns each of $R_i$’s three foreign chunks to one direction: $v_{i, i+1}$ goes 1 hop right, $v_{i, i+2}$ goes 2 hops right (antipode tie, broken rightward), $v_{i, i-1}$ goes 1 hop left. Each rank ships 2 chunks rightward (one direct, one via a single relay hop) and 1 leftward over the $N{-}1 = 3$ steps:
| Step | Right-edge send | Left-edge send |
|---|---|---|
| 1 | inject $v_{i, i+1}$ (1 hop; arrives at $R_{i+1}$) | inject $v_{i, i-1}$ (1 hop; arrives at $R_{i-1}$) |
| 2 | inject $v_{i, i+2}$ (hop 1 of 2; parks at $R_{i+1}$) | idle |
| 3 | forward $v_{i-1, i+1}$ (received at step 2 from $R_{i-1}$; arrives at $R_{i+1}$) | idle |
Starting state (same as §7 intro):
R0: {v00, v01, v02, v03}
R1: {v10, v11, v12, v13}
R2: {v20, v21, v22, v23}
R3: {v30, v31, v32, v33}
After step 1 — every 1-hop chunk arrives. Each rank receives two chunks, one from each neighbor:
Right sends: R0→R1: v01 R1→R2: v12 R2→R3: v23 R3→R0: v30
Left sends: R0→R3: v03 R1→R0: v10 R2→R1: v21 R3→R2: v32
R0 holds: {v00, v10, v30} ← missing v20 (antipode, 2 hops away)
R1 holds: {v01, v11, v21 } ← missing v31
R2 holds: { v12, v22, v32} ← missing v02
R3 holds: {v03, v23, v33} ← missing v13
After step 2 — every rank injects its antipode (2-hop) chunk onto the right channel. The chunks park at the intermediate rank with one more hop to go. Left channel is idle — all leftward chunks already delivered at step 1.
Right sends: R0→R1: v02 R1→R2: v13 R2→R3: v20 R3→R0: v31
(left channel idle)
Mid-relay state — chunks parked at intermediate ranks:
v02 at R1 (destined R2) v13 at R2 (destined R3)
v20 at R3 (destined R0) v31 at R0 (destined R1)
(No ranks' "missing" set changes at step 2 — every received chunk is mid-relay.)
After step 3 — each parked chunk makes its final hop by being forwarded rightward:
Forwards: R0→R1: v31 R1→R2: v02 R2→R3: v13 R3→R0: v20
(left channel idle)
R0: {v00, v10, v20, v30} ✓
R1: {v01, v11, v21, v31} ✓
R2: {v02, v12, v22, v32} ✓
R3: {v03, v13, v23, v33} ✓
A2A complete. Each right edge carried 3 chunks of size $M/N$ each (one per step); each left edge carried 1 chunk at step 1 and was idle afterward. At larger $N$ the right/left split becomes more balanced (more chunks fall on the left half of the shortest-arc partition), but the busier direction always sets the BW bound.
Cost accounting. The bottleneck (right-edge) link carries $N{-}1$ chunks of size $M/N$ across the $N{-}1$ sequential steps; each step costs $\alpha + (M/N)/\mathrm{BW}$. Summing:
$$t_{\mathrm{ring\,A2A}} \;=\; (N{-}1)\,\alpha \;+\; \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
Asymptotic form (bandwidth-bound regime; intrinsically pipelined implementation). Like ring AR (§5.1) and ring AG / RS (§6), ring A2A’s closed form above is already the pipelined asymptote with $P = N$. The $M/N$ chunked payload bakes pipelining into the derivation; the $N{-}1$-step schedule is not a $P = 1$ baseline that gets pipelined later, it is the pipelined schedule. Three structural reasons no further collapse is possible:
- No fill or drain. Every rank starts with all $N$ of its chunks, so at step 1 every rank is already concurrently driving both its right-edge and left-edge links — steady state from step 1 to step $N{-}1$. No fill phase to amortize.
- $P > N$ cannot improve the BW floor. The bottleneck right-edge link carries $(N{-}1) \cdot M/N$ total bytes across the collective — matching the per-rank BW lower bound for A2A (each rank must ship $(N{-}1)/N$ of its $M$ bytes to other ranks). Finer segmentation shrinks per-step payload but not total bytes per link; the BW coefficient $(N{-}1)/N$ is a true lower bound, not an asymptote approached in a limit.
- No $O(\sqrt{M})$ correction. The Appendix C master formula’s $\sqrt{M}$ correction originates in the $L{-}1$ idle steps while a single-source pipeline fills. Ring A2A has no single source and no fill, so the closed form is exact under the conflict-free ring assumption, not an $\approx$-approximation.
The schedule lands naturally on bisection-limited fabrics — torus, hypercube, pure-ring / NVLink-ring interconnects — where not every rank pair has a direct fabric path, because each rank drives only its two adjacent links (well within any fabric’s port budget). On switched fabrics with full bisection, the alternative pairwise schedule (§7.2) matches this α-β cost without the intermediate-hop relay.
7.2 Pairwise direct-send A2A
Pairwise direct-send runs A2A as $N{-}1$ rounds of simultaneous send/receive between rank pairs — step $t$ uses partner offset $+t$, so each chunk crosses exactly one fabric hop (no intermediate relay). This requires a fabric where every rank pair has a direct path: fat-tree / Clos / NVSwitch.
Setup. Four ranks $R_0, R_1, R_2, R_3$ on a full-bisection fabric. At step $t \in \{1, 2, 3\}$ every rank concurrently sends to partner $(i+t) \bmod N$ and receives from partner $(i-t) \bmod N$ over the full-duplex link. Step 1 uses offset $+1$, step 2 offset $+2$, step 3 offset $+3$:
step 1 (offset +1) step 2 (offset +2) step 3 (offset +3)
R0 ──→ R1 R0 ──→ R2 R0 ──→ R3
R1 ──→ R2 R1 ──→ R3 R1 ──→ R0
R2 ──→ R3 R2 ──→ R0 R2 ──→ R1
R3 ──→ R0 R3 ──→ R1 R3 ──→ R2
(each step: one send + one receive per rank, full-duplex; every
concurrent send goes to a distinct switch port — no contention)
Each rank $R_i$ starts with 4 chunks $\{v_{i,0}, v_{i,1}, v_{i,2}, v_{i,3}\}$ (same initial layout as the §7 intro); target is to deliver each $v_{i,j}$ to $R_j$.
After step 1 (offset $+1$) — each rank ships its “right-neighbor” chunk and receives from its “left-neighbor”.
Sends: R0→R1: v01 R1→R2: v12 R2→R3: v23 R3→R0: v30
R0: {v00, v02, v03, v30} ← received v30 from R3
R1: {v01, v10, v11, v13} ← received v01 from R0
R2: { v12, v20, v21, v22} ← received v12 from R1
R3: { v23, v31, v32, v33} ← received v23 from R2
Each rank now holds 2 of its 4 final chunks (its own self-chunk plus one received).
After step 2 (offset $+2$) — each rank ships its “2-away” chunk directly to the antipode.
Sends: R0→R2: v02 R1→R3: v13 R2→R0: v20 R3→R1: v31
R0: {v00, v03, v20, v30} ← received v20 from R2
R1: {v01, v10, v11, v31} ← received v31 from R3
R2: { v02, v12, v21, v22} ← received v02 from R0
R3: { v13, v23, v32, v33} ← received v13 from R1
After step 3 (offset $+3$) — each rank ships its last remaining foreign chunk.
Sends: R0→R3: v03 R1→R0: v10 R2→R1: v21 R3→R2: v32
R0: {v00, v10, v20, v30} ← received v10 from R1
R1: {v01, v11, v21, v31} ← received v21 from R2
R2: {v02, v12, v22, v32} ← received v32 from R3
R3: {v03, v13, v23, v33} ← received v03 from R0
Every rank now holds its column of the transpose — A2A complete in $N{-}1 = 3$ steps.
Cost accounting. Each of the $N{-}1$ steps costs $\alpha + (M/N)/\mathrm{BW}$ (one handshake plus one $M/N$-sized chunk over the full-duplex link). The steps run sequentially (step $t+1$ cannot overlap step $t$’s send/receive on the same endpoint):
$$t_{\mathrm{pairwise\,A2A}} \;=\; (N{-}1)\,\alpha \;+\; \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
Identical α-β form to the ring relay (§7.1) — both schedules hit the $(N{-}1)/N \cdot M/\mathrm{BW}$ BW lower bound. The practical distinction is routing: pairwise requires full bisection (no chunk traverses more than one fabric hop), ring does not.
Asymptotic form (bandwidth-bound regime; already exact). The closed form above is exact under the full-bisection assumption — there is nothing to pipeline away. The three structural reasons parallel ring A2A (§7.1):
- No fill or drain. Every rank drives one send and one receive on a full-duplex link from step 1; the $N{-}1$ permutation rounds are steady state throughout, with no single-source pipeline to fill.
- Segmenting $M$ does not improve the BW floor. Each rank ships $(N{-}1)/N \cdot M$ bytes total across $N{-}1$ distinct partner rounds — the per-rank BW lower bound for A2A. Splitting each round’s $M/N$ chunk into finer sub-chunks replaces one handshake with several (more α, same bytes), and so cannot beat the floor.
- No $O(\sqrt{M})$ correction. No single source, no fill — the Appendix C pipelining correction is absent; the formula is exact under the conflict-free bisection assumption.
NCCL implementation. NCCL ships pairwise direct-send via its staggered P2P scheduler (scheduleP2pTasksToPlan), which offsets the per-rank partner order by rank index so step 1 routes $R_0 \to R_1, R_1 \to R_2, \ldots$ (adjacent offsets), step 2 routes offset $+2$, and so on — spreading concurrent sends across distinct switch-port pairs and avoiding per-step head-of-line blocking on any single port. The pairwise schedule is the NCCL default on scale-up NVSwitch fabrics and on switched scale-out fabrics (fat-tree / Clos); the ring relay surfaces only on bisection-limited topologies whose fabric physically lacks a direct path between every rank pair.
Workloads that run A2A back-to-back (e.g., MoE dispatch followed by a reverse A2A combine) double the total, giving $2(N{-}1)\alpha + 2(N{-}1)M/(N\,\mathrm{BW})$ for the round trip under either schedule.
7.3 Comparison and practical adoption
Two software forms sit on the table: ring relay (§7.1) and pairwise direct-send (§7.2). Unlike AR (§5.3) there is no hardware in-network option — switch primitives (SHARP / NVLS / Tomahawk Ultra INC) cannot accelerate A2A’s BW side, for structural reasons summarized at the end of this section. One MPI-era variant — Bruck — is omitted from the table because NCCL / RCCL do not ship it; its $O(\log N)$-latency / $O(\log N)$-BW-coefficient derivation and the single-algorithm-default reasoning that rules it out of GPU collectives live in Appendix B.5.
| Algorithm | α term | BW term | Regime / adoption |
|---|---|---|---|
| Ring relay (§7.1) | $(N{-}1)\,\alpha$ | $\dfrac{N-1}{N} \cdot \dfrac{M}{\mathrm{BW}}$ | Bisection-limited fabrics (torus, hypercube, NVLink-ring); $N$-hop relay lands naturally on fabrics without full bisection |
| Pairwise direct-send (§7.2) | $(N{-}1)\,\alpha$ | $\dfrac{N-1}{N} \cdot \dfrac{M}{\mathrm{BW}}$ | Full-bisection fabrics (fat-tree / Clos / NVSwitch). NCCL default |
Reading the rows as a progression. Ring relay and pairwise direct-send are α-β-equivalent — same $(N{-}1)\alpha$ latency, same BW-optimal $(N{-}1)/N$ coefficient — so there is no α-β crossover between them; the choice is dictated by the fabric’s bisection, not the message size. Pairwise needs every rank pair to have a direct path (no intermediate relay), so it runs on switched / full-bisection fabrics; ring ships the same α-β cost on bisection-limited topologies by using both directions of the ring and forwarding through intermediate ranks. The topology dimension matters more here than in §5.3 or §6 for one reason specific to A2A:
- A2A is the collective most likely to be bandwidth-bound on a bisection-constrained fabric. The aggregate $(N{-}1)M$ cross-fabric traffic has to pass through the bisection of whatever physical topology the $N$ ranks sit on. On topologies with limited bisection (torus), this bound turns A2A into the rate-limiting step for permutation-heavy workloads (MoE, distributed FFT, distributed sort) — and drives physical-layout decisions about which ranks sit inside which high-BW island. The topology discussion in
02_topology_mapping.mdmakes this quantitative.
Why NCCL ships pairwise (and not ring relay) on scale-up. Scale-up NVLink / NVSwitch fabrics offer full bisection, so both schedules are feasible — and NCCL picks pairwise. Three practical reasons the α-β model misses:
- Zero relay overhead. Ring relay requires intermediate ranks to forward chunks hop-by-hop (step 3 at $N=4$ forwards step-2 arrivals); every relay is an extra kernel entry and SM-side copy. Pairwise is zero-copy endpoint-to-endpoint — each chunk crosses one fabric hop via a single DMA.
- Switch port-spreading. Pairwise’s staggered partner offsets (
scheduleP2pTasksToPlan) route concurrent sends across distinct NVSwitch port pairs, so no switch port sees more than one outgoing chunk per step. Ring relay concentrates all traffic on each rank’s two ring neighbors; on NVSwitch that’s two fixed ports, and head-of-line blocking between concurrent flows on those ports slows the whole schedule. - Any $N$. Ring relay’s shortest-arc routing has a tie-break asymmetry at antipode (rightward for even $N$), which makes the BW split between left and right edges unequal at small $N$ (left-channel idle after step 1 at $N=4$); pairwise is symmetric at every $N$.
The ring relay remains the schedule of choice where the fabric forces it — pure-ring or torus topologies without full bisection — which is exactly its role in 02_topology_mapping.md §3.3.
Switch primitives don’t help A2A on the BW side. Unlike BC / AR / AG, A2A benefits from neither the switch-ALU reduction primitive nor switch-multicast replication: the $N(N{-}1)$ per-destination payloads are all distinct (no shared payload for the switch to multicast) and there is no reduction (no aggregation for the switch ALU to compute). The BW term is structurally fixed at $(N{-}1)/N \cdot M/\mathrm{BW}$ per rank regardless of switch capability. Hardware A2A primitives (Tomahawk Ultra INC, Rubin-generation NVSwitches) collapse the α side only — one switch-crossbar operation instead of $(N{-}1)$ endpoint-scheduled ones — not the BW side. See 04_in_network_collectives.md §1.1–§1.3 for the structural argument and per-primitive treatment.
8. Point-to-point hop
A single send/recv between two ranks — the degenerate $N=2$ case of any collective, shown for completeness in the same style as §§3–7.
Worked example. Sender $R_A$ holds payload $v$ (size $M$ bytes); receiver $R_B$ is empty. One step, one link, one $M$-byte transfer:
Initial:
R_A: {v}
R_B: { }
Step 1 — R_A sends v to R_B over the link:
R_A: {v} ← kept (send is non-destructive)
R_B: {v} ← received
Cost. One handshake plus $M$ bytes on a single link:
$$t_{\mathrm{P2P}} = \alpha + \frac{M}{\mathrm{BW}}$$
Trivial, but it shows up once per pipeline step: pipeline parallelism (PP) passes activations stage-by-stage as a chain of P2P hops. If the PP stage count is $P$, the per-step PP cost is $\alpha + M_{\mathrm{PP}} / \mathrm{BW}$ per hop, contributing to the stage-level comm time.
9. Mapping primitives to DP, TP, EP, SP, PP
Large-model distributed execution uses up to five orthogonal parallelism axes, each mapped to a collective primitive. Four of them (TP/SP/EP/PP) contribute cost inside every forward / decode step and dominate inference latency; DP (data parallel) adds a once-per-training-step gradient AR that is absent from pure inference but often the largest single collective in training.
| Parallelism | Regime | Collective | Algorithm | α term | BW term |
|---|---|---|---|---|---|
| DP | training only | AR | Ring (§5.1) | $2(N{-}1)\,\alpha$ | $2 \cdot (N{-}1)/N \cdot M/\mathrm{BW}$ |
| DP | training only | AR | DBT (§5.2) | $2\,\lceil \log_2 N \rceil \, \alpha$ | $\lceil \log_2 N \rceil \cdot M/\mathrm{BW}$ |
| DP | training only | AR | hw INC — SHARP / NVLS (§5.3) | $2\,\alpha_\mathrm{switch}$ | $M/\mathrm{BW}$ |
| TP | train + infer | AR | Ring (§5.1) | $2(N{-}1)\,\alpha$ | $2 \cdot (N{-}1)/N \cdot M/\mathrm{BW}$ |
| TP | train + infer | AR | DBT (§5.2) | $2\,\lceil \log_2 N \rceil \, \alpha$ | $\lceil \log_2 N \rceil \cdot M/\mathrm{BW}$ |
| TP | train + infer | AR | hw INC — SHARP / NVLS (§5.3) | $2\,\alpha_\mathrm{switch}$ | $M/\mathrm{BW}$ |
| SP | train + infer | AG | Ring (§6) | $(N{-}1)\,\alpha$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ |
| EP | train + infer | A2A | Ring relay (§7.1, bisection-limited fabrics) | $(N{-}1)\,\alpha$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ |
| EP | train + infer | A2A | Pairwise direct-send (§7.2, full-bisection fabrics) | $(N{-}1)\,\alpha$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ |
| PP | train + infer | P2P | Single send/recv (§8) | $\alpha$ | $M/\mathrm{BW}$ |
DP is the only axis whose collective is training-exclusive: pure inference has no gradient to reduce, so no DP collective fires per decode step. TP / SP / EP / PP all contribute on every forward / decode step and therefore show up in inference latency as well as training iteration time. A single decoding step on a model sharded across TP/SP/EP/PP issues one AR per TP layer, one AG per SP layer, one A2A per MoE layer (dispatch + combine = two A2A), and one P2P per pipeline hop. Training adds a backward-pass AG / RS per layer plus a once-per-step DP gradient AR on the full gradient tensor (or once per gradient-accumulation window) — that DP AR runs the same ring / DBT / hw-INC path as TP but is typically the single largest collective in a training run because it ships every parameter’s gradient.
Vocabulary aside: why these mappings and not others?
- DP → AR (not A2A or AG): DP replicas each compute gradients on different mini-batches and must sum them to get the true batch gradient — a once-per-training-step AR that’s absent from pure inference. Same logic as TP’s partial-output reduction, just at a different granularity (whole-gradient vs per-layer partial).
- TP → AR (not AG): TP splits weight matrices across ranks; each rank computes a partial output of row-sharded attention / MLP projections that must be summed to produce the true output. RS+AG is the bandwidth-optimal way to do that sum.
- SP → AG (not AR): SP splits tokens across ranks; each rank holds distinct partial KV cache shards that other ranks need — no reduction, just gather.
- EP → A2A (not AR): MoE routing is a permutation — token $t$ belongs to whichever expert the router picked; it needs to physically move there. Dispatch and combine are both A2A.
- PP → P2P (not a collective): PP is a chain — the activation at stage $s$ only goes to stage $s+1$. No group operation, just one send/recv per hop.
Appendix A: Parallel Aggregated Trees (PAT) — scale-out AG / RS
PAT [NCCL-PAT] ships in NCCL starting with the 2.23 release and is the first NCCL-shipped collective algorithm whose designed-for operating point is the scale-out fabric (NICs, InfiniBand / RoCE) — as distinguished from the scale-up NVLink / NVSwitch island that ring and DBT already cover well. It implements AG and RS with $O(\log N)$ latency in the node count $N$ while matching ring’s bandwidth-optimal $(N{-}1)/N$ coefficient — strictly dominating ring on α-β at the scale-out regime where $N$ runs into the hundreds and per-node α is in the microseconds. The rest of this appendix motivates PAT against the §6 ring baseline, lays out the parallel-trees construction and why its specific partner order fits scale-out NIC pipelines, works through the schedule at $N = 8$, derives the cost, and closes with the NCCL dispatch rules. PAT gets its own appendix (rather than sitting alongside the non-mainline variants in Appendix B) because it is the one log-depth AG / RS variant NCCL has chosen to ship, and its design is worth the same “practical adoption” treatment the main text gives ring, DBT, and switch-ALU AR.
Why scale-out AG / RS needs a different algorithm. At scale-out the communicator $N$ is the node count, not the intra-island GPU count. Large training and inference jobs run with $N$ in the hundreds to thousands of nodes. Two properties flip relative to the intra-NVLink regime §6’s ring was designed for:
- $\alpha$ is much larger. Scale-out NIC $\alpha$ is in the microseconds (InfiniBand EDR/HDR ≈ 1–2 μs, plus a kernel-launch and proxy-thread floor on NCCL’s scale-out path of a few additional μs). Ring AG at $N = 512$ nodes is $(N{-}1)\alpha \approx 511 \times 2\,\mu$s $\approx 1$ ms of pure α — swamping most realistic per-rank $M/N$ payloads in bandwidth cost.
- Port budget is 1–2. Scale-out endpoints present 1–2 active NIC ports per rank (GPUDirect RDMA from one GPU typically drives one NIC, though 2–8 NICs per node exist). The partner-cycling schedule that kills rec-doubling on-node — needs $\lceil \log_2 N \rceil$ concurrent partners at steady state to pipeline, see Appendix B.3 — kills it even harder at scale-out.
Ring’s $(N{-}1)\alpha$ dominates the collective; partner-cycling schedules serialize on the 1–2 NIC budget; the algorithm PAT ships needs to combine log-depth latency with a bounded-port, bounded-buffer pipeline. This is the niche PAT was designed for.
The parallel-trees concept. PAT structures an $N$-node AG as $N$ independent binary-tree broadcasts, one per source chunk, advancing in lockstep. Chunk $c_i$ originates at rank $R_i$ and reaches all other ranks along its own tree of depth $\lceil \log_2 N \rceil$; each tree has the same shape, just rooted at a different rank. At round $r$, every tree advances by exactly one edge, so the whole set of $N$ trees covers all $N$ ranks in $\lceil \log_2 N \rceil$ rounds.
PAT at N=8 — the tree rooted at R0 (broadcasting c0).
Offsets are taken in reverse Bruck order: farthest first (4), then 2, then 1.
Round 1 (offset 4): R0 ──────────► R4
Round 2 (offset 2): R0 ──► R2 R4 ──► R6
Round 3 (offset 1): R0 → R1 R2 → R3 R4 → R5 R6 → R7
After 3 rounds: R0, R1, … R7 all hold c0.
All 8 trees (one rooted at each Ri, same shape) run concurrently — each link carries
exactly one chunk per round, which is the bounded-buffer property PAT needs.
Two properties fall out of the construction:
- Per-link payload per round is constant at $M/N$. Unlike rec-doubling AG (payload $M/N, 2M/N, 4M/N, \ldots$), PAT does not let the exchange buffer grow. The intermediate buffer required per rank is therefore bounded by $O(M/N)$, which matters because scale-out proxy threads stage data through finite registered-memory buffers and don’t have room for the full $M$.
- Partner order “farthest first” gets long-RTT transfers in flight early. PAT uses the Bruck-family offset sequence ($i \pm 2^k$ for $k = 0, \ldots, \lceil \log_2 N \rceil - 1$) in reverse order — at $N = 8$ the partner offset sequence is $4, 2, 1$ rather than Bruck’s $1, 2, 4$. On scale-out, the longest-logical-distance exchange (offset $N/2$) also corresponds to the physically most distant node pair (worst-case cross-fabric RTT). Scheduling it first means the long transfer overlaps with the later, shorter-offset transfers on the NIC’s DMA engines. Bruck’s “nearest first” would leave the long transfer on the critical path at the end, after the NIC has already drained its pipeline.
Schedule at $N = 8$. Each rank starts holding its own chunk $c_i$ of size $M/N = M/8$.
Round 1 (offset 4, partner i XOR 4):
Pairs: (R0,R4), (R1,R5), (R2,R6), (R3,R7)
Each pair exchanges its one chunk.
After: R0 has {c0,c4}; R1 has {c1,c5}; … ; R7 has {c3,c7}. [2 chunks each]
Round 2 (offset 2, partner i XOR 2):
Pairs: (R0,R2), (R1,R3), (R4,R6), (R5,R7)
Each pair exchanges one selected chunk of the "current owned" set
(the one the other side still needs from this pair's perspective).
After: each rank holds 4 chunks of the 8. [4 chunks each]
Round 3 (offset 1, partner i XOR 1):
Pairs: (R0,R1), (R2,R3), (R4,R5), (R6,R7)
Each pair exchanges one selected chunk.
After: every rank holds all 8 chunks. [8 chunks — complete]
The “one selected chunk” per round per pair is determined by the chunk-to-tree assignment: round $r$ advances, for every source $s$, exactly the one edge of $s$’s broadcast tree at depth $r$. With $N$ trees each of depth $\log_2 N$ running concurrently, the round advances all trees one level at once. The bounded-buffer property follows because a rank participates in exactly one tree-edge per round per direction.
Cost. $\lceil \log_2 N \rceil$ sequential rounds, each shipping $M/N$ bytes per link, yielding a per-round cost of $\alpha + (M/N)/\mathrm{BW}$. Total per-rank on-wire volume is $(N{-}1)\cdot M/N$ — the AG BW lower bound — accumulated across the $\lceil \log_2 N \rceil$ rounds. Crucially the bandwidth term is not $\lceil \log_2 N \rceil \cdot M/N \cdot \mathrm{BW}^{-1}$ (which would be log-depth × per-round payload summed naïvely) because on a full-duplex link each rank both sends and receives each round, and across the $\lceil \log_2 N \rceil$ rounds the total data each rank actually ships out is exactly $(N{-}1) \cdot M/N$ bytes — one per chunk it doesn’t originate, forwarded once on its outbound direction:
$$t_{\mathrm{PAT}} = \lceil \log_2 N \rceil \, \alpha + \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
RS is symmetric (reverse the schedule and replace forward-concat with sum-in). Compared to ring’s $(N{-}1)\alpha + (N{-}1)/N \cdot M/\mathrm{BW}$: same BW coefficient, $O(\log N)$ α vs ring’s $O(N)$ α — exactly the α reduction that makes sense of PAT’s shipping niche.
Practical adoption. The NCCL tuner dispatches PAT only when the scale-out operating conditions line up; everywhere else stays on ring. Two shipping constraints define the scope:
- Inter-node only, 1 rank per node. The NCCL 2.23 implementation restricts PAT to one rank per node because only the inter-node phase is implemented; intra-node AG / RS still runs ring. With $G > 1$ ranks per node, the intra-node ranks share a NIC (the scale-out port), and the tree-edge-per-round property degenerates — multiple tree-edges at the same round land on the same NIC and serialize, collapsing the log-depth advantage. The 2.23 release notes document this restriction directly; a future “multi-rank-per-node PAT” would need a hierarchical composition (intra-node ring / tree at the leaves, PAT between node leaders) that NCCL has not yet shipped.
- Scale-up doesn’t benefit. On a scale-up NVLink / NVSwitch domain ($N \leq 72$, $\alpha \approx 0.5\,\mu$s), ring AG’s $(N{-}1)\alpha$ is already in the tens of μs and ring’s pipelining keeps BW at the floor. Replacing ring with PAT would trade pipeline-friendliness for a log-depth α schedule without any α budget to recover on the NVLink side, and it would inherit a port-budget obstruction similar to rec-doubling’s (see Appendix B.3) because tree-edge-per-round still requires more than the 2 NVLink ring directions at intermediate tree levels. NCCL 2.23 explicitly routes intra-node traffic through ring.
The scope is therefore: PAT is the inter-node scale-out AG / RS algorithm at 1 rank per node; ring covers every other AG / RS configuration. Unlike the non-mainline variants in Appendix B — which never escape the MPI-menu status — PAT is the single non-ring AG / RS algorithm NCCL has chosen to ship, and the scale-out α budget is what made the engineering investment worth it.
Appendix B: Non-mainline AR / AG / RS / A2A variants
Four algorithms from the MPI literature and older HPC work appear prominently in MPICH / OpenMPI’s algorithm menus but are absent from the NCCL / RCCL shipping menu: simple recursive-doubling AR (B.1), Rabenseifner halving-doubling AR (B.2), recursive-doubling AG / recursive-halving RS (B.4), and Bruck A2A (B.5). Each has a clean pure α-β cost that looks competitive on paper — some strictly dominate the shipped algorithm — yet the port-budget, pipeline-compatibility, and single-default arguments rule them out on real GPU fabrics. B.1 and B.2 derive the two AR variants at $N = 4$; B.3 consolidates the AR-specific “why neither ships” analysis — numerical crossovers against DBT, partner-cycling as pipeline-kill, power-of-2 and topology-adversarial secondary factors — that §5.3 only gestures at. B.4 specializes the same schedule family to standalone AG / RS and carries the scale-up-ring-wins argument of §6 through explicitly; B.5 does the analogous job for A2A and explains why NCCL’s single-default picks pairwise rather than Bruck. Each subsection gives the full step-by-step trace and the cost formula cited by the main-text comparison.
B.1 Simple recursive-doubling AR
Simple recursive-doubling AR is the one-phase butterfly / hypercube sweep: at step $k \in \{1, \ldots, \lceil \log_2 N \rceil\}$, every rank $i$ exchanges its full current vector with partner $i \oplus 2^{k-1}$ and sums the received vector into its local copy. After $\lceil \log_2 N \rceil$ steps, every rank holds the $N$-way sum. We walk through $N=4$; same initial state as §5.1 (each $R_i$ holds $V_i = [v_{i,0}, v_{i,1}, v_{i,2}, v_{i,3}]$).
Step 1 (partner $i \oplus 1$, offset $2^0 = 1$). Pairs: $(R_0, R_1)$ and $(R_2, R_3)$. Each pair exchanges full vectors and sums.
After step 1 (2-way partial sums):
R0: [v00+v10 v01+v11 v02+v12 v03+v13]
R1: [v00+v10 v01+v11 v02+v12 v03+v13] ← identical to R0
R2: [v20+v30 v21+v31 v22+v32 v23+v33]
R3: [v20+v30 v21+v31 v22+v32 v23+v33] ← identical to R2
Step 2 (partner $i \oplus 2$, offset $2^1 = 2$). Pairs: $(R_0, R_2)$ and $(R_1, R_3)$. Each pair again exchanges full vectors and sums — but now the “full vector” is already a 2-way partial sum from step 1.
After step 2 (4-way full sums):
R0: [S0 S1 S2 S3]
R1: [S0 S1 S2 S3]
R2: [S0 S1 S2 S3]
R3: [S0 S1 S2 S3]
where S_k = v_{0,k} + v_{1,k} + v_{2,k} + v_{3,k}.
All four ranks hold the complete reduced vector after $\lceil \log_2 4 \rceil = 2$ sequential steps.
Cost. Each step moves the full $M$-byte vector across the active link (full-duplex, so both partners send concurrently over opposite directions; per-step cost is $\alpha + M/\mathrm{BW}$). With $\lceil \log_2 N \rceil$ sequential steps:
$$t_{\mathrm{rec\,doubling\,AR}} = \lceil \log_2 N \rceil \, \alpha + \lceil \log_2 N \rceil \cdot \frac{M}{\mathrm{BW}}$$
Strengths. Minimum latency term $\lceil \log_2 N \rceil \alpha$ — a lower bound for any reduction over a binary-combinable tree. Single-phase schedule (no separate RS / AG split), so the runtime is simpler than Rabenseifner’s.
Weakness. The BW coefficient grows as $\log_2 N$ rather than saturating at the ring-optimal $2(N-1)/N \to 2$. At $N = 512$, that’s $9 \cdot M/\mathrm{BW}$ versus ring’s $\approx 2 \cdot M/\mathrm{BW}$ — a $4.5\times$ BW penalty that kills it on large-$M$ AR. The §A.3 pipelining analysis further explains why this BW coefficient cannot be rescued by segmentation on a bounded-port fabric: each step targets a different partner, so a pipelined schedule would need $\lceil \log_2 N \rceil$ concurrent physical ports per rank, which no real scale-up or scale-out fabric provides.
B.2 Rabenseifner halving-doubling AR
Rabenseifner’s halving-doubling AR (RHD) [TRG05] recognizes the AR ≡ RS + AG identity from §2 and builds each phase from a chunk-exponential hypercube schedule: the RS phase halves the per-step payload, the AG phase doubles it, and together the two phases ship only $2(N-1)/N \cdot M$ bytes per rank — matching ring’s BW-optimal lower bound at only $2\lceil \log_2 N \rceil$ steps. We trace $N=4$ with the same $V_i$ as §5.1, chunked into 4 pieces.
Phase 1 — Reduce-scatter (recursive halving, $\lceil \log_2 N \rceil$ steps).
At step $k \in \{1, \ldots, \lceil \log_2 N \rceil\}$, rank $i$ pairs with partner $i \oplus 2^{k-1}$, sends the half of its chunks the partner will own at the end of this step, and receives (summing in) the complementary half. The per-step payload halves each round because the “chunks I still own” set halves.
Step 1 ($k=1$, partner $i \oplus 1$). Split the 4-chunk vector into lower-half $\{0, 1\}$ and upper-half $\{2, 3\}$. Pairs: $(R_0, R_1)$ → $R_0$ keeps lower, $R_1$ keeps upper. $(R_2, R_3)$ → $R_2$ keeps lower, $R_3$ keeps upper. Each rank sends $M/2$ bytes.
After RS step 1:
R0: [v00+v10 v01+v11 ? ? ] (owns {0,1}; sent {2,3} to R1)
R1: [ ? ? v02+v12 v03+v13 ] (owns {2,3})
R2: [v20+v30 v21+v31 ? ? ] (owns {0,1})
R3: [ ? ? v22+v32 v23+v33 ] (owns {2,3})
Step 2 ($k=2$, partner $i \oplus 2$). Each pair subdivides its half again. $(R_0, R_2)$ both own $\{0, 1\}$; $R_0$ keeps chunk $0$, $R_2$ keeps chunk $1$. $(R_1, R_3)$ both own $\{2, 3\}$; $R_1$ keeps chunk $3$, $R_3$ keeps chunk $2$. Each rank sends $M/4$ bytes.
After RS step 2 (full 4-way sums in each rank's one owned chunk):
R0: [S0 ? ? ? ] ← chunk 0 fully reduced
R1: [ ? ? ? S3 ] ← chunk 3 fully reduced
R2: [ ? S1 ? ? ] ← chunk 1 fully reduced
R3: [ ? ? S2 ? ] ← chunk 2 fully reduced
This is the RS end-state. Each rank owns exactly one fully-reduced $M/N$ chunk.
Phase 2 — All-gather (recursive doubling, $\lceil \log_2 N \rceil$ steps). The same partner schedule in reverse order: step 1 uses partner $i \oplus 2$ (last RS partner), step 2 uses partner $i \oplus 1$ (first RS partner). Each step doubles the chunks-owned set; payload grows from $M/N$ to $2M/N$, $\ldots$, to $M/2$ by the last step.
Step 1 ($k=1$ of AG, partner $i \oplus 2$). $R_0$ sends its $S_0$ to $R_2$; $R_2$ sends its $S_1$ to $R_0$. Pair $(R_1, R_3)$ similarly exchanges $\{S_3, S_2\}$. Each rank sends $M/4$.
After AG step 1:
R0: [S0 S1 ? ?]
R1: [ ? ? S2 S3]
R2: [S0 S1 ? ?]
R3: [ ? ? S2 S3]
Step 2 ($k=2$ of AG, partner $i \oplus 1$). $R_0$ sends its $\{S_0, S_1\}$ half to $R_1$; $R_1$ sends its $\{S_2, S_3\}$ half to $R_0$. Symmetric for $(R_2, R_3)$. Each rank sends $M/2$.
After AG step 2:
R0: [S0 S1 S2 S3]
R1: [S0 S1 S2 S3]
R2: [S0 S1 S2 S3]
R3: [S0 S1 S2 S3]
AR done in $2\lceil \log_2 4 \rceil = 4$ sequential steps.
Cost accounting. Per-step payloads follow a geometric schedule:
| Phase | Step | Payload | Per-step cost |
|---|---|---|---|
| RS | 1 | $M/2$ | $\alpha + (M/2)/\mathrm{BW}$ |
| RS | 2 | $M/4$ | $\alpha + (M/4)/\mathrm{BW}$ |
| AG | 1 | $M/4$ | $\alpha + (M/4)/\mathrm{BW}$ |
| AG | 2 | $M/2$ | $\alpha + (M/2)/\mathrm{BW}$ |
Total at $N=4$: $4\alpha + (M/2 + M/4 + M/4 + M/2)/\mathrm{BW} = 4\alpha + (3/2)\,M/\mathrm{BW}$. The bandwidth series is $M \cdot \sum_{k=1}^{\log_2 N} 2^{-k} = M \cdot (N-1)/N$ for one phase, doubled for the two-phase RS+AG. Generalizing:
$$t_{\mathrm{Rabenseifner\,AR}} = 2\lceil \log_2 N \rceil \, \alpha + 2 \cdot \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
At $N=4$ this evaluates to $4\alpha + 1.5\,M/\mathrm{BW}$, matching the trace above.
Strengths. Bandwidth-optimal (same $2(N-1)/N$ BW floor as ring) at $O(\log N)$ latency instead of ring’s $O(N)$. Strictly dominates both ring (same BW, fewer α) and simple recursive doubling (same α order, better BW) on α-β arithmetic. This is the “best on paper” AR algorithm for power-of-2 $N$.
Weakness. Same partner-cycling failure mode as simple recursive doubling — each RS step and each AG step exchanges with a different partner, so pipelining requires $\lceil \log_2 N \rceil$ concurrent ports per rank. On bounded-port fabrics (2 ring directions on NVLink, 2–8 NICs off-node, small-digit scale-up switch uplinks), the pipeline serializes and the non-pipelined BW term of $2(N-1)/N \cdot M/\mathrm{BW}$ is the best it attains. DBT (§5.2), which starts from a worse $\log_2 N \cdot M/\mathrm{BW}$ non-pipelined BW but pipelines successfully on a 3-port-budget fabric, matches or beats Rabenseifner’s asymptote — which is why NCCL ships DBT and not Rabenseifner despite Rabenseifner’s more attractive paper cost. Also strictly needs $N = 2^k$; non-power-of-2 requires a reduction-first embedding step that breaks the clean halving geometry. §B.3 consolidates the partner-cycling argument and the numerical crossovers against the shipped algorithms.
B.3 Why neither AR variant is shipped
The per-subsection “Weakness” paragraphs of B.1 and B.2 each sketch why their variant doesn’t ship. This subsection consolidates those arguments — introducing the shared port-budget / partner-cycling mechanism and the numerical crossovers against shipped DBT / ring that §5.3 only gestures at — so they can be cited as one unit from the main text and from the AG / RS / A2A subsections that follow.
Simple recursive-doubling AR (B.1) and Rabenseifner halving-doubling AR (B.2) arrive at the §5.3 comparison with pure α-β costs that look competitive or better than DBT:
- Rec-doubling: $\lceil \log_2 N \rceil \, \alpha + \lceil \log_2 N \rceil \cdot M/\mathrm{BW}$ — minimum possible latency term (half of DBT’s) at the price of a $\log_2 N$ BW coefficient.
- Rabenseifner: $2\lceil \log_2 N \rceil \, \alpha + 2(N{-}1)/N \cdot M/\mathrm{BW}$ — matches ring’s BW-optimal floor at log-depth latency; strictly dominates ring on α-β arithmetic at every $(N, M)$.
Yet NCCL / RCCL ship neither. The reason is a structural mismatch between their partner schedules and the port budget of every realistic fabric tier — the same obstruction Appendix C §C.4 describes for linear schedules in general, specialized here to the partner-cycling family.
The port-budget / partner-cycling failure mode. Both variants exchange with a different partner at every step: the hypercube schedule $i \oplus 2^{k-1}$ cycles through $\lceil \log_2 N \rceil$ distinct partners across the algorithm. Pipelining (Appendix C), which is the mechanism that collapses a log-depth schedule’s BW coefficient from $L$ down to $\sim 1$, requires the segments-in-flight at steady state to use physically disjoint links simultaneously — segment $s$ heading to partner $2^{k-1}$ on one port must not queue behind segment $s{+}1$ heading to partner $2^{k}$ on the same port. That needs $\lceil \log_2 N \rceil$ concurrent physical ports per rank. Real fabrics expose ~2 NVLink ring directions on-node, 2–8 NICs off-node, and small-single-digit switch uplinks at the scale-up tier. For $N = 512$ that is $\lceil \log_2 N \rceil = 9$ concurrent partners needed — past every budget. The pipeline serializes, and each variant is stuck at its non-pipelined BW coefficient.
DBT escapes this trap because its schedule targets at most ~3 concurrent partners per rank at peak — a tree node’s ≤ 2 children during reduce plus the sibling-tree leaf edge on the opposite direction — well inside every realistic fabric port budget. That is why DBT’s non-pipelined $\lceil \log_2 N \rceil \cdot M/\mathrm{BW}$ collapses under the Appendix C master formula to the $M/\mathrm{BW}$ α-β asymptote (§5.2), while rec-doubling and Rabenseifner cannot collapse theirs.
Numerical crossover: rec-doubling vs DBT. Rec-doubling’s non-pipelined cost vs DBT’s pipelined α-β asymptote:
$$\lceil \log_2 N \rceil \, \alpha + \lceil \log_2 N \rceil \cdot \frac{M}{\mathrm{BW}} \;\;\text{vs}\;\; 2\lceil \log_2 N \rceil \, \alpha + \frac{M}{\mathrm{BW}}$$
Rec-doubling wins when $(\lceil \log_2 N \rceil - 1) \cdot M/\mathrm{BW} < \lceil \log_2 N \rceil \, \alpha$, i.e., at
$$M_* \;=\; \frac{\lceil \log_2 N \rceil}{\lceil \log_2 N \rceil - 1} \cdot \alpha \cdot \mathrm{BW} \;\;\longrightarrow\;\; M_* \to \alpha \cdot \mathrm{BW} \;\text{as}\; N \to \infty.$$
At $N = 512$, $\alpha = 0.5\,\mu$s, $\mathrm{BW} = 900\,\mathrm{GB/s}$: $M_* = 9/8 \cdot 0.5\,\mu\mathrm{s} \cdot 900\,\mathrm{GB/s} \approx 506\,\mathrm{KB}$. Rec-doubling’s α edge only buys anything on sub-MB messages — a regime NCCL already serves via the LL / LL128 fast-path on ring, without adding a separate algorithm. Past the crossover the $\log_2 N$ BW coefficient means a $9\times$ BW penalty at $N = 512$ versus DBT’s near-1. (Practice inflation $c_{\mathrm{real}} \geq 1$ of DBT’s coefficient — see §5.3 practice caveat — only shifts $M_*$ slightly; the sub-MB conclusion is robust.)
Numerical crossover: Rabenseifner vs DBT. Same α term ($2\lceil \log_2 N \rceil \, \alpha$ on both sides), so the comparison is purely on the BW coefficient: Rabenseifner’s $2(N{-}1)/N \to 2$ vs DBT pipelined’s $1$. DBT wins at every $M > 0$ on α-β — the margin is a factor of $\sim 2$ at large $N$. Practice inflation $c_{\mathrm{real}} \geq 1$ of DBT’s coefficient (see §5.3 practice caveat) closes this margin somewhat but, for $c_{\mathrm{real}} < 2(N{-}1)/N$, keeps DBT ahead. Rabenseifner’s paper dominance over ring and over non-pipelined DBT never materializes for the shipping choice because the algorithm DBT is compared against is the pipelined DBT, which is what NCCL actually runs. Had Rabenseifner been able to pipeline, it would drop below $2(N{-}1)/N$ toward its own $M/\mathrm{BW}$ floor and the comparison would tighten further — but it cannot, for the port-budget reason above.
Compared to ring. Ring (§5.1) is intrinsically pipelined with $P = N$ and achieves its $2(N{-}1)/N \to 2$ BW floor for free, without any of the port-budget gymnastics. Rabenseifner matches ring’s BW floor with a log-depth α (an improvement), but without pipelining it only reaches that floor, not below. Rec-doubling’s $\log_2 N$ BW coefficient is worse than ring’s $\sim 2$ at every $N \geq 8$, so rec-doubling is ring-dominated at large $M$ and DBT-dominated at small $M$ — no regime where it is the right choice.
Two secondary factors reinforce the choice:
- Power-of-2 restriction. Both variants strictly require $N = 2^k$; non-power-of-2 rank counts need a reduction-first embedding step (pair off $2\lfloor N/2 \rfloor$ ranks into virtual pairs, run on $2^{\lfloor \log_2 N \rfloor}$, re-distribute) that breaks the clean halving geometry and adds 2 extra α hops plus an asymmetric BW term. NCCL routinely runs on $N = 6$, $N = 72$ (NVL72), and other non-power-of-2 counts.
- Topology-adversarial partner sequence. The $i \oplus 2^{k-1}$ partner at the largest $k$ crosses the communicator bisection on every fabric, whereas tree and ring schedules have clean dim-decomposed embeddings on torus / fat-tree that keep most traffic local (see
02_topology_mapping.md). Partner-cycling repeatedly pays the worst-case link latency that dim-decomposed schedules amortize.
Together: partner-cycling kills the pipeline, DBT’s pipelined asymptote dominates Rabenseifner on BW and rec-doubling on BW for all but sub-MB messages, the α edge rec-doubling carries on those sub-MB messages is already served by the LL fast-path on ring, and the non-power-of-2 / topology issues add further friction. NCCL / RCCL ship ring and DBT; rec-doubling and Rabenseifner stay in the MPI menu as reference points.
The port-budget / partner-cycling argument above generalizes beyond AR. The next two subsections specialize it: B.4 applies it to standalone AG / RS (where Rabenseifner’s two phases become two independent log-depth algorithms), and B.5 applies the single-default reasoning to the one A2A variant with log-depth α — Bruck’s — where pipelining is not even the central objection because no BW-optimal log-depth A2A exists for pipelining to rescue.
B.4 Recursive-doubling AG / recursive-halving RS
Recursive-doubling AG is exactly Phase 2 of Rabenseifner AR in B.2, run standalone: each rank starts with one $M/N$-byte chunk and ends holding the full concatenation. Recursive-halving RS is the dual (Phase 1 of Rabenseifner run standalone): each rank starts with the full $M$-byte vector and ends holding one $M/N$-byte reduced chunk. Both use the same hypercube partner schedule ($i \oplus 2^{k-1}$ at step $k$) with the payload doubling in AG and halving in RS across $\lceil \log_2 N \rceil$ steps.
AG trace at $N = 4$. Start: $R_i$ holds only its own chunk $c_i$ of size $M/4$.
Step 1 ($k = 1$, partner $i \oplus 1$). Pairs $(R_0, R_1)$ and $(R_2, R_3)$ each exchange their one chunk. After: each rank holds 2 chunks ($M/2$ total).
R0: [c0 c1 ? ?]
R1: [c0 c1 ? ?]
R2: [ ? ? c2 c3]
R3: [ ? ? c2 c3]
Step 2 ($k = 2$, partner $i \oplus 2$). Pairs $(R_0, R_2)$ and $(R_1, R_3)$ each exchange their 2-chunk block. After: every rank holds all 4 chunks.
R0: [c0 c1 c2 c3]
R1: [c0 c1 c2 c3]
R2: [c0 c1 c2 c3]
R3: [c0 c1 c2 c3]
AG done in $\lceil \log_2 4 \rceil = 2$ steps. RS is the mirror image: start from a full vector, run the same partner schedule in reverse order (step 1 partner $i \oplus 2$, step 2 partner $i \oplus 1$), and the payload halves each step so each rank ends owning one fully-reduced $M/N$ chunk.
Cost. The per-step payload follows a geometric schedule. For AG with $\lceil \log_2 N \rceil$ steps, step $k$ ships $2^{k-1} \cdot M/N$ bytes per rank per direction; total per-rank ship = $M/N \cdot \sum_{k=1}^{\log_2 N} 2^{k-1} = (N-1) \cdot M/N$ bytes — matching ring’s BW lower bound. RS is symmetric. Per-step cost $\alpha + 2^{k-1} M / (N \, \mathrm{BW})$; summing:
$$t_{\mathrm{rec\,doub\,AG}} = t_{\mathrm{rec\,halv\,RS}} = \lceil \log_2 N \rceil \, \alpha + \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
where $M$ is the per-rank final total volume (for AG) or initial total volume (for RS).
Strengths. BW-optimal (matches ring) at $O(\log N)$ latency instead of ring’s $O(N)$. On paper, strictly dominates ring AG / RS on α-β arithmetic for all $(N, M)$.
Weakness. The same partner-cycling issue from Rabenseifner AR applies: each step targets a different partner, so pipelining fails on bounded-port fabrics and the non-pipelined log-depth BW term of $(N-1)/N \cdot M/\mathrm{BW}$ is the best this algorithm attains. On a 2-port-budget fabric (NVLink pair, 2-NIC scale-out), pipelining ring ($P = N$ segments on the 2 neighbor links) already saturates the asymptotic floor; switching to rec-doubling gains the $\log_2 N \to N-1$ α reduction but at the cost of a pipeline-unfriendly schedule that can’t amortize that α into the BW path. Also, the geometric payload makes per-round chunk sizes irregular, which complicates overlap with compute. Needs $N = 2^k$. For all these reasons NCCL does not ship rec-doubling AG / rec-halving RS; MPI (MPICH, OpenMPI) ships it and dispatches by message-size threshold.
The one AG / RS variant NCCL does ship — Parallel Aggregated Trees (PAT) — solves this problem by keeping the per-round payload constant at $M/N$ (bounded buffer, pipeline-friendly) at the cost of a more involved chunk-to-tree assignment; see Appendix A for the full derivation and shipping rules. PAT and rec-doubling AG use the same partner-offset family but in opposite orders: rec-doubling takes offsets $1, 2, 4, \ldots$ with exponentially growing per-round buffers; PAT takes $N/2, \ldots, 2, 1$ with a constant $M/N$ per-round buffer. The difference is precisely what moves PAT into NCCL’s shipping menu and keeps rec-doubling out.
B.5 Bruck A2A
For A2A, the log-depth partner-cycling variant is Bruck’s algorithm [BHKUW97]. The partner-cycling pipeline-kill argument from B.3 applies here too, but it is not the central objection: A2A has no natural BW-optimal log-depth algorithm for pipelining to rescue — the inflated BW coefficient is intrinsic to the bit-butterfly schedule, not a pipelining artifact. What rules Bruck out of NCCL is the single-algorithm-per-primitive default, which picks pairwise (§7.2) for its large-$M$ BW optimality and accepts the larger α term at small $M$ rather than shipping a second algorithm with a runtime dispatch threshold.
Bruck’s A2A achieves $\lceil \log_2 N \rceil$ rounds — vs pairwise’s $N-1$ — by running a butterfly / hypercube pattern on the bits of the destination index, combined with a pre-rotation and a post-rotation of the local buffer that together turn “send chunk to rank $j$” into “send the slots of my buffer whose current index has bit $k$ set to partner $i + 2^k$”. The cost is a $\log_2 N / 2 \cdot M/\mathrm{BW}$ BW term — inflated relative to pairwise’s $(N-1)/N \cdot M/\mathrm{BW}$ — plus $O(M)$ local rotation work that’s not on the wire. We trace the schedule at $N = 4$ and give the cost.
Pre-rotation. Before any on-wire communication, rank $i$ cyclically rotates its send buffer left by $i$ positions (i.e., the chunk originally at slot $j$ moves to slot $(j - i) \bmod N$). The purpose is to align the “bit-$k$-of-destination-offset” pattern across ranks so that a single partner offset $+2^k$ per round suffices.
Initial (slots indexed by destination): After pre-rotation (rank i shifts left by i):
R0: [v00 v01 v02 v03] R0: [v00 v01 v02 v03] (i=0: no shift)
R1: [v10 v11 v12 v13] R1: [v11 v12 v13 v10] (i=1: shift left by 1)
R2: [v20 v21 v22 v23] R2: [v22 v23 v20 v21] (i=2: shift left by 2)
R3: [v30 v31 v32 v33] R3: [v33 v30 v31 v32] (i=3: shift left by 3)
Communication rounds ($\lceil \log_2 N \rceil = 2$ at $N=4$). At round $k \in \{0, 1, \ldots, \lceil \log_2 N \rceil - 1\}$, rank $i$ sends to partner $(i + 2^k) \bmod N$ and receives from $(i - 2^k) \bmod N$. The chunks sent at round $k$ are the slots of the current buffer whose index has bit $k$ set ($N/2$ slots, total $M/2$ bytes per round); the chunks at the other $N/2$ slots stay put. The partner’s corresponding slots overwrite the local ones on receive.
Round 0 (partner $i+1 \bmod 4$, send slots with bit 0 set — i.e., slots 1 and 3, $M/2$ bytes).
Sends (slots 1 and 3 of current buffer):
R0 → R1: [v01, v03] R1 → R2: [v12, v10]
R2 → R3: [v23, v21] R3 → R0: [v30, v32]
After round 0 (slots 1 and 3 overwritten by incoming):
R0: [v00 v30 v02 v32] (slots 1,3 from R3)
R1: [v11 v01 v13 v03] (slots 1,3 from R0)
R2: [v22 v12 v20 v10] (slots 1,3 from R1)
R3: [v33 v23 v31 v21] (slots 1,3 from R2)
Round 1 (partner $i+2 \bmod 4$, send slots with bit 1 set — i.e., slots 2 and 3, $M/2$ bytes).
Sends (slots 2 and 3 of current buffer):
R0 → R2: [v02, v32] R1 → R3: [v13, v03]
R2 → R0: [v20, v10] R3 → R1: [v31, v21]
After round 1 (slots 2 and 3 overwritten by incoming):
R0: [v00 v30 v20 v10] (slots 2,3 from R2)
R1: [v11 v01 v31 v21] (slots 2,3 from R3)
R2: [v22 v12 v02 v32] (slots 2,3 from R0)
R3: [v33 v23 v13 v03] (slots 2,3 from R1)
Post-rotation. Each rank $i$ cyclically rotates its buffer right by $i$ positions and optionally reverses the order (implementation detail; the net effect is to permute the received chunks into MPI / NCCL A2A output layout). After post-rotation:
Final (MPI/NCCL A2A output layout: slot j holds the chunk sent from R_j):
R0: [v00 v10 v20 v30]
R1: [v01 v11 v21 v31]
R2: [v02 v12 v22 v32]
R3: [v03 v13 v23 v33]
Matches the expected A2A end-state in §7.
Cost. The pre- and post-rotations are local memory copies (not on-wire; they do consume memory bandwidth and an intermediate buffer of size $M$, but in the α-β model we count on-wire time only). The $\lceil \log_2 N \rceil$ rounds each ship $M/2$ bytes per rank per direction, giving
$$t_{\mathrm{Bruck\,A2A}} = \lceil \log_2 N \rceil \, \alpha + \lceil \log_2 N \rceil \cdot \frac{M}{2\,\mathrm{BW}}$$
At $N = 4$: $2\alpha + M/\mathrm{BW}$, compared to pairwise’s $3\alpha + (3/4)M/\mathrm{BW}$ at the same $N$. Bruck wins on α; pairwise wins on BW — the crossover is at $M^* \sim \alpha \cdot \mathrm{BW}$, computed in §7.3.
Strengths. Minimum-latency A2A (log-depth). The BW term is $\log_2 N / 2 \cdot M/\mathrm{BW}$ rather than pairwise’s $(N-1)/N \cdot M/\mathrm{BW}$, so Bruck wins at small $M$ where the α term dominates.
Weakness. BW term scales with $\log_2 N$ rather than saturating at the lower bound; above the $M^*$ crossover, pairwise is strictly faster. The pre/post rotations require an intermediate buffer of $O(M)$ per rank, plus $O(M)$ local-memory copies that are free in the α-β model but non-trivial in practice. The bit-test routing is cleanest at $N = 2^k$; non-power-of-2 needs padding or a modified final round. Partner cycling across the $\log_2 N$ rounds is the same bounded-port objection that killed rec-doubling AR, though A2A is less bothered by it because there is no natural bandwidth-optimal log-depth A2A for pipelining to rescue — the inflated BW coefficient is fundamental, not a pipelining artifact. NCCL’s single-algorithm policy picks pairwise for its large-$M$ optimality; MPI ships both Bruck and pairwise and dispatches by message-size threshold.
Appendix C: Asymptotic form of linear schedules (via pipelining)
The collectives in §3 (binomial tree BC), §4 (binomial tree Reduce), and §5 (ring AR and double binary tree AR) all cite an “asymptotic form” $L\alpha + M/\mathrm{BW}$ that collapses the otherwise-$L$-scaled bandwidth term to the single-link floor. This appendix separates two things that the main text glues together:
- Pipelining is the implementation mechanism: chunk the message into $P$ sub-segments and stream them through the $L$-step schedule like parts on an assembly line. The finite-$P$ cost is the master formula $t_\mathrm{pipe}(P) = (L + P - 1)(\alpha + M/(P\,\mathrm{BW}))$ — exact, not asymptotic.
- The asymptotic form is the cost floor the implementation reaches in the bandwidth-bound limit $M \to \infty$ at optimal segmentation $P^*$. The two terms $L\alpha$ and $M/\mathrm{BW}$ are separate floors — each the minimum its coefficient can attain — and the sum is approached up to an $O(\sqrt{M})$ correction that vanishes relative to $M/\mathrm{BW}$.
C.1–C.4 derive the finite-$P$ master formula, its optimum $P^*$, the $\sqrt{M}$ correction between exact and asymptotic, and the port-budget caveat that licenses the collapse on real fabrics. C.5 lists how each primitive instantiates $L(N)$. Everything here applies identically to any schedule with a linear dependency chain — so future variants (larger $k$-ary trees, alternate butterfly orderings) inherit the same formulas.
C.1 Why pipeline?
Start with the non-pipelined schedule: $L$ sequential steps, each shipping the full $M$ bytes. Total cost $L\alpha + L\,M/\mathrm{BW}$. The ugly part is the second term — the bandwidth coefficient scales linearly with $L$. The inefficiency is easy to see: while step $k$ is busy pushing bytes, steps $1, \ldots, k{-}1$ sit idle (they finished their part of the transfer several slots ago, but there’s nothing queued for them to do). If we could keep every step busy at once, the whole algorithm would finish in roughly the time of one step’s work regardless of $L$. Pipelining is the trick that makes that happen.
C.2 The conveyor idea and space-time diagram
Cut the message $M$ into $P$ equal sub-segments of size $M/P$ and feed them through the schedule one after another, like parts on an assembly line. Segment 1 enters step 1 at time slot 1, moves to step 2 at slot 2, then to step 3 at slot 3, and so on. Segment 2 enters step 1 at slot 2 (one slot behind segment 1) and trails it through. By the time segment $P$ has entered the pipeline, segments $1, \ldots, P{-}1$ are already distributed across the later steps — and each slot now advances every segment by one step simultaneously.
Rows = steps in the schedule, columns = time slots. Each slot costs exactly $\alpha + (M/P)/\mathrm{BW}$ (one handshake plus one sub-segment’s worth of transfer). Cell $(k, t)$ = “segment at step $k$ during slot $t$”. Small example with $L=3$ steps and $P=3$ segments:
slot 1 slot 2 slot 3 slot 4 slot 5
step 1: seg1 seg2 seg3 · ·
step 2: · seg1 seg2 seg3 ·
step 3: · · seg1 seg2 seg3
└──── fill ────┘ steady └──── drain ────┘
(P-1 = 2) (1 slot) (L-1 = 2)
Three phases:
- Fill (slots 1 to $P{-}1$) — the pipeline is warming up. Step 1 has work from slot 1; step 2 doesn’t start until slot 2; step $L$ doesn’t start until slot $L$.
- Steady state (slots where all $L$ steps are busy simultaneously). In the picture: slot 3 has $\text{seg3}$ at step 1, $\text{seg2}$ at step 2, $\text{seg1}$ at step 3 — every step advancing one segment per slot. This is the regime where the pipeline pays for itself.
- Drain (slots $P+1$ to $L+P{-}1$) — segments that entered late are still finishing. Step 1 has nothing left to do.
Total slot count: $L + P - 1$. The assembly-line identity: the last segment (segment $P$) enters step 1 at slot $P$ and needs $L - 1$ additional slots to advance through the remaining steps, exiting at slot $P + (L-1) = L + P - 1$. In the diagram: $3 + 3 - 1 = 5$ slots. Matches.
C.3 Master formula and optimal segmentation
Parameters at a glance. $L$ is the schedule’s sequential-step depth (fixed by the algorithm once $N$ is chosen); $P$ is the segment count (a free knob the scheduler picks, then optimizes below). $L$ and $P$ are independent — the mapping $L(N)$ is schedule-specific:
| Schedule | $L$ | source |
|---|---|---|
| Ring BC (§3.1) | $N - 1$ | hops from source to tail rank |
| Ring Reduce (§4.1) | $N - 1$ | hops from source rank to root along chain |
| Binomial tree BC / Reduce (§3.2, §4.2) | $\lceil \log_2 N \rceil$ | tree depth |
| Ring AR (§5.1) | $2(N - 1)$ | RS pass + AG pass |
| DBT AR (§5.2) | $2 \lceil \log_2 N \rceil$ | reduce pass + broadcast pass |
The derivation below treats $L$ as given; each primitive substitutes its own $L(N)$ at the end. The full instantiation table (including per-rank port requirements) is in C.5.
Each slot costs $\alpha + (M/P)/\mathrm{BW}$; there are $L + P - 1$ of them:
$$t_{\mathrm{pipe}}(P) = (L + P - 1)\left(\alpha + \frac{M/P}{\mathrm{BW}}\right)$$
To see where the $L$ factor goes, expand the bandwidth piece. Using $(L{+}P{-}1)/P = 1 + (L{-}1)/P$:
$$t_{\mathrm{pipe}}(P) \;=\; \underbrace{(L + P - 1)\,\alpha}{\substack{\text{hop count}\\ L\alpha\,+\,(P-1)\alpha\\ \text{extra handshakes for fill/drain}}} \;+\; \underbrace{\frac{M}{\mathrm{BW}}}{\substack{\text{steady-state BW}\\ \text{(saturates at 1}\cdot M/\mathrm{BW}\text{)}}} \;+\; \underbrace{\frac{(L - 1)\,M}{P\,\mathrm{BW}}}_{\substack{\text{fill+drain overhead}\\ \text{(shrinks like }1/P\text{)}}}$$
Three things to notice:
- The $L$-dependence has left the dominant BW term. Non-pipelined BW was $L \cdot M/\mathrm{BW}$; here the $L$ survives only inside the fill/drain piece, which vanishes as $P \to \infty$.
- The saturated BW floor is $M/\mathrm{BW}$ — the volume a single rank-port must push regardless of $P$ or $L$. You can’t go below that without using more ports.
- $P$ can’t grow arbitrarily large, because the hop-count term picks up $(P{-}1)\alpha$ extra handshakes. More segments = more slot boundaries = more $\alpha$.
Sanity check with numbers. Set $L = 3$.
- Non-pipelined ($P = 1$): cost $= 3\alpha + 3M/\mathrm{BW}$. BW coefficient 3.
- Pipelined with $P = 3$: $5\alpha + M/\mathrm{BW} + (2/3)M/\mathrm{BW} = 5\alpha + 1.67\,M/\mathrm{BW}$. BW coefficient dropped from 3 to 1.67 at the cost of 2 extra handshakes.
- Pipelined with $P = 10$: $12\alpha + M/\mathrm{BW} + 0.2\,M/\mathrm{BW} = 12\alpha + 1.2\,M/\mathrm{BW}$. Within 20 % of the asymptotic floor, at the cost of 9 extra handshakes.
- $P \to \infty$: cost $\to \infty \cdot \alpha + M/\mathrm{BW}$. Asymptotic BW is achieved, but latency has exploded — clearly past the optimum.
Optimal segmentation. Minimize by differentiating with respect to $P$ (continuous approximation):
$$\frac{dt_{\mathrm{pipe}}}{dP} \;=\; \alpha \;-\; \frac{(L - 1)\,M}{P^2\,\mathrm{BW}} \;=\; 0 \quad\Longrightarrow\quad P^{*} \;=\; \sqrt{\frac{(L - 1)\,M}{\alpha\,\mathrm{BW}}}$$
Substituting back:
$$t_{\mathrm{pipe}}(P^{*}) \;=\; L\,\alpha \;+\; \frac{M}{\mathrm{BW}} \;+\; 2\sqrt{\frac{(L - 1)\,\alpha\,M}{\mathrm{BW}}}$$
The square-root correction grows as $\sqrt{M}$ — much slower than the original $LM/\mathrm{BW}$ or the floor $M/\mathrm{BW}$. The punch line:
In the bandwidth-bound regime, pipelining collapses the BW coefficient from $L$ down to (essentially) 1, at a modest $O(\sqrt{M})$ latency-term surcharge. The base $L\alpha$ hop-count cost is untouched.
C.4 Port-budget caveat
The diagram above is a logical picture — “row $k$” is “step $k$ of the schedule”, and each cell shows which segment is being logically processed there. To turn this into real wall-clock time, every cell in the steady-state column must correspond to a physical transfer on a distinct link. If multiple cells in the same column try to drive the same physical port on the same rank, only one of them actually runs; the others queue, and the pipeline collapses partway back to serial.
Counting concurrency at a single rank $R$: during steady state, $R$ is simultaneously participating in some number of steps — at most $L$, and typically the distinct partners across the schedule’s steps. Each concurrent step corresponds to one edge out of $R$, so $R$ needs that many concurrent physical ports to sustain the pipeline fully. “Port” here is deliberately generic: whichever level of the fabric the collective traverses, the budget is set by the number of distinct physical links out of $R$ at that level. Concretely, the tight budget can be any of:
- Scale-up on-package or on-node: NVLink lanes / ring-direction channels out of a GPU (typically 2 ring directions or 4–18 NVLink channels per chip), on-package chiplet-to-chiplet interconnect ports, or PCIe lanes to a host switch — the scale-up fabric that stays inside a high-bandwidth island.
- Scale-up switch ports: the upstream ports a single endpoint presents into an NVSwitch / UALink / proprietary scale-up switch tier. A GPU typically connects with a handful of links into that switch, and the switch itself has a fixed radix.
- Scale-out off-node: NIC count per GPU or per node (typically 2–8 NICs on modern training nodes), which sets the per-rank port budget for inter-node collectives.
- Mesh / torus fabrics: on a $k$-dimensional torus each rank has exactly $2k$ neighbor ports (2 for a 1-D ring, 4 for 2-D, 6 for 3-D, etc.), fixed by the topology rather than by the endpoint. This is the tightest and least flexible port budget of any fabric family: an algorithm needing $\log_2 N$ concurrent partners has no hope on a 3-D torus, and even ring-family algorithms have to be carefully dimension-decomposed to stay inside the budget.
Whatever level applies, the number is $O(1)$-to-single-digit, not $O(\log N)$. Algorithms whose steady-state concurrency exceeds that number serialize on the oversubscribed port regardless of whether the oversubscription is inside a scale-up switch, on a torus neighbor link, on the scale-out NIC, or across on-package links. Torus-specific consequences — dimension-decomposed ring AR on torus, and the full mapping from the algorithms here to each topology — are worked through in 02_topology_mapping.md.
Case A — schedule's per-rank port requirement fits the budget:
slot 3 (steady state at rank R):
step 1 edge ─────▶ port A carrying seg3
step 2 edge ─────▶ port B carrying seg2
step 3 edge ─────▶ port C carrying seg1
└── 3 distinct links, 3 segments in parallel ✓
conveyor runs as drawn
Case B — schedule's per-rank port requirement exceeds the budget:
slot 3 (steady state at rank R, suppose only port A exists):
step 1 edge ───┐
step 2 edge ───┼───▶ port A ← only one gets served per slot
step 3 edge ───┘ other two queue behind it
└── pipeline collapses toward serial L·(α + M/BW)
The guiding question for any schedule is: does the per-rank steady-state concurrency fit the fabric’s port budget at every tier the collective traverses? If yes, the pipelined formula from C.3 holds; if no, the schedule serializes back to (or toward) its non-pipelined cost.
C.5 Instantiation across primitives
The algorithms elsewhere in this note invoke the above construction with the following parameters. “Steady-state concurrency” is the distinct-port count required per rank when the pipeline is full; “fits typical budget?” is the answer under a 2–8 port fabric tier (either 2 NVLink ring directions, 2–8 NICs, or the 3+ switch uplinks of a modern NVSwitch).
| Primitive | $L$ | Per-rank steady-state concurrency | Fits typical budget? | Pipelined BW term |
|---|---|---|---|---|
| Ring BC (§3.1) | $N-1$ | 2 neighbor links (in + out) | yes | $M/\mathrm{BW}$ |
| Ring Reduce (§4.1) | $N-1$ | 2 neighbor links (in + out) | yes | $M/\mathrm{BW}$ |
| Binomial tree BC (§3.2) | $\lceil\log_2 N\rceil$ | tree parent + up to $\lceil\log_2 N\rceil$ children at root; $\leq 2$ at interior | yes at interior; root fan-out needs $\lceil\log_2 N\rceil$ ports for full pipeline | $M/\mathrm{BW}$ interior-limited; root bound by port count |
| Binomial tree Reduce (§4.2) | $\lceil\log_2 N\rceil$ | same as BC (time-reverse dual) | same | $M/\mathrm{BW}$ |
| Ring AR (§5.1) | $2(N-1)$ | 2 neighbor links throughout | yes — intrinsically pipelined with $P = N$ | $2(N-1)/N \cdot M/\mathrm{BW}$ |
| Double binary tree AR (§5.2) | $2\lceil\log_2 N\rceil$ | $\leq 3$ (interior in one tree + leaf in sibling) | yes on any 3+ port tier | near $2\,M/\mathrm{BW}$ in practice |
| Plain recursive doubling AR (App. B.1) | $\lceil\log_2 N\rceil$ | $\lceil\log_2 N\rceil$ distinct partners | no at $N = 512$ (9 partners vs 2–8 ports) | does not improve — collapses to $\log_2 N \cdot M/\mathrm{BW}$ |
| Rabenseifner AR (App. B.2) | $2\lceil\log_2 N\rceil$ | $\lceil\log_2 N\rceil$ distinct partners per phase | no at typical $N$ | does not improve — collapses to $2(N-1)/N \cdot M/\mathrm{BW}$ |
Two observations:
- Tree BC/Reduce (§3.2 / §4.2) technically need $\lceil\log_2 N\rceil$ concurrent ports at the root to fully pipeline the initial fan-out (or final fan-in). In practice NCCL’s pipelined tree implementation sustains a modest $P$ that keeps the bandwidth close to $M/\mathrm{BW}$ at mid-to-large $M$ without saturating the root’s port count; the asymptotic floor is $M/\mathrm{BW}$ and the root is the tightest bottleneck.
- AR on partner-cycling schedules (plain recursive doubling and Rabenseifner) is where the port-budget caveat bites hardest. Their table-form BW terms assume a pipeline that can’t actually overlap on real fabrics, which is the asymmetry Appendix B.3 expands on and the reason NCCL ships ring + double binary tree instead.
Appendix D: algbw vs busbw — the NCCL-tests vocabulary
NCCL-tests [NCCL-TESTS] and most collective-benchmark workflows report two bandwidth metrics: algbw (algorithm bandwidth) and busbw (bus bandwidth). Both derive from the same measured wall-clock time $t$ for a collective of size $M$, but they normalize differently — and picking the right one for a given question is what makes NCCL-tests output interpretable.
algbw — application-facing throughput. $M$ bytes delivered in time $t$, treating the collective as an opaque data transfer:
$$\mathrm{algbw} = \frac{M}{t}$$
This is the direct input to end-to-end latency budgets: a user issuing “reduce 1 GB of gradients in 10 ms” sees algbw = 100 GB/s, and that’s exactly what a step-time model should plug into its communication-wall-clock term.
busbw — link-level throughput. The $M$-byte-equivalent traffic that flowed on any single endpoint’s link during the collective. It multiplies algbw by a primitive-dependent correction factor that accounts for how many bytes each endpoint actually moved per unit of delivered work:
$$\mathrm{busbw} = \mathrm{algbw} \cdot C_{\mathrm{primitive}}(N)$$
where $C_{\mathrm{primitive}}(N)$ is the algorithm’s $n_\beta$ coefficient (§1) — the per-rank byte-move count normalized by $M$. Per-primitive factors per the NCCL-tests convention [NCCL-TESTS]:
| Primitive | $C(N)$ | Rationale |
|---|---|---|
| AR | $2(N-1)/N$ | RS + AG both move $(N-1)/N \cdot M$ per rank |
| AG | $(N-1)/N$ | each rank forwards the other $N-1$ ranks’ slices |
| RS | $(N-1)/N$ | each rank sends its non-own slices |
| BC | $(N-1)/N$ | root fans out, leaves receive once (asymptotic) |
| A2A | $(N-1)/N$ | each rank’s one $(N-1)/N$ of its payload goes off-rank |
| Reduce | 1 | only the root’s final accumulator link matters |
Example: on an $N = 100$ AR measurement with algbw $= 450\,\mathrm{GB/s}$, busbw $= 450 \cdot 2 \cdot 99/100 \approx 891\,\mathrm{GB/s}$ — which is the number to compare directly against the fabric’s peak link BW to judge algorithmic efficiency.
Why both metrics coexist — and when to use which. algbw is the right number for performance modeling: step time is (compute + communication), and the communication term is $M / \mathrm{algbw}$. busbw is the right number for diagnosing whether the fabric ceiling is reached: it normalizes out the algorithm’s structural $n_\beta$ factor (e.g., AR’s 2× from the RS + AG decomposition) so the resulting number is directly comparable across primitives to the fabric’s peak link BW. A collective whose busbw equals peak link BW has hit the fabric ceiling; one whose busbw is well below peak has headroom that a better algorithm or better placement might recover.
This is the measurement convention that 05_contention_and_congestion.md §4.1 uses to back out $\eta_\beta$: realized busbw divided by peak link BW gives the contention-coefficient ratio directly.
Further reading
02_topology_mapping.md— how ring / tree / log algorithms map to star (crossbar) and torus fabrics, with latency + BW derivations per topology, the torus dim-decomp AR mechanism worked out on a 2×2 example, and a side-by-side comparison at fixed $G$.04_in_network_collectives.md— how switch-resident reduction engines (SHARP, NVLS) replace $O(N)$ endpoint hops with $O(1)$ switch hops on star topologies.05_contention_and_congestion.md— how dynamic contention coefficients $\eta_\alpha \ge 1$ and $\eta_\beta \in (0, 1]$ modify the idealized costs above under realistic traffic.03_hierarchical_topologies.md— how the primitives above compose across tiers (RS → sub-AR → AG for AR, and AG / RS / A2A generalizations) when a cluster stacks star / torus / mesh / Clos fabrics.references.md— primary-source bibliography (Hockney’s α-β model, Patarasuk-Yuan ring AR, Thakur-Rabenseifner-Gropp RHD, Sanders-Speck-Träff DBT, Bruck log A2A, PAT / NCCL 2.23 scale-out AG/RS, Jeaugey et al. “Demystifying NCCL” 2025).- Patarasuk & Yuan (2009), “Bandwidth optimal all-reduce algorithms for clusters of workstations” — the foundational proof that ring AR achieves the $2(N-1)/N \cdot M/\mathrm{BW}$ bandwidth lower bound.
-
PCIe switches (Broadcom PEX, Microchip Switchtec, etc.) have supported hardware multicast since the PCIe Gen2 spec added it as an optional feature, and Gen4/5 AI-oriented switches expose it reliably. They are therefore useful for BC and AG — and the BC-half of AR where the target fabric is a PCIe tree — but they have no in-switch ALU: the multicast primitive replicates payloads only, it does not combine them. PCIe switches are consequently absent from the INC rows of the Reduce and RS tables (§4.3, §6), and do not contribute to the BW-side effective-BW lift that SHARP / NVLS give AR (see
04_in_network_collectives.md §1.4). ↩
Mapping Collective Algorithms to Physical Topologies
Author: Yue Lu
Date: April 2026
The previous note (01_collective_algorithms.md) costs collective primitives on an abstract fully-connected fabric. Real clusters don’t have that. Inside a scale-up domain they have a star topology of GPUs around an NVSwitch, a 2D / 3D torus for TPU pods, or a mesh for chiplet interposers or short-reach PCIe fabrics. This note focuses on those single-tier, scale-up topologies — walking through each with a 2×2-style diagram, showing how ring / tree / dim-decomposed ring AR maps onto each, and deriving the per-primitive cost formulas summarized by topology in §5. The canonical scale-out / multi-tier fabric — fat-tree / Clos — is inherently hierarchical and lives in 03_hierarchical_topologies.md §2 alongside the composition rules that apply once topologies layer.
No network contention modeled here — every physical link is assumed to carry the active collective alone at peak bandwidth, with no concurrent flows from other collectives, other groups, or background traffic sharing the link. Realistic link sharing, bisection oversubscription, and concurrent-group conflicts are scored in 05_contention_and_congestion.md.
Table of Contents
- Topology primer
- Star topology
- 2.1 Broadcast
- 2.2 Reduce
- 2.3 All-reduce
- 2.4 All-gather / reduce-scatter
- 2.5 All-to-all
- Torus topology
- 3.1 Dim-decomposition
- 3.2 Broadcast
- 3.3 Reduce
- 3.4 All-reduce
- 3.5 All-gather / reduce-scatter
- 3.6 All-to-all
- Mesh topology
- Summary and limitations
- 5.1 Cost summary by topology
- 5.2 Limitations
- Appendix A: Dim-decomposed Rabenseifner halving-doubling
- Appendix B: N, D, d — rank count, diameter, and dim sizes
- Appendix C: Per-chunk-per-step analysis vs bottleneck analysis
- Further reading
1. Topology primer
This note is a three-topology catalog of single-tier, scale-up fabrics: star (endpoints around a central high-radix switch — e.g., NVSwitch, §2), torus (a $k$-dimensional lattice with wraparound — e.g., TPU pods, §3), and mesh (direct all-to-all or $k$-D lattice without wraparound — e.g., chiplet interposer, PCIe point-to-point, §4). Star and torus cover scale-up domains at pod level; mesh covers short-reach direct fabrics. The canonical scale-out / multi-tier fabric — fat-tree / Clos — is inherently hierarchical and covered in 03_hierarchical_topologies.md §2. Four properties describe any topology:
- Diameter $D$: hop count between the farthest rank pair.
- Bisection bandwidth: total BW across the narrowest cut that splits the ranks into two equal halves.
- Radix: ports per switch (for switched fabrics) or ports per router (for direct fabrics).
- Scalability: how wiring scales with $N$.
1.1 Star (single high-radix switch)
All $N$ endpoints attach to one central switch. $D = 2$ link hops (endpoint → switch → endpoint). This is NVLink 5 + NVSwitch Gen4 within a single NVL72 domain.
┌──────────────────────────────────────┐
│ NVSwitch │
│ (N ports, full crossbar) │
└──┬─────┬─────┬─────┬─────┬─────┬────┘
│ │ │ │ │ │
R0 R1 R2 R3 ... RN-1
- Pros: any-to-any in 2 hops. Collective algorithm is unconstrained by wiring since star can emulate any logical topology (ring, tree).
- Cons: scale limit is switch radix. A single NVSwitch Gen4 domain caps at ~72 GPUs; beyond that you need a multi-tier fabric.
1.2 Torus (2D or 3D lattice with wraparound)
Ranks are laid out on a $k$-dimensional lattice with dim sizes $(d_1, \ldots, d_k)$; each rank connects to its +1 and −1 neighbors per dim, with wraparound closing each axis into a ring. The neighbor count per rank is $\sum_i \min(d_i - 1, 2)$ — at most $2k$, reached when every $d_i \geq 3$. At a dim with $d_i = 2$ the +1 and −1 neighbors coincide, so that dim contributes 1 neighbor instead of 2. Asymmetric shapes may or may not trigger this — it depends per-dim: e.g., an $A \times A \times B$ torus with $A \geq 3$ and $B = 2$ gives $2 + 2 + 1 = 5$ neighbors per rank, but the same shape with $B = 4$ (or any $B \geq 3$) saturates back to $2 + 2 + 2 = 6$ because the $\min(d_i - 1, 2)$ term caps at 2 once $d_i$ clears 3. Google TPU pods use a 3D torus (typically symmetric — TPU v5p’s 16×16×16 — so all dims are non-degenerate).
Torus structure: d = 2 per axis (degenerate, 1 link/dim) vs d ≥ 3 per axis (non-degenerate, 2 links/dim). Neighbors per rank = Σᵢ min(dᵢ − 1, 2).
──────────── Degenerate d = 2 per axis (direct ≡ wraparound, k links/rank) ────────────
2D: 2×2 = 4 ranks 3D: 2×2×2 = 8 ranks
(k = 2, 2 links per rank) (k = 3, 3 links per rank)
(0,1,1)━━━━━(1,1,1)
╱┃ ╱┃
╱ ┃ ╱ ┃
(0,1,0)━━━━━━(1,1,0) ┃
(0,1)━━(1,1) ┃ ┃ ┃ ┃
┃ ┃ ┃ (0,0,1)━━━━━┃━(1,0,1)
┃ ┃ ┃ ╱ ┃ ╱
(0,0)━━(1,0) ┃ ╱ ┃ ╱
(0,0,0)━━━━━━(1,0,0)
every drawn edge is both direct-neighbor and wraparound-neighbor: (0,y) ↔ (1,y)
is the single X-edge (similarly Y, Z). At d = 2 the 2k rule collapses to k.
────────── Non-degenerate d = 3 per axis (direct ≠ wraparound, 2k links/rank) ──────────
2D: 3×3 = 9 ranks 2k = 4 links per rank
(flat view, wraparound marked with ↔ ↕)
↕ ↕ ↕ ↔ X-wraparound: (0,y) ━ (2,y) per y
↔ (0,2)━(1,2)━(2,2) ↔ ↕ Y-wraparound: (x,0) ━ (x,2) per x
┃ ┃ ┃
↔ (0,1)━(1,1)━(2,1) ↔ example: (0,0)'s 4 neighbors are
┃ ┃ ┃ (1,0), (2,0) [X]
↔ (0,0)━(1,0)━(2,0) ↔ (0,1), (0,2) [Y]
↕ ↕ ↕
3D: 3×3×3 = 27 ranks 2k = 6 links per rank
(stacked Z-slices, each slice a 3×3 2D torus as above)
Z = 0 slice Z = 1 slice Z = 2 slice
↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕
↔(0,2,0)━(1,2,0)━(2,2,0)↔ ↔(0,2,1)━(1,2,1)━(2,2,1)↔ ↔(0,2,2)━(1,2,2)━(2,2,2)↔
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
↔(0,1,0)━(1,1,0)━(2,1,0)↔ ↔(0,1,1)━(1,1,1)━(2,1,1)↔ ↔(0,1,2)━(1,1,2)━(2,1,2)↔
┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃ ┃
↔(0,0,0)━(1,0,0)━(2,0,0)↔ ↔(0,0,1)━(1,0,1)━(2,0,1)↔ ↔(0,0,2)━(1,0,2)━(2,0,2)↔
↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕
━━Z━━→ ━━Z━━→
┌───────────────── Z-wraparound (Z = 2 slice → Z = 0 slice) ───────────────┐
▼ │
Z = 0 slice Z = 2 slice
↔ X-wraparound within each slice (row closes into a 3-ring, 3 rows per slice)
↕ Y-wraparound within each slice (column closes into a 3-ring, 3 columns per slice)
Z direct Z-links between adjacent slices: (x,y,0) ━ (x,y,1), (x,y,1) ━ (x,y,2);
plus Z-wraparound (x,y,2) ━ (x,y,0) closes each Z-chain into a 3-ring (9 total)
examples — neighbors grouped by axis, with the slice of each one spelled out:
(1,1,1) body-center (all 6 direct, no wraparound):
X: (0,1,1), (2,1,1) — both in Z = 1 slice (left and right of (1,1,1))
Y: (1,0,1), (1,2,1) — both in Z = 1 slice (below and above (1,1,1))
Z: (1,1,0), (1,1,2) — (1,1,0) sits in Z = 0 slice, (1,1,2) in Z = 2 slice
(0,0,0) corner (3 direct + 3 via wraparound, marked *):
X: (1,0,0), (2,0,0)* — both in Z = 0 slice; * wraparound from X = 0 to X = 2
Y: (0,1,0), (0,2,0)* — both in Z = 0 slice; * wraparound from Y = 0 to Y = 2
Z: (0,0,1), (0,0,2)* — (0,0,1) in Z = 1 slice; (0,0,2)* in Z = 2 slice (wrap)
(0,1,2) edge-midpoint (4 direct + 2 via wraparound, marked *):
X: (1,1,2), (2,1,2)* — both in Z = 2 slice; * wraparound from X = 0 to X = 2
Y: (0,0,2), (0,2,2) — both in Z = 2 slice; Y = 1 is interior so both direct
Z: (0,1,1), (0,1,0)* — (0,1,1) in Z = 1 slice; (0,1,0)* in Z = 0 slice (wrap Z = 2 → 0)
production TPU v5p pods: 16×16×16 = 4096 ranks (same structure, each axis a 16-ring)
──────────── Asymmetric shapes: degeneration is per-dim, not per-shape ────────────
Asymmetry alone does NOT collapse anything — each dim is judged independently by its
own dᵢ. Only a dim with dᵢ = 2 contributes 1 neighbor instead of 2.
per-dim rule: dᵢ = 1 → 0 neighbors on axis i (axis is trivial, single rank)
dᵢ = 2 → 1 neighbor on axis i (direct ≡ wraparound, collapsed)
dᵢ ≥ 3 → 2 neighbors on axis i (direct ≠ wraparound, full)
total neighbors per rank = Σᵢ min(dᵢ − 1, 2), bounded above by 2k.
worked examples (k = 3):
3 × 3 × 3 (symmetric, all ≥ 3) → 2 + 2 + 2 = 6 (full 2k, non-degenerate)
3 × 3 × 4 (asymmetric, all ≥ 3) → 2 + 2 + 2 = 6 (still full 2k — asymmetry alone doesn't degenerate)
16 × 16 × 4 (asymmetric, all ≥ 3) → 2 + 2 + 2 = 6 (same; common TPU slice shape)
4 × 4 × 2 (asymmetric, one = 2) → 2 + 2 + 1 = 5 (Z-axis collapsed to a 2-ring)
2 × 2 × 2 (symmetric, all = 2) → 1 + 1 + 1 = 3 (all axes collapsed — the k case)
8 × 1 × 1 (one axis trivial) → 2 + 0 + 0 = 2 (effectively a 1D 8-ring)
key distinction: "asymmetric" means the dᵢ differ; "degenerate on axis i" means dᵢ = 2.
They are orthogonal — 3 × 3 × 4 is asymmetric but fully non-degenerate.
- Pros: no central switch — each rank has at most $2k$ links (fewer when any $d_i = 2$; general formula $\sum_i \min(d_i - 1, 2)$ per §1.2 intro). Linear wire cost in $N$. Dim structure enables dim-by-dim ring collectives that compress latency from $O(N)$ to $O(N^{1/k})$.
- Cons: diameter $D = \sum_i \lfloor d_i / 2 \rfloor$ grows as $N^{1/k}$; bisection scales as $N^{(k-1)/k}$ — sub-linear. All-to-all is bisection-constrained (more below). Realistic efficiency $\eta$ is typically lower than star’s (longer path length accumulates per-hop overhead — details in
05_contention_and_congestion.md§4). Fault tolerance is the weakest among the topologies here: a single rank failure breaks ring continuity in each of its $k$ dims, disrupting dim-decomposition; remapping either requires OCS-style slice reshaping (TPU v4+) or marking the whole slice unavailable (simpler torus fabrics).
1.3 Mesh (direct all-to-all or $k$-D lattice without wraparound)
Mesh is the direct-wiring counterpart to both star and torus — no central switch (the star-side simplification) and no wraparound (the torus-side simplification). Two regimes bracket the range, each corresponding to one of those parents: full mesh (the star analog), where every rank has a direct link to every other ($N(N-1)/2$ edges) — same any-to-any connectivity as star, but with the switch replaced by a dedicated per-pair link; and $k$-D mesh (the torus analog), where ranks sit on a $k$-D lattice with neighbor links only and no wraparound — same product-structured wiring as torus, but with the “seams” cut. The torus neighbor formula $\sum_i \min(d_i - 1, 2)$ applies unchanged to interior ranks; boundary ranks (at position 0 or $d_i - 1$ on any axis with $d_i \geq 3$) lose 1 neighbor per boundary axis — the wraparound that would have supplied the missing direction isn’t there. §4 develops each precisely.
Full mesh (4 ranks): 2D mesh example: 4×4 = 16 ranks (no wraparound)
R0 ━━━━━ R1 (0,3)━(1,3)━(2,3)━(3,3)
┃╲ ╱┃ ┃ ┃ ┃ ┃
┃ ╲ ╱ ┃ (0,2)━(1,2)━(2,2)━(3,2)
┃ ╲ ╱ ┃ ┃ ┃ ┃ ┃
┃ ╳ ┃ (0,1)━(1,1)━(2,1)━(3,1)
┃ ╱ ╲ ┃ ┃ ┃ ┃ ┃
┃ ╱ ╲ ┃ (0,0)━(1,0)━(2,0)━(3,0)
┃╱ ╲┃
R3 ━━━━━ R2 (no edges wrap around between (3,·) and (0,·))
- Pros: full mesh removes the switch entirely — endpoint-to-endpoint α drops the cut-through hop — and every pair owns its own dedicated link so A2A is free of bisection or port-sharing pressure. $k$-D mesh retains torus’s product-structured dim-decomposition — same $O(N^{1/k})$ α scaling for AR / AG / RS, just with each dim-phase running an open-line RS/AG instead of a closed-ring RS/AG (roughly 2× the α-coefficient per dim because the line loses the ring’s bidirectional parallelism, but the $O(N^{1/k})$ order is preserved) — while using fewer physical links (no wraparound edges).
- Cons: full mesh has $O(N^2)$ wire count and $O(N)$ ports-per-rank, practically capping at ~8–16 endpoints (legacy DGX-1/DGX-2 hybrid cube-mesh, chiplet interposer fabrics). $k$-D mesh has half the bisection of a same-shape torus because the wraparound links are missing — A2A pays a 2× BW penalty relative to torus. $k$-D mesh inherits the same realistic-η and fault-tolerance weaknesses as torus (lattice continuity breaks on rank failure; details in
05_contention_and_congestion.md§4). Full mesh is more forgiving on both axes — single-hop paths avoid the per-hop η accumulation, and any-to-any wiring means a rank failure only removes that endpoint without breaking any continuity structure.
2. Star topology
Star is the most flexible substrate for mapping collective algorithms to hardware. Any-to-any in 2 hops means every logical “communication edge” in a ring, DBT, or pairwise schedule lands on a single switch cut-through — the algorithm’s logical topology is the physical topology, with no wiring-induced translation. Any of the seven primitives from 01_collective_algorithms.md runs on a star at its pure α-β cost — no topology correction, no dim-decomposition, no hierarchical sub-tiers. The algorithm is a software choice, not a wiring constraint. This direct-mapping property is unique to star among the topologies in this series: torus forces dim-decomposition (§3); multi-tier fat-tree / Clos fabrics force tier-aware schedules that sub-divide the collective across per-tier sub-groups. There are no star-specific “special” schedules; this section only records the α calibration that makes the cost formulas concrete on a representative scale-up star, and defers algorithm selection to the prior note. In-network collectives (SHARP, covered in 04_in_network_collectives.md) build on this by collapsing even the tree’s $\log_2 N$ endpoint hops to $O(1)$ switch-hop latency — an amplification of star’s algorithmic freedom, not a separate algorithm class.
α and BW calibration on a scale-up star. On NVLink-5 / NVSwitch Gen4 class hardware:
- $\alpha \approx 0.5\,\mu$s — switch cut-through ($\sim$100 ns) plus endpoint software overhead ($\sim$400 ns). Each algorithm “hop” in the star primitive costs this $\alpha$, which already includes the physical 2-link endpoint→switch→endpoint traversal as a single cut-through event.
- $\mathrm{BW} \approx 900\,\mathrm{GB/s}$ per direction per GPU — a single NVLink-5 port’s unidirectional bandwidth. All cost formulas from the prior note use this as the per-port BW.
2.1 Broadcast
Relation to 01_collective_algorithms.md §3. All three options from the prior note — ring BC (§3.1), binomial-tree BC (§3.2), and hardware switch multicast (§3.3) — run on a star at their pure costs. Star contributes only the α and BW calibration above.
Commercial shipment. NCCL ships binomial-tree BC on star as the default software path. NVSwitch Gen3+ / NVLS exposes the switch-multicast primitive that collapses α to the $2\alpha_{\mathrm{switch}}$ $N$-independent floor (see 04_in_network_collectives.md §1.2).
2.2 Reduce
Relation to 01_collective_algorithms.md §4. Reduce is the time-reverse dual of BC: ring Reduce (§4.1), binomial-tree Reduce (§4.2), and switch-ALU Reduce / SHARP / NVLS (§4.3) all run on a star at their pure costs. Per 01_collective_algorithms.md §4.3, tree strictly dominates ring across the full $M$-range for standalone Reduce (same BW floor, smaller α), so star’s effective menu reduces to tree + INC.
Commercial shipment. NCCL ships tree Reduce on star. NVSwitch Gen3+ / NVLS and InfiniBand SHARP provide the switch-ALU path that collapses the α count to an $N$-independent constant (see 04_in_network_collectives.md §1.1).
2.3 All-reduce
Relation to 01_collective_algorithms.md §5. All four AR algorithms from the prior note (ring in §5.1, DBT in §5.2, plus simple rec-doubling in App. B.1 and Rabenseifner in App. B.2 for reference) run on a star at their pure costs — no topology correction, no dim-decomposition. Star contributes only the α and BW calibration above.
Commercial shipment. NCCL ships both ring and DBT on star. The $(N, M)$ selection behavior between them is a runtime-scheduler concern, not a topology concern — see 01_collective_algorithms.md §5.3 for the algorithm-selection discussion.
2.4 All-gather / reduce-scatter
Relation to 01_collective_algorithms.md §6. Star runs both ring AG/RS (§6) and rec-doubling AG / rec-halving RS (App. B.4) at their pure costs from the prior note. No star-specific correction to the cost formulas: BW = per-port BW, α = cut-through + software, plug in and read off.
Commercial shipment. NCCL ships ring AG/RS on star. The ring-vs-log-depth selection rationale is a runtime-scheduler concern — see 01_collective_algorithms.md §6.
2.5 All-to-all
Relation to 01_collective_algorithms.md §7. A2A on a star is bounded by per-rank port BW, not by the switch fabric. The switch has aggregate BW $N \cdot \mathrm{BW}$ — far more than any collective needs — but each endpoint has only one duplex link, so the information-theoretic lower bound on cost is $\frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}} \approx M/\mathrm{BW}$ (each rank must ship $(N-1)M/N$ bytes through its one port). Star’s win over torus for A2A is that it achieves this BW lower bound with no bisection bottleneck and no dim-decomposition to engineer — just peak per-port BW applied uniformly across all $N(N-1)$ pairs.
Commercial shipment. NCCL ships pairwise direct-send (01_collective_algorithms.md §7.2) on star. Full algorithm-selection rationale — including the bisection-limited vs full-bisection fabric rule that picks pairwise on star but ring-relay on torus — lives in 01_collective_algorithms.md §7.3.
3. Torus topology
A torus trades star’s algorithmic flexibility for linear wire cost at scale. The return on that trade is a single powerful optimization that distinguishes torus from star / fat-tree / Clos: dim-decomposition — rewriting each primitive as $k$ serial phases, one per physical dim, each running an independent $d_i$-rank ring on its own disjoint wiring. The payoff is structural: the flat-ring $O(N)$ α term collapses to $O(N^{1/k})$ while the bandwidth term stays at the flat-ring bandwidth optimum. This is the sole reason a torus can match or beat a star for AR / AG / RS despite its longer average path length. Take dim-decomposition away and a torus is strictly worse than a star for every primitive.
§3.1 develops the general dim-decomposition mechanism with a 2×2×2 visual; §3.2 and §3.3 cover BC and Reduce, and §3.4–§3.6 specialize dim-decomp to AR / AG / RS / A2A. Realistic-$\eta$ re-scoring of both wins and losses lives in 05_contention_and_congestion.md.
3.1 Dim-decomposition
Dim-decomposition is the topology-aware rewrite that makes torus competitive on collective cost. This subsection defines the rewrite, visualizes the per-dim phases on a 2×2×2 torus, states the correctness / BW-optimality / latency-compression properties, and catalogs where the decomposition applies and where it breaks. The primitive-specific subsections (§3.2 BC, §3.3 Reduce, §3.4 AR, §3.5 AG/RS, §3.6 A2A) all specialize this framework — chunk-size bookkeeping, AG-vs-RS pairing, cost formulas.
The rewrite. A flat collective over $N$ ranks is decomposed into $k$ serial phases, one per physical dim. Within phase $i$, $(N/d_i)$ independent rings of size $d_i$ run concurrently on the $(N/d_i)$ dim-lines perpendicular to axis $i$; each ring uses only dim-$i$ links, and different rings share no wiring.
Visual — three phases on a 2×2×2 torus. The 2×2×2 has 8 ranks and 12 edges split into three axis families: 4 X-edges, 4 Y-edges, 4 Z-edges. Dim-decomposition runs three sequential phases; each phase uses only one axis’s edges while the other two are idle.
Torus 2×2×2 — 8 ranks labeled a–h (link families labeled; X horizontal, Y into-page, Z vertical):
g━━━━━━h ━━━ horizontal edges = X-axis links (4 total)
╱┃ ╱┃ ╱ diagonal edges = Y-axis links (4 total)
╱ ┃ ╱ ┃ ┃ vertical edges = Z-axis links (4 total)
c━━━━━━d ┃
┃ ┃ ┃ ┃
┃ e━━━┃━━f
┃ ╱ ┃ ╱
┃╱ ┃╱
a━━━━━━b
Phase X (X-links carry all traffic; Y and Z links idle):
4 concurrent 2-rank rings along horizontal edges —
{ a ↔ b }, { c ↔ d }, { e ↔ f }, { g ↔ h }.
Phase Y (Y-links carry all traffic; X and Z links idle):
4 concurrent 2-rank rings along diagonal (into-page) edges —
{ a ↔ e }, { b ↔ f }, { c ↔ g }, { d ↔ h }.
Phase Z (Z-links carry all traffic; X and Y links idle):
4 concurrent 2-rank rings along vertical edges —
{ a ↔ c }, { b ↔ d }, { e ↔ g }, { f ↔ h }.
At $d_i = 2$ per axis (shown here) each “ring” collapses to a pair-exchange (direct ≡ wraparound); at $d_i \geq 3$ each phase runs genuine $d_i$-rings, but the structure “each phase uses only its axis’s links, rings run concurrently on disjoint wiring” is unchanged. A TPU v5p $16 \times 16 \times 16$ torus runs 256 concurrent 16-rings per dim-phase.
Concrete savings vs flat ring on the same 8 ranks. To see what dim-decomposition buys, compare against running a flat 8-ring schedule on the same 2×2×2 torus (ignoring the product structure — the linearized rank order still runs correctly, but each logical hop crosses many physical torus links on average):
Flat-ring schedule on 2×2×2 torus (8 ranks linearized into a single ring):
a ━ b ━ e ━ f ━ c ━ d ━ g ━ h ━┐
↑ │
└──────────────────────────────┘
wraparound (flat-ring closure)
N − 1 = 7 sequential hops per ring traversal; every rank sends / receives
one neighbor exchange at a time. No inter-rank parallelism.
Dim-decomposed schedule on 2×2×2 torus:
Phase X → Phase Y → Phase Z
───────────────── ───────────────── ─────────────────
4 pair-exchanges 4 pair-exchanges 4 pair-exchanges
concurrent concurrent concurrent
on X-links on Y-links on Z-links
Σ (dᵢ − 1) = 3 sequential hops per ring traversal — less than half of 7.
Within each phase, 4 pair-rings run on physically disjoint wiring simultaneously.
7 vs 3 sequential hops → ~2.3× α-compression at $N = 8$, for free — same total bytes moved, same BW term, just fewer serial waits. The ratio grows as $(N - 1) / \sum_i (d_i - 1)$: at $8 \times 8 \times 8$ that is $511 / 21 = 24\times$; at $16 \times 16 \times 16$ it is $4095 / 45 = 91\times$. Any specific primitive (AR = two traversals, AG or RS = one, A2A = different schedule entirely) multiplies through the same ratio — the compression is a pure schedule rewrite discovering the $k$ parallel dim-rings the torus’s wiring was already providing.
Why it works. Three properties follow from the torus’s $k$-D Cartesian-product wiring:
- Correctness — reduction (and concatenation, for AG) is associative, so the collective factors across dims: $\sum_{x, y, z} V_{x,y,z} = \sum_z \big( \sum_y ( \sum_x V_{x,y,z} ) \big).$ Each inner sum is an independent problem handled by one dim-phase.
- Bandwidth-optimality — the $(N/d_i)$ independent rings in phase $i$ use physically disjoint link sets (different rings of the same dim live on different dim-lines). No inter-ring contention wastes BW.
- Latency compression — $k$ serial phases, each $O(d_i) = O(N^{1/k})$ long, total $O(k \cdot N^{1/k}) = O(N^{1/k})$ for fixed $k$.
Where dim-decomposition breaks. The compression is not a blanket win on torus:
- You can still embed a flat $N$-ring on a torus (each logical hop then crosses many physical links), paying the full flat-ring $O(N)$ cost — dim-decomposition is the optimization that skips this, not the only correct schedule.
- Compression requires dim-aligned group layouts (§3.4). Off-prefix groups collapse the decomposition back toward flat-ring.
- A2A is bisection-bound, not per-port (§3.6) — dim-decomposition doesn’t help the BW side because A2A’s aggregate cross-cut traffic is fixed by semantics, not schedule. Cost scales with $d_{\max}$, not $N$.
- Asymmetric dim sizes pay disproportionately — a pathological shape like $256 \times 2 \times 2$ costs far more than cubic $8 \times 8 \times 8$ at the same $N$.
3.2 Broadcast
BC on a torus ships as dim-decomposed ring BC — the §3.1 framework with ring BC (01_collective_algorithms.md §3.1) as the inner per-dim primitive. Unlike AR, BC has only $k$ phases (no reverse half): the source fans out along dim 1, every dim-1 recipient then fans out along dim 2 concurrently, and so on until all $N$ ranks hold the payload. Unidirectional ring BC per dim costs $(d_i - 1)\alpha + M/\mathrm{BW}$; a bidirectional variant using both ring directions concurrently reaches the ring-diameter lower bound of $\lfloor d_i / 2 \rfloor \alpha$ per dim (≈2× α compression for $d_i \geq 3$).
Aside — why tree BC doesn’t help on torus. On fully-connected substrates (star, §2.1), binomial tree BC compresses α to $\lceil \log_2 N \rceil$ because every logical tree edge crosses one physical hop. On a $d$-ring, a tree’s logical “distance-$d/2$ step” covers $d/2$ physical hops, and the subsequent halvings sum back to the same $d - 1$ total physical α as ring BC. The log-depth α advantage requires any-to-any 1-hop reachability, which torus per-dim wiring doesn’t provide. Bidirectional ring BC’s $\lfloor d_i / 2 \rfloor \alpha$ is the best α achievable on ring wiring.
Algorithm. Run per-dim ring BC one dim at a time. Phase order on a 3D torus: X-BC → Y-BC → Z-BC — $k$ phases total. After phase $i$, the set of ranks holding the payload has grown by a factor of $d_i$.
Why it works — 2×2×2 worked example. Eight ranks labeled a–h, same slice convention as §3.4 (a at origin (0,0,0), z=0 slice holds a/b/e/f, z=1 slice holds c/d/g/h). Source is rank a, holding an $M$-byte payload $V$. All other ranks start empty (shown as $\cdot$).
Initial state — only a holds V:
Z = 0 slice: Z = 1 slice:
┌─────┬─────┐ ┌─────┬─────┐
y=1 │ · │ · │ y=1 │ · │ · │
├─────┼─────┤ ├─────┼─────┤
y=0 │ V │ · │ y=0 │ · │ · │
└─────┴─────┘ └─────┴─────┘
x=0 x=1 x=0 x=1
Phase 1 (X-BC). One active 2-ring along X — the ring containing the source: {a, b}. a sends $V$ to b along the X-link. The other three X-rings ({e,f}, {c,d}, {g,h}) are dormant: no source holds $V$ yet.
After Phase 1 (X-BC) — a, b hold V:
Z = 0 slice: Z = 1 slice:
┌─────┬─────┐ ┌─────┬─────┐
y=1 │ · │ · │ y=1 │ · │ · │
├─────┼─────┤ ├─────┼─────┤
y=0 │ V │ V │ y=0 │ · │ · │
└─────┴─────┘ └─────┴─────┘
x=0 x=1 x=0 x=1
Phase 2 (Y-BC). Two active 2-rings along Y, one per (x, z=0): {a, e}, {b, f}. a → e and b → f concurrently. The other two Y-rings ({c,g}, {d,h}) are still dormant.
After Phase 2 (Y-BC) — entire Z=0 slice holds V:
Z = 0 slice: Z = 1 slice:
┌─────┬─────┐ ┌─────┬─────┐
y=1 │ V │ V │ y=1 │ · │ · │
├─────┼─────┤ ├─────┼─────┤
y=0 │ V │ V │ y=0 │ · │ · │
└─────┴─────┘ └─────┴─────┘
x=0 x=1 x=0 x=1
Phase 3 (Z-BC). Four concurrent 2-rings along Z, one per (x, y): {a, c}, {b, d}, {e, g}, {f, h}. Every Z-ring now has a source (the entire Z=0 slice holds $V$), so all four rings broadcast concurrently. After Phase 3, all 8 ranks hold $V$.
After Phase 3 (Z-BC) — all 8 ranks hold V. BC complete:
Z = 0 slice: Z = 1 slice:
┌─────┬─────┐ ┌─────┬─────┐
y=1 │ V │ V │ y=1 │ V │ V │
├─────┼─────┤ ├─────┼─────┤
y=0 │ V │ V │ y=0 │ V │ V │
└─────┴─────┘ └─────┴─────┘
x=0 x=1 x=0 x=1
BC complete: every rank holds $V$. Total phase count: $k = 3$; total sequential α-hops: $\sum_i (d_i - 1) = 3$ — half of AR’s $2\sum_i (d_i - 1) = 6$, since BC is one-way (source → all, no reverse half).
Bidirectional variant — 3×3 torus example. For $d_i \geq 3$, using both ring directions concurrently cuts α roughly in half. On a 3×3 torus (9 ranks, source = a at (0, 0)):
3×3 torus grid (wraparound on every row and column):
┌───┬───┬───┐
y=2 │ g │ h │ i │ X-wraps (per row): a↔c, d↔f, g↔i
├───┼───┼───┤ Y-wraps (per col): a↔g, b↔h, c↔i
y=1 │ d │ e │ f │
├───┼───┼───┤
y=0 │ a │ b │ c │
└───┴───┴───┘
x=0 x=1 x=2
Initial — only a holds V: Phase 1 (X-BC, bidirectional):
a sends V → b (forward)
┌───┬───┬───┐ and V → c (via X-wrap),
y=2 │ · │ · │ · │ concurrently on row y=0.
├───┼───┼───┤ Only row y=0 has a source; rows
y=1 │ · │ · │ · │ y=1, y=2 stay dormant this phase.
├───┼───┼───┤
y=0 │ V │ · │ · │
└───┴───┴───┘
x=0 x=1 x=2
After Phase 1 (X-BC) — row y=0 holds V:
┌───┬───┬───┐
y=2 │ · │ · │ · │
├───┼───┼───┤
y=1 │ · │ · │ · │
├───┼───┼───┤ 1 α elapsed
y=0 │ V │ V │ V │
└───┴───┴───┘
x=0 x=1 x=2
Phase 2 (Y-BC, bidirectional on every column in parallel):
Each column's y=0 rank sends V → y=1 (forward) AND V → y=2 (via Y-wrap)
concurrently. All 3 columns fire in parallel because every column now
has a V-source in row y=0.
After Phase 2 (Y-BC) — all 9 ranks hold V:
┌───┬───┬───┐
y=2 │ V │ V │ V │
├───┼───┼───┤
y=1 │ V │ V │ V │ 1 α elapsed
├───┼───┼───┤
y=0 │ V │ V │ V │
└───┴───┴───┘
x=0 x=1 x=2
Total: 2 α + M/BW (vs unidirectional ring's (3-1) + (3-1) = 4 α — 2× α win.)
On a 3-ring, bidirectional completes in 1 α per dim because each rank’s two ring-neighbors cover the rest of the ring. For larger $d_i$ the pattern generalizes to a $\lfloor d_i/2 \rfloor$-step wave radiating outward from the source (e.g., 4-ring = 2 α, 8-ring = 4 α). The 2×2×2 worked example below uses $d_i = 2$ where bidirectional and unidirectional coincide trivially (1 α per dim).
Cost formula. Each dim-$i$ phase is a $d_i$-rank pipelined ring BC at cost $(d_i - 1)\alpha + M/\mathrm{BW}$ unidirectional or $\lfloor d_i/2 \rfloor \alpha + M/\mathrm{BW}$ bidirectional (01_collective_algorithms.md §3.1). With pipelining across the $k$ dim phases, the BW term telescopes into a single $M/\mathrm{BW}$ (each phase re-uses the payload pipelined in from the previous dim, so only the source-side dispatch bottleneck — one $M$-worth of bytes through one outbound port — lands on the wall clock):
$$t_{\mathrm{torus,BC,ring}} \;\approx\; \sum_i (d_i - 1)\,\alpha \;+\; \frac{M}{\mathrm{BW}} \quad\text{(unidirectional)}$$
$$t_{\mathrm{torus,BC,bidir}} \;\approx\; \sum_i \lfloor d_i / 2 \rfloor\,\alpha \;+\; \frac{M}{\mathrm{BW}} \quad\text{(bidirectional)}$$
Switch-hosted multicast (NVLS / SHARP) is structurally unavailable on torus — there is no central switch to fan out from — so the α floor is purely software-driven, unlike star (§2.1) where INC collapses α to $2\alpha_{\mathrm{switch}}$.
Commercial shipment. TPU/Trainium expose BC as a dim-decomposed pipelined primitive; whether XLA / NeuronX run unidirectional or bidirectional per-dim ring is a tuner detail not publicly documented. Cross-pod BC on composite fabrics (scale-up torus + scale-out Clos) reverts to per-tier schedules — see 03_hierarchical_topologies.md.
3.3 Reduce
Reduce on a torus ships as dim-decomposed ring Reduce — the time-reverse dual of §3.2’s BC. Every rank contributes its own $V_i$; contributions accumulate toward a single root via the inverse dim-decomposition. Phase order reverses BC’s: collapse the last dim first, so that the set of ranks holding live partial sums shrinks by a factor of $d_i$ each phase until only the root remains. As with BC, the bidirectional ring variant reaches $\lfloor d_i/2 \rfloor \alpha$ per dim by reducing from both halves of the ring toward the root concurrently, vs unidirectional’s $d_i - 1$; the tree-BC caveat from §3.2 applies symmetrically (log-depth α needs any-to-any reachability, which ring wiring lacks).
Algorithm. Run per-dim ring Reduce one dim at a time. Phase order on a 3D torus with root a at (0, 0, 0): Z-Reduce → Y-Reduce → X-Reduce — $k$ phases total. Each phase accumulates partial sums toward the root’s coordinate on that axis.
Why it works — 2×2×2 worked example. Eight ranks labeled a–h, same slice convention as §3.4. Root is rank a. Each rank initially holds its own $M$-byte vector $V_i$.
Initial state — each rank holds its own V_i:
Z = 0 slice: Z = 1 slice:
┌─────┬─────┐ ┌─────┬─────┐
y=1 │ V_e │ V_f │ y=1 │ V_g │ V_h │
├─────┼─────┤ ├─────┼─────┤
y=0 │ V_a │ V_b │ y=0 │ V_c │ V_d │
└─────┴─────┘ └─────┴─────┘
Phase 1 (Z-Reduce). Four concurrent 2-rings along Z, one per (x, y): {a, c}, {b, d}, {e, g}, {f, h}. Each pair reduces toward its z=0 endpoint — c → a, d → b, g → e, h → f, with z=0 ranks summing into their own buffer. After Phase 1, the Z=0 slice holds 2-rank partial sums; the Z=1 slice is dormant (its state has been absorbed).
After Phase 1 (Z-Reduce) — Z=0 slice holds pairwise sums; Z=1 dormant:
Z = 0 slice: Z = 1 slice:
┌───────────┬───────────┐ ┌──────────┬──────────┐
y=1 │ V_e + V_g │ V_f + V_h │ y=1 │ dormant │ dormant │
├───────────┼───────────┤ ├──────────┼──────────┤
y=0 │ V_a + V_c │ V_b + V_d │ y=0 │ dormant │ dormant │
└───────────┴───────────┘ └──────────┴──────────┘
Phase 2 (Y-Reduce). Two active 2-rings along Y in the Z=0 slice: {a, e}, {b, f}. Each pair reduces toward y=0: e → a, f → b. Row y=1 becomes dormant.
After Phase 2 (Y-Reduce) — row y=0 of Z=0 holds 4-rank partial sums:
Z = 0 slice: Z = 1 slice: dormant
┌───────────────────────┬───────────────────────┐
y=1 │ dormant │ dormant │
├───────────────────────┼───────────────────────┤
y=0 │ V_a+V_c+V_e+V_g │ V_b+V_d+V_f+V_h │
└───────────────────────┴───────────────────────┘
Phase 3 (X-Reduce). One active 2-ring along X in row (y=0, z=0): {a, b}. b reduces toward x=0: b → a. After Phase 3, only rank a holds live state, and it holds the full reduction.
After Phase 3 (X-Reduce) — a holds the full sum. Reduce complete:
Z = 0 slice:
┌───────────────────────────────────┬─────────┐
y=1 │ dormant │ dormant │
├───────────────────────────────────┼─────────┤
y=0 │ V_a+V_b+V_c+V_d+V_e+V_f+V_g+V_h │ dormant │
└───────────────────────────────────┴─────────┘
Z = 1 slice: dormant
Rank a holds Σ = V_a + V_b + V_c + V_d + V_e + V_f + V_g + V_h.
Reduce complete: root a holds $\Sigma = \sum_i V_i$. Total phase count: $k = 3$; total sequential α-hops: $\sum_i (d_i - 1) = 3$ — same as BC (time-reverse dual), half of AR.
Bidirectional variant — 3×3 torus example. Time-reverse of §3.2’s bidirectional BC. On a 3×3 torus (9 ranks, root = a at (0, 0)), let $S_x$ denote the sum of column $x$: $S_0 = V_a + V_d + V_g$, $S_1 = V_b + V_e + V_h$, $S_2 = V_c + V_f + V_i$. Final answer: $\Sigma = S_0 + S_1 + S_2 = \sum_i V_i$.
3×3 torus grid (same layout as §3.2; root = a at (0,0)):
┌───┬───┬───┐
y=2 │ g │ h │ i │ X-wraps: a↔c, d↔f, g↔i
├───┼───┼───┤ Y-wraps: a↔g, b↔h, c↔i
y=1 │ d │ e │ f │
├───┼───┼───┤
y=0 │ a │ b │ c │
└───┴───┴───┘
x=0 x=1 x=2
Initial — each rank holds its own V_i:
┌─────┬─────┬─────┐
y=2 │ V_g │ V_h │ V_i │
├─────┼─────┼─────┤
y=1 │ V_d │ V_e │ V_f │
├─────┼─────┼─────┤
y=0 │ V_a │ V_b │ V_c │
└─────┴─────┴─────┘
x=0 x=1 x=2
Phase 1 (Y-Reduce, bidirectional on every column in parallel):
For each column, y=1 sends its V forward to y=0 AND y=2 sends its V via Y-wrap
to y=0, concurrently. Root-of-column (y=0 rank) sums both into its own V.
All 3 columns fire in parallel (3 concurrent 3-rings).
After Phase 1 (Y-Reduce) — row y=0 holds column-sums S_0, S_1, S_2:
┌─────┬─────┬─────┐
y=2 │ — │ — │ — │ (dormant)
├─────┼─────┼─────┤
y=1 │ — │ — │ — │ (dormant)
├─────┼─────┼─────┤
y=0 │ S_0 │ S_1 │ S_2 │ 1 α elapsed
└─────┴─────┴─────┘
x=0 x=1 x=2
Phase 2 (X-Reduce, bidirectional on row y=0):
b sends S_1 forward to a AND c sends S_2 via X-wrap to a, concurrently.
a sums both into its own S_0.
After Phase 2 (X-Reduce) — a holds the full sum Σ:
┌─────┬─────┬─────┐
y=2 │ — │ — │ — │
├─────┼─────┼─────┤
y=1 │ — │ — │ — │
├─────┼─────┼─────┤
y=0 │ Σ │ — │ — │ 1 α elapsed
└─────┴─────┴─────┘
x=0 x=1 x=2
Total: 2 α + M/BW (vs unidirectional ring's (3-1) + (3-1) = 4 α — 2× α win.)
The 2×2×2 worked example above uses $d_i = 2$ where bidirectional and unidirectional coincide (1 α per dim either way); the 3×3 illustration makes the bidirectional wave-toward-root mechanic visible.
Cost formula. Each dim-$i$ phase is a $d_i$-rank pipelined ring Reduce at cost $(d_i - 1)\alpha + M/\mathrm{BW}$ unidirectional or $\lfloor d_i/2 \rfloor \alpha + M/\mathrm{BW}$ bidirectional (01_collective_algorithms.md §4.1). Symmetric to BC (§3.2), the BW term telescopes to $M/\mathrm{BW}$ overall under pipelining:
$$t_{\mathrm{torus,Reduce,ring}} \;\approx\; \sum_i (d_i - 1)\,\alpha \;+\; \frac{M}{\mathrm{BW}} \quad\text{(unidirectional)}$$
$$t_{\mathrm{torus,Reduce,bidir}} \;\approx\; \sum_i \lfloor d_i / 2 \rfloor\,\alpha \;+\; \frac{M}{\mathrm{BW}} \quad\text{(bidirectional)}$$
As with BC, the torus has no in-network ALU (ICI is link-only) — so the α floor is software-driven; the RS phase of the full torus AR (§3.4) stays the preferred path when every rank needs the result anyway, avoiding the Reduce+BC decomposition (see 01_collective_algorithms.md §4.3).
3.4 All-reduce
AR on a torus ships as dim-decomposed ring — the §3.1 framework with ring RS + AG as the inner per-dim primitive (Patarasuk-Yuan, 01_collective_algorithms.md §5.1, just on a $d_i$-rank ring). TPU (XLA / JAX) and AWS Trainium (NeuronX CCL) both default to it. A tree-flavored alternative — dim-decomposed Rabenseifner halving-doubling — swaps the inner primitive from ring to halving-doubling for additional α compression; it is not shipping on any production torus stack (Appendix A has the full derivation) but serves as a reference point that sharpens the production-ring analysis.
Algorithm. Run ring RS+AG one dim at a time. Phase order on a 3D torus: X-RS, Y-RS, Z-RS, then Z-AG, Y-AG, X-AG — $2k$ phases total. The AG dims traverse in reverse of the RS dims so that chunk sizes grow back symmetrically to how they shrank during RS (required for clean BW telescoping below; not for correctness — any AG dim order finishes AR correctly).
Why it works — 2×2×2 worked example. Eight ranks labeled a–h; each holds a length-8 input vector of size $M$ bytes (8 chunks of size $M/8$). Rank a’s input is $[a_0, a_1, \ldots, a_7]$, rank b’s is $[b_0, \ldots, b_7]$, and so on through h. The target AR output at every rank is $[\Sigma_0, \Sigma_1, \ldots, \Sigma_7]$ where $\Sigma_i = a_i + b_i + c_i + d_i + e_i + f_i + g_i + h_i$.
We visualize state as two 2×2 slices (Z = 0 front, Z = 1 back) — the stacked-slice form from §1.2 — where the letter in each cell names the rank at that position. The schedule runs 3 RS phases (X-RS → Y-RS → Z-RS) followed by 3 AG phases (Z-AG → Y-AG → X-AG); each phase uses only its axis’s links (§3.1).
Initial state — each rank holds its own 8-chunk input vector:
Z = 0 slice:
┌───────────────────────────┬───────────────────────────┐
y=1 │ [e₀,e₁,e₂,e₃,e₄,e₅,e₆,e₇] │ [f₀,f₁,f₂,f₃,f₄,f₅,f₆,f₇] │
├───────────────────────────┼───────────────────────────┤
y=0 │ [a₀,a₁,a₂,a₃,a₄,a₅,a₆,a₇] │ [b₀,b₁,b₂,b₃,b₄,b₅,b₆,b₇] │
└───────────────────────────┴───────────────────────────┘
x=0 x=1
Z = 1 slice:
┌───────────────────────────┬───────────────────────────┐
y=1 │ [g₀,g₁,g₂,g₃,g₄,g₅,g₆,g₇] │ [h₀,h₁,h₂,h₃,h₄,h₅,h₆,h₇] │
├───────────────────────────┼───────────────────────────┤
y=0 │ [c₀,c₁,c₂,c₃,c₄,c₅,c₆,c₇] │ [d₀,d₁,d₂,d₃,d₄,d₅,d₆,d₇] │
└───────────────────────────┴───────────────────────────┘
x=0 x=1
Phase 1 (X-RS). 4 concurrent 2-rings along X, one per (y, z) pair: {a,b}, {c,d}, {e,f}, {g,h}. Representative ring {a, b}: a keeps chunks 0–3, sends 4–7 to b; b keeps 4–7, sends 0–3 to a; each adds received into kept. Other rings do the same on their own pairs.
After Phase 1 (X-RS) — chunk size M → M/2; each rank holds 4 chunks:
Z = 0 slice:
┌───────────────────────────┬───────────────────────────┐
y=1 │ [e₀+f₀,e₁+f₁,e₂+f₂,e₃+f₃] │ [e₄+f₄,e₅+f₅,e₆+f₆,e₇+f₇] │
├───────────────────────────┼───────────────────────────┤
y=0 │ [a₀+b₀,a₁+b₁,a₂+b₂,a₃+b₃] │ [a₄+b₄,a₅+b₅,a₆+b₆,a₇+b₇] │
└───────────────────────────┴───────────────────────────┘
x=0 x=1
Z = 1 slice:
┌───────────────────────────┬───────────────────────────┐
y=1 │ [g₀+h₀,g₁+h₁,g₂+h₂,g₃+h₃] │ [g₄+h₄,g₅+h₅,g₆+h₆,g₇+h₇] │
├───────────────────────────┼───────────────────────────┤
y=0 │ [c₀+d₀,c₁+d₁,c₂+d₂,c₃+d₃] │ [c₄+d₄,c₅+d₅,c₆+d₆,c₇+d₇] │
└───────────────────────────┴───────────────────────────┘
x=0 x=1
All four X-rings ran concurrently on physically disjoint X-links — row-0 of the Z=0 slice doesn’t wait on row-1, nor on anything in the Z=1 slice.
Phase 2 (Y-RS). 4 concurrent 2-rings along Y, one per (x, z) pair: {a,e}, {b,f}, {c,g}, {d,h}. Representative ring {a, e}: each now holds 4 chunks; a keeps chunks 0–1, sends 2–3 to e; e keeps 2–3, sends 0–1 to a; each adds.
After Phase 2 (Y-RS) — chunk size M/2 → M/4; each rank holds 2 chunks:
Z = 0 slice:
┌───────────────────────────┬───────────────────────────┐
y=1 │ [a₂+b₂+e₂+f₂,a₃+b₃+e₃+f₃] │ [a₆+b₆+e₆+f₆,a₇+b₇+e₇+f₇] │
├───────────────────────────┼───────────────────────────┤
y=0 │ [a₀+b₀+e₀+f₀,a₁+b₁+e₁+f₁] │ [a₄+b₄+e₄+f₄,a₅+b₅+e₅+f₅] │
└───────────────────────────┴───────────────────────────┘
x=0 x=1
Z = 1 slice:
┌───────────────────────────┬───────────────────────────┐
y=1 │ [c₂+d₂+g₂+h₂,c₃+d₃+g₃+h₃] │ [c₆+d₆+g₆+h₆,c₇+d₇+g₇+h₇] │
├───────────────────────────┼───────────────────────────┤
y=0 │ [c₀+d₀+g₀+h₀,c₁+d₁+g₁+h₁] │ [c₄+d₄+g₄+h₄,c₅+d₅+g₅+h₅] │
└───────────────────────────┴───────────────────────────┘
x=0 x=1
Phase 3 (Z-RS). 4 concurrent 2-rings along Z, one per (x, y) pair — this is the first phase that crosses slices: {a,c}, {b,d}, {e,g}, {f,h}. Representative ring {a, c}: each holds 2 chunks; a keeps chunk 0, sends 1 to c; c keeps 1, sends 0 to a; each adds.
After Phase 3 (Z-RS) — chunk size M/4 → M/8 = M/N; each rank holds 1 fully-reduced chunk:
Z = 0 slice:
┌───────────────────────────┬───────────────────────────┐
y=1 │ [a₂+b₂+c₂+d₂+e₂+f₂+g₂+h₂] │ [a₆+b₆+c₆+d₆+e₆+f₆+g₆+h₆] │
├───────────────────────────┼───────────────────────────┤
y=0 │ [a₀+b₀+c₀+d₀+e₀+f₀+g₀+h₀] │ [a₄+b₄+c₄+d₄+e₄+f₄+g₄+h₄] │
└───────────────────────────┴───────────────────────────┘
x=0 x=1
Z = 1 slice:
┌───────────────────────────┬───────────────────────────┐
y=1 │ [a₃+b₃+c₃+d₃+e₃+f₃+g₃+h₃] │ [a₇+b₇+c₇+d₇+e₇+f₇+g₇+h₇] │
├───────────────────────────┼───────────────────────────┤
y=0 │ [a₁+b₁+c₁+d₁+e₁+f₁+g₁+h₁] │ [a₅+b₅+c₅+d₅+e₅+f₅+g₅+h₅] │
└───────────────────────────┴───────────────────────────┘
x=0 x=1
RS complete. 8 fully-reduced chunks (Σ₀..Σ₇ where Σᵢ = aᵢ+bᵢ+cᵢ+dᵢ+eᵢ+fᵢ+gᵢ+hᵢ)
scattered across 8 ranks with no duplication.
Phases 4–6 (Z-AG → Y-AG → X-AG). Reverse the RS dim order. Each AG phase is a 2-ring pair-exchange with no reduction, doubling per-rank chunk count. Below, $\Sigma_i = a_i+b_i+c_i+d_i+e_i+f_i+g_i+h_i$ as in Phase 3.
Starting state for AG (Phase-3 state re-expressed using Σ notation):
Z = 0 slice:
┌──────────┬──────────┐
y=1 │ [Σ₂] │ [Σ₆] │
├──────────┼──────────┤
y=0 │ [Σ₀] │ [Σ₄] │
└──────────┴──────────┘
x=0 x=1
Z = 1 slice:
┌──────────┬──────────┐
y=1 │ [Σ₃] │ [Σ₇] │
├──────────┼──────────┤
y=0 │ [Σ₁] │ [Σ₅] │
└──────────┴──────────┘
x=0 x=1
Phase 4 (Z-AG). 4 concurrent 2-rings: {a,c}, {b,d}, {e,g}, {f,h}. Each Z-pair exchanges its single chunk — a sends Σ₀ to c, c sends Σ₁ to a; similarly for the other three pairs.
After Phase 4 (Z-AG) — chunk count 1 → 2 per rank:
Z = 0 slice:
┌──────────┬──────────┐
y=1 │ [Σ₂, Σ₃] │ [Σ₆, Σ₇] │
├──────────┼──────────┤
y=0 │ [Σ₀, Σ₁] │ [Σ₄, Σ₅] │
└──────────┴──────────┘
x=0 x=1
Z = 1 slice:
┌──────────┬──────────┐
y=1 │ [Σ₂, Σ₃] │ [Σ₆, Σ₇] │
├──────────┼──────────┤
y=0 │ [Σ₀, Σ₁] │ [Σ₄, Σ₅] │
└──────────┴──────────┘
x=0 x=1
Z=0 and Z=1 slices now hold identical chunk pairings — Z-AG copied each
single chunk to its Z-partner, so ranks sharing an (x, y) position match.
Phase 5 (Y-AG). 4 concurrent 2-rings: {a,e}, {b,f}, {c,g}, {d,h}. Each Y-pair exchanges its 2 chunks — a sends [Σ₀, Σ₁] to e, e sends [Σ₂, Σ₃] to a; similarly for the other three pairs.
After Phase 5 (Y-AG) — chunk count 2 → 4 per rank:
Z = 0 slice:
┌─────────────────────┬─────────────────────┐
y=1 │ [Σ₀, Σ₁, Σ₂, Σ₃] │ [Σ₄, Σ₅, Σ₆, Σ₇] │
├─────────────────────┼─────────────────────┤
y=0 │ [Σ₀, Σ₁, Σ₂, Σ₃] │ [Σ₄, Σ₅, Σ₆, Σ₇] │
└─────────────────────┴─────────────────────┘
x=0 x=1
Z = 1 slice:
┌─────────────────────┬─────────────────────┐
y=1 │ [Σ₀, Σ₁, Σ₂, Σ₃] │ [Σ₄, Σ₅, Σ₆, Σ₇] │
├─────────────────────┼─────────────────────┤
y=0 │ [Σ₀, Σ₁, Σ₂, Σ₃] │ [Σ₄, Σ₅, Σ₆, Σ₇] │
└─────────────────────┴─────────────────────┘
x=0 x=1
Both slices identical, and within each slice every x-column is uniform —
Y-AG merged the (x, z)-pairs, so the 4-chunk set depends only on x.
Phase 6 (X-AG). 4 concurrent 2-rings: {a,b}, {c,d}, {e,f}, {g,h}. Each X-pair exchanges its 4 chunks — a sends [Σ₀, Σ₁, Σ₂, Σ₃] to b, b sends [Σ₄, Σ₅, Σ₆, Σ₇] to a; similarly for the other three pairs.
After Phase 6 (X-AG) — chunk count 4 → 8 per rank. AR complete:
Z = 0 slice:
┌───────────────────────────────────┬───────────────────────────────────┐
y=1 │ [Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇] │ [Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇] │
├───────────────────────────────────┼───────────────────────────────────┤
y=0 │ [Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇] │ [Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇] │
└───────────────────────────────────┴───────────────────────────────────┘
x=0 x=1
Z = 1 slice:
┌───────────────────────────────────┬───────────────────────────────────┐
y=1 │ [Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇] │ [Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇] │
├───────────────────────────────────┼───────────────────────────────────┤
y=0 │ [Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇] │ [Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇] │
└───────────────────────────────────┴───────────────────────────────────┘
x=0 x=1
Every rank a through h now holds the full reduced vector
[Σ₀, Σ₁, Σ₂, Σ₃, Σ₄, Σ₅, Σ₆, Σ₇].
AR complete: every rank holds $[\Sigma_0, \Sigma_1, \ldots, \Sigma_7]$. Total phase count: $2k = 6$; total sequential α-hops: $2 \sum_i (d_i - 1) = 6$, matching the §3.1 savings comparison.
Why AG reverses the RS dim order. The pairing is what makes the cost derivation telescope cleanly. Phase 1 (X-RS) shrinks chunks by $d_x$; the matching Phase 6 (X-AG) grows them back by $d_x$. Phase 2 (Y-RS) shrinks by $d_y$; the matching Phase 5 (Y-AG) grows by $d_y$. Phase 3 (Z-RS) / Phase 4 (Z-AG) pair similarly. Each RS–AG pair moves the same total bytes per rank, and the per-phase BW term depends only on the chunk size at that phase. Running AG in a different dim order would still produce correct output — each dim’s AG stage is a self-contained copy — but the chunk-size bookkeeping would not match up against the RS phases and the telescoping derivation below would be messier.
Cost formula — α-term derivation. Each dim-$i$ phase runs a $d_i$-rank ring RS (or AG), whose ring-primitive α-cost from 01_collective_algorithms.md §5.1 is $(d_i - 1)\,\alpha$: a $d_i$-rank ring takes $d_i - 1$ sequential neighbor-exchange steps to complete one full RS (or AG) traversal. The schedule runs $k$ RS phases followed by $k$ AG phases, each on its own dim’s ring, so the total sequential α-hops is the sum over all $2k$ phases:
$$n_\alpha = \underbrace{\sum_{i=1}^{k} (d_i - 1)}{\text{k RS phases}} \;+\; \underbrace{\sum{i=1}^{k} (d_i - 1)}_{\text{k AG phases}} \;=\; 2 \sum_i (d_i - 1)$$
Concrete counts — the index $i$ ranges over the $k$ dims (for a 3D torus, $i \in \{1, 2, 3\}$ with $d_1, d_2, d_3$ the per-axis sizes):
-
2×2×2 torus ($k = 3$, all $d_i = 2$): $$n_\alpha = \sum_{i=1}^{3}(2 - 1) \;+\; \sum_{i=1}^{3}(2 - 1) \;=\; 3 + 3 \;=\; 6$$ (3 RS hops + 3 AG hops; matches the §3.1 savings comparison.)
-
4×4×4 torus — TPU v4 / v5p single-rack cube ($N = 64$, $k = 3$, all $d_i = 4$): $$n_\alpha = \sum_{i=1}^{3}(4 - 1) \;+\; \sum_{i=1}^{3}(4 - 1) \;=\; 9 + 9 \;=\; 18$$
-
8×8×8 torus — TPU v4 / v5p multi-rack pod slice ($N = 512$, used as the canonical anchor in §5 and
05_contention_and_congestion.md§6): $$n_\alpha = \sum_{i=1}^{3}(8 - 1) \;+\; \sum_{i=1}^{3}(8 - 1) \;=\; 21 + 21 \;=\; 42$$ -
16×16×16 — TPU v4 / v5p full pod ($N = 4096$, $k = 3$, all $d_i = 16$): $$n_\alpha = \sum_{i=1}^{3}(16 - 1) \;+\; \sum_{i=1}^{3}(16 - 1) \;=\; 45 + 45 \;=\; 90$$
-
Asymmetric 16×16×4 ($k = 3$, $d_1 = d_2 = 16$, $d_3 = 4$): $$n_\alpha = \big[(16-1) + (16-1) + (4-1)\big] + \big[(16-1) + (16-1) + (4-1)\big] \;=\; 33 + 33 \;=\; 66$$
Note that phases are sequential across dims (X-RS waits for Y-RS to start; cannot overlap because the next phase’s input is the current phase’s output). The parallelism is within a phase — the $(N/d_i)$ independent dim-rings run concurrently on disjoint wiring (§3.1), which is what keeps the BW term bandwidth-optimal rather than inflating the α term.
Cost formula — BW-term derivation. The BW cost for a single dim-$i$ ring RS on a $d_i$-rank ring (from 01_collective_algorithms.md §5.1) is $\frac{d_i - 1}{d_i} \cdot \frac{M’}{\mathrm{BW}}$, where $M’$ is the per-rank input size at phase entry. Each rank sends $(d_i - 1)/d_i$ fraction of its data through its one outgoing link during that ring.
The subtlety: $M’$ shrinks from phase to phase because each RS phase reduces the chunk size by the factor $d_i$ of that phase’s ring. On a 3D torus running X-RS → Y-RS → Z-RS, chunk size evolves $M \to M/d_x \to M/(d_x d_y) \to M/(d_x d_y d_z) = M/N$. Per-RS-phase BW contribution:
| RS phase | Ring size | Per-rank input at entry | BW term |
|---|---|---|---|
| 1 (X-RS) | $d_x$ | $M$ | $\frac{d_x - 1}{d_x} \cdot \frac{M}{\mathrm{BW}}$ |
| 2 (Y-RS) | $d_y$ | $M / d_x$ | $\frac{d_y - 1}{d_y} \cdot \frac{M}{d_x \mathrm{BW}} = \frac{d_y - 1}{d_x d_y} \cdot \frac{M}{\mathrm{BW}}$ |
| 3 (Z-RS) | $d_z$ | $M / (d_x d_y)$ | $\frac{d_z - 1}{d_z} \cdot \frac{M}{d_x d_y \mathrm{BW}} = \frac{d_z - 1}{d_x d_y d_z} \cdot \frac{M}{\mathrm{BW}}$ |
Summing the three RS phases:
$$\text{RS BW term} \;=\; \frac{M}{\mathrm{BW}} \cdot \left( \frac{d_x - 1}{d_x} \;+\; \frac{d_y - 1}{d_x d_y} \;+\; \frac{d_z - 1}{d_x d_y d_z} \right)$$
Telescoping trick. Rewrite each summand using $(d - 1)/d = 1 - 1/d$:
$$\underbrace{\left(1 - \tfrac{1}{d_x}\right)}{\text{phase 1}} \;+\; \underbrace{\left(\tfrac{1}{d_x} - \tfrac{1}{d_x d_y}\right)}{\text{phase 2}} \;+\; \underbrace{\left(\tfrac{1}{d_x d_y} - \tfrac{1}{d_x d_y d_z}\right)}_{\text{phase 3}} \;=\; 1 - \tfrac{1}{d_x d_y d_z} \;=\; \frac{N - 1}{N}$$
Inner fractions cancel pairwise — $1/d_x$ from phase 1 cancels $1/d_x$ from phase 2’s positive side; $1/(d_x d_y)$ cancels between phase 2 and phase 3 likewise. What survives is $1 - 1/N$. So:
$$\text{RS BW term} \;=\; \frac{N - 1}{N} \cdot \frac{M}{\mathrm{BW}}$$
AG contribution — same derivation in reverse. AG phases run in reverse dim order (Z-AG → Y-AG → X-AG), each growing the chunk size by the paired dim’s factor. Ring AG on a $d_i$-rank ring has BW cost $\frac{d_i - 1}{d_i} \cdot \frac{M’}{\mathrm{BW}}$, identical to RS but with $M’$ now the per-rank output size at phase exit. Chunk size evolves $M/N \to M/(d_x d_y) \to M/d_x \to M$ across the three AG phases. Per-AG-phase BW contribution:
| AG phase | Ring size | Per-rank output at exit | BW term |
|---|---|---|---|
| 4 (Z-AG) | $d_z$ | $M / (d_x d_y)$ | $\frac{d_z - 1}{d_z} \cdot \frac{M}{d_x d_y \mathrm{BW}} = \frac{d_z - 1}{d_x d_y d_z} \cdot \frac{M}{\mathrm{BW}}$ |
| 5 (Y-AG) | $d_y$ | $M / d_x$ | $\frac{d_y - 1}{d_y} \cdot \frac{M}{d_x \mathrm{BW}} = \frac{d_y - 1}{d_x d_y} \cdot \frac{M}{\mathrm{BW}}$ |
| 6 (X-AG) | $d_x$ | $M$ | $\frac{d_x - 1}{d_x} \cdot \frac{M}{\mathrm{BW}}$ |
Summing the three AG phases:
$$\text{AG BW term} \;=\; \frac{M}{\mathrm{BW}} \cdot \left( \frac{d_z - 1}{d_x d_y d_z} \;+\; \frac{d_y - 1}{d_x d_y} \;+\; \frac{d_x - 1}{d_x} \right)$$
These are the same three summands as the RS sum (just indexed in reverse), so the same telescoping collapse applies. Using $(d - 1)/d = 1 - 1/d$:
$$\underbrace{\left(\tfrac{1}{d_x d_y} - \tfrac{1}{d_x d_y d_z}\right)}{\text{phase 4}} \;+\; \underbrace{\left(\tfrac{1}{d_x} - \tfrac{1}{d_x d_y}\right)}{\text{phase 5}} \;+\; \underbrace{\left(1 - \tfrac{1}{d_x}\right)}_{\text{phase 6}} \;=\; 1 - \tfrac{1}{N} \;=\; \frac{N - 1}{N}$$
Same pairwise cancellation as RS — $1/d_x$ between phases 5 and 6, $1/(d_x d_y)$ between phases 4 and 5, with $1$ and $-1/N$ surviving. So:
$$\text{AG BW term} \;=\; \frac{N - 1}{N} \cdot \frac{M}{\mathrm{BW}}$$
RS and AG are BW-symmetric by construction: each AG phase moves the same total bytes per rank as its paired RS phase (phase 4 mirrors phase 3, phase 5 mirrors phase 2, phase 6 mirrors phase 1). The byte-count match is why the two sums evaluate to the same $(N-1)/N \cdot M / \mathrm{BW}$. Combined:
$$\text{Total BW term} \;=\; \underbrace{\frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}}{\text{RS}} \;+\; \underbrace{\frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}}{\text{AG}} \;=\; 2 \cdot \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
This is the flat-ring bandwidth-optimal floor, unchanged by dim-decomposition.
Final cost formula:
$$t_{\mathrm{torus,AR}} \;=\; 2 \sum_i (d_i - 1)\,\alpha \;+\; 2 \cdot \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}, \quad N = \prod_i d_i$$
Concrete BW counts (AR BW term $= 2 \cdot (N-1)/N \cdot M / \mathrm{BW}$):
- 2×2×2 torus ($N = 8$): BW term $= 2 \cdot (7/8) \cdot M/\mathrm{BW} = (7/4) \cdot M/\mathrm{BW} \approx 1.75 \cdot M/\mathrm{BW}$.
- 8×8×8 torus ($N = 512$): BW term $= 2 \cdot (511/512) \cdot M/\mathrm{BW} \approx 1.996 \cdot M/\mathrm{BW}$.
- 16×16×16 TPU v5p ($N = 4096$): BW term $= 2 \cdot (4095/4096) \cdot M/\mathrm{BW} \approx 2 \cdot M/\mathrm{BW}$.
At large $N$ the factor $(N-1)/N \to 1$, so AR asymptotically costs $2M/\mathrm{BW}$ on BW — same as flat-ring AR’s BW bound.
You pay less α, the same BW as flat ring — that’s the entire arbitrage of dim-decomposition.
Commercial adoption. Dim-decomposed ring AR is the default AR primitive on production torus fabrics:
- Google TPU v4 / v5p — 3D torus up to $16 \times 16 \times 16$ per slice; XLA / JAX emit dim-decomposed ring RS+AG for every sharded reduction [TPU-V4].
- AWS Trainium2 — each Trn2 instance is a 2D NeuronLink torus of 16 chips; Trn2 UltraServer composes four instances into a 3D 64-chip torus via a Z-dim NeuronLink [TRN2-ARCH]. NeuronX CCL ships dim-decomposed ring AR as the default kernel [NEURON-CC].
Message-size-specialized variants and runtime-scheduler tuning are outside this note’s topology-mapping scope.
Floating-point limitation. Dim-decomposed AR reorders the floating-point sum — a rank adds partial sums coming from different dim-phases — which is not bit-identical to a flat-ring ordering. This is the one numerical caveat specific to AR; §3.5’s standalone AG drops it (no reduction, bit-copying is exact) while standalone RS inherits it.
The caveat is scoped to floating-point only. Integer reductions — quantized-gradient AR, histogram aggregation, counting workloads — are unaffected because two’s-complement integer addition is associative modulo $2^n$: any AR schedule (flat ring, dim-decomposed, tree, hierarchical) produces bit-identical output for integer inputs, with overflow (if any) occurring deterministically.
3.5 All-gather / reduce-scatter
Relation to §3.4. AG and RS are the two halves of dim-decomposed AR, so everything from §3.4 carries over directly — the X-phase / Y-phase decomposition, the dim-decomposed ring primitive (with the Rabenseifner-per-dim variant in Appendix A as a reference point), the disjoint-wiring parallelism argument (per §3.1), and the chunk-size telescoping.
Cost formula. Cost halves vs §3.4 because only one half (RS or AG) runs:
$$t_{\mathrm{torus,AG\,or\,RS,ring}} = \sum_i (d_i - 1)\,\alpha + \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
Commercial shipment. Dim-decomposed ring AG and RS ship on the same production torus stacks as the AR variant in §3.4: TPU via XLA / JAX, Trn2 via NeuronX CCL. The Rabenseifner-per-dim variant (Appendix A) is not shipped as an AG / RS kernel for the reasons discussed there.
Limitations. The §3.4 limitations (plus the Appendix A caveats if the Rabenseifner path is ever chosen) all apply verbatim with one relaxation: standalone AG has no reduction, so the floating-point non-associativity concern (§3.4’s “FP limitation” paragraph) drops out entirely — any AG schedule on any layout produces bit-identical output because bit-copying is exact. Standalone RS still reduces and inherits the AR floating-point behavior unchanged.
3.6 All-to-all
Relation to 01_collective_algorithms.md §7. The shipped algorithm is ring relay (§7.1) — §7.3’s selection rule pegs ring relay as the natural fit on bisection-limited fabrics precisely because not every rank pair has a direct fabric path, so shortest-arc forwarding through intermediate ranks is the only reachable schedule; pairwise direct-send (§7.2) requires the full bisection that torus lacks. The α-β cost $(N-1)\alpha + (N-1)/N \cdot M/\mathrm{BW}$ and the partner-selection / chunk-routing / pre- and post-rotation bookkeeping port over verbatim from §7.1. What changes on a torus is the delivered bandwidth. A2A has no RS+AG-style decomposition to hide behind (§7 opener makes this explicit: aggregate cross-fabric traffic is $(N-1) M$ bytes regardless of schedule), so unlike AR in §3.4 — where per-dim ring kept torus BW at the star level while only paying more α hops — A2A pays a hard bisection penalty on torus that no scheduling trick can recover on the BW side, while still enjoying the latency compression from dim-decomposition on the α side. Whether torus helps or hurts vs star therefore depends entirely on which term dominates at the target $(N, M, \alpha, \mathrm{BW})$.
How ring-relay A2A moves packets on a torus. Each rank has $N$ chunks of size $M/N$, one destined for each rank (including itself). Under ring-relay, each chunk travels from source to destination along a shortest-arc path through the torus lattice, hop by hop. Intermediate ranks forward chunks they don’t own without combining, reducing, or holding — pure point-to-point forwarding. Multiple chunks move concurrently: every link can carry an independent in-flight chunk on each direction of its full-duplex channel.
To make this concrete, label the 16 ranks of the 4×4 torus with letters a–p and write chunk $x{\to}Y$ for the chunk held by rank $x$ destined for rank $Y$:
4×4 torus — ranks labeled a–p (wraparound on every row and column):
↕ ↕ ↕ ↕ ↕ Y-wraparound per column:
┌──┴──┬──┴──┬──┴──┬──┴──┐ a ↔ m, b ↔ n, c ↔ o, d ↔ p
↔ y=3 │ m │ n │ o │ p │ ↔
├─────┼─────┼─────┼─────┤ ↔ X-wraparound per row:
↔ y=2 │ i │ j │ k │ l │ ↔ a ↔ d, e ↔ h, i ↔ l, m ↔ p
├─────┼─────┼─────┼─────┤
↔ y=1 │ e │ f │ g │ h │ ↔ (these wraparound edges are the
├─────┼─────┼─────┼─────┤ "seams" that distinguish torus
↔ y=0 │ a │ b │ c │ d │ ↔ from the same-shape k-D mesh
└──┬──┴──┬──┴──┬──┴──┬──┘ in §4.2, which has them cut)
↕ ↕ ↕ ↕
x=0 x=1 x=2 x=3
initial state: each rank x holds 16 chunks [x→a, x→b, x→c, ..., x→p],
one destined for each rank (including itself).
final state: each rank y holds 16 chunks [a→y, b→y, c→y, ..., p→y],
one received from each source.
Sample chunk journeys from rank a. Rank a has 15 chunks to dispatch (keeping $a{\to}a$ for itself). Four representative routes, showing the range of path lengths and how wraparound can serve as an equally-short alternative:
a→b direct east, 1 hop: a ──→ b (1 hop)
a→d via X-wraparound, 1 hop: a ═══ d (1 hop)
└─ X-wrap (a↔d), not a→b→c→d (3 hops)
a→c east, relay through b: a ──→ b ──→ c (2 hops)
step 1 step 2
a→k diameter, X then Y (direct): a ──→ b ──→ c ──→ g ──→ k (4 hops)
X X Y Y
a→k alternative via X-wraparound: a ══─► d ──→ c ──→ g ──→ k (4 hops)
X-wrap X Y Y
(also: a→e→f→g→k Y-first, a→b→f→j→k X-Y-X-Y, a→m→n→o→k via
Y-wrap + X hops, etc. — multiple equally-short paths exist.)
At step 1, rank a injects 4 chunks — one onto each of its 4 outbound links (east to b, west via wraparound to d, north to e, south via wraparound to m). At step 2, rank b has received $a{\to}b$ (keep it — destination reached) and also forwards $a{\to}c$ east, while rank a injects the next-hop chunks. The schedule continues; by step $\mathrm{diam} = 4$, every chunk has arrived. All 15 other ranks run the same protocol in parallel with their own 15 chunks each, so 240 chunks are in flight concurrently across the fabric.
From this mechanism the two cost terms split cleanly:
- α-term: every chunk takes at most $\mathrm{diam} = \sum_i \lfloor d_i / 2 \rfloor$ hops. All ranks forward concurrently, so the wall-clock α-cost is set by the longest single-chunk journey — not the sum of all chunk hops. For the 4×4: diameter = 4, so α-term = $4\alpha$. (Contrast §3.4’s ring RS, where α-cost was $(d - 1)\alpha$ per phase because chunks moved in lockstep one-hop-per-step; A2A ring-relay doesn’t step-sync.)
- BW-term: ring RS’s step-by-step summing (§3.4) relies on uniform per-step link loading, which A2A’s pipelined, asymmetric traffic doesn’t provide. Appendix D develops the replacement framework — bottleneck analysis (
re-use count × per-use time) — and works this case end-to-end. The structural fact for torus A2A is that cross-half traffic can only use the bisection cut links, so the cut’s aggregate capacity upper-bounds cross-half throughput, and since every cross-cut chunk uses exactly one cut edge the total byte-count is fixed by A2A’s semantics (not by scheduling).
Bisection cut for the 4×4 torus. Drawing a vertical line between columns $x = 1$ and $x = 2$ partitions the ranks into two equal halves, severing the X-axis links that connect left to right:
Bisection cut (│ marks severed X-links; ranks labeled as in the grid above):
│ │
y=3 m ─── n │ o ─── p │ (X-wrap m ↔ p crosses the cut)
y=2 i ─── j │ k ─── l │ (X-wrap i ↔ l crosses the cut)
y=1 e ─── f │ g ─── h │ (X-wrap e ↔ h crosses the cut)
y=0 a ─── b │ c ─── d │ (X-wrap a ↔ d crosses the cut)
│ │
x=0 x=1 x=2 x=3
Per row: 1 direct cut (between x=1 and x=2) + 1 wraparound cut = 2 severed links.
4 rows × 2 = 8 severed full-duplex links total; each at per-link BW.
Each left-half rank owns exactly one outbound cut link — 4 direct links at $x{=}1 \to x{=}2$ and 4 X-wraparound links at $x{=}0 \to x{=}3$ — so the cut has 8 severed edges total.
Deriving the BW term. Using a single-link view: pick any one cut link, count how many chunks re-use it over the phase, and multiply by the per-chunk transit time. Assuming ideal pipelining — each re-use back-to-back, with the link at 100% utilization — the phase ends when the link’s last occupant finishes its hop, so the phase time equals that link’s total busy-time. (By symmetry on a regular torus every cut link carries the same load, so any one link works.)
Focus on b→c (direct cut at $y=0$, L→R direction). Under uniform shortest-path routing, the 64 cross-cut chunks distribute across the 8 cut links → 8 chunks per cut link. The 8 chunks passing through b→c under X-first routing from row $y=0$ to the $x=2$ column:
| # | Chunk | Route | Total hops |
|---|---|---|---|
| 1 | $a \to c$ | $a \to b \to c$ | 2 |
| 2 | $a \to g$ | $a \to b \to c \to g$ | 3 |
| 3 | $a \to k$ | $a \to b \to c \to g \to k$ | 4 |
| 4 | $a \to o$ | $a \to b \to c \to o$ (Y-wrap on last hop) | 3 |
| 5 | $b \to c$ | $b \to c$ | 1 |
| 6 | $b \to g$ | $b \to c \to g$ | 2 |
| 7 | $b \to k$ | $b \to c \to g \to k$ | 3 |
| 8 | $b \to o$ | $b \to c \to o$ (Y-wrap on last hop) | 2 |
Each chunk uses b→c once for $M/(16\,\mathrm{BW})$. With 8 back-to-back re-uses:
$$t_{\mathrm{BW}} \;=\; 8 \cdot \frac{M}{16\,\mathrm{BW}} \;=\; \frac{M}{2\,\mathrm{BW}}$$
Appendix D.3 derives the same result via the equivalent aggregate whole-cut view (4M cross-cut bytes / $8\,\mathrm{BW}$) and contrasts both against per-chunk-per-step reasoning.
Generalizing to arbitrary torus shape. Extending the whole-cut view to arbitrary $d_i$, the BW term scales to $d_{\max}/8 \cdot M/\mathrm{BW}$. Appendix D.3 tracks how each of the three whole-cut quantities (severed-link count, cross-cut traffic, cut throughput) scales with $N$ and $d_{\max}$, and derives this formula.
General cost formula. Combining the BW term with the α-term (one full ring-relay traversal across the diameter):
$$t_{\mathrm{torus,A2A}} \;\approx\; \mathrm{diam} \cdot \alpha \;+\; \frac{d_{\max}}{8} \cdot \frac{M}{\mathrm{BW}}, \qquad \mathrm{diam} = \sum_i \lfloor d_i / 2 \rfloor$$
The bandwidth term scales with $d_{\max}$ — not $N$. On star (the pairwise-direct baseline from 01_collective_algorithms.md §7.2) the equivalent BW term is $(N-1)/N \cdot M/\mathrm{BW} \approx M/\mathrm{BW}$; the torus-vs-star penalty is therefore $d_{\max}/8$.
Aside — “bisection bandwidth” in other literature. Network-design texts often introduce bisection bandwidth as a named quantity — the total one-direction throughput across the worst-case equal-halves cut — and write the A2A BW term as $(NM/4) / \mathrm{BW_{bisect}}$. For a wraparound torus, $\mathrm{BW_{bisect}} = (2N/d_{\max}) \cdot \mathrm{BW}$, which is exactly what “severed-link count × per-link BW” evaluates to in the Appendix D.3 derivation. This note keeps the per-link BW as the only symbol and computes the cut throughput inline; readers encountering $\mathrm{BW_{bisect}}$ elsewhere should mentally map it to $(2N/d_{\max}) \cdot \mathrm{BW}$ on a torus, or $s \cdot N \cdot \mathrm{BW}$ with $s$ the oversubscription ratio on a fat-tree (
03_hierarchical_topologies.md§1).
Dim-decomposed A2A — alternative schedule. Running $k$ successive per-dim A2As (§3.1-style decomposition, with bidirectional ring-relay A2A as the per-dim primitive — each phase runs A2A on every $d_i$-ring in parallel, using dim-$i$ wiring alone) gives the approximate cost
$$t_{\mathrm{torus,A2A,decomp}} \;\approx\; \mathrm{diam} \cdot \alpha \;+\; \frac{\sum_i d_i}{8} \cdot \frac{M}{\mathrm{BW}}$$
— per-rank holding stays $M$ across phases (data redistributes without changing total per-rank volume), so each phase costs $\lfloor d_i/2 \rfloor \alpha + d_i/8 \cdot M/\mathrm{BW}$ and the $k$ phases sum sequentially. The α-term matches flat’s $\mathrm{diam} \cdot \alpha$; the BW-term is worse than flat’s $d_{\max}/8$ by factor $(\sum_i d_i) / d_{\max}$ — ${\sim}k\times$ on a cubic $k$-D torus — because dim-decomposed idles $k{-}1$ dims’ wiring per phase while flat uses all dims concurrently via shortest-arc routing.
Commercial shipment. Both flat ring-relay A2A (01_collective_algorithms.md §7.1, shortest-arc forwarding on a linearized rank order) and the dim-decomposed schedule above ship on TPU and Trainium as the MoE expert-dispatch / expert-combine primitive. The bisection penalty is the dominant design pressure: TPU v4’s optical circuit switches [TPU-V4] exist partly so that operators can reshape the torus slice into the most cubic layout that matches the MoE’s expert count, minimizing $d_{\max}$. Trainium’s Trn2 topology [TRN2-ARCH] is fixed at 2D per instance + Z-dim across instances — roughly cubic for 64-chip UltraServers — by the same reasoning. MoE workloads that outgrow a single square slice and must span multiple pods see sharp A2A-cost cliffs, which is why production MoE training/inference pipelines pin expert count to match the slice’s natural dimensions whenever possible.
Limitations (new vs §3.4 / §3.5).
- No decomposition escape for the BW term. The dim-decomposed AR from §3.4 kept torus BW at star-level by routing each per-dim phase over disjoint wiring. For A2A, dim-decomposition is strictly worse than flat (${\sim}k\times$ on BW, as derived above) — splitting into per-dim A2As idles most of the fabric each phase, and the total cross-bisection aggregate is fixed by the collective’s semantics regardless of schedule. The $d_{\max}/8$ penalty of flat A2A is architectural, not algorithmic.
- Layout sensitivity is severe. Unlike AR (§3.4 note), A2A cost scales with $d_{\max}$ rather than $\sum (d_i - 1)$. A $16 \times 8 \times 4$ layout is 2× worse than $8 \times 8 \times 8$ for the same $N = 512$; a $32 \times 4 \times 4$ layout is 4× worse. Operators tuning MoE on torus spend real effort finding the most cubic viable slice.
What mitigates the bisection penalty. Two mitigations actually move the A2A cost on torus:
- Shape choice — minimize $d_{\max}$. Pick the most cubic layout at the given rank count (cubic > skinny; higher-D > lower-D). For $N = 512$, moving from $32 \times 16$ (2D, $d_{\max} = 32$) to $8 \times 8 \times 8$ (3D, $d_{\max} = 8$) is a 4× BW improvement for free. On TPU v4 / v5p this is a job-launch decision because OCS [TPU-V4] can reconfigure the physical slice shape (and dimensionality) between jobs; on fixed-topology systems like Trainium [TRN2-ARCH] it’s baked into the hardware generation.
- Per-link ICI bandwidth bumps. Each TPU / Trainium generation raises per-link $\mathrm{BW}$, shrinking the absolute BW term even though the $d_{\max}/8$ coefficient is unchanged.
4. Mesh topology
Mesh is the direct-wiring complement to star and the wraparound-free complement to torus. Two regimes use the word “mesh” interchangeably in the literature: full mesh (§4.1), where every rank-pair has a dedicated link, and $k$-D mesh (§4.2), a torus with the wraparound edges removed. Cost behavior can be summarized as “star minus the switch α” for full mesh and “torus minus the wraparound bisection” for $k$-D mesh — this section makes each precise.
4.1 Full mesh
Full mesh replaces the central switch with $N(N-1)/2$ direct edges, one per rank pair. Every collective primitive runs at its star cost with one adjustment: the per-hop α drops the switch cut-through term and becomes pure endpoint-to-endpoint link latency. On chiplet-scale copper this is ~100–300 ns per hop (vs NVSwitch’s ~500 ns cut-through); at board-scale NVLink it is comparable to star’s α because the PHY cost dominates. No switch ASIC means no in-network reduction path — SHARP / NVLS are structurally unavailable — and no switch multicast, so AG and BC cannot benefit from switch-resident replication.
Algorithm coverage. Full mesh has any-to-any direct links between every rank pair, so every logical edge in any schedule from 01_collective_algorithms.md lands on exactly one physical link. All software algorithms, such as ring, binomial tree, DBT, rec-doubling / rec-halving, Rabenseifner, pairwise direct-send, Bruck, and PAT, run at their pure α-β costs, with the single substitution $\alpha \to \alpha_{\mathrm{link}}$. The lone exception is the INC paths (SHARP / NVLS / switch multicast): full mesh has no switch ASIC, so the in-network formulas from 01_collective_algorithms.md §3.3 / §4.3 / §5.3 do not apply.
Example — ring AR. The 01_collective_algorithms.md §5.1 ring-AR cost
$$t_{\mathrm{ring\,AR}} = 2(N-1)\,\alpha + 2\,\frac{N-1}{N}\cdot\frac{M}{\mathrm{BW}}$$
maps onto full mesh as
$$t_{\mathrm{mesh,\,ring\,AR}} = 2(N-1)\,\alpha_{\mathrm{link}} + 2\,\frac{N-1}{N}\cdot\frac{M}{\mathrm{BW}}$$
Same formula; only α changes (switch cut-through term drops, leaving pure PHY-to-PHY link latency). Every other algorithm substitutes the same way — no topology correction beyond this α calibration.
In the pure α-β model, A2A gains essentially nothing over star: NVSwitch-class star already hits the per-port BW bound via pairwise direct-send, and removing the switch at most recovers the cut-through time. Note the $(N-1)/N$ coefficient is serial over each rank’s single outbound α-β port-slot per step — the $N-1$ dedicated wires don’t dissolve the per-rank port budget in the pure α-β model, so the formula has the same serialized shape as star’s. The structural advantage of full mesh is not cost-theoretic; it is that dedicated per-pair links carry no inter-pair contention — which matters under concurrent collective groups (scored in 05_contention_and_congestion.md).
Commercial shipment — where full mesh actually appears. Three regimes dominate:
- Chiplet interposer (UCIe, EMIB, HBM base-die): small $N$ (2–8 chiplets), direct PHY-to-PHY links on silicon / organic substrate; no switch in the package.
- Legacy pre-NVSwitch NVLink (DGX-1 / DGX-2 hybrid cube-mesh): 8-way or 16-way NVLink-1/2 used hybrid cube-mesh wiring where each GPU linked directly to a subset of peers. NVSwitch (NVLink-3+) replaced this because full mesh’s $O(N^2)$ wire count stops scaling past ~16 endpoints.
- PCIe peer-to-peer inside a root complex: GPUs DMA to each other over per-pair PCIe lanes; an informal full-mesh analog, though traffic in practice still traverses a root-complex switch.
Limitations.
- Scalability wall at $O(N^2)$ wires. Each added rank needs $N$ new links; total wire count grows quadratically. Practical limit is 8–16 ranks on any substrate; beyond that, switches win on wire-count economics.
- No switch means no in-network primitive. The $n_\alpha \to 2$ collapse via NVLS / SHARP is unavailable; switch-multicast-accelerated AG is unavailable. Full mesh is the “purest” α-β fabric in that sense — every cost reduces to the analytical formula with no hardware-side shortcut.
4.2 $k$-D mesh
A $k$-D mesh is a torus without the wraparound edges. The torus neighbor formula $\sum_i \min(d_i - 1, 2)$ (§1.2) becomes the interior-rank ceiling that only ranks not on any boundary reach; a rank at position 0 or $d_i - 1$ on any axis with $d_i \geq 3$ has 1 neighbor on that axis instead of 2 (wraparound missing). Axes with $d_i \leq 2$ behave the same in mesh and torus (wraparound was already degenerate at $d_i = 2$ or absent at $d_i = 1$). Concretely: for a 3×3×3 mesh, the body-center rank (1, 1, 1) still has 6 neighbors, a face-center rank like (1, 1, 0) has 5, an edge-midpoint like (0, 1, 0) has 4, and a corner like (0, 0, 0) has only 3 — one per non-boundary direction. The dim-decomposition argument from §3.1 carries over verbatim — each dim’s RS/AG runs on a $d_i$-rank open line rather than a $d_i$-rank closed ring.
Algorithm coverage — what transfers from torus (§3). The dim-decomposition framework (§3.1) and every per-primitive derivation in §3.2–§3.6 port over directly with “open $d_i$-line” substituted for “$d_i$-ring” as the per-dim inner primitive. Bucket-brigade RS/AG on an open line has the same α and BW as ring RS/AG (chunk-size telescoping is identical — each rank still sends $(d_i - 1)/d_i$ of its chunk per half); pipelined per-dim BC/Reduce inherit the same $\sum_i (d_i - 1)\alpha + M/\mathrm{BW}$ form. So the unidirectional torus formulas transfer verbatim:
$$t_{\mathrm{mesh,AR}} = 2\sum_i (d_i - 1)\,\alpha + 2\,\frac{N-1}{N}\cdot\frac{M}{\mathrm{BW}}, \qquad t_{\mathrm{mesh,BC}} = t_{\mathrm{mesh,Reduce}} = \sum_i (d_i - 1)\,\alpha + \frac{M}{\mathrm{BW}}$$
(AG / RS are half of AR.) Like torus, mesh is link-only — no switch ASIC, so the INC paths from 01 §3.3 / §4.3 / §5.3 are unavailable.
What does not transfer. Two effects distinguish mesh from torus:
-
Bisection halves → A2A BW doubles. The bisection cut along the longest dim severs $N/d_{\max}$ links — half of the torus’s $2N/d_{\max}$ (wraparound edges are missing). The flat A2A BW term rises from torus’s $d_{\max}/8 \cdot M/\mathrm{BW}$ to $d_{\max}/4 \cdot M/\mathrm{BW}$:
$$t_{\mathrm{mesh,A2A}} \approx \mathrm{diam}\cdot\alpha + \frac{d_{\max}}{4}\cdot\frac{M}{\mathrm{BW}}, \qquad \mathrm{diam} = \sum_i (d_i - 1)$$
Mesh A2A is consistently 2× worse than same-shape torus A2A. At cubic $8 \times 8 \times 8$ the penalty is $d_{\max}/4 = 2\times$ vs star’s $\approx M/\mathrm{BW}$; a 2D $32 \times 16$ mesh pays $8\times$, vs 2D torus’s $4\times$. The dim-decomposed A2A alternative (§3.6-style) inherits the same 2× cut penalty per phase: $t_{\mathrm{mesh,A2A,decomp}} \approx \mathrm{diam}\cdot\alpha + (\sum_i d_i)/4 \cdot M/\mathrm{BW}$.
-
Bidirectional BC/Reduce α-halving doesn’t apply cleanly. Torus’s bidirectional ring BC/Reduce (§3.2/§3.3) cuts per-dim α to $\lfloor d_i/2 \rfloor$ by radiating through both ring directions — including the wrap. On an open line the wrap is absent: a source at one end (typical for mesh BC) reaches the far end only via unidirectional forward, so α stays at $(d_i - 1)$ per dim. A source at the middle can use both directions concurrently to achieve $\lceil (d_i-1)/2 \rceil$ α, but that’s a source-placement optimization, not a substrate property. The formulas above use the worst-case unidirectional α.
Scheduling caveat on realized BW. The α-β formulas assume bidirectional pipelining that keeps every link fully utilized — matching the ring case’s uniform per-link load. A naïve unidirectional schedule on an open line can realize up to 2× worse BW because middle links bottleneck while endpoint links idle; well-tuned implementations hit the formula, but worst-case-bounded cost models should apply a factor up to 2 on the BW term when scheduling discipline is not guaranteed.
Commercial shipment. $k$-D mesh appears primarily as an implementation detail rather than a top-level AI-fabric choice: Intel’s Xeon Phi KNL used a 2D mesh-on-die for its CHA tile interconnect; some pre-production Trainium / TPU prototypes used meshes before upgrading to torus; chiplet-scale HBM base-die interconnects often use small 2×2 or 4×4 meshes because on-silicon wraparound wiring is expensive and dim sizes are tiny. In production large-scale AI fabrics, torus dominates — the wraparound cost is small (~$N$ extra links) and the 2× bisection payoff is large.
Limitations.
- 2× bisection penalty vs same-shape torus. Applies to A2A directly and, under concurrent traffic, to any primitive that saturates bisection. For pure AR/AG/RS the α-β formula matches torus, but the real link-load skew (middle-heavy on open lines) erases part of that at high utilization.
- Dim-decomposition still applies unchanged. Ring-per-dim becomes line-per-dim. The correctness proof is identical; only the per-dim endpoint behavior changes.
5. Summary and limitations
5.1 Cost summary by topology
Cost table for each (topology, primitive) pair at its dominant shipped algorithm under the ideal α-β model (no contention, uniform per-link BW, uniform per-hop $\alpha$; torus / k-D mesh rows assume contiguous dim-aligned group placement). All BW terms assume asymptotic pipelining (the $P \to P^*$ limit from 01_collective_algorithms.md Appendix C).
| Topology | Primitive | Algorithm | Latency term | BW term | Shipped by |
|---|---|---|---|---|---|
| Star (§2) | BC | Binomial tree (pipelined) | $\lceil \log_2 N \rceil\,\alpha$ | $M/\mathrm{BW}$ | NCCL (default BC path) |
| Reduce | Binomial tree (pipelined) | $\lceil \log_2 N \rceil\,\alpha$ | $M/\mathrm{BW}$ | NCCL (default Reduce path) | |
| AR | Double binary tree (pipelined) | $2\,\lceil \log_2 N \rceil\,\alpha$ | $M/\mathrm{BW}$ | NCCL (small–mid $M$) | |
| AR | Ring | $2(N{-}1)\,\alpha$ | $2(N{-}1)/N \cdot M/\mathrm{BW}$ | NCCL (bulk $M$) | |
| AG / RS | Ring | $(N{-}1)\,\alpha$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ | NCCL | |
| A2A | Pairwise direct-send | $(N{-}1)\,\alpha$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ | NCCL | |
| Torus (§3) | BC | Dim-decomp ring (bidirectional) | $\sum_i \lfloor d_i/2 \rfloor\,\alpha$ | $M/\mathrm{BW}$ | TPU, Trainium |
| Reduce | Dim-decomp ring (bidirectional) | $\sum_i \lfloor d_i/2 \rfloor\,\alpha$ | $M/\mathrm{BW}$ | TPU, Trainium | |
| AR | Dim-decomp ring | $2 \sum_i (d_i{-}1)\,\alpha$ | $2(N{-}1)/N \cdot M/\mathrm{BW}$ | TPU (XLA / JAX), Trainium (NeuronX CCL) | |
| AG / RS | Dim-decomp ring | $\sum_i (d_i{-}1)\,\alpha$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ | TPU, Trainium | |
| A2A | Ring relay (bisection-bound) | $\mathrm{diam}\cdot\alpha$, $\mathrm{diam} = \sum_i \lfloor d_i/2 \rfloor$ | $(d_{\max}/8) \cdot M/\mathrm{BW}$ | TPU, Trainium | |
| Full mesh (§4.1) | BC | Binomial tree (pipelined) | $\lceil \log_2 N \rceil\,\alpha_{\mathrm{link}}$ | $M/\mathrm{BW}$ | Chiplet interposer (UCIe), DGX-1/2 hybrid cube-mesh (legacy), PCIe P2P |
| Reduce | Binomial tree (pipelined) | $\lceil \log_2 N \rceil\,\alpha_{\mathrm{link}}$ | $M/\mathrm{BW}$ | same | |
| AR | Double binary tree (pipelined) | $2\,\lceil \log_2 N \rceil\,\alpha_{\mathrm{link}}$ | $M/\mathrm{BW}$ | same | |
| AR | Ring | $2(N{-}1)\,\alpha_{\mathrm{link}}$ | $2(N{-}1)/N \cdot M/\mathrm{BW}$ | same | |
| AG / RS | Ring | $(N{-}1)\,\alpha_{\mathrm{link}}$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ | same | |
| A2A | Pairwise direct-send | $(N{-}1)\,\alpha_{\mathrm{link}}$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ | same | |
| k-D mesh (§4.2) | BC | Dim-decomp open-line | $\sum_i (d_i{-}1)\,\alpha$ | $M/\mathrm{BW}$ | Chiplet HBM base-die, Xeon Phi KNL tile mesh |
| Reduce | Dim-decomp open-line | $\sum_i (d_i{-}1)\,\alpha$ | $M/\mathrm{BW}$ | same | |
| AR | Dim-decomp open-line | $2 \sum_i (d_i{-}1)\,\alpha$ | $2(N{-}1)/N \cdot M/\mathrm{BW}$ | same | |
| AG / RS | Dim-decomp open-line | $\sum_i (d_i{-}1)\,\alpha$ | $(N{-}1)/N \cdot M/\mathrm{BW}$ | same | |
| A2A | Ring relay (bisection halved) | $\mathrm{diam}\cdot\alpha$, $\mathrm{diam} = \sum_i (d_i{-}1)$ | $(d_{\max}/4) \cdot M/\mathrm{BW}$ | same |
Six observations that the rest of this series builds on:
- Torus preserves BW-optimality for AR / AG / RS but pays in α. The torus ring entries keep the same $(N{-}1)/N$ (or $2(N{-}1)/N$) BW coefficient as their star counterparts — dim-decomposition routes each phase over disjoint per-dim copper, so no BW is wasted. The torus-vs-star gap is entirely in the α term: $\sum (d_i{-}1)$ vs star’s $\log_2 N$ (DBT) or $N{-}1$ (ring). Production uses ring on both sides because the α gap is small once $M$ is large.
- BC and Reduce hit the $M/\mathrm{BW}$ BW ceiling under pipelining, on every topology. The entire topology-vs-topology gap for BC / Reduce lives in the α term: $\log_2 N$ (star tree), $\sum_i \lfloor d_i/2 \rfloor$ (bidirectional torus), $\sum_i (d_i{-}1)$ (open-line k-D mesh or unidirectional torus). The wraparound halves torus’s α relative to same-shape k-D mesh; star’s log-depth α beats both at any $N$ the switch can accommodate.
- Torus pays a hard BW penalty on A2A under asymmetric shapes. The $d_{\max}/8 \cdot M/\mathrm{BW}$ BW term has no algorithmic escape — it’s set by the bisection cut, not by the schedule. At cubic $N^{1/k}$ per dim the penalty is $1\times$ (torus matches star’s BW term exactly), but it scales linearly with $d_{\max}$ and reaches $32\times$ at $256 \times 2 \times 2$ — which is why torus pods aggressively reshape slices toward cubic layouts for MoE workloads.
- k-D mesh is torus with a 2× A2A penalty. For AR / AG / RS / BC / Reduce, k-D mesh’s open-line substitutes for torus’s ring with no change to the unidirectional α form ($\sum (d_i{-}1)\alpha$ either way) and no change to BW ($(N{-}1)/N \cdot M/\mathrm{BW}$ telescoping still holds on bucket brigade). For A2A, missing wraparound edges halve the bisection cut, so the BW term is $d_{\max}/4$ vs torus’s $d_{\max}/8$ — exactly $2\times$ worse.
- Tree-flavored algorithms ship on star but not on torus / mesh. DBT is NCCL’s default on switched fabrics and is selected by the tuner for small-$M$ AR; the structurally analogous dim-decomposed Rabenseifner variant on torus (Appendix A) is absent from both TPU and Trainium runtime stacks because the 1.5–4× α compression at production dim sizes $d_i \in \{4, 8, 16\}$ rarely beats the simpler ring kernel in practice. This is a fabric-economics decision, not a correctness one.
- Star has an additional α-compression escape hatch via in-network collectives — direct-wired topologies (torus, full mesh, k-D mesh) do not. Within a switched fabric, the reduction operation can move into the switch ASIC itself: switch-resident ALUs reduce flits on the fly and multicast the result back, collapsing $n_\alpha$ from $\log_2 N$ (DBT) all the way to $2$, independent of $N$. NVLS (NVSwitch), Quantum SHARP (InfiniBand fat-tree / Clos), and Tomahawk Ultra INC (Ethernet) all exploit this. Direct-wired topologies have no analogous path — the $N$-dependent $\alpha$ term comes from neighbor-router hops, which cannot be collapsed by moving the reduce into a central switch that does not exist. See
04_in_network_collectives.mdfor the full mechanism and cost model.
5.2 Limitations
Every cost formula above assumes the collective runs alone with perfect link-level scheduling. Real deployments break this in four ways; 05_contention_and_congestion.md covers each with coefficient models.
- Concurrent collective groups. DP replicas simultaneously issue TP all-reduces. On a star, non-overlapping port subsets handle them trivially. On torus, if all replicas share a common dim, they contend for the same physical links. Best-case torus assumes this doesn’t happen.
- Off-prefix group layouts. Torus dim-decomp only hits $2 \sum(d_i - 1)$ hops when the group ranks are contiguous along dim prefixes. A scatter-pattern allocator (typical in shared clusters) produces groups whose ranks are spread arbitrarily — the physical hop count can be 2–4× the ideal, and the formula falls back to flat-ring bound.
- Skewed A2A traffic. Real MoE routing shows 3–10× skew between hot and cold experts. The uniform-bisection bound underestimates tail latency. Star shrugs this off (every port has the same BW); torus pays — hot-expert destinations concentrated along one dim saturate that dim’s bisection well before the uniform bound predicts.
- Mixed traffic classes. Gradient AR, activation AG, optimizer RS, KV cache streaming, and control messages compete for the same fabric — the specifics depend on workload (training vs inference, dense vs MoE), but the structural point is the same: wormhole / cut-through routing keeps the BW per message stable at peak utilization but adds queuing delay at high saturation.
Also worth noting: everything above assumes the software picks the right algorithm for the fabric. Any dispatch layer (NCCL, MPI, a custom tuner) that maps a fabric to its matching algorithm delivers these costs; one that forces flat-ring AR on a torus will see the flat-ring cost, not the dim-decomp cost — the formula follows the algorithm, not the wiring.
Appendix A: Dim-decomposed Rabenseifner halving-doubling
This appendix preserves the full derivation of the tree-flavored torus AR variant for reference. It is not shipping on any production torus stack — neither XLA / JAX on TPU nor NeuronX CCL on Trainium chooses it — and the §3.4 main-text discussion selects dim-decomposed ring instead. The material here exists so the comparison remains complete: it sharpens why ring-per-dim wins on torus by showing exactly what the Rabenseifner-per-dim alternative gives up.
Relation to §3.1. The compositional framework is identical: run $k$ serial dim-phases for RS, then $k$ more for AG, with $(N/d_i)$ concurrent rings per phase on disjoint copper. The only swap is the inner per-dim primitive — replace each dim’s $d_i$-rank ring RS (or AG) with the $d_i$-rank recursive halving-doubling schedule from 01_collective_algorithms.md Appendix B.2 (AR) / Appendix B.4 (standalone RS or AG) (power-of-2 $d_i$ required; for non-power-of-2 $d_i$ the schedule reduces to ring or to a hybrid form). The dim-decomposition argument itself — associativity of reduction, chunk-size telescoping across phases, concurrent ring execution on disjoint copper — is unchanged.
Why swap the inner primitive? Recursive halving-doubling takes $\lceil \log_2 d_i \rceil$ α hops per dim-phase instead of $(d_i - 1)$ for ring, while keeping the same $(d_i - 1)/d_i \cdot \mathrm{chunk}/\mathrm{BW}$ bandwidth term per phase (because each rank still sends $(d_i - 1)/d_i$ of its current chunk, just in $\log_2 d_i$ bursts of doubling size rather than $d_i - 1$ bursts of fixed size). The per-dim α compression is modest — $d_i = 4 \to 2$ hops vs 3 for ring (1.5×); $d_i = 8 \to 3$ vs 7 (2.3×); $d_i = 16 \to 4$ vs 15 (3.75×) — but compounds across dims.
Worked example — 4×4 torus with Rabenseifner per dim. Same 2D grid as §3.4, but each row/column now runs halving-doubling instead of ring. On $d_x = 4$ the X-phase RS runs $\lceil \log_2 4 \rceil = 2$ halving-doubling steps per row (step 1: exchange halves with partner 2 away; step 2: exchange quarters with partner 1 away), giving $n_\alpha = 2$ per phase versus ring’s $n_\alpha = 3$. The bandwidth telescoping matches ring’s exactly — after the X-phase RS each rank holds $M/4$ bytes, after Y-phase RS $M/16$ — because halving-doubling’s BW coefficient $(D-1)/D$ is identical to ring’s for the same chunk math. Total for 2D AR: $n_\alpha = 2 \cdot 2 \cdot 2 = 8$ versus ring’s $2 \cdot 2 \cdot 3 = 12$, with identical BW term.
Cost formula. Summing $\lceil \log_2 d_i \rceil$ α hops per dim per half (RS or AG) and applying the same BW telescoping as §3.4:
$$t_{\mathrm{torus,AR,Rab}} \;=\; 2 \sum_i \lceil \log_2 d_i \rceil\,\alpha + 2 \cdot \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}, \quad N = \prod_i d_i$$
For the AG / RS halves:
$$t_{\mathrm{torus,AG\,or\,RS,Rab}} = \sum_i \lceil \log_2 d_i \rceil\,\alpha + \frac{N-1}{N} \cdot \frac{M}{\mathrm{BW}}$$
Latency compression at $N = 512$. For the $8 \times 8 \times 8$ layout: $\sum \lceil \log_2 d_i \rceil = 9$ hops, so $n_\alpha = 18$ for AR versus dim-decomp ring’s 42 (2.3× compression) and flat ring’s 1022 (57× compression). The marginal step from ring-per-dim (42) to Rabenseifner-per-dim (18) saves 24 α hops; at $\alpha = 0.5\,\mu$s that is 12 μs — meaningful at very small $M$, negligible once the BW term crosses a few tens of μs.
Why it does not ship on production torus fabrics. Four constraints together:
- Power-of-2 dim sizes required. The halving-doubling schedule assumes $d_i$ is a power of 2 along each active dim. TPU v4 / v5p pods configured via OCS can choose $d_i \in \{2, 4, 8, 16\}$ per dim — compatible — but Trainium’s fixed 2D-plus-Z 64-chip shape bakes in $d_i = \{4, 4, 4\}$. Any dim mismatch forces fallback to ring-per-dim for that dim, eroding the compression.
- Small per-dim $d_i$ limits the α savings. At $d_i = 4$ the α compression is 1.5× (3 hops → 2); at $d_i = 8$ it is 2.3×. The $\log_2 d_i$ advantage only widens for large $d_i$, but production dim sizes top out around 16 because larger dims degrade A2A (§3.6 bisection-penalty scales as $d_{\max}$).
- Kernel complexity dominates the α savings at bulk $M$. Each halving-doubling step has butterfly-pattern neighbor exchange with doubling chunk size; mapping this onto the torus’s fixed $2k$-neighbor wiring requires per-step chunk recomputation on the chip’s reduce-engine. The ring kernel uses a single static schedule. For the $M \geq$ few hundred KB bulk regime where production workloads live, the BW term dominates and the extra hops the Rabenseifner variant saves do not pay for the kernel-complexity overhead.
- α term is not on the critical path once BW dominates. At $N = 512$, $\alpha = 0.5\,\mu$s, $\mathrm{BW} = 900\,\mathrm{GB/s}$, and $M = 16\,\mathrm{MB}$, the ring-per-dim AR cost is 21 μs α + 35.5 μs BW = 57 μs total; Rabenseifner-per-dim trims this to 9 μs α + 35.5 μs BW = 45 μs — a 21% saving on paper that is routinely erased by the kernel-complexity overhead in point 3.
This is a fabric-economics and software-simplicity decision, not a correctness one: dim-decomposed Rabenseifner-per-dim produces the same numerical result (up to floating-point ordering) as dim-decomposed ring, and either could be implemented on the same hardware. Production stacks ship the simpler ring-per-dim kernel because its α disadvantage is small at production dim sizes and its BW coefficient is exactly the same.
Limitations. All of §3.4’s floating-point-associativity, dim-aligned-group-layout, and wraparound-wiring caveats apply verbatim. The per-dim halving-doubling schedule adds the power-of-2 $d_i$ constraint on top.
Appendix B: N, D, d — rank count, diameter, and dim sizes
Three symbols recur throughout the torus / mesh cost formulas in §3–§4: N (total rank count), D (diameter), and $d_i$ (per-axis dim size). They are related but distinct — N and D are both derived from the set of $d_i$ values, but each plays a different role in the cost formulas. This appendix defines each, shows how they connect, and works through examples.
B.1 Definitions
-
$d_i$ — dim size along axis $i$. The number of ranks sitting on axis $i$ of a k-D lattice. For a k-D torus or mesh, the collection $(d_1, d_2, \ldots, d_k)$ is the shape. A 4×3×2 lattice has $d_1 = 4$, $d_2 = 3$, $d_3 = 2$.
- Symmetric: all $d_i$ equal — write $d^k$ or “$d \times d \times d$ …”. TPU v5p’s $16 \times 16 \times 16$ has $d = 16$.
- Asymmetric: $d_i$ differ — e.g., $16 \times 16 \times 4$.
- $d_{\max} = \max_i d_i$: the largest dim size, which sets the A2A bisection penalty (§3.6).
-
$N$ — total rank count. For a k-D lattice, $N$ is the product of dim sizes:
$$N = \prod_i d_i = d_1 \cdot d_2 \cdots d_k$$
Pick the shape, multiply the $d_i$s, get $N$.
-
$D$ — diameter. The hop count between the farthest rank pair (worst-case shortest path). A single scalar describing the fabric, derived from the $d_i$ values via topology-specific formulas:
Topology Diameter formula Torus (wraparound, §3) $D = \sum_i \lfloor d_i / 2 \rfloor$ k-D mesh (no wraparound, §4.2) $D = \sum_i (d_i - 1)$ Star (single switch, §2) $D = 2$ Full mesh (§4.1) $D = 1$ 2-tier leaf-spine fat-tree ( 03_hierarchical_topologies.md§1)$D = 4$ 3-tier Clos ( 03_hierarchical_topologies.md§1)$D = 6$
B.2 Connecting the three
Pick the shape first; both $N$ and $D$ fall out:
shape = (d₁, d₂, ..., dₖ)
│
├──── product ────────→ N (total ranks; grows as dᵏ in k-D)
│
└──── sum over axes ──→ D (diameter; grows linearly in k)
per-axis worst-case hops:
torus: ⌊dᵢ / 2⌋
mesh: dᵢ − 1
Per-axis worst case:
- Torus axis (ring of $d$ ranks): walking to the opposite rank takes $\lfloor d / 2 \rfloor$ hops in either direction (wraparound gives the shortcut).
- Mesh axis (open line of $d$ ranks): walking end-to-end takes $d - 1$ hops; no shortcut.
Sum over axes: routing factors across axes (dim-independent routing), so the worst-case pair’s hop count is the sum of per-axis worst cases.
B.3 Worked example — 3×3 torus
Concrete small case: $d = 3$ on both axes, $k = 2$.
- Shape: (3, 3).
- $N$ = 3 · 3 = 9 ranks.
- $D$ (torus) = $\lfloor 3/2 \rfloor + \lfloor 3/2 \rfloor$ = 1 + 1 = 2 hops.
- $D$ (mesh, same shape) = (3−1) + (3−1) = 4 hops.
Trace worst-case from rank (0, 0) in the torus:
3 × 3 torus (from §1.2, d = 3 per axis):
↕ ↕ ↕
↔ (0,2)━(1,2)━(2,2) ↔
┃ ┃ ┃
↔ (0,1)━(1,1)━(2,1) ↔ Hop count from (0, 0) to each other rank:
┃ ┃ ┃
↔ (0,0)━(1,0)━(2,0) ↔ (1, 0) = 1 (2, 0) = 1 (X-wrap shortcut)
↕ ↕ ↕ (0, 1) = 1 (0, 2) = 1 (Y-wrap shortcut)
(1, 1) = 2 (2, 1) = 2
(1, 2) = 2 (2, 2) = 2
Worst case: 2 hops → D = 2 ✓
On a 3 × 3 mesh (wraparound removed), (0, 0) ↔ (2, 2) takes 2 + 2 = 4 hops — no X-wrap or Y-wrap shortcut.
B.4 Same $N$, different shapes
One rank count $N$ admits many shapes, each with its own diameter:
| Shape | $(d_1, \ldots, d_k)$ | k | $N$ | $D$ (torus) | $D$ (mesh) |
|---|---|---|---|---|---|
| 1D 512-ring | (512) | 1 | 512 | $\lfloor 512/2 \rfloor$ = 256 | 511 |
| 2D 32 × 16 | (32, 16) | 2 | 512 | 16 + 8 = 24 | 31 + 15 = 46 |
| 3D 8 × 8 × 8 | (8, 8, 8) | 3 | 512 | 4 + 4 + 4 = 12 | 7·3 = 21 |
| 4D 8 × 4 × 4 × 4 | (8, 4, 4, 4) | 4 | 512 | 4 + 2·3 = 10 | 7 + 3·3 = 16 |
Two patterns:
- Higher $k$ → smaller $D$ at fixed $N$. From 1D (D = 256) to 4D (D = 10), diameter compresses ~25×. This is the $O(N^{1/k})$ scaling that §3 attributes to dim-decomposition — factoring the collective across $k$ dims lets each dim-ring be short.
- Wraparound halves per-axis $D$. 3D 8³ torus has D = 12; same-shape mesh has D = 21 (~1.75× larger). At larger $d$ the ratio approaches 2× — the structural reason torus outperforms mesh on α-term (§4.2 pros).
B.5 Role in cost formulas
The three quantities map to different parts of the α-β cost model:
| Quantity | Appears in | Example formula |
|---|---|---|
| $N$ | BW term | $(N-1)/N \cdot M / \mathrm{BW}$ (flat-ring AR bandwidth coefficient) |
| $d_i$ | BW-telescoping during dim-decomp | after phase-$i$ RS, chunk size becomes $M / \prod_{j \leq i} d_j$ |
| $D$ | α-term (traversal latency) | flat-ring AR pays $2(N-1)\alpha$; dim-decomposed AR pays $2 \sum_i (d_i - 1) \alpha$ |
| $d_{\max}$ | A2A bisection penalty | torus A2A BW term = $d_{\max}/8 \cdot M/\mathrm{BW}$ |
Practical rule of thumb:
- Whole-fabric size → $N$.
- Per-axis width → $d_i$.
- Worst-case hop count → $D$.
- Largest axis → $d_{\max}$.
Conflating any two of these (e.g., writing $D$ for both the per-axis width and the diameter) makes formulas like $D = \sum_i \lfloor d_i / 2 \rfloor$ read as $D = \sum_i \lfloor D_i / 2 \rfloor$ — same letter on both sides, requiring the reader to mentally disambiguate inner-$D$ vs outer-$D$. The uppercase-$D$ / lowercase-$d$ split avoids this entirely.
Appendix C: Per-chunk-per-step analysis vs bottleneck analysis
The BW-term for different primitives in §3 and §4 is computed two different ways:
- Per-chunk-per-step analysis — used for ring RS / AG and dim-decomposed AR (§3.4, §3.5).
- Bottleneck analysis — used for A2A (§3.6, §4.2) and fat-tree A2A (
03_hierarchical_topologies.md§2.2).
This appendix explains when each applies, how they differ, and why bottleneck analysis is the more general formulation. §3.6 references the results here.
C.1 Per-chunk-per-step analysis (lockstep, symmetric traffic)
For ring RS on a $d$-rank ring, every rank sends one chunk per step through its one outbound link. By symmetry:
- Every rank performs the same operation at every step.
- Every link carries the same bytes per step.
Total BW cost is computed step by step:
$$t_{\mathrm{BW, ring\,RS}} = \sum_{\text{steps}} \frac{\text{chunk size at this step}}{\mathrm{BW}} \;=\; (d - 1) \cdot \frac{M/d}{\mathrm{BW}} \;=\; \frac{d - 1}{d} \cdot \frac{M}{\mathrm{BW}}$$
This works because the traffic is lockstep symmetric — no link is more loaded than any other, and summing per-step-per-rank costs captures the full BW cost. For dim-decomposed AR (§3.4), the same technique is applied per dim-phase with telescoping chunk sizes, summing to $(N-1)/N \cdot M/\mathrm{BW}$ across all phases.
C.2 Bottleneck analysis (pipelined, possibly asymmetric traffic)
For A2A ring-relay, chunks flow independently along shortest-arc paths across the fabric. At any given moment, different links carry different chunks; there is no synchronous per-step pattern where every rank does the same thing. Per-step-per-rank summing doesn’t apply directly.
Instead, identify the link (or group of links) carrying the most bytes over the whole phase and compute time to drain:
$$t_{\mathrm{BW}} \;=\; \frac{\text{bytes through the bottleneck}}{\text{bottleneck throughput}}$$
Why this works. The phase ends only when every byte has reached its destination. The most-loaded link is the last to drain — lighter-loaded links finish earlier but don’t reduce the phase time. Any scheduling trick that doesn’t reduce the bottleneck’s load doesn’t reduce this time.
For A2A on torus, the natural bottleneck is the bisection cut: all cross-half traffic must pass through it (intra-half links can’t carry it), so the cut’s capacity is the structural upper bound on cross-half throughput.
C.3 Concrete example — 4×4 torus A2A
Take the 4×4 torus A2A worked example (§3.6). We compute the BW-term three ways: (1) per-chunk-per-step reasoning applied to a single chunk’s journey, (2) bottleneck analysis on the whole cut (aggregate bytes / aggregate throughput), and (3) bottleneck analysis zoomed to one specific cut link. (2) and (3) are the same bottleneck analysis at different scopes and give the same answer; (1) gives a different (smaller) number, and that gap is the lesson.
(1) Per-chunk-per-step reasoning on a single chunk — what $a{\to}k$’s own bytes pay for its journey. Chunk $a \to k$ takes a 4-hop route:
$$a \to b \to c \to g \to k$$
Each hop moves a chunk of size $M/N = M/16$ at per-link $\mathrm{BW}$, so each hop’s transit time is $M/(16 \cdot \mathrm{BW})$. Summed across all 4 hops:
$$t_{a{\to}k\,\text{personal}} \;=\; 4 \cdot \frac{M}{16 \, \mathrm{BW}} \;=\; \frac{M}{4 \, \mathrm{BW}}$$
This is not the phase time — after $a{\to}k$ arrives, the cut link that carried its cross-half hop is still busy serving other chunks.
(2) Bottleneck analysis — aggregate whole-cut view (total cross-cut bytes drained by aggregate cut throughput).
Total cross-cut bytes, one direction. Every left-half rank (8 of them) sends one chunk of size $M/N = M/16$ to every right-half rank (8 of them):
$$\text{cross-cut bytes, L}\to\text{R} \;=\; \tfrac{N}{2} \cdot \tfrac{N}{2} \cdot \tfrac{M}{N} \;=\; 8 \cdot 8 \cdot \tfrac{M}{16} \;=\; 4M$$
Byte-count is fixed by A2A semantics (not by routing). A shortest-arc route from a left-half source to a right-half destination transitions halves exactly once — it uses exactly one cut edge, regardless of which shortest path is chosen. For $a{\to}k$, three equally-short routes exist; each uses exactly one cut hop:
X-first direct: a ─I─→ b ─C─→ c ─I─→ g ─I─→ k (cut at hop 2)
Y-first: a ─I─→ e ─I─→ f ─C─→ g ─I─→ k (cut at hop 3)
X-wraparound: a ═C═→ d ─I─→ c ─I─→ g ─I─→ k (cut at hop 1)
I = intra-half hop, C = cut hop.
So all 4M cross-cut bytes pass through the 8 cut links exactly once, total — no scheduling trick reduces this.
Aggregate cut throughput, one direction. 8 cut links × per-link $\mathrm{BW}$ per direction = $8\,\mathrm{BW}$. (Per-link BW is per-direction on a full-duplex link, as calibrated in §2; the symmetric R→L flow runs on the opposite channels of the same 8 links and drains concurrently.)
Divide:
$$t_{\mathrm{BW}} \;=\; \frac{4M}{8\,\mathrm{BW}} \;=\; \frac{M}{2\,\mathrm{BW}}$$
(3) Bottleneck analysis — single-link view (zoom in on one cut link’s busy-time). Focus on b→c (direct cut at $y=0$, L→R direction). Under uniform shortest-path routing, the 64 cross-cut chunks distribute across the 8 cut links → 8 chunks per cut link. The 8 chunks passing through b→c under X-first routing from row $y=0$ to the $x=2$ column:
| # | Chunk | Route | Total hops |
|---|---|---|---|
| 1 | $a \to c$ | $a \to b \to c$ | 2 |
| 2 | $a \to g$ | $a \to b \to c \to g$ | 3 |
| 3 | $a \to k$ | $a \to b \to c \to g \to k$ | 4 |
| 4 | $a \to o$ | $a \to b \to c \to o$ (Y-wrap on last hop) | 3 |
| 5 | $b \to c$ | $b \to c$ | 1 |
| 6 | $b \to g$ | $b \to c \to g$ | 2 |
| 7 | $b \to k$ | $b \to c \to g \to k$ | 3 |
| 8 | $b \to o$ | $b \to c \to o$ (Y-wrap on last hop) | 2 |
Each chunk uses b→c exactly once somewhere in its journey (position doesn’t matter for the BW arithmetic — only total occupancy does). Each use costs $M/(16 \cdot \mathrm{BW})$. Total occupancy:
$$t_{b{\to}c\,\text{occupancy}} \;=\; 8 \cdot \frac{M}{16 \, \mathrm{BW}} \;=\; \frac{M}{2 \, \mathrm{BW}}$$
(2) and (3) agree, as they must for a symmetric torus where every cut link carries the same load — cut-wide average = any-one-link occupancy.
Why the phase time is the bottleneck occupancy, not the per-chunk journey. $a{\to}k$ personally uses b→c for only $M/(16 \cdot \mathrm{BW})$ — one of the 8 transits that link carries. After $a{\to}k$ finishes its own hop through b→c, the link still has 7 more chunks to serve. The phase doesn’t end until all chunks have arrived, including the 7 others. So phase time = link’s total occupancy = $M/(2 \cdot \mathrm{BW})$, not $a{\to}k$’s personal $M/(4 \cdot \mathrm{BW})$.
Generalizing to arbitrary torus shape. The whole-cut view (2) extends to arbitrary $d_i$ by tracking how each of its quantities scales with $N$ and $d_{\max}$:
| Quantity | 4×4 value | General formula |
|---|---|---|
| Severed-link count | 8 | $2N / d_{\max}$ (2 links per dim-line × $N/d_{\max}$ dim-lines across the cut) |
| Cross-cut chunks (one direction) | 64 | $(N/2) \cdot (N/2) = N^2/4$ |
| Cross-cut bytes (one direction) | $4M$ | $(N^2/4) \cdot (M/N) = NM/4$ |
| Cut throughput (one direction) | $8\,\mathrm{BW}$ | $(2N / d_{\max}) \cdot \mathrm{BW}$ |
| Chunks per cut link | 8 | $(N^2/4) / (2N/d_{\max}) = N \cdot d_{\max}/8$ |
Two quantities of interest fall out of this table. The BW term, dividing one-direction traffic by one-direction throughput (full-duplex → the symmetric opposite direction finishes concurrently, so this is the total BW phase time):
$$t_{\mathrm{BW}} \;=\; \frac{NM/4}{(2N/d_{\max}) \cdot \mathrm{BW}} \;=\; \frac{d_{\max}}{8} \cdot \frac{M}{\mathrm{BW}}$$
4×4 check: $d_{\max}/8 \cdot M/\mathrm{BW} = 4/8 \cdot M/\mathrm{BW} = M/(2\,\mathrm{BW})$ ✓. This $d_{\max}/8$ factor is what §3.6’s general cost formula uses.
The bottleneck-vs-personal ratio, quantifying how much more loaded the bottleneck link is than one chunk’s private path (chunks sharing the bottleneck, divided by hops in one chunk’s journey):
$$\frac{t_{\mathrm{bottleneck}}}{t_{\mathrm{longest chunk personal}}} \;=\; \frac{\text{chunks per cut link}}{\text{longest chunk hop count}} \;=\; \frac{N \cdot d_{\max} / 8}{\mathrm{diam}}$$
4×4 check: $(16 \cdot 4 / 8) / 4 = 8/4 = 2$ ✓ — the bottleneck serves 8 chunks while a single chunk’s journey has only 4 hops, so the link is “shared more broadly” than any one chunk’s path is “long.”
Evaluated across a few shapes:
| Shape | $N$ | $d_{\max}$ | $\mathrm{diam}$ | Chunks per cut link | Longest journey (hops) | Ratio |
|---|---|---|---|---|---|---|
| 4×4 (our case) | 16 | 4 | 4 | 8 | 4 | 2 |
| 8×8 | 64 | 8 | 8 | 64 | 8 | 8 |
| 8×8×8 | 512 | 8 | 12 | 512 | 12 | ~43 |
| 16×16×16 (TPU v5p) | 4096 | 16 | 24 | 8192 | 24 | ~341 |
The ratio grows roughly as $N / \mathrm{diam}$: large tori accumulate many chunks on each cut link while the longest journey grows only slowly (linearly in dim-size), so the bottleneck term increasingly dominates any individual chunk’s personal BW cost. The 4×4 case is pedagogically convenient because the ratio is a clean small number; at production scale the bottleneck is hundreds of times larger than a single chunk’s BW cost.
Re-use interpretation (bridge analogy). What the bottleneck formula is actually computing: how many times is the bottleneck link re-used over the phase, and how long does each re-use take?
$$t_{\mathrm{BW, bottleneck}} \;=\; \underbrace{(\text{\# re-uses of bottleneck link})}{\text{= \# chunks through it}} \;\times\; \underbrace{(\text{per-use time})}{\text{= chunk size / BW}}$$
At b→c: 8 re-uses × M/(16·BW) per re-use = M/(2·BW). The ideal formula assumes perfect pipelining — each re-use happens back-to-back, with the link at 100% utilization during its busy period. (Real deployments fall short due to switch queueing, flow-control stalls, and scheduler bubbles; that gap is priced by $\eta_\beta$ in 05_contention_and_congestion.md §4.)
Picture 8 people each needing to cross a 1-minute bridge. Each person’s own crossing takes 1 minute — short. But the bridge is re-used 8 times, back-to-back, so it’s occupied for 8 minutes total. The party finishes when the last person exits, at minute 8, even though no single person’s crossing took longer than 1. Phase time tracks the bridge’s total busy-time (= re-use count × per-use time), not any one person’s crossing time.
The 4×4 A2A case maps exactly:
- “People” = chunks (8 of them passing through b→c)
- “1-minute crossing” = M/(16·BW) per-chunk transit
- “Party finishes” = last chunk exits b→c → M/(2·BW)
C.4 When the two analyses coincide
On regular torus with uniform shortest-path routing, graph symmetry forces every link to carry identical load ($M/2$ bytes per direction in the 4×4 A2A case). The bottleneck load equals the average load, so bottleneck-based and per-link-uniform calculations give the same answer. Per-chunk-per-step analysis — if applied carefully to account for pipelining — reaches the same number.
For ring RS, traffic is balanced by construction. Per-step-per-rank is the easiest formulation and gives the right answer directly; bottleneck analysis would give the same result but is unnecessary overhead.
So when traffic is symmetric, pick whichever formulation is easier to compute. For ring, it’s per-step-per-rank. For A2A on regular torus, it’s the bisection bottleneck.
C.5 When bottleneck analysis is necessary
Bottleneck analysis is strictly more general; it gives the tight bound even when per-link symmetry breaks:
- Oversubscribed fat-tree (
03_hierarchical_topologies.md§1): upper-tier links loaded $s \times$ more than lower-tier. Per-link-uniform doesn’t apply. - Off-prefix torus group layouts (§7): some dim-lines loaded more than others when the collective’s ranks don’t tile the torus cleanly.
- Hotspot traffic (e.g., skewed MoE expert routing): some destinations receive disproportionately more traffic, concentrating load on specific links.
- Non-uniform routing policies: deterministic routing may not evenly distribute chunks across alternate paths.
In each of these cases, per-link loads differ from the average and only bottleneck analysis gives the tight time bound. “Average-based” reasoning ($t = \text{total bytes} / \text{total fabric BW}$) underestimates because it assumes perfect load balancing that the fabric doesn’t deliver.
C.6 Relationship to $\eta_\beta$ in 05_contention_and_congestion.md
The bottleneck formula derived above is the ideal lower bound — it does NOT include any $\eta_\beta$ discount. Every $t_{\mathrm{BW}}$ computed in this note (§3.6’s $M/(2\,\mathrm{BW})$ for 4×4 torus A2A, 03_hierarchical_topologies.md §2.2’s fat-tree A2A term, etc.) assumes:
- The bottleneck link runs at 100% utilization during its busy period.
- Every re-use of the bottleneck happens back-to-back with no gaps between chunks.
- No switch queueing delay, no head-of-line blocking, no flow-control stalls.
- No cross-traffic interference, no imperfect pipelining bubbles.
Under those assumptions, the formula gives the fastest possible phase time on the bottleneck — the ideal target that a perfectly-scheduled deployment would hit. Real deployments sit above this lower bound because those assumptions fail in practice: switches do queue, flow-control does stall, chunks don’t arrive at relay points perfectly aligned, and other traffic shares the same ports.
The coefficient $\eta_\beta \in (0, 1]$ introduced in 05_contention_and_congestion.md §4 is a separate multiplicative discount applied on top of this note’s ideal formulas, quantifying the real-vs-ideal gap:
$$t_{\mathrm{BW, realized}} \;=\; \frac{\text{bottleneck bytes}}{\eta_\beta \cdot \mathrm{BW}} \;=\; \frac{t_{\mathrm{BW, ideal}}}{\eta_\beta}$$
Bottom line. Throughout 02_topology_mapping.md, wherever you see a cost formula with plain $\mathrm{BW}$ (no $\eta_\beta$), it is the $\eta_\beta = 1$ idealization. The $\eta_\beta$ factor is not hidden inside any of the derivations here; it is strictly a 05_contention_and_congestion.md-layer concept, applied externally when scoring realistic performance. When reading a 02_topology_mapping.md BW-term, assume perfect pipelining and zero contention.
Further reading
01_collective_algorithms.md— topology-free derivations of the ring, tree, RS, AG, and A2A primitives. Prerequisite for the per-topology cost formulas above.04_in_network_collectives.md— how SHARP / NVLS collapses the $O(N)$ endpoint-hop cost on star topologies to $O(1)$ switch-hop cost.05_contention_and_congestion.md— extending the ideal formulas above with $\eta_\alpha$, $\eta_\beta$ coefficients to score concurrent-group and off-prefix effects.03_hierarchical_topologies.md— composing the single-tier topologies above into multi-tier hierarchies (NVL72 star inside a Clos, chiplet mesh inside a star inside a Clos), with the associated hierarchical-collective cost rules and device-domain allocation problem.references.md— primary-source citations for the cost formulas in this note (Patarasuk-Yuan bandwidth-optimal AR, Chan et al. dim-decomposed AR).
Hierarchical Topologies: Composition, Partitioning, and Optimization
Author: Yue Lu
Date: April 2026
Single-tier fabrics hit hard scaling ceilings — NVSwitch radix, torus dimension wiring cost, chiplet pin count — and once a deployment needs more ranks than any one tier can support, hierarchical composition is inevitable. Real clusters layer tiers accordingly: a GB200 NVL72 pod is a star of 72 GPUs; pods connect through an InfiniBand Clos; chiplets inside each GPU package form a mesh on the base die. This note covers the canonical multi-tier fabric (§1 Clos architecture), what changes when topologies compose across tiers (§2 composition rules), and how in-network collectives and contention coefficients interact with hierarchical schedules (§3).
Table of Contents
- Multi-tier Clos architecture
- Composition rules for hierarchical collectives
- INC and contention in hierarchies
- Appendix A: Worked rail-optimized NVL72 SuperPOD topology (L = 32)
- Appendix B: The k-ary fat-tree
- Further reading
1. Multi-tier Clos architecture
Each single-layer topology from 02_topology_mapping.md has a hard scaling ceiling:
- Star caps at switch radix (NVSwitch Gen4: 72 ports).
- Torus caps at dimensional-wiring cost (TPU v5p: $16^3 = 4096$).
- Full mesh caps at $O(N^2)$ wire count (~8–16 endpoints).
Beyond each ceiling, the only scaling path is to compose topologies across tiers — one topology inner, a different one outer. Modern production stacks are always multi-tier: NVL72 (star) inside InfiniBand (Clos); chiplets (mesh) inside a GPU inside NVLink (star) inside InfiniBand (Clos); TPU chips (torus) inside an ICI slice (torus) inside an inter-slice link. The canonical multi-tier switched construction is the Clos: endpoints attach to leaf switches, leaves uplink to spine switches, and optionally a super-spine tier closes a 3-tier Clos. Its cost behavior is controlled by per-tier α-β pairs and an oversubscription ratio that caps upper-tier bandwidth.
Per-tier characterization. Each tier is described by four parameters: domain size $N_i$, per-hop latency $\alpha_i$, per-link BW $\mathrm{BW}_i$, and (for Clos tiers) oversubscription $s_i \geq 1$. The cost of any collective over the full $N = \prod_i N_i$ ranks is a composition of per-tier primitive costs — §2 derives the composition rules; this section anchors them on the Clos topology itself.
Leaf-spine structure. $N$ endpoints split across $L$ leaf switches ($N/L$ per leaf). Each leaf switch has two port populations:
- Downlink ports face endpoints below — $p_{\mathrm{down}} = N/L$ ports per leaf, each at port-BW.
- Uplink ports face the spine tier above — $p_{\mathrm{up}}$ ports per leaf, each at the same port-BW (normally all ports on a switch are identical; “downlink” vs “uplink” is a role assignment, not hardware).
A spine tier of $S$ switches carries all inter-leaf traffic; multiple spines provide multi-path routing and aggregate cross-leaf bandwidth. A small concrete instance:
Two-tier Clos (N = 16 endpoints; L = 4 leaves × S = 2 spines; s = 2):
┌──────┐ ┌──────┐
spine tier: │ S0 │ │ S1 │ each spine: 4 downlinks
└──────┘ └──────┘ (one per leaf)
full bipartite — 8 leaf-spine links total
(every leaf connects to every spine)
┌────┐ ┌────┐ ┌────┐ ┌────┐
leaf tier: │ L0 │ │ L1 │ │ L2 │ │ L3 │ each leaf:
└────┘ └────┘ └────┘ └────┘ 2 uplinks (→ S0, S1)
▼▼▼▼ ▼▼▼▼ ▼▼▼▼ ▼▼▼▼ 4 downlinks
R0-3 R4-7 R8-11 R12-15
Wiring:
every L_i has 2 uplinks — one to S0, one to S1.
every S_j has 4 downlinks — one to each of L0 … L3.
Port-budget: 4 down-port BW per leaf (to endpoints) vs. 2 up-port BW per leaf (to spines)
→ oversubscription s = 4 / 2 = 2 (defined below).
Cost anatomy. Intra-leaf traffic traverses 2 link hops (endpoint → leaf → endpoint), at α ≈ a single-switch star (~0.5 μs on InfiniBand NDR class). Cross-leaf traffic traverses 4 hops (endpoint → leaf → spine → leaf → endpoint), at roughly 2× the α. Bandwidth is capped by the minimum of endpoint BW, leaf-uplink aggregate BW, and spine bisection.
Oversubscription. The oversubscription ratio at the leaf-to-spine boundary (same notion at any switch tier) is
$$s \;=\; \frac{p_{\mathrm{down}}\cdot\mathrm{BW}{\mathrm{down}}}{p{\mathrm{up}}\cdot\mathrm{BW}_{\mathrm{up}}} \;\geq\; 1$$
The special case $s = 1$ ($p_{\mathrm{up}}\cdot\mathrm{BW}{\mathrm{up}} = p{\mathrm{down}}\cdot\mathrm{BW}{\mathrm{down}}$) is full-bisection / non-blocking — every endpoint can drive its full BW to any other endpoint through the spine without contention. NVIDIA Quantum-class InfiniBand pods are engineered this way by default; Ethernet leaf-spine pods are often full-bisection at the pod level. When full-bisection is too expensive, upper tiers are oversubscribed at $s = 2$, $s = 4$, etc. At $s = 4$ a leaf can receive $4\times$ more traffic from its endpoints than it can push to the spine: under uniform cross-leaf traffic each flow gets only $\mathrm{BW}/s$ northbound. The uplinks are the bottleneck, not the endpoint ports or the spines. The contention-coefficient model in 05_contention_and_congestion.md captures this directly as $\eta\beta \approx 1/s$ on upper-tier-crossing traffic — a 4:1 oversubscribed super-spine inflates cross-pod AR’s BW term by 4×. The α term is not inflated by $s$ — oversubscription is a BW knob, not a latency knob. §3.2 below reuses this $\eta_\beta$ hook when scoring hierarchical AR with realistic contention.
Scaling to three tiers. When a single spine tier’s radix runs out, pods of leaf-spine hierarchies hang under a super-spine tier that routes cross-pod traffic. Diameter grows to 6 link hops (endpoint → leaf → spine → super-spine → spine → leaf → endpoint), the per-tier α-β story extends one layer, and each new boundary carries its own $s$:
3-tier Clos (schematic):
[Super-spine] cross-pod traffic only
↑
[Spine (per pod)] cross-leaf within pod, plus up to super-spine
↑
[Leaf (per pod)] endpoints attached
↑
[Endpoints]
Production examples: large InfiniBand SuperPOD deployments (thousands of GPUs), Meta’s research clusters, and most hyperscaler Ethernet scale-out fabrics. Further layering (4-tier or beyond) is rarely built — once a pod reaches super-spine scale, it’s usually cheaper to add pods laterally than another vertical tier.
Naming: Clos vs fat-tree. Historically distinct: Clos (Charles Clos, 1953) is the general multi-tier switched family, allowing any $s \geq 1$; Leiserson fat-tree [LEIS85] is a tree with link bandwidth growing toward the root. The modern datacenter realization — the k-ary fat-tree of [AL-FARES08] — is built from commodity $k$-port switches with $k/2$ up / $k/2$ down at every switch, yielding a Clos with $s = 1$ at every tier; Appendix B has the port-by-port build. Production AI fabrics (NVIDIA DGX SuperPOD [DGX-SUPERPOD], Meta’s Research SuperCluster [META-RSC], hyperscaler Ethernet training pods) follow this structural pattern — tiered Clos with full bipartite at every tier boundary — but don’t rigidly hit Al-Fares’s exact port accounting: vendors mix switch radices across tiers, and the common design runs $s = 1$ at the intra-pod leaf-spine tier (for tight training locality within a Scalable Unit) while oversubscribing the super-spine between pods where cost dominates over bandwidth. In the rest of this note, Clos means the general multi-tier fabric at any $s$, and fat-tree is shorthand for the $s = 1$ k-ary variant when the non-blocking property matters — the usage aligns with production InfiniBand / Ethernet vendor literature.
1.1 Case study: NVIDIA NVL72 SuperPOD
The NVL72 SuperPOD grounds the abstract Clos above in real hardware and surfaces a common source of reader confusion: every GPU is attached to two physically separate fabrics, so “leaf” and “innermost tier” are both valid terms in this setup but refer to different switches.
Two independent fabrics per GPU. Every GPU in an NVL72 pod has two distinct egress paths:
- NVLink → NVSwitch (scale-up, intra-pod). 72 GPUs per NVL72 pod on NVLink Gen5 at 1800 GB/s per GPU. Traffic stays within the pod — NVLink ports do not reach a NIC or ToR.
- PCIe → ConnectX NIC → IB ToR (scale-out, inter-pod). One ConnectX-7 (NDR) or ConnectX-8 (XDR) NIC per GPU via PCIe Gen5 — 72 NICs per NVL72. NDR: 50 GB/s per NIC; XDR / Quantum-X800: 100 GB/s. NICs uplink to IB Quantum-2 / X800 ToR switches, which uplink to IB spines.
NCCL routes intra-pod peers over NVLink and cross-pod peers over IB; the two fabrics never share a data path.
Two-fabric structure (one NVL72 pod, 72 GPUs):
┌──────────────── compute pod ────────────────┐
│ │
│ ┌───────┐ NVLink ┌─────────────┐ │
│ │ GPU_0 │────────────│ │ │ ← scale-up (NVSwitch)
│ └───┬───┘ │ NVSwitch │ │ NVLink Gen5: 1800 GB/s
│ │ PCIe │ fabric │ │ (72 GPUs per pod)
│ ┌───┴───┐ └─────────────┘ │
│ │ NIC_0 │ (1 NIC per GPU — 72 total) │
│ └───┬───┘ │
└───────┼─────────────────────────────────────┘
│ PCIe Gen5 → ConnectX NIC → IB
▼
┌─────────────┐ ┌─────────────┐
│ IB ToR │ ◄───────► │ IB Spine │ ← scale-out Clos
│ (Quantum-2 │ │ (Quantum-2) │ (leaf ↔ spine per §1)
│ "leaf") │ └─────────────┘ 4-hop cross-pod path:
└─────────────┘ NIC → ToR → Spine → ToR → NIC
Mapping to the abstract hierarchy.
| Abstract tier (used in §2) | NVL72 SuperPOD realization |
|---|---|
| Innermost (scale-up) | NVLink Gen5 within the pod, 72 GPUs, 1800 GB/s per GPU |
| Outer leaf | IB Quantum-2 / Quantum-X800 ToR |
| Outer spine | IB Quantum-2 / Quantum-X800 spine |
| Outer super-spine | Multi-pod deployments (>72 GPUs across multiple NVL72s) |
The common confusion — “leaf” ≠ “innermost tier.” In Clos literature, leaf is a position in the switched fabric — the first switch an endpoint hits. For NVL72 SuperPODs, that first switch is the IB ToR. But the innermost tier of the overall hierarchy is the NVSwitch/NVLink fabric, which sits below the NIC in the topology — intra-pod traffic never passes through the NIC or ToR at all. The IB ToR is the entry point of the outer tier, not the inside. The gap between the two tiers is large — roughly ~36× in per-GPU BW (1800 GB/s NVLink vs ~50 GB/s NDR) and several × in α (NVSwitch cut-through vs cross-pod IB path). §2 picks this up and shows how to compose collectives across tiers so that the fast inner fabric carries most of the work and the slow outer fabric is crossed as little as possible.
How are 72 NICs actually wired to IB ToRs? The two-fabric diagram above draws the IB ToR as a single box, which is a deliberate abstraction — the previous paragraphs care only about which fabric a packet uses (NVLink vs IB), not about how many ToRs the scale-out fabric has. But in real hardware, an NVL72’s 72 NICs fan out to actual switches, and the choice of fan-out pattern determines port-count requirements, failure blast radius, and whether NCCL can exploit rail locality. Two layouts appear in production deployments.
A natural first guess: attach all 72 NICs to one ToR. For full bisection ($s = 1$), that ToR would need 72 downlinks to NICs + 72 uplinks to the spine — exactly 144 ports. Quantum-X800 Q3400’s radix is 144 [QX800], so this math works by construction. Whether this “pod-local” layout is what NVIDIA actually ships is a separate question; it turns out the SuperPOD reference architecture uses a different pattern (rail-optimized) that also lands on 144-port ToRs but for a different reason. Both are worth walking through.
Layout 1 — Pod-local ToR. All 72 NICs from one NVL72 attach to a single ToR; 72 downlinks to NICs + 72 uplinks to the spine = exactly 144 ports at $s = 1$ full-bisection, matching Quantum-X800 Q3400’s radix [QX800] one-for-one.
Pod-local layout (1 ToR per NVL72; all 72 NICs into one X800):
┌─────────────── one NVL72 (72 GPUs / 72 NICs) ──────────────┐
│ │
│ tray 0: [N] [N] [N] [N] │
│ tray 1: [N] [N] [N] [N] │
│ ... │
│ tray 17: [N] [N] [N] [N] │
│ │
└─────────────────────────────┬──────────────────────────────┘
│ all 72 NICs (one bundle)
▼
┌─────────────────────────┐
│ Quantum-X800 Q3400 │ ← 144 ports:
│ (one ToR per NVL72) │ 72 down = 72 NICs
└────────────┬────────────┘ 72 up → IB spine
│
▼
═══ IB spine fabric ═══
Simple, but the entire NVL72 shares one failure domain, and one ToR per pod adds up fast at SU scale.
Layout 2 — Rail-optimized (NVIDIA SuperPOD reference [DGX-SUPERPOD]). NICs at the same position within a compute tray across different NVL72 pods share one ToR. Each GB200 NVL72 compute tray holds 4 GPUs / 4 NICs, so each NIC goes to one of 4 rails. Every NVL72 is striped across 4 rail-ToRs, so any one ToR sees only 18 NICs per pod (one per tray), letting a 144-port X800 aggregate multiple pods on the same rail within one SU.
Rail-optimized fan-out (NVL72 striped across 4 rail-ToRs):
┌─────────────── one NVL72 (72 GPUs / 72 NICs) ──────────────┐
│ │
│ tray 0: [r0] [r1] [r2] [r3] │
│ tray 1: [r0] [r1] [r2] [r3] │
│ ... │
│ tray 17: [r0] [r1] [r2] [r3] │
│ │ │ │ │ │
└─────────────────┼─────┼─────┼─────┼────────────────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌─────┬─────┬─────┬─────┐
│ToR_0│ToR_1│ToR_2│ToR_3│ ← one Quantum-X800 ToR per rail
│144p │144p │144p │144p │ (18 NICs from this NVL72 each)
└──┬──┴──┬──┴──┬──┴──┬──┘
│ │ │ │
▼ ▼ ▼ ▼
═══ IB spine fabric ═══
Each ToR's 144 ports = 72 NIC downlinks (= 18 × up to 4 NVL72s on this rail
at s = 1 full-bisection)
+ 72 spine uplinks.
The inverse view — what one rail-i ToR sees. The diagram above shows one NVL72 fanning out to 4 rail-ToRs. The flip side is what each rail-ToR collects: it aggregates the rail-i NICs from up to 4 NVL72 pods. Critically, each pod contributes only 18 NICs to this ToR (its rail-i slice — one NIC per tray); the pod’s other 54 NICs land on the other three rails’ ToRs. So one ToR is not “consumed” by a single pod — four pods can share it.
Inverse view: one rail-i ToR_Z aggregates rail-i NICs from up to 4 NVL72s
┌─NVL72_A─┐ ┌─NVL72_B─┐ ┌─NVL72_C─┐ ┌─NVL72_D─┐
│ rail-i │ │ rail-i │ │ rail-i │ │ rail-i │
│ 18 NICs │ │ 18 NICs │ │ 18 NICs │ │ 18 NICs │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
▼ ▼ ▼ ▼
┌────┬────────────┬────────────┬────────────┬────┐
│ rail-i ToR_Z (Quantum-X800) │ ← 144 ports:
│ │ 72 down = 4 × 18 NICs
└───────────────────────┬────────────────────────┘ 72 up → IB spine
│
▼
═══ IB spine fabric ═══
This is what makes “Same ToR” cross-pod traffic feasible: ToR_Z physically sees 4 pods’ rail-i NICs and can bridge any pair of them in 2 hops without involving the spine. Beyond 4 NVL72s on a rail, a second rail-i ToR is needed and cross-ToR-on-rail-i traffic must use the spine (4-hop path).
Traffic behavior under rail-optimized. First, the NVLink rule: intra-pod traffic (GPU_A ↔ GPU_B inside the same NVL72) always uses NVLink — IB never enters the picture, regardless of rail assignment. The cases below are all cross-pod; they must use IB because NVLink does not extend off-pod, and the rail-optimized topology determines how many IB hops are needed.
- Cross-pod, same rail, same ToR — NVL72_X and NVL72_Y both attach their rail-i NICs to the same physical ToR_Z: 2 hops (NIC → ToR_Z → NIC). The IB spine is not needed because one ToR bridges both pods. This applies when up to 4 NVL72s share a rail’s ToR (144-port X800 at $s = 1$: 72 down ÷ 18 NICs/pod = 4 pods).
- Cross-pod, same rail, different ToRs — once the rail spans more than 4 NVL72s, rail-i NICs spread across rail-i ToR_A and rail-i ToR_B: 4 hops (NIC → ToR_A → IB spine → ToR_B → NIC). The spine is the only physical path between two same-rail ToRs.
- Cross-pod, cross-rail — GPU_i on rail i in pod A ↔ GPU_j on rail j ≠ i in pod B: NVLink rail bridge — 1 NVLink hop + 4 IB hops on rail j. The 4 rail fabrics are physically disjoint at every tier (no IB cable connects rail-i switches to rail-j switches), so cross-rail does not go through any IB spine. Instead, the source first sends over NVLink intra-pod-A to a sibling GPU whose NIC lands on rail j, then that sibling uplinks normally on rail j (NIC → ToR_j → spine_j → ToR_j → NIC). Cross-rail pays one extra ~0.5 μs intra-pod NVLink hop on top of the destination rail’s 4-hop IB cost.
NCCL auto-detects rails and prefers same-rail peering for α-sensitive TP traffic [DGX-SUPERPOD] — a 2-hop same-rail path beats the 4-hop cross-rail one. This is why rail-optimized is the production norm despite using 4× more ToRs than pod-local.
Bottom line on the 144-port ToR usage. Both layouts use 144-port X800, but sized for different jobs:
- Pod-local: 144 = 72 NICs + 72 spine uplinks, one NVL72 exactly ([QX800]).
- Rail-optimized: 144 ports aggregates a rail across up to 4 NVL72s — ~72 NICs from multiple pods + 72 spine uplinks ([DGX-SUPERPOD]).
The X800 radix matches both interpretations comfortably, which is why Q3400 is the reference ToR for GB200 SuperPODs.
1.2 Cost-model symbols anchored on this SuperPOD
§2’s cost formulas and case studies use a small set of symbols. We anchor their meaning and concrete values once here so downstream subsections can plug in numbers without re-introducing notation. The values come directly from the NVL72 + IB hardware described in §1.1 above.
Latency (α). Three distinct switch paths surface in production, matching the three traffic classes from §1.1’s “Traffic behavior under rail-optimized”:
| Symbol | Path | NVL72 + IB value |
|---|---|---|
| $\alpha_{\mathrm{inner}}$ | Intra-pod GPU↔GPU on scale-up (NVSwitch cut-through) | $\approx 0.5\,\mu$s |
| $\alpha_{\mathrm{leaf}}$ | Cross-pod 2-hop IB path (NIC → shared rail-ToR → NIC) | $\approx 2\,\mu$s |
| $\alpha_{\mathrm{spine}}$ | Cross-pod 4-hop IB path (NIC → ToR → spine → ToR → NIC), end-to-end including NIC serialization | $\approx 8\,\mu$s |
$\alpha_{\mathrm{leaf}}$ applies when pods share a rail-ToR — possible only at small scale, since a 144-port X800 at $s = 1$ holds at most 4 NVL72s per rail. $\alpha_{\mathrm{spine}}$ is the general cross-pod cost once the rail spans more than 4 NVL72s.
§2.1’s hierarchical-AR formulas use a derived symbol $\alpha_{\mathrm{outer}}$ for the cross-pod α — the effective cross-pod latency for the deployment, equal to:
- $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{leaf}}$ when all participating pods share a rail-ToR (small scale).
- $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{spine}}$ when pods span multiple rail-ToRs (production scale).
- A weighted average of the two when the schedule has both same-ToR and spine-traversing hops (e.g., a ring AR across 8 pods split 4-and-4 across two rail-ToRs sees 6 same-ToR hops at $\alpha_{\mathrm{leaf}}$ and 2 cross-ToR hops at $\alpha_{\mathrm{spine}}$).
Case studies plug in concrete values: §2.1’s $L = 2$ walk-through uses $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{leaf}}$ (small-scale, pods share a rail-ToR); a production-scale $L = 32$ deployment would use $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{spine}}$ instead. §2.2’s A2A formula itemizes by destination, so it doesn’t use $\alpha_{\mathrm{outer}}$ — all three primitive α values ($\alpha_{\mathrm{inner}}$, $\alpha_{\mathrm{leaf}}$, $\alpha_{\mathrm{spine}}$) appear directly.
Bandwidth (BW), per GPU per direction.
| Symbol | Source | NVL72 + IB value |
|---|---|---|
| $\mathrm{BW}_{\mathrm{inner}}$ | NVLink Gen5 (per direction; 1800 GB/s bidirectional aggregate) | $\approx 900\,\mathrm{GB/s}$ |
| $\mathrm{BW}_{\mathrm{outer}}$ | ConnectX-7 NDR (ConnectX-8 XDR doubles to 100 GB/s) | $\approx 50\,\mathrm{GB/s}$ |
In multi-tier formulas, $\mathrm{BW}{\mathrm{bottleneck}} \approx \mathrm{BW}{\mathrm{outer}}$ once any cross-pod traffic is involved.
Shape parameters. $N$ = total ranks; $L$ = number of pods; $N_{\mathrm{inner}} = N/L$ = GPUs per pod (= 72 on NVL72); $p$ = pods sharing a rail-ToR ($\leq 4$ at $s = 1$).
These symbols carry through §2 unchanged.
2. Composition rules for hierarchical collectives
The core idea. In single-tier, a collective like AR is one algorithm running over one fabric — pick a ring, a DBT, or Rabenseifner, evaluate with that fabric’s $(\alpha, \mathrm{BW})$, done. In multi-tier, the same user-level collective often gets re-expressed as a combination of different primitives running on different tiers. A hierarchical AR isn’t “run AR on tier 1, then AR on tier 2” — it’s a sequence like RS on the inner tier, AR on the outer tier, AG on the inner tier that composes to the same result. The per-tier primitives need not be the same primitive the user asked for; they’re chosen so each tier does the work it’s best at, and so the outer tier sees as little data as possible.
Why this re-expression is possible (and necessary). Two reasons.
- Tiers are heterogeneous. An intra-pod hop costs ~0.5 μs at ~900 GB/s on NVSwitch while a cross-pod hop costs 2–8 μs at ~50 GB/s on InfiniBand. A flat schedule that treats the Clos as one big star pays $\log N \cdot \alpha_{\mathrm{spine}}$ on every algorithmic hop and $2(N-1)/N \cdot M/\mathrm{BW}_{\mathrm{outer}}$ on every byte — wasting the fast intra-pod fabric entirely. Decomposing across tiers lets each tier pay only its own α on its own phase, and lets the outer tier carry a shrunk payload rather than the full $M$.
- The collective’s algebraic structure allows it. Reduction is associative — reducing locally first then globally gives the same answer as reducing in one flat pass. Broadcast is replication — copying locally then globally is equivalent to one flat copy. These are the structural hooks that let the single-tier collective be factored into a multi-phase schedule. Each of AR, AG, RS, BC, Reduce has such a factoring; the specific phase pattern differs (§2.1 tabulates all five).
Two design levers in every hierarchical schedule. Once you accept that a single-tier collective can be re-expressed as a cross-tier combination, the schedule has two knobs: which primitive runs on which tier, and how much payload crosses each tier boundary. The canonical AR pattern (RS → sub-AR → AG) uses both: it puts the bulk of the α-hops on the fast inner tier and shrinks the payload from $M$ to $ML/N$ before any byte crosses the slow outer tier. §2.1 develops this pattern for AR and shows how AG, RS, BC, Reduce fall out as halves or trivial cascades of it.
The exception: A2A. Not every collective has the algebraic structure to decompose. All-to-all has no associativity or replication hook — every source-destination pair carries a distinct payload, so the cross-tier permutation traffic equals the full $(N-1)/N \cdot M$ bytes regardless of how the schedule is phased. §2.2 works through why A2A doesn’t decompose and how its cost formula becomes a destination-weighted sum over distance classes rather than a phased pipeline.
The cost formulas below use the symbols defined in §1.2 — $\alpha_{\mathrm{inner}}$, $\alpha_{\mathrm{leaf}}$, $\alpha_{\mathrm{spine}}$, $\mathrm{BW}{\mathrm{inner}}$, $\mathrm{BW}{\mathrm{outer}}$, plus the shape parameters $N$, $L$, $N_{\mathrm{inner}}$, $p$. Refer back to that table for concrete NVL72 + IB values when plugging numbers in.
2.1 The RS → sub-AR → AG pattern
Hierarchical AR composes the primitives from 01_collective_algorithms.md across tiers by exploiting reduction’s associativity. For a 2-tier hierarchy with $L$ outer groups of $N/L$ inner ranks each, the canonical schedule is:
Phase 1 (intra-group RS): each of L groups runs independent RS across its N/L ranks
→ each rank now holds (L/N)·M bytes of partially-reduced data
→ per-group RS runs on the inner tier's fabric at (α_inner, BW_inner)
Phase 2 (cross-group AR): L ranks-per-rank-position run AR across the outer tier
→ the M·L/N byte payload now sees the outer tier's (α_outer, BW_outer)
(α_outer = α_leaf, α_spine, or a mix — see §1.2)
→ this AR can itself be another hierarchical AR if k > 2 tiers
Phase 3 (intra-group AG): each of L groups runs independent AG across its N/L ranks
→ each rank now holds the fully-reduced M bytes
→ reverse of phase 1 on the inner tier
Why this composes correctly. Reduction is associative, so the inner-then-outer order produces the same result as a flat AR across all $N$ ranks. The per-rank chunk size telescopes: after inner RS, each rank holds $M / (N/L) = ML/N$ bytes (not $M$), so the outer AR moves less data. This is the same bandwidth-telescoping argument as torus dim-decomposition (02_topology_mapping.md §3.1), generalized across heterogeneous tiers.
Worked example — N = 4, L = 2. Four ranks {A, B, C, D}; leaf 0 holds {A, B}, leaf 1 holds {C, D}. Each rank starts with a 4-chunk vector; target is every rank ending with $\Sigma = [\Sigma_0, \Sigma_1, \Sigma_2, \Sigma_3]$ where $\Sigma_k = a_k + b_k + c_k + d_k$ sums all four ranks’ chunk-$k$ values.
Initial state Phase 1: intra-leaf RS Phase 2: cross-leaf sub-AR
(each rank full M) (N/L = 2 ranks per leaf) (L = 2 ranks per chunk pair)
[runs on NVSwitch fabric] [runs on IB ToR + spine]
[intra-pod, fast] [cross-pod, slow]
Leaf 0 Leaf 0 Leaf 0
A: [a0,a1,a2,a3] A: [a0+b0, a1+b1, · , · ] A: [ Σ0 , Σ1 , · , · ]
B: [b0,b1,b2,b3] ──► B: [ · , · , a2+b2,a3+b3] B: [ · , · , Σ2 , Σ3 ]
Leaf 1 Leaf 1 Leaf 1
C: [c0,c1,c2,c3] C: [c0+d0, c1+d1, · , · ] C: [ Σ0 , Σ1 , · , · ]
D: [d0,d1,d2,d3] ──► D: [ · , · , c2+d2,c3+d3] D: [ · , · , Σ2 , Σ3 ]
Payload per rank: M Payload per rank: M·L/N Sub-AR pairs (one AR per
= M/2 (leaf-partial sums chunk-position group):
on 2 chunks) {A, C} sum chunks 0, 1
{B, D} sum chunks 2, 3
Phase 3: intra-leaf AG (exchange chunks within each leaf)
[runs on NVSwitch fabric — mirrors Phase 1]
Leaf 0: A ↔ B exchange A: [Σ0, Σ1, Σ2, Σ3]
B: [Σ0, Σ1, Σ2, Σ3]
Leaf 1: C ↔ D exchange C: [Σ0, Σ1, Σ2, Σ3]
D: [Σ0, Σ1, Σ2, Σ3]
AR complete: every rank holds the full Σ.
Decoding “sub-AR”. The phase-2 AR runs on a sub-group of $L$ ranks (one per leaf), not on all $N$. There are $N/L$ such sub-groups — one per chunk-position — and they run concurrently. Each sub-AR completes the reduction for its chunk positions across leaves: if A holds the leaf-0 partial of chunks 0–1 and C holds the leaf-1 partial of the same chunks, AR on {A, C} produces the full $N$-way reduction for chunks 0–1. The trick is that RS in phase 1 shrinks the “which chunks each rank owns” set to $L/N$ of the full vector, so the phase-2 AR only needs to operate on that narrow slice — paying $L$-ranks α-cost on an $ML/N$-byte payload instead of $N$-ranks α-cost on $M$ bytes. Phase 3 AG then broadcasts the fully-reduced slices back within each leaf so every rank ends up with all $M$ bytes.
Cost formula. Summing the three phases:
$$t_{\mathrm{AR}} \;=\; t_{\mathrm{RS,inner}}\!\left(\tfrac{N}{L}, M\right) \;+\; t_{\mathrm{AR,outer}}\!\left(L, \tfrac{ML}{N}\right) \;+\; t_{\mathrm{AG,inner}}\!\left(\tfrac{N}{L}, M\right)$$
Each term instantiates the primitive cost formulas from 01_collective_algorithms.md / 02_topology_mapping.md with the matching tier’s $(\alpha, \mathrm{BW})$. For $k > 2$ tiers, the outer AR term is itself hierarchical — recursion down the tree until the innermost tier is reached.
Inner-tier cost in composed deployments. When the Clos leaf port attaches a whole scale-up pod (e.g., an NVL72 with 72 GPUs behind one NVSwitch) rather than a single GPU, the “inner tier” in the hierarchical schedule is that scale-up fabric: $t_{\mathrm{RS,inner}}$ and $t_{\mathrm{AG,inner}}$ instantiate the star (or torus, or mesh) cost from 02_topology_mapping.md §5.1 at the pod’s $(\alpha_{\mathrm{inner}}, \mathrm{BW}{\mathrm{inner}})$; the outer AR phase instantiates the Clos AR at $(\alpha{\mathrm{outer}}, \mathrm{BW}{\mathrm{outer}})$, where $\alpha{\mathrm{outer}}$ resolves to $\alpha_{\mathrm{leaf}}$, $\alpha_{\mathrm{spine}}$, or a mix per the §1.2 catalog. Intra-pod α-β cost is part of the hierarchical total by construction — not an overhead layered on top. The same holds for AG, RS, BC, Reduce: each row of the summary table below evaluates its inner term at the scale-up fabric’s cost.
The $\alpha$ and BW structure. For 2-tier with ring-on-ring:
$$t_{\mathrm{AR}}^{\mathrm{2\text{-}tier}} \approx 2\!\left(\tfrac{N}{L} - 1\right)\alpha_{\mathrm{inner}} + 2(L-1)\alpha_{\mathrm{outer}} + \frac{2(N-1)}{N}\cdot\frac{M}{\mathrm{BW}_{\mathrm{bottleneck}}}$$
where $\mathrm{BW}{\mathrm{bottleneck}} = \min(\mathrm{BW}{\mathrm{inner}}, \mathrm{BW}{\mathrm{outer}} \cdot N/(L \cdot \text{cross-tier share}))$ accounts for the lower-throughput tier dominating BW. In practice $\mathrm{BW}{\mathrm{outer}} \ll \mathrm{BW}{\mathrm{inner}}$ (IB at 50 GB/s vs NVLink at 900 GB/s), so $\mathrm{BW}{\mathrm{bottleneck}} \approx \mathrm{BW}_{\mathrm{outer}}$ once the outer tier participates.
AG, RS, BC, Reduce drop out. The three-phase pattern above already contains AG and RS as its two halves: standalone AG runs inner AG → outer AG (phase 3 logic only, no reduction); standalone RS runs outer RS → inner RS (phase 1 logic, time-reverse). BC and Reduce cascade trivially across tiers — no reduction to amortize. The full summary:
| Primitive | 2-tier composition | α term | BW term (at $s=1$, uniform BW) |
|---|---|---|---|
| AR | inner RS → outer AR → inner AG | $2(N/L - 1)\alpha_{\mathrm{inner}} + 2(L - 1)\alpha_{\mathrm{outer}}$ | $2(N-1)/N \cdot M/\mathrm{BW}$ |
| AG | inner AG → outer AG | $(N/L - 1)\alpha_{\mathrm{inner}} + (L - 1)\alpha_{\mathrm{outer}}$ | $(N-1)/N \cdot M/\mathrm{BW}$ |
| RS | outer RS → inner RS | $(L - 1)\alpha_{\mathrm{outer}} + (N/L - 1)\alpha_{\mathrm{inner}}$ | $(N-1)/N \cdot M/\mathrm{BW}$ |
| BC | outer BC → inner BC | $\lceil\log_2 L\rceil\,\alpha_{\mathrm{outer}} + \lceil\log_2(N/L)\rceil\,\alpha_{\mathrm{inner}}$ | $M/\mathrm{BW}$ (pipelined) |
| Reduce | inner Reduce → outer Reduce | $\lceil\log_2(N/L)\rceil\,\alpha_{\mathrm{inner}} + \lceil\log_2 L\rceil\,\alpha_{\mathrm{outer}}$ | $M/\mathrm{BW}$ (pipelined) |
(Plug in $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{leaf}}$ for small deployments where pods share a rail-ToR, $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{spine}}$ for production deployments spanning multiple rail-ToRs, or a weighted mix when the schedule has both same-ToR and spine-traversing hops; see §1.2.)
For oversubscription $s > 1$ at any tier boundary, multiply the BW term by $s$ on traffic crossing that boundary — equivalent to $\eta_\beta \approx 1/s$ in 05_contention_and_congestion.md §4.2. AG and RS cost half of AR on both α (one pass, not two) and BW (one RS-style telescoping, not the doubled AR round-trip); BC and Reduce stay at the $M/\mathrm{BW}$ pipelined ceiling regardless of tier count, with α scaling log-depth on each tier.
Case study: NVL72 + IB SuperPOD (AR). Using the shared terminology values, take a 2-pod deployment ($L = 2$, $N = 144$, $M = 16\,\mathrm{MB}$, ring-on-ring schedule). At this scale the two pods share a rail-ToR ($p = 2$, well within the $p \leq 4$ budget), so $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{leaf}} \approx 2\,\mu$s for the outer phase:
- Inner RS (ring on 72 GPUs over NVSwitch): $(N_{\mathrm{inner}}-1)\,\alpha_{\mathrm{inner}} = 71 \cdot 0.5 = \mathbf{35.5\,\mu s}$ α; $\frac{71}{72}\cdot\frac{M}{\mathrm{BW}_{\mathrm{inner}}} \approx \mathbf{17.5\,\mu s}$ BW.
- Outer sub-AR (ring on 2 pods through the shared rail-ToR): payload has telescoped to $ML/N = 0.22\,\mathrm{MB}$; $2(L-1)\,\alpha_{\mathrm{outer}} = 2 \cdot 1 \cdot 2 = \mathbf{4\,\mu s}$ α + $\frac{2(L-1)}{L}\cdot\frac{ML/N}{\mathrm{BW}_{\mathrm{outer}}} \approx \mathbf{4.4\,\mu s}$ BW.
- Inner AG: mirrors inner RS — 35.5 μs α + 17.5 μs BW.
Hierarchical total: ≈ 75 μs α + 39 μs BW ≈ 114 μs.
How does this compare to two hypothetical flat-ring baselines at the same $N = 144$?
| Schedule (all at $N = 144$, $M = 16\,\mathrm{MB}$) | α (μs) | BW (μs) | Total |
|---|---|---|---|
| Flat ring on a hypothetical 144-GPU NVSwitch domain (NVLink doesn’t actually span pods, but assume it could): $2(N-1)\,\alpha_{\mathrm{inner}} + 2(N-1)/N \cdot M/\mathrm{BW}_{\mathrm{inner}}$ | 143 | 35 | ~178 |
| Hierarchical NVLink + IB (this case study) | 75 | 39 | ~114 |
| Flat ring forced over IB only (ignore NVLink): $2(N-1)\,\alpha_{\mathrm{spine}} + 2(N-1)/N \cdot M/\mathrm{BW}_{\mathrm{outer}}$ | 2,288 | 636 | ~2,920 |
Two takeaways:
- Hierarchical beats flat-NVSwitch (114 vs 178 μs) even on the same fast fabric. The α saving comes from running $N/L$ parallel inner rings of length $N/L - 1$ instead of one ring of length $N - 1$. The α formula goes from $2(N-1)\,\alpha$ to $2(N/L - 1)\,\alpha + 2(L - 1)\,\alpha_{\mathrm{outer}}$ — at $L = 2$ this is $286\alpha \to 144\alpha$ on a uniform-α basis, a ~50% reduction before the outer tier even gets involved. The “shorter parallel rings” effect; optimal $L \approx \sqrt{N}$ gives roughly $\sqrt{N}$ speedup on the α term.
- Hierarchical beats flat-IB by ~26×, almost entirely on α, because 71 of the 75 μs of α work runs on NVSwitch at $\alpha_{\mathrm{inner}} = 0.5\,\mu$s instead of on IB at $\alpha_{\mathrm{spine}} = 8\,\mu$s, and payload telescoping cuts what crosses the outer tier from $M$ down to $M/72$.
So hierarchical wins on two distinct axes: shorter parallel inner rings (works on any fabric, beats flat-NVSwitch) plus payload telescoping when the outer tier is slower (turns the IB-side BW from 636 μs into 4.4 μs). At production scale ($L = 32$, pods spanning multiple rail-ToRs so $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{spine}}$), the gap to flat baselines widens further, and SHARP on the spine (§3.1) collapses the outer α to $\sim 2\alpha_{\mathrm{switch}}$.
2.2 A2A on hierarchies — the outlier
A2A is the one primitive where hierarchical composition doesn’t help. No reduction or replication semantics to exploit across tiers — every source-destination pair carries a distinct payload, and the cross-tier permutation traffic equals the full $(N-1) \cdot (1 - 1/N) \cdot M$ bytes regardless of schedule. The BW bound is the outermost tier’s bisection; no hierarchical schedule can beat it. If the outer tier is oversubscribed at $s$ or has per-leaf BW below per-endpoint BW, A2A realized BW drops by the full factor. Unlike §2.1, A2A’s cost itemizes per destination — each pairwise send pays its destination’s actual path latency rather than an average — so all three α values from the §2 shared terminology ($\alpha_{\mathrm{inner}}$, $\alpha_{\mathrm{leaf}}$, $\alpha_{\mathrm{spine}}$) appear simultaneously in its cost formula.
Cost formula. Implementations ship as pairwise direct-send: each rank serializes $N{-}1$ outgoing messages of size $M/N$ through its outbound port. Each send pays its destination’s actual path cost ($\alpha + (M/N)/\mathrm{BW}$); the total is a destination-weighted sum. In a real NVL72 + IB SuperPOD, peers split across three distance classes (counts shown for $L$ pods total, $p$ pods sharing a rail-ToR, $N_{\mathrm{inner}}$ GPUs per pod):
| Peer count | Destination class | Path α | Path BW |
|---|---|---|---|
| $N_{\mathrm{inner}} - 1$ | Intra-pod (NVSwitch scale-up) | $\alpha_{\mathrm{inner}}$ | $\mathrm{BW}_{\mathrm{inner}}$ |
| $(p - 1)\,N_{\mathrm{inner}}$ | Same-leaf cross-pod (through shared rail-ToR) | $\alpha_{\mathrm{leaf}}$ | $\mathrm{BW}_{\mathrm{outer}}$ |
| $(L - p)\,N_{\mathrm{inner}}$ | Cross-leaf cross-pod (through IB spine) | $\alpha_{\mathrm{spine}}$ | $\mathrm{BW}_{\mathrm{outer}}$ |
The α and BW terms separate out as:
$$t_\alpha^{\mathrm{A2A}} \;=\; (N_{\mathrm{inner}} - 1)\,\alpha_{\mathrm{inner}} \;+\; (p - 1)\,N_{\mathrm{inner}}\,\alpha_{\mathrm{leaf}} \;+\; (L - p)\,N_{\mathrm{inner}}\,\alpha_{\mathrm{spine}}$$
$$t_{\mathrm{BW}}^{\mathrm{A2A}} \;=\; (N_{\mathrm{inner}} - 1)\,\frac{M/N}{\mathrm{BW}{\mathrm{inner}}} \;+\; (N - N{\mathrm{inner}})\,\frac{M/N}{\mathrm{BW}_{\mathrm{outer}}} \cdot s$$
Intra-pod α contributes as a summand over same-pod peers — not a literal “intra-pod A2A added on top of cross-pod A2A,” but a single $N{-}1$-send schedule split by destination class. The inner tier’s high BW only helps on the $N_{\mathrm{inner}} - 1$ intra-pod sends; the rest pays $\mathrm{BW}_{\mathrm{outer}}$, slowed further by any outer-tier oversubscription $s$.
Worst case (all peers cross-leaf, e.g., MoE EP spread across the full SU): $t_\alpha \to (N{-}1)\,\alpha_{\mathrm{spine}}$ — the dominant scenario at production scale. Best case (all peers same-pod): $t_\alpha \to (N{-}1)\,\alpha_{\mathrm{inner}}$ — only feasible when $N \leq N_{\mathrm{inner}} = 72$.
Single-GPU endpoint Clos (the simpler abstract case where each leaf-ToR port holds one GPU, no scale-up pods). The intra-pod term drops out and the formula reduces to $t_\alpha = (N/L - 1)\,\alpha_{\mathrm{leaf}} + (N - N/L)\,\alpha_{\mathrm{spine}}$ where $L$ counts leaf-ToRs and $N/L$ counts GPUs per leaf-ToR. BW = $(N{-}1)/N \cdot M/\mathrm{BW}_{\mathrm{outer}} \cdot s$, uniform across all sends.
Case study: NVL72 + IB SuperPOD (A2A). Using the shared terminology values, take the same $L = 2$ pods sharing rail-ToRs as the §2.1 case study ($p = 2$, $N = 144$, $M = 16\,\mathrm{MB}$, so $M/N \approx 0.111\,\mathrm{MB}$). A2A’s per-destination accounting splits the $N-1 = 143$ peers into two distance classes here (no $\alpha_{\mathrm{spine}}$ peers because $p = L$). Each rank serializes 143 pairwise sends:
- 71 intra-pod sends over NVLink: $71 \cdot \!\bigl(\alpha_{\mathrm{inner}} + (M/N)/\mathrm{BW}_{\mathrm{inner}}\bigr) = 71 \cdot (0.5 + 0.12)\,\mu$s ≈ 44 μs.
- 72 same-leaf cross-pod sends over IB through shared ToR: $72 \cdot \!\bigl(\alpha_{\mathrm{leaf}} + (M/N)/\mathrm{BW}_{\mathrm{outer}}\bigr) = 72 \cdot (2 + 2.2)\,\mu$s ≈ 304 μs.
Hierarchical A2A total ≈ 348 μs.
How does this compare to flat-A2A baselines at the same $N = 144$?
| Schedule ($N = 144$, $M = 16\,\mathrm{MB}$) | α (μs) | BW (μs) | Total |
|---|---|---|---|
| Flat A2A on hypothetical 144-GPU NVSwitch (NVLink can’t actually span pods): $(N{-}1)\,\alpha_{\mathrm{inner}} + (N{-}1)\,(M/N)/\mathrm{BW}_{\mathrm{inner}}$ | 71.5 | 17.6 | ~89 |
| Hierarchical A2A on NVL72 + IB at $L=2$ same rail-ToR (this case study) | 179 | 169 | ~348 |
| Hierarchical A2A on NVL72 + IB at $L=2$ different rail-ToRs (spine traversal): $\alpha_{\mathrm{leaf}}$ replaced by $\alpha_{\mathrm{spine}}$ for cross-pod sends | 611 | 169 | ~780 |
| Flat A2A on IB only (no NVLink, every peer via spine): $(N{-}1)\,\alpha_{\mathrm{spine}} + (N{-}1)\,(M/N)/\mathrm{BW}_{\mathrm{outer}}$ | 1,144 | 318 | ~1,460 |
Lesson: don’t pull A2A out of the inner domain. Unlike AR, A2A loses every time it crosses a tier boundary. AR’s hierarchical decomposition saves on both α (parallel inner rings) and BW (payload telescoping); A2A has neither lever — every send carries a distinct payload (no reduction to amortize, no telescoping), and serial pairwise sends pay full α per cross-pod hop (no shorter-rings shortcut). On the same $N = 144$, flat-NVSwitch A2A (89 μs) is ~4× faster than the same-ToR hierarchical version (348 μs) and ~9× faster than the cross-ToR variant (780 μs).
This is why MoE expert parallelism placement on the inner-most fabric (NVL72 NVSwitch within a single pod) is the production rule whenever it fits. NCCL’s same-rail / same-ToR preference for A2A-heavy traffic mitigates the penalty when EP must span pods, but no scheduling trick recovers the full intra-pod cost — every GPU pair pulled out of NVLink roughly quadruples its A2A contribution.
2.3 When hierarchical helps, when it hurts
Hierarchical composition helps when:
- BW is tier-mismatched. If the inner tier is 10–20× faster than the outer (NVLink vs InfiniBand), doing as much of the reduction as possible inside the inner tier shrinks the outer tier’s payload from $M$ to $ML/N$. For a 2-tier system with $L = 16$ inner groups and $N = 1024$, the outer tier carries only $M/64$ of the original payload — a 64× reduction in cross-tier BW demand.
- Latency is tier-mismatched. Inner-tier α is ~0.5 μs (NVSwitch), outer-tier α is ~2–5 μs (IB), so doing most algorithmic hops on the inner tier wins both on the α term and on the BW term.
Hierarchical composition hurts when:
- Collective is A2A (see §2.2 — no decomposition gain).
- Outer tier is fast enough to do a flat AR. A deployment entirely inside one NVL72 pod has no outer-tier penalty, and flat star AR (DBT / INC) beats any hierarchical scheme.
- Group shape doesn’t factor nicely. If $N = 513$ doesn’t divide into a clean (inner, outer) factorization, hierarchical phases pay padding overhead.
- Outer tier oversubscription $s$ is large. See §3 — high $s$ compresses the hierarchical advantage.
3. INC and contention in hierarchies
In a multi-tier hierarchy, every reduction-or-replication primitive accumulates α at both tiers (§2.1’s summary table). At production scale ($N_{\mathrm{inner}} = 72$, $L = 32$, $\alpha_{\mathrm{inner}} = 0.5\,\mu$s, $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{spine}} = 8\,\mu$s, software DBT), AR pays ~7 μs of inner α + ~80 μs of outer α; AG/RS/BC/Reduce pay half each. The outer term is the biggest single contributor in absolute terms, but the inner term is not negligible — and once the outer term is compressed, the inner term becomes the new bottleneck.
The single biggest lever to compress these costs is in-network collectives (INC): pushing reduction or multicast logic into the switch ASIC itself, so the affected phase pays $\sim 2\,\alpha_{\mathrm{switch}}$ (a few hundred ns of switch cut-through) instead of $O(\log L)$ endpoint-driven rounds. INC applies at both tiers of the hierarchy, and both are valuable:
- Outer-tier INC (Quantum SHARP on IB Quantum-2 / X800, Spectrum-X SHARP on Ethernet, Tomahawk Ultra-based fabrics) sees the biggest absolute α saving — collapses ~80 μs of outer DBT down to ~1 μs at $L = 32$. This is what gives hierarchical AR at $L = 32$ pods its production-relevance.
- Inner-tier INC (NVLink SHARP / NVLS on NVSwitch Gen4 within an NVL72 pod) is equally important for two reasons: (i) it compresses the inner α from ~7 μs to ~0.4 μs, which becomes the dominant residual once outer INC is deployed; (ii) it lifts intra-pod AR’s $\mathrm{BW_{eff}}$ from $\mathrm{BW}/2$ to $\mathrm{BW}$ — a measured ~1.3× BW win (470+ GB/s vs ~360 GB/s on H100, [NVLINK-SHARP]) that the outer tier can’t replicate because the dual-touch pattern only manifests on endpoint-driven trees.
The two compose: a deployment with INC at both tiers gets the outer α collapse plus the inner BW lift simultaneously. INC is hardware-gated; see 04_in_network_collectives.md §2 for per-fabric mechanics. This section focuses on how INC composes with the hierarchical schedules of §2.
Two cross-cuts:
- §3.1: SHARP at either or both tiers collapses the corresponding phase of every reduction-or-replication primitive — uniform structural saving across AR, AG, RS, BC, Reduce. A2A is the lone exception (no fit on traditional INC hardware; dedicated HW A2A on Rubin / Tomahawk Ultra is a separate primitive).
- §3.2: per-tier η coefficients (NVSwitch crossbar contention, IB Clos oversubscription) feed back into §2.1’s hierarchical formulas, modifying realized cost without changing the algorithm shape.
3.1 SHARP composes at any tier of the hierarchy
The SHARP family collapses $n_\alpha$ to ~2 on its host fabric (04_in_network_collectives.md §1). In a hierarchical schedule (§2.1), SHARP can be installed independently at the inner tier, the outer tier, or both — each replacing its tier’s endpoint-driven phase with an INC operation at switch cut-through latency $\alpha_{\mathrm{switch}}$. Tiers without INC continue to use software DBT or ring.
Per-tier SHARP impact on the α term, at production scale ($N_{\mathrm{inner}} = 72$, $L = 32$, $\alpha_{\mathrm{inner}} = 0.5\,\mu$s, $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{spine}} = 8\,\mu$s; switch latencies $\alpha_{\mathrm{sw}}^{\mathrm{inner}} \approx 0.2\,\mu$s for NVLS, $\alpha_{\mathrm{sw}}^{\mathrm{outer}} \approx 0.5\,\mu$s for Quantum SHARP):
| Primitive | Inner α (software DBT) | Inner α (NVLS) | Outer α (software DBT) | Outer α (Quantum SHARP) |
|---|---|---|---|---|
| AR | $2\lceil\log_2 N_{\mathrm{inner}}\rceil\,\alpha_{\mathrm{inner}} \approx 7\,\mu$s | $\sim 2\,\alpha_{\mathrm{sw}}^{\mathrm{inner}} \approx 0.4\,\mu$s | $2\lceil\log_2 L\rceil\,\alpha_{\mathrm{outer}} \approx 80\,\mu$s | $\sim 2\,\alpha_{\mathrm{sw}}^{\mathrm{outer}} \approx 1\,\mu$s |
| AG | $\lceil\log_2 N_{\mathrm{inner}}\rceil\,\alpha_{\mathrm{inner}} \approx 3.5\,\mu$s | $\sim \alpha_{\mathrm{sw}}^{\mathrm{inner}} \approx 0.2\,\mu$s | $\lceil\log_2 L\rceil\,\alpha_{\mathrm{outer}} \approx 40\,\mu$s | $\sim \alpha_{\mathrm{sw}}^{\mathrm{outer}} \approx 0.5\,\mu$s |
| RS | $\lceil\log_2 N_{\mathrm{inner}}\rceil\,\alpha_{\mathrm{inner}} \approx 3.5\,\mu$s | $\sim \alpha_{\mathrm{sw}}^{\mathrm{inner}} \approx 0.2\,\mu$s | $\lceil\log_2 L\rceil\,\alpha_{\mathrm{outer}} \approx 40\,\mu$s | $\sim \alpha_{\mathrm{sw}}^{\mathrm{outer}} \approx 0.5\,\mu$s |
| BC | $\lceil\log_2 N_{\mathrm{inner}}\rceil\,\alpha_{\mathrm{inner}} \approx 3.5\,\mu$s | $\sim \alpha_{\mathrm{sw}}^{\mathrm{inner}} \approx 0.2\,\mu$s | $\lceil\log_2 L\rceil\,\alpha_{\mathrm{outer}} \approx 40\,\mu$s | $\sim \alpha_{\mathrm{sw}}^{\mathrm{outer}} \approx 0.5\,\mu$s |
| Reduce | $\lceil\log_2 N_{\mathrm{inner}}\rceil\,\alpha_{\mathrm{inner}} \approx 3.5\,\mu$s | $\sim \alpha_{\mathrm{sw}}^{\mathrm{inner}} \approx 0.2\,\mu$s | $\lceil\log_2 L\rceil\,\alpha_{\mathrm{outer}} \approx 40\,\mu$s | $\sim \alpha_{\mathrm{sw}}^{\mathrm{outer}} \approx 0.5\,\mu$s |
| A2A | $(N_{\mathrm{inner}}{-}1)\,\alpha_{\mathrm{inner}} \approx 35.5\,\mu$s | No NVLS fit on Gen4* | — (no decomposition; §2.2) | No fit on traditional SHARP |
*NVSwitch Gen4 NVLS supports AR / AG / BC but not A2A; Rubin-generation NVSwitches are expected to extend to A2A, and Tomahawk Ultra adds HW A2A on Ethernet today (04_in_network_collectives.md §1.2).
Three structural points:
- Both tiers benefit from INC, in different ways. Outer-tier INC dominates absolute α savings (80 → 1 μs is the biggest single move at $L = 32$). But once outer INC is in place, inner-tier α becomes the new dominant cost (7 of 8 μs total residual). Inner-tier NVLS compresses that residual to ~0.4 μs — the only lever that brings hierarchical AR within striking distance of single-pod cost.
- AR uniquely compounds α and BW wins, and the BW win is inner-tier-specific. Endpoint-driven AR pays the $\mathrm{BW_{eff}} = \mathrm{BW}/2$ dual-touch penalty; SHARP reduces in the switch ALU and eliminates the pattern. NVLS-measured intra-pod AR busbw rises from ~360 GB/s to 470+ GB/s (~1.3×) on H100 [NVLINK-SHARP]. The outer tier inherits the same BW lift on its phase, but the inner tier’s lift is what matters at large $M$ because the outer payload is already telescoped to $ML/N$. AG / RS / BC / Reduce never had the dual-touch penalty, so SHARP gives them only the α collapse at either tier.
- A2A has no INC path on traditional SHARP at either tier. Rubin-gen NVSwitches and Tomahawk Ultra add dedicated HW A2A as a separate primitive (efficiency win, not the same structural collapse), but on current-shipping NVL72 + Quantum-X800, A2A stays software-scheduled.
Composing both tiers. A deployment with NVLS + Quantum SHARP at $L = 32$ pays roughly $0.4 + 1 = 1.4\,\mu$s of α total + ~9 μs BW (telescoped outer + NVLS-lifted inner) ≈ ~10 μs end-to-end — essentially as fast as a single-pod NVLS AR despite spanning 32 pods. Outer-only SHARP (without NVLS) gets to ~$7 + 1 = 8\,\mu$s α + ~13 μs BW ≈ ~21 μs; the NVLS contribution is the difference between “10 μs” and “21 μs” — half the remaining cost. That’s why NVLS + Quantum SHARP is the standard combo on production GB200 SuperPODs.
3.2 Per-tier η in a hierarchical schedule
The per-tier $\eta$ profile from 05_contention_and_congestion.md §4.2 feeds directly into §2.1’s hierarchical formulas: each phase uses its own tier’s $(\eta_\alpha, \eta_\beta)$ when computing realistic cost. For a 2-tier AR with oversubscription $s > 1$ at the outer tier:
$$t_{\mathrm{AR,realistic}}^{\mathrm{2\text{-}tier}} \;=\; t_{\mathrm{RS,inner}}^{(\eta_\alpha^\mathrm{inner},\, \eta_\beta^\mathrm{inner})} \;+\; t_{\mathrm{AR,outer}}^{(\eta_\alpha^\mathrm{outer},\, \min(\eta_\beta^\mathrm{hw},\, 1/s))} \;+\; t_{\mathrm{AG,inner}}^{(\eta_\alpha^\mathrm{inner},\, \eta_\beta^\mathrm{inner})}$$
The “inner” and “outer” labels here match the tier labels from §2.1’s hierarchical schedule (NVSwitch and IB Clos respectively) — not the path-specific α subscripts (α_leaf, α_spine) introduced in §1.2 for A2A’s per-destination accounting.
Typical per-tier η values for NVL72 + IB SuperPOD (from 05_contention_and_congestion.md §4.2):
| Tier | Role | $\eta_\alpha$ | $\eta_\beta$ at $s = 1$ | $\eta_\beta$ at $s = 2$ | $\eta_\beta$ at $s = 4$ |
|---|---|---|---|---|---|
| Inner (NVSwitch within pod) | Intra-pod scale-up | ≈ 1.00 | ≈ 0.80 | n/a (NVSwitch typically not oversubscribed) | n/a |
| Outer (IB Quantum-2 / X800 Clos) | Cross-pod scale-out | ≈ 1.10–1.30 | ≈ 0.80 | $\min(0.80, 0.50) = 0.50$ | $\min(0.80, 0.25) = 0.25$ |
The outer-tier $\eta_\beta$ is bounded by $\min(\eta_\beta^{\mathrm{hw}}, 1/s)$: oversubscription pins it at $1/s$ once $s > 1.25$. The outer-tier $\eta_\alpha$ inflates additively from cross-tier queueing under load (1.10–1.30 typical, climbs to 2–3× past 80% utilization). The inner-tier coefficients are deployment-stable because NVSwitch is engineered for full bisection.
Implication. Oversubscription at the outer tier makes that tier relatively more expensive for BW-heavy primitives — the outer phase’s BW term inflates by $s$ regardless of which collective is running. Any tier-assignment optimizer (e.g., a partition-strategy search over hierarchical cost) must feed $\eta_\beta^{\mathrm{outer}}(s)$ into the formula, not just the ideal $(\alpha, \mathrm{BW})$, or it will miss-rank deployments where oversubscription matters.
Interaction with SHARP. SHARP-on-outer (§3.1) collapses the outer-α term but does not sidestep the tier-BW cap from $s$ — oversubscription still binds the outer BW side regardless of INC, because the switch ALU can only forward at the tier’s aggregate bandwidth. This is why production SHARP-enabled deployments often run $s = 1$ at the SHARP tier even when other tiers are oversubscribed — to keep the BW-side benefit intact.
3.3 Pareto implications
Three patterns emerge from combining hierarchical cost with INC and η:
- SHARP compresses the “extra tier” penalty almost completely for AR-heavy workloads. A 3-tier Clos with SHARP at the super-spine runs AR at a cost close to 2-tier without SHARP.
- A2A has no INC path on shipping hardware (
04_in_network_collectives.md §1.1— no reduction or replication semantics). EP A2A is the primitive whose cost is hardest to compress by going hierarchical — which is why MoE inference workloads are so sensitive to tier topology choice. - Contention η compresses Pareto margins, not rankings. Under realistic η, the winning hierarchical schedule typically doesn’t change — the gap to the runner-up just narrows. This matches the observation in
05_contention_and_congestion.md §5.3that realistic η tightens every margin without reshuffling winners.
Appendix A: Worked rail-optimized NVL72 SuperPOD topology (L = 32)
§1.1’s case study introduced the rail-optimized layout (one rail-ToR per NIC position in a compute tray; each NVL72 striped across 4 rails). This appendix works through the actual switch counts and Clos topology for a production-scale $L = 32$ NVL72 + Quantum-X800 deployment — the canonical “GB200 SuperPOD” anchor referenced from 04_in_network_collectives.md §3.
The build-up walks bottom-up. A.1 the smallest physical layers — compute tray and the NVL72 rack (pod). A.2 the pod ↔ leaf-ToR connection (fan-out and inverse view, rail-optimized). A.3 the leaf ↔ spine connection — what “full bipartite” actually means in cables, and why the leaf-spine pair count is $8 \times 4$ but the cable count is $8 \times 4 \times 18$. A.4 puts everything together at the cluster scale (per-rail topology + switch counts + 4-rail independence).
Cluster parameters.
- $L = 32$ NVL72 pods, $N_{\mathrm{inner}} = 72$ GPUs per pod → $N = 2304$ GPUs total.
- 4 rails (one per NIC position in a GB200 compute tray; 18 trays per NVL72, so 18 rail-$i$ NICs per pod).
- Quantum-X800 Q3400 leaf and spine switches: 144 ports at 800 Gbps XDR each.
- Full bisection ($s = 1$) within each rail’s fat-tree.
A.1 The compute tray and the NVL72 rack (pod)
This subsection grounds the smallest two physical layers — (1) compute tray and (2) NVL72 pod = 1 rack — before the rest of the appendix walks up to the cluster scale.
(1) Compute tray. The smallest replicable unit is a 1U GB200 compute tray with 4 GPUs and 4 NICs (one NIC per GPU, mounted next to it on the tray). Each GPU has two physically distinct egress paths off the tray:
GB200 compute tray (1 of 18 per NVL72 rack — 4 GPUs / 4 NICs):
to NVSwitch trays in this rack (NVLink, rack-internal — every GPU has its own)
▲ ▲ ▲ ▲
║ ║ ║ ║
┌────────╫────────────╫────────────╫────────────╫─────┐
│ ║ ║ ║ ║ │
│ GPU_0 GPU_1 GPU_2 GPU_3 │
│ │ │ │ │ │ PCIe Gen5
│ NIC_0 NIC_1 NIC_2 NIC_3 │ (1 NIC/GPU)
│ │ │ │ │ │
└────────┼────────────┼────────────┼────────────┼─────┘
▼ ▼ ▼ ▼
(rail 0) (rail 1) (rail 2) (rail 3)
4 IB cables out of the tray, one per rail, up to the IB ToR layer
Rail rule (cluster-wide). GPU_$i$’s NIC always lands on rail $i$, and this rule is identical across every tray (0–17) of every pod in the cluster (NVL72_0, NVL72_1, …). So slot $i$ across every tray in every pod forms a single rail-$i$ communicator of $L$ GPUs (32 GPUs at $L = 32$) — one GPU per pod, all reachable through rail-$i$’s IB fabric without ever leaving the rail. A GPU’s rail is set by its tray-slot position, not by which tray or which pod it lives in. (Cross-rail traffic — GPU_$i$ ↔ GPU_$j$ for $i \neq j$ across pods — uses NVLink as a rail bridge; see §1.1’s “Cross-pod, cross-rail” bullet.)
(2) NVL72 pod = 1 rack. The full NVL72 is one physical rack containing 18 compute trays + 9 NVSwitch trays plus power/management hardware. The 9 NVSwitch trays form the NVLink fabric that connects all 72 GPUs intra-pod — every GPU’s NVLink ports cable to the NVSwitch trays, forming a rack-internal switched fabric. No NVLink cable leaves the rack. The 72 NIC outputs (4 NICs/tray × 18 trays) all exit upward toward the IB ToR layer:
NVL72 pod = 1 rack (72 GPUs / 9 NVSwitch trays / 72 NICs out):
╔══════════════════ NVL72 rack ══════════════════╗
║ ║
║ compute tray 0: [GPU_0|GPU_1|GPU_2|GPU_3] ║
║ compute tray 1: [GPU_0|GPU_1|GPU_2|GPU_3] ║
║ ... (more compute trays) ║
║ ║
║ ──────── 9 NVSwitch trays ──────── ║
║ (NVLink fabric for all 72 GPUs; ║
║ rack-internal, no cable exits) ║
║ ────────────────────────────── ║
║ ║
║ ... (more compute trays) ║
║ compute tray 17: [GPU_0|GPU_1|GPU_2|GPU_3] ║
║ ║
╚═════════════════════ ║║║║ ═════════════════════╝
▼▼▼▼ 72 IB-bound NIC cables exit upward
(4 NICs/tray × 18 trays)
→ IB ToR(s) on top of rack(s)
The two physical fabrics (NVLink scale-up + IB scale-out) are the two egress paths from §1.1’s “Two independent fabrics per GPU” — what changes here is the granularity: the rail bullet labels in (1) tell you which IB rail each GPU’s NIC lands on. Every NVL72 rack contributes 18 NICs (one per tray’s slot $i$) to each of the 4 rails.
A.2 walks the next layer up — how each rack’s 72 NICs reach IB ToRs (pod → leaf). A.3 then covers leaf ↔ spine. A.4 puts everything together at full $L = 32$ cluster scale.
A.2 Pod ↔ leaf-ToR connection (rail-optimized)
Two complementary views capture how each NVL72’s 72 NICs reach IB leaf-ToRs: fan-out (one pod → 4 rail-ToRs) and aggregate / inverse (one rail-ToR ← up to 4 pods). Both are introduced in §1.1’s Layout 2; this subsection rephrases them in the appendix’s $L = 32$ context.
View 1 — Fan-out (one pod → 4 rail-ToRs). NICs at slot $i$ across every tray (one per tray × 18 trays = 18 NICs) bundle together and exit the rack to the rail-$i$ leaf-ToR; the same pattern at slots 1, 2, 3 fills the other 3 rail-ToRs. Each pod is striped across 4 rail-ToRs, contributing 18 NICs to each.
Pod-side fan-out (one NVL72 → 4 rail-ToRs):
┌────────────── NVL72_0 (72 GPUs / 72 NICs) ───────────────┐
│ │
│ tray 0: [r0] [r1] [r2] [r3] │
│ tray 1: [r0] [r1] [r2] [r3] │
│ ... │
│ tray 17: [r0] [r1] [r2] [r3] │
│ │ │ │ │ │
└─────────────────┼────────┼────────┼────────┼─────────────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──┬───┐ ┌──┬───┐ ┌──┬───┐ ┌──┬───┐
│rail-0│ │rail-1│ │rail-2│ │rail-3│ ← 4 distinct (separate) leaf-ToRs
│Leaf 0│ │Leaf 0│ │Leaf 0│ │Leaf 0│ that this pod connects to (rail-i
│ 144p │ │ 144p │ │ 144p │ │ 144p │ Leaf 0 for i = 0, 1, 2, 3). At L
└──────┘ └──────┘ └──────┘ └──────┘ = 32 each rail has 8 such leaves
(A.4); this pod hits 1 per rail
— its leaf-group's rail-i Leaf 0.
View 2 — Aggregate / inverse (one rail-ToR ← up to 4 pods). A 144-port X800 Q3400 leaf-ToR at $s = 1$ has 72 down-ports for NICs. Since each pod contributes only 18 NICs to a given rail, one leaf-ToR can absorb the rail-$i$ slice of up to 4 NVL72 pods ($72 / 18 = 4$). At $L = 32$, this rule replicates 8 times per rail: pods are partitioned into 8 leaf-groups of 4 pods each, and each leaf-group attaches to one rail-leaf-ToR per rail (pods 0–3 → rail-$i$ Leaf 0; pods 4–7 → rail-$i$ Leaf 1; …; pods 28–31 → rail-$i$ Leaf 7).
ToR-side aggregate (one rail-i leaf-ToR ← 4 NVL72 pods — repeated 8× per rail):
┌─NVL72_0─┐ ┌─NVL72_1─┐ ┌─NVL72_2─┐ ┌─NVL72_3─┐
│ rail-i │ │ rail-i │ │ rail-i │ │ rail-i │ ← only the rail-i slice
│ 18 NICs │ │ 18 NICs │ │ 18 NICs │ │ 18 NICs │ of each pod is shown;
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ the other 54 NICs/pod
│ │ │ │ land on rails j ≠ i
▼ ▼ ▼ ▼ (4 × 18 = 72 NICs total
┌────┬────────────┬────────────┬────────────┬────┐ into one rail-i leaf)
│ rail-i Leaf 0 (X800 Q3400, 144 ports) │
│ 72 down (= 4 × 18 NICs) + 72 up (→ 4 spines) │
└────────────────────────┬───────────────────────┘
│
▼ 72 up-ports — leaf-to-spine fan-out covered in A.3
The two views are reciprocal: the fan-out shows where one pod’s 4 NIC bundles land; the aggregate shows what one leaf-ToR collects. The 18 NICs each pod contributes to a rail-$i$ leaf-ToR are precisely that pod’s slot-$i$ NICs (NIC_$i$ across trays 0–17) — A.1’s rail rule, applied 4 times. So rail-i Leaf 0 is aggregating same-slot peers across 4 different pods: 18 GPU_$i$ from pod 0, 18 from pod 1, 18 from pod 2, 18 from pod 3 — 72 same-slot GPUs total. Across the cluster:
- 8 leaf-groups per rail × 4 rails = 32 leaf-groups cluster-wide; each leaf-group is one (rail, 4-pod) tuple wired to one leaf-ToR. Scaling beyond $L = 32$ adds more leaf-groups per rail (and more spines accordingly — see A.4).
- One pod participates in 4 leaf-groups simultaneously, one per rail. Pod 0’s rail-0 NICs land on rail-0 Leaf 0; its rail-1 NICs land on rail-1 Leaf 0; etc. The pod’s failure domain spans 4 leaves rather than 1 — a small intentional blast-radius decoupling vs the §1.1 Layout 1 (“pod-local ToR”) alternative.
A.3 Up-side micro-architecture: rail-leaf-ToRs → rail-spines
Each leaf-ToR’s 72 up-ports reach the spine layer. With 4 spines per rail (derived in A.4) and full bisection ($s = 1$), the 72 up-ports split into 4 groups of $72 / 4 = 18$. The key fact, easy to miss: every leaf-spine pair is connected by 18 parallel cables, not one. This is the “fattening” mechanism of the k-ary fat-tree (Appendix B): aggregate cross-tier BW grows toward the root through path multiplicity, not through link width.
Aside — is there a rule like A.2’s tray-slot rule for the 18-cable groups? No. A.2’s pod-to-leaf wiring has a structural rule rooted in hardware: the 18 NICs from a pod that go to rail-$i$ are exactly slot $i$ across all 18 trays (GPU_$i$ → rail-$i$). At the leaf-spine layer, the 4 spines are equivalent destinations within the same rail’s fabric — nothing about a packet picks one over another, and ECMP (5-tuple hashing) distributes flows across all 4 at runtime. Any partition of the leaf’s 72 up-ports into 4 groups of 18 satisfies the k-ary fat-tree property; production cabling typically uses contiguous port blocks (ports 0–17 → spine 0, 18–35 → spine 1, etc.), often aligned with switch ASIC dies for lower same-die latency, but this is install convention, not a workload constraint. (A 3-tier fat-tree adds a real rule: each spine indexed $j$ connects only to “plane $j$” of super-spines per the Al-Fares construction — see Appendix B.2 — but that’s beyond the 2-tier per-rail scope here.)
Per-leaf fan-out (one leaf, 72 up-ports to 4 spines):
┌──────── rail-i Leaf 0 (72 up-ports) ────────┐
└────┬───────────┬───────────┬───────────┬────┘
│ │ │ │
18 │ 18 │ 18 │ 18 │ (parallel cables each)
│ │ │ │
▼ ▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Spine 0 │ │ Spine 1 │ │ Spine 2 │ │ Spine 3 │
└─────────┘ └─────────┘ └─────────┘ └─────────┘
Same fan-out shape repeats for every one of the rail's 8 leaves.
Inverse view — what one spine collects. A rail’s spine has 144 down-ports. With 8 leaves per rail each contributing 18 parallel cables, the spine’s down-ports are exactly filled $8 \times 18 = 144$:
Leaf 0 Leaf 1 Leaf 2 Leaf 3 Leaf 4 Leaf 5 Leaf 6 Leaf 7
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│ 18 │ │ 18 │ │ 18 │ │ 18 │ │ 18 │ │ 18 │ │ 18 │ │ 18 │
└─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘
│ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ Spine j (X800 Q3400, 144 ports — all 144 used as down-ports) │
│ No up-ports: in a 2-tier Clos, the spine is the top of the fabric. │
└─────────────────────────────────────────────────────────────────────────┘
All 144 spine ports are downlinks — that’s what makes the 4 fat-trees “independent.” Each spine’s full radix is consumed serving its 8 rail-local leaves ($8 \times 18 = 144$); no port is left over for an inter-rail uplink. With no spine on rail $i$ wired to anything on rail $j \neq i$, the 4 rails are 4 physically disjoint IB fabrics by port allocation, not just by cabling convention — the spine’s port budget itself forecloses any cross-rail connection. (A 3-tier deployment would split each spine’s 144 ports as $72$ down + $72$ up to a per-rail super-spine — still rail-separated, since cross-rail traffic continues through the NVLink rail bridge described in §1.1’s “Cross-pod, cross-rail” bullet.)
A.4 Putting it together: cluster overview and switch counts
Combining A.1 (tray + rack), A.2 (pod ↔ leaf), and A.3 (leaf ↔ spine) gives the full $L = 32$ cluster: 4 independent rail fat-trees side by side, with the same 32 pods wired into all 4 rails via the rail-stripe fan-out (A.2 View 1). The macro view first, then a per-rail detail.
Cluster macro view — 4 rails + 32 shared pods:
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ rail 0 │ │ rail 1 │ │ rail 2 │ │ rail 3 │
│ 4 spines│ │ 4 spines│ │ 4 spines│ │ 4 spines│
│ 8 leaves│ │ 8 leaves│ │ 8 leaves│ │ 8 leaves│
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────────── 32 NVL72 pods ─────────────────────────┐
│ Each pod's 72 NICs split 18+18+18+18 across the 4 rails. │
│ Same 32 pods are shared by all 4 rails (no replication). │
└────────────────────────────────────────────────────────────┘
Switches in different rails are physically distinct: e.g., the cluster has
4 separate X800s named "Spine 0" — one per rail; same for "Leaf 0..7".
No IB cable crosses rails (A.3). Cluster totals: 16 spines + 32 leaves.
Full cluster topology — each Spine/Leaf box below stands for 4 distinct X800 switches (one per rail), shown via the ×4 row inside each box. Pods at the bottom are shared across all 4 rails.
NVL72 SuperPOD full cluster topology at L = 32 — 2304 GPUs across 4 rails
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
Spine layer: │Spine │ │Spine │ │Spine │ │Spine │ 4 spines × 4 rails =
│ 0 │ │ 1 │ │ 2 │ │ 3 │ 16 X800 spines total
│dn:144│ │dn:144│ │dn:144│ │dn:144│ (144 ports each, all
│up: 0│ │up: 0│ │up: 0│ │up: 0│ consumed as downlinks
│ ×4 │ │ ×4 │ │ ×4 │ │ ×4 │ — no super-spine in
└─┬────┘ └─┬────┘ └─┬────┘ └─┬────┘ a 2-tier per rail)
│ │ │ │
full bipartite within each rail: every leaf ↔ every
spine, 18 parallel cables per pair at s = 1.
No IB cable crosses rails (A.3 callout above).
│ │ │ │
┌─┴────┐ ┌─┴────┐ ┌─┴────┐ ┌─┴────┐ ┌─┴────┐
Leaf layer: │Leaf 0│ │Leaf 1│ │Leaf 2│ │Leaf 3│ ... │Leaf 7│ 8 leaves × 4 rails
│ ToR │ │ ToR │ │ ToR │ │ ToR │ │ ToR │ = 32 X800 leaves total
│dn: 72│ │dn: 72│ │dn: 72│ │dn: 72│ │dn: 72│ (dn = 4 pods × 18 NICs;
│up: 72│ │up: 72│ │up: 72│ │up: 72│ │up: 72│ up = 4 spines × 18
│ ×4 │ │ ×4 │ │ ×4 │ │ ×4 │ │ ×4 │ parallel cables)
└──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘ └──┬───┘
│ │ │ │ │
│ each leaf collects 72 NICs (= 18 from each of 4 pods)
▼ ▼ ▼ ▼ ▼
┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐ ┌──────┐
│ pods │ │ pods │ │ pods │ │ pods │ ... │ pods │ pods SHARED across
│ 0-3 │ │ 4-7 │ │ 8-11 │ │12-15 │ │28-31 │ all 4 rails (no ×4):
└──────┘ └──────┘ └──────┘ └──────┘ └──────┘ each pod's 72 NICs
already split 18+18+18+18
across the 4 rails
Switch-count derivation. Per rail at $L = 32$:
- NICs to absorb per rail: $32\,\text{pods} \times 18\,\text{NICs/pod} = 576\,\text{NICs}$.
- Each X800 leaf at $s = 1$: $72$ down-ports (NICs) + $72$ up-ports (spines) = $144$ → leaves per rail = $576 / 72 = 8$ (matches A.2’s “8 leaf-groups per rail”).
- Each leaf has $72$ up-ports, distributed across $S$ spines as $72/S$ parallel links per leaf-spine pair. For non-blocking ($s = 1$) at the spine tier, $S$ must absorb all $8 \times 72 = 576$ leaf-up-port total; with 144-port spines: $S = 576 / 144 = 4$.
- Spines per rail = 4, each with $8\,\text{leaves} \times 18 = 144$ down-ports fully utilized (matches A.3’s column-sum).
Cluster totals. Switches multiply by 4 across rails; pods do not.
| Per rail | Cluster (× 4 rails) | |
|---|---|---|
| Leaf switches | 8 | 32 |
| Spine switches | 4 | 16 |
| Total X800 switches | 12 | 48 |
| Pods (shared across rails) | — | 32 (not multiplied) |
This is the topology referenced as the “production GB200 SuperPOD” anchor in 03_hierarchical_topologies.md §1.1’s case study, 03_hierarchical_topologies.md §1.2’s symbol catalog (where $\alpha_{\mathrm{outer}} = \alpha_{\mathrm{spine}} = 8\,\mu$s applies because pods span 8 leaves per rail), and in 04_in_network_collectives.md §3’s discussion of why the imaginary 512-port single-switch is hypothetical.
Appendix B: The k-ary fat-tree
§1 treats “fat-tree” and “Clos” as the general multi-tier switched family, parameterized by the oversubscription ratio $s$. In practice, virtually every AI datacenter deploys a specific realization: the k-ary fat-tree of [AL-FARES08]. This appendix gives an explicit diagram and port-by-port accounting for a small $k = 4$ instance, contrasts it with the original Leiserson (1985) fat-tree concept, and explains why the k-ary variant dominates deployment.
B.1 Traditional (Leiserson) fat-tree
Leiserson [LEIS85] proposed the fat-tree as a universal routing network for hardware-efficient supercomputing. The defining property: link bandwidth grows toward the root. Unlike a plain binary tree (where every link has the same BW and the root becomes a bottleneck), a fat-tree has “fat” branches at the top with BW scaled to match the aggregate of everything below.
Leiserson fat-tree (conceptual, BW grows toward root):
┌─────┐
│ R │ root switch
└──┬──┘
│ ═══ 4× link bandwidth
┌─────────┴─────────┐
│ │
┌──┴──┐ ┌──┴──┐
│ I1 │ │ I2 │ intermediate tier
└─┬─┬─┘ └─┬─┬─┘
│ │ ═══ 2× link │ │
┌───┘ └───┐ ┌───┘ └───┐
│ │ │ │
┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴─┐
│ E1 │ │ E2 │ │ E3 │ │ E4 │ edge tier
└──┬─┘ └──┬─┘ └──┬─┘ └──┬─┘
│ 1× │ │ │
● ● ● ● endpoints
BW scaling rule: link BW doubles at each level going up.
Aggregate BW at every tier stays constant; the root never bottlenecks.
Properties:
- Switches at different tiers can have different port counts and link speeds.
- BW scaling rule is abstract — factor-of-2 per level, factor-of-k, or any monotone increase.
- Mostly an academic / theoretical construction — not practical to build from commodity parts.
Practical limitation: requires custom (non-uniform) switch hardware. You can’t go to a vendor catalog and buy a “root” switch with 4× the BW of an “intermediate” switch — that’s not how the commodity-switch market works.
B.2 k-ary fat-tree (Al-Fares construction)
Al-Fares et al. [AL-FARES08] realized fat-tree properties using only commodity $k$-port switches — every switch in the entire fabric is identical. BW growth toward the root is achieved not by fattening individual links but by multiplying parallel paths at upper tiers.
Parameter: a single even integer $k$ (the switch radix) determines everything.
| Quantity | Formula | $k = 4$ | $k = 48$ | $k = 64$ |
|---|---|---|---|---|
| Switch radix (ports/switch) | $k$ | 4 | 48 | 64 |
| Number of pods | $k$ | 4 | 48 | 64 |
| Edge switches per pod | $k/2$ | 2 | 24 | 32 |
| Aggregation switches per pod | $k/2$ | 2 | 24 | 32 |
| Core switches | $(k/2)^2$ | 4 | 576 | 1,024 |
| Endpoints per edge switch | $k/2$ | 2 | 24 | 32 |
| Endpoints per pod | $(k/2)^2$ | 4 | 576 | 1,024 |
| Total endpoints | $k^3/4$ | 16 | 27,648 | 65,536 |
| Total switches | $5k^2/4$ | 20 | 2,880 | 5,120 |
Uniform port split per switch (every switch — edge, aggregation, core — has $k$ ports split evenly):
┌───────────────┐ k/2 "up" ports (toward higher tier)
│ │ ←──────────────────────
│ SWITCH │
│ (k ports) │ ←──────────────────────
└───────────────┘ k/2 "down" ports (toward lower tier)
This uniform k/2-up / k/2-down split enforces s = 1 at every tier:
the switch cannot receive more traffic downward than it can push upward.
Full diagram: $k = 4$ fat-tree (20 switches, 16 endpoints)
CORE tier: (k/2)² = 4 switches; each has k = 4 ports (all downward, one per pod)
┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐
│ C(1,1) │ │ C(1,2) │ │ C(2,1) │ │ C(2,2) │ 4 core switches
└──┬┬┬┬──┘ └──┬┬┬┬──┘ └──┬┬┬┬──┘ └──┬┬┬┬──┘ each with 4 ports
││││ ││││ ││││ ││││ (one per pod)
││││ 16 core↔aggregation links total (each core has one link to
││││ each pod's aggregation tier — specifically, core C(i,j)
││││ connects to aggregation switch indexed j in each pod)
▼▼▼▼ ▼▼▼▼ ▼▼▼▼ ▼▼▼▼
POD 0 POD 1 POD 2 POD 3
AGG tier (k/2 = 2 per pod; k/2 up-ports to core, k/2 down-ports to edge)
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│ A₀₀│ │ A₀₁│ │ A₁₀│ │ A₁₁│ │ A₂₀│ │ A₂₁│ │ A₃₀│ │ A₃₁│
└─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘
⇅──────⇅ ⇅─────⇅ ⇅─────⇅ ⇅─────⇅
(within-pod full bipartite: each agg ↔ each edge, 4 links per pod, 16 total)
EDGE tier (k/2 = 2 per pod; k/2 up-ports to agg, k/2 down-ports to endpoints)
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│ E₀₀│ │ E₀₁│ │ E₁₀│ │ E₁₁│ │ E₂₀│ │ E₂₁│ │ E₃₀│ │ E₃₁│
└─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘ └─┬┬─┘
││ ││ ││ ││ ││ ││ ││ ││
ENDPOINTS: 2 per edge switch × 2 edges per pod × 4 pods = 16 total
R₀R₁ R₂R₃ R₄R₅ R₆R₇ R₈R₉ R₁₀R₁₁ R₁₂R₁₃ R₁₄R₁₅
Per-switch port accounting (every switch has $k = 4$ ports; split 2 up, 2 down):
| Switch type | 2 “up” ports connect to | 2 “down” ports connect to |
|---|---|---|
| Edge $E_{p,i}$ (pod $p$, edge $i$) | Both aggregation switches in pod $p$ | 2 endpoints |
| Aggregation $A_{p,j}$ (pod $p$, agg $j$) | 2 of the 4 core switches — specifically cores $C(i, j)$ for $i \in \{1, 2\}$ | Both edge switches in pod $p$ |
| Core $C(i, j)$ | N/A (top tier) | 1 aggregation switch per pod — $k$ = 4 ports, one per pod, connecting to agg $j$ of each pod |
Wiring rule (where the magic happens): Core switches are indexed by a pair $(i, j)$ with $i, j \in \{1, \ldots, k/2\}$. Core $C(i, j)$’s $p$-th port connects to aggregation switch $j$ in pod $p$. This partitions the $(k/2)^2$ cores into $k/2$ parallel planes — each plane consists of $k/2$ cores all talking to the same aggregation index across every pod. Different planes (different $j$) carry independent cross-pod traffic, enabling ECMP load balancing.
B.3 Comparison: Leiserson vs Al-Fares
| Property | Leiserson (1985) | k-ary (Al-Fares 2008) |
|---|---|---|
| Switch uniformity | Not required — switches can differ per tier | Required — every switch has exactly $k$ ports |
| How “fat” is achieved | Wider / higher-BW links at upper tiers | More parallel paths at upper tiers (same link BW everywhere) |
| Parameterization | Abstract BW scaling rule, per tier | Single integer $k$ |
| Construction | Theoretical / conceptual | Concrete recipe buildable from catalog parts |
| Oversubscription | Not specified | $s = 1$ at every tier by construction |
| Commodity hardware | No | Yes — identical $k$-port switches |
| Is it a Clos? | Not necessarily | Yes, rearrangeably non-blocking |
| Where you find it | Academic / custom supercomputers | Ubiquitous in AI datacenters |
The two share the “fat-tree” name but achieve the fattening through different mechanisms: Leiserson by link width, Al-Fares by parallel path count. The Al-Fares version is what “fat-tree” means in datacenter practice.
B.4 Why k-ary dominates AI datacenters
- Commodity economics: single switch SKU throughout the fabric → single vendor negotiation, single firmware stream, interchangeable spares. One 48-port switch model can build fabrics from 48 endpoints (small pilot) to 27,648 endpoints (full $k^3/4$ deployment).
- Cabling predictability: every switch has the same $k/2$-up / $k/2$-down port assignment, so cabling templates repeat verbatim per pod.
- Full bisection by construction: $s = 1$ at every tier means no cross-tier BW bottleneck under permutation traffic — any pairing of endpoints can run at full link rate.
- ECMP-native: $(k/2)^2$ core switches give $(k/2)^2$ equivalent paths between any two pods; standard ECMP hashing spreads flows evenly without fabric-specific logic.
- Single-knob scaling: need more endpoints? Pick a larger $k$ (different commodity switch), rebuild. Don’t need to redesign the topology itself — the recipe is the same at any $k$.
Production examples:
| Fabric | Switch radix $k$ | Total endpoint capacity | Deployments |
|---|---|---|---|
| NVIDIA SuperPOD (Quantum-2 IB NDR) | 40 (HCA) / 64 (switch) | ~16–32K GPUs per pod | DGX SuperPOD, Meta RSC |
| NVIDIA Quantum-X800 | 144 | >500K endpoints per fabric | Blackwell-class GB200 deployments |
| Broadcom Tomahawk 5 (Ethernet) | 256–512 | Millions via 3-tier | Hyperscaler AI clusters |
| Arista 7060X / 7800 (Ethernet) | 32–128 | Varies | Meta, Microsoft, Google |
Further reading
01_collective_algorithms.md— per-primitive cost formulas (ring, DBT, RHD, Rabenseifner, pairwise); the building blocks composed here.02_topology_mapping.md— single-tier scale-up costs for star, torus, mesh; §5.1 has the per-topology cost-formula summary that §2.2 here builds on.04_in_network_collectives.md— SHARP / NVLS / Quantum SHARP / Spectrum-X SHARP mechanics; §3 above relies on the scale-out INC cost model from §2.2 there.05_contention_and_congestion.md— per-tier $\eta$ calibration (§5.1); feeds §3.2 here.references.md— primary sources on Megatron-LM parallelism (Shoeybi et al.), hierarchical collective scheduling (Thakur & Gropp, MagPie), and the Leiserson (1985) fat-tree foundational paper.
In-Network Collectives: SHARP, NVLS, and the $N_{\mathrm{hops}}$ Collapse
Author: Yue Lu
Date: April 2026
On switched fabrics — single-switch star (e.g., NVSwitch) or multi-tier star-of-stars (fat-tree / Clos) — conventional collectives pay an endpoint-driven latency term that grows with group size: $2(N-1)\alpha$ for ring AR, $2\lceil \log_2 N \rceil\alpha$ for the best software tree (DBT / RHD). Every algorithmic step is an endpoint round trip through the switch, and the $\alpha$ floor ($\sim 1\,\mu$s) is set by endpoint software — scheduling, kernel launch, NIC engine setup — paid once per step. For latency-sensitive traffic (small $M$, interactive collectives, fine-grained reductions), this $n_\alpha \cdot \alpha$ term dominates total cost, and no software schedule can eliminate it because the $\alpha$ floor is outside the algorithm’s control.
In-network collectives (INC) attack this term at its source by moving the reduction into the switch ASIC itself. Switch-resident ALUs reduce flits as they arrive and multicast the result back; endpoints see the reduced output in a single logical round trip, and the endpoint software overhead is paid once rather than $O(N)$ or $O(\log N)$ times. On a single-switch star, $n_\alpha$ collapses from $2(N-1)$ or $2\lceil \log_2 N \rceil$ down to $2$, independent of $N$. On a multi-tier switched fabric, the aggregation tree is built from switches rather than endpoints, so $n_\alpha = O(\log_r N)$ but at switch cut-through latency ($\sim$ 200-400 ns) rather than endpoint RTT ($\sim 1-3\,\mu$s) — a 5-10× per-hop saving on top of the structural collapse.
The speedup is scoped to switched fabrics: star, fat-tree / Clos. Torus-native collectives don’t benefit — their $N$-dependent $\alpha$ term comes from neighbor-router hops rather than endpoint-driven switch round trips, and there is no switch-hosted ALU in the reduce path. This note focuses on the switched-fabric case: what the hardware does, how the $O(N) \to O(1)$ latency collapse manifests at $N = 512$ on a hypothetical single-switch star (chosen for consistency with the running example in 02_topology_mapping.md §5.1 and 05_contention_and_congestion.md §5), and where the mechanism breaks down (reducible types, precision, cross-domain fallback). Three shipping implementations anchor the discussion — NVLS (NVLink SHARP on NVSwitch, single-switch star), Quantum SHARP (InfiniBand, multi-tier), and Tomahawk Ultra INC (Ethernet, emerging).
Table of Contents
- How INC speeds up collectives
- Scale-up vs scale-out
- Worked example at $N = 512$
- INC-star vs torus
- Further reading
1. How INC speeds up collectives
Conventional software collectives treat the switch as a passive forwarder: each algorithmic step is an endpoint-to-endpoint message routed through the switch’s crossbar without modification. Every step pays one endpoint-software $\alpha$ (NCCL scheduling + CUDA launch + NIC engine setup, $\sim 1\,\mu$s) plus switch cut-through ($\sim 200$ ns) plus negligible wire propagation. For AR on $N$ endpoints: ring needs $2(N-1)$ such steps, DBT / RHD needs $2\lceil \log_2 N \rceil$ (§5 of 01_collective_algorithms.md). The $\alpha$ floor is software-set and cannot be reduced by any choice of algorithm.
Software ring AR on a star — every arrow is an endpoint → switch → endpoint round trip,
with one α worth of endpoint software overhead paid at each step
R0 → switch → R1 → switch → R2 → switch → ... → R(N-1) → switch → R0
[N-1 RS steps] then [N-1 AG steps]
INC replaces the passive-forwarder assumption with an active-switch one. Three structurally distinct hardware capabilities have shipped or are emerging on production switch ASICs, each accelerating a different set of collectives:
- §1.1 — In-network reduction (switch ALU). Many in, one out semantics. Matches Reduce, RS, and the reduce-phase of AR.
- §1.2 — Switch multicast (crossbar fanout). One in, many out semantics. Matches BC, AG, and the multicast-phase of AR.
- §1.3 — Crossbar scatter-gather (HW A2A, emerging). Many in, many out via per-destination routing semantics. Matches A2A only.
§1.1 and §1.2 are the SHARP-class INC family (NVLink SHARP / NVLS, Quantum SHARP, Spectrum-X SHARP, Tomahawk Ultra) — they collapse $O(N) \to O(1)$ on $\alpha$ and (for AR alone) lift the BW ceiling. §1.3 is a structurally different primitive (Broadcom Tomahawk Ultra today, Rubin-generation NVSwitches next), giving an efficiency win on A2A’s $\alpha$ but no structural data collapse. §1.4 consolidates the per-primitive α-term and BW-term savings into a single summary table.
1.1 In-network reduction (switch ALU)
Mechanism. The switch ASIC has ALUs in its crossbar fabric alongside the port-routing logic. They consume $N$ incoming $M$-byte flits from participating ports, compute sum / max / min / bit-op elementwise at line rate, and emit one $M$-byte reduced output. The semantics — many in, one out — match Reduce, RS, and the reduce-phase of AR. They do not match AG, BC, or A2A (no reduction in their semantics).
How it accelerates each affected primitive.
Reduce — one root R* receives the sum of all N values
─────────────────────────────────────────────────────────
Software tree-Reduce:
Level 0: pairs of endpoints reduce (N/2 pairs in parallel; one α per level)
Level 1: pairs of pair-results reduce (N/4 pairs)
...
Total: ⌈log₂ N⌉ levels of endpoint round trips
INC Reduce:
R0 ─push M─►┐
R1 ─push M─►├─► [Switch ALU: elementwise sum] ─push M─► root R*
R2 ─push M─►│
... │
R(N-1) ─push M─►┘
Total: ~1 logical round trip — one push per rank, one receive at root
RS — every rank receives a distinct M/N-byte slice of the reduced result
────────────────────────────────────────────────────────────────────────
Software ring RS:
N-1 sequential ring steps, each forwarding (N-1)/N · M bytes
INC RS:
R0 ─push M─►┐ ┌─slice 0 (M/N) ─► R0
R1 ─push M─►├─► [Switch ALU: sum] ─► ├─slice 1 (M/N) ─► R1
R2 ─push M─►│ then scatter slices ├─slice 2 (M/N) ─► R2
... │ | ...
R(N-1) ─push M─►┘ └─slice N-1 (M/N) ─► R(N-1)
Total: ~2 logical round trips — push, in-fabric ALU + scatter, receive
AR’s reduce phase uses the same switch-ALU mechanism as Reduce; the M-byte reduced result feeds straight into the multicast engine (§1.2) for AR’s broadcast phase, instead of being delivered to a single root.
Improvement vs software. The α and BW effects on these primitives:
- α-term: from $O(N)$ endpoint round trips (ring) or $O(\log N)$ levels (tree) down to $\sim 2$ logical round trips through the switch, regardless of $N$. Speedup $\sim (N-1)/2$ vs ring, $\sim \log_2 N$ vs tree.
- BW-term: stays at the same per-rank ceiling as software ($M/\mathrm{BW}$ for Reduce; $(N-1)/N \cdot M/\mathrm{BW}$ for RS) — software ring already saturates the link via full-duplex forwarding. INC reduction is α-only on these two primitives.
- AR is special: composes the switch ALU + multicast xbar and additionally lifts $\mathrm{BW_{eff}}$ from $\mathrm{BW}/2$ to $\mathrm{BW}$ — see §1.4.
Commercial shipment (switch-ALU support):
- NVLS (NVLink SHARP) — NVSwitch Gen3 (H100 era) and Gen4 (B200 / GB200 / NVL72). $\alpha_{\mathrm{switch}} \approx 100$–$200$ ns. Capped at NVSwitch radix (72 GPUs in NVL72).
- Quantum SHARP — IB Quantum-2 / Quantum-X800 switches.
- Spectrum-X SHARP — NVIDIA Spectrum-X Ethernet switches.
- Tomahawk Ultra (Broadcom, shipped 2025) — first commodity-Ethernet INC ASIC, $\alpha_{\mathrm{switch}} \approx 250$ ns.
1.2 Switch multicast (crossbar fanout)
Mechanism. The switch has a multicast routing table. A single $M$-byte write to a multicast group ID is replicated by the crossbar to every destination port in the group in one switch-local operation, without traversing the aggregate ingress bandwidth $N$ times. The semantics — one in, many out — match BC, AG (which decomposes as $N$ concurrent broadcasts, one per rank’s slice), and the broadcast-phase of AR. They do not match RS, Reduce, or A2A.
How it accelerates each affected primitive.
BC — root R0 sends M bytes to all other ranks
─────────────────────────────────────────────
Software pipelined-tree BC:
Level 0: R0 sends M to one peer (one α)
Level 1: both forward M to one peer each (one α, parallel)
...
Total: ⌈log₂ N⌉ α-rounds
INC BC:
┌─► R1
R0 ─push M─► [Switch ─┤─► R2
multicast: ├─► ...
fanout] └─► R(N-1)
Total: ~1 logical round trip — single multicast write, parallel fanout
AG — every rank pushes its M/N slice; every rank receives all N slices
──────────────────────────────────────────────────────────────────────
Software ring AG:
N-1 sequential ring steps; each rank forwards its current slice + receives next
INC AG (= N concurrent broadcasts, one per source slice):
R0 ─push slice_0─► ┌─► R1, R2, ..., R(N-1) receive slice_0
R1 ─push slice_1─► ─┤─► R0, R2, ..., R(N-1) receive slice_1
R2 ─push slice_2─► switch ─► R0, R1, ..., R(N-1) receive slice_2
... multicast ...
R(N-1) ─push slice_(N-1)─► └─► R0, R1, ..., R(N-2) receive slice_(N-1)
All N multicasts pipeline through the switch's multicast engine concurrently;
every rank ends with N slices = M bytes total.
Total: ~2 logical round trips — N concurrent pushes + N concurrent receives
AR’s multicast phase uses the same switch-multicast mechanism as BC; the multicast source is the switch ALU’s reduced output (§1.1) rather than a designated root rank.
Improvement vs software.
- α-term: from $O(N)$ (ring AG) or $O(\log N)$ (pipelined tree BC) down to $\sim 2$ logical round trips. Speedup $\sim (N-1)/2$ vs ring AG, $\sim \log_2 N$ vs tree BC.
- BW-term: stays at $M/\mathrm{BW}$ per rank — software ring AG / pipelined tree BC already saturate the link in full-duplex via concurrent forward + receive. Multicast’s gain is α-only here too.
- AR is special — see §1.4.
Commercial shipment.
- SHARP-class INC — switch multicast support is bundled with switch-ALU support; the same generations from §1.1 apply. NVLS supports BC / AG / AR within an NVSwitch domain; Quantum SHARP and Spectrum-X SHARP support multicast across multi-tier IB / Ethernet fat-trees; Tomahawk Ultra supports it on Ethernet scale-up.
- PCIe switch multicast — Broadcom PEX series (formerly PLX), Microchip Switchtec, and Astera Labs PCIe switches implement the PCIe-spec MCAST capability (introduced in PCIe 2.1, ~2010): a single posted write to a multicast-group ID is replicated by the crossbar to all destination ports in one switch operation. Structurally the same as the SHARP-class multicast above, but scoped to a PCIe domain (covers BC / AG along the CPU↔GPU and GPU↔NIC paths within a host). For modern NVLink-based AI clusters this path is less central than NVLS, since intra-node GPU↔GPU traffic stays on NVLink — but PCIe multicast is a valid INC primitive for non-NVLink accelerator topologies (some AMD MI-series configurations, custom inference cards, accelerators interconnected over PCIe fabric) and was the dominant scale-up multicast option in pre-NVLink HPC clusters.
1.3 Crossbar scatter-gather — HW A2A (emerging)
Mechanism. A2A doesn’t fit either §1.1 or §1.2’s SHARP-class capabilities — there’s no reduction (every send carries a distinct payload, nothing combines) and no replication (every destination receives different data, nothing is multicast). So none of the structural $O(N) \to O(1)$ machinery from those subsections applies. But there is a third, separate INC primitive that does accelerate A2A: hardware-driven crossbar scatter-gather.
Software A2A has each rank serialize $N{-}1$ outbound sends through its single port; each send pays $\alpha$ (endpoint scheduling + kernel launch + NIC engine setup). HW A2A flips this:
- Each rank concatenates its $N{-}1$ destination chunks into one combined buffer of size $(N{-}1)\,M/N$, with per-chunk destination metadata.
- The rank submits the buffer to the switch in a single transaction (one $\alpha$).
- The switch crossbar reads the per-chunk routing tags and forwards each chunk to its destination port in parallel within the switch.
- Each destination receives its $M/N$ chunk in one switch-driven transaction.
The $N{-}1$ endpoint-driven scheduling rounds collapse to ~1 endpoint submit + 1 switch routing pass + 1 endpoint receive. Per-rank α drops from $(N{-}1)\,\alpha$ to $\sim\alpha_{\mathrm{switch}}$.
A2A — every source-destination pair carries a distinct payload
──────────────────────────────────────────────────────────────
R0's send_buffer in local memory (same data layout in software and HW A2A):
[chunk_0 (→R0)][chunk_1 (→R1)][chunk_2 (→R2)]...[chunk_(N-1) (→R(N-1))]
←─ M/N ─► ←─ M/N ─► ←─ M/N ─► ←─ M/N ─►
└──────── (N-1)·M/N bytes leave the rank ────────┘
Software pairwise A2A (N-1 separate sends per rank):
R0 ─send(chunk_1, R1)──────► switch ─► R1 [α + (M/N)/BW]
R0 ─send(chunk_2, R2)──────► switch ─► R2 [α + (M/N)/BW]
... ← N-1 endpoint-scheduled rounds
R0 ─send(chunk_(N-1),R(N-1))► switch ─► R(N-1) [α + (M/N)/BW]
Each send pulls one chunk from send_buffer; each pays a separate α
(NCCL scheduling + kernel launch + NIC engine setup).
Hardware A2A (one bulk transaction per rank):
R0 ──(submit base ptr + descriptor)──► [Switch crossbar:
(descriptor = per-chunk parses descriptor;
dest tags for chunks 1..N-1) routes each chunk in
parallel to its dest port]
─► R1 receives chunk_1
─► R2 receives chunk_2
...
─► R(N-1) receives chunk_(N-1)
Same send_buffer bytes are physically transferred; only the submission
descriptor differs (1 descriptor + 1 endpoint α, vs N-1 descriptors and α's).
Per-rank cost: ~1 submit + 1 switch routing pass + 1 receive ≈ ~α_switch
Crossbar routing structure — multicast xbar (§1.2) vs scatter-gather engine (§1.3). Both bypass the switch ALU; they differ in how the crossbar dispatches data:
Multicast (one input → N replicas; data fanout)
─────────────────────────────────────────────────
┌─► R1 [same chunk M]
│
R0 ─push chunk M─► switch ──┼─► R2 [same chunk M]
│
...
│
└─► R(N-1) [same chunk M]
Crossbar primitive: replicate one input to many outputs.
Bytes: 1 input chunk → N identical output copies.
HW blocks: multicast group table + crossbar fanout duplication logic.
Scatter-gather (one bulk input → N distinct chunks; data permutation)
─────────────────────────────────────────────────────────────────────
┌─► R1 [chunk_1]
│
R0 ─push [chunk_1│chunk_2│...│chunk_(N-1)]──────┼─► R2 [chunk_2]
│
+ descriptor (dest tags) ...
│
└─► R(N-1) [chunk_(N-1)]
Crossbar primitive: parse descriptor, route each chunk to its tagged port.
Bytes: 1 bulk input → N distinct outputs (no replication).
HW blocks: descriptor parser + per-chunk address routing + concurrent
multi-source ingress arbitration (N ranks may submit at once).
The two diagrams above show the contrasting data flows. The table below summarizes the feature-level differences side-by-side — same crossbar fabric, but each capability has its own ingress format, routing rule, output pattern, and required HW blocks:
| Aspect | Multicast xbar (§1.2) | Scatter-gather engine (§1.3) |
|---|---|---|
| Semantics | one in, many out (replication) | many in, many out (permutation) |
| Ingress | 1 buffer of M bytes at 1 input port | 1 bulk buffer of (N−1)·M/N bytes + descriptor table at 1 input port |
| Routing rule | replicate every byte to every port in the multicast group | parse per-chunk destination tags; route each chunk to its specific destination port |
| Per-byte fanout | 1 source → N destinations (with byte replication) | 1 source → 1 destination per chunk (no replication) |
| Egress | same bytes appear on every group port | different bytes appear on different ports |
| Required HW blocks | multicast group table + crossbar fanout duplication logic | descriptor parser + per-chunk address routing + concurrent multi-source ingress arbitration |
| Uses switch ALU? | no | no |
| Collectives accelerated | BC, AG, AR-multicast-phase | A2A only |
The two operations are silicon-distinct: multicast needs replication logic on the data path; scatter-gather needs descriptor parsing and per-chunk dispatch. Existing SHARP-class switches (NVSwitch Gen4, Quantum-X800) ship multicast but not the scatter-gather descriptor handler — that’s why HW A2A requires a generation jump (Tomahawk Ultra today, Rubin NVSwitches next), even though both capabilities live in the crossbar fabric and neither needs the ALU.
HW A2A is an “efficiency” win — same per-rank bytes, fewer endpoint α’s. This puts it in the same category as multicast for AG/BC and ALU-reduce for RS/Reduce: in all of these, the switch does parallel work the endpoints would have done sequentially, but the per-rank byte count is unchanged. The unique structural INC win remains AR alone, where the composed switch-ALU + multicast-xbar pair halves per-rank bytes from $\sim 2M$ to $M$ ($\mathrm{BW_{eff}}$: $\mathrm{BW}/2 \to \mathrm{BW}$); §1.4 makes this explicit in the summary table.
What distinguishes HW A2A from the SHARP-class efficiency wins is how the switch parallelizes:
- SHARP-class multicast / ALU reduce — the switch parallelizes a tree of endpoint operations in its crossbar fabric (multicast group fanout, or ALU aggregation). Software does $\log_2 N$ or $N{-}1$ sequential rounds of pairwise endpoint forwarding; the switch does the equivalent work in 1 hardware operation. The α saving comes from collapsing the algorithmic tree depth.
- HW A2A scatter-gather — the switch parallelizes the endpoint submission via descriptor batching. Per-chunk routing inside the switch is essentially normal point-to-point switching; what’s new is that one bulk transaction at the ingress port subsumes $N{-}1$ separate endpoint scheduling events. The α saving comes from descriptor batching at the rank, not from algorithmic-tree collapse.
In both cases the per-rank wire-side byte count is identical to software ($(N{-}1)/N \cdot M$ for A2A, AG, RS; $M$ for BC, Reduce; etc.) — so the BW term is unchanged and the win is α-only. AR is the lone structural exception.
Cost comparison (intra-pod A2A at $N = 72$, $M = 16\,\mathrm{MB}$, $\alpha_{\mathrm{inner}} = 0.5\,\mu$s, $\alpha_{\mathrm{switch}} \approx 0.2\,\mu$s, $\mathrm{BW}_{\mathrm{inner}} = 900\,\mathrm{GB/s}$):
| α term | BW term | Total | |
|---|---|---|---|
| Software pairwise A2A | $(N{-}1)\,\alpha = 35.5\,\mu$s | $(N{-}1)/N \cdot M/\mathrm{BW} \approx 17.5\,\mu$s | ~53 μs |
| Hardware A2A (Rubin / Tomahawk Ultra) | $\sim 2\,\alpha_{\mathrm{switch}} \approx 0.4\,\mu$s | unchanged ≈ 17.5 μs | ~18 μs |
About a 3× total speedup at $M = 16\,\mathrm{MB}$, dominated by the α collapse; at small $M$ where α dominates, the speedup approaches $(N{-}1)/2 \approx 36\times$. At large $M$ where BW dominates, both schedules converge toward the bisection bound ($\sim M/\mathrm{BW}$).
Commercial shipment:
- NVSwitch Gen4 (current GB200 NVL72): no HW A2A — A2A runs software-scheduled through the crossbar.
- Rubin-generation NVSwitches (next gen): planned HW A2A support, extending NVLS beyond AR / BC / AG.
- Broadcom Tomahawk Ultra (Ethernet, shipped 2025): yes — first commodity-Ethernet HW A2A primitive [TH-ULTRA].
- Quantum-X800 (current IB): no HW A2A on shipping silicon.
1.4 Cost-model savings — consolidated summary
§1.1, §1.2, and §1.3 each accelerate a different subset of collectives. The net α-term and BW-term effects across all primitives, evaluated on a single-switch star with $N$ ranks, $M$-byte payload, switch cut-through $\alpha_{\mathrm{switch}}$:
| Primitive | Required INC HW | α (best software) | α (INC) | α speedup | $\mathrm{BW_{eff}}$ (software) | $\mathrm{BW_{eff}}$ (INC) | BW lift |
|---|---|---|---|---|---|---|---|
| AR | Switch ALU + Multicast xbar | $2(N{-}1)\alpha$ (ring) or $2\lceil\log_2 N\rceil\alpha$ (DBT) | $\sim 2\,\alpha_{\mathrm{switch}}$ | $\sim (N{-}1)$ or $\sim \log_2 N$ × | $\mathrm{BW}/2$ | $\mathrm{BW}$ | 2× |
| Reduce | Switch ALU | $\lceil\log_2 N\rceil\alpha$ | $\sim \alpha_{\mathrm{switch}}$ | $\sim \log_2 N$ × | $\mathrm{BW}$ (saturated) | $\mathrm{BW}$ | $1\times$ |
| RS | Switch ALU | $(N{-}1)\alpha$ | $\sim 2\,\alpha_{\mathrm{switch}}$ | $\sim (N{-}1)/2$ × | $\mathrm{BW}$ | $\mathrm{BW}$ | $1\times$ |
| AG | Multicast xbar | $(N{-}1)\alpha$ | $\sim 2\,\alpha_{\mathrm{switch}}$ | $\sim (N{-}1)/2$ × | $\mathrm{BW}$ | $\mathrm{BW}$ | $1\times$ |
| BC | Multicast xbar | $\lceil\log_2 N\rceil\alpha$ (pipelined tree) | $\sim \alpha_{\mathrm{switch}}$ | $\sim \log_2 N$ × | $\mathrm{BW}$ | $\mathrm{BW}$ | $1\times$ |
| A2A | Scatter-gather engine | $(N{-}1)\alpha$ (pairwise) | $\sim \alpha_{\mathrm{switch}}$ | $\sim (N{-}1)$ × | $\mathrm{BW}$ (bisection) | $\mathrm{BW}$ (bisection) | $1\times$ |
Three structural observations:
1. AR is the unique primitive that uses both the switch ALU and the multicast xbar. Endpoints push their contributions into the switch ALU (reduce phase from §1.1); the ALU emits the single $M$-byte result into the multicast xbar; the crossbar replicates it back to all ports (multicast phase from §1.2). Together these complete the entire AR in one logical round trip, regardless of $N$ — endpoint software overhead is paid once for the whole collective.
This composition is why AR alone gets both a structural α collapse AND a BW-eff doubling. Every byte of the reduced result must visit an endpoint link twice in software AR (once into the reduction, once back out for redistribution); ring makes this explicit (RS phase + AG phase), and DBT pays the same dual-touch in fewer steps. NCCL’s DBT in fact moves the full $2M$ per rank, hitting $\mathrm{BW_{eff}} = \mathrm{BW}/2$ exactly. INC’s switch ALU + multicast pair does both phases inside the fabric, so each endpoint moves only $M$ bytes total — one upstream, one downstream, on opposite directions of the full-duplex link. The $2\times$ ratio is the algorithmic ceiling; realized lift is smaller (~$1.3\times$ measured on NVLS) and is treated quantitatively in 05_contention_and_congestion.md.
2. AG / BC / RS / Reduce get α-only wins because software already saturates link BW. A full-duplex star runs ring AG / pipelined tree BC at $\mathrm{BW_{eff}} = \mathrm{BW}$ already — each step, a rank forwards $M/N$ outbound while receiving $M/N$ inbound concurrently, so per-rank wall-clock BW is $(N-1)/N \cdot M/\mathrm{BW} \approx M/\mathrm{BW}$. The two-touch pattern that costs AR a factor of two never applied. INC’s BW-side numbers for these primitives match software (not lift it), so their speedups are dramatic at small $M$ (α-bound) and converge to $1\times$ at large $M$ (BW-bound).
3. A2A’s win is α-only and “efficiency” rather than “structural”. Total cross-sectional bytes ($(N{-}1)\,M$) are unchanged because the switch routes them verbatim — no aggregation collapse, no replication. The α reduction from $(N{-}1)\alpha$ to $\sim\alpha_{\mathrm{switch}}$ (§1.3) comes from batching $N{-}1$ endpoint-scheduling rounds into one switch transaction, not from an algorithmic-tree collapse. Where HW A2A doesn’t ship (current NVSwitch Gen4 / Quantum-X800), A2A pays the full software $(N{-}1)\alpha$ on its only path — software pairwise direct-send. SHARP-class INC at the same fabrics gives A2A nothing because the switch ALU and multicast xbar need aggregation or replication semantics, which A2A lacks.
Per-hop α substitution at multi-tier scale. Even for primitives where INC’s $n_\alpha$ collapse stays $O(\log N)$ on a multi-tier fabric (the aggregation-tree depth $k = \lceil\log_r N\rceil$), each switch-level α is $\sim 200$–$400$ ns (cut-through) instead of the $\sim 1$–$3\,\mu$s endpoint RTT — a 5–10× per-hop saving stacked on top of the structural collapse, because the endpoint software overhead is paid once at collective launch rather than once per level. §2.2 picks up the multi-tier story.
2. Scale-up vs scale-out
The INC mechanism — switch ASIC reduces in-fabric, endpoints see one round trip — applies at two very different deployment scales. Within one switch domain (scale-up), $n_\alpha = 2$ regardless of $N$. Across a multi-tier switched fabric (scale-out), $n_\alpha = 2k$ where $k = \lceil\log_r N\rceil$ is the aggregation-tree depth, but each hop is switch cut-through rather than endpoint RTT. Each scale category has two shipping implementations — one NVLink / InfiniBand (NVIDIA) and one Ethernet (Broadcom / NVIDIA).
2.1 Scale-up: single-switch star
The entire AR runs inside one switch domain, so $n_\alpha = 2$ end-to-end, independent of $N$ up to the switch radix cap:
$$t_{\mathrm{INC, AR}}^{\mathrm{scale-up}} \;\approx\; 2\alpha_{\mathrm{switch}} + \frac{M}{\mathrm{BW}}$$
per the §1.4 derivation ($\mathrm{BW_{eff}} = \mathrm{BW}$, the full algorithmic ceiling for AR). Two hardware features compose: hardware multicast (a single write replicated by the switch to all destination ports in one switch-local operation) and hardware all-reduce (switch ALU combines contributions across the SHARP group and multicasts the reduced result back).
Two shipping implementations on different fabrics:
- NVLS — NVLink / NVSwitch. NVIDIA’s NVLink SHARP, introduced with NVSwitch Gen3 (H100 era) and extended in NVSwitch Gen4 (B200 / GB200 / NVL72). Domain size is capped by NVSwitch radix — 72 GPUs per NVL72 pod. $\alpha_{\mathrm{switch}} \approx 100$–$200$ ns. Current gen supports AR / BC / AG; cross-pod traffic falls back to software (NCCL picks DBT or ring per message size). Rubin-generation NVSwitches are expected to extend NVLS to A2A.
- Tomahawk Ultra INC — Ethernet. Broadcom Tomahawk Ultra (shipped 2025) is the first Ethernet switch ASIC with in-network collectives [TH-ULTRA]. 51.2 Tbps full-duplex, $\alpha_{\mathrm{switch}} \approx 250$ ns. Structurally equivalent to NVLS — single-switch star with in-fabric reduction — but on commodity Ethernet, breaking the NVLink-only monopoly on scale-up INC and opening the door to INC-enabled Ethernet fabrics at HPC / AI scale-up cost points.
2.2 Scale-out: multi-tier aggregation tree
Across thousands of endpoints spread over many switch levels, the INC primitive is an aggregation tree whose internal nodes are switch ASICs. Each level reduces its incoming flits and forwards the $M$-byte partial upward; the root completes the reduction and multicasts back down:
$$t_{\mathrm{INC, AR}}^{\mathrm{scale-out}} \;\approx\; 2k \cdot \alpha_{\mathrm{switch}} + \frac{M}{\mathrm{BW}}$$
where $k$ is the number of switch tiers an $M$-byte flit traverses from an endpoint to the root of the aggregation tree. For a fat-tree with per-switch radix $r$ (fan-in per level), covering $N$ endpoints requires $k = \lceil\log_r N\rceil$ tiers — e.g., $r = 64$ ports per switch covers $N = 4096$ endpoints in $k = 2$ tiers (leaf + spine), or $N = 262{,}144$ in $k = 3$ (leaf + spine + super-spine). The factor of $2k$ on the $\alpha$ term accounts for the round trip: $k$ switch hops up to the root during reduce, then $k$ back down during multicast. $\alpha_{\mathrm{switch}} \approx 200$–$400$ ns is the switch cut-through, not endpoint software RTT. The BW term stays at $M/\mathrm{BW}$ — the same algorithmic ceiling as scale-up ($\mathrm{BW_{eff}} = \mathrm{BW}$ from §1.4), because each endpoint still only pushes $M$ up and receives $M$ down, and cut-through pipelining at each tier keeps the endpoint’s outbound and inbound directions overlapping once the pipeline is filled ($\sim 2k\alpha_\mathrm{switch}$).
Concretely, for $N = 4096$ across a 3-tier aggregation tree, AR latency is $\sim 2$–$3\,\mu$s — even though software ring AR at the same $N$ would need 8190 sequential endpoint $\alpha$s ($\sim 8$ ms at endpoint $\alpha = 1\,\mu$s). The $\sim 3000\times$ speedup is dominated by the endpoint-to-switch $\alpha$ substitution combined with the structural $O(N) \to O(\log N)$ collapse.
Two shipping implementations on different fabrics:
- Quantum SHARP — InfiniBand. NVIDIA Mellanox Quantum-2 and Quantum-X800 switches on IB fat-tree / Clos topologies. The established scale-out INC path for large training clusters.
- Spectrum-X SHARP — Ethernet. NVIDIA’s Ethernet-side analog, running the SHARP protocol over RoCE on Spectrum-X switches. Brings the scale-out INC story to commodity-Ethernet clusters — complementing Tomahawk Ultra on the scale-up side.
3. Worked example at $N = 512$
To track the same $N = 512$ anchor used in 05_contention_and_congestion.md §5, apply scale-up INC to a hypothetical single-switch star with 512 ports. Real scale-up INC (NVL72) caps at $N = 72$ — a 512-port single-switch INC ASIC does not exist today, so a production $N \geq 512$ deployment would use scale-out INC over a multi-tier Clos (03_hierarchical_topologies.md Appendix A works through the rail-optimized GB200 SuperPOD topology at $L = 32$ NVL72s / $N = 2304$ GPUs, showing the actual 8-leaves + 4-spines per-rail fat-tree that current high-count clusters deploy). We pick the single-switch abstraction for the worked example below because it applies the §1.1 / §1.2 / §1.4 algorithmic ceilings directly ($n_\alpha = 2$, $\mathrm{BW_{eff}} = \mathrm{BW}$); scale-out INC at $k = 2$, $\alpha_\mathrm{switch} = 0.5\,\mu$s gives essentially the same total ($\sim 20\,\mu$s vs $\sim 19\,\mu$s for scale-up) because the $M / \mathrm{BW}$ term dominates at $M = 16\,\mathrm{MB}$.
Per-link $\alpha = 0.5\,\mu$s (switch cut-through + endpoint software), $\mathrm{BW} = 900\,\mathrm{GB/s}$, $M = 16\,\mathrm{MB}$.
3.1 The full $N = 512$ ladder (AR)
Evaluating the three software / torus rows from the formulas in 02_topology_mapping.md §5.1 and adding one NVLS-style INC row on the hypothetical single-switch star:
| Topology | Algorithm | $n_\alpha$ | $\alpha$ term | BW term | Total |
|---|---|---|---|---|---|
| Star (crossbar) | Ring AR | 1022 | 511 μs | 35.5 μs | 546 μs |
| Star (crossbar) | DBT / RHD | 18 | 9 μs | 35.5 μs | 45 μs |
| Hypothetical 512-port star | NVLS-style INC | 2 | 1 μs | 17.8 μs | 18.8 μs |
| Torus $8 \times 8 \times 8$ | Dim-decomp ring | 42 | 21 μs | 35.5 μs | 57 μs |
INC closes two gaps at once:
- $\alpha$ side. $n_\alpha$ collapses from 18 (DBT) or 1022 (ring) to 2. The α term shrinks from 9 μs (DBT) to 1 μs (INC) — small in absolute terms at this $M$, but consequential at smaller $M$ (see §3.2).
- BW side. $\mathrm{BW_{eff}}$ doubles from $\mathrm{BW}/2$ (any software schedule) to $\mathrm{BW}$ (INC), halving the BW term from 35.5 μs to 17.8 μs.
Net: star + INC at 18.8 μs beats the best software pairing (star + DBT at 45 μs) by $\sim 2.4\times$ and the torus at 57 μs by $\sim 3\times$. Against the pathological star+ring row (546 μs) the speedup is $\sim 29\times$, but that row is a cautionary baseline, not a design we’d ship. The $\sim 2\times$ ceiling from §1.4 is the asymptotic BW-side INC speedup for AR; the full $\sim 2.4\times$ vs DBT at $M = 16\,\mathrm{MB}$ reflects the residual $\alpha$-side contribution.
3.2 Regime sensitivity (AR)
The $N = 512$ speedup of INC over the best software pairing (DBT) varies sharply with $M$. Sweeping $M$ from $\alpha$-bound to BW-bound:
| $M$ | Star + DBT | Star + INC | Speedup (DBT → INC) |
|---|---|---|---|
| 10 KB ($\alpha$-bound) | 9.02 μs | 1.01 μs | ~9× |
| 1 MB | 11.2 μs | 2.1 μs | ~5× |
| 16 MB (anchor) | 45 μs | 18.8 μs | ~2.4× |
| 1 GB (BW-bound) | 2.23 ms | 1.11 ms | ~2× |
At small $M$ the $\alpha$-term collapse dominates: DBT still needs 18 synchronizations at 0.5 μs each (= 9 μs); INC needs 2 (= 1 μs). At large $M$ the BW-term ratio takes over and the speedup converges to the $2\times$ payload-count ceiling. The transition is governed entirely by the $\alpha$-vs-BW crossover for DBT: $M^\star \approx n_\alpha^{\mathrm{DBT}} \cdot \alpha \cdot \mathrm{BW} / 2 = 18 \cdot 0.5\,\mu\mathrm{s} \cdot 900\,\mathrm{GB/s} / 2 \approx 4\,\mathrm{MB}$. Below $M^\star$ INC’s $\alpha$ savings dominate; above, its BW savings do.
The ceilings in this section assume frictionless cut-through, no ALU or multicast contention, and no scheduler overhead. 05_contention_and_congestion.md §5 re-runs the same $N = 512$ ladder under realistic $\eta$ and quantifies how much of each ceiling survives in deployment.
3.3 Cross-primitive comparison (non-AR primitives)
§3.1–§3.2 focused on AR. The $\alpha$-side INC collapse applies to AG, RS, BC, and Reduce as well; the BW-side collapse is AR-exclusive (§1.4 — the other reduction-or-replication primitives already hit $\mathrm{BW_{eff}} = \mathrm{BW}$ in software via full-duplex operation). A2A gets no SHARP-class INC lift on any hardware; HW A2A (§1.3) is the separate path. The table runs the same $N = 512$, $M = 16\,\mathrm{MB}$, $\alpha = 0.5\,\mu$s, $\mathrm{BW} = 900\,\mathrm{GB/s}$ anchor across all five non-AR primitives:
| Primitive | Topology / Algorithm | $n_\alpha$ | $\alpha$ term | BW term | Total |
|---|---|---|---|---|---|
| AG / RS | Star ring | 511 | 255.5 μs | 17.7 μs | 273 μs |
Star rec-doub AG / rec-halv RS (01_collective_algorithms.md App. B.4) | 9 | 4.5 μs | 17.7 μs | 22.2 μs | |
| Hypothetical 512-port star + INC | 2 | 1.0 μs | 17.7 μs | 18.7 μs | |
| Torus $8 \times 8 \times 8$ dim-decomp ring | 21 | 10.5 μs | 17.7 μs | 28.2 μs | |
| BC / Reduce | Star ring | 511 | 255.5 μs | 17.8 μs | 273 μs |
| Star pipelined tree (DBT) | 9 | 4.5 μs | 17.8 μs | 22.3 μs | |
| Hypothetical 512-port star + INC | 1 | 0.5 μs | 17.8 μs | 18.3 μs | |
| Torus $8 \times 8 \times 8$ dim-decomp bidirectional | 12 | 6.0 μs | 17.8 μs | 23.8 μs | |
| A2A | Star pairwise (NCCL) | 511 | 255.5 μs | 17.7 μs | 273 μs |
| Hypothetical 512-port star + SHARP-class INC | — | — | — | N/A (no aggregation/replication semantics; §1.1) | |
| Hypothetical 512-port star + HW A2A (Tomahawk Ultra / Rubin) | $\sim 2$ | $\sim 1\,\mu$s | 17.7 μs | ~19 μs (α-only collapse; §1.3) | |
| Torus $8 \times 8 \times 8$ bisection-bound (TPU / Trainium) | 12 | 6 μs | 17.8 μs | 23.8 μs |
Three observations:
- AG / RS / BC / Reduce INC closes the α-side gap but not the BW-side gap. Star + INC (~18 μs) beats the best software (~22 μs) by only $\sim 1.2\times$ at this anchor — a much tighter margin than AR’s $2.4\times$ at the same $M$. The entire gap is the $\alpha$-term collapse (e.g., AG / RS rec-doub: $4.5 \to 1.0\,\mu$s; BC / Reduce DBT: $4.5 \to 0.5\,\mu$s); BW terms match because software rec-doub / DBT already hits $\mathrm{BW_{eff}} = \mathrm{BW}$ on a full-duplex star. At small $M$ the ratio widens toward the $\alpha$-only ceiling ($\sim 4.5\times$ for AG/RS, $\sim 9\times$ for BC/Reduce); at large $M$ all four collapse to $1\times$.
- A2A’s INC story is split across two capabilities, neither giving the AR-style structural collapse. SHARP-class INC (switch ALU + multicast) gives A2A no win at all on any hardware — the primitive’s per-destination payloads have no aggregation or replication structure to exploit. The separate HW A2A primitive (§1.3) ships today on Tomahawk Ultra and is planned for Rubin-generation NVSwitches; it collapses the α-side per-destination scheduling to $\sim 1\alpha$ but leaves the BW term at the bisection bound (which a switch ALU cannot reduce by forwarding-verbatim semantics). Current shipping NVSwitch Gen4 (NVL72) and Quantum-X800 include neither, so A2A on those platforms falls back to software-scheduled pairwise sends and the topology-side win path: torus bisection-bound reduces $(N{-}1)\alpha$ to $\mathrm{diam}\cdot\alpha$ at the cost of a $D_{\max}/8$ BW penalty (which vanishes at cubic shapes).
- Torus stays competitive when INC is unavailable. At $N = 512$, $M = 16\,\mathrm{MB}$, torus 8³ dim-decomp is $\sim 1.3$–$1.5\times$ slower than hypothetical INC across AG / RS (28.2 μs vs 18.7) and BC / Reduce (23.8 vs 18.3), but $\sim 10\times$ faster than star ring — the dim-decomposition’s $\sum(D_i - 1)$ or $\sum\lfloor D_i/2 \rfloor$ $\alpha$ collapse carries most of the win. Torus is the best A2A option by a wide margin when INC is unavailable (23.8 μs vs star pairwise’s 273 μs — $\sim 11.5\times$), because the dim-decomposed $\mathrm{diam}\cdot\alpha$ latency term cuts deeper while the $D_{\max}/8$ BW penalty is $1\times$ at the cubic 8³ shape.
The pattern across all six primitives: AR is the only primitive where INC pays back at every $M$ (both $\alpha$ and BW wins); AG / RS / BC / Reduce wins concentrate at small $M$ (α-only); A2A’s only INC path is the separate HW A2A primitive (§1.3), which gives an α-side efficiency win on Tomahawk Ultra today (and on Rubin NVSwitches in the next generation) but no BW lift. 05_contention_and_congestion.md §5 layers realistic $\eta$ on each of these rows and quantifies which speedups survive contention.
4. INC-star vs torus
§1.4 summarizes the per-primitive INC vs software-on-star comparison. This section completes the architectural picture by adding torus as the third design point — the natural choice when INC is unavailable — and asking, for each primitive, how INC-on-star compares to a same-$N$ torus at their algorithmic ceilings. The table echoes 02_topology_mapping.md §5.1 (per-primitive cost on each topology) and adds INC rows under Star:
| Topology | Primitive | Algorithm | α term | BW term | INC-on-star speedup (small M / large M) |
|---|---|---|---|---|---|
| Star | AR | Ring | $2(N-1)\,\alpha$ | $2(N-1)/N \cdot M/\mathrm{BW}$ | $\sim (N-1)\times$ / $\sim 2\times$ |
| AR | DBT | $2\lceil\log_2 N\rceil\,\alpha$ | $2(N-1)/N \cdot M/\mathrm{BW}$ | $\sim \log_2 N\times$ / $\sim 2\times$ | |
| AR | INC (ALU + multicast) | $2\,\alpha_{\mathrm{switch}}$ | $M/\mathrm{BW}$ | — (reference) | |
| AG / RS | Ring | $(N-1)\,\alpha$ | $(N-1)/N \cdot M/\mathrm{BW}$ | $\sim (N-1)/2\,\times$ / $1\times$ | |
| AG / RS | INC (multicast or ALU+scatter) | $2\,\alpha_{\mathrm{switch}}$ | $(N-1)/N \cdot M/\mathrm{BW}$ | — (reference) | |
| BC / Reduce | Pipelined tree (DBT) | $\lceil\log_2 N\rceil\,\alpha$ | $M/\mathrm{BW}$ | $\sim \log_2 N\times$ / $1\times$ | |
| BC / Reduce | INC (multicast or ALU) | $\alpha_{\mathrm{switch}}$ | $M/\mathrm{BW}$ | — (reference) | |
| A2A | Pairwise | $(N-1)\,\alpha$ | $(N-1)/N \cdot M/\mathrm{BW}$ | $\sim (N-1)\times$ / $1\times$ (vs HW A2A); no speedup with SHARP-class only | |
| A2A | INC (HW A2A scatter-gather; §1.3) | $\alpha_{\mathrm{switch}}$ | $(N-1)/N \cdot M/\mathrm{BW}$ | — (reference; ships on Tomahawk Ultra / Rubin) | |
| Torus | AR | Dim-decomp ring | $2\sum_i (D_i-1)\,\alpha$ | $2(N-1)/N \cdot M/\mathrm{BW}$ | $\sim \sum_i (D_i-1)\times$ / $\sim 2\times$ |
| AG / RS | Dim-decomp ring | $\sum_i (D_i-1)\,\alpha$ | $(N-1)/N \cdot M/\mathrm{BW}$ | $\sim \sum_i (D_i-1)/2\,\times$ / $1\times$ | |
| BC / Reduce | Dim-decomp bidirectional | $\sum_i \lfloor D_i/2 \rfloor\,\alpha$ | $M/\mathrm{BW}$ | $\sim \sum_i \lfloor D_i/2 \rfloor\,\times$ / $1\times$ | |
| A2A | Pairwise — bisection-bound | $\mathrm{diam}\cdot\alpha$ | $D_{\max}/8 \cdot M/\mathrm{BW}$ | $\sim \mathrm{diam}\times$ / $\sim D_{\max}/8\,\times$ ($1\times$ at $8^3$; $\sim 2\times$ at $16^3$ — typical TPU pod slice shape) |
Star-row α and BW-eff values match §1.4’s per-primitive summary; the torus rows are the dim-decomposed costs from 02_topology_mapping.md §3 plugged in for the architectural-swap comparison. Three observations on the INC-star vs torus comparison:
- AR is the only primitive where INC-on-star structurally beats torus on BOTH axes. α-side: $\sim \sum_i(D_i-1)$ hops on torus vs $\sim 2$ on INC-star — a deep-vs-shallow tree gap. BW-side: torus AR pays the same $2\times$ dual-touch BW penalty as software-on-star ($\mathrm{BW_{eff}} = \mathrm{BW}/2$), while INC-on-star halves per-rank traffic via the composed switch ALU + multicast xbar pair → $\mathrm{BW_{eff}} = \mathrm{BW}$. Torus has no analog of this composition. So INC-star beats torus on AR at every $M$: ~2× from BW alone at large $M$, plus the α-side gap at smaller $M$.
- For α-only primitives (AG / RS / BC / Reduce), INC-star and torus are α-side competitors; both saturate link BW. Torus dim-decomp on a full-duplex ring achieves $\mathrm{BW_{eff}} = \mathrm{BW}$, just like ring AG / pipelined-tree BC on a star. The remaining gap is α-only — INC-star’s $\sim \alpha_{\mathrm{switch}}$ or $2\alpha_{\mathrm{switch}}$ vs torus’s $\sum_i(D_i-1)\,\alpha$ or $\sum_i \lfloor D_i/2\rfloor\,\alpha$. At production sizes ($N = 512$ as 8³ torus: $\sum(D_i-1) = 21$, $\sum\lfloor D_i/2\rfloor = 12$), INC-star’s α-side advantage is order of magnitude in the latency-bound regime; at large $M$ both architectures converge to the same $M/\mathrm{BW}$ ceiling. The choice between INC-star and torus on these four is rarely a deal-breaker — deployment-side considerations (rack power, switch radix, cabling) often dominate.
- A2A’s topology choice flips with hardware availability. SHARP-class INC gives A2A no win on any hardware. With HW A2A (Tomahawk Ultra today, Rubin NVSwitches next), star + HW A2A collapses α to $\sim \alpha_{\mathrm{switch}}$ but the BW term stays bisection-bound — same as torus. So both architectures end up bisection-bound; INC-star edges out torus only on α. Without HW A2A (current GB200 NVL72 / Quantum-X800), A2A on a star pays the full software $(N-1)\,\alpha$, and torus’s $\mathrm{diam}\cdot\alpha$ saves an order of magnitude — making torus the natural fallback for A2A-heavy workloads (TPU / Trainium clusters lean on this; star-only deployments without HW A2A struggle on MoE EP traffic). The right topology decision flips depending on whether the deployment has HW A2A — the most fabric-architecture-sensitive primitive in this comparison.
Limitations — every cost formula above is an algorithmic ceiling. Six restrictions apply in practice:
- Scope: single switch domain (or SHARP-enabled hierarchy). NVLS works only within one NVSwitch domain. Quantum SHARP works across a SHARP-enabled IB fabric, but only if all switches in the aggregation tree support the protocol. Cross-domain traffic falls back to software.
- Reducible types: limited. Production switch ALUs support sum, max, min, bit-and/or/xor, and a few related ops over BF16/FP16/FP32 (FP64 varies by switch generation). Compound reductions like softmax (which requires exp() + sum + per-element division) and top-k are not supported in shipping SHARP-class hardware — they fall back to either in-fabric approximations on programmable-pipeline switches (Tofino-class research prototypes; published academic work but not productized) or full software execution. Roadmap extensions (Rubin-generation NVSwitches and successor Quantum SHARP versions) may broaden the op set; verify against current vendor documentation when reduction semantics matter beyond the standard set.
- Precision: hardware-fixed. Switch ALUs typically operate in BF16 with FP32 accumulators; some SHARP-enabled IB switches support FP32 directly. The accumulation order is also fixed by the switch tree shape, which can produce slightly different numerical results than ring AR — usually within numerical tolerance but worth flagging for bitwise-reproducibility use cases.
- Message alignment. INC operations require specific alignment (typically 128 B or higher); very small messages can fall through to software — NCCL’s heuristic handles this, but a sweep across $M$ may show a sudden floor where the runtime switches paths.
- BW-regime convergence. INC’s dominant win is in the latency-term regime. AR’s $\sim$70× speedup at small $M$ shrinks to $\sim 2\times$ at large $M$; AG / RS / BC / Reduce shrink to $1\times$; A2A’s HW A2A gain similarly converges to $1\times$ at the bisection bound.
- Ceilings vs realized speedups. The ratios above are algorithmic ceilings — they assume frictionless cut-through pipelining, no multicast contention, and no NCCL / switch-ALU scheduling overhead. Realized speedups on production hardware are smaller (e.g., NVLink NVLS measures $\sim 1.3\times$ BW-regime lift vs the $2\times$ ceiling). The full contention-coefficient treatment — calibrating $\eta_\alpha, \eta_\beta$ against published busbw measurements and re-running the comparison under realistic factors — is in
05_contention_and_congestion.md.
Further reading
01_collective_algorithms.md— baseline ring and tree AR costs that INC compresses against; the $\alpha$-$\beta$ model and the Rabenseifner / DBT derivations referenced throughout.02_topology_mapping.md§2 — $\alpha$ / BW calibration on a scale-up star and star-specific observations (e.g., DBT’s port-uniform utilization on the crossbar). §5.1 observation 6 flags INC as star’s $\alpha$-compression escape hatch.05_contention_and_congestion.md— contention coefficients that modify software collectives; INC paths are less sensitive to contention because they use only one switch operation.03_hierarchical_topologies.md§3 — how INC at both tiers of the hierarchy composes: outer-tier INC (Quantum SHARP / Spectrum-X SHARP / Tomahawk Ultra at the Clos spine) for the biggest absolute α saving, and inner-tier INC (NVLS within an NVL72 pod) for the residual α plus the BW-eff lift on intra-pod AR. NVLS + Quantum SHARP combined brings hierarchical AR at L = 32 pods to roughly the cost of a single-pod NVLS AR.references.md— primary sources for SHARP (Graham et al. 2016), NVLink SHARP / NVLS (NVIDIA 2023 whitepapers, NCCL 2.27 release notes), and Tomahawk Ultra INC (Broadcom 2025).- Graham et al. (2016), “Scalable Hierarchical Aggregation Protocol (SHARP)” — the architectural paper.
- NVIDIA (2023), NVLink SHARP and NVSwitch Multicast/Reduce Whitepapers.
Contention and Congestion: From Algorithmic Ceilings to Realized Performance
Author: Yue Lu
Date: April 2026
02_topology_mapping.md scored each collective as if it ran alone on a perfectly-scheduled fabric; 04_in_network_collectives.md §1.4 tightened that upper bound to the algorithmic ceiling ($n_\alpha, \mathrm{BW_{eff}}$) under in-network reduction. Real deployments fall below both — some of the gap is baked into the silicon and fabric (link utilization ceilings, switch queueing, cross-tier backpressure), and some is left on the table by the runtime scheduler (which rank layout the allocator produces, which concurrent groups share physical links, how expert traffic is routed). This note splits the gap along that axis, introduces per-fabric contention coefficients $\eta_\alpha \ge 1$ and $\eta_\beta \in (0, 1]$ as the scalar aggregate, and re-runs the $N = 512$ comparison under realistic $\eta$ to show how much the ideal-model margins compress.
Table of Contents
- From ceilings to realized
- Hardware-level inefficiencies
- Software/scheduling-level inefficiencies
- Contention coefficients $\eta_\alpha$ and $\eta_\beta$
- Re-running $N = 512$ under realistic $\eta$
- What the coefficient model doesn’t capture
- When $\eta$ is load-bearing
- Further reading
1. From ceilings to realized
The cost formulas in 02_topology_mapping.md and the INC ceilings in 04_in_network_collectives.md §1.4 both describe upper bounds. Realized deployment performance sits below those ceilings, and the gap decomposes cleanly into two sources:
- Hardware-level inefficiencies — what an isolated, perfectly-scheduled collective still pays because the fabric cannot sustain its nominal link BW or switch latency under realistic load (§2).
- Software/scheduling-level inefficiencies — what the runtime scheduler leaves on the table by accepting an off-prefix rank layout, scheduling concurrent groups that share physical links, or routing expert traffic non-uniformly (§3).
A useful way to keep the two separated: if the inefficiency persists when a single, isolated collective runs on a freshly-booted cluster, it’s hardware-level. If it disappears when the scheduler makes better choices without touching the silicon, it’s software/scheduling-level.
Scope assumption. We assume the deployment has already selected the right algorithm for its fabric at design time (DBT or NVLS on star, dim-decomp on torus, INC where available). Algorithm pairing is a one-time design decision, not a runtime contention effect; a well-designed deployment pays it only in the form of “pick the right one,” after which the $(\eta_\alpha, \eta_\beta)$ model below prices what’s left.
Both sources show up in the cost model as a combination of “more $\alpha$” and “less effective BW.” §4 introduces the scalar $(\eta_\alpha, \eta_\beta)$ coefficients that aggregate them per (fabric, collective).
1.1 Bottleneck analysis
The cost formulas in 02_topology_mapping.md and the realistic discounts in this note are both computed via the same technique: bottleneck analysis. Identify the link (or group of links) carrying the most bytes during a collective operation, and compute the time to drain it as
$$t_{\mathrm{BW}} \;=\; \frac{\text{bytes through the bottleneck}}{\text{bottleneck throughput}}$$
This works because the operation isn’t complete until every byte has reached its destination — the most-loaded link is the last to finish, and every other (lighter-loaded) link has already completed by the time it’s done. The bottleneck’s load and throughput therefore set wall-clock BW time; any scheduling trick that doesn’t reduce the bottleneck’s load doesn’t reduce this time. See 02_topology_mapping.md §3.6 for the concrete derivation on torus A2A.
Ideal regime. Under symmetric assumptions (uniform routing, isolated groups, no hotspot traffic, no hardware saturation), the bottleneck is identified cleanly by fabric geometry — the bisection cut for torus A2A (02_topology_mapping.md §3.6), the upper-tier uplink under oversubscription ratio $s$ for fat-tree (03_hierarchical_topologies.md §1), the per-dim ring’s single link for dim-decomposed AR (02_topology_mapping.md §3.4). Per-link loads are either uniform by graph symmetry or set by tight structural arguments, so the bottleneck bound matches measured performance closely. This is the starting point that 02_topology_mapping.md’s cost formulas describe.
Realistic regime — three ways contention breaks the ideal. Contention doesn’t invalidate the framework; it shifts the inputs:
- The bottleneck location can change. Concurrent collective groups sharing a dim, off-prefix layouts scattering chunks across multiple dim-lines, hotspot MoE destinations concentrated on one axis — each of these moves “which link is worst-loaded” away from the ideal-model location.
- Per-link load variance widens. Uniform routing symmetry no longer holds; under skewed MoE routing some links carry 3–10× more traffic than others, under oversubscription upper-tier links carry $s$× more than lower-tier ones. The bottleneck is strictly worse than the average, so “average load × BW” style reasoning underestimates realized time.
- The bottleneck’s own throughput degrades. Switch queueing, head-of-line blocking at congested ports, and flow-control backpressure reduce the bottleneck link’s effective BW below its nominal per-link value. The BW we divide by shrinks.
Mapping to $(\eta_\alpha, \eta_\beta)$. The coefficient model introduced in §4 aggregates these three effects into two scalars per (fabric, collective): $\eta_\beta \in (0, 1]$ discounts the BW term to account for (2) and (3) above, $\eta_\alpha \geq 1$ inflates the α term to account for queueing hops. §2 (hardware-level) and §3 (scheduling-level) catalog which specific mechanisms drive each shift.
2. Hardware-level inefficiencies
These are unavoidable without different silicon or different fabric architecture. Each shows up in the coefficient model as a floor on $\eta_\beta$ or a floor on $\eta_\alpha$ that the software stack cannot drive away.
2.1 Link BW utilization ceiling
Nominal per-link bandwidth is set by pin count $\times$ signaling rate, but the fraction a collective actually sees is discounted by framing overheads, endpoint DMA efficiency, and flow-control headroom required to keep the link from stalling. NCCL-tests on an H100 NVLink4 node measures AR busbw $\approx 360\,\mathrm{GB/s}$ against $450\,\mathrm{GB/s}$ raw unidirectional link BW — a baseline $\eta_\beta \approx 0.80$ that shows up without any concurrency at all. This is the floor the software stack cannot beat: even a single TP AR on an otherwise-idle node pays it.
2.2 Switch/fabric congestion under load
Wormhole / cut-through routing keeps per-message BW close to peak once a path is established. Queue delay, not per-hop latency, is what inflates under load: as fabric utilization crosses $\sim 80\%$, head-of-line blocking and arbitration delay at each switch grow super-linearly. Effective $\alpha$ per message can double or triple when the fabric carries multiple concurrent traffic classes — training mixes gradient AR with activation AG/RS and DP AR; inference mixes TP AR with KV streaming and scheduler messages; HPC mixes MPI collectives with RDMA p2p. The α-β model doesn’t price this implicitly, so congestion shows up as $\eta_\alpha > 1$.
Hierarchical specialization. Multi-tier Clos and fat-tree fabrics typically oversubscribe upper tiers relative to leaves (e.g., 2:1 or 4:1 at the aggregation layer). When an upper-tier link saturates, credit-based flow control pushes backpressure into lower tiers — the same congestion mechanism, but per-tier $\eta$ treats tiers as independent and under-counts the coupling. Tight flow-controlled fabrics (InfiniBand, RoCE with PFC) violate the independence assumption; loose store-and-forward or timeout-based fabrics approximate it better. For scale-out INC paths (04_in_network_collectives.md §2.2), cross-tier coupling is what makes the $2k\alpha_\mathrm{switch}$ floor grow faster than the per-tier walk under saturation.
3. Software/scheduling-level inefficiencies
These are fixable in the scheduler — different rank layout, different group placement, different routing policy — without touching the hardware. They show up as an additional discount on top of the hardware floor: if the scheduler picks well, realized $\eta$ approaches the hardware-only floor; if it picks badly, realized $\eta$ can be much worse.
3.1 Off-prefix group layouts
Torus dim-decomp AR hits $n_\alpha = 2 \sum (D_i - 1)$ only when the collective group’s ranks are contiguous along dim prefixes. A group of 16 ranks on an $8 \times 8 \times 8$ torus where the ranks are $\{(0,0,0), (1,0,0), \ldots, (7,0,0), (0,1,0), \ldots\}$ is prefix-contiguous along dim-0 — the algorithm runs as a single 8-ring.
But SLURM-style scatter allocation, or a job that inherits a random-looking rank layout, produces groups where ranks are spread across dim coordinates. Each algorithmic hop becomes a multi-hop path in the physical topology. The formula degrades back to the flat-ring bound ($n_\alpha \approx 2(N-1)$), often 2-4× the ideal. This is a scheduler choice, not a hardware limit: prefix-aligned allocation policies exist, and when available they eliminate this penalty entirely.
Residual unavoidable case. Two cases escape even an optimal allocator: (a) group-shape mismatch — a TP = 6 group on an $8 \times 8 \times 8$ torus has no prefix-contiguous placement because 6 doesn’t divide any dim; (b) multi-tenant fragmentation — a shared cluster with asynchronous job arrivals eventually runs out of prefix-aligned free slots and must place the next job off-prefix. Both are set by workload shape × topology geometry, not by the scheduler.
3.2 Concurrent collective groups
If the cluster runs DP = 8 replicas each doing TP = 8 AR simultaneously, where those 8 groups land on the physical fabric matters. On a star with $N = 64$ ports, the 8 groups partition into 8 disjoint port octets — each sees full BW with zero interference. Torus is different: if the 8 replicas share a common dim (e.g., all TP groups run along the X-axis), those 8 rings physically traverse the same links, and their aggregate BW is split. The ideal dim-decomp AR formula assumes rings occupy disjoint links across dims; concurrent-group layouts can violate that assumption.
Like off-prefix layouts, this is a layout choice made by the job scheduler. A layout-aware scheduler that distributes concurrent TP groups across orthogonal dims avoids the penalty; a naive scheduler pays it.
Residual unavoidable case. Sharing is forced when: (a) concurrency exceeds dim count — 8 concurrent TP groups on a 3D torus have only 3 orthogonal dims to spread across, so at least 3 groups must share a dim; (b) TP size pins the dim — TP = 8 on an $8 \times 8 \times 8$ torus must run along a full-dim axis, and every concurrent DP replica doing TP = 8 is forced onto that axis. Once forced, the per-dim BW splits across groups.
3.3 Skewed MoE routing
The uniform-bisection A2A bound assumes all $(i, j)$ pairs carry equal traffic. Real MoE traces show 3-10× skew between hot and cold experts — a handful of experts receive the bulk of routed tokens. The bisection cut still carries $N M / 2$ bytes on average, but tail latency is set by the hottest link, not the average. Torus has hotspots under skewed routing: hot-expert destinations concentrated along one dim saturate that dim’s bisection well before the uniform bound predicts. Star absorbs skew uniformly because every port has the same BW.
Skew is a policy-level artifact of the routing function (top-$k$ gating, capacity factor, expert placement). Load-balancing losses, expert shuffling, and capacity-factor tuning all discount the skew penalty without hardware changes.
Residual unavoidable case. Routing is input-dependent: even with perfectly load-balanced long-run averages, a single batch can be skewed by what tokens it happens to contain. Average-case throughput can be driven close to uniform, but per-step tail latency always sees some residual hotspot — which is what latency-SLA workloads actually pay on.
4. Contention coefficients $\eta_\alpha$ and $\eta_\beta$
The idealized collective cost is $t = n_\alpha \cdot \alpha + n_\beta \cdot (M / \mathrm{BW})$. Under the combined discount from §2 and §3, replace with
$$t_{\mathrm{effective}} = n_\alpha \cdot \alpha \cdot \eta_\alpha + n_\beta \cdot \frac{M}{\mathrm{BW} \cdot \eta_\beta}$$
Interpretation:
- $\eta_\alpha \ge 1$ inflates the latency term. Aggregates arbitration delay, head-of-line blocking, and cross-tier backpressure under congestion (§2.2), plus any layout-driven extra hops (§3.1).
- $\eta_\beta \in (0, 1]$ deflates the effective bandwidth. Aggregates the hardware BW ceiling (§2.1), link sharing between concurrent groups (§3.2), and bisection saturation from skewed traffic (§3.3).
Both default to 1.0, which recovers the ideal model. The cleanest way to apply them is at the fabric-tier level — each switch tier or link class carries its own $(\eta_\alpha, \eta_\beta)$, and the primitive cost function sees an “effective” tier with already-discounted $\alpha$ and $\mathrm{BW}$ values. No algorithm-level plumbing is required. The two scalars are deliberately coarse — they capture only the first-order “how much worse than the isolated-primitive upper bound”; §6 catalogs what scalar coefficients can’t express, and §7 covers when to escalate to packet-level simulation.
4.1 Calibration from public benchmarks
A representative $\eta$ profile calibrated from public measurements:
| Fabric | $\eta_\alpha$ | $\eta_\beta$ | Source |
|---|---|---|---|
| Crossbar (NVLink+NVSwitch, no SHARP) | 1.00 | 0.80 | NCCL-tests H100 AR busbw 360 GB/s measured vs 450 GB/s peak → 0.80 |
| NVLS (NVLink SHARP, in-network reduction) | 1.00 | 0.52 | NVLS busbw 470 GB/s measured vs DBT 360 GB/s ($\sim 1.3\times$ lift) on same hardware; back-solves to $\eta_\beta^{\mathrm{INC}} \approx 1.3 \cdot \eta_\beta^{\mathrm{DBT}} / 2 = 0.52$ under the $n_\beta^{\mathrm{INC}} = 1$ framing |
| Torus (off-prefix + concurrent groups) | 1.20 | 0.60 | TPU v4 twisted-vs-untwisted 1.63× A2A reported gap, interpreted as upper-bound $\eta_\beta \approx 1/1.63 \approx 0.6$ under adversarial layouts |
Why $\eta_\beta^{\mathrm{INC}} < \eta_\beta^{\mathrm{DBT}}$ even though INC wins. $\eta_\beta$ is measured against raw link BW, and the algorithmic factor sits in $n_\beta$. DBT’s $n_\beta = 2$ already bakes in a 2× slack, so its $\eta_\beta = 0.80$ applies to a target that was never the full link. INC’s $n_\beta = 1$ attempts the full link, and switch-ALU throughput plus multicast contention keep it from fully realizing that — so its $\eta_\beta$ looks lower, even though its realized per-rank BW ($0.52 \cdot \mathrm{BW}$) still beats DBT’s ($0.80 \cdot \mathrm{BW} / 2 = 0.40 \cdot \mathrm{BW}$) by $\sim 1.3\times$.
These are worst-case-realistic, not typical. A well-tuned deployment with prefix-aligned groups (§3.1) and SHARP-enabled AR will see much smaller $\eta$ penalties — the profile is a lower bound on performance, not an expectation.
Calibrating proprietary or newer fabrics. Run NCCL-tests (or an equivalent collective benchmark) under the intended concurrent-group pattern and back out $\eta$ from the ratio of measured busbw to the theoretical peak of algbw.
4.2 Per-tier $\eta$ on hierarchical fabrics
The flat-scalar profile above (one $(\eta_\alpha, \eta_\beta)$ pair per fabric) is appropriate for single-tier topologies — star, torus within one pod, full mesh — where all traffic sees the same link class. Hierarchical fabrics (fat-tree, Clos, and the layered compositions in 03_hierarchical_topologies.md) require a per-tier $\eta$ pair because each tier has distinct saturation behavior. Leaf-level links rarely saturate because per-endpoint BW is sized to match endpoint ingress; upper tiers (spine, super-spine) saturate first under cross-leaf fan-in, and their saturation is gated by the oversubscription ratio.
Oversubscription upper-bounds upper-tier $\eta_\beta$. A tier oversubscribed at ratio $s \geq 1$ (03_hierarchical_topologies.md §1) has aggregate upper-tier BW equal to $1/s$ of the aggregate downlink demand. Under uniform cross-tier traffic, every byte climbing the tier competes for that $1/s$-fraction of capacity, so the realized $\eta_\beta$ at that tier is capped:
$$\eta_\beta^{\mathrm{upper\text{-}tier}} \;\leq\; \min\!\left(\eta_\beta^{\mathrm{hw\,floor}}, \; \frac{1}{s}\right)$$
The inequality binds at $1/s$ whenever $s$ drives the cap below the tier’s hardware floor (e.g., $s = 2$ at a tier whose hardware floor would otherwise be $0.80$ caps realized $\eta_\beta$ at $0.50$). Below $1/s$ the actual value is further discounted by ECMP (Equal-Cost Multi-Path) hash collisions, queue-buildup arbitration, and tier-coupling backpressure. ECMP load-balances each flow across one of several equal-cost upper-tier paths by hashing the packet’s 5-tuple (src/dst IP, src/dst port, protocol) — deterministic per flow (so packets within a flow stay in order), but when distinct flows hash to the same path they pile onto it and leave other equivalent paths idle. The $\alpha$ side gets an additive inflation $\eta_\alpha > 1$ from cross-tier queueing under load (§2.2), not from oversubscription directly — oversubscription is a BW lever.
Per-tier $\eta$ profile for a 2-tier leaf-spine fabric. At representative oversubscription ratios:
| Tier | Role | $\eta_\alpha$ | $\eta_\beta$ at $s = 1$ | $\eta_\beta$ at $s = 2$ | $\eta_\beta$ at $s = 4$ |
|---|---|---|---|---|---|
| Leaf (endpoint ↔ leaf switch) | Intra-leaf traffic only | 1.00 | 0.80 | 0.80 | 0.80 |
| Spine (leaf ↔ spine) | Cross-leaf traffic | 1.10–1.30 | 0.80 | 0.40 | 0.20 |
| Aggregate cross-leaf | $\eta_\alpha^{\mathrm{spine}}$ | $\min(0.80, 1)$ | $\min(0.80, 0.50) = 0.50$ | $\min(0.80, 0.25) = 0.25$ |
The leaf row matches the single-tier NVSwitch-class floor (§4.1’s crossbar row) because per-endpoint BW is not oversubscribed. The spine row’s $\eta_\beta$ is $\min(0.80, 1/s)$ — oversubscription dominates the hardware floor once $s > 1.25$. $\eta_\alpha^{\mathrm{spine}} = 1.10\text{–}1.30$ reflects cross-tier queueing at moderate load on credit-based flow control (InfiniBand) or PFC-configured Ethernet; the factor climbs to 2–3× as fabric utilization crosses 80%.
3-tier Clos. A super-spine tier extends the table with a third row. If the super-spine is oversubscribed at $s_3$ independently of the spine-tier $s_2$, cross-pod BW compounds: $\eta_\beta^{\mathrm{cross\text{-}pod}} \leq \min(1/s_2, \, 1/s_3)$ when traffic crosses both tiers. Most production Clos fabrics oversubscribe only one tier (typically the super-spine at $s_3 \in \{2, 4\}$), keeping the leaf-spine tier at $s_2 = 1$ — which pins $\eta_\beta^{\mathrm{cross\text{-}pod}} \leq 1/s_3$.
Applying per-tier $\eta$ in the AR cost formula. For the two-tier RS → sub-AR → AG decomposition from 03_hierarchical_topologies.md §2.1, each phase uses its own tier’s coefficients:
$$t_{\mathrm{AR,realistic}} \;=\; \underbrace{t_{\mathrm{RS,inner}}\!\left(\tfrac{N}{L},\, M\right)}{\text{at } (\eta\alpha^{\mathrm{inner}},\, \eta_\beta^{\mathrm{inner}})} \;+\; \underbrace{t_{\mathrm{AR,outer}}\!\left(L,\, \tfrac{ML}{N}\right)}{\text{at } (\eta\alpha^{\mathrm{outer}},\, \eta_\beta^{\mathrm{outer}})} \;+\; \underbrace{t_{\mathrm{AG,inner}}\!\left(\tfrac{N}{L},\, M\right)}{\text{at } (\eta\alpha^{\mathrm{inner}},\, \eta_\beta^{\mathrm{inner}})}$$
The inner/outer naming follows 03_hierarchical_topologies.md §1.2. In a flat leaf-spine Clos (single GPUs attached directly to leaves), inner is intra-leaf (η table’s leaf row) and outer is cross-leaf (aggregate cross-leaf row, capped by $\min(\eta_\beta^{\mathrm{hw}},\, 1/s)$). In a NVL72 + IB SuperPOD, inner is the intra-pod NVLink fabric and outer is the IB Clos itself, with the leaf/spine sub-tiers feeding into $\eta^{\mathrm{outer}}$.
SHARP at the spine tier of the outer Clos (04_in_network_collectives.md §2.2) replaces the middle term with an INC AR at $n_\alpha = 2k$ switch hops. The spine-row $\eta_\beta$ still applies because oversubscription limits the BW the switch ALU can forward between tiers — SHARP sidesteps the $\alpha$-side contention but not the tier-BW cap from $s$.
5. Re-running $N = 512$ under realistic $\eta$
Using the realistic profile from §4.1, re-run the AR ladder from 04_in_network_collectives.md §3.1 at $N = 512$, $M = 16\,\mathrm{MB}$, $\alpha = 0.5\,\mu$s, $\mathrm{BW} = 900\,\mathrm{GB/s}$. We score the optimized pairings on each fabric — DBT on star, dim-decomp on torus, NVLS-style INC on the hypothetical single-switch star from 04_in_network_collectives.md §3 — per the scope assumption in §1.
5.1 Ideal vs realistic AR cost
| Topology + algo | $n_\alpha$ | $n_\beta$ | Ideal $\alpha$ | Ideal BW | Ideal total | $\eta_\alpha$ | $\eta_\beta$ | Realistic $\alpha$ | Realistic BW | Realistic total |
|---|---|---|---|---|---|---|---|---|---|---|
| Hypothetical star + NVLS (INC) | 2 | 1 | 1 μs | 17.8 μs | 18.8 μs | 1.00 | 0.52 | 1 μs | 34.2 μs | 35 μs |
| Star + DBT (NCCL) | 18 | 2 | 9 μs | 35.5 μs | 45 μs | 1.00 | 0.80 | 9 μs | 44.4 μs | 53 μs |
| Torus $8^3$ + dim-ring | 42 | 2 | 21 μs | 35.5 μs | 57 μs | 1.20 | 0.60 | 25.2 μs | 59.2 μs | 84 μs |
5.2 What changed
- Star + INC loses $\sim 16\,\mu$s (86% increase — the largest relative penalty of the three). Its ideal BW term is already at the raw-link ceiling ($M / \mathrm{BW}$), so the $\eta_\beta = 0.52$ realization loss nearly doubles the wall-clock cost. Despite this, INC is still the fastest at 35 μs — ~1.5× ahead of star+DBT and ~2.4× ahead of torus.
- Star + DBT loses $\sim 8\,\mu$s (18% increase). The BW-dominated cost is inflated by $\eta_\beta = 0.80$; $\alpha$ is unaffected. A flat penalty on an already small BW term.
- Torus loses $\sim 27\,\mu$s (47% increase). Both $\eta_\alpha$ and $\eta_\beta$ hit it. The BW term goes from matching star (35.5 μs) to 65% worse (59.2 μs). The $\alpha$ term inflates 20%.
5.3 Margin compression
| Ideal | Realistic | Gap to INC (ideal → realistic) | |
|---|---|---|---|
| Star + INC | 18.8 μs | 35 μs | — |
| Star + DBT | 45 μs | 53 μs | $2.4\times$ → $1.5\times$ |
| Torus dim-decomp | 57 μs | 84 μs | $3.0\times$ → $2.4\times$ |
Takeaway: the ideal-model ranking is preserved — INC < DBT < torus — but the margins compress toward INC as realistic contention kicks in. This is the mirror image of the naive expectation. INC’s ideal ceiling is the highest above its realistic cost (86% inflation) because its BW term is closest to the raw-link limit and has nowhere left to hide; DBT and torus start further from their limits and absorb contention into already-padded BW terms. The gap between INC and the next-best software collective compresses from $2.4\times$ to $1.5\times$, and the gap to torus from $3.0\times$ to $2.4\times$ — INC is still the winner, but the ceiling-vs-realized story flagged in 04_in_network_collectives.md §4 (limit 6) is where most of the “marketed 2× vs measured 1.3×” NVLS lift goes.
The non-INC margins also compress: star+DBT vs torus goes from 45 vs 57 μs (~25% gap) ideal to 53 vs 84 μs (~60% gap) realistic, because torus pays $\eta$ at both $\alpha$ and BW while star+DBT only pays it on BW. The “star vs torus” decision under realistic $\eta$ is more contention-sensitive than the ideal numbers suggest.
5.4 AG / RS and A2A under realistic $\eta$
§5.1–§5.3 covered AR. Re-running the AG / RS and A2A ladders from 04_in_network_collectives.md §3.3 at the same anchor ($N = 512$, $M = 16\,\mathrm{MB}$, $\alpha = 0.5\,\mu$s, $\mathrm{BW} = 900\,\mathrm{GB/s}$) uses a different $\eta_\beta$ for the INC row than AR did: there is no switch-ALU on the AG / RS critical path (only multicast), so INC AG / RS inherits the crossbar-multicast baseline $\eta_\beta = 0.80$ rather than NVLS’s 0.52. This is the coefficient-level reflection of the “no BW ceiling lift” structural point from 04_in_network_collectives.md §1.4: INC AG / RS runs at the same realized BW as a well-scheduled software AG / RS on a full-duplex crossbar.
AG / RS.
| Topology + algo | Ideal $\alpha$ | Ideal BW | Ideal total | $\eta_\alpha$ | $\eta_\beta$ | Realistic $\alpha$ | Realistic BW | Realistic total |
|---|---|---|---|---|---|---|---|---|
| Hypothetical star + INC | 1.0 μs | 17.7 μs | 18.7 μs | 1.00 | 0.80 | 1.0 μs | 22.2 μs | 23.2 μs |
| Star + RHD | 4.5 μs | 17.7 μs | 22.2 μs | 1.00 | 0.80 | 4.5 μs | 22.2 μs | 26.7 μs |
| Torus $8^3$ + dim-decomp | 10.5 μs | 17.7 μs | 28.2 μs | 1.20 | 0.60 | 12.6 μs | 29.6 μs | 42.2 μs |
A2A.
| Topology + algo | Ideal $\alpha$ | Ideal BW | Ideal total | $\eta_\alpha$ | $\eta_\beta$ | Realistic $\alpha$ | Realistic BW | Realistic total |
|---|---|---|---|---|---|---|---|---|
| Star + pairwise (NCCL) | 255.5 μs | 17.7 μs | 273 μs | 1.00 | 0.80 | 255.5 μs | 22.2 μs | 278 μs |
| Hypothetical star + INC | — | — | — | — | — | — | — | N/A on shipping hardware |
| Torus $8^3$ + bisection (TPU / Trainium) | 6.0 μs | 17.8 μs | 23.8 μs | 1.20 | 0.60 | 7.2 μs | 29.6 μs | 36.8 μs |
Three observations:
- AG / RS margins are already tight and stay tight. Star INC (23.2 μs) vs star RHD (26.7 μs) vs torus (42.2 μs). The ideal $1.2\times$ INC-vs-RHD gap compresses slightly to $1.15\times$ under realistic $\eta$ — $\alpha$ savings are preserved (INC still wins on $n_\alpha$), but both rows hit the same BW floor so contention doesn’t open the gap. AG / RS is a fundamentally $\alpha$-dominated primitive at this $M$; the whole INC-vs-software fight happens on the $\alpha$ side.
- A2A ranking holds torus under any realistic scoring. Ideal: torus 23.8 μs < pairwise 273 μs. Realistic: torus 36.8 μs < pairwise 278 μs — ordering preserved, but torus’s $\sim 11.5\times$ ideal advantage over the shipped NCCL pairwise compresses to $\sim 7.5\times$ under $\eta$ because pairwise’s $\alpha$-dominated cost absorbs contention only on the tiny BW term while torus pays $\eta$ on both. With no INC path available on shipping hardware, torus is the only primitive that scales A2A without a bisection penalty at large $N$; this is the forward-looking reason Rubin-generation HW-A2A and Tomahawk Ultra’s INC A2A exist, and the main A2A mitigation until they ship at scale.
- Realistic-$\eta$ collapses the cross-primitive ordering into a simple rule. AR wins by ~$1.5\times$ on INC-capable star; AG / RS wins by ~$1.2\times$ on INC; A2A wins by ~$7.5\times$ on cubic torus (no INC help on shipping HW). The realistic $\eta$ profile doesn’t reshuffle the winner within each primitive but does tighten every margin — which directly feeds the topology-choice Pareto sweep of
03_hierarchical_topologies.md §3.3and §7 below.
6. What the coefficient model doesn’t capture
Scalar $(\eta_\alpha, \eta_\beta)$ per (fabric, collective) is a back-of-envelope model. It cannot express:
6.1 Layout-dependent contention
Two TP groups whose ranks are all in dim-0 conflict. Two TP groups whose ranks are orthogonal (one in dim-0, one in dim-2) don’t. A single $\eta_\beta$ averages over layout. A layout-aware model would need to take the rank-to-coordinate map as input and compute link-sharing per pair of groups.
6.2 Payload-size-dependent contention
Contention often manifests at small messages (queue delay dominates) and vanishes at large messages (BW utilization is steady). The coefficient is a workload-average — it over-predicts penalty for large-$M$ collectives (prefill, bulk gradient AR) and under-predicts for small-$M$ collectives (decode AR, latency-critical HPC reductions).
6.3 Dynamic patterns
Expert skew, traffic bursts, and temporal collisions between collective steps are all dynamic. A coefficient matched to average load under-reports tail latency, which is often what the user cares about (latency SLA > average throughput).
6.4 Cross-tier propagation in multi-tier fabrics
In hierarchical fabrics (multi-tier Clos, fat-tree, or any switched fabric where upper tiers are oversubscribed relative to lower tiers), saturation at an upper tier can push backpressure into lower tiers via credit-based flow control — but a per-tier $\eta$ treats tiers as independent. Tight flow-controlled fabrics violate this assumption; loose store-and-forward or timeout-based fabrics approximate it better.
For any of these, the right answer is a higher-fidelity simulator (packet-level or event-driven, e.g. SST, Booksim). The coefficient model is the layer between “zero contention” and “simulate it” — cheap enough to use in a topology-choice Pareto sweep, principled enough that the rankings mean something.
7. When $\eta$ is load-bearing
The $(\eta_\alpha, \eta_\beta)$ defaults in §4.1 are calibrated from public measurements but still broadly applicable. They’re good enough for three classes of decision:
- First-pass topology selection. “Star vs torus for a new 512-GPU cluster” — the ideal-$\eta$ ranking often holds, but realistic-$\eta$ shifts the margins. Both sweeps should agree on the top choice; if they don’t, the decision is contention-sensitive and warrants simulation.
- Scheduler-policy review. “Are our rank-allocation and concurrent-group policies prefix-aligned on torus?” — off-prefix (§3.1) and concurrent-group (§3.2) penalties are the primary reason realistic torus $\eta_\beta$ sits at 0.60 vs the 0.80 a star pays; a policy change can recover much of that.
- Sensitivity analysis. “How much does our end-to-end step time depend on contention?” (TPOT for inference, step-time for training) — sweep $\eta$ from (1.0, 1.0) ideal to (1.5, 0.4) aggressive and report the envelope.
Cases where $\eta$-based ranking is not enough and you should escalate to real simulation or measurement:
- Deployment budgeting for a new fabric architecture. “What’s the actual max-throughput step time for configuration X?” (TPOT for inference, iteration time for training) — coefficient model will be off by factor-of-2 under unmodeled cross-tier dynamics; need hardware-trace calibration.
- Production SLA guarantees. Tail latency dominates; coefficients give average-case estimates.
- Comparing fabrics within 20% of each other. The error bars on $\eta$ defaults are wider than 20%; any “ideal-$\eta$ ranking changes under realistic-$\eta$” result should prompt a closer look, not a definitive call.
Further reading
04_in_network_collectives.md§3.1 — the ideal-$\eta$ $N = 512$ AR ladder that this note re-runs under realistic $\eta$ in §5.04_in_network_collectives.md§1 — SHARP / NVLS effectively dodges contention by doing the reduction in the switch itself; the $\eta$ story for SHARP-enabled AR is much tighter than for software collectives.03_hierarchical_topologies.md§3.2 — how the $\eta$ profile here generalizes when topologies compose across tiers: each tier carries its own $(\eta_\alpha, \eta_\beta)$, and the hierarchical cost rule propagates them per phase.02_topology_mapping.md§5.2 — the four ways the ideal-$\eta$ model breaks (concurrent groups, off-prefix layouts, skewed A2A, mixed traffic), foreshadowed there and priced quantitatively in this note.references.md— NCCL-tests methodology and TPU v4 paper for the calibration sources.
References: Collectives Explainer Series
Author: Yue Lu
Date: April 2026
Self-contained bibliography for 01–05 in this folder. Every non-obvious equation and empirical value in the series has been cross-checked against the sources below.
Primary algorithm papers
[ALPHA-BETA] Hockney, R. (1994). The Communication Challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Computing 20(3). → Canonical α-β latency model $t = \alpha + M/\mathrm{BW}$. Used throughout 01_collective_algorithms.md §1.
[PY09] Patarasuk, P., & Yuan, X. (2009). Bandwidth Optimal All-Reduce Algorithms for Clusters of Workstations. JPDC 69(2):117–124. → Ring AR achieves $2(N-1)/N \cdot M/\mathrm{BW}$ on any tree-connected fabric. Backs up 01_collective_algorithms.md §5.1 and 02_topology_mapping.md §3.4. https://doi.org/10.1016/j.jpdc.2008.09.002
[TRG05] Thakur, R., Rabenseifner, R., & Gropp, W. (2005). Optimization of Collective Communication Operations in MPICH. IJHPCA 19(1):49–66. → Recursive halving-doubling all-reduce achieves $2\lceil \log_2 N \rceil \alpha + 2 M/\mathrm{BW}$ for power-of-2 N; describes the ring / RHD / Rabenseifner crossover rules implemented by modern MPI libraries. Backs up 01_collective_algorithms.md Appendix B.2, 01_collective_algorithms.md Appendix B.4, and 02_topology_mapping.md Appendix A. https://journals.sagepub.com/doi/10.1177/1094342005051521, author preprint: https://web.cels.anl.gov/~thakur/papers/ijhpca-coll.pdf
[CHPV07] Chan, E., Heimlich, M., Purkayastha, A., & van de Geijn, R. (2007). Collective Communication: Theory, Practice, and Experience. Concurrency and Computation: Practice and Experience, 19(13):1749–1783. → Dim-decomposed all-reduce framework and telescoping derivation of multi-dim ring costs. Backs up the torus dim-decomp cost and BW telescoping in 02_topology_mapping.md §3.1.
[BHKUW97] Bruck, J., Ho, C.-T., Kipnis, S., Upfal, E., & Weathersby, D. (1997). Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems. IEEE TPDS 8(11):1143–1156. → Log-round all-to-all via circular shifts and bit-wise exchange; $\lceil \log_2 N \rceil$ steps at the cost of moving the full payload per step. Backs up the “log A2A” variant in 01_collective_algorithms.md Appendix B.5. https://doi.org/10.1109/71.642949
[SST09] Sanders, P., Speck, J., & Träff, J.L. (2009). Two-Tree Algorithms for Full Bandwidth Broadcast, Reduction and Scan. Parallel Computing 35(12):581–594. → Double binary tree (DBT) construction: two complementary trees that each rank is interior in exactly one of, used to saturate both directions of every full-duplex link during broadcast / reduce / AR. Backs up the DBT derivation in 01_collective_algorithms.md §5.2, the DBT shipping rationale in 01_collective_algorithms.md §5.3, and the binomial-tree BC / Reduce cost in 01_collective_algorithms.md §3.2 and §4.1. https://doi.org/10.1016/j.parco.2009.09.001
[LEIS85] Leiserson, C.E. (1985). Fat-Trees: Universal Networks for Hardware-Efficient Supercomputing. IEEE Transactions on Computers C-34(10):892–901. → Foundational fat-tree network construction: universal routing with bandwidth that grows fatter toward the root; the basis for Clos topology analysis in modern AI fabrics. Backs up the fat-tree / Clos topology treatment in 03_hierarchical_topologies.md §1 and the Leiserson-vs-Al-Fares comparison in 03_hierarchical_topologies.md Appendix B. https://doi.org/10.1109/TC.1985.6312192
[AL-FARES08] Al-Fares, M., Loukissas, A., & Vahdat, A. (2008). A Scalable, Commodity Data Center Network Architecture. ACM SIGCOMM 2008, pp. 63–74. → The k-ary fat-tree construction from commodity k-port switches: each switch splits its k ports k/2 up / k/2 down, yielding a rearrangeably non-blocking Clos with $s = 1$ at every tier. The equivalence “k-ary fat-tree = Clos with $s = 1$” that modern datacenter fabrics rely on is proved constructively in §3. Backs up the Clos-vs-fat-tree distinction in 03_hierarchical_topologies.md §1 and the k-ary fat-tree derivation in 03_hierarchical_topologies.md Appendix B. https://doi.org/10.1145/1402958.1402967
[THAKUR-HIER] Thakur, R., & Gropp, W. (2003). Improving the Performance of Collective Operations in MPICH. EuroPVM/MPI 2003, LNCS 2840:257–267. → Hierarchical-aware collective scheduling for MPI: tier-aware intra-node RS + inter-node AR + intra-node AG decomposition with empirical validation on IBM SP and Myrinet clusters. The canonical academic reference for the RS→AR→AG hierarchical pattern formalized in 03_hierarchical_topologies.md §2.1. https://doi.org/10.1007/978-3-540-39924-7_38
[MAGPIE] Kielmann, T., Hofman, R.F.H., Bal, H.E., Plaat, A., & Bhoedjang, R.A.F. (1999). MagPIe: MPI’s Collective Communication Operations for Clustered Wide Area Systems. PPoPP 1999, pp. 131–140. → Hierarchical collective algorithms for wide-area clusters with heterogeneous tier bandwidths; introduces the principle of “do most of the work on the fastest tier” that informs the inner-vs-outer tier tradeoff in 03_hierarchical_topologies.md §2. https://doi.org/10.1145/301104.301116
[NCCL-PAT] NVIDIA Corporation. (2024). NCCL 2.23 Release Notes / “New collective algorithms for small message sizes — PAT for inter-node AllGather and ReduceScatter at 1 rank per node.” Accompanying blog: Jeaugey, S., “Introducing NCCL 2.23: Parallel Aggregated Trees for AllGather and ReduceScatter at Scale.” NVIDIA Developer Blog, Aug 2024. → PAT (Parallel Aggregated Trees): reversed-Bruck offset schedule ($4, 2, 1, \ldots$), $\log_2 N$ rounds, $M/N$-byte bounded buffer per round, total per-rank on-wire volume $(N-1)M/N$. Shipping scope: inter-node AG / RS only, at 1 rank per node. Backs up 01_collective_algorithms.md §6 (scale-out rationale) and 01_collective_algorithms.md Appendix A (full derivation). https://docs.nvidia.com/deeplearning/nccl/release-notes/rel_2-23-4.html
[DEMYST-NCCL] Jeaugey, S., Addair, T., et al. (2025). Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms. arXiv:2507.04786. → Empirical tuner-trace analysis of NCCL’s algorithm selection across AR / AG / RS at representative scales; confirms the Tree-vs-Ring selection rule (DBT for small-$M$, ring for large-$M$) on NVLink / NVSwitch fabrics and documents the per-regime crossover points that the α-β model alone does not predict. Backs up the “practice caveat” at the end of 01_collective_algorithms.md §5.3 and the corresponding note in 02_topology_mapping.md §2. https://arxiv.org/abs/2507.04786
Topology architecture papers
[TPU-V4] Jouppi, N.P., et al. (2023). TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings. ISCA 2023. → 3D torus with optical circuit switching for slice reconfiguration; twisted-torus 1.63× A2A gain on asymmetric layouts (§V). Backs up the TPU torus deployment note in 02_topology_mapping.md §3.1 and the OCS slice-reshaping commentary in 02_topology_mapping.md §3.6; source for the torus realistic $\eta_\beta \approx 0.60$ calibration in 05_contention_and_congestion.md §4.1. https://doi.org/10.1145/3579371.3589350
[TRN2-ARCH] Amazon Web Services. (2024–2025). Amazon EC2 Trn2 Architecture. AWS Neuron Documentation. → Trn2 server: 16 Trainium2 chips arranged as a 2D NeuronLink torus (each chip connects to 4 neighbors). Trn2 UltraServer: four Trn2 instances joined via a Z-dimension NeuronLink into a 3D 64-chip torus. Backs up the Trainium adoption note in 02_topology_mapping.md §3.1 and §3.6. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/about-neuron/arch/neuron-hardware/trn2-arch.html
[NEURON-CC] Amazon Web Services. (2024–2025). Neuron Collective Communication / Intra-node Collectives. AWS Neuron Documentation. → NeuronX Collective Communication Library ships ring / mesh / KangaRing / RDH all-reduce variants purpose-built for Trainium’s 2D/3D NeuronLink torus; per-NeuronCore CC Cores offload the orchestration of collective phases. Backs up the NeuronX CCL reference in 02_topology_mapping.md §3.1. https://awsdocs-neuron.readthedocs-hosted.com/en/latest/neuron-runtime/explore/intranode-collective-comm.html
[DGX-SUPERPOD] NVIDIA Corporation. (2023–2024). NVIDIA DGX SuperPOD Reference Architecture: Featuring NVIDIA DGX H100 / H200 / GB200 Systems. NVIDIA Enterprise Documentation. → Canonical production reference architecture for InfiniBand-based AI training pods: full-bisection (s = 1) leaf-spine Clos within each Scalable Unit (SU), optional oversubscription between SUs at the super-spine when cost dominates over locality. Backs up the production-Clos deployment claim in 03_hierarchical_topologies.md §1 and the rail-optimized layout discussion in 03_hierarchical_topologies.md §1.1. https://docs.nvidia.com/https:/docs.nvidia.com/dgx-superpod/reference-architecture-scalable-infrastructure-h100/latest/
[META-RSC] Meta AI. (2022). Introducing the AI Research SuperCluster — Meta’s cutting-edge AI supercomputer for AI research. Meta AI Blog, Jan 2022. → Meta Research SuperCluster (RSC): 16,000 A100 GPUs on a 3-tier InfiniBand HDR Clos fabric with full pod-level bisection; representative of production hyperscaler AI training deployments. Backs up the production-Clos deployment claim in 03_hierarchical_topologies.md §1. https://ai.meta.com/blog/ai-rsc/
In-network collectives
[SHARP-IB] Graham, R.L., et al. (2016). Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction. COMHPC Workshop at SC16. → Original SHARP specification for InfiniBand; hardware-offloaded reduction trees; reports ~95% of network bandwidth utilization and 2–5× speedup over host-based reduction. Backs up the in-network reduction (switch ALU) mechanism in 04_in_network_collectives.md §1.1 and the SHARP-class Quantum SHARP / Spectrum-X SHARP commercial discussion in 04_in_network_collectives.md §2.2. https://network.nvidia.com/pdf/solutions/hpc/paperieee_copyright.pdf
[NVLINK-SHARP] NVIDIA Corporation. (2023–2024). NVLink SHARP (NVLS) — In-Network All-Reduce Acceleration. GTC S62129; NCCL release notes; GB200 NVL Multi-Node Tuning Guide. → NVSwitch-based in-network reduction; AR busbw rises from ~360 GB/s to 470+ GB/s under NVLS (≈1.3× measured). Backs up the NVLS mechanism in 04_in_network_collectives.md §1.1 and the scale-up INC commercial summary in 04_in_network_collectives.md §2.1. The measured 1.3× AR BW lift is the calibration source for the 03_hierarchical_topologies.md §3.1 inner-tier-INC commentary and for the NVLS-row $\eta_\beta \approx 0.52$ in 05_contention_and_congestion.md §4.1. https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/nccl.html
[NCCL-NVLS-2.27] NVIDIA Corporation. (2025). Enabling Fast Inference and Resilient Training with NCCL 2.27. NVIDIA Technical Blog, May 2025. → Up to 2.5× improvement on small-to-medium messages in Symmetric Memory configurations with NVLS. Backs up the small-$M$ speedup regime discussion in 04_in_network_collectives.md §3.2. https://developer.nvidia.com/blog/enabling-fast-inference-and-resilient-training-with-nccl-2-27/
[TH-ULTRA] Broadcom. (2025). Broadcom Ships Tomahawk Ultra: Reimagining the Ethernet Switch for HPC and AI Scale-up. Jul 2025. → First Ethernet switch ASIC with in-network collectives (INC); 51.2 Tbps full-duplex, 250 ns switch latency. Backs up the Tomahawk Ultra HW A2A mention in 04_in_network_collectives.md §1.3 and the Ethernet INC entry in 04_in_network_collectives.md §2.1. https://investors.broadcom.com/news-releases/news-release-details/broadcom-ships-tomahawk-ultra-reimagining-ethernet-switch-hpc
Hardware specifications
[NVLINK5-SPEC] NVIDIA Corporation. (2024). NVLink 5 / NVSwitch Gen4 / GB200 NVL72 Architecture. NVIDIA product pages. → NVLink 5: 18 links × 100 GB/s per direction = 1.8 TB/s bidirectional per GPU (900 GB/s unidirectional). NVSwitch Gen4: 72 NVLink 5 ports per chip. NVL72: 72 GPUs, 130 TB/s aggregate all-to-all bandwidth. Backs up the per-link bandwidth numbers in 01_collective_algorithms.md §5.1, 02_topology_mapping.md §1.1, and 03_hierarchical_topologies.md §1.1. https://www.nvidia.com/en-us/data-center/nvlink/
[UCIE] UCIe Consortium. (2023–2024). Universal Chiplet Interconnect Express (UCIe) Specification v1.1 / v2.0. → Standard die-to-die interface for chiplet interconnects; per-lane signaling rates up to 32 GT/s (v1.1) / 40 GT/s (v2.0), target latency ~2 ns per hop for advanced package PHY. Backs up the chiplet-mesh $\alpha$ and BW numbers in 02_topology_mapping.md §1.3 and §4.1. https://www.uciexpress.org/specifications
[NCCL-TESTS] NVIDIA Corporation. (2020–2025). NCCL Performance Tests. → Canonical busbw/algbw measurement methodology for NCCL collectives; H100/A100 intra-node AR busbw ≈ 360 GB/s against 450 GB/s peak NVLink4 unidirectional. Backs up the busbw/algbw vocabulary in 01_collective_algorithms.md Appendix D and is the calibration source for the crossbar $\eta_\beta \approx 0.80$ in 05_contention_and_congestion.md §4.1. https://github.com/NVIDIA/nccl-tests
[H100-SPEC] NVIDIA Corporation. (2022). NVIDIA H100 Tensor Core GPU Architecture. NVIDIA Whitepaper WP-10792-001. → H100 SXM5 NVLink 4.0: 900 GB/s bidirectional per GPU (450 GB/s unidirectional). Baseline “pre-NVLink5” BW numbers used in 02_topology_mapping.md §2 and the calibration baseline in 05_contention_and_congestion.md §4.1. https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
[QX800] NVIDIA Corporation. (2024). NVIDIA Quantum-X800 InfiniBand Platform / Q3400 Switch Series. NVIDIA product pages and datasheet. → 144-port 800 Gbps (XDR) InfiniBand switch (Q3400-LD air-cooled and Q3400-RA liquid-cooled variants); 115.2 Tbps aggregate bidirectional throughput; designed for GB200 NVL72 SuperPOD scale-out fabric. Backs up the 144-port ToR radix used in both pod-local and rail-optimized X800 layouts in 03_hierarchical_topologies.md §1.1. https://www.nvidia.com/en-us/networking/quantum-x800/
Tag → section map
Where each citation is used in the series:
| Tag | Used in |
|---|---|
[ALPHA-BETA] | 01_collective_algorithms.md §1 |
[PY09] | 01_collective_algorithms.md §5.1, 02_topology_mapping.md §3.4 |
[TRG05] | 01_collective_algorithms.md App. B.2, App. B.4; 02_topology_mapping.md App. A |
[SST09] | 01_collective_algorithms.md §3.2, §4.1, §5.1, §5.2, §5.3 |
[CHPV07] | 02_topology_mapping.md §3.1 |
[BHKUW97] | 01_collective_algorithms.md App. B.5 |
[NCCL-PAT] | 01_collective_algorithms.md §6, App. A |
[DEMYST-NCCL] | 01_collective_algorithms.md §5.3; 02_topology_mapping.md §2 |
[LEIS85] | 03_hierarchical_topologies.md §1, App. B (B.1) |
[AL-FARES08] | 03_hierarchical_topologies.md §1, App. B (B.2) |
[THAKUR-HIER] | 03_hierarchical_topologies.md §2.1 |
[MAGPIE] | 03_hierarchical_topologies.md §2 (inner-vs-outer tier tradeoff) |
[TPU-V4] | 02_topology_mapping.md §3.1, §3.6; 05_contention_and_congestion.md §4.1 |
[TRN2-ARCH] | 02_topology_mapping.md §3.1, §3.6 |
[NEURON-CC] | 02_topology_mapping.md §3.1 |
[DGX-SUPERPOD] | 03_hierarchical_topologies.md §1, §1.1 |
[META-RSC] | 03_hierarchical_topologies.md §1 |
[NCCL-TESTS] | 01_collective_algorithms.md App. D; 05_contention_and_congestion.md §4.1 |
[SHARP-IB] | 04_in_network_collectives.md §1.1, §2.2 |
[NVLINK-SHARP] | 04_in_network_collectives.md §1.1, §2.1; 03_hierarchical_topologies.md §3.1; 05_contention_and_congestion.md §4.1 |
[NCCL-NVLS-2.27] | 04_in_network_collectives.md §3.2 |
[TH-ULTRA] | 04_in_network_collectives.md §1.3, §2.1 |
[NVLINK5-SPEC] | 01_collective_algorithms.md §5.1; 02_topology_mapping.md §1.1; 03_hierarchical_topologies.md §1.1 |
[UCIE] | 02_topology_mapping.md §1.3, §4.1 |
[H100-SPEC] | 02_topology_mapping.md §2; 05_contention_and_congestion.md §4.1 |
[QX800] | 03_hierarchical_topologies.md §1.1 |
Verification notes (what was cross-checked)
The following equations and empirical values were independently verified (not just transcribed from secondary sources):
| Claim | Verified against | Status |
|---|---|---|
| Ring AR cost $2(N-1)\alpha + 2(N-1)/N \cdot M/\mathrm{BW}$ | [PY09] direct formula | ✓ |
| Tree (RHD) AR $2 \lceil \log_2 N \rceil \alpha + 2 M/\mathrm{BW}$ | [TRG05] §3.3 | ✓ |
| Torus dim-decomp $2 \sum(D_i-1)\alpha + 2(N-1)/N \cdot M/\mathrm{BW}$ | [CHPV07], [PY09] telescoping | ✓ |
| NVL72 = 72 GPUs per NVSwitch domain | [NVLINK5-SPEC] | ✓ |
| NVLink 5 per-GPU unidirectional 900 GB/s | [NVLINK5-SPEC]: 18 × 100 GB/s unidir | ✓ |
| Log A2A $\lceil \log_2 N \rceil$ steps | [BHKUW97] §III | ✓ |
| NVLS measured 470+ GB/s busbw | [NVLINK-SHARP] GB200 tuning guide | ✓ (≈1.3× over 360 GB/s non-SHARP; paper’s “1.7×” is the IB Quantum-2 figure, not NVLS) |
| Tomahawk Ultra 250 ns switch latency | [TH-ULTRA] | ✓ |
| Crossbar $\eta_\beta \approx 0.80$ (360/450) | [NCCL-TESTS] PERFORMANCE.md | ✓ |
| Torus $\eta_\beta \approx 0.60$ from twisted-torus 1.63× | [TPU-V4] §V | ✓ (upper-bound interpretation) |
Known numerical caveats (documented in-line)
- The $\alpha$ values used in worked examples ($\alpha = 0.5\,\mu$s intra-NVLink, $\alpha \approx 1\,\mu$s inter-domain) are order-of-magnitude estimates consistent with NVSwitch cut-through (~100 ns) + endpoint software floor (~800 ns). Exact values depend on the NCCL path chosen at runtime; the model’s conclusions are robust to $\pm 2\times$ variation on $\alpha$ because the bandwidth term dominates for $M \geq$ few hundred KB.
- The A2A on star entries in
02_topology_mapping.md§2.5 and §5.1 use the per-rank pairwise direct-send formula (NVSwitch aggregate bisection saturated), appropriate for MoE workloads where all ranks participate simultaneously.