Occupancy Math on the AMD MI355X: A From-First-Principles Guide
Decoding Occupancy Math for the AMD MI355X (CDNA4)
A First-Principles Technical Deep Dive
When you ask a GPU kernel engineer about the health of their code, "occupancy" is almost always one of the first terms they mention. Yet, despite its ubiquity, it remains one of the most misunderstood metrics in the field.
Most developers treat occupancy as a "black box" percentage provided by a profiler, rather than a predictable result of hardware constraints.
In reality, occupancy is a fully derivable value. By analyzing a kernel's resource consumption against fixed hardware ceilings, you can calculate it by hand. Mastering this derivation transforms how you approach performance tuning.
On the MI355X, occupancy represents the fraction of a SIMD's wavefront slots that are actively filled. This value is dictated by the "bottleneck" resource—whichever of these four limits is exhausted first:
- VGPRs (Vector General Purpose Registers)
- SGPRs (Scalar General Purpose Registers)
- LDS (Local Data Share)
- Workgroup/Barrier slots
The fundamental math is simple:
The "Occupancy Trap"
It is critical to realize that maximizing occupancy is often the wrong objective. In tests using MXFP8 MFMA (Matrix Fused Multiply-Add), the matrix cores can maintain of peak throughput even when occupancy is low. This proves that throughput is tied to matrix-engine utilization, not necessarily how many wavefronts are resident on the SIMD.
Roadmap of this Guide
- Part 1: The MI355X Architecture Understanding the silicon, the CU, and the memory hierarchy.
- Part 2: The Math of Limiters Calculating the ceiling and using
rocprofv3. - Part 3: The Nuance of Performance Little’s Law, ILP, and why "full" isn't always "fast."
Part 1 — The MI355X Architecture
Occupancy is essentially a resource allocation problem. To solve the math, you must first distinguish between private and shared resources.
High-Level Topology
The MI355X is composed of eight Accelerator Complex Dies (XCDs) connected via 4th-gen Infinity Fabric.
| Component | Specification |
|---|---|
| Total Compute Units (CUs) | 256 (32 per XCD) |
| Clock Speed | Up to 2.4 GHz |
| VRAM | 288 GB HBM3E |
| Memory Bandwidth | 8 TB/s |
| Last-Level Cache | 256 MB Infinity Cache |
While XCD-aware swizzling is vital for data locality (keeping traffic within a single L2 slice), occupancy is calculated at the Compute Unit (CU) level.
Inside the Compute Unit
A single CU consists of four SIMD units and supporting infrastructure. The critical resources for our calculations are:
1. The Vector Register File (VGPRs)
Each lane has a 512-entry register file. This pool is shared between standard registers and accumulator registers.
- The Rule: A wavefront can use up to 512 total VGPRs, with a flexible split (up to 256 of each type).
Old Architecture: Separate physical ACC files (MI100/MI200)CDNA4: Unified budget.
In most MXFP8 GEMM kernels, the compiler prioritizes regular VGPRs; accumulators only spill into the dedicated pool for the largest tile sizes.
2. Local Data Share (LDS)
The LDS is the primary cooperation mechanism for workgroups. On CDNA4, the LDS is larger than the 64 KB found in CDNA3 (totaling 160 KB).
The Critical Distinction:
- VGPRs are per-SIMD.
- LDS is per-CU.
This asymmetry is why the denominators in occupancy calculations differ depending on the resource.
Threads, Lanes, and Wavefronts
The hardware does not schedule individual threads. Instead, it uses wavefronts: bundles of 64 threads executing in lockstep.
- Wavefront Size: 64 threads (Equivalent to a 64-wide NVIDIA warp).
- Workgroup Example: A 256-thread workgroup = 4 wavefronts.
- Resident Waves: Waves whose registers are reserved on the hardware.
- Scheduling: The scheduler switches between resident waves to hide latency when one stalls.
Occupancy Definition: The ratio of resident waves to the maximum possible (8 waves per SIMD, or 32 per CU).
![Architecture Diagram Placeholder: A visual representation of a SIMD unit showing 64 lanes and the scheduler switching between 8 possible wavefront slots]
The "Rosetta Stone" for CUDA Developers
If you are coming from an NVIDIA background, the concepts map almost perfectly:
| AMD Term | NVIDIA Equivalent | Note |
|---|---|---|
| Wavefront | Warp | AMD is 64-wide; NVIDIA is 32-wide |
| CU (Compute Unit) | SM (Streaming Multiprocessor) | The primary scheduling unit |
| LDS | Shared Memory | Per-SM/CU scratchpad |
| VGPR | Register | Per-thread vector registers |
To calculate the actual ceiling, one might use a logic flow similar to this:
def calculate_occupancy(vgpr_used, sgpr_used, lds_used, waves_per_workgroup):
# Hardware limits for MI355X
MAX_VGPR_PER_SIMD = 8 * 64 * 512
MAX_LDS_PER_CU = 160 * 1024
# Calculate limiters
vgpr_limit = MAX_VGPR_PER_SIMD / (vgpr_used * 64)
lds_limit = MAX_LDS_PER_CU / lds_used
return min(vgpr_limit, lds_limit, ...)