Decoding Occupancy Math for the AMD MI355X (CDNA4)

A First-Principles Technical Deep Dive

When you ask a GPU kernel engineer about the health of their code, "occupancy" is almost always one of the first terms they mention. Yet, despite its ubiquity, it remains one of the most misunderstood metrics in the field.

Most developers treat occupancy as a "black box" percentage provided by a profiler, rather than a predictable result of hardware constraints.

In reality, occupancy is a fully derivable value. By analyzing a kernel's resource consumption against fixed hardware ceilings, you can calculate it by hand. Mastering this derivation transforms how you approach performance tuning.

On the MI355X, occupancy represents the fraction of a SIMD's wavefront slots that are actively filled. This value is dictated by the "bottleneck" resource—whichever of these four limits is exhausted first:

VGPRs (Vector General Purpose Registers)
SGPRs (Scalar General Purpose Registers)
LDS (Local Data Share)
Workgroup/Barrier slots

The fundamental math is simple: $\text{Occupancy} = \min\left(\frac{\text{Hardware Budget}_i}{\text{Kernel Usage}_i}\right)$

The "Occupancy Trap"

It is critical to realize that maximizing occupancy is often the wrong objective. In tests using MXFP8 MFMA (Matrix Fused Multiply-Add), the matrix cores can maintain $\approx 97\%$ of peak throughput even when occupancy is low. This proves that throughput is tied to matrix-engine utilization, not necessarily how many wavefronts are resident on the SIMD.

Roadmap of this Guide

Part 1: The MI355X Architecture $\rightarrow$ Understanding the silicon, the CU, and the memory hierarchy.
Part 2: The Math of Limiters $\rightarrow$ Calculating the ceiling and using rocprofv3.
Part 3: The Nuance of Performance $\rightarrow$ Little’s Law, ILP, and why "full" isn't always "fast."

Part 1 — The MI355X Architecture

Occupancy is essentially a resource allocation problem. To solve the math, you must first distinguish between private and shared resources.

High-Level Topology

The MI355X is composed of eight Accelerator Complex Dies (XCDs) connected via 4th-gen Infinity Fabric.

Component	Specification
Total Compute Units (CUs)	256 (32 per XCD)
Clock Speed	Up to 2.4 GHz
VRAM	288 GB HBM3E
Memory Bandwidth	8 TB/s
Last-Level Cache	256 MB Infinity Cache

While XCD-aware swizzling is vital for data locality (keeping traffic within a single L2 slice), occupancy is calculated at the Compute Unit (CU) level.

Inside the Compute Unit

A single CU consists of four SIMD units and supporting infrastructure. The critical resources for our calculations are:

1. The Vector Register File (`VGPRs`)

Each lane has a 512-entry register file. This pool is shared between standard registers and accumulator registers.

The Rule: A wavefront can use up to 512 total VGPRs, with a flexible split (up to 256 of each type).
~~Old Architecture: Separate physical ACC files (MI100/MI200)~~ $\rightarrow$ CDNA4: Unified budget.

In most MXFP8 GEMM kernels, the compiler prioritizes regular VGPRs; accumulators only spill into the dedicated pool for the largest tile sizes.

2. Local Data Share (`LDS`)

The LDS is the primary cooperation mechanism for workgroups. On CDNA4, the LDS is $2.5\times$ larger than the 64 KB found in CDNA3 (totaling 160 KB).

The Critical Distinction:

VGPRs are per-SIMD.
LDS is per-CU.

This asymmetry is why the denominators in occupancy calculations differ depending on the resource.

Threads, Lanes, and Wavefronts

The hardware does not schedule individual threads. Instead, it uses wavefronts: bundles of 64 threads executing in lockstep.

Wavefront Size: 64 threads (Equivalent to a 64-wide NVIDIA warp).
Workgroup Example: A 256-thread workgroup = 4 wavefronts.
Resident Waves: Waves whose registers are reserved on the hardware.
Scheduling: The scheduler switches between resident waves to hide latency when one stalls.

Occupancy Definition: The ratio of resident waves to the maximum possible (8 waves per SIMD, or 32 per CU).

![Architecture Diagram Placeholder: A visual representation of a SIMD unit showing 64 lanes and the scheduler switching between 8 possible wavefront slots]

The "Rosetta Stone" for CUDA Developers

If you are coming from an NVIDIA background, the concepts map almost perfectly:

AMD Term	NVIDIA Equivalent	Note
Wavefront	Warp	AMD is 64-wide; NVIDIA is 32-wide
CU (Compute Unit)	SM (Streaming Multiprocessor)	The primary scheduling unit
LDS	Shared Memory	Per-SM/CU scratchpad
VGPR	Register	Per-thread vector registers

To calculate the actual ceiling, one might use a logic flow similar to this:

def calculate_occupancy(vgpr_used, sgpr_used, lds_used, waves_per_workgroup):
    # Hardware limits for MI355X
    MAX_VGPR_PER_SIMD = 8 * 64 * 512 
    MAX_LDS_PER_CU = 160 * 1024
    
    # Calculate limiters
    vgpr_limit = MAX_VGPR_PER_SIMD / (vgpr_used * 64)
    lds_limit = MAX_LDS_PER_CU / lds_used
    
    return min(vgpr_limit, lds_limit, ...)