Zigzag Decoding with AVX-512

Originally published on zeux.io by Arseny Kapoulkine

Recently, while optimizing vertex decoding within the meshoptimizer library using AVX-512, I discovered two specific optimizations. Although I didn't implement them in the final version, they are interesting enough to document.

The optimizations that were actually integrated require a deeper architectural explanation that I'll save for another time. However, the two methods discussed here leverage AVX-512 features that were new to me, both focusing on a single problem: the efficient decoding of zigzag encoded integers.

Understanding Zigzag Encoding

When dealing with large sets of signed integers—particularly those generated via delta encoding (the difference between consecutive values)—we often find that the magnitudes are small, but the signs fluctuate unpredictably.

If we used standard two's complement, small negative numbers would have many leading ones, making them inefficient for variable-bitrate encoding. To solve this, we use Zigzag Encoding, which ensures that the number of bits required scales with the absolute magnitude of the value.

The Transformation Logic

The encoding maps signed integers to unsigned integers using the following rules:

Input Value ( $v$ )	Transformation Formula	Result Property
Positive ( $v \ge 0$ )	$2 \times v$	Always Even
Negative ( $v < 0$ )	$2 \times (\sim v) + 1$	Always Odd

Essentially, the sign bit is migrated to the Least Significant Bit (LSB).

Decoding the Value

To reverse this process, we can use a conditional approach: (v & 1) == 0 ? (v >> 1) : ~(v >> 1)

However, to avoid performance-killing branches, a branchless formulation is preferred:

(v >> 1) ^ -(v & 1)

How it works:

-(v & 1) produces a mask of all zeros if $v$ is even (positive).
It produces a mask of all ones ( $-1$ in two's complement) if $v$ is odd (negative).
XORing the shifted value with this mask preserves the magnitude for positive numbers and inverts it for negative numbers.

This allows the resulting small positive value to be stored using variable-width schemes such as VByte (LEB128) or other bit-wise encoding methods.

Implementing SIMD Decoding

The branchless logic translates seamlessly into SIMD. If you have a vector of unsigned values v, you can decode them in parallel.

SSE2 Implementation

Using 32-bit integers in SSE2, the implementation looks like this:

__m128i xs = _mm_and_si128(v, _mm_set1_epi32(1)); 
__m128i xl = _mm_sub_epi32(_mm_setzero_si128(), xs); 
__m128i xr = _mm_srli_epi32(v, 1); 
return _mm_xor_si128(xl, xr);

Assembly Breakdown

The compiler translates the above into the following sequence:

vpsrld xmm1, xmm0, 1    ; Shift right by 1
vpandq xmm0, xmm0, xmm2 ; AND with 1 (xmm2 preloaded)
vpsubd xmm0, xmm3, xmm0 ; Subtract from 0 (xmm3 preloaded)
vpxorq xmm0, xmm0, xmm1 ; XOR the results

Performance Note: On Zen 4 architecture, each of these instructions has a latency of 1 cycle. Since vpandq does not depend on vpsrld, the total cumulative latency is $\approx 3$ cycles.

Exploring AVX-512 Masks

The transformation is fundamentally about bit manipulation. While one might hope for a dedicated hardware bit-permutation engine, we must work with the AVX-512 instruction set.

~~I initially thought vpternlogd (the 3-input truth table instruction) would be the answer~~, but it doesn't provide a clear advantage here.

Instead, we can leverage AVX-512 predication (execution masks) to implement the branching version of the logic directly in hardware.

The Predicated Approach

Instead of using arithmetic to create a mask, we can use the following logic:

Compute sign = v & 1
Create mask = (sign != 0)
res = v >> 1 (Executed for all lanes)
res = ~res (Executed only for lanes where mask is true)

In pseudo-code:

sign = v & 1;
mask = sign != 0;
res = v >> 1; 
res = ~res {mask}; // Predicated inversion

Naively, this still appears to be four instructions: calculating the sign, comparing it to zero, shifting, and conditionally inverting. AVX-512 does not ha...