Zigzag Decoding with AVX-512
Zigzag Decoding with AVX-512
Originally published on zeux.io by Arseny Kapoulkine
Recently, while optimizing vertex decoding within the meshoptimizer library using AVX-512, I discovered two specific optimizations. Although I didn't implement them in the final version, they are interesting enough to document.
The optimizations that were actually integrated require a deeper architectural explanation that I'll save for another time. However, the two methods discussed here leverage AVX-512 features that were new to me, both focusing on a single problem: the efficient decoding of zigzag encoded integers.
Understanding Zigzag Encoding
When dealing with large sets of signed integers—particularly those generated via delta encoding (the difference between consecutive values)—we often find that the magnitudes are small, but the signs fluctuate unpredictably.
If we used standard two's complement, small negative numbers would have many leading ones, making them inefficient for variable-bitrate encoding. To solve this, we use Zigzag Encoding, which ensures that the number of bits required scales with the absolute magnitude of the value.
The Transformation Logic
The encoding maps signed integers to unsigned integers using the following rules:
| Input Value () | Transformation Formula | Result Property |
|---|---|---|
| Positive () | Always Even | |
| Negative () | Always Odd |
Essentially, the sign bit is migrated to the Least Significant Bit (LSB).
Decoding the Value
To reverse this process, we can use a conditional approach:
(v & 1) == 0 ? (v >> 1) : ~(v >> 1)
However, to avoid performance-killing branches, a branchless formulation is preferred:
(v >> 1) ^ -(v & 1)
How it works:
-(v & 1)produces a mask of all zeros if is even (positive).- It produces a mask of all ones ( in two's complement) if is odd (negative).
- XORing the shifted value with this mask preserves the magnitude for positive numbers and inverts it for negative numbers.
This allows the resulting small positive value to be stored using variable-width schemes such as VByte (LEB128) or other bit-wise encoding methods.
Implementing SIMD Decoding
The branchless logic translates seamlessly into SIMD. If you have a vector of unsigned values v, you can decode them in parallel.
SSE2 Implementation
Using 32-bit integers in SSE2, the implementation looks like this:
__m128i xs = _mm_and_si128(v, _mm_set1_epi32(1));
__m128i xl = _mm_sub_epi32(_mm_setzero_si128(), xs);
__m128i xr = _mm_srli_epi32(v, 1);
return _mm_xor_si128(xl, xr);
Assembly Breakdown
The compiler translates the above into the following sequence:
vpsrld xmm1, xmm0, 1 ; Shift right by 1
vpandq xmm0, xmm0, xmm2 ; AND with 1 (xmm2 preloaded)
vpsubd xmm0, xmm3, xmm0 ; Subtract from 0 (xmm3 preloaded)
vpxorq xmm0, xmm0, xmm1 ; XOR the results
Performance Note: On Zen 4 architecture, each of these instructions has a latency of 1 cycle. Since vpandq does not depend on vpsrld, the total cumulative latency is cycles.
Exploring AVX-512 Masks
The transformation is fundamentally about bit manipulation. While one might hope for a dedicated hardware bit-permutation engine, we must work with the AVX-512 instruction set.
I initially thought , but it doesn't provide a clear advantage here.vpternlogd (the 3-input truth table instruction) would be the answer
Instead, we can leverage AVX-512 predication (execution masks) to implement the branching version of the logic directly in hardware.
The Predicated Approach
Instead of using arithmetic to create a mask, we can use the following logic:
- Compute
sign = v & 1 - Create
mask = (sign != 0) -
res = v >> 1(Executed for all lanes) -
res = ~res(Executed only for lanes wheremaskis true)
In pseudo-code:
sign = v & 1;
mask = sign != 0;
res = v >> 1;
res = ~res {mask}; // Predicated inversion
Naively, this still appears to be four instructions: calculating the sign, comparing it to zero, shifting, and conditionally inverting. AVX-512 does not ha...