1. Basic examples sheet

To get this out of the way, here are the most useful (imo) instructions.

// Multiple variable-types
    double MySingleValue {0.0};
    double MyDoubleArray[2] {0.0, 1.1};
    FVector MyVector {0.0, 1.1, 2.2};
    alignas(16) FVector MyAlignedVector;

// Loading instructions for two doubles

    // Set two doubles from 1 value
    __m128d RegisterSSE_1 = _mm_set1_pd(MySingleValue);
    // Set two doubles directly
    RegisterSSE_1 = _mm_set_pd(0.0, 1.1);
    // Load array (pointer to first element)
    RegisterSSE_1 = _mm_loadu_pd(MyDoubleArray);
    // FVector is stored as an array, point to first element
    RegisterSSE_1 = _mm_loadu_pd(&MyVector.X);

    // Set zero array
    __m128d RegisterSSE_2 = _mm_setzero_pd();

// Operation examples

    // Addition
    __m128d RegisterSSE_3 = _mm_add_pd(RegisterSSE_1, RegisterSSE_2);
    // Subtraction
    RegisterSSE_3 = _mm_sub_pd(RegisterSSE_1, RegisterSSE_2);
    // Multiplication
    RegisterSSE_3 = _mm_mul_pd(RegisterSSE_1, RegisterSSE_2);
    // Division
    RegisterSSE_3 = _mm_div_pd(RegisterSSE_1, RegisterSSE_2);  

// Using the result

    // Array[2] to store the result
    double ResultArray[2];
    alignas(16) double AlignedResultArray[2];


    // Move register values to ResultArray
    _mm_storeu_pd(ResultArray, RegisterSSE_3);
    _mm_store_pd(AlignedResultArray, RegisterSSE_3);

    // Only X and Y cause we used two doubles (64bit) in a 128bit register.
    FVector ResultVector = FVector(ResultArray[0], ResultArray[1], 0.0);

There are other loading and set operations, as well as other operators to be used. These examples also only include doubles, while you could use floats etc. Anyways, this should be enough to get one started.

2. Performance tips

Initialization.

If you use SIMD instructions, in this case SSE, it is important to consider the overhead of initializing SSE registers. For example, if you just add two FVector2D using SSE, it could be just as fast as adding X and Y directly. This is due to the overhead involved in loading data into, operating on, and storing results from the special SIMD registers, which might exceed the cost of simple scalar operations for trivial tasks.

Set vs. load

"_mm_load_pd" (or the safer "_mm_loadu_pd") copies adjacent double values directly from a memory location, like "FVector" components, into an SSE register. In contrast, "_mm_set_pd" builds the SSE register from two separate double values provided as arguments, useful when combining data from different variables or constants. Neither intrinsic is universally faster; "load"'s speed depends heavily on memory cache performance (fast if data is cached, slow if not), while set's speed is generally more consistent. Choose load for contiguous data already in memory and set for constructing vectors from separate pieces.

loadu vs load

As you may have seen, I used "_mm_loadu_pd()" in my example. But there is also another option "_mm_load_pd()". So what's the difference?

_mm_load_pd(): Uses already aligned memory (faster, not as safe).
_mm_loadu_pd(): Uses not already aligned memory (minimally slower but much safer).

I wasn't yet able to fully make sure that FVector is aligned correctly by default, so using "loadu" is a good choice unless the FVector is aligned before usage:

alignas(16) FVector ResultVec0 {};

This will align the memory correctly for a 128bit register (16 bytes = 128 bits).

Best use-cases

When using arrays.

When re-using the 128bit registers.

Sometimes simply replacing an "FVector. +-*/ FVector" using SIMD could be beneficial, though all my testing never got above margin of error territory.

3. Other tips

It is always best to test your performance. Sometimes it may seem that it is faster, even though it's not. For example, I have tried using SSE for Line-Traces (calculating the end point), yet it wasn't really faster than just adding and multiplying FVectors.

A good example on how to test your result:

/*  Quick sleep trying to force a context switch before the test
    to prevent (if possible) context switching during the test. */
    FPlatformProcess::Sleep(0.f);

// Start Benchmark
auto BenchmarkStart = std::chrono::high_resolution_clock::now();

    // DO YOUR TEST HERE.

// Get duration
const auto BenchmarkDuration = std::chrono::duration_cast<std::chrono::nanoseconds>(
       std::chrono::high_resolution_clock::now() - BenchmarkStart);
// Get nanosecons
const int64 BenchmarkResult = static_cast<int64>(BenchmarkDuration.count());

// Print your result
GEngine->AddOnScreenDebugMessage(
    -1,
    25.f,
    FColor::White,
    FString("SSE took " + FString::FromInt(BenchmarkResult)));

Also, check if the results are actually correct (in case some memory alignment didn't work as planned. Hope this helps to get you started!

4. Exceptions

After a lot of testing I have found some weird exceptions:

If you for example re-use an array very often, the performance of SIMD instructions may tank and be just as fast as what you usually get with default instructions. This seems to be due to a "warm up" of the data and that they are now located within the CPUs cache. Long story short: it may be, that when using data frequently and locally, SIMD instructions may be slower. But for short usages SIMD may still prove faster.

Daves Developer Blog

Thursday, 3 April 2025

UE5: SSE Instructions

1. Basic examples sheet

2. Performance tips

Initialization.

Set vs. load

loadu vs load

Best use-cases

3. Other tips

4. Exceptions

No comments:

Post a Comment

UE5: LIKELY() vs. UNLIKELY (branch-prediction)

Search This Blog