Tuesday, 8 April 2025

UE5: TArray<>.Reserve(N) vs .SetNumUninitialized(N)

TArray<>.Reserve(N)

A safe way to pre-allocate memory capacity for an array to avoid reallocations during dynamic growth.

+ rather safe
+ pre-allocating prevents runtime heap-allocations
- faster than runtime allocation, but slower than ".SetNumUninitialized()"

Correct example:

TArray<FVector> MyVectors;
MyVectors.Reserve(32);
const int32 LoopLength {18};

for (int32 i = 0; i < LoopLength; i++)
    MyVectors.Add(FVector(42.0, 42.0, 42.0));

"LoopLength" can be anything between 0 and 32 in this case. If the loop exceeds the reserved capacity (32 in this case), subsequent ".Add()" calls will still work but will trigger memory reallocations, reducing the performance benefit gained from the initial reservation. So try to enter a correct number if possible.

TArray<>.SetNumUninitialized(N)

A not so safe, but very performant way to reserve memory with manual initialization.

+ fast (removes ".Add()" overhead)
+ pre-allocating prevents runtime heap-allocations
- unsafe if not all elements are correctly initialized (direct access using [index])

Correct example:

TArray<FVector> MyVectors;
MyVectors.SetNumUninitialized(32);
const int32 LoopLength {32};

for (int32 i = 0; i < LoopLength; i++)
    MyVectors[i] = (FVector(42.0, 42.0, 42.0));

Crucially, you must ensure your code explicitly initializes every element index "i" (from 0 up to N-1) that you intend to read from later. If your code attempts to access an index "i" where "i" >= N (the size set by SetNumUninitialized), it will result in an out-of-bounds access, likely causing a crash. As a best practice: simply match "SetNumUninitialized()" with "LoopLength" to prevent any issues.

Thursday, 3 April 2025

UE5: SSE Instructions

1. Basic examples sheet

To get this out of the way, here are the most useful (imo) instructions.

// Multiple variable-types
    double MySingleValue {0.0};
    double MyDoubleArray[2] {0.0, 1.1};
    FVector MyVector {0.0, 1.1, 2.2};
    alignas(16) FVector MyAlignedVector;

// Loading instructions for two doubles

    // Set two doubles from 1 value
    __m128d RegisterSSE_1 = _mm_set1_pd(MySingleValue);
    // Set two doubles directly
    RegisterSSE_1 = _mm_set_pd(0.0, 1.1);
    // Load array (pointer to first element)
    RegisterSSE_1 = _mm_loadu_pd(MyDoubleArray);
    // FVector is stored as an array, point to first element
    RegisterSSE_1 = _mm_loadu_pd(&MyVector.X);

    // Set zero array
    __m128d RegisterSSE_2 = _mm_setzero_pd();

// Operation examples

    // Addition
    __m128d RegisterSSE_3 = _mm_add_pd(RegisterSSE_1, RegisterSSE_2);
    // Subtraction
    RegisterSSE_3 = _mm_sub_pd(RegisterSSE_1, RegisterSSE_2);
    // Multiplication
    RegisterSSE_3 = _mm_mul_pd(RegisterSSE_1, RegisterSSE_2);
    // Division
    RegisterSSE_3 = _mm_div_pd(RegisterSSE_1, RegisterSSE_2);  

// Using the result

    // Array[2] to store the result
    double ResultArray[2];
    alignas(16) double AlignedResultArray[2];


    // Move register values to ResultArray
    _mm_storeu_pd(ResultArray, RegisterSSE_3);
    _mm_store_pd(AlignedResultArray, RegisterSSE_3);

    // Only X and Y cause we used two doubles (64bit) in a 128bit register.
    FVector ResultVector = FVector(ResultArray[0], ResultArray[1], 0.0);

There are other loading and set operations, as well as other operators to be used. These examples also only include doubles, while you could use floats etc. Anyways, this should be enough to get one started.

2. Performance tips

Initialization.

If you use SIMD instructions, in this case SSE, it is important to consider the overhead of initializing SSE registers. For example, if you just add two FVector2D using SSE, it could be just as fast as adding X and Y directly. This is due to the overhead involved in loading data into, operating on, and storing results from the special SIMD registers, which might exceed the cost of simple scalar operations for trivial tasks.

Set vs. load

"_mm_load_pd" (or the safer "_mm_loadu_pd") copies adjacent double values directly from a memory location, like "FVector" components, into an SSE register. In contrast, "_mm_set_pd" builds the SSE register from two separate double values provided as arguments, useful when combining data from different variables or constants. Neither intrinsic is universally faster; "load"'s speed depends heavily on memory cache performance (fast if data is cached, slow if not), while set's speed is generally more consistent. Choose load for contiguous data already in memory and set for constructing vectors from separate pieces.

loadu vs load

As you may have seen, I used "_mm_loadu_pd()" in my example. But there is also another option "_mm_load_pd()". So what's the difference?

_mm_load_pd(): Uses already aligned memory (faster, not as safe).
_mm_loadu_pd(): Uses not already aligned memory (minimally slower but much safer).

I wasn't yet able to fully make sure that FVector is aligned correctly by default, so using "loadu" is a good choice unless the FVector is aligned before usage:

alignas(16) FVector ResultVec0 {};

This will align the memory correctly for a 128bit register (16 bytes = 128 bits).

Best use-cases

When using arrays.

When re-using the 128bit registers.

Sometimes simply replacing an "FVector. +-*/ FVector" using SIMD could be beneficial, though all my testing never got above margin of error territory.

3. Other tips

It is always best to test your performance. Sometimes it may seem that it is faster, even though it's not. For example, I have tried using SSE for Line-Traces (calculating the end point), yet it wasn't really faster than just adding and multiplying FVectors.

A good example on how to test your result:

/*  Quick sleep trying to force a context switch before the test
    to prevent (if possible) context switching during the test. */
    FPlatformProcess::Sleep(0.f);

// Start Benchmark
auto BenchmarkStart = std::chrono::high_resolution_clock::now();

    // DO YOUR TEST HERE.

// Get duration
const auto BenchmarkDuration = std::chrono::duration_cast<std::chrono::nanoseconds>(
       std::chrono::high_resolution_clock::now() - BenchmarkStart);
// Get nanosecons
const int64 BenchmarkResult = static_cast<int64>(BenchmarkDuration.count());

// Print your result
GEngine->AddOnScreenDebugMessage(
    -1,
    25.f,
    FColor::White,
    FString("SSE took " + FString::FromInt(BenchmarkResult)));

Also, check if the results are actually correct (in case some memory alignment didn't work as planned. Hope this helps to get you started!

4. Exceptions

After a lot of testing I have found some weird exceptions:

If you for example re-use an array very often, the performance of SIMD instructions may tank and be just as fast as what you usually get with default instructions. This seems to be due to a "warm up" of the data and that they are now located within the CPUs cache. Long story short: it may be, that when using data frequently and locally, SIMD instructions may be slower. But for short usages SIMD may still prove faster.

UE5: Enable SIMD

1. Build Configuration

In order to enable SIMD instructions, besides SSE, for Unreal Engine, the follow entry must first be added to the "[...].Build.cs"

MinCpuArchX64 = MinimumCpuArchitectureX64.AVX512;

You can chose between various options:

- AVX
- AVX2
- AVX512

Important: When enabling this in a UPLUGIN for example, the main projects' setting will always override the plugins' settings. The best approach would then be to enable what you wanna support in the plugin and then set what the project for should support for the current build.

2. Preprocessor directives/Macros

With the above done, let's go for some compiler specific macros. The following is dummy code, but should get across what we want to do.

#if defined(__AVX2__)
// Do AVX2 logic
#elif defined(__AVX__)
// Do AVX logic
#elif defined(__SSE4_1__)
// Do SSE logic
#else
// Non-vector logic
#endif

The above are macros used by the Microsoft Visual Studio C++ compiler.

Now Unreal has its own macros for that, but I have found them to be rather confusing. Additionally, you'll always get a warning (at least in Rider) about a "macro redefinition". I wasn't yet able to fix that and there's no real documentation there. In short: instead of the default compiler macros, you could also use:

#if PLATFORM_ALWAYS_HAS_AVX_2
// AVX2
#elif PLATFORM_ALWAYS_HAS_AVX
// AVX
#elif PLATFORM_ALWAYS_HAS_SSE4_2
// SSE4.2
#else
// Non-vector logic
#endif

Yet I have found many issues, like when switching the "MinimumCpuArchitectureX64" value sometimes it wouldn't compile with certain macros. Hence for now I am using the default ones.

Tip: When using AVX FMA instructions, you could also check for just that:

#if defined(__FMA3__)
// FMA3 default compiler
#elif PLATFORM_ALWAYS_HAS_FMA3
// FMA Unreal
#else
// Non-FMA logic
#endif

Note: I am aware that you can choose between "PLATFORM_ALWAYS_HAS_[...]" and "PLATFORM_MAYBE_HAS_[...]". But I have yet to figure out the actual difference.

3. Compiling the code

Let's say for example we have enabled AVX512 and your function supports AVX512, AVX2 and SSE4.1. If the project is set to support AVX512, it will correctly compile using the AVX512 implementation. If instead you set it to AVX2, it will compile for that.

In theory, if you try to compile for a system that does not support AVX512, it would only compile for AVX2 by default, despite the "MinimumCpuArchitectireX64" settings.

Let's say you have set everything up correctly and now want to build your project for different instruction sets, you could either:

1. Compile with the correct "MinimumCpuArchitectireX64" setting.
2. Compile on the system that you want to support directly (using the default compiler macros this should work fine).

A cool note: you could also add macros like "PLATFORM_64BITS" or "LINUX_ARM64" to support multiple SIMD instructions depending on the platform. For example, first check for the CPU architecture and then implement both AVX instructions for x86-64 and NEON instructions for ARMx64.

4. Issues/anomalies

In Rider you may see that, when using the macros, that your actual implementation is greyed out in your IDE. Let's say you have enabled AVX512 and it compiles for that, yet it is greyed out: this seems to be normal behavior for now. If you're not sure to check what has been compiled, look at step 5.

Also, when enabling "MinCpuArchX64", you'll get the following error when compiling (UE 5.5):

11>command line: Warning C5106 : macro redefined with different parameter names
11>WindowsPlatform.h(77): Reference C5106 : see previous definition of 'PLATFORM_ENABLE_VECTORINTRINSICS'
11>command line: Warning C4005 : 'PLATFORM_MAYBE_HAS_AVX': macro redefinition
11>Platform.h(199): Reference C4005 : see previous definition of 'PLATFORM_MAYBE_HAS_AVX'
11>command line: Warning C4005 : 'PLATFORM_ALWAYS_HAS_AVX': macro redefinition
11>Platform.h(202): Reference C4005 : see previous definition of 'PLATFORM_ALWAYS_HAS_AVX'
11>command line: Warning C4005 : 'PLATFORM_ALWAYS_HAS_AVX_2': macro redefinition
11>Platform.h(205): Reference C4005 : see previous definition of 'PLATFORM_ALWAYS_HAS_AVX_2'

I wasn't yet able to identify the issue, cause it's something with Unreal macros or such. Hence the tip to use default compiler preprocessing directives.

5. Compiler Tips

If you want to make sure that the correct implementation is used, there are two key things you can use.

1. Use compiler messages.

#pragma message ("SIMD: SSE - ENABLED")

This way you'll see what is being used for compiling the code. Important: if you happen to see multiple messages when compiling, despite you having correctly used "#if", "#elif" etc., that is normal. In that case, the last entry is the one that counts. This behavior stems from Unreal compiling more than once which can especially occur when using plugins.

2. Use Unreal Engine on-screen messages.

If the compiler messages get too confusing or you want to be 100% sure, simply use the runtime checks unreal engine offers.

GEngine->AddOnScreenDebugMessage(
    -1, 25.f, FColor::White, "SIMD: SSE - ENABLED");

Now when running the game, you can easily see which instruction set is being used. Nice, eh?
BUT: don't compare performance when printing something to the screen. FString is always allocated on the heap and will therefore make anything MUCH slower (not to mention the actual printing of it)!

Daves Developer Blog