STM32G4 series fastest way to add 32 16bit signed integers

etheory · ‎2023-05-09

I have the following function to sum 32 16bit signed integers as optimally as I think is possible:

int16_t sum32elements2(int16_t vals[])
{
  uint32_t s1 = __SADD16(*(uint32_t*)&vals[0], *(uint32_t*)&vals[2]);
  uint32_t s2 = __SADD16(*(uint32_t*)&vals[4], *(uint32_t*)&vals[6]);
  uint32_t s3 = __SADD16(*(uint32_t*)&vals[8], *(uint32_t*)&vals[10]);
  uint32_t s4 = __SADD16(*(uint32_t*)&vals[12], *(uint32_t*)&vals[14]);
  uint32_t s5 = __SADD16(*(uint32_t*)&vals[16], *(uint32_t*)&vals[18]);
  uint32_t s6 = __SADD16(*(uint32_t*)&vals[20], *(uint32_t*)&vals[22]);
  uint32_t s7 = __SADD16(*(uint32_t*)&vals[24], *(uint32_t*)&vals[26]);
  uint32_t s8 = __SADD16(*(uint32_t*)&vals[28], *(uint32_t*)&vals[30]);
  s1 = __SADD16(s1, s2);
  s2 = __SADD16(s3, s4);
  s3 = __SADD16(s5, s6);
  s4 = __SADD16(s7, s8);
  s1 = __SADD16(s1, s2);
  s2 = __SADD16(s3, s4);
  s1 = __SADD16(s1, s2);
  typedef union { uint32_t u32; int16_t s16[2]; } S;
  S s = { s1 };
  int16_t result = s.s16[0] + s.s16[1];
  return result;
}

In godbolt this produces the following 27 lines of assembly when using -O3 which makes perfect sense, it's almost a perfect code translation, (and if I do say so myself, that looks like some nice tight code):

sum32elements2(short*):
  push {r4, r5, r6, r7, lr}
  ldrd r3, r2, [r0]
  sadd16 r3, r3, r2
  ldrd r7, r2, [r0, #8]
  sadd16 r7, r7, r2
  ldrd r2, r1, [r0, #16]
  sadd16 ip, r2, r1
  ldrd r6, r2, [r0, #24]
  sadd16 r6, r6, r2
  ldrd r2, r1, [r0, #32]
  sadd16 r2, r2, r1
  ldrd r5, r1, [r0, #40]
  sadd16 r5, r5, r1
  ldrd r1, r4, [r0, #48]
  sadd16 r1, r1, r4
  ldrd r4, r0, [r0, #56]
  sadd16 lr, r4, r0
  sadd16 r0, r3, r7
  sadd16 ip, ip, r6
  sadd16 r3, r2, r5
  sadd16 r1, r1, lr
  sadd16 r0, r0, ip
  sadd16 r3, r3, r1
  sadd16 r0, r0, r3
  add r0, r0, r0, asr #16
  sxth r0, r0
  pop {r4, r5, r6, r7, pc}

Why does this code take 67 cycles exactly to run when loaded into CCM SRAM?

CCM SRAM should be zero wait state and run as fast as possible.

By my calculations even at the very worst latency, there are 8 ldrd's with supposedly 3 clocks each (as per https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Instruction-set-summary/Cortex-M4-instructions) 15 sadd16 with 1 cycle each an add with 1 cycle and a sxth with 1 cycle. I am inlining the function so the call and return cycles are not relevant. For precision repeatable exact timing I am using DWT->CTRL with the following helper functions:

__STATIC_FORCEINLINE void PT_Enable() { DWT->CTRL |= 1; }; // enable DWT->CYCCNT counter
__STATIC_FORCEINLINE uint32_t PT_Start() { return DWT->CYCCNT; };
__STATIC_FORCEINLINE uint32_t PT_Elapsed(uint32_t pt) { return DWT->CYCCNT - pt; };

This should total:

(8*3)+15+1+1 = 41 or around that, but I get 67.

Why is that? Is there a way in STM32CubeIDE to do memory latency and pipeline occupancy testing for the STM32G474RE?

I seem to be unable to get to the right theoretical performance.

For reference, running from FLASH with zero wait state takes exactly 75 cycles vs exactly 67 for CCM SRAM. Running code from SRAM is slower than either option.

Data is in SRAM for sake of discussion, but that should be irrelevant. Placing code in CCM SRAM should allow parallelization of code and data streams, which seems to be where my main benefit of the code in CCM and data in SRAM split comes from.

What could I do to achieve theoretical performance?

NOTE: for context, the code is used like so:

uint32_t sum_timer = PT_Start();
    int32_t sum = s16_sum32elements(data);
    total_clocks_sum += PT_Elapsed(sum_timer);

Where data is in SRAM. Also note that in the context it is used, the function has "__STATIC_FORCEINLINE" placed in front of it. I only omit that above as godbolt doesn't emit code if this is used for obvious reasons.

My compiler flags are:

gcc:
 
-mcpu=cortex-m4 -std=gnu11 -DUSE_HAL_DRIVER -DSTM32G474xx -c -I../Core/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc/Legacy -I../Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../Drivers/CMSIS/Include -O3 -ffunction-sections -fdata-sections -Wall -fstack-usage -fcyclomatic-complexity --specs=nano.specs -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb
 
g++:
 
-mcpu=cortex-m4 -std=gnu++14 -DUSE_HAL_DRIVER -DSTM32G474xx -c -I../Core/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc/Legacy -I../Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../Drivers/CMSIS/Include -O3 -ffunction-sections -fdata-sections -fno-exceptions -fno-rtti -fno-use-cxa-atexit -Wall -fstack-usage -fcyclomatic-complexity --specs=nano.specs -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb

Any other combination of flags I tried made it run slower.

LCE · ‎2023-05-10

Great thread, thanks a lot!

Too much "CubeMX not working / plz do my work" around here... ;)

etheory · ‎2023-05-11

Thanks. If I crack around 41 cycles I'll post full details about my insane journey. It's been a blast. Super great responses from people here have been very helpful.