STM32G4 series fastest way to add 32 16bit signed integers

etheory · ‎2023-05-09

I have the following function to sum 32 16bit signed integers as optimally as I think is possible:

int16_t sum32elements2(int16_t vals[])
{
  uint32_t s1 = __SADD16(*(uint32_t*)&vals[0], *(uint32_t*)&vals[2]);
  uint32_t s2 = __SADD16(*(uint32_t*)&vals[4], *(uint32_t*)&vals[6]);
  uint32_t s3 = __SADD16(*(uint32_t*)&vals[8], *(uint32_t*)&vals[10]);
  uint32_t s4 = __SADD16(*(uint32_t*)&vals[12], *(uint32_t*)&vals[14]);
  uint32_t s5 = __SADD16(*(uint32_t*)&vals[16], *(uint32_t*)&vals[18]);
  uint32_t s6 = __SADD16(*(uint32_t*)&vals[20], *(uint32_t*)&vals[22]);
  uint32_t s7 = __SADD16(*(uint32_t*)&vals[24], *(uint32_t*)&vals[26]);
  uint32_t s8 = __SADD16(*(uint32_t*)&vals[28], *(uint32_t*)&vals[30]);
  s1 = __SADD16(s1, s2);
  s2 = __SADD16(s3, s4);
  s3 = __SADD16(s5, s6);
  s4 = __SADD16(s7, s8);
  s1 = __SADD16(s1, s2);
  s2 = __SADD16(s3, s4);
  s1 = __SADD16(s1, s2);
  typedef union { uint32_t u32; int16_t s16[2]; } S;
  S s = { s1 };
  int16_t result = s.s16[0] + s.s16[1];
  return result;
}

In godbolt this produces the following 27 lines of assembly when using -O3 which makes perfect sense, it's almost a perfect code translation, (and if I do say so myself, that looks like some nice tight code):

sum32elements2(short*):
  push {r4, r5, r6, r7, lr}
  ldrd r3, r2, [r0]
  sadd16 r3, r3, r2
  ldrd r7, r2, [r0, #8]
  sadd16 r7, r7, r2
  ldrd r2, r1, [r0, #16]
  sadd16 ip, r2, r1
  ldrd r6, r2, [r0, #24]
  sadd16 r6, r6, r2
  ldrd r2, r1, [r0, #32]
  sadd16 r2, r2, r1
  ldrd r5, r1, [r0, #40]
  sadd16 r5, r5, r1
  ldrd r1, r4, [r0, #48]
  sadd16 r1, r1, r4
  ldrd r4, r0, [r0, #56]
  sadd16 lr, r4, r0
  sadd16 r0, r3, r7
  sadd16 ip, ip, r6
  sadd16 r3, r2, r5
  sadd16 r1, r1, lr
  sadd16 r0, r0, ip
  sadd16 r3, r3, r1
  sadd16 r0, r0, r3
  add r0, r0, r0, asr #16
  sxth r0, r0
  pop {r4, r5, r6, r7, pc}

Why does this code take 67 cycles exactly to run when loaded into CCM SRAM?

CCM SRAM should be zero wait state and run as fast as possible.

By my calculations even at the very worst latency, there are 8 ldrd's with supposedly 3 clocks each (as per https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Instruction-set-summary/Cortex-M4-instructions) 15 sadd16 with 1 cycle each an add with 1 cycle and a sxth with 1 cycle. I am inlining the function so the call and return cycles are not relevant. For precision repeatable exact timing I am using DWT->CTRL with the following helper functions:

__STATIC_FORCEINLINE void PT_Enable() { DWT->CTRL |= 1; }; // enable DWT->CYCCNT counter
__STATIC_FORCEINLINE uint32_t PT_Start() { return DWT->CYCCNT; };
__STATIC_FORCEINLINE uint32_t PT_Elapsed(uint32_t pt) { return DWT->CYCCNT - pt; };

This should total:

(8*3)+15+1+1 = 41 or around that, but I get 67.

Why is that? Is there a way in STM32CubeIDE to do memory latency and pipeline occupancy testing for the STM32G474RE?

I seem to be unable to get to the right theoretical performance.

For reference, running from FLASH with zero wait state takes exactly 75 cycles vs exactly 67 for CCM SRAM. Running code from SRAM is slower than either option.

Data is in SRAM for sake of discussion, but that should be irrelevant. Placing code in CCM SRAM should allow parallelization of code and data streams, which seems to be where my main benefit of the code in CCM and data in SRAM split comes from.

What could I do to achieve theoretical performance?

NOTE: for context, the code is used like so:

uint32_t sum_timer = PT_Start();
    int32_t sum = s16_sum32elements(data);
    total_clocks_sum += PT_Elapsed(sum_timer);

Where data is in SRAM. Also note that in the context it is used, the function has "__STATIC_FORCEINLINE" placed in front of it. I only omit that above as godbolt doesn't emit code if this is used for obvious reasons.

My compiler flags are:

gcc:
 
-mcpu=cortex-m4 -std=gnu11 -DUSE_HAL_DRIVER -DSTM32G474xx -c -I../Core/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc/Legacy -I../Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../Drivers/CMSIS/Include -O3 -ffunction-sections -fdata-sections -Wall -fstack-usage -fcyclomatic-complexity --specs=nano.specs -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb
 
g++:
 
-mcpu=cortex-m4 -std=gnu++14 -DUSE_HAL_DRIVER -DSTM32G474xx -c -I../Core/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc/Legacy -I../Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../Drivers/CMSIS/Include -O3 -ffunction-sections -fdata-sections -fno-exceptions -fno-rtti -fno-use-cxa-atexit -Wall -fstack-usage -fcyclomatic-complexity --specs=nano.specs -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb

Any other combination of flags I tried made it run slower.

etheory · ‎2023-05-09

An update, using:

__attribute__(( aligned(16)))

In front of my data array brings the timing down to 63 clocks from 67.

Better, but still not perfect.

Tesla DeLorean · ‎2023-05-09

>>What could I do to achieve theoretical performance?

Perhaps reorder so the pipeline doesn't constantly stall?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

AScha.3 · ‎2023-05-09

Just for fun; try optimizer setting -02 and -Ofast /speed, in my tests sometimes better than -03 .

If you feel a post has answered your question, please click "Accept as Solution".

waclawek.jan · ‎2023-05-09

First of all, post disasm of the actual binary you are running, i.e. including the DWT stuff.

> For reference, running from FLASH with zero wait state takes exactly 75 cycles vs exactly 67 for CCM SRAM.

Then something is not set correctly. Prefetch enabled?

JW

etheory · ‎2023-05-09

Why would this code constantly stall the pipeline? Could you please suggest a link to documentation that might help me understand how to resolve this? Thanks.

etheory · ‎2023-05-09

Thanks for the suggestion, I did earlier in my testing but I'll try again, cheers.

etheory · ‎2023-05-09

I'd love to post the ASM from STM32CubeIDE but it doesn’t seem to allow you to view it unless you compile in debug mode. Could you please point me to a method I could use to do this? Thanks.

waclawek.jan · ‎2023-05-09

I don't use CubeIDE, but surely you can set whatever optimization level you want for Debug, too.

Or else just use the standard objdump from gcc's suite to disasm the .elf.

JW

Tesla DeLorean · ‎2023-05-09

Load more data so you don't cause and instant dependency on data you're fetching, instructions all take multiple cycles to complete

ldrd r3, r2, [r0]
sadd16 r3, r3, r2 ; dependency on prior
ldrd r7, r2, [r0, #8]
sadd16 r7, r7, r2 ; dependency on prior

ldrd r3, r2, [r0] ; see if this pair run faster, then expand
ldrd r7, r6, [r0, #8]
sadd16 r3, r3, r2
sadd16 r7, r7, r6

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..