STM32G4 series fastest way to add 32 16bit signed integers

etheory · ‎2023-05-09

I have the following function to sum 32 16bit signed integers as optimally as I think is possible:

int16_t sum32elements2(int16_t vals[])
{
  uint32_t s1 = __SADD16(*(uint32_t*)&vals[0], *(uint32_t*)&vals[2]);
  uint32_t s2 = __SADD16(*(uint32_t*)&vals[4], *(uint32_t*)&vals[6]);
  uint32_t s3 = __SADD16(*(uint32_t*)&vals[8], *(uint32_t*)&vals[10]);
  uint32_t s4 = __SADD16(*(uint32_t*)&vals[12], *(uint32_t*)&vals[14]);
  uint32_t s5 = __SADD16(*(uint32_t*)&vals[16], *(uint32_t*)&vals[18]);
  uint32_t s6 = __SADD16(*(uint32_t*)&vals[20], *(uint32_t*)&vals[22]);
  uint32_t s7 = __SADD16(*(uint32_t*)&vals[24], *(uint32_t*)&vals[26]);
  uint32_t s8 = __SADD16(*(uint32_t*)&vals[28], *(uint32_t*)&vals[30]);
  s1 = __SADD16(s1, s2);
  s2 = __SADD16(s3, s4);
  s3 = __SADD16(s5, s6);
  s4 = __SADD16(s7, s8);
  s1 = __SADD16(s1, s2);
  s2 = __SADD16(s3, s4);
  s1 = __SADD16(s1, s2);
  typedef union { uint32_t u32; int16_t s16[2]; } S;
  S s = { s1 };
  int16_t result = s.s16[0] + s.s16[1];
  return result;
}

In godbolt this produces the following 27 lines of assembly when using -O3 which makes perfect sense, it's almost a perfect code translation, (and if I do say so myself, that looks like some nice tight code):

sum32elements2(short*):
  push {r4, r5, r6, r7, lr}
  ldrd r3, r2, [r0]
  sadd16 r3, r3, r2
  ldrd r7, r2, [r0, #8]
  sadd16 r7, r7, r2
  ldrd r2, r1, [r0, #16]
  sadd16 ip, r2, r1
  ldrd r6, r2, [r0, #24]
  sadd16 r6, r6, r2
  ldrd r2, r1, [r0, #32]
  sadd16 r2, r2, r1
  ldrd r5, r1, [r0, #40]
  sadd16 r5, r5, r1
  ldrd r1, r4, [r0, #48]
  sadd16 r1, r1, r4
  ldrd r4, r0, [r0, #56]
  sadd16 lr, r4, r0
  sadd16 r0, r3, r7
  sadd16 ip, ip, r6
  sadd16 r3, r2, r5
  sadd16 r1, r1, lr
  sadd16 r0, r0, ip
  sadd16 r3, r3, r1
  sadd16 r0, r0, r3
  add r0, r0, r0, asr #16
  sxth r0, r0
  pop {r4, r5, r6, r7, pc}

Why does this code take 67 cycles exactly to run when loaded into CCM SRAM?

CCM SRAM should be zero wait state and run as fast as possible.

By my calculations even at the very worst latency, there are 8 ldrd's with supposedly 3 clocks each (as per https://developer.arm.com/documentation/ddi0439/b/Programmers-Model/Instruction-set-summary/Cortex-M4-instructions) 15 sadd16 with 1 cycle each an add with 1 cycle and a sxth with 1 cycle. I am inlining the function so the call and return cycles are not relevant. For precision repeatable exact timing I am using DWT->CTRL with the following helper functions:

__STATIC_FORCEINLINE void PT_Enable() { DWT->CTRL |= 1; }; // enable DWT->CYCCNT counter
__STATIC_FORCEINLINE uint32_t PT_Start() { return DWT->CYCCNT; };
__STATIC_FORCEINLINE uint32_t PT_Elapsed(uint32_t pt) { return DWT->CYCCNT - pt; };

This should total:

(8*3)+15+1+1 = 41 or around that, but I get 67.

Why is that? Is there a way in STM32CubeIDE to do memory latency and pipeline occupancy testing for the STM32G474RE?

I seem to be unable to get to the right theoretical performance.

For reference, running from FLASH with zero wait state takes exactly 75 cycles vs exactly 67 for CCM SRAM. Running code from SRAM is slower than either option.

Data is in SRAM for sake of discussion, but that should be irrelevant. Placing code in CCM SRAM should allow parallelization of code and data streams, which seems to be where my main benefit of the code in CCM and data in SRAM split comes from.

What could I do to achieve theoretical performance?

NOTE: for context, the code is used like so:

uint32_t sum_timer = PT_Start();
    int32_t sum = s16_sum32elements(data);
    total_clocks_sum += PT_Elapsed(sum_timer);

Where data is in SRAM. Also note that in the context it is used, the function has "__STATIC_FORCEINLINE" placed in front of it. I only omit that above as godbolt doesn't emit code if this is used for obvious reasons.

My compiler flags are:

gcc:
 
-mcpu=cortex-m4 -std=gnu11 -DUSE_HAL_DRIVER -DSTM32G474xx -c -I../Core/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc/Legacy -I../Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../Drivers/CMSIS/Include -O3 -ffunction-sections -fdata-sections -Wall -fstack-usage -fcyclomatic-complexity --specs=nano.specs -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb
 
g++:
 
-mcpu=cortex-m4 -std=gnu++14 -DUSE_HAL_DRIVER -DSTM32G474xx -c -I../Core/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc -I../Drivers/STM32G4xx_HAL_Driver/Inc/Legacy -I../Drivers/CMSIS/Device/ST/STM32G4xx/Include -I../Drivers/CMSIS/Include -O3 -ffunction-sections -fdata-sections -fno-exceptions -fno-rtti -fno-use-cxa-atexit -Wall -fstack-usage -fcyclomatic-complexity --specs=nano.specs -mfpu=fpv4-sp-d16 -mfloat-abi=hard -mthumb

Any other combination of flags I tried made it run slower.

etheory · ‎2023-05-09

Thanks, that makes a lot of sense. After your post I did a lot more reading about the M4 pipeline and what operations cause the stall. I really appreciate the comment it really helps.

etheory · ‎2023-05-09

I feel like such an idiot, haha, of course, you are absolutely correct, that is easy to do. When I get home from work tonight I'll give it a go and post it here. I appreciate your response.

etheory · ‎2023-05-10

Here it is, it's almost exactly the same as godbolt:

10000450:   ldrd    r3, r2, [r6, #128]      ; 0x80
10000454:   sadd16  r12, r3, r2
10000458:   ldrd    r7, r3, [r6, #136]      ; 0x88
1000045c:   sadd16  r7, r7, r3
10000460:   ldrd    r1, r3, [r6, #144]      ; 0x90
10000464:   sadd16  r1, r1, r3
10000468:   ldrd    r5, r3, [r6, #152]      ; 0x98
1000046c:   sadd16  r5, r5, r3
10000470:   ldrd    r4, r3, [r6, #160]      ; 0xa0
10000474:   sadd16  r4, r4, r3
10000478:   ldrd    r3, r2, [r6, #168]      ; 0xa8
1000047c:   sadd16  r3, r3, r2
10000480:   ldrd    r2, r0, [r6, #176]      ; 0xb0
10000484:   str     r3, [sp, #44]   ; 0x2c
10000486:   sadd16  r2, r2, r0
1000048a:   ldrd    r0, r3, [r6, #184]      ; 0xb8
1000048e:   sadd16  r0, r0, r3
10000492:   sadd16  r7, r12, r7
10000496:   sadd16  r1, r1, r5
1000049a:   ldr     r3, [sp, #44]   ; 0x2c
1000049c:   sadd16  r3, r4, r3
100004a0:   sadd16  r0, r2, r0
100004a4:   sadd16  r1, r7, r1
100004a8:   sadd16  r3, r3, r0
100004ac:   sadd16  r2, r1, r3
100004b0:   ldr.w   r3, [r10, #4]
100004b4:   ldr     r1, [sp, #40]   ; 0x28

etheory · ‎2023-05-10

gcc is seeminly too stupid to understand this concept. I've seemingly tried every compiler flag I can think of and it seems incapable of understand the M4 pipeline and issues with load stalls. Do you always do this his hand-assembly or is there a magic flag to stop gcc or "GNU Tools for STM32 (10.3-2021.10) from being so silly? Thanks!

etheory · ‎2023-05-10

-O2, -O3, -Ofast all shows precisely the same performance metrics. No difference at all.

etheory · ‎2023-05-10

You were correct about prefetch.

After improving a few more things:

New stats.

1.) CCM RAM: 63 cycles

2.) FLASH no prefetch: 83 cycles

3.) FLASH prefetch: 69 cycles

Wow, FLASH with prefetch has impressive performance with the prefetch enabled in this case. Nice. Thanks for the prompt.

etheory · ‎2023-05-10

@Community member - thanks again for your help, your tip in the pipeline stalls led me down the right path, but something is still up.

I managed to find something that takes the 67 cycle execution time of the above to 61, it looks like this (reordering the ldrd's instead took 63, I think due to the way they are scheduled differently to the other loads, it didn't help much):

sum32elements4(short*):
  push {r4, lr}
  ldr r3, [r0]
  ldr r4, [r0, #4]
  ldr r2, [r0, #8]
  ldr r1, [r0, #12]
  sadd16 r3, r3, r4
  sadd16 r2, r2, r1
  sadd16 r3, r3, r2
  ldr ip, [r0, #16]
  ldr r4, [r0, #20]
  ldr r2, [r0, #24]
  ldr r1, [r0, #28]
  sadd16 ip, ip, r4
  sadd16 r2, r2, r1
  sadd16 ip, ip, r2
  ldr r2, [r0, #32]
  ldr lr, [r0, #36]
  ldr r1, [r0, #40]
  ldr r4, [r0, #44]
  sadd16 r2, r2, lr
  sadd16 r1, r1, r4
  sadd16 r2, r2, r1
  ldr r1, [r0, #48]
  ldr r4, [r0, #52]
  ldr lr, [r0, #56]
  ldr r0, [r0, #60]
  sadd16 r1, r1, r4
  sadd16 r0, lr, r0
  sadd16 r1, r1, r0
  sadd16 r0, r3, ip
  sadd16 r3, r2, r1
  sadd16 r0, r0, r3
  add r0, r0, r0, asr #16
  sxth r0, r0
  pop {r4, pc}

As far as I can see, this now has no load hazard/load pipeline stalls from the memory reads (but there are a couple in the add sadd16 chain I need to address).

Having said that, it SHOULD run much faster, due to the loads theoretically not load hazarding any more, but it's only a few cycles faster. That seems odd to me. It still seems far from the 41 ish theoretical limit.

I'll work next on removing/mitigating the dependency chain stalls of the sadd16's.

etheory · ‎2023-05-10

Now we are talking, 56 cycles now:

sum32elements4(short*):
  push {r4, r5, r6, lr}
  ldr r3, [r0]
  ldr r1, [r0, #4]
  ldr r4, [r0, #8]
  ldr r2, [r0, #12]
  sadd16 r3, r3, r1
  sadd16 r4, r4, r2
  ldr ip, [r0, #16]
  ldr r5, [r0, #20]
  ldr r2, [r0, #24]
  ldr r1, [r0, #28]
  sadd16 ip, ip, r5
  sadd16 r2, r2, r1
  ldr r1, [r0, #32]
  ldr r2, [r0, #36]
  ldr lr, [r0, #40]
  ldr r5, [r0, #44]
  sadd16 r1, r1, r2
  sadd16 lr, lr, r5
  ldr r2, [r0, #48]
  ldr r6, [r0, #52]
  ldr r5, [r0, #56]
  ldr r0, [r0, #60]
  sadd16 r2, r2, r6
  sadd16 r0, r5, r0
  sadd16 r2, r2, r0
  sadd16 r0, r3, r4
  sadd16 ip, ip, r2
  sadd16 r3, r1, lr
  sadd16 r0, r0, ip
  sadd16 r2, r3, r2
  sadd16 r0, r0, r2
  add r0, r0, r0, asr #16
  sxth r0, r0
  pop {r4, r5, r6, pc}

Thanks for your help. It really does mostly have to do with pipeline stalling. Every time I remove another depedency it makes a huge difference.

waclawek.jan · ‎2023-05-10

Post disasm of the actual binary you are running, i.e. including the DWT stuff.

Reason is, that that contributes to the posted cycles number, too.

Make sure that number still holds.

Also, don't stop on a breakpoint before "start of measurement", that influences internal state of the processor (e.g. prefetch queue).

Also, tell us about content of registers which are not explicitly clear from the disasm (i.e. position of the data in memory).

Details matter.

JW

etheory · ‎2023-05-10

Thanks, all great points, that really helps. I will do those things shortly.