Skip to main content
fhu.11
Associate III
December 7, 2021
Question

STM32G474RE NUCLEO board SIMD instruction test

  • December 7, 2021
  • 4 replies
  • 3752 views

Hi

I want to test the execution speed of SIMD instruction. My demo as below.

16-bit addition, for example.

0693W00000HnjelQAB.pngI find the instruction __SADD16() in cmsis_gcc.h. The execution time using __SADD16 is 27us and the time without __SADD16 is 23us.  It seems like that the speed is more slow when using __SADD16.

Am I calling the wrong function? Please tell me how to use SIMD instruction correctly to speed up addition or multiplication.

Thanks

This topic has been closed for replies.

4 replies

TDK
December 7, 2021

SADD16 is a dual 16-bit signed addition. So you pass it two int16_t values packed in the form of a uint32_t and it calculates the result in a single instruction.

It looks like you're only performing one addition at a time, so the two are things you're profiling are not equivalent. Would have been nice to see how uiCal1 and uiCal2 are defined.

Your comments also say u16 but SADD is for signed addition, not unsigned.

Examining the disassembly would also be enlightening. You're running more than a simple SADD16 instruction in each loop.

"If you feel a post has answered your question, please click ""Accept as Solution""."
fhu.11
fhu.11Author
Associate III
December 7, 2021

0693W00000HnjwBQAR.pngfigure 1

0693W00000HnjwLQAR.pngfigure 2

The figure2 is the body of __UADD16(), and the figure1 is the corresponding disassembly

0693W00000HnjwpQAB.pngfigure 3

The figure3 is the disassembly of uiRes[i]=uiCal1[i]+uiCal2[i];

It can be seen that the figure 1 has more disassembly instructions than figure 3.

So I change my code in this way.

for(i=0;i<100;i++)

 {

 __ASM volatile ("uadd16 %0, %1, %2" : "=r" (uiRes[i]) : "r" (uiCal1[i]), "r" (uiCal2[i]) );

//  uiRes[i]=uiCal1[i]+uiCal2[i];

 }

Then I find that the excution time of new UADD16 calling method is 22us faster than the old.

I wonder if using SIMD instruction only improves the calculation speed a little bit? 23us to 22us?

fhu.11
fhu.11Author
Associate III
December 7, 2021

uint16_t uiCal1[100];

uint16_t uiCal2[100];

Here is the definition of uiCal1 and uiCal2

And I change the __SADD16() to __UADD16(), the result is the same.

TDK
December 7, 2021

Why would it be different? You still aren't using the instruction properly. You need to pack two uint16_t values into a uint32_t in order for it to be efficient.

Maybe don't worry about optimization at this stage.

"If you feel a post has answered your question, please click ""Accept as Solution""."
fhu.11
fhu.11Author
Associate III
December 7, 2021

Here is two kinds of addition one using instruction uadd16 another calculate directly

__ASM volatile ("uadd16 %0, %1, %2" : "=r" (*(uint32_t*)(uiRes+0)) : "r" (((uint32_t)uiCal1[1]<<16)+uiCal1[0]), "r" (((uint32_t)uiCal2[1]<<16)+uiCal2[0]));

 uiRes[0]=uiCal1[0]+uiCal2[0];

 uiRes[1]=uiCal1[1]+uiCal2[1];​

The execution result of both statements is the same and the execution time is the same, 996ns.

It seems like that there's no advantage to using uadd16 for calculation speed.

waclawek.jan
Super User
December 7, 2021

Again, observe/compare asm.

If compiler performs the four halfword loads separately, try for input the same type punning you used for output. Our simply keep the inputs and outputs as packed unions of a couple of halfwords overlapped by a word.

JW