2021-12-06 06:03 PM
Hi
I want to test the execution speed of SIMD instruction. My demo as below.
16-bit addition, for example.
I find the instruction __SADD16() in cmsis_gcc.h. The execution time using __SADD16 is 27us and the time without __SADD16 is 23us. It seems like that the speed is more slow when using __SADD16.
Am I calling the wrong function? Please tell me how to use SIMD instruction correctly to speed up addition or multiplication.
Thanks
2021-12-06 06:37 PM
SADD16 is a dual 16-bit signed addition. So you pass it two int16_t values packed in the form of a uint32_t and it calculates the result in a single instruction.
It looks like you're only performing one addition at a time, so the two are things you're profiling are not equivalent. Would have been nice to see how uiCal1 and uiCal2 are defined.
Your comments also say u16 but SADD is for signed addition, not unsigned.
Examining the disassembly would also be enlightening. You're running more than a simple SADD16 instruction in each loop.
2021-12-06 06:46 PM
uint16_t uiCal1[100];
uint16_t uiCal2[100];
Here is the definition of uiCal1 and uiCal2
And I change the __SADD16() to __UADD16(), the result is the same.
2021-12-06 07:09 PM
Why would it be different? You still aren't using the instruction properly. You need to pack two uint16_t values into a uint32_t in order for it to be efficient.
Maybe don't worry about optimization at this stage.
2021-12-06 07:17 PM
figure 1
figure 2
The figure2 is the body of __UADD16(), and the figure1 is the corresponding disassembly
figure 3
The figure3 is the disassembly of uiRes[i]=uiCal1[i]+uiCal2[i];
It can be seen that the figure 1 has more disassembly instructions than figure 3.
So I change my code in this way.
for(i=0;i<100;i++)
{
__ASM volatile ("uadd16 %0, %1, %2" : "=r" (uiRes[i]) : "r" (uiCal1[i]), "r" (uiCal2[i]) );
// uiRes[i]=uiCal1[i]+uiCal2[i];
}
Then I find that the excution time of new UADD16 calling method is 22us faster than the old.
I wonder if using SIMD instruction only improves the calculation speed a little bit? 23us to 22us?
2021-12-06 10:59 PM
Here is two kinds of addition one using instruction uadd16 another calculate directly
__ASM volatile ("uadd16 %0, %1, %2" : "=r" (*(uint32_t*)(uiRes+0)) : "r" (((uint32_t)uiCal1[1]<<16)+uiCal1[0]), "r" (((uint32_t)uiCal2[1]<<16)+uiCal2[0]));
uiRes[0]=uiCal1[0]+uiCal2[0];
uiRes[1]=uiCal1[1]+uiCal2[1];
The execution result of both statements is the same and the execution time is the same, 996ns.
It seems like that there's no advantage to using uadd16 for calculation speed.
2021-12-06 11:36 PM
Again, observe/compare asm.
If compiler performs the four halfword loads separately, try for input the same type punning you used for output. Our simply keep the inputs and outputs as packed unions of a couple of halfwords overlapped by a word.
JW