cancel
Showing results for 
Search instead for 
Did you mean: 

optimizations in string.h ?

guyvo67
Associate II
Posted on February 23, 2009 at 15:45

optimizations in string.h ?

3 REPLIES 3
guyvo67
Associate II
Posted on May 17, 2011 at 13:03

Hi,

I needed a memmove in my project so my first thing was taking simply the memmove from string.h like:

memmove(Ξ[1],Xi,72);

This takes 85µs on 8MHz in the simulator. Quit a lot to move only 72 bytes don't you think ? I did the same thing on AVR 8-bit and surprisingly I got 75µs !

I stepped into the assembler code for the memmove and apparently somewhere there is loop of 72 times. This means that the default memmove is byte oriented and not optimized in my opinion to move in parts of 4 bytes(32bit). At least I expect that a cortex ST32 is four times faster doing this operations in comparison with the AVR 8 bit target.

Changing option like O1/O2/O3/SIZE does not change the performance for this function. Gcc compiler options are:

-MD -D_STM32F103VBT6_ -D_STM3x_ -D_STM32x_ -mthumb -mcpu=cortex-m3

Can anyone bring some light here because i'm rather new in this target. I did not tested this yet on the real target running with PLL 72M but speed will ony increase 9 times and says nothing about the efficiency of the implementation. Using DMA memory to memory can also bring a speed gain but again basically says nothing on the logic used.

Thanks

Guy

domen2
Associate III
Posted on May 17, 2011 at 13:03

Are you sure you need memmove and not memcpy? Former must work when areas overlap, and that's probably the reason for byte transfers.

Another problem is that 4-byte (''word'') accesses must be aligned to 4-byte boundary on ARM.

guyvo67
Associate II
Posted on May 17, 2011 at 13:03

Quote:

Are you sure you need memmove and not memcpy?

Yes I need memmove because I must shift the s16 array right with one position.-> a[n] -> a[n+1]

Quote:

Former must work when areas overlap, and that's probably the reason for byte transfers.

Yes that can be the cause, because memmove must work in all conditions. In my case, this particular move could be optimized though. Starting the move at the end of the buffer in parts of 4 bytes mov doesn't give any problems if I do this with 4 bytes. But I guess I have to write it myself 😉

PS:

Measurements on target(PLL 72M) with scoop:

memcpy takes 2,6 µs (185 cycles ~ 2.5 cycles / byte)

memmove takes 8.2µs (585 cycles ~ 8 cycles / byte )

(memmove / memcpy ~ 3)