Performance drop due to alignment when using memcpy or memset

jorgen2 · ‎2012-01-21

Posted on January 21, 2012 at 15:58

Hi all! :D

I've been playing around with the SMT32F4-Discovery and generating signals for driving a VGA monitor using two timers (vsync and hsync) and pure software for pixeldata (no DMA). I've gotten this far

http://www.youtube.com/watch?v=iZRwqjbeups

It uses doubble buffering and resolution is 320x200 with 256 colors. A 640x200 mode is also available with 16 colors. Works quite well. But when I started looking at how much cycles some routine took I discovered that, when blitting (in this case) a 200x200 pixel image to the framebuffer, every 4th position in x was faster. About 10 times faster. (or the other 10 times slower depending on how you look at it)

Moving the image 4 px at the time (framebuffer is alingned to start with) keeps it steady at fast.

Anyone know a memset and memcpy that is better suited for this stuff?

One solution could be to do the first un-aligned bytes ''manually'' and then call

memcpy for the rest since then aligned to 4?

Am I making any sense with my question, problem?

Best regards

J�rgen

#stm32f4-alignment-vga

flyer31 · ‎2012-01-21

Posted on January 21, 2012 at 17:56

Concerning fast memcpy without alignment restrictions, maybe the following is interesting for you:

http://blog.frankvh.com/2011/12/30/stm32f2xx-stm32f4xx-sdio-interface-part-2/

Follow his ''Stellaris forum posting'' to the fast assembler memory copy routine.

Tesla DeLorean · ‎2012-01-21

Posted on January 21, 2012 at 18:37

With x86 optimized libraries the memcpy looks at the alignments of the source/destination parameters. Depending on the input parameter, one or both can be unaligned. Ideally you can get both into alignment, but one would be an improvement over two unaligned.

You should also look at how classic ARM9 code does this, as you can't do unaligned word (32-bit) or half-word (16-bit) accesses. In the worse case you'd have to do it a byte at a time.

The deferred write buffer of the M3 is going to work more efficiently on 32-bit aligned destinations. One should also consider the memory speed behind the the source and destinations, the unaligned penalty on the slower memory should be avoided. The compiler may not know this, some things you'll have to custom code for your specific problem.

I wager that some tool chains implement memcpy, memset, strcpy, etc more efficiently than others. Examine how different chains implement library code.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..