Memcpy vs DMA

haukeye · ‎2016-02-10

Posted on February 10, 2016 at 16:47

hi, I'm using the STM32f407-discovery board example of DMA_FLASH_RAM.

I've noticed by using systics for time measurements, that memcpy provides better timing than DMA.

for example.

I have 16bit array of 64 elements to send.

by memcpy the transer lasted 293systics

by DMA 973systics

i only activate the DMA and wait for the status to change.

Thanks for assistance !!

Tesla DeLorean · ‎2016-02-10

Posted on February 10, 2016 at 17:00

DMA doesn't have a cached view of memory, or write buffers. The set up time for DMA is also not insignificant.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

carmine · ‎2016-02-10

Posted on February 10, 2016 at 18:17

Your results depend on several factors. What compiler are you using? And what about the libc release?

For example, the newlib library provides a speed optimized version of memcpy(), which automatically detects word-aligned memory transfers. The newlib-nano memcpy(), being optimized for size, it doesn't perform this type of check. Moreover, you should also try DMA m2m transfers word-aligned: for some STM32 MCU you can achieve more than 4x speed-up.

These are some test results I've obtained on a wide range of STM32 microcontrollers:

re.wolff9 · ‎2016-02-10

Posted on February 11, 2016 at 01:13

Simplyfying the table, we toss out the ''byte aligned DMA''. We're not interested in that and performance is bad anyway. The same we do for the ''newlib-nano'' memcpy: It has bad performance. And the Loop is exactly the same as ''newlib'', so that one goes out as well. That leaves DMA(aligned DMA) vs newlib (software copy). In many cases the result is about the same, except for the L152 where DMA is a lot slower.

So, IF you can have the CPU doing something else, doing it with DMA is faster, otherwise, doing it in software can be about as fast as DMA can achieve.

Doing something else with the CPU is difficult: Chances are that the CPU will need the memory bus. So if you manage to get the CPU to do ten things, it will hold up the DMA transfer for ten extra cycles. So even if you get something done while the DMA runs, the total running time may not be less than if you did the memcpy in sofware and THEN did those ten things.

Something can be said for simplicity: Just do it in software.

The '407 doesn't have a cache.

carmine · ‎2016-02-10

Posted on February 11, 2016 at 07:11

There is just one other thing to take in account: linking the newlib, which provides the performance optimized version of memcpy(), costs about 10K of additional FLASH memory. For smaller STM32 MCUs (e.g. low-cost F0 with less than 32k of FLASH) doing it with DMA is preferable. Moreover, the DMA setup cost is not unimportant as stated by clive, so a DMA M2M transfer makes sense only if you have at least 30-50 elements to copy.

carl2399 · ‎2016-02-10

Posted on February 11, 2016 at 08:00

My code probably spends about 2% of execution time in memcpy, so I have the following naive implementation:

void
MemCpy(
void
*dst, 
const
void
*src, u32 cnt)
{
// Copy longwords, taking advantage of STM ability to read/write unaligned data
while
(cnt >= 4)
{
*(u32 *)dst = *(
const
u32 *)src;
dst = (u8 *)dst + 4;
src = (
const
u8 *)src + 4;
cnt -= 4;
}
// Copy the couple of leftover bytes
while
(cnt--)
{
*(u8 *)dst = *(
const
u8 *)src;
dst = (u8 *)dst + 1;
src = (
const
u8 *)src + 1;
}
}

when compiled with -O2 under GCC, it requires 46 bytes of flash.