AnsweredAssumed Answered

Optimising memset() and memcpy() with DMA (STM32F407)

Question asked by mjbcswitzerland on Jun 16, 2014
Hi All

I have a test were the following 6 operations are performed using standard memcpy() and memset() [working with a byte pointer and a loop]. On an STM32F407 at 168MHz with M2M operations in SRAM.
1. clear 2k bytes of memory using memset() - 62us
2. perform memcpy() of 2k bytes [both arrays are long word aligned] - 96us
3. perform memcpy() of 2k bytes [source array is long word aligned, destination array is odd address aligned 0xXXXX1] - 96us
4. perform memcpy() of 2k bytes [destination array is long word aligned, source array is odd address aligned 0xXXXX1] - 96us
5. perform memcpy() of 2k bytes [source and destination arrays are odd address aligned both 0xXXXX1] - 96us
6. perform memcpy() of 2k bytes [destination and source arrays are on odd addresses - one with 0xXXXX1 and one with 0xXXXX3] - 96us

As expected with SW the memset() is faster since it only moves one pointer and all memcpy() variations are the same.

Then I tested memset() and memcpy() based on M2M DMA.

1. clear 2k bytes of memory using memset() - 12.8us
2. perform memcpy() of 2k bytes [both arrays are long word aligned] - 12.8us
3. perform memcpy() of 2k bytes [source array is long word aligned, destination array is odd address aligned 0xXXXX1] - 49.2us
4. perform memcpy() of 2k bytes [destination array is long word aligned, source array is odd address aligned 0xXXXX1] - 49.2us
5. perform memcpy() of 2k bytes [source and destination arrays are odd address aligned both 0xXXXX1] - 13.2us
6. perform memcpy() of 2k bytes [destination and source arrays are on odd addresses - one with 0xXXXX1 and one with 0xXXXX3] - 24.8us

The operation uses the largest transfer units that are possible. It was found that although either the source or destination is long word aligned it was not possible to use a long word transfer due to the non-aligned address [either source or destination] (the DMA transfer works but it looks like the DMA controller sets the LSB address bits to 0 so there is a shift in the buffer content). Therefore the routines do their best to align the arrays as best as possible and copy any extra bytes at the beginning and end 'manually'.

The results show that when the arrays are long word aligned, or can be manipulated to use long word transfers the operation is 7.5 times faster than the SW operation.
When only byte transfers are possible by DMA it is 1.9x faster.
When 16 bit copies can be arranged (case 6) 3.9x faster.
memset() speed increases are less since the SW operation is faster anyway.

Questions:
- it is true that there is no way around having to use 16 or 8 bit transfers when the arrays' boundaries don't work out? The ARM itself can access mis-aligned long words (maybe with some speed reduction) but is the DMA restricted?
- does anyone know of a method to get more performance out of the technique (or more consistent performance independ on alignments)?

Regards

Mark

www.uTasker.com

Outcomes