Optimum data transfers, DMA vs CPU (detailed technical question)

thanks4opensource · ‎2019-12-26

I'm trying to understand the best method for copying data in my application. This is specifically for memory-to-memory transfers on the STM32F103xx series MCUs, but I'm also interested in memory-to-peripheral, peripheral-to-memory, and peripheral-to-peripheral transfers, and in other STM chips.

For lack of a better term I'll call my use-case "Synchronous DMA". Rather than initializing a DMA request and doing other processing while it completes, my application has nothing else to do and in fact can't proceed until the data transfer is finished. Given this situation, I see several possible approaches:

1. Don't use DMA. Execute a tight loop copying 32 bit words from source to dest, coded in C/C++ or assembly, possibly using the ARM "load multiple" and "store multiple" instructions. Or is DMA always faster than the CPU?

2.Start a DMA transfer and wait for it to complete in a "busy loop":

DMA->CCRXn |= DMA_CCR_MEM2MEM;  /* start transfer */
while (!(DMA->ISR & DMA_IFCR_CTCIFn)) asm("nop");

Will this loop slow down the DMA? I find the ST documentation unclear on whether the DMA is "interleaved" vs "cycle stealing". RM0008 states:

The DMA controller performs direct memory transfer by sharing the system bus with the Cortex®-M3 core. The DMA request may stop the CPU access to the system bus for some bus cycles, when the CPU and DMA are targeting the same destination (memory or peripheral)

Does this mean that the CPU never slows down DMA (only the reverse, DMA slows CPU)? And if CPU and DMA are reading different addresses (as would be the case here) there's no contention anyway?

3. Similar to #2, but enable a DMA "transfer complete" interrupt and use a WFI instruction instead of NOP in the loop:

int transfer_complete = 0;
 
void interrupt_handler() { transfer_complete = 1; }
 
    /* main code */
    transfer_complete = 0;
    DMA->CCRXn |= DMA_CCR_MEM2MEM;  /* start transfer */
    while (!transfer_complete) asm("wfi");

All the ARM docs I've read say that WFI is a "hint" instruction -- that the CPU can treat it as NOP. Would that happen here and make this potentially as slow as #2? Or would the WFI halt CPU execution, keeping it off the bus and any possible interference with DMA accesses?

I can, and probably will, do some experiments to see which of these (or other) methods is the fastest. I do a lot of that kind of reverse engineering of ST products. :( But I'm hoping someone here knows the answers and can provide some insights.

As a bonus question, are there any restrictions on DMA access to the USB PMA (packet memory area) in the STM32F103 series?

MikeDB · ‎2019-12-26

'Normal' event in computers would be for DMA to take precedence over CPU on a per cycle basis so I can't believe STM have changed this. The DMA will get a byte and then the CPU will get the next one.

Most people would use a

while (!transfer_complete) {};

Not sure any need to use assembly code here.

Or even better start preparing the next transfer or some other work in a workbuffer and then copy the buffer to the DMA memory only when complete.

S.Ma · ‎2019-12-26

If you only think memory to memory transfer for the DMA then I would simply use the core to do the job.

HW assist in general is justified by latency/critical response time/specific power budget. This way, the code remains portable and easier to maintain.

If some of the block transfer are just less than 10 bytes, then programming the DMA registers might take more time than do the job straigth...

The real thing to master in embedded (and most SW tools can't help much there) is to estimate the worst interrupts latencies and duration, then relax this with HW assist and finally find out the lowest operating system clock frequency to minimize power.

turboscrew · ‎2019-12-26

I think "the same destination memory" is a general expression. In F103 there is only one RAM, but in other chips there there can be more. In F427 there are three. They are all different slaves, and as such, have different connection to busmatrix. Different masters can access different slaves at the same time. If different masters access the same slave, they need to be arbitrated. One gets there while the other waits.

I recall that usually DMA is a bit slower and is thus useful only if core has something else to do.

I guess DMA is always kind of "cycle-stealing". I think it never blocks the core totally away for the whole transfer. There is sense to it, because the system is not built around a central common bus. The processor gets instructions via a specific bus between the core and the flash accelerator (cache/queue), and DMACs have some specific buses to peripherals avoiding accesses via busmatrix. Also the core can access the same slave as the DMAC is using, during the time slots that DMAC is doing some "internal businesses".

You might be interested in this: https://www.st.com/content/ccc/resource/technical/document/application_note/47/41/32/e8/6f/42/43/bd/CD00160362.pdf/files/CD00160362.pdf/jcr:content/translations/en.CD00160362.pdf

thanks4opensource · ‎2019-12-26

Interleaved memory accesses, CPU and DMA, is the classic architecture, often without impact on the CPU because it only uses the bus part of the time. I'm just looking for confirmation/details on how the ST chips do it.

Assembly code would be appropriate mainly for my case #1, where the CPU is doing the transfer without using any DMA capabilities. I'd be looking for any optimizations the compiler didn't generate, like using the ARM opcodes that load/store multiple registers in one instruction. But again I don't know if they provide any speed boost as the process might be limited by memory bandwidth, not instruction execution.

In my use-case I can't prepare a next buffer while DMA is transferring the first. The application needs to read the data in the first before deciding if and what to transfer next.

thanks4opensource · ‎2019-12-26

Yes, the complications involved in using DMA are only justified if the application needs the speed boost it provides. DMA certainly does make the code less portable.

I should have mentioned that my transfers are maximum 64 bytes in length, and often shorter. I do realize there's a point where the number of CPU instructions to do the loop (or unroll it) is less than that required to setup (and wait for) DMA. That's part of the cost/benefit equation I'm trying to solve.

thanks4opensource · ‎2019-12-26

Thanks. I'll go through the AN2548 DMA Application Note again to see if I can understand it better. I did know that in general there are different memories, busses and bus masters, etc.

Very interesting that you say: "I recall that usually DMA is a bit slower and is thus useful only if core has something else to do." I have always thought that it was faster -- that the DMA controller did data reads and writes, driven by a hardware state machine, at the maximum speed of the bus compared to the CPU which has to do at least instruction decode and execute (if not also fetch) between each read and write. Again, as per comments above, this is how DMA can do its work without impacting the CPU on the same bus (because it's using bus cycles that the CPU isn't using anyway).

S.Ma · ‎2019-12-26

A microcontroller is hardly a data router, soon or later the data moved around or not will have to be processed and this will be by the core.

Just make the transfer with some added value... DMA can't access CCM memory when it exists, so if you want to increase your SW complexity and increase debug time of your projects (not in microseconds here)...

S.Ma · ‎2019-12-26

Anyway, I would create a function for memcpy with both DMA and Core options so the choice is open and things can move along.

MikeDB · ‎2019-12-26

I'm sure someone has more complex examples where unrolling made an ARM processor faster, but the only time I've found it helps is when summing all the elements of a small to medium sized array or similar simple tasks. For Intel conversely unrolling often helps.