cancel
Showing results for 
Search instead for 
Did you mean: 

Optimum data transfers, DMA vs CPU (detailed technical question)

I'm trying to understand the best method for copying data in my application. This is specifically for memory-to-memory transfers on the STM32F103xx series MCUs, but I'm also interested in memory-to-peripheral, peripheral-to-memory, and peripheral-to-peripheral transfers, and in other STM chips.

For lack of a better term I'll call my use-case "Synchronous DMA". Rather than initializing a DMA request and doing other processing while it completes, my application has nothing else to do and in fact can't proceed until the data transfer is finished. Given this situation, I see several possible approaches:

1. Don't use DMA. Execute a tight loop copying 32 bit words from source to dest, coded in C/C++ or assembly, possibly using the ARM "load multiple" and "store multiple" instructions. Or is DMA always faster than the CPU?

2.Start a DMA transfer and wait for it to complete in a "busy loop":

DMA->CCRXn |= DMA_CCR_MEM2MEM;  /* start transfer */
while (!(DMA->ISR & DMA_IFCR_CTCIFn)) asm("nop");

 Will this loop slow down the DMA? I find the ST documentation unclear on whether the DMA is "interleaved" vs "cycle stealing". RM0008 states:

The DMA controller performs direct memory transfer by sharing the system bus with the Cortex®-M3 core. The DMA request may stop the CPU access to the system bus for some bus cycles, when the CPU and DMA are targeting the same destination (memory or peripheral)

Does this mean that the CPU never slows down DMA (only the reverse, DMA slows CPU)? And if CPU and DMA are reading different addresses (as would be the case here) there's no contention anyway?

3. Similar to #2, but enable a DMA "transfer complete" interrupt and use a WFI instruction instead of NOP in the loop:

int transfer_complete = 0;
 
void interrupt_handler() { transfer_complete = 1; }
 
    /* main code */
    transfer_complete = 0;
    DMA->CCRXn |= DMA_CCR_MEM2MEM;  /* start transfer */
    while (!transfer_complete) asm("wfi");

All the ARM docs I've read say that WFI is a "hint" instruction -- that the CPU can treat it as NOP. Would that happen here and make this potentially as slow as #2? Or would the WFI halt CPU execution, keeping it off the bus and any possible interference with DMA accesses?

I can, and probably will, do some experiments to see which of these (or other) methods is the fastest. I do a lot of that kind of reverse engineering of ST products. :(  But I'm hoping someone here knows the answers and can provide some insights.

As a bonus question, are there any restrictions on DMA access to the USB PMA (packet memory area) in the STM32F103 series?

10 REPLIES 10

Well the processor reads the instructions via a different bus and flash accelerator. It can get one instruction per one cycle without affecting the buses connected to the busmatrix. Also, there are read and write queues in the core (for the pipelining). I also think that DMAC needs time for arbitration between channels, and then there is arbitration between bus masters.