Skip to main content
anonymous.8
Senior II
February 7, 2017
Question

Does DMA really help speed things up with a serial communications peripheral

  • February 7, 2017
  • 3 replies
  • 2264 views
Posted on February 07, 2017 at 18:23

Hi, all, I am using DMA with the STM32F7 SDMMC peripheral to read 4K buffers from the SD card. The basic SDMMC SDReadMultiBlocks function is called from disk_read which is called from Elm-Chan's fatFS f_read.

Since, when reading from a file, we need to access the sectors and buffers sequentially, we have to wait until each DMA filled buffer transfer is complete before we move on to retrieve the next one. Therefore we use the SDWaitReadOperation() function to wait until the DMA transfer is complete.

Therein lies the problem. The SDWaitReadOperation() obviously waits in a spin loop, basically wasting CPU cycles, while we wait for the DMA transfer to complete. But the transfer elapsed time is not decided by CPU speed, RAM speed or DMA transfer speed, because the flow controller is the SDMMC peripheral and its transfer speed is determined by the SD bus speed.

So unless you can actually do some useful other work in the spin loop inside SDWaitReadOperation(), I can't see how using DMA gives you any more transfer speed than a simple polled method where you fill your application buffer as the SDMMC peripheral's FIFO is able to provide data.

I am not using an OS, just the usual while (true) {} task loop.

I looked at the possible DMA interrupts to signal when the DMA transfer is half complete and then start doing some useful work with the first half of the application buffer while the DMA is filling the remaining second half. But surely that relies on the DMA knowing the size of the transfer to begin with in the SDMMC1->NDTR register. But, because the SDMMC is the flow controller, not the DMA, we set SDMMC1->NDTR  to 0;

So in that case, would the DMA actually signal a half complete interrupt at all, because it basically doesn�t know what �full� means, let alone half full.

The same sort of elapsed time issue surely applies to using DMA with ANY serial communications peripheral e.g., USART, I2C, SPI etc. - the transfer rate is determined by the bus clock rate, not by the DMA transfer, so if you have to wait in a spin loop until the DMA transfer completes, you might just as well have used a simple polled method to begin with.

Am I missing something here? What are your thoughts?

#dma
This topic has been closed for replies.

3 replies

Tesla DeLorean
Guru
February 7, 2017
Posted on February 07, 2017 at 19:41

Don't use spin loops then?

ST uses them to simplify the demonstration of functionality. If you want an optimized driver stack you're going to have to invest a significant amount of time/effort into that. Having a singular thread, with serialized execution is about what most people can deal with, so the demo code targets that demographic.

In a non-OS implementation the spin loops could pump a buffer processing task. ie process the data you got last time, or generate the next set of data, do a millisecond or so of work and leave.

SDIO is complicated because it has a synchronous clock that you can't stall. The FIFO masks some of this, but it is still more effective to use DMA over polling and sending data manually via the AHB/APB buses. Speed is most often dictated by the card, the micro inside that and it's progression through command processing and array management. With DMA the bus and CPU are available to do other processing of the data, if you do something useful with that, or burn the cycles, that's really up to you. If done correctly you can drop the CPU clock speed significantly, the advantages at 200 MHz are likely more fractional.

USART RX DMA in most cases seems to be fraught with issues, it might work well if data is constantly streaming. With a buffering scheme USART TX DMA allows the IRQ loading to be decimated.

The general dumbness of most of the peripherals means they need a lot more baby-sitting, so complicating the driver stack is one of those things where you'll have to balance the investment vs return.

Tips, Buy me a coffee, or three.. PayPal Venmo (See Profile) Up vote any posts that you find helpful, it shows what's working..
anonymous.8
Senior II
February 7, 2017
Posted on February 07, 2017 at 19:51

Clive, thanks for that.

What are your thoughts about using the DMA half complete flag to start processing the first half of the application buffer once it has been received in the spin loop? Am I correct in thinking that won't work with SDIO/SDMMC because the DMA doesn't know when half full, or even full, has been reached because we don't tell it the size of the data transfer?

On that point, although you have said in the past that we don't set the DMA transfer size with peripheral flow control, would there be any harm if we did set the size anyway, and in that case, would the DMA then set the half done flag?

T J
Senior III
February 7, 2017
Posted on February 07, 2017 at 22:00

I set the UART receive DMA to a circular buffer of just 2 bytes.

this way I receive a DMA interrupt for each byte. ( the half-complete and complete DMA interrupt callbacks.)

It works very well and fast.

I believe the circular buffer size is used by the hardware, to set the half complete and fully complete interrupts

Danish1
Lead III
February 8, 2017
Posted on February 08, 2017 at 11:44

As Clive said, a multi-threaded application allows you to get on with other tasks while one thread is waiting on a peripheral to complete. Peripherals are, in general, very slow compared with the processor's ability to throw bytes around.

For my application with UART reception, I want to avoid loss of data much more than I worry about speed of responding to incoming data.

I also don't like the way the stm32f4 USART peripheral seems to pulse RTS on reception of every character (I guess for that short time interval between the character fully arriving in the USART and the DMA taking it away).

I have a large circular buffer, with data coming in by DMA but going out by polling (typically every 10 ms) and on each poll I process all the data sitting in the buffer. I can see how much has arrived without having to stop the incoming DMA by looking at USARTn_RX_DMA_STREAM->NDTR.

In order to satisfy my whim about RTS, I use DMA transfer-complete and half-complete interrupts to see how I'm getting on with pulling data out of the buffer. If (at the instant of the interrupt) the buffer is more than half-full (i.e. we won't get another interrupt before the buffer overflows) then I negate RTS. And I also flag my receive polling thread as overdue.

I re-assert RTS as appropriate from my routine that takes characters out of the buffer.

I know this wastes roughly half the UART reception buffer, but that's one of the design compromises one has to make.

As to SDIO, in my application things only make sense to process a sector at a time so I simply use the DMA transfer-complete interrupt.

 - Danish

AVI-crak
Senior
February 8, 2017
Posted on February 08, 2017 at 12:00

Instead of waiting for the completion of reading, it is necessary to use linear predictive reading. In case of coincidence of addresses - to use a new block on the outside call. If reading does not match the address - to start a new cycle of reading, with a new address.

It's almost a buffer. It works well in the case of reading stream.

  When the processes of reading a lot of the stick - start braking. In this case, a buffer is needed for each process, each the size.

Driver Chan was created when there was not enough memory, it is now possible to circumvent this limitation.