cancel
Showing results for 
Search instead for 
Did you mean: 

Does DMA really help speed things up with a serial communications peripheral

anonymous.8
Senior II
Posted on February 07, 2017 at 18:23

Hi, all, I am using DMA with the STM32F7 SDMMC peripheral to read 4K buffers from the SD card. The basic SDMMC SDReadMultiBlocks function is called from disk_read which is called from Elm-Chan's fatFS f_read.

Since, when reading from a file, we need to access the sectors and buffers sequentially, we have to wait until each DMA filled buffer transfer is complete before we move on to retrieve the next one. Therefore we use the SDWaitReadOperation() function to wait until the DMA transfer is complete.

Therein lies the problem. The SDWaitReadOperation() obviously waits in a spin loop, basically wasting CPU cycles, while we wait for the DMA transfer to complete. But the transfer elapsed time is not decided by CPU speed, RAM speed or DMA transfer speed, because the flow controller is the SDMMC peripheral and its transfer speed is determined by the SD bus speed.

So unless you can actually do some useful other work in the spin loop inside SDWaitReadOperation(), I can't see how using DMA gives you any more transfer speed than a simple polled method where you fill your application buffer as the SDMMC peripheral's FIFO is able to provide data.

I am not using an OS, just the usual while (true) {} task loop.

I looked at the possible DMA interrupts to signal when the DMA transfer is half complete and then start doing some useful work with the first half of the application buffer while the DMA is filling the remaining second half. But surely that relies on the DMA knowing the size of the transfer to begin with in the SDMMC1->NDTR register. But, because the SDMMC is the flow controller, not the DMA, we set SDMMC1->NDTR  to 0;

So in that case, would the DMA actually signal a half complete interrupt at all, because it basically doesn�t know what �full� means, let alone half full.

The same sort of elapsed time issue surely applies to using DMA with ANY serial communications peripheral e.g., USART, I2C, SPI etc. - the transfer rate is determined by the bus clock rate, not by the DMA transfer, so if you have to wait in a spin loop until the DMA transfer completes, you might just as well have used a simple polled method to begin with.

Am I missing something here? What are your thoughts?

#dma
7 REPLIES 7
Posted on February 07, 2017 at 19:41

Don't use spin loops then?

ST uses them to simplify the demonstration of functionality. If you want an optimized driver stack you're going to have to invest a significant amount of time/effort into that. Having a singular thread, with serialized execution is about what most people can deal with, so the demo code targets that demographic.

In a non-OS implementation the spin loops could pump a buffer processing task. ie process the data you got last time, or generate the next set of data, do a millisecond or so of work and leave.

SDIO is complicated because it has a synchronous clock that you can't stall. The FIFO masks some of this, but it is still more effective to use DMA over polling and sending data manually via the AHB/APB buses. Speed is most often dictated by the card, the micro inside that and it's progression through command processing and array management. With DMA the bus and CPU are available to do other processing of the data, if you do something useful with that, or burn the cycles, that's really up to you. If done correctly you can drop the CPU clock speed significantly, the advantages at 200 MHz are likely more fractional.

USART RX DMA in most cases seems to be fraught with issues, it might work well if data is constantly streaming. With a buffering scheme USART TX DMA allows the IRQ loading to be decimated.

The general dumbness of most of the peripherals means they need a lot more baby-sitting, so complicating the driver stack is one of those things where you'll have to balance the investment vs return.

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..
Posted on February 07, 2017 at 19:51

Clive, thanks for that.

What are your thoughts about using the DMA half complete flag to start processing the first half of the application buffer once it has been received in the spin loop? Am I correct in thinking that won't work with SDIO/SDMMC because the DMA doesn't know when half full, or even full, has been reached because we don't tell it the size of the data transfer?

On that point, although you have said in the past that we don't set the DMA transfer size with peripheral flow control, would there be any harm if we did set the size anyway, and in that case, would the DMA then set the half done flag?

Posted on February 07, 2017 at 22:00

I set the UART receive DMA to a circular buffer of just 2 bytes.

this way I receive a DMA interrupt for each byte. ( the half-complete and complete DMA interrupt callbacks.)

It works very well and fast.

I believe the circular buffer size is used by the hardware, to set the half complete and fully complete interrupts

Posted on February 08, 2017 at 00:05

I haven't tried setting the transfer length for SDIO, but it is a known length at the outset, and the transfer is not open-ended.

If I were building a read-streaming app, I'd like use a pipeline, where a worker task would alternately read half of a ping-pong buffer, and have a secondary task that consumed the data. The balance would be that the read takes some fraction of the consumption time. If latency at start-up was critical one could precharge the buffer with a smaller block of data. Data consumption would occur concurrently with the reading of the next buffer.

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..
Danish1
Lead II
Posted on February 08, 2017 at 11:44

As Clive said, a multi-threaded application allows you to get on with other tasks while one thread is waiting on a peripheral to complete. Peripherals are, in general, very slow compared with the processor's ability to throw bytes around.

For my application with UART reception, I want to avoid loss of data much more than I worry about speed of responding to incoming data.

I also don't like the way the stm32f4 USART peripheral seems to pulse RTS on reception of every character (I guess for that short time interval between the character fully arriving in the USART and the DMA taking it away).

I have a large circular buffer, with data coming in by DMA but going out by polling (typically every 10 ms) and on each poll I process all the data sitting in the buffer. I can see how much has arrived without having to stop the incoming DMA by looking at USARTn_RX_DMA_STREAM->NDTR.

In order to satisfy my whim about RTS, I use DMA transfer-complete and half-complete interrupts to see how I'm getting on with pulling data out of the buffer. If (at the instant of the interrupt) the buffer is more than half-full (i.e. we won't get another interrupt before the buffer overflows) then I negate RTS. And I also flag my receive polling thread as overdue.

I re-assert RTS as appropriate from my routine that takes characters out of the buffer.

I know this wastes roughly half the UART reception buffer, but that's one of the design compromises one has to make.

As to SDIO, in my application things only make sense to process a sector at a time so I simply use the DMA transfer-complete interrupt.

 - Danish

AVI-crak
Senior
Posted on February 08, 2017 at 12:00

Instead of waiting for the completion of reading, it is necessary to use linear predictive reading. In case of coincidence of addresses - to use a new block on the outside call. If reading does not match the address - to start a new cycle of reading, with a new address.

It's almost a buffer. It works well in the case of reading stream.

  When the processes of reading a lot of the stick - start braking. In this case, a buffer is needed for each process, each the size.

Driver Chan was created when there was not enough memory, it is now possible to circumvent this limitation.
Posted on February 08, 2017 at 18:35

Clive, what I am trying to do is this:

I need to play back up to 4 separate sound clips located at different places in a single composite sound file.

They are uncompressed 16 bits, monophonic, at 192000 samples/second.

All sounds can be played simultaneously.

So I open the same file in read mode four times using Elm Chan's FatFS and maintain four separate file ptrs so I can seek to the appropriate place in the file for each sound clip, as needed.

Each sound clip's meta data is stored in a large data structure which contains two ping-pong 4KByte

(2KByte 16 bit samples) data buffers to hold the sound clip data read from the SD card. So there's 8 of those buffers total.

Then I have a conventional two channel stereo audio DAC which has two 32 bit audio ping-pong buffers which are pumped into the DAC using DMA. That part works fine.

Those two audio buffers contain conventional 16 bit left and right sound samples. They can be filled only after

all four sound clip buffers have been filled from the SD card since we then have to mix down and sometimes

crossfade the original SD card samples to get down to the two left/right channels that the DAC is expecting.

This is all actually working and at 192000 samples/second on the STM32F745 but using polled SDIO access as I

can't yet get the SDIO to work with DMA. I have put all of those buffers in DTCM.

The other thing I have to be able to do is change very quickly, on the fly, any one of the sounds to an alternate

sound clip, while any or all four of the sounds are playing. So I have to have preloaded SD card buffers for another

36 different sounds as I can't afford to wait while I retrieve the new sound clip from the SD card. Those buffers would have to be in regular RAM. So the playback code gets pretty hairy.

And of course there are other tasks that I have to do while all this is going on such as measure four incoming pulses

using Timer input captures and capture and interpret GPS receiver NMEA data coming in on a USART port.

The thing here is, whilst I have written many multi-threaded applications on the PC using MS .NET and C♯,

I have only ever written single threaded round-robin type applications for microcontrollers in C.

I already have invested a lot of time on this project and am unwilling at this point to start afresh

with an RTOS or such.

I am making progress, just, but still trying to find out why my SDIO DMA doesn't work.