cancel
Showing results for 
Search instead for 
Did you mean: 

High performance SDIO/SDMMC driver using DMA for FreeRTOS - anyone?

anonymous.8
Senior II
Posted on March 03, 2017 at 12:55

Hi, has anyone written a high performance STM32F4/F7 SDIO/SDMMC driver using DMA for FreeRTOS?

I am currently converting my complex audio application from bare metal to run on freeRTOS but am still faced with STM's poor implementation of their SDMMC driver which, although using DMA, blocks until the DMA transfer is complete. They have not thought through that driver very well as it really gives no better performance than a polling method since the transfer rate is dictated by the SD card bus speed, not by the CPU/memory access speed.

The mid-level functions disk_read/disk_write and the higher level FATFS f_read/f_write which call the SDMMC driver to read/write blocks of data are expecting sequential file access reads/writes. Surely this problem must have been solved by someone before now. I am new to RTOS of any kind and don't want to have to write a decent SDMMC driver for FreeRTOS, when STM should have done a decent job in the first place.

Has anyone already got a working DMA based SDMMC card driver (for STM32F4/F7) that works efficiently with FreeRTOS?

Thanks.

#stm32f7-dma-sdio-sdmmc-requests #sdio
6 REPLIES 6
G A
Associate II
Posted on March 22, 2017 at 17:20

I have, and works relatively well. Quite some extra work. The sample code and examples are not robust for any real application as you probably already figured. I can only give you some relatively high level pointers I wish I could contribute full source :(

Here is my experience of what you have to do, probably in this order:

1) For now stay with STM32CubeF4 Firmware Package V1.14.0 / 04-November-2016. The latest one changes many things regarding SDIO and I'm still struggling to make it work. I will do a separate post about this. I think there is a bug.

2) You need to use DMA. Change the SD_read and SD_write functions to use BSP_SD_ReadBlocks_DMA and BSP_SD_Write

Blocks_DMA

3) The SDIO global interrupt has to have a numerical smaller priority number than the SDIO DMA TX and SDIO DMA RX priorities.

4) FatFS aggressively tries to optimize and coalesce writes, this is good specially for flash memory, however sometimes it may pass buffers that are not word aligned to the SD_read/SD_write functions. When that happens DMA will quietly corrupt your data because DMA requires word (32 bit) aligned buffers. This may not be a deal breaker for your application if you always write files in multiples of 4 bytes. If it is a deal breaker you need to solve it. I don't know enough about the DMA implementation you may be able to use DMA with misaligned buffers. I have a global 512 bytes aligned buffer and my SD_write/SD_read function will use memcpy to/from the aligned buffer, one sector at the time when it gets a request with an unaligned buffer. This is not as bad as it looks. The global 512 buffer can be shared for reads and writes but then add a mutex to protect it or just be careful not to use FatFs on 

different tasks. (FatFS may already have this semaphore/mutex/lock, but I added my own mutex, didn't have the time to dig into this)

5) Use the code

http://elm-chan.org/fsw/ff/res/app4.c

  to test the FatFS driver implementation. It can fail two tests and you hace to decide if you want to solve:

  a) It will fail the over 4GB. (that is a limitation in theory solved in 1.15 firmware)

  b)it will fail the unaligned buffer test. See item (4) and decide.

Don't waste any time until this runs OK, except for (a) and/or (b) . 

6) At this point FatFS should be working but will monopolize the CPU doing busy waits waiting on the DMA. Time to do the important stuff:

Rewrite the HAL_SD_CheckWriteOperation and HAL_SD_CheckReadOperation. Add to the busy wait while loops in the functions calls to vTaskDelay() to allow other tasks to run while the DMA and SD card state machine is waiting for completion. Each while loop in each function takes different time and then there is room here to fine tune the delays for a good balance of performance and cooperation with other tasks. The DMA operation is very uniform and linear with the number of blocks, the other operations times are very variable and depend on the 'class' of the SD card and how many sectors where written before. If maximum performance is your goal, then the put instrumentation code here to figure the optimum delays. This will consume a lot of your time, and will vary with different SD cards. Seriously totally worth it for optimum performance. You want to call 

vTaskDelay with the biggest number possible that will not delay the current task unnecessarily.

PS: The cube does a good job of respecting your own code in many files, but many of the changes here are in places where the cube code was not designed to be changed. This creates the problem that your changes will be overwritten every time you use the cube. My solution was add '__weak' to functions that I need to change and implement them in my own file, a 'patches.c' file. Then when I regenerate cube code I just need to add '__weak' to the cube files again and as a bonus the linker gives me an error if a forget any function.

Posted on March 22, 2017 at 18:57

Thanks for that. I already have a working SDIO I/F on the STM32F746 discovery board working with DMA. So the only thing extra I see that I need is to insert a 

call to 

vTaskDelay in the wait loop while it is waiting for the DMA to complete.

The other thing I'll eventually be looking to do is replacing Elm Chan's FATFS with the thread safe FreeRTOS+FAT file system provided on the FreeRTOS.org web site.

Elm Chan's FATFS is good but by now is a bit dated and not well suited to use in a multi-tasking system.

Posted on March 22, 2017 at 19:52

David, If I remember correctly there are 3 while loops in the HAL_SD_CheckWriteOperation call, the first one for the DMA and one or two more that are part of the state machine for the SD card. The DMA delay is fairly predictable, the second one has a lot of variability and the third one was always ready, no wait needed. I initially put individual counters on all the loops and 

vTaskDelay(1) every turn of the loop. I printed how many times the loop spun and then fine tuned the actual delay numbers for each loop to maximize cooperation with other tasks with out extra delay on the task doing the I/O.

BTW, i'm running OS ticks at 10 uS, with the the default 1 mS OS tick you may just to sleep 1 ms on some loops and busy wait on others. If I remember correctly it takes a an average of 3.5mS to write a sector and 1.1mS to read a sector but every few writes there is one that takes one or two orders or magnitude more. Is the nature of the SD cards.

Posted on June 20, 2018 at 10:20

Sorry, I think all for SDIO DMA, FATFS and RTOS - all is already there. And I would not use a time delay function just to fix: instead let inform a caller that the 'DMA was done' is completed (e.g. via events or here in SD card code via a queue).

Based on my H7 project, with SDMMC1 and RTOS it works this way:

-  if you initialize FATFS you do: if(FATFS_LinkDriver(&

SD_Driver

, SDPath) == 0)

- this SD_Driver is a 'driver object' which points to all the read, write, init functions (via function pointers in the driver

  'object struct')

- the definition of SD_Driver is in my file called 'sd_diskio.c'

- the SD_read() function there uses a message queue, called 'SDQueueID':

  the DMA read function will wait until the DMA was completed, this RTOS queue was written and it will return just if a

  Callback has written into this queue

- this 'DMA complete' is announced in a

Callback

function, called 'BSP_SD_WriteCpltCallback' (find in 'sd_diskio.c')

So, all is there and it works fine with FatFS, RTOS etc. (if you compare/see some of the example files in ST FW demo projects and applications).

What to bear in mind is just:

a) 'all' INT and DMA based functions have a

related Callback function

, e.g. BSP_SD_WriteCpltCallback.

     You

had

to implement these Callbacks (if not done already in BSP code). Default is 'weak', you had to override =>

     a 'must'

provide YOUR

Callback function.

     They are called automatically when an INT, DMA etc. done happens (done/called in HAL driver).

     But if you do not provide YOUR Callback function - they execute the empty default (weak) function (not yours!).

    

==> check that you have really all needed Callbacks! (e.g. for DMA complete)

b) such a Callback function would often just send an event or write to a queue (like here).

    The counter part, waiting for the event or queue message will 'return from an asynchronous call', release the waiting

     thread after the INT, DMA 'event' was there. So, the FATFS don't care: it calls the SD read or write function and the

     return will only happen when the DMA was finished.

OK, if you want to give back the CPU to other RTOS threads - you could yield CPU in the SD_Driver 'object' functions, e.g. right after DMA was kicked off. in the SD card read function (in BSP function to add, not FATFS).

The problem is just: from FATFS view (even with RTOS), such a driver read function is called, it looks like a simple 'synchronous' call: it calls a function - waits until it returns (from FATFS's view). If you do now a context switch, and another thread is running which blocks/pre-emptives your FATFS call - how to come back to FATFS, even the 'DMA done' even/queue was sent? (just to ask  😉   )

It is your choice: just to bear in mind that real HW INTs, RTOS threads, their priorities etc. are really 'sensitive' (if a wrong system design - and you get a dead-lock ..., also a reason why CubeMX cannot handle and generate code for you for such high level system). I am fighting often with such dead-locks ... and there is nothing, just a 'good SW system design' and a 'think twice' approach with the entire system in mind (where a tool cannot really help).

SD Card, FATFs and RTOS is not so difficult: just check other examples, check Callbacks and the SD_Driver structure, function pointers in there.

It works fine for me on many STM MCUs (F4, F7, H7) in a similar way, also easy to change from DMA based to non-DMA based (just a different set of function pointers in struct 'SD_Driver', just a different 'driver object').

And:

no need

to touch and modify any FATFS files (I would not do anyway). The driver functions for read, write etc. should convert the 'asynchronous' (DMA based, back-ground-running in parallel) call into synchronous: wait via events or queues until done before returning to caller. All is done in the BSP drivers and SD_Driver 'driver object'. It is not part of FATFS (which expects just basic low level routines via the 'driver object'), it is part of your BSP.

Yes, CubeMX will NOT generate such 'complex' BSP functions: you had to check FW projects, demos, applications: there are such BSP functions (CubeMX is very low level, chip-based, we are talking here a SW-system and it is much more high-level, CubeMX is not made for it, just HW-related, low-level).

And:

- do not touch FATFS source code - do not add anything in third-party code (very hard to maintain, to migrate to newer versions ..., not a good idea). All is really done via SD_Driver 'driver object' and BSP functions (including Callbacks).

- do not use any time sensitive code, e.g. to wait for X milliseconds, osDelay(), HAL_Delay() etc. At the end it will not really work (if you change clocks, SD card type etc. and timing values are incorrect). Have a real 'ping-pong', synchronization.

KScha
Associate

Hi,

There is an example with a RTOS - enabled driver using DMA and RTOS signaling in the latest CubeMX for the STM32F657-Eval. I will try to port this one for the STM32F746-discovery and the 767 Nucleo.