DMA RX buffer cache invalidation - BEFORE or AFTER a transfer?

patrislav · ‎2018-10-16

Hi, I have following system:

NUCLEO-F767ZI board (STM32F767ZI)

Playing a waveform on DAC1, running it through a magic analog black-box, feeding it back to ADC1. Recording the signal on ADC1, stopping the transfer after one full buffer has been recorded. Then doing some calculations on the recorded signal. After some pause the next transfer is started, etc.

Data transfer takes place through DMA streams in one-shot (non-circular) mode. ADC & DAC triggered by timer update.

Control flow is like this:

fill the DAC output buffer with waveform data
configure & enable the DMA for DAC & ADC
start transfer by starting the trigger timer
busy-wait (do nothing) until transfer complete (TCx set for all DMA streams)
stop and disable DMA streams
do some calculations on the received ADC data
repeat after N milliseconds

I enabled the F7 dcache to speed up the calculations part, and then funny things started to happen. I learned about cache coherency stuff, and added cache management code (clean/flush & invalidate) to the DMA transfer code - cache clean for the DAC (TX) buffer before the transfer, and cache invalidate for the ADC (RX) buffer after the transfer.

Finally even more funny stuff happened, and I learned that global SCB_CleanDCache and SCB_InvalidateDCache is EVIL, as that will happily overwrite any unrelated stuff that happens to be in the cache or not cached yet.

https://community.st.com/servlet/rtaImage?eid=ka00X000000UUR9&feoid=00N0X00000DHMUr&refid=0EM0X000001ms16

So I made sure that the DMA buffers have addresses and sizes aligned to 32 bytes, to avoid cache line sharing with anything else, and replaced the SCB_CleanDCache / SCB_InvalidateDCache calls with their SCB_CleanDCache_by_Addr / SCB_InvalidateDCache_by_Addr counterparts, to make sure that ONLY the DMA buffers are affected and nothing else.

Now it works again, but the strange thing is: it seems I have to invalidate the DMA RX buffer BEFORE the actual transfer. If I invalidate it AFTER the transfer, there are still some corrupt values in the RXed data every now and then.

However, the CPU doesn't touch the RX buffer until the transfer is complete, so I don't know how this is different. Did I miss something?

I know that the recommended solution is to use non-cacheable regions for DMA, but there might still be legitimate reasons for explicit cache management - for example, when the data cannot be copied to a separate buffer for evaluation, and the calculations have to be done directly in the RX buffer.

David Littell · ‎2018-10-16

As long as the DMA-utilized buffers are aligned and sized for the cache lines and you flush before sending and invalidate before accessing received data you should be OK. If not, something else is afoot.

AVI-crak · ‎2018-10-16

Before or after - does not matter.

It is important where the buffer is located. STM32F767 has DTCM RAM, with a starting address of 0x20000000, the size depends on the processor model. However, the access speed of a DTCM memory is maximum.

This means that you do not need to use the data cache for this area. Change the MPU settings for this memory, as well as change the parameters of the linker, add a new section for DMA buffers. This section should be the first.

tommi · ‎2019-08-23

Thank you! This resolved an issue I had with SPI in DMA mode with STM32F7. SPI was fine in polling and in IT mode, but would not work in DMA mode. It was indeed the cache, I disabled the cache and SPI started working also in DMA mode. For enabling cache again, I moved the SPI DMA buffers in linker file:

Memory_B5(xrw) : ORIGIN = 0x2004F0A0, LENGTH = 0x40

and in the end of file:

.SPIarraySection : { *(.SPI_Buff_section) } >Memory_B5

And declaring the buffer (tx and rx) in the code like this:

uint8_t spiTxRxBuff[2][32] __attribute__((section(".SPIarraySection")));

DMA problems fixed and cache enabled.