Showing results for 
Search instead for 
Did you mean: 

How to evaluate STM32F7 bus matrix performance and solve bus matrix performance issues?


I am working with the STM32F769 microcontroller and very rarely I experience some strange issues with an SPI slave and I suspect it's related to congestion on the bus matrix (other, primarily DMA, transfers ongoing at the same time). Is it possible to somehow "look inside" the bus matrix and evaluate how it is performing? What are some recommended ways to solve congestion on the bus matrix (it's already running at maximum clock speed)?

ST Employee

Dear @arnold_w ,

Here is an Application Note AN4667  
you can see impact of DMA and some tips to optimize overall performance while selecting right memories for SPI buffers and other masters . Have a good lecture .



Chief II

SPI overloading the bus matrix running at 216 MHz? What else is running there?

Judging by other topics, you are using HAL and CubeMX generated code. If that is the case, there is no point in suspecting bus matrix or other highly unlikely issues, while there is the HAL/Cube code, which is broken in countless ways, especially for Cortex-M7.


I think it might be because all DMA-buffers were located in DTCM and it seems like a long way (via the core!) for the data to travel between the DMA controllers and DTCM. I've moved the DMA-buffers to RAM and I'm running tests to see if the problem has disappeared.

As I already said, if you are using HAL and other broken code, all of those guesses and assumptions are useless. For the device to be reliable, the code must be correct. There is no way around it! As for the correctness and testing:

Therefore one can test as much as one wants, but it still doesn't prove anything. Have you fixed the issues reported in these links?

If no, then what's the point in wasting a time on "testing" a broken code?

No, the SPI-slave is not using the HAL-functions. I noticed the HAL-functions wouldn't recover properly if CLK-cycles were lost, e.g. if the other microcontroller it was communicating with had a BOR reset during an SPI-transfer, so I write my own code. Actually, I posted the source code here ( ) some years ago (but I've improved the code since then). Other ongoing DMA traffic is Ethernet/SAI (more or less continuously) and 2 SPI-masters (sporadic burst transfers). Since the DTCM doesn't support caching, the problem can't be bad cache handling code. I did some logging and noticed that I got 6 SPI interrupts in rapid succession (unfortunately, I didn't log the interrupt status bits) and I assume these are underrun interrupts, that's why I was suspecting congestion somewhere between the DMA-controller and the memory (in this case DTCM). My allocation to this particular problem is very on and off and right now I'm not supposed to spend any time on this (but maybe that changes tomorrow 🙂

Since the DTCM doesn't support caching, the problem can't be bad cache handling code.

But lacking barriers can still be an issue. And, while for SRAM lacking barrier instructions that can be solved by configuring the descriptor memory as a device memory type (but one still needs volatile qualifiers and/or compiler barriers), DTCM is always a normal memory type and therefore MPU configuration doesn't matter. Therefore with DTCM the proper usage of __DMB() macro is absolutely mandatory and there is no other way around it. The reworked drivers from ST are OK in this regard, but the older ones are broken. All of it is explained in my articles.

Other ongoing DMA traffic is Ethernet/SAI (more or less continuously) and 2 SPI-masters (sporadic burst transfers).

SPI can go up to 54 Mbps, which even with 1-byte DMA transfers turns into 6,75 MT/s. Doing 2/4-byte transfers reduces the load further. Ethernet can go up to 100 Mbps (12,5 MBps). As the ETH DMA does 4-byte transfers, it turns into 3,125 MT/s. Is SAI used for audio? Some standard stereo 48 kHz stream would load the bus with 96 kT/s, which is next to nothing. An 8-channel 192 kHz stream still loads the bus with just 1,536 MT/s. And the SAI has an 8 word deep FIFO. The DMA peripheral also has a 4 word deep FIFO.

All of it together sums up to 18,161 MT/s, which is just 8,4 % of the bus matrix's 216 MT/s bandwidth. So, even all of it simultaneously running at maximum speeds, is still very far from overloading the bus matrix.