cancel
Showing results for 
Search instead for 
Did you mean: 

LWIP/ V6.5.0. STMH32H735_DISCO (https://github.com/stm32-hotspot/STM32H7-LwIP-Examples) .Rx_PoolSection. (former .RxArraySection) seems not to be guarded by the MPU against cache coherency issues. Is this the right way?

Johi
Senior III

I have compared 2 STMH32H735_DISCO_Eth examples, one built under cube IDE V6.2.1 and a more recent one built with IDE V6.5.0. (thanks to @Pavel A.​  for providing the reference to the new examples on GIT).

The V6.5.0. example  moved the Rx_PoolSection/RxArraySection from AHB D2 RAM to AXI D1 RAM. The MPU configuration in the example seems not to protect the Rx_PoolSection/RxArraySection area from cache coherency problems any more. Can such a protection be omitted and if so why? 

I made a very detailed .doc describing in detail my findings, it is attached to this post.

1 ACCEPTED SOLUTION

Accepted Solutions
Piranha
Chief II

Your analysis seems to be correct. Similarly I took a deeper look at some examples from Cube packages...

Let's start with an example for NUCLEO-H743ZI. The address of RX_POOL is 0x30000400 for GCC, Keil and IAR compilers. An offset of 0x400 is reasonable because it provides 1KB space for descriptors. But then in MPU configuration we have an address 0x30004000, which is wrong. There are two possibilities how this happened. The first and obvious one seems to be just a spelling error, but the fact that the number still has 8 digits is suspicious. So should we change it to 0x30000400? No, because the address must also be aligned to the MPU region size, which is 16KB = 0x4000 in this case. So now we see from where the 0x4000 offset most likely came - someone "fixed" the address to be aligned to the MPU region size and just didn't care about the consequences. Of course, the correct address is 0x30000000 and the region for descriptors must be set in a region with a higher number than a region for RX_POOL memory for the priorities to act as required.

Then they ported this to the STM32H735G-DK board. They moved the RX_POOL to the 0x30000200 for GCC and Keil compilers, left at the previous address for IAR (descriptors also at the previous addresses) compiler and, of course, did not fix the region address and order in the MPU configuration.

On top of it at least the latest CubeMX v6.7.0 does not generate the additional RX_POOL declaration code at all and because of that the pool will be assigned to the .bss section, which is again wrong.

Because none of these broken examples actually worked, the developers tried solving it by adding a D-cache invalidation in HAL_ETH_RxLinkCallback() function. But, of course, they didn't do that without flaws also. A correct cache maintenance also requires a D-cache invalidation before the reception - for this code in HAL_ETH_RxAllocateCallback() function. Anyway, if there is a cache maintenance code, there is no need for MPU configuration for RX_POOL memory. Doing both for the same purpose just shows that the developers don't understand what they are doing.

And, if that wouldn't be enough, I will remind about another D-cache related flaw, which has been reported long ago. Regardless of whether the driver uses MPU configuration or cache maintenance functions for Rx buffers, it cannot solve the cache coherence problems for Tx buffers. First, Tx buffers can come from different sources, including the RX_POOL, when the received packets are "returned" to the output. Second, PBUF types PBUF_ROM/PBUF_REF can point to arbitrary memory addresses. Therefore the Tx buffers always need a D-cache clean operation to be performed in the function low_level_output().

At the end I'll remind that configuring the specific memory regions for data buffers as non-cacheable is generally a very poor choice performance wise. For a proper and decent solution it is recommended to read my article "Maintaining CPU data cache coherence for DMA buffers". Additionally this topic also has some useful related information.

@Imen DAHMEN​ , @Amel NASRI​ , yeah, yet another set of severe issues!

View solution in original post

8 REPLIES 8
Piranha
Chief II

Your analysis seems to be correct. Similarly I took a deeper look at some examples from Cube packages...

Let's start with an example for NUCLEO-H743ZI. The address of RX_POOL is 0x30000400 for GCC, Keil and IAR compilers. An offset of 0x400 is reasonable because it provides 1KB space for descriptors. But then in MPU configuration we have an address 0x30004000, which is wrong. There are two possibilities how this happened. The first and obvious one seems to be just a spelling error, but the fact that the number still has 8 digits is suspicious. So should we change it to 0x30000400? No, because the address must also be aligned to the MPU region size, which is 16KB = 0x4000 in this case. So now we see from where the 0x4000 offset most likely came - someone "fixed" the address to be aligned to the MPU region size and just didn't care about the consequences. Of course, the correct address is 0x30000000 and the region for descriptors must be set in a region with a higher number than a region for RX_POOL memory for the priorities to act as required.

Then they ported this to the STM32H735G-DK board. They moved the RX_POOL to the 0x30000200 for GCC and Keil compilers, left at the previous address for IAR (descriptors also at the previous addresses) compiler and, of course, did not fix the region address and order in the MPU configuration.

On top of it at least the latest CubeMX v6.7.0 does not generate the additional RX_POOL declaration code at all and because of that the pool will be assigned to the .bss section, which is again wrong.

Because none of these broken examples actually worked, the developers tried solving it by adding a D-cache invalidation in HAL_ETH_RxLinkCallback() function. But, of course, they didn't do that without flaws also. A correct cache maintenance also requires a D-cache invalidation before the reception - for this code in HAL_ETH_RxAllocateCallback() function. Anyway, if there is a cache maintenance code, there is no need for MPU configuration for RX_POOL memory. Doing both for the same purpose just shows that the developers don't understand what they are doing.

And, if that wouldn't be enough, I will remind about another D-cache related flaw, which has been reported long ago. Regardless of whether the driver uses MPU configuration or cache maintenance functions for Rx buffers, it cannot solve the cache coherence problems for Tx buffers. First, Tx buffers can come from different sources, including the RX_POOL, when the received packets are "returned" to the output. Second, PBUF types PBUF_ROM/PBUF_REF can point to arbitrary memory addresses. Therefore the Tx buffers always need a D-cache clean operation to be performed in the function low_level_output().

At the end I'll remind that configuring the specific memory regions for data buffers as non-cacheable is generally a very poor choice performance wise. For a proper and decent solution it is recommended to read my article "Maintaining CPU data cache coherence for DMA buffers". Additionally this topic also has some useful related information.

@Imen DAHMEN​ , @Amel NASRI​ , yeah, yet another set of severe issues!

MSG_ST
ST Employee

Hi @Johi​ ,

You could refer to the official STM32CubeH7 release available here,

And you will find that the Rx_PoolSection is still declared on RAM_D2 however I admit that the MPU management for this section is not very efficient.

So, please note these improvements in the next maintenance releases :

  • An MPU region (cacheable) will be added to manage RX pool section.
  • Mismatch between addresses config IAR vs CubeIDE will be fixed.

Regards

Mahdy

> An MPU region (cacheable) will be added to manage RX pool section.

All of the normal memory is cacheable by default. So what is the purpose for a adding a cacheable region? There isn't - it's just another complete nonsense from you... It actually needs to be a non-cacheable region.

Anyway, this still cannot fix the issues with Tx buffers and PBUF_ROM/PBUF_REF buffers, which I described in my previous post.

Piranha
Chief II

A slight update from my side on this broken junk. In my previous post I assumed that the 16 KB non-cacheable MPU memory region is meant for Rx buffers. Taking a deeper look, one can see that ST defines LWIP_RAM_HEAP_POINTER to the same 0x30004000 address and the MPU configuration code has a comment "... for LwIP RAM heap which contains the Tx buffers", which most likely means that the real intention was to make a lwIP heap memory non-cacheable and "suitable" for Tx buffers. This rises two severe issues:

  1. Assuming that all Tx buffers come from lwIP heap memory is just plain wrong. Tx buffers can come from heap, memory pools, Rx pool, FLASH and basically any address by the PBUF_ROM/PBUF_REF pbuf types.
  2. There is no MPU configuration region for Rx buffer pool and, as explained in the previous post, ST's code does not implement a correct D-cache maintenance code. As a result, the received data buffers will be corrupted by a D-cache eviction.

A quick and simple workaround is to put an RX_POOL at address 0x30000400 and create a non-cacheable MPU region at address 0x30000000 with a size of 32 KB. That makes the Rx buffers non-cacheable and keeps the heap memory configuration as intended by ST. But such a workaround is still incomplete, because, as I said, ST's assumption about the Tx buffers is just plaint wrong.

> ST's assumption about the Tx buffers is just plaint wrong.

Need to see how this is addressed in the Azure/NetX drivers.

Obviously, if the ETH moves data using its DMA, and this DMA cannot access some MCU memory areas... maybe the low level output function should check that the TX buffers are in "good" memory, else fail.

The issue here is not the accessibility, but cacheability. The whole point of making the lwIP heap non-cacheable is to "ensure" that the Tx buffers come from a non-cacheable memory region. As explained in my D-cache maintenance article, all of this broken nonsense can be removed after adding a D-cache clean to Tx code. Apparently even a singe trivial line of code is too much for ST's "geniuses"...

Although less serious, DMA accessibility is also a problem that ST hasn't really addressed. At least on some parts, there are memories that are not accessible by the Ethernet DMA. For example, on the STM32H750, the Ethernet DMA can access any internal or external memory except the TCM (which is CPU or MDMA only). To ST's credit, this limitation of their driver is actually documented in at least one place.

The normal solution is to copy--either the offending pbuf or the entire chain. Doing the entire chain is simpler and likely not too wasteful as the remainder of chain is likely just protocol headers. Here's an example in a TI driver for a part where the Ethernet DMA can't access flash memory. (One issue with it is that it allocates the new pbuf from the RX pool (PBUF_POOL) instead of the heap (PBUF_RAM), which is normally used for TX.)

On the topic of copying, a driver with a non-blocking output function needs to check each pbuf in the chain with PBUF_NEEDS_COPY to see if it needs to copy it before queuing it for transmission (instead of just incrementing its reference count). I don't know of any publicly available netif drivers that need to do this and handle it correctly, so here's an example of proper queuing in the ARP module. ST's output function looks like it's blocking, but IIRC it actually isn't because of the initial semaphore count bug that they're refusing to fix.

mslugx
Associate II

Hi all, i just discovered this issue on my H743ZI2 board as well. I read through the entire replies, however I am still not sure what to do to fix this issue? Can someone tell me what to do for it? Thank you