STM32H7 Hal Ethernet Use of LWIP_MEMPOOL_DECLARE

JRS · ‎2021-10-21

Not sure if this is a bug or something I do not completely understand, but in the generated ethernetif.c file (HAL V1.9.0) there is the line:

LWIP_MEMPOOL_DECLARE(RX_POOL, 10, sizeof(struct pbuf_custom), "Zero-copy RX PBUF pool");

Which always uses, if I understand correctly, a fixed magic number of 10 for the pbuf-count. I would think that number should be tied to something like PBUF_POOL_SIZE. Or am I missing the point of this?

Piranha · ‎2022-06-13

Is this a joke? This is the code from v1.10:

#define ETH_RX_BUFFER_CNT             12U
LWIP_MEMPOOL_DECLARE(RX_POOL, ETH_RX_BUFFER_CNT, sizeof(RxBuff_t), "Zero-copy RX PBUF pool");

A magic number of 10 has been replaced with a define ETH_RX_BUFFER_CNT, which is defined as another magic number of 12. If the developers still think that the number of buffers must be different from the number of descriptors, then they should make that new ETH_RX_BUFFER_CNT define configurable in CubeMX.

/*
  2.a. Rx Buffers number must be between ETH_RX_DESC_CNT and 2*ETH_RX_DESC_CNT
*/

If by 2*ETH_RX_DESC_CNT the developers were thinking about the problem related to the context descriptors, then it's the other way around - the descriptors are the ones, the number of which should be doubled, not the data buffers. Therefore it should read: "between ceil(ETH_RX_DESC_CNT/2) and ETH_RX_DESC_CNT"

Peaga · ‎2022-06-28

Thanks for your tip, @Piranha! I was having some troubles with HTTP server on a no RTOS application, where ETH_RX_BUFFER_SIZE greater than 1000U leads me to a memory fault. Setting ETH_RX_BUFFER_CNT to ETH_RX_DESC_CNT fix the problem...

alister · ‎2022-06-28

>If by 2*ETH_RX_DESC_CNT

Haven't read its code, but perhaps it's implemented two buffers per descriptor now.

MSG_ST · ‎2022-12-30

Hi,

As an answer for the main issue : LWIP_MEMPOOL_DECLARE with a fixed pbuf-count :

Due to the fact that we are using a Ring structure for descriptors, each descriptor is pointed to 2 data buffers.

So that the RX_POOL can hold all Rx_buffers, pbuf-count should be equal to 2 * ETH_RX_DESC_CNT.

An Internal ticket ID 142051 is already created to solve this issue

(PS: This is an internal tracking number and is not accessible or usable by customers).

Sorry for the delayed answer and thank you for your interaction.

Regards

Mahdy

Piranha · ‎2022-12-30

And again a complete nonsense by an ST employee... First, the fact, that the descriptors are arranged as a ring, has nothing to do with the possibility to provide two data buffers for each descriptor. But the main nonsense you said is that the each descriptor points to two data buffers, while actually the HAL does not use buffer 2 of the Rx descriptors.

PhilippH · ‎2023-11-28

Hi together,
my understanding of the current HAL F7 driver state (STM32F7xx-hal-driver v1.3.0) is:

The DMA descriptors are used in chained mode and use one buffer per descriptor
Each Rx buffer is assigned via to a descriptor entry in ETH_UpdateDescriptor
- This happens for all available descriptors in HAL_ETH_Start or HAL_ETH_Start_IT
- Or in HAL_ETH_ReadData when we have an unused descriptor
Received data blocks (inside a RX Buffer) do get a cache invalidate and chained together in HAL_ETH_RxLinkCallback (as long as we use a buffer size of 1536) which is called in HAL_ETH_ReadData
1. As a descriptor is free now, ETH_UpdateDescriptor is called which allocates another RX buffer
Once all DMA receive chunks are received HAL_ETH_ReadData will return the chained pbufs
This pbuf (chain) is then passed to the lwip tcpip task via netif->input(p, netif)
Once the pbuf is not needed anymore by tcpip task it is freed and pbuf_free_custom is called and the RX buffer can be used again
As long as the tcpip task is processing / holding the pbuf it can't be reused for receiving new data

As to my understanding the following rules apply to the RX buffer count:

We should have at least as much RX buffers as descriptor entries (ETH_RX_BUFFER_CNT >= ETH_RX_DESC_CNT)
It's ok to have more RX buffers than descriptors to handle receiving of high load peaks
1. This does imply that either the receive task has i higher priority than the tcpip task or the tcpip task does hold the receive buffer for longer (e.g. waiting for a timeout before using the buffer to send data back to the network. I'm not too deep into the lwip stack, but read that it uses receive buffer for sending out responses)
2. One can now discuss if it's really usefull to receive more frames when the tcpip task is not able process them, but in my opinion this helps with peak load situations or multiple lwip "clients" (task that receives data e.g. via the socket API) is blocked and other tasks and internal things like DHCP should still be working reliable
3. Maybe ST did some testing and observed that they get into RX_ALLOC_ERROR state in high peak load situations on some platform and increased them to 15.
4. I couldn't see any issues with iperf and ETH_RX_BUFFER_CNT = ETH_RX_DESC_CNT. iperf will not hold received data back to send it later as it's designed to get the most throughput

Note: numbered list is just for better referencing and not the exact execution order

@PiranhaFirst of all thank you for all your hints and lists about various issues. Can you please confirm my assumptions / breakdown and extend/comment if you find something wrong / extendable etc.?

@MSG_ST Do you think that it is possible that ST answers / adapt following points?

It would be really nice if someone from ST can explain why the code uses a ETH_RX_BUFFER_CNT of 15U and please then also document it in the code.
Also adapting the @note sections in the file header would be very helpful, as currently it's very confusing as it doesn't match the code.
Please let the user overwrite the ETH_RX_BUFFER_CNT somehow (at least by adding a #ifndef block around the define in the ethernetif.c)

Regards

Philipp

MSG_ST · ‎2023-11-29

Hi @PhilippH ,

Thank you for your detailed comment. I'm fully agree with your assumptions.

1-

I confirm from our side that the Rx_Buff size should be at least equal to the number of descriptors and the maximum of Rx_Buff size could exceed the number of descriptors in order to have a pool size which could handle high load peaks in some heavy applications. The comment section will be updated accordingly.

And that's why you can find ETH_RX_BUFFER_CNT = 15U or 12U...

2-

The ETH_RX_BUFFER_CNT will be configurable via STM32CubeMX to allow user to modify it according to their use cases.

Thank you again for your contribution

Regards

Piranha · ‎2023-11-29

Your understanding is correct and there are just few things to add.

The PBUFs can be held not only by the TCPIP thread, but also by applications. And it's not just a question of performance, as it can be done even deliberately as a feature. For example, an application can implement some buffering by holding PBUFs and only releasing them sometime later. The order of releasing PBUFs also can and will differ from the reception order - both from a single application and definitely between a several applications. All of this together means that the number of Rx buffers must be high enough that it more or less satisfies the requirements of all applications, including their buffering requirements.

Now, let's analyse how the number of DMA descriptors relates to the number of Rx buffers.

ETH_RX_BUFFER_CNT < ETH_RX_DESC_CNT

In this configuration there are not enough buffers to attach to all descriptors and therefore the surplus descriptors are just wasted. The system can still work, but it just doesn't make sense.

ETH_RX_BUFFER_CNT > ETH_RX_DESC_CNT

In this configuration we have surplus buffers, which can be held by applications and then returned back to the Rx pool. But a buffer, which is sitting in an Rx pool and are not used for anything, is again wasted.

ETH_RX_BUFFER_CNT == ETH_RX_DESC_CNT

In this configuration, when a buffer is released, it doesn't even have to be put in an Rx pool, but can immediately be attached to a descriptor. Therefore the software managed Rx pool becomes useless, because it would always be empty, and it can and should be removed from the design completely.

I've tried to explain this several times previously, but everyone misunderstood it, including the advanced users. For example, ST's code defaults to 4 descriptors and 12 buffers. When I am saying that those numbers must be equal, everyone interprets it as "reduce the number of buffers to 4" and counter-argues how limited such configuration would be. But for some strange reason nobody interprets it as "increase the number of descriptors to 12", which is what I was actually saying. I will remind again that the descriptors are small and their size, compared to the Rx buffers, is minor. Also Alister worried about a head-of-line blocking, but that is not the case here, because the buffers are not bound to particular descriptors and I'm not suggesting they should be.

@alister , @LCE , this could be interesting for you.

LCE · ‎2023-11-29

ETH_RX_BUFFER_CNT > ETH_RX_DESC_CNT

> In this configuration we have surplus buffers, which can be held by applications and then returned back to the Rx
> pool. But a buffer, which is sitting in an Rx pool and are not used for anything, is again wasted.

I found that this is the configuration which works best for my application: no OS, kind of audio streaming, http rest interface, calling a RX-descriptor / pbuf reallocation function in main / while / LwIpProcess (when not sleeping).

So I simply define:

#define ETH_RX_DESC_CNT    ( PBUF_POOL_SIZE - 8 )    /* must be lower than pool size */

with 64 pbufs in pool, 32 for my test boards without external RAM (where I need more internal RAM for the more important TX buffers, RX buffers are mostly used to handle TCP ACKs).

But don't ask me what the problem was, my H7 ethernet driver is more than a year old now, and I'm good at forgetting, maybe pbuf chains still being worked on. Have to check my notes...