32F417 - how to prevent low_level_output() (in ethernetif.c) overwriting the last DMA buffer?

PHolt.1 · ‎2022-07-15

It seems that one has to wait on the three TPS bits = 000 in DMASR.

I fixed the problem by putting

if ( (((heth->Instance)->DMASR) & (0x7 << 20)) != 0 ) {}

at the end of HAL_ETH_TransmitFrame().

That loop should not hang unless the silicon is defective - right?

This is the code which is all over the internet, and I think the check of whether the buffer is available is in the wrong place. It should be BEFORE the memcpy().

      /* Check if the length of data to copy is bigger than Tx buffer size*/
      while( (byteslefttocopy + bufferoffset) > ETH_TX_BUF_SIZE )
      {
        /* Copy data to Tx buffer*/
        memcpy( (u8_t*)((u8_t*)buffer + bufferoffset), (u8_t*)((u8_t*)q->payload + payloadoffset), (ETH_TX_BUF_SIZE - bufferoffset) );
 
        /* Point to next descriptor */
        DmaTxDesc = (ETH_DMADESCTypeDef *)(DmaTxDesc->Buffer2NextDescAddr);
 
        /* Check if the buffer is available */
        if((DmaTxDesc->Status & ETH_DMATxDesc_OWN) != (u32)RESET)
        {
          errval = ERR_USE;
          goto error;
        }
 
        buffer = (u8 *)(DmaTxDesc->Buffer1Addr);
 
        byteslefttocopy = byteslefttocopy - (ETH_TX_BUF_SIZE - bufferoffset);
        payloadoffset = payloadoffset + (ETH_TX_BUF_SIZE - bufferoffset);
        framelength = framelength + (ETH_TX_BUF_SIZE - bufferoffset);
        bufferoffset = 0;
      }

PHolt.1 · ‎2022-07-22

Presumably the question was stupid in some way but it is not clear to me how.

Piranha · ‎2022-07-22

> That loop should not hang unless the silicon is defective - right?

The line you've shown is not a loop. Enabled DMA, when not running, sit's in a suspended state, which is identified by TPS value 0b110. But during intensive transfers the DMA will not enter that state. If you want to wait when the frame has been transmitted, then wait for the OWN bit on the last buffer to be cleared by DMA. But you don't need to do anything of that at all. All the software must do, is prepare the descriptors and issue poll demand.

> This is the code which is all over the internet, and I think the check of whether the buffer is available is in the wrong place.

Why search for ST code somewhere else? And you haven't shown a full code. There is an additional OWN bit check at line 239 and an additional memory copy at line 274. The internal while loop is executed only when the data crosses the DMA buffer boundaries and it copies data to all buffers except the last one. The last buffer is written at line 274, not line 253. Therefore the code is stupid but correct. Obviously the code can be restructured so that everything is done in that while loop and there are no two OWN bit checks and two memory copy calls. It should be noted that ST have released a rewritten drivers, along with whom this copy-based code is replaced by somewhat better zero-copy code, which at least for now also has some significant flaws.

The real issues are listed in my Ethernet/lwIP issue list, which I've updated recently:

https://community.st.com/s/question/0D50X0000BOtfhnSQB/how-to-make-ethernet-and-lwip-working-on-stm32

PHolt.1 · ‎2022-07-23

Thank you Piranha.

I have done through this and it looks like I am up to date, except the __DMB issue which you mention at

https://community.st.com/s/question/0D50X0000C4Nk4GSQS/bug-missing-compiler-and-cpu-memory-barriers

Does this apply to code in ethernetif.c or elsewhere?

I see references to the OWN bit only in the eth_transmitframe and eth_receiveframe functions. I am using these out of stm32f4xx_hal_eth.c but with the silly hal_busy "lock" code stripped out because that was obviously useless (as you also point out).

Piranha · ‎2022-07-24

With the old non-rewritten driver barriers are required in both - ethernetif.c and stm32***_hal_eth.c. Read my article! It explains the barriers very clearly. And at the end it has a very clear and simple instruction where the barriers should be placed in code! I have to cite a user "ataradov" from the EEVblog forum:

"How can you miss this very basic info?"

By the way, there is no need for doing any DMASR flag checking or clearing, when issuing a poll demand. Also the TUS flag is only relevant when TSF (Transmit store and forward) mode is not used, which should not be done because of errata ES0182 Rev 13 section "2.11.4 Transmit frame data corruption". And the reception doesn't need issuing poll demands at all - the DMA automatically tries to re-enter the Run state on every new frame received in Rx FIFO. Of course, when the Rx poll demand is not used, the barrier before it is also useless.

low_level_output()
{
	for (/* buffers of a single frame */) {
		// Ensure the OWN bit in TDES0 is cleared
		// Process the buffer
		// Prepare the descriptor, except TDES0
		__DMB();
		// Write TDES0 with the OWN bit set
	}
	__DMB();
	ETH->DMATPDR = (uint32_t)ETH;  // Any value issues a descriptor list poll demand.
}
 
low_level_input()
{
	for (/* Rx buffers of a single frame */) {
		// Read the RDES0 and ensure the OWN bit in cleared
		__DMB();
		// Read the rest of the descriptor
		// Process the buffer
		// Prepare the descriptor, except RDES0
		__DMB();
		// Write RDES0 with the OWN bit set
	}
}

> Nano is supposed to be smaller, but there is stull like this ... stating that it does not support long-long, which is definitely incorrect.

It is correct. The printf() family (and some other) functions from Newlib-nano are a disaster for microcontroller development. As you already know, this was researched and documented completely by Dave Nadler. But again you seem to ignore all the details he wrote, like, for example, the requirement to set configUSE_NEWLIB_REENTRANT=1. Anyway, fixing this is again a useless waste of time - read my comment in this topic.

What is the value of the initial stack pointer in the vector table? It is set in the linker script. Older ST examples had a wrong values of it.

PHolt.1 · ‎2022-07-24

"How can you miss this very basic info?"

Because I am not as clever as you.

Re __DMB() insertion, your code does not relate much to mine (I am using the older drivers because the new ones are for interrupts, while I am polling) but I believe I have implemented all the fixes. I will post my ethernetif.c below but on past record if somebody does that, it ensures that nobody replies to their post. I also don't see applicable code in stm32f4xx_hal_eth.c because I moved the two HAL_ETH...frame functions into ethernetif.c, stripping out silly stuff like their HAL_BUSY "protection". Anyway, __DMB does no harm so I may have put it in a few places too many.

"It is correct. The printf() family (and some other) functions"

You are right as always. See explanation here

https://www.eevblog.com/forum/microcontrollers/32f4-hard-fault-trap-how-to-track-this-down/msg4318681/#msg4318681

Looks like I was not using newlib-nano at all, ever, because I definitely have %llu etc working. But I am still stuck with printf using the heap for floats and longs, so I need to fix those mutexes, which are currently empty functions, but I am having trouble because I don't have a source to that lib and the mutex calls are not "weak". I do need to output uint32_t and uint64_t and floats because my project uses a lot of that, although I can avoid floats if I really have to (easy to output two integers with a '.' between them) but that won't help with the malloc and thread-safety issue because as I say this standard C lib printf uses the heap for anything long or bigger. I have been tracing the code and the ST lib needs all these mutex calls implemented

https://gist.github.com/thomask77/3a2d54a482c294beec5d87730e163bdd

Just been tracing printf() and pretty soon after there is a call to __retarget_lock_acquire_recursive which is an empty function.

If you know of a solution, I am "all ears".

My ethernetif.c file is too long to post. I have put it here on db

https://www.dropbox.com/s/rsurv1prfm3a074/ethernetif.c?dl=0

and will delete the db link after a day or two.

Thank you as always for any help. I have spent many hours on this. It is currently running ok.

Piranha · ‎2022-07-24

https://community.arm.com/arm-community-blogs/b/embedded-blog/posts/shrink-your-mcu-code-size-with-gcc-arm-embedded-4-7

Newlib, the C library in the toolchain, implements printf functions that are so complicated they require about 37K bytes of FLASH and 5K bytes of RAM to run a simple hello-world program.

That's about the standard non-nano version. A printf() of a size of a whole IP stack... Wonderful! If you still want to use that Frankenstein, this link should be very informative. And look at the Reference Links at the end of that article. But the question remains - why torture yourself?

> If you know of a solution, I am "all ears".

In my previous post I gave a link to a topic with a real solution - replacing the monster with a decent library.

About your code... From where is the ethernetif_set_link() called? It doesn't conform to the multi-threading requirements. What's with the DHCP? If I'm not mistaken, in one of EEVblog topics you said that you observe low_level_output() being reentered. If that is true, then that is a very clear indicator of some of your code breaking the multi-threading rules. Generally for debugging multi-thread conformance it is highly recommended to define and implement the macros mentioned in this topic.

P.S. I downloaded that file I'll take a more detailed look on it. If it was only for me, you can delete it...

PHolt.1 · ‎2022-07-25

"Newlib, the C library in the toolchain, implements printf functions that are so complicated they require about 37K bytes of FLASH and 5K bytes of RAM to run a simple hello-world program."

37k is very possible but I don't believe the 5k RAM. I've spent ages on looking at RAM usage (largely because MbedTLS, apart from ~150k of FLASH, uses about 50k of RAM, pushing towards the 128k available). But anyway it would be ok. FLASH is not at the limit - currently 325k out of 1MB and 150k of that is MbedTLS!

" replacing the monster with a decent library."

However, I do want all the functionality. For other projects I could use a chopped-down printf, and have done so in the past. There are many many printfs out there.

The real issue is that the printf code uses the heap for floats and longs (it seems most printfs that implement floats properly use the heap for at least floats, although in years past I developed an H8/300 product which used the Hi-Tech compiler and that used no heap at all, but the floats were not IEEE float format which nowadays would be unacceptable, for some reason - I've been doing embedded since ~1979) calls __retarget_lock_acquire_recursive, then it calls _malloc_r, and _malloc_r then calls __retarget_lock_acquire_recursive. This happens only with floats and longs and is evidently an attempt to make the heap usage thread-safe. It should work, if those calls are not empty...

"From where is the ethernetif_set_link() called? "

It is called on a link status change (startup, cable insertion / removal, etc)

		case DHCP_IDLE:
		default:
			{
				// Get most recently recorded link status
				bool net_up = netif_is_link_up(&kde_netconf_netif);
 
				// Read the physical link status
				ethernetif_set_link(&kde_netconf_netif);
 
				// Has the link status changed
				if (net_up != netif_is_link_up(&kde_netconf_netif)) {
					ethernetif_update_config(&kde_netconf_netif);
 
					if (net_up) {
					   // Link was up so must have dropped
					   debug_thread("Ethernet link down");
					}
					else {
					   // Link was down so must be up - restart DHCP
					   debug_thread("Ethernet link up");
					   DHCP_state = DHCP_START;
					}
				}
 
				osDelay(1000);
			}

" in one of EEVblog topics you said that you observe low_level_output() being reentered."

No; I commented that I could not be sure that LWIP won't do that. I just don't know enough to be sure. My feeling is that even though the LWIP API (not the raw one) can be called from multiple threads (I am using the thread protection #define on it, btw) it will not be making concurrent calls to low_level_output. But, not being sure, I did implement a separate mutex on the low level output (as you can see) and another separate mutex on the low level input for the same reason. Those functions are obviously not thread safe on their own. Well, hardware isn't thread-safe : - )

The cost of a mutex is of the order of 1-2 microseconds - I timed it once. Mutex protecting those functions costs nothing.

AIUI, if you call the LWIP API (socket or netconn; they both end up in the same place, as I found out by tracing them), that eventually leads to a call to low_level_output. It doesn't do anything especially clever. It has to send out that data. So it would seem logical that if two RTOS threads made two API calls concurrently, that could call low_level_output concurrently. It may well not but I just don't know.

FWIW, the other thing I learned is that in "practical embedded applications" one is wasting time playing with TX buffer settings. Connected to a 100mbps LAN, the output data disappears down the wire far faster than a 168MHz ARM32 could fill up the ETH DMA buffers (I am pretty sure the 32F4 ETH controller can drive out at 100mbps, or close to). It isn't a 4GHz 80x86. I spent many hours on this. In practical situations you are using TCP/IP and every outgoing packet has an incoming packet, so the TX packet rate is wholly limited by the RX packet rate. As you see in my file, I am polling the input, at a packet rate of 100Hz, in low_level_input. That delivers 250kbytes/sec which is totally fine for this application. Increasing the poll rate to say 1kHz I can get 1200kbytes/sec, and my feeling is that with interrupt-driven RX one would probably exceed that, possibly by quite a bit on a stripped-down demo system but not with the RTOS running 20 other tasks. LWIP is not that fast... and people may be sending tiny packets. For a speed demo you would rig it up with nothing else running and sending MTU-sized packets. The other advantage of polling, apart from simplicity, is an inherent protection from an accidental "DOS". One has to ask: where is the RX packet filtering? It isn't in the PHY chip, AFAIK.

I am extremely grateful for any ideas on that file. It does work but there are very rare unexplained crashes (although as I posted on eevblog I believe the lack of mutex protection on printf+heap is a more likely cause).

Piranha · ‎2022-07-26

> However, I do want all the functionality. ... There are many many printfs out there.

Yes, there are many, but I gave you a two specific examples. What features exactly are missing in those implementations for your needs?

> It is called on a link status change

I mean technically - startup, thread, ISR. Looks like some thread, which is not the lwIP core thread. If it doesn't do core locking, then it breaks the multi-threading rules. I already gave the link with an advice how to catch all of such code reliably and easily...

Also, as it is written in my issue list topic, there is no need for that broken DHCP state management in application. It's done internally since lwIP v2.0.0 (year 2016)! Just set up netif_set_link_callback() for link state and netif_set_status_callback() for NETIF status, which is called on IP address changes. And there is no necessity for a separate thread just for the link state management - do the checking-updating of the link state from the Ethernet input thread, when there are no frames received for 250 ms or so.

> LWIP API (not the raw one) can be called from multiple threads (I am using the thread protection #define on it, btw)

Netconn and Socket APIs are thread-safe, which means those don't need any additional protection.

> it will not be making concurrent calls to low_level_output

That is true. The lwIP developers are smart enough to understand that the Ethernet peripheral can transfer only a single frame at a time. In each direction, of course, therefore two frames in total. As I already explained in other topic, the lwIP doesn't call low_level_output() function in parallel. And it doesn't call low_level_input() function at all... That's your Ethernet input thread, which calls it! The lwIP doesn't even know about that function and is not capable of calling it! Solving the "problems", which doesn't exist, but ignoring the real ones... That's what ST's team already does excellently!

> In practical situations you are using TCP/IP and every outgoing packet has an incoming packet, so the TX packet rate is wholly limited by the RX packet rate.

Not for UDP. Actually it depends on a higher layer protocol, but, for example, RTP source just transmits and does not receive anything.

> Connected to a 100mbps LAN, the output data disappears down the wire far faster than a 168MHz ARM32 could fill up the ETH DMA buffers

A comment in your code tells that the impact of a memory copy is negligible, but now you say that the CPU cannot feed the ETH fast enough - don't you see the contradiction? �� Copy-based driver not only decreases performance, but also increases memory usage because of additional buffers. Zero-copy driver doesn't have any buffers on a transmit side at all. Though it has to have more descriptors, because PBUFs typically consist of 1-3 segments and queuing frames becomes very fast.

You are theorizing about speeds. I've proven and demonstrated the performance publicly with my Ethernet/lwIP demonstration firmware long ago, but again you've not seen it despite it being cross-linked with the issue topic... And yes, it runs Iperf2 server, HTTP server, UART command line, LED blinking and measures CPU load without any problems while simultaneously running Ethernet load tests at a maximum. And my private project does much more simultaneously, including some critical real-time processing... Your MCU can process tens of thousands of a valid frames per second and dropping irrelevant frames in a lower layers of IP stack takes even less processing, but you are limiting it to 100 frames per second. And it's not only about an average data rate - the frames can come one per day, but if they all come in your 10 ms period then the first ones fill the buffers and the rest of frames are lost. You are crippling a powerful device to solve a "DOS problem", which is not a problem for it at all. Also I recommend reading my posts in this topic to learn a better understanding of how an actual Ethernet network works and what impact it has on your device.

PHolt.1 · ‎2022-07-27

The requirement for a full printf is due to needing uint63 and float support, and due to the future use being to some extent unknown. Otherwise I would have used a chopped-down printf. However I am not sure if there is a full-feature printf which doesn't use the heap, and is itself thread-safe. I used to have one 30 years ago but it worked in a nonstandard float format context.

I appear to have solved it, by using objcopy to weaken the relevant symbols in libc.a (took a day to locate the right libc.a, out of many options of same name) but it took a huge amount of effort. I am still not finished; something wrong with the mutexes. The lib code uses a recursive mutex for malloc and free which I think is wrong, and I changed it, but maybe that would actually work?

https://www.eevblog.com/forum/programming/st-cube-gcc-how-to-override-a-function-not-defined-as-weak-(no-sources)/

Regarding low_level_input, my mutexes are based on your code here

https://community.st.com/s/question/0D50X0000BOtUflSQF/bug-stm32-lwip-ethernet-driver-rx-deadlock

Hence I have

/**
  * This function is the ethernetif_input task. It uses the function low_level_input()
  * that should handle the actual reception of bytes from the network
  * interface. Then the type of the received packet is determined and
  * the appropriate input function is called.
  *
  * This is a standalone RTOS task so is a forever loop.
  * Could be done with interrupts but then we have the risk of hanging the KDE with fast input.
  *
  */
void ethernetif_input( void * argument )
{
	struct pbuf *p;
	struct netif *netif = (struct netif *) argument;
 
	// This mutex position is based on an idea from Piranha here, with the delay added
	// https://community.st.com/s/question/0D50X0000BOtUflSQF/bug-stm32-lwip-ethernet-driver-rx-deadlock
#if 1
	// This version is more protected from fast input data
	do
    {
		sys_mutex_lock(&lock_eth_if_in);
		p = low_level_input( netif );
		sys_mutex_unlock(&lock_eth_if_in);
 
		if (p!=NULL)
		{
			if (netif->input( p, netif) != ERR_OK )
			{
				pbuf_free(p);
			}
		}
 
		osDelay(10);	// this has a dramatic effect on ETH speed, both ways
 
    } while(true);
}

I guess LWIP monitors the input buffer structure to see if new data has arrived? In that case no mutex protection is needed, but then I don't understand your example.

I have removed mutex protection from low_level_output and it still works. Thanks for that.

How much work would a zero-copy solution be? How much would you want for doing it? I could use extra RAM... I can see one would change the low_level_output code to reference the LWIP buffers directly, without using the ETH DMA buffers. However, I never managed to work out what controls LWIP's tx buffers. A huge amount of conflicting info online. One thing which makes some difference is mem_size, but there there are the pbufs (which I thought were used for RX only but I read stuff saying otherwise.

Thanks for the DHCP ideas. My code (implemented by someone else; I don't really understand it) came from the ST Cube examples. It seems to work ok. Do you think there is something wrong with it? It runs from an RTOS thread called "DHCP".

/**
  * @brief  LwIP_DHCP thread
  * @param  None
  * @retval None
  */
void kde_DHCP_task(void * pvParameters)
{
	uint8_t DHCP_state;
	char ip_string[16];
	DHCP_state = DHCP_START;
 
	ip_string[0] = 0;
 
	osDelay(1000);
 
	debug_thread("Starting DHCP task");
 
	for (;;) {
		osDelay(250);
 
		switch (DHCP_state) {
		case DHCP_START:
			netconf_netif.ip_addr.addr = 0;
 
			dhcp_start(&netconf_netif);
			DHCP_state = DHCP_WAIT_ADDRESS;
 
			debug_thread("Looking for DHCP server....");
			break;
 
		case DHCP_WAIT_ADDRESS:
			// Read the new IP address
			if (netconf_netif.ip_addr.addr != 0) {
				DHCP_state = DHCP_ADDRESS_ASSIGNED;
 
				// Stop DHCP
				dhcp_stop(&netconf_netif);
 
				netconf_ip_address_to_string(netconf_netif.ip_addr.addr, ip_string);
 
				debug_thread_printf("DHCP assigned address %s", ip_string);
				DHCP_state = DHCP_IDLE;
 
				// Indicate DHCP success in boot.txt
				// Indicate in boot.txt the static IP address being used due to DHCP being disabled
				file_write_boot_file(ip_string, "D");
 
				// end of DHCP process
			}
			else {
				struct dhcp *dhcp = (struct dhcp *) netif_get_client_data(&netconf_netif, LWIP_NETIF_CLIENT_DATA_INDEX_DHCP);
 
				// DHCP timeout
				debug_thread_buf("DHCP tries: ", &dhcp->tries, 1);
 
				if (dhcp && dhcp->tries > dhcp_retries) {
					DHCP_state = DHCP_TIMEOUT;
 
					// Stop DHCP
					dhcp_stop(&netconf_netif);
 
					// Static address used
					netif_set_addr(&netconf_netif, &ip_static, &ip_mask, &ip_gateway);
					dns_setserver(0, &ip_dns_server);
					DHCP_state = DHCP_IDLE;
 
					netconf_ip_address_to_string(ip_static.addr, ip_string);
 
					debug_thread_printf("DHCP timeout, static address %s", ip_string);
 
					// Indicate DHCP failure in boot.txt
					// Indicate in boot.txt the static IP address being used due to DHCP being disabled
					file_write_boot_file(ip_string, "F");
				}
			}
			break;
 
		case DHCP_IDLE:
		default:
			{
				// Get most recently recorded link status
				bool net_up = netif_is_link_up(&netconf_netif);
 
				// Read the physical link status
				ethernetif_set_link(&netconf_netif);
 
				// Has the link status changed
				if (net_up != netif_is_link_up(&netconf_netif)) {
					ethernetif_update_config(&netconf_netif);
 
					if (net_up) {
					   // Link was up so must have dropped
					   debug_thread("Ethernet link down");
					}
					else {
					   // Link was down so must be up - restart DHCP
					   debug_thread("Ethernet link up");
					   DHCP_state = DHCP_START;
					}
				}
 
				osDelay(1000);
			}
			break;
		}
	}
}

Maybe a mutex is needed somewhere. I see ethernetif_update_config is indeed called from another RTOS thread.

Thank you again.