ST provided example "Nx_MQTT_Client" fails after running for up to a few minutes (can fail quickly or after a bit). nx_packet_allocate() fails to allocate memory so it looks like a memory stomp. How to fix?
I took the provided example and got it to work as is. As provided it will run for 10 MQTT packets sent and received. Sometimes the provided example would fail within that normal 20 seconds of operation (by hanging or printing some unfindable log-report strings. Since I wanted to use this example as a boilerplate for an application, it was not OK to have it fail sometimes for no known reason. I delved...
I reduced the message timeout and switched the program to one of its option modes - run forever. That way I could replicate the issue without waiting for a long time. Now the program fails quite quickly and repeatedly within the first 1000 messages.
When it failed, the log message it was printing (or sometimes just trying to print) looks like this:
Number of allocated buffer for Rx and command answer 114 Number of free buffer 46 Number of command answer 45 , callback 66 , sum of both 111 (should match alloc && free) Number of posted answer (callback + cmd answer) 114 , processed answer 113
I could not use the search in STM32CodeIDE to find portions of these strings (that would have been too easy). The source containing the strings is in:
as the body of a macro:
That means it's a problem with memory allocation for WiFi buffers, probably in AzureRTOS threads. AzureRTOS provides "TX_MINIMUM_STACK" (default 200 Bytes) and "ThreadX memory pool size" (default 1024 somethings (not specified)) in the "Middleware:THREADX" section in the "Nx_MQTT_Client.loc" display. The problem does not appear to change when I set these values to something larger (2048 resp. 10240).
There's a second place to set memory parameters, that might make a difference, but I've no idea if I should leave that alone. It's in the "Project Manager" tab of the same .loc display. "Minimum Heap Size" defaults to "0x200" somethings (still units not specified) and "Minimum Stack Size" defaulted to "0x400" somethings (also units not specified). Why the Heap would be set smaller than Stack I guess it's not expecting any user memory allocation (malloc()?).
The two stack sizes are quite different 0x200 being 512 Bytes vs 200 Bytes. The 0x200 may not refer to the threads at all though and may be program-wide stack limits and not the emulated thread stacks managed by Azure RTOS.
I guess I could fiddle with these numbers but can anyone let me know what's going on and what I should do to fix? Stumbling around is not meant to be the debugging method (though empirical programming is where I find myself :( ).
I tried to run the MQTT application available in the STM32u5 Firmware Package (version 1.1.1).
I just reduced the time between each pub to 10 and run the test forever :
I reached more than 15000 message sent and the problem is never reproduced on my side.
Could you please:
Thanks you so much for doing that. There is hope! Mine has failed anywhere from 300 to 9000 so far.
I'm running: STM32CubeIDE
Build: 14494_20230119_0724 (UTC)
on an Apple Mac M2 (same failure results when compiled on an Intel i9 Mac)
Workspace says 1.11.0 Nx_MQTT_Client, installed from scratch using:
"File->Import...", then "Next >", then
select "Import STM32Cube Example", then "Next >",
then (after new dialogs are rendered), Name: Nx_MQTT_Client, then
select line containing "B-U5851-IOT02A", followed by "Next >".
Then let the importer load with default location etc.
I don't know which version of AzureRTOS are included in this package but it's all coming from ST.com's example base.
***WARNING***... if you want to run multiple versions of the project in the same IDE, you cannot. If you rename an existing Nx_MQTT_Client to a safe new name, not all the project contents are moved to a safe hierarchy. Further, if you try to rename the project back again (to its original name) the project becomes corrupted and cannot be rebuilt. I've not reported this issue yet.
I upgraded board firmware (as per project instructions). en.x-wifi-emw3080b MXCHIP WiFi module firmware has been flashed on the board. (STM32U5 DK, 497-B-U585I-IOT02A-ND).
I don't know where to look to give you STM32u5 package details (if that's more than the part number, above). Nor do I know where to look for the current package AzureRTOS version. I'm happy to hunt that down if you could let me know where to look.
Case #00171660 has been opened. Anis B. is trying to reproduce the issue. The case has a description of minimal example-package changes to get it to fail. It sounds like you've performed those steps and not reproduced the issue. Are you working with Anis B on this?
I can provide a project hierarchy if you like. I can provide a link to an iCloud .zip to download. It's a Mac version though. I'm about to set up a Windows 11 IDE and can provide a .zip of that if you like (once I've shown if it too fails). Just 2 lines total changed, Plus the Wifi Settings (WIFI_SSID, WIFI_PASSWORD) in the mx_wifi_conf.h file.
File: app_netxduo.h Around line 74 From: #define NB_MESSAGE 10 /* if NB_MESSAGE = 0, client will publish messages infinitely */ To: #define NB_MESSAGE 0 /* if NB_MESSAGE = 0, client will publish messages infinitely */
File: app_netxduo.c Around Line: 690 From: /* Delay 1s between each pub */ tx_thread_sleep(100); To: /* Delay 50ms between each pub */ tx_thread_sleep(5);
Are you directing the test messages at mosquitto.com test broker - I pointed at my own LAN's mosquitto broker to avoid DOSing their server.
Thanks again for your effort.
I loaded a fresh Windows 11 with the latest STM32CubeIDE and imported the current Nx_MQTT_Client example. Changes as detailed above (plus WIFI_SSID and SSID_PASSWORD in mx_wifi_conf.h). Downloaded the compiled image to STM32U5 p/n 497-B-U585I-IOT02A-ND. It ran really well until it got to around 9000 messages and then ***BOOM***.
That 9000 messages is not cast in concrete, it really can crash earlier or later.
Please let me know if I can help with more info or tests.
Thanks again for your help - I'm dependent on this board being stable using RTOS/WiFi/MQTT/Encryption.
New board ordered from Digi-key. Just in case it turns out to be a board WiFi hardware issue. It'll arrive in a few days. I'll let you know if the issue moves to the new board.
New board arrived - same product number. Same issue on both boards. Looks like it's not a hardware problem and not an AppleOSX vs Windows 11 issue. We're down to ST software and board-firmware revision issues. I'm flailing around trying to find firmware and software updates to make the instability go away. Are you able to start with the example Nx_MQTT_Client package downloaded using STM32CubeIDE and let me know what software and firmware packages you've loaded from that base point? Are you also able to run the example package on B-U5851-IOT02A hardware?
I really don't know how else to proceed.
Replicated with a different ST example program. Nx_SNTP_client, which exercises WiFi very little but it uses WiFi to get an initial timestamp, then passes that to the on-board Real Time Clock. Subsequently the WiFi thread is still active but just exchanges "keep-alive" traffic. After 43 hours of reporting timestamp output to USB-serial, the following appeared:
20-02-2023 / 15:51:09 20-02-2023 / 15:51:10 Error: command 0x010e timeout(10000 ms) waiting answer 2323 20-02-2023 / 15:51:11 20-02-2023 / 15:51:12 20-02-2023 / 15:51:13 . . 20-02-2023 / 15:52:10 20-02-2023 / 15:52:11 20-02-2023 / 15:52:12 wait flow timeout 0 20-02-2023 / 15:52:13 20-02-2023 / 15:52:14
"waiting answer" from mipc_request() in mx_wifi_ipc.c (line 267)
"wait flow timeout" from process_txrx_poll() in mx_wifi_spi.c (lines 390 and 458)
This is identical to one of the long-term failure modes in Nx_MQTT_Client, where subsequently WiFi no longer functions until the board is reset.
There are two independent WiFi driver faults. Neither are solved but more information has been gathered.
Fault Type #1, which shows:
SPI length invalid: 1552-0 Invalid SPI type 00
then invalidates the established WiFi connection and shows:
Error: command 0x010e timeout(10000 ms) waiting answer <an incrementing decimal number> wait flow timeout 0
regularly every minute or so.
Both boards were failing at the same time, even when they were restarted at significantly different times. This indicated they were responding to an external stimulus. I switched one of the boards to another SSID (different WiFi manufacturer, same LAN) and the board on the new SSID has not yet produced Fault Type #1, though the board that stayed on the old SSID continued to exhibit Fault #1 every so often (from minutes to hours between episodes) after board reset.
Fault #1 occurs on EERO v6 WiFi equipment while it has not yet occurred on Verizon G3100/1104/220.127.116.11 WiFi equipment. CISCO acknowledged similar firmware driver bugs in 2014 which they subsequently fixed (https://www.cisco.com/c/en/us/support/docs/security-vpn/ipsec-negotiation-ike-protocols/115801-technote-iosvpn-00.html). Connected equipment is meant to recover from this "Invalid SPI" issue. Apparently STM32 WiFi driver does not.
Fault Type #2, which shows:
Number of allocated buffer for Rx and command answer 2640 Number of free buffer 433 Number of command answer 432 , callback 2205 , sum of both 2637 (should match alloc && free) Number of posted answer (callback + cmd answer) 2640 , processed answer 2639
repeated as fast as my screen can scroll. Number batches (shown above) are different each episode but stay the same during an episode.
Fault Type #2 can be induced on both SSID WiFi equipment here (so it is not external-equipment dependent). The easiest/fastest way to replicate the fault (probably a memory stomp) is by using two boards connecting to a Mosquitto Broker using the same string "Client_ID" (parameter #3 to nxd_mqtt_client_create(...);). I put the boards in a 2-second reconnect loop every time Mosquitto disconnects. As soon as Mosquitto sees a second connect with the same Client_ID, it disconnects the old connection so with two boards in a reconnect-loop they bounce connect/disconnect against each other. The loop has:
nxd_mqtt_client_secure_connect(...); followed by nxd_mqtt_client_subscribe(...);
every disconnect. It takes only a few seconds to put one of the boards into its memory-stomp loop, while the other continues working just fine. No specific board keeps showing the fault, randomly one or the other demonstrates the fault within a few seconds.