Mysterious UI Freeze

MitchellB · ‎2023-10-07

We have recently been seeing some GUI freezes happening in one of our products. We had a process that creates and saves logs to a CSV file stored on a SD card in our unit. During this save process, we have a handful of task switching events going from our Frontend to our backend. The front end prompts a nondismissible modal to indicate a log is being saved, while the backend handles the actual saving and logging to the SD card. It also handles a large number of calculations we’re doing during the save process.

We don’t have a surefire way to recreate the freeze, instead we have some automation setup within our Model.tick() to “click” on spots of our screen depending on a handful of flags that get set for the currently shown screen. This automation eventually manages to hit the freeze, it varies how long it takes. Sometimes it happens within the first 10 or so iterations, and other times we could go through the entire process 100 times before seeing the freeze. We have also hooked up our automation for a handful of similar screens that look similar to the process that is causing our freeze. This has so far not resulted in a freeze. UI wise, there isnt a whole lot of difference between these screens and the ones that cause the freeze. There is a good amount of difference however in the backend with the type of calculations we’re doing. They are less intensive compared to the process that ends up freezing.

We have explored several avenues that seemed similar here in the ST forums as well as elsewhere on the internet, but none of the conclusions in those threads have helped us. Our closest solution is this exploration. A coworker of mine recently posted in that thread, with the hopes of someone seeing it and giving us some insight on a potential solution since a lot of the communication seemed to have happened over private messages.

We have setup the FreeRTOS Hard Fault Handler, including writing the registers to our EEPROM that we can read back later after recovering from the frozen state. We have also implemented a similar setup for the Bus Fault and Usage Fault handlers. All three handlers were not touched during any of the freezes we have experienced.

We have four total tasks that we swap between. Our backend task, a frontend gui task, a USB task, and a idle task. We start the backend task as well as the gui task within our main, backend having an above normal priority and the guiTask having a below normal priority. The USBTask is started within the backend Task initialization, but it is being handled by USB_HOST and USB_DEVICE middleware generated by CubeMX. The USB task is a higher priority than our gui task, but lower than the backend task, so we can try to guarantee that when a log is saved, it's saved before the GUI transitions to a new screen. At one point we disabled the USB task, this has been the “best” solution we’ve found. The freeze continues to happen, but we only experienced it twice on two out of ten units over a period of 100+ hours. While this is promising, we are not confident enough in the “solution” to move forward with it. Nor do we know why exactly this change has increased the time before a freeze, with limited performance slowdown.

We have several timer interrupts that handle peripherals from updating the time on the device, monitor a sleep timer for when the device isnt in use, as well as monitoring some voltage monitoring for our batteries as well as checking if we’re using AC power instead. These have all been non-essential for the process that is ongoing when a freeze happens, and the freezes continue with them removed.

We also explored disabling DMA2D from our device, but this did not seem to be useful; the freeze continued. If anything, once we disabled DMA2D, the freeze started to happen less than what we have experienced to that point. During this time, we also disabled the ART ACCELERATOR within CubeMX, but it also seemed to have helped anything since we eventually got a freeze on the device.

Tl;dr A device of ours is experiencing an anomaly that is causing the GUI to freeze, preventing the user from continuing, prompting a full power restart. After exploring several possible solutions, disabling nonessential interruptions, graphical enhancements, verbose fault handling, as well task manipulation, none of which have seemed to help. We are still stumped on what is causing our freeze and are looking for guidance.

To expand on our setup:

TGFX 4.22
STM32CubeMX 6.7 (Built in 6.5, not merged to higher version)
STMCubeIDE 1.6.1
CMSIS V1.02
FreeRTOS 10.2.1
FATFS Version R0.12c
STM32F746
DMA2D LTDC FMC
SDIO DMA
SDMMC1
Multiple I2C devices

Mohammad MORADI ESFAHANIASL · ‎2023-11-03

Hello @MitchellB ,

As far as I understood, you haven't experienced the freeze with the application with similar GUI but less calculation underneath. Could you explain a bit what those heavy calculations are? Are you getting data from USB or other peripherals or is it just data from inside the application (I mean from GUI)?
Plus, can you send a picture of the stack-call when the freeze happens?
Thank you

Mohammad MORADI
ST Software Developer | TouchGFX

MitchellB · ‎2023-11-09

Hello @Mohammad MORADI ESFAHANIASL Thank you for inquiring!
We have had a lot of issues with our device crashing gracefully to capture the stack call, but did manage to get something here...

along with a good register output...

r0 537068560
r1 537198488
r2 5
r3 80
r4 537007872
r5 3223280352
r6 2779096485
r7 537198416
r8 2779096485
r9 2779096485
r10 2779096485
r11 2779096485
r12 0
sp 0x2004ff50
lr 134441659
pc 0x8037562 <xQueueGiveFromISR+78>
xpsr 553648232
msp 537198416
psp 537099488
primask 0
basepri 80
faultmask 0
control 0
fpscr 1610612736
s0 -44.632
s1 0
s2 1.85324308e-011
s3 -6.17822647
s4 0
s5 0
s6 0
s7 0
s8 -3.79935045e-007
s9 3.18894601
s10 4.07432206e-023
s11 11.8740759
s12 1.85324308e-011
s13 1024
s14 -10
s15 0
s16 0
s17 0
s18 0
s19 0
s20 0
s21 0
s22 0
s23 0
s24 0
s25 0
s26 0
s27 0
s28 0
s29 0
s30 0
s31 -nan(0x7fffff)
d0 1.6097109858027061e-314
d1 -11112.063892723678
d2 0
d3 0
d4 56.185110663984005
d5 1965111.102615694
d6 9.4447345714403685e+021
d7 1.6008220200397196e-314
d8 0
d9 0
d10 0
d11 0
d12 0
d13 0
d14 0
d15 -nan(0xfffff00000000)

Sorry for the formatting...

What we were doing at the time of the freeze is a recursive quickSort of about a thousand double values, to give context of whats going on. Specifically we had set a breakpoint at the start of the quickSort, continued after it hit the start of it and then the device froze. Normally when this happens and we're connected to our debugger, the device's CPU will just stop everything and we wont be able to connect to the device via the debugger. For whatever reason, this current crash stopped the device but didnt stop the CPU, so we were able to pause the debugger and see it paused with the above stack call. Resuming the build just results in the same location when pausing again.

The peripheral we are gathering our data from are a series of analog to digital converters. We do have USB contectivity and other peripherals, but at the time of the freeze, none of those should be active.

We do have a sysview crash report (not one conntected to the above stack call and register list) of another crash, but from what we can tell there isn't anything outrageous with it. If you want to take a look still, I can message you the file directly.

Thank you for responding!!

Mohammad MORADI ESFAHANIASL · ‎2023-11-20

Hello @MitchellB ,

Sorry for the late response. I took a look at the information that you have provided, but unfortunately, I couldn't infer anything useful out of it. What I suggest is to be cautious about the priority of your tasks, and try not use Model.tick() directly. Instead, try to use HandleTickEvent functions in your View classes. However, the MCU crash that you mentioned sounds way more devastating than a simple GUI freeze because you get disconnected from the debugger and the MCU stops working. So, there is a chance that you are trying to get performance out of the STM32F7 boundaries.
Sorry that I can't help you more, but if you found something new or managed to solve the issue, please share it with us.
Thanks and good luck

Mohammad MORADI
ST Software Developer | TouchGFX