Extremely strange behavior running on an STM32G0B1

JKaz.1 · ‎2022-02-06

Ok, I'm at a complete loss here so I'm looking for any ideas or suggestions. We have a code base that runs on top of FreeRTOS and we have versions of the code that run on multiple, different STM32L4s and on the STM32H753. I ported the code base to the G0 and just started to run into weird HardFaults. Hard Faults that don't make any sense. Walking back up the stack looks completely normal. Random, what looks like, memory corruption, but it doesn't make any sense. I've run a lot of tests and an absurd amount of diagnostic code... but what I'm posting about here you shouldn't have to know any of that, because it seems like strange compiler errors.

If I execute the following code, the system currently crashes (if I move things around in unrelated parts of flash/ram, then this code will suddenly start to work without issue).

std::string StringOp::sizeToString(size_t sz)
{
	std::string sout("");
	do
	{
		//__asm volatile("NOP");
		char digit = (char)(sz % 10) + '0';
		sout = digit + sout;
		sz /= 10;
	} while (sz);
	return sout;
}

When I debug it and look at it in the assembly, the relevant part looks like.

do
	{
		//__asm volatile("NOP");
		char digit = (char)(sz % 10) + '0';
0x0802e2c0  ldr r3, [r7, #0] 
0x0802e2c2  movs r1, #10 
0x0802e2c4  movs r0, r3 
0x0802e2c6  bl 0x80004d8 <__aeabi_uidivmod> 
0x0802e2ca  movs r3, r1 
0x0802e2cc  uxtb r2, r3 
0x0802e2ce  movs r1, #39	; 0x27 
0x0802e2d0  adds r3, r7, r1 
0x0802e2d2  adds r2, #48	; 0x30 
0x0802e2d4  strb r2, [r3, #0] 
		sout = digit + sout;

See that branch command? When I step into that I don't step to 0x8004d8. I step into 0x080012c8, which is not the right address, so then I crash when I try to return and it pops too many things off the stack.

See that NOP assembly command that's commented out? If I uncomment that, it simply hard faults when it tries to jump to 0x8004d8. If I instead put a NOP at the very beginning of the function, it executes the entire thing just fine.

This code is running in a task and is not in an ISR. This exact same code base runs fine on two other STM32 processors from different families (well, like 90% the same, hardware layer is different but all this higher level stuff is the exact same code).

Can anyone think of anything that would cause the processor to go off into the weeds like this? The placement of NOP commands affects whether or not it functions properly, and I'm at a loss for words on what that could be. This is on custom hardware so could something there be causing an issue? Bad oscillator signal/capacitance/resistance? Unstable 3.3V rail? I am going to port this to a Nucleo board later this week but I was wondering if anyone could think of something I should check.

Also, I am compiling it using GCC 7.3 but I updated to the latest 10 and it still broke. I built the hardware init files using the latest version of CubeMX (6.4) and have compared my initialization code against the examples in the hardware framework that I'm using (the latest, 1.5)... Anyone got any ideas on where I should poke?

Thanks!

SHuds.2 · ‎2022-07-25

To add on if anyone else has this issue, Section 2.2.10 here (STM32G0B1xB/xC/xE device errata - Errata sheet) says that prefetch on stm32g0b1 has a tendency to hard fault if dual bank is enabled. Not sure if this is the issue you are mainly experiencing here, but I was having unexplained crashes for weeks before disabling prefetch. That Doc at least puts it in writing that its a known issue.

View solution in original post

waclawek.jan · ‎2022-02-06

Incorrectly set FLASH latency?

The major difference between Cortex-M0+ and 'M4/'M7 is the requirement to have *data* properly aligned, but if you truly single-step in disasm and are sure that the crash occurs on the branch *instruction* (i.e. not in code after branch, which could happen if you "step over"rather than "step into"), then it's not data related.

JW

Antoine Odonne · ‎2022-02-07

Hi,

The fact that you branch to this inconsistent address is a thing to investigate, can you decode the instruction and check where it should lead you? If address is legit in code, look in linker or configuration at tool level otherwise indeed it might be more related to device configuration?

Regards,

Antoine

JKaz.1 · ‎2022-02-07

I did check the flash latency after posting this, and it's set to 2. PWR_REGULATOR_VOLTAGE_SCALE1 is used. I'm less sure what that means, but I think that's right? I built the clock configuration using CubeMX and configured what was in the .c file was what I put in CubeMX. I was running at 60 MHz, but I slowed it down to 48 MHz and it was magically able to get through this function without crashing... it just went back to it's normal way of crashing which was to randomly fault at some time during execution (this is the state it's been in for most for the last 3 weeks that I've spent most of my time debugging it).

Yup, I am stepping into, rather than over. If I step over, the next place the processor goes is the Hard Fault. If I step into, I go to that weird address and it will actually continue to run a fair number of instructions, it just breaks when it tries to exit the function and pop things off the stack. Since it didn't enter at the correct place, it pops a value from the stack that it thinks should be the new Program Counter and it faults.

For the data alignment, do you know if it's half-word or word aligned? Also, the compiler should take care of that for ram allocation, correct? As long as I don't try to index an array at an invalid location or access memory directly than I shouldn't be OK, right? I don't have to worry about declaring a uint8_t and then another uint8_t and not being able to access the second one because it's not data aligned?

I checked the .lss file, and the address that it says in the assembly (0x80004d8) is in fact the valid address of the function it is trying to do. If I slow down the clock speed from 60MHz to 48MHz (or if I put that NOP at the top of the function) it does in fact properly jump to that address and do all the things its supposed to do.

The changing of clock speeds and the insertion of a NOP makes this sound like some sort of race condition to me, but I have no idea what could cause an incorrect address jump like that...

waclawek.jan · ‎2022-02-07

What's the primary clock source? What is VDD voltage?

Check physically connection of all VDD and GND pins (check for bad solder joints) - yes, testing on Nucleo is a good idea.

Read out and check/post RCC and FLASH registers content.

JW

gbm · ‎2022-02-07

Looks like too small task stack or too smal heap used by C++ libraries (set in CubeMX, not the FreeRTOS heap).

My STM32 stuff on github - compact USB device stack and more: https://github.com/gbm-ii/gbmUSBdevice

JKaz.1 · ‎2022-02-07

I have overridden malloc, calloc, free, new and delete to call the freeRTOS equivalents, so all memory management should be going through the freeRTOS heap (with the exception of the main task, which runs off of the end of memory, so that's the gcc/CubeMX stack). I've turned on all the heap overflow diagnostic stuff that freeRTOS has, as well as used our own memory diagnostic code and as far as I can tell, the task with the smallest amount of stack remaining still has a few hundred bytes left to play with. The main freeRTOS heap has around 17K left. I don't think it's a simple stack over run buuuuuut at this point I'm honestly not ruling anything out. I spent multiple days looking at that stuff, but it doesn't mean I didn't miss something. I'll try just throwing more RAM at the problem from the CubeMX configuration (and linker) and see what happens. It can't possibly hurt at this point. I can also set all of memory to a known byte pattern and then check it after a crash, see if that turns up an especially high watermark someplace I'm not expecting.

The primary clock source is an external oscillator running at 16MHz. I have the PLLs configured to run the main system at 60MHz. I need to verify that the correct oscillator got installed (though this occurs on multiple boards so it's probably right). I also need to check the caps around it. According to the schematic and the datasheet what we have should be valid but double checking everything at this point... I'll read/check/post the RCC and FLASH registers in a few days, leaving in about an hour to catch a flight.

This board/design almost certainly didn't go through a full and proper DVT, so there could be some electrical wonkiness messing with CPU function, so checking all VDD pins is high on my list when I get back... though probably after porting it to the nucleo board since that should eliminate any weird hardware stuff.

Thanks everyone for all your suggestions, this is driving me and the other firmware guy slowly insane!

Antoine Odonne · ‎2022-02-08

Hi,

Maybe you can monitor your system clock using PA8 MCO ? Or potentially check if your code execute and run OK on another chip?

I would be curious of the setting for Prefetch and Cache as well... I don't expect it would matter here, but since the failure seems to be related to jump alignment, who knows.

Your config is ok otherwise, 2 WS for PWR scale range 1 (standard voltage for digital blocs in the circuit)

Best regards,

Antoine

JKaz.1 · ‎2022-02-10

Unfortunately I haven't had the time to port it to the nucleo board yet but I just wanted to post that I have both PREFETCH_ENABLE and INSTRUCTION_CACHE_ENABLE enabled.

If I turn off PREFETCH_ENABLE and leave INSTRUCTION_CACHE_ENABLE on it seems to work properly. The other firmware guy ran a test for 15 hours and it never crashed. I think the most successful test I've had to date was when I slowed the processor speed down from 60MHz to 48MHz, and that was 12 minutes. So this looks promising, which is great, but solutions like this urk me. I can't actually tell if I fixed the problem or just band-aided it and it'll crop up again later when I change the code in some way it doesn't like...

SHuds.2 · ‎2022-07-25

To add on if anyone else has this issue, Section 2.2.10 here (STM32G0B1xB/xC/xE device errata - Errata sheet) says that prefetch on stm32g0b1 has a tendency to hard fault if dual bank is enabled. Not sure if this is the issue you are mainly experiencing here, but I was having unexplained crashes for weeks before disabling prefetch. That Doc at least puts it in writing that its a known issue.