2018-10-08 10:02 AM
Target device STM32F407VG
I am getting a hardfault that occurs only when not debugging.
Using RTC_BKP registers I have tracked the problem down to a function call that passes a pointer to a struct.
This struct contains a pointer to dynamically allocated array and info about the array's size.
The function call in question is the last in a chain. I have tracked the address of the struct through the call chain and at the last moment it changes from 0x20001f54 (SRAM) to 0x100057c8 (CCMRAM). This only occurs when not debugging, and usually only after a power reset. When running with the debugger it executes correctly and as expected.
I am baffled.
I also have a deadline to meet.
If anyone has experienced this before and can shed some light onto what may be happening, I'd much appreciate it.
Cheers.
2018-10-08 10:20 AM
Stack overflow, local variable on stack being overwritten with access to global variable into which the stack overflew?
JW
2018-10-08 10:22 AM
What toolchain?
Using what memory for stack vs heap? RTOS tasks/threads?
When not attached to the debugger I use a USART for telemetry and output Hard Fault trap details.
Is there a bounty?
2018-10-08 05:03 PM
your pointers are out of range... what did you do ?
if you have poor syntax, changing from -O0 to -O2 will also cause a similar faults.
You need professional help.
it looks to me like a pointer methodology issue.
Please post the errant code.
2018-10-09 06:06 AM
I'd go with Jan's suggestion.
Interrupts use to "bang" on the stack, too. The CCM address may indicate a specific interrupt.
But cause would be the stack size, nonetheless.
2018-10-10 04:20 AM
Thanks for your suggestions.
Atollic/GNU/stdlib, no RTOS or threading.
Stack/Heap are in SRAM. Build analyser reports 111KB free. There's about 200 bytes on the stack when the error occurs. I don't have a figure for heap size, but it idles at 1k and I've never seen it go much above 2k.
I've circumvented the problem by removing two global static structs from CCMRAM that were introduced in the first commit that this problem could be tracked back to.
There are several interrupts servicing peripherals, USB, I2C, USART, SPI, Systick, DMA. All application code runs from the main loop.
The structs I've placed back into SRAM are accessed somewhere in the offending call chain, but not by the interrupts.
Changing optimisation level has no effect.
Once I've completed the changes that were being blocked by this problem, I'll come back to it and find out if that address in CCMRAM pointed to anything in particular.
I'm still concerned mainly with how it is that the debugger is causing correct behaviour? Does a debug session affect the way the stack is used? Does it affect bus/ram/flash access or wait states?
If it is a stack problem, wouldn't it persist regardless of debug session or storage location of some globals? It would surely still overflow (or whatever it's doing) and still get the wrong address, read the wrong data and cause a hardfault when realloc() is called?
2018-10-10 08:33 AM
The debugger shares bus resources, may stall the processor, changing the points at which interrupts may occur, and stopping clocks. It can access peripheral registers in a way that clears or changes status, this can be a specific problems for USART/SPI type peripherals, and those with FIFO like SDIO or USB.
It's generally not as transparent as you might like to believe it should be.
The way I'd track flow is to instrument things, and have a bit flag where I can enable or increase/focus the level of detail. Then I'd focus on specific causes and eliminated others as potential players.
Would focus on interrupts, stacks, and overflow of auto/local variables.
2018-10-10 08:59 AM
How does the pointer you've mentioned in opening post to change unexpectedly, relate to those two structures?
---
I'd also try to use the debugger in a least-intrusive mode possible. An unexpectedly changing memory content is a prime target for data breakpoints (even if they may be tricky to be evaluated due to the fact that they fire at the writeback stage, which is several cycles after the offending instruction had been executed).
Unfortunately, the "debugger" in fact is a chain and layers of hardware and software, and working of this as a whole is not as clear in its minute details one may wish; but I (maybe naively) believe that a naked gdb/openOCD/STLink chain would not try to do anything touching the internals until instructed to do so. Eclipse (thus atollic) while uses gdb, adds several layers and I'd not trust them not attempting to do "live probing" under various conditions.
JW
2018-10-17 06:30 AM
I've sussed it.
I've been trying to solve multiple errors as if they were one.
Firstly:
The change in the pointers in my original post was a red herring. The last function in the chain was being called from a different path later on in the call chain. The 'paper trail' I'd left in nvram wasn't recording values from this second entry point, giving the impression that the pointer had suddenly changed.
Secondly:
Even with my circumvention tactic, I was getting a hardfault on a different version of the hardware.
I was able to trace that to an uninitialised array index that was causing a fetch from an unmapped memory address.
The reason why I was getting crashes after a full power reset and not after debugging was not due to the debugger but the programming operation. It seems that when the chip is programmed from a hex file (not a dfu file), all ram addresses are reset to 0, which puts the uninitialised array index at a nice safe 0.
So, I still have a struct that crashes when it's placed in ccmram, but now I have a good idea as to the cause. It's almost certainly another uninitialised array index* which is being safely set to 0 when I debug, but is scrambled after a power reset. When the struct in question is in SRAM, the erroneous index is pointing somewhere in ram (due to SRAM's relatively larger size), but when in CCMRAM, it points outside of the mapped area and causes a hardfault.
(*except that the struct in question contains no arrays...)
No stack overflows, problematic interrupts, bus collisions, or obscure hardware faults.
Much relief.
Thanks again for your input.
2018-10-17 08:32 AM
Final Piece Of The Puzzle
Static variables are initialised to zero by the startup script.... unless there are placed in ccmram, resulting in non-zero pointers to nowhere
Best solution : change startup script and linker script to erase ccmram
Easier solution : use memset to erase ccmram at top of main()