2024-06-13 09:29 AM
Hi experts,
I'm adding to my existing code (that runs on a STM32F105) a function to store a word in the flash.
I basically copied the code from the official example based on the HAL libraries, here my function:
uint32_t FLASH_Write_Data(uint32_t StartPageAddress, uint32_t *Data, uint16_t NWords)
{
static FLASH_EraseInitTypeDef EraseInitStruct;
uint32_t PageError;
/* Unlock the Flash memory to enable the flash control register access */
HAL_FLASH_Unlock();
/* Erase the FLASH area*/
EraseInitStruct.TypeErase = FLASH_TYPEERASE_PAGES;
EraseInitStruct.PageAddress = FLASH_USER_START_ADDR;
EraseInitStruct.NbPages = (FLASH_USER_END_ADDR - FLASH_USER_START_ADDR) / FLASH_PAGE_SIZE;
if (HAL_FLASHEx_Erase(&EraseInitStruct, &PageError) != HAL_OK)
{
/*Error occurred while page erase.*/
return HAL_FLASH_GetError();
}
/* Program the user FLASH area word by word*/
uint32_t i = 0;
while (i < NWords)
{
if (HAL_FLASH_Program(FLASH_TYPEPROGRAM_WORD, StartPageAddress, Data[i]) == HAL_OK)
{
StartPageAddress += MEMORY_OFFSET;
i++;
}
else
{
/* Error occurred while writing data in Flash memory*/
return HAL_FLASH_GetError();
}
}
HAL_FLASH_Lock();
return HAL_OK;
}
The very first strange thing is that as soon as I start debugging, the program counter often jumps to the HardFault_Handler function.
From there, I can reset the chip and restart the debug session without any apparent problem.
However, when I execute the program, it consistently ends up in the HardFault_Handler function.
If I comment out the flash write function, the program works as expected.
I started commenting out part of the flash function code, and I noticed that when the HAL_FLASHEx_Erase and the HAL_FLASH_Program functions are not executed, the program continues to work as expected.
I stepped through the code and I didn't notice anything obviousy wrong, as I can erase and even write the passed value to the passed memory address without any immediate error!
The only thing that happens before the program counter jumps into the HardFault_Handler function is that, after several instructions, the IMPRECISERR flag of the CFS register is set.
I tried moving the flash function into the main code, just after the SystemClock_Config or just after before the while(1), but surprisingly, the instruction where the IMPRECISERR flag is set does not change.
The point where this is happening is on the closing (yes on the close function parenthesis "}") of this function:
bool Shaft_Measure(void)
{
static int32_t aShaft_old;
int32_t aShaft;
msg.a_shaft = (int16_t)aShaft;
if (ABS(aShaft - aShaft_old) > 1){
bRead_speed = true;
}
else{
bRead_speed = false;
}
aShaft_old = aShaft;
return bRead_speed;
}
I'm sure that the problem is not there, but unfortunately I don't know how to proceed to identify the issue.
The MCU I'm using is the STM32F105RC which has 256 kB of flash, and the value I'm trying to save it is just a uint32_t number.
The constants used by the flash functions are defined in my flash.h here reported:
#define FLASH_ADDR_PAGE_127 ((uint32_t)0x0803F800)
#define FLASH_USER_START_ADDR FLASH_ADDR_PAGE_127
#define FLASH_USER_END_ADDR FLASH_ADDR_PAGE_127 + FLASH_PAGE_SIZE
#define MEMORY_OFFSET ((uint32_t)0x4U)
These addresses have been copied from the device datasheet.
Here my system:
Segger JLink base (with the latest ver: 7.96l)
STM32CubeIDE (ver: 1.15.1)
STM32CubeMX (ver: 6.11.1)
HAL libraries for STM32F1 (ver: 1.8.5)
Any help would be greatly appreciated!
Solved! Go to Solution.
2024-06-20 06:23 AM - edited 2024-06-20 06:23 AM
it's simpler than that. Look at the end of loop condition of this function:
void FLASH_Read_Data (uint32_t StartPageAddress, uint32_t *data, uint16_t n_words) { while (1) { *data = *(__IO uint32_t *)StartPageAddress; StartPageAddress += MEMORY_OFFSET; data++; if (!(n_words--)) break; } }
It reads from FLASH and writes to destination (which is given by pointer) one more word than is n_words, so
uint32_t last_boot_counter, new_boot_counter; FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U);
writes 2 words to stack (as last_boot_counter is a local variable), and thrashes what's just after last_boot_counter, which happens to be the stacked stack frame pointer in r7 from the caller function. And that thrashed stack frame pointer then just propagates to the point where it's actually used, resulting in the fault.
@aga,
> how should I start to write code to defend myself from these hidden traps?
I could preach here "proper coding methods" etc. Some folks believe religiously in the power of tools like static checkers or various fancy programming languages.
But the naked truth is, humans make errors. So, the practical solution is to a) try to be as meticulous as possible within reasonable boundaries; b) be prepared for errors at various levels.
In this particular case, for example, in FLASH_Read_Data(), I would use the for() loop rather than do/while() or while(). There's a reason for the for() loop - it's usually perceived as a simple alternative and it's more readily understood, if one sticks to the simple for(i = 0; i < max; i++) pattern; so it's less likely to result in error.
And, also, in this particular case, you have been able to track down the root cause and mitigate it (by using the correct FLASH_Read_Data()). I never consider the "the problem went away, although I'm not sure why" to be an acceptable solution; that's why to me it's very important to maintain the problem (i.e. don't make any subsequent changes, unless I can surely undo them) until I am absolutely sure what was the root cause and that I removed it.
JW
2024-06-13 10:46 AM
In this context "Imprecise" means it's a deferred write, ie through the Write Buffers
The code address of the fault is therefore not exact, as it was started a few cycles earlier in the pipe-line, and you're now executing later instructions.
Look at what's actually reported, look at the address of the failed write, which will be correct, and look a little earlier in the code instructions.
Addresses must be 4-byte / 32-bit aligned, and can't be written more than once per erase cycle.
Have a Hard Fault handler that outputs actionable data, during development, and in the field, so support techs can actual identify and fix issues..
https://github.com/cturvey/RandomNinjaChef/blob/main/KeilHardFault.c
2024-06-17 05:24 AM
Hi @Tesla DeLorean ,
first of all, many thanks for your message and for providing the code to catch the HardFault exceptions!
I spent the entire day integrating the code you suggested into my project and I'm sure that it was time well spent.
Here's what I ended up with:
[Hard Fault]
CPU registers dump:
r0 = 00000000, r1 = 20000578, r2 = 200009EC, r3 = 00000000
r4 = 20000A18, r5 = 24264CEE, r6 = 8715A7C8, sp = 2000FFE8
r12= 0803F802, lr = 080045EB, pc = 080045F0, psr= 01000000
bfar=E000ED38, cfsr=00000400, hfsr=40000000, dfsr=00000000, afsr=00000000
Stack dump:
00000001
0800DC81
20000A18
24264CEE
2000FF90
08005833
Instructions dump:
B2DB 2B00 D00A F001 F8A9 4603 71FB 79FB (4618) F001 F8D1 4B3B 2200 701A
As I have the Joseph Yiu's book "The definitive guide to ARM Cortex-3 and Cortex-M4 processors..." I started reading about this fault and I discovered that I could disable the write buffer feature to properly catch the point that triggered the bus fault. Unfortunately I couldn't set the DISDEFWBUF bit (SCnSCB->ACTLR) as my MCU (ARM Cortex-M3 revision r0p1) does not have it.
So, I went throught the stack starting from the last address (0x08005833) and I found these instructions (in the disassembly view):
Reset_Handler:
08005800: bl 0x80051a4 <SystemInit>
68 ldr r0, =_sdata
08005804: ldr r0, [pc, #44] @ (0x8005834 <Reset_Handler+51>)
69 ldr r1, =_edata
08005806: ldr r1, [pc, #48] @ (0x8005838 <LoopFillZerobss+18>)
70 ldr r2, =_sidata
08005808: ldr r2, [pc, #48] @ (0x800583c <LoopFillZerobss+22>)
71 movs r3, #0
0800580a: movs r3, #0
72 b LoopCopyDataInit
0800580c: b.n 0x8005814 <Reset_Handler+19>
75 ldr r4, [r2, r3]
0800580e: ldr r4, [r2, r3]
76 str r4, [r0, r3]
08005810: str r4, [r0, r3]
77 adds r3, r3, #4
08005812: adds r3, #4
80 adds r4, r0, r3
08005814: adds r4, r0, r3
81 cmp r4, r1
08005816: cmp r4, r1
82 bcc CopyDataInit
08005818: bcc.n 0x800580e <Reset_Handler+13>
85 ldr r2, =_sbss
0800581a: ldr r2, [pc, #36] @ (0x8005840 <LoopFillZerobss+26>)
86 ldr r4, =_ebss
0800581c: ldr r4, [pc, #36] @ (0x8005844 <LoopFillZerobss+30>)
87 movs r3, #0
0800581e: movs r3, #0
88 b LoopFillZerobss
08005820: b.n 0x8005826 <Reset_Handler+37>
91 str r3, [r2]
08005822: str r3, [r2, #0]
92 adds r2, r2, #4
08005824: adds r2, #4
95 cmp r2, r4
08005826: cmp r2, r4
96 bcc FillZerobss
08005828: bcc.n 0x8005822 <Reset_Handler+33>
99 bl __libc_init_array
0800582a: bl 0x800dc4c <__libc_init_array>
101 bl main
0800582e: bl 0x80044d4 <main>
102 bx lr
08005832: bx lr
68 ldr r0, =_sdata
08005834: movs r0, r0
08005836: movs r0, #0
69 ldr r1, =_edata
08005838: movs r4, r1
0800583a: movs r0, #0
70 ldr r2, =_sidata
0800583c: b.n 0x80050f8 <HAL_CAN_RxFifo0MsgPendingCallback+92>
0800583e: lsrs r0, r0, #32
85 ldr r2, =_sbss
08005840: movs r0, r2
08005842: movs r0, #0
86 ldr r4, =_ebss
08005844: lsrs r0, r3, #8
08005846: movs r0, #0
115 b Infinite_Loop
WWDG_IRQHandler:
08005848: b.n 0x8005848 <WWDG_IRQHandler>
266 TST lr, #4
To me the location (0x08005833) looks like the reset handler, althought the address does not perfectly match.
Moreover I don't know how to read the instructions dump.
What are the following steps should I do?
Many thanks!
2024-06-17 05:36 AM - edited 2024-06-17 05:36 AM
pc = 080045F0
What you want is to look at disasm a couple of instructions before this address, and from content of other registers in the fault handler discern, which was the offending instruction. If you have mixed disasm/C view, it's usually quite obvious what's the problem in the source.
JW
2024-06-17 08:56 AM
Hi @waclawek.jan,
thanks for your help.
here the registers dump value and below the dissasembly code that contains where the PC points after the HardFault is triggered.
[Hard Fault]
CPU registers dump:
r0 = 00000000, r1 = 20000578, r2 = 200009E8, r3 = 00000000
r4 = 20000A10, r5 = 64264CEE, r6 = 8715A5C8, sp = 2000FFE8
r12= 0803F802, lr = 0800460F, pc = 08004614, psr= 01000000
bfar=E000ED38, cfsr=00000400, hfsr=40000000, dfsr=00000000, afsr=00000000
Stack dump:
00000001
0800DCA9
20000A10
64264CEE
2000FF90
0800585B
Instructions dump:
B2DB 2B00 D00A F001 F8A9 4603 71FB 79FB (4618) F001 F8D3 4B3B 2200 701A
Disassembly around the address: 0x08004614
279 if (DEV_VAR_X == NDevice_variant){
080045f8: ldr r3, [pc, #252] @ (0x80046f8 <main+512>)
080045fa: ldrb r3, [r3, #0]
080045fc: cmp r3, #0
080045fe: bne.n 0x8004620 <main+296>
282 if (bRun_task_shaft_meas){
08004600: ldr r3, [pc, #260] @ (0x8004708 <main+528>)
08004602: ldrb r3, [r3, #0]
08004604: uxtb r3, r3
08004606: cmp r3, #0
08004608: beq.n 0x8004620 <main+296>
285 bool bRead_speed = Shaft_Measure();
0800460a: bl 0x8005760 <Shaft_Measure>
0800460e: mov r3, r0
08004610: strb r3, [r7, #7]
286 Shaft_Speed(bRead_speed);
08004612: ldrb r3, [r7, #7]
08004614: mov r0, r3
08004616: bl 0x80057c0 <Shaft_Speed>
289 bRun_task_shaft_meas = false;
0800461a: ldr r3, [pc, #236] @ (0x8004708 <main+528>)
0800461c: movs r2, #0
0800461e: strb r2, [r3, #0]
298 if (DEV_VAR_X == NDevice_variant){
08004620: ldr r3, [pc, #212] @ (0x80046f8 <main+512>)
08004622: ldrb r3, [r3, #0]
08004624: cmp r3, #0
08004626: bne.n 0x8004638 <main+320>
301 if (bRun_task_LEDs){
08004628: ldr r3, [pc, #224] @ (0x800470c <main+532>)
0800462a: ldrb r3, [r3, #0]
0800462c: uxtb r3, r3
0800462e: cmp r3, #0
08004630: beq.n 0x8004638 <main+320>
306 bRun_task_LEDs = false;
08004632: ldr r3, [pc, #216] @ (0x800470c <main+532>)
08004634: movs r2, #0
08004636: strb r2, [r3, #0]
The correspondig C code is this:
#ifdef USE_HALL_SENSORS
if (DEV_VAR_X == NDevice_variant){
/* Run the shaft measurement task - triggered by HAL_TIM_IC_CaptureCallback() */
if (bRun_task_shaft_meas){
bool bRead_speed = Shaft_Measure();
Shaft_Speed(bRead_speed);
bRun_task_shaft_meas = false;
}
}
#endif // USE_HALL_SENSORS
From your suggestion as the PC points to Shaft_Speed(bRead_speed) the error should be around here.
void Shaft_Speed(bool bRead_speed)
{
if (bRead_speed){
WG_TX_1_msg.n_shaft = (uint16_t)(1000000/htim4.Instance->CCR1);
}
else{
WG_TX_1_msg.n_shaft = 0U;
}
}
Here I don't really see anything wrong, as the code it is very simple. Looking a bit before, I have the other function Shaft_Measure (reported in the first post), which also looks good to me, so I have no idea where the issue could be.
Any other help please?
Many thanks!
2024-06-17 09:08 AM
Would look at what R7 points at
00000000 B2DB uxtb r3, r3
00000002 2B00 cmp r3, #0
00000004 D00A beq.n loc_00001C
00000006 F001 F8A9 bl sub_00115C
0000000A 4603 mov r3, r0
0000000C 71FB strb r3, [r7, #7] <<<<
0000000E 79FB ldrb r3, [r7, #7]
00000010 4618 mov r0, r3
00000012 F001 F8D3 bl sub_0011BC
00000016 4B3B ldr r3, [pc, #236] ; ($000104)
00000018 2200 movs r2, #0
0000001A 701A strb r2, [r3, #0]
2024-06-17 09:13 AM - edited 2024-06-17 09:13 AM
I'd say the problem happens in
08004610: strb r3, [r7, #7]
can we know r7 from the hardfault handler?
OTOH it's a local variable so should be located at the stack; don't quite understand what might have happened to r7 (which is probably the local frame pointer, i.e. points to the stack where local variables are allocated).
Is the fault reproducible? Can you get content of r7?
JW
2024-06-18 02:25 AM
Hi both,
Here what I found while steping through the disassembly code around the instruction you both mentioned.
I took some time to read the values to provide more detailed information.
All values were read before the execution the line they are commenting.
265 if (bRun_task_analog){
08003bde: ldr r3, [pc, #296] @ (0x8003d08 <main+524>) --> pc = 0x8003BDE
08003be0: ldrb r3, [r3, #0] --> r3 = 0x200007DC (contains 0x0)
08003be2: uxtb r3, r3 --> r3 = 0x1
08003be4: cmp r3, #0
08003be6: beq.n 0x8003bfc <main+256>
267 ADC_Measure();
08003be8: bl 0x8002214 <ADC_Measure>
269 ADC_Copy_Results(NDevice_variant); --> pc = 0x8003BEC
08003bec: ldr r3, [pc, #268] @ (0x8003cfc <main+512>) --> r3 = 0x200007E8
08003bee: ldrb r3, [r3, #0] --> r3 = 0x0
08003bf0: mov r0, r3
08003bf2: bl 0x80022bc <ADC_Copy_Results>
272 bRun_task_analog = false;
08003bf6: ldr r3, [pc, #272] @ (0x8003d08 <main+524>) --> pc = 0x8003BF6
08003bf8: movs r2, #0
08003bfa: strb r2, [r3, #0] --> r3 = 0x200007dc (contains 01010000)
280 if (DEV_VAR_X == NDevice_variant){
08003bfc: ldr r3, [pc, #252] @ (0x8003cfc <main+512>) --> pc = 0x8003BFC
08003bfe: ldrb r3, [r3, #0] --> r3 = 0x200007E8 (contains 0x0)
08003c00: cmp r3, #0
08003c02: bne.n 0x8003c24 <main+296>
283 if (bRun_task_shaft_meas){
08003c04: ldr r3, [pc, #260] @ (0x8003d0c <main+528>) --> pc = 0x8003C04
08003c06: ldrb r3, [r3, #0] --> r3 = 0x200007DD (contains 0x00010100)
08003c08: uxtb r3, r3 --> r3 = 0x1
08003c0a: cmp r3, #0
08003c0c: beq.n 0x8003c24 <main+296>
286 bool bRead_speed = Shaft_Measure();
-----------------------------------------------------------> jumps to code below
Shaft_Measure:
08004d64: push {r7} --> r7 = 0xFFFFFFFF
08004d66: sub sp, #12 --> sp = 0x2000FFE4 (contains 0xFFFFFFFF)
08004d68: add r7, sp, #0 --> sp = 0x2000FFD8 (contains 0x02F80308)
448 bool bRead_speed = false;
08004d6a: movs r3, #0
08004d6c: strb r3, [r7, #7] --> r7 = 0x2000FFD8 (contains 0x02F80308)
451 if (htim1.Instance == NULL) {
08004d6e: ldr r3, [pc, #72] @ (0x8004db8 <Shaft_Measure+84>) --> pc = 0x8004D6E
08004d70: ldr r3, [r3, #0] --> r3 = 0x20000884 (contains 002C0140)
08004d72: cmp r3, #0
08004d74: bne.n 0x8004d7a <Shaft_Measure+22> -> jumps to 0x08004d7a
-------
452 return false; // or handle the error as appropriate
08004d76: movs r3, #0
08004d78: b.n 0x8004dae <Shaft_Measure+74>
458 aShaft = (int32_t)htim1.Instance->CNT;
-------
08004d7a: ldr r3, [pc, #60] @ (0x8004db8 <Shaft_Measure+84>) --> pc = 0x8004D7A
08004d7c: ldr r3, [r3, #0] --> r3 = 0x20000884 (contains 002C0140)
08004d7e: ldr r3, [r3, #36] @ 0x24 --> r3 = 0x0
08004d80: str r3, [r7, #0]
459 WG_TX_1_msg.a_shaft = (int16_t)aShaft;
08004d82: ldr r3, [r7, #0] --> r7 = 0x2000FFD8 (contains 0x0)
08004d84: sxth r2, r3
08004d86: ldr r3, [pc, #52] @ (0x8004dbc <Shaft_Measure+88>) --> pc = 0x8004D86
08004d88: strh r2, [r3, #6] --> r3 = 0x20000614 (contains 0x0)
469 if ((int32_t)abs(aShaft - aShaft_old) > 1){
08004d8a: ldr r3, [pc, #52] @ (0x8004dc0 <Shaft_Measure+92>) --> pc = 0x8004D8A
08004d8c: ldr r3, [r3, #0] --> r3 = 0x200009E8 (contains 0x0)
08004d8e: ldr r2, [r7, #0] --> r7 = 0x2000ffd8 (contains 0x0)
08004d90: subs r3, r2, r3 --> both r2 and r3 = 0x0
08004d92: cmp r3, #0
08004d94: it lt
08004d96: neglt r3, r3
08004d98: cmp r3, #1
08004d9a: ble.n 0x8004da2 <Shaft_Measure+62> -> jumps to 08004da2
-------
470 bRead_speed = true;
08004d9c: movs r3, #1
08004d9e: strb r3, [r7, #7]
08004da0: b.n 0x8004da6 <Shaft_Measure+66>
473 bRead_speed = false;
-------
08004da2: movs r3, #0 --> r3 = 0x0
08004da4: strb r3, [r7, #7] --> r7 = 0x2000FFD8 (contains 0x0)
475 aShaft_old = aShaft;
08004da6: ldr r2, [pc, #24] @ (0x8004dc0 <Shaft_Measure+92>) --> pc = 0x8004DA6
08004da8: ldr r3, [r7, #0] --> r7 = 0x2000FFD8 (contains 0x0)
08004daa: str r3, [r2, #0] --> r2 = 0x200009E8 (contains 0x0)
476 return bRead_speed;
08004dac: ldrb r3, [r7, #7] --> r7 = 0x2000FFD8 (contains 0x0)
477 }
08004dae: mov r0, r3 --> r3 = 0x0
08004db0: adds r7, #12 --> r7 = 0x2000FFD8 (contains 0x0)
08004db2: mov sp, r7 --> r7 = 0x2000FFE4 (contains 0xFFFFFFFF)
08004db4: pop {r7}
08004db6: bx lr --> lr = 0x8003C13 (jumps to the code below)
08004db8: lsrs r4, r0, #2
08004dba: movs r0, #0
08004dbc: lsls r4, r2, #24
08004dbe: movs r0, #0
08004dc0: lsrs r0, r5, #7
08004dc2: movs r0, #0
482 {
-----------------------------------------------------------
08003c0e: bl 0x8004d64 <Shaft_Measure>
08003c12: mov r3, r0 --> r0 = 0x0
08003c14: strb r3, [r7, #7] --> r7 = 0xFFFFFFFF
287 Shaft_Speed(bRead_speed);
08003c16: ldrb r3, [r7, #7] --> r3 now contains 0x0 and r7 still 0xFFFFFFFF
08003c18: mov r0, r3
08003c1a: bl 0x8004dc4 <Shaft_Speed>
290 bRun_task_shaft_meas = false;
Why did you point to that particular instruction?
I continued to step through the code, and although the error was triggered at the same point, I was still able to proceed to other instructions below the point that normally jumps to the HardFault handler.
Why is this happening?
Still many thanks for further help!
2024-06-18 02:52 AM
Hi @waclawek.jan,
the fault happens every time I run the code, but I'm not sure if that would be reproducible... It would take some time (that I do not have) to create a similar program.
The value of r7 while the function Shaft_Measure is executed it is always 0x2000FFD8 which contains 0x02F80308, and never changes.
Then after the HardFault within the startup assembly is invoked it changes to 0xFFFFFFFF.
When it executes the hard_fault_handler_c function it changes again to 0x2000F90 which now contains 0xC8A50587 (that was missing from the first dump - sorry).
I really hope that this could help to find the issue.
Many thanks!
2024-06-18 05:12 AM
Why does the offending code's address keep changing?
Are you adding/removing code for the various experiments? That makes it a moving target, harder to aim and hit.
In the above code, Shaft_Measure() is irrelevant - r7 is pushed to stack at the beginning and then popped back at end, i.e. it's unchanged. As your comment also indicates, r7 was already 0xFFFFFFFF at the point where Shaft_Measure() was called, and that's incorrect value as it's used as stack frame i.e. it should point somewhere near the top of stack.
In other words, go back to the beginning of the calling function (i.e. top of function which called Shaft_Measure()), and observe how r7 is set up there.
JW