2024-06-13 09:29 AM
Hi experts,
I'm adding to my existing code (that runs on a STM32F105) a function to store a word in the flash.
I basically copied the code from the official example based on the HAL libraries, here my function:
uint32_t FLASH_Write_Data(uint32_t StartPageAddress, uint32_t *Data, uint16_t NWords)
{
static FLASH_EraseInitTypeDef EraseInitStruct;
uint32_t PageError;
/* Unlock the Flash memory to enable the flash control register access */
HAL_FLASH_Unlock();
/* Erase the FLASH area*/
EraseInitStruct.TypeErase = FLASH_TYPEERASE_PAGES;
EraseInitStruct.PageAddress = FLASH_USER_START_ADDR;
EraseInitStruct.NbPages = (FLASH_USER_END_ADDR - FLASH_USER_START_ADDR) / FLASH_PAGE_SIZE;
if (HAL_FLASHEx_Erase(&EraseInitStruct, &PageError) != HAL_OK)
{
/*Error occurred while page erase.*/
return HAL_FLASH_GetError();
}
/* Program the user FLASH area word by word*/
uint32_t i = 0;
while (i < NWords)
{
if (HAL_FLASH_Program(FLASH_TYPEPROGRAM_WORD, StartPageAddress, Data[i]) == HAL_OK)
{
StartPageAddress += MEMORY_OFFSET;
i++;
}
else
{
/* Error occurred while writing data in Flash memory*/
return HAL_FLASH_GetError();
}
}
HAL_FLASH_Lock();
return HAL_OK;
}
The very first strange thing is that as soon as I start debugging, the program counter often jumps to the HardFault_Handler function.
From there, I can reset the chip and restart the debug session without any apparent problem.
However, when I execute the program, it consistently ends up in the HardFault_Handler function.
If I comment out the flash write function, the program works as expected.
I started commenting out part of the flash function code, and I noticed that when the HAL_FLASHEx_Erase and the HAL_FLASH_Program functions are not executed, the program continues to work as expected.
I stepped through the code and I didn't notice anything obviousy wrong, as I can erase and even write the passed value to the passed memory address without any immediate error!
The only thing that happens before the program counter jumps into the HardFault_Handler function is that, after several instructions, the IMPRECISERR flag of the CFS register is set.
I tried moving the flash function into the main code, just after the SystemClock_Config or just after before the while(1), but surprisingly, the instruction where the IMPRECISERR flag is set does not change.
The point where this is happening is on the closing (yes on the close function parenthesis "}") of this function:
bool Shaft_Measure(void)
{
static int32_t aShaft_old;
int32_t aShaft;
msg.a_shaft = (int16_t)aShaft;
if (ABS(aShaft - aShaft_old) > 1){
bRead_speed = true;
}
else{
bRead_speed = false;
}
aShaft_old = aShaft;
return bRead_speed;
}
I'm sure that the problem is not there, but unfortunately I don't know how to proceed to identify the issue.
The MCU I'm using is the STM32F105RC which has 256 kB of flash, and the value I'm trying to save it is just a uint32_t number.
The constants used by the flash functions are defined in my flash.h here reported:
#define FLASH_ADDR_PAGE_127 ((uint32_t)0x0803F800)
#define FLASH_USER_START_ADDR FLASH_ADDR_PAGE_127
#define FLASH_USER_END_ADDR FLASH_ADDR_PAGE_127 + FLASH_PAGE_SIZE
#define MEMORY_OFFSET ((uint32_t)0x4U)
These addresses have been copied from the device datasheet.
Here my system:
Segger JLink base (with the latest ver: 7.96l)
STM32CubeIDE (ver: 1.15.1)
STM32CubeMX (ver: 6.11.1)
HAL libraries for STM32F1 (ver: 1.8.5)
Any help would be greatly appreciated!
Solved! Go to Solution.
2024-06-18 05:17 AM
HI @waclawek.jan,
I saw a bit of inconsistency as well, because I changed a bit the code, but never in its functionality, but you are right, next time I'll make a new branch to avoid changes in the code.
Understood, I'll try to trace r7 and see where it changes to the wrong value (0xFFFFFFFF).
Many thanks!
2024-06-19 08:48 AM - edited 2024-06-19 08:59 AM
Hi @waclawek.jan,
yes, I understood your point. By following your instructions, I was able to locate the offending code.
The issue is created in the following function that first reads and then updates the flash:
uint32_t FLASH_BootCounter(void)
{
/* Read the current boot counter */
uint32_t last_boot_counter, new_boot_counter;
FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U);
/* Increment the boot counter */
new_boot_counter = last_boot_counter + 1;
FLASH_Write_Data(FLASH_USER_START_ADDR, &new_boot_counter, 0x1U);
return last_boot_counter;
}
The disassembly code related to the last function is:
24 FLASH_Write_Data(FLASH_USER_START_ADDR, &new_boot_counter, 0x1U);
08002fb0: mov r3, r7
08002fb2: movs r2, #1
08002fb4: mov r1, r3
08002fb6: ldr r0, [pc, #16] @ (0x8002fc8 <FLASH_BootCounter+48>)
08002fb8: bl 0x8002fcc <FLASH_Write_Data>
25 return last_boot_counter;
08002fbc: ldr r3, [r7, #4]
26 }
08002fbe: mov r0, r3
08002fc0: adds r7, #8
08002fc2: mov sp, r7
08002fc4: pop {r7, pc}
08002fc6: nop
08002fc8: strb.w r0, [r0, <undefined>]
When the pc points to the 08002fc0 address the memory contains:
0x2000FFD8: 4F4E0000
...
...
0x2000FFE0: FFFFFFFF
When the adds operation is executed, r7 changes from pointing to the memory address 0x2000FFD8 to the new address 0x2000FFE0 which now contains an invalid memory value: 0xFFFFFFFF.
I stepped through then the entire FLASH_BootCounter function and I found out that the issue it is in the FLASH_Read_Data function.
I then stepped through the FLASH_Read_Data function disassembly (it took me all day!) and here are the results:
152 SEGGER_RTT_printf(0, "RTT READY\n");
08003b1e: ldr r1, [pc, #416] @ (0x8003cc0 <main+488>)
08003b20: movs r0, #0
08003b22: bl 0x8000fa8 <SEGGER_RTT_printf>
159 NBoot = FLASH_BootCounter();
08003b26: bl 0x8002f98 <FLASH_BootCounter>
-----------------------------------------------------------> jumps to code below
FLASH_BootCounter:
08002f98: push {r7, lr} --> r7 = 0x2000FFF0 (contains 0x20000A10), lr contains the returning address
08002f9a: sub sp, #8 --> sp = 0x2000FFE0 (contains 0x2000FF0)
08002f9c: add r7, sp, #0 --> r7 = 0x2000FFF0 sp = 0x2000FFD8 (contains 0x8003B27)
21 FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U);
08002f9e: adds r3, r7, #4 --> r7 = 0x2000FFD8 + 4 = 0x2000FFDC (contains 0x800D2C4)
08002fa0: movs r2, #1
08002fa2: mov r1, r3 --> r3 = 0x2000FFDC (contains 0x0800D2C4)
08002fa4: ldr r0, [pc, #32] @ (0x8002fc8 <FLASH_BootCounter+48>)
08002fa6: bl 0x8003068 <FLASH_Read_Data>
-----------------------------------------------------------> jumps to code below
FLASH_Read_Data:
08003068: push {r7} --> r7 = 0x2000FFD8 (contains 0x8003B27)
0800306a: sub sp, #20 --> sp = 0x2000FFD4
0800306c: add r7, sp, #0 --> r7 = 0x2000FFC0 (contains 0x800D2C0)
0800306e: str r0, [r7, #12] --> r0 = 0x803F800, r7 = 0x2000FFC0 + 12 = 0x2000FFCC (contains 0x803F800)
08003070: str r1, [r7, #8] --> r1 = 0x2000FFDC, r7 = 0x2000FFC0 + 8 = 0x2000FFC8 (contains 0x2000FFE0)
08003072: mov r3, r2 --> r2 = r3 = 0x1
08003074: strh r3, [r7, #6] --> r7 = 0x2000FFC0 + 6 half word (contains 0x0)
At this point:
- r0: StartPageAddress (0x803F800)
- r1: &data (0x2000FFDC)
- r2: n_words (0x1)
[FIRST ITERATION]
79 *data = *(__IO uint32_t *)StartPageAddress;
08003076: ldr r3, [r7, #12] --> r7 = 0x2000FFC0 + 12 = 0x2000FFCC (contains 0x803F800)
08003078: ldr r2, [r3, #0] --> r3 = 0x803F800, r2 = 0x1
0800307a: ldr r3, [r7, #8] --> r3 = 0x803F800, r7 = 0x2000FFC0 + 8 = 0x2000FFC8 (contains 0x2000FFDC)
0800307c: str r2, [r3, #0] --> r2 = 0x5F34, r3 = 0x2000FFDC (it contains 0x800D2C4, then it will contain 0x5F34)
80 StartPageAddress += MEMORY_OFFSET;
0800307e: ldr r3, [r7, #12] --> r3 = 0x2000FFDC, r7 = 0x2000FFC0 + 12 = 0x2000FFCC (it contains 0x803F800)
08003080: adds r3, #4 --> r3 = 0x803F800
08003082: str r3, [r7, #12] --> r3 = 0x803F804, r7 = 0x2000FFC0 + 12 = 0x2000FFCC (it contains 0x803F800, then it will contain 0x803F804)
81 data++;
08003084: ldr r3, [r7, #8] --> r3 = 0x803F804, r7 = 0x2000FFC0 + 8 = 0x2000FFC8 (it contains 0x2000FFDC)
08003086: adds r3, #4 --> r3 = 0x2000FFDC (contains 0x00005F34)
08003088: str r3, [r7, #8] --> r3 = 0x2000FFE0, r7 = 0x2000FFC0 + 8 = 0x2000FFC8 (contains 0x2000FFDC, it will contain 0x2000FFE0)
82 if (!(n_words--)) break;
0800308a: ldrh r3, [r7, #6] --> r3 = 0x2000FFE0, r7 = 0x2000FFC0 + 6 = 0x2000FFC6 (read half word contains 0x1)
0800308c: subs r2, r3, #1 --> r2 = 0x5F34, r3 = 0x1
0800308e: strh r2, [r7, #6] --> r2 = 0, r7 = 0x2000FFC0 + 6 = 0x2000FFC6 (restore half word with 0x0)
08003090: cmp r3, #0
08003092: beq.n 0x8003096 <FLASH_Read_Data+46>
79 *data = *(__IO uint32_t *)StartPageAddress;
08003094: b.n 0x8003076 <FLASH_Read_Data+14>
82 if (!(n_words--)) break;
[SECOND ITERATION]
79 *data = *(__IO uint32_t *)StartPageAddress;
08003076: ldr r3, [r7, #12] --> r3 = 0x1, r7 = 0x2000FFC0 + 12 = 0x2000FFCC (contains 0x803F804)
08003078: ldr r2, [r3, #0] --> r3 = 0x803F804, r2 = 0x0 <<<<<<<<<< The address pointed by r3 is empty (non flash data there), then r2 is loaded with 0xFFFFFFFF
0800307a: ldr r3, [r7, #8] --> r3 = 0x803F804, r7 = 0x2000FFC0 + 8 = 0x2000FFC8 (contains 0x2000FFE0)
0800307c: str r2, [r3, #0] --> now r3 = 0x2000FFE0 and r2 = 0xFFFFFFFF, when this instruction is executed the 0xFFFFFFFF is loaded into the 0x2000FFE0 address
80 StartPageAddress += MEMORY_OFFSET;
0800307e: ldr r3, [r7, #12] --> r3 = 0x2000FFE0, r7 = 0x2000FFC0 + 12 = 0x2000FFCC (it contains 0x803F804)
08003080: adds r3, #4 --> r3 = 0x803F804
08003082: str r3, [r7, #12] --> r3 = 0x803F808, r7 = 0x2000FFC0 + 12 = 0x2000FFCC (contains 0x803F804, it will contain 0x803F808)
81 data++;
08003084: ldr r3, [r7, #8] --> r3 = 0x803F808, r7 = 0x2000FFC0 + 8 = 0x2000FFC8 (contains 0x2000FFE0)
08003086: adds r3, #4 --> r3 = 0x2000FFE0 (it contains 0xFFFFFFFF)
08003088: str r3, [r7, #8] --> r3 = 0x2000FFE4, r7 = 0x2000FFC0 + 8 = 0x2000FFC8 (it contains 0x2000FFE0, then it will contain 0x2000FFE4)
82 if (!(n_words--)) break;
0800308a: ldrh r3, [r7, #6] --> r3 = 0x2000FFE4, r7 = 0x2000FFC0 + 6 = 0x2000FFC6 (read half word contains 0x0)
0800308c: subs r2, r3, #1 --> r2 = 0xFFFFFFFF, r3 = 0x0 <<<<<<<<<< subtraction is not executed - registers do not change - APSR register = 0x80000000 -> N bit set
0800308e: strh r2, [r7, #6] --> r2 = 0xFFFFFFFF, r7 = 0x2000FFC0 + 6 (restore half word with 0xFF) <<<<<<<<<< is this correct? as 0x2000FFC4 contains 0xFFFF0000
08003090: cmp r3, #0 --> r3 = 0x0
08003092: beq.n 0x8003096 <FLASH_Read_Data+46>
79 *data = *(__IO uint32_t *)StartPageAddress;
08003094: b.n 0x8003076 <FLASH_Read_Data+14>
82 if (!(n_words--)) break;
08003096: nop
84 }
08003098: nop --> r1 = 0x2000FFDC, r2 = 0xFFFFFFFF, r3 = 0x0, r7 = 0x2000FFC0
0800309a: adds r7, #20 --> r7 = 0x2000FFC0 + 20 = 0x2000FFD4 (contains 0x2000FFD8)
0800309c: mov sp, r7 --> sp = 0x2000FFC0
0800309e: pop {r7} --> sp = 0x2000FFD4, r7 = 0x2000FFD4 (contains 0x2000FFD8)
080030a0: bx lr --> sp = 0x2000FFD8, r7 = 0x2000FFD8 (contains 0x08003B27)
returns from the code above <----------------------------------------------------
23 new_boot_counter = last_boot_counter + 1;
08002faa: ldr r3, [r7, #4] --> r3 = 0x0, r7 = 0x2000FFD8 + 4 = 0x2000FFDC (contains 0x00005F34)
08002fac: adds r3, #1 --> r3 = 0x5F34
08002fae: str r3, [r7, #0] --> r3 = 0x5F35, r7 = 0x2000FFD8 (it contains 0x08003B27, it will contain 0x00005F35)
24 FLASH_Write_Data(FLASH_USER_START_ADDR, &new_boot_counter, 0x1U);
08002fb0: mov r3, r7 --> r3 = 0x5F35, r7 = 0x2000FFD8 (contains 0x00005F35)
08002fb2: movs r2, #1 --> r2 = 0xFFFFFFFF
08002fb4: mov r1, r3 --> r1 = r3 = 0x2000FFD8 (contains 0x00005F35)
08002fb6: ldr r0, [pc, #16] @ (0x8002fc8 <FLASH_BootCounter+48>) --> pc = 0x8002FB6 + 16 = 0x8002FC6
08002fb8: bl 0x8002fcc <FLASH_Write_Data> --> This function not inserted here <--
25 return last_boot_counter;
08002fbc: ldr r3, [r7, #4] --> r3 = 0x0, r7 = 0x2000FFD8 + 4 = 0x2000FFDC (contains 0x00005F34)
26 }
08002fbe: mov r0, r3 --> r0 = 0x0, r3 = 0x5F34
08002fc0: adds r7, #8 --> r7 = 0x2000FFD8 (contains 0x00005F35)
08002fc2: mov sp, r7 --> sp = 0x2000FFD8, r7 = 0x2000FFE0 (contains 0xFFFFFFFF) <<<<<<<<<< here the wrong memory address is pushed into the stack
08002fc4: pop {r7, pc} --> r7 = sp = 0x2000FFE0 (contains 0xFFFFFFFF)
--> sp = 0x2000FFE8, r7 = 0xFFFFFFFF
08002fc6: nop
08002fc8: strb.w r0, [r0, <undefined>]
The FLASH_Read_Data function seems to be a kind of do-while loop, which reads the requested flash memory, but also reads the next value, which "coincidentally" it contains 0xFFFFFFFF, as that memory address is empty.
When this value is read, despite the N (negative bit) being set, the operation doesn't happen, and the registers still contain the wrong values, leading to r7 being loaded with this invalid memory address.
I have to admit that, in an attempt to be quick (which turned out to be the very opposite), I copied that function from a website. I will never do that again. I would have never written that function in that way, but I can't find anything wrong that would trigger such behavior.
Here the offending function:
void FLASH_Read_Data (uint32_t StartPageAddress, uint32_t *data, uint16_t n_words)
{
while (1)
{
*data = *(__IO uint32_t *)StartPageAddress;
StartPageAddress += MEMORY_OFFSET;
data++;
if (!(n_words--)) break;
}
}
I then updated that function to be like in the official example:
void FLASH_Read_Data (uint32_t StartPageAddress, uint32_t *data, uint16_t n_words)
{
uint32_t address = StartPageAddress;
uint32_t EndPageAddress = StartPageAddress + n_words * MEMORY_OFFSET;
while (address < EndPageAddress)
{
*data = *(__IO uint32_t *)address;
address += MEMORY_OFFSET;
}
}
... And now all works as expected, and stepping into the disassembly I didn't spot anything wrong.
Your insights on this would be greatly appreciated (@waclawek.jan, @Tesla DeLorean) as I'm sure yu would help me to finally understad what happened under the hood.
One more thing, how should I start to write code to defend myself from these hidden traps? Can you point me to some good articles that explain how provide to the code a sort of "watchdog"
Many many thanks!
2024-06-19 09:30 AM
>>Why did you point to that particular instruction?
Because it's the WRITE that's causing the IMPRECISE fault
r7 == 0xFFFFFFFF is problematic, suggesting it got trashed somewhere along the way, assuming the compiler didn't generate bad code. It's probably a pointer to one of your structures, perhaps locally on the stack.
Code/Execution level failures can occur if the FLASH Wait States are too aggressive, especially in the prefetching paths.
Perhaps it's a code inlining optimization issue? Is the boolean value used anywhere? It looks like it staged unnecessarily (where r7=sp, local on the stack frame), and then passed to the function.
2024-06-20 06:23 AM - edited 2024-06-20 06:23 AM
it's simpler than that. Look at the end of loop condition of this function:
void FLASH_Read_Data (uint32_t StartPageAddress, uint32_t *data, uint16_t n_words) { while (1) { *data = *(__IO uint32_t *)StartPageAddress; StartPageAddress += MEMORY_OFFSET; data++; if (!(n_words--)) break; } }
It reads from FLASH and writes to destination (which is given by pointer) one more word than is n_words, so
uint32_t last_boot_counter, new_boot_counter; FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U);
writes 2 words to stack (as last_boot_counter is a local variable), and thrashes what's just after last_boot_counter, which happens to be the stacked stack frame pointer in r7 from the caller function. And that thrashed stack frame pointer then just propagates to the point where it's actually used, resulting in the fault.
@aga,
> how should I start to write code to defend myself from these hidden traps?
I could preach here "proper coding methods" etc. Some folks believe religiously in the power of tools like static checkers or various fancy programming languages.
But the naked truth is, humans make errors. So, the practical solution is to a) try to be as meticulous as possible within reasonable boundaries; b) be prepared for errors at various levels.
In this particular case, for example, in FLASH_Read_Data(), I would use the for() loop rather than do/while() or while(). There's a reason for the for() loop - it's usually perceived as a simple alternative and it's more readily understood, if one sticks to the simple for(i = 0; i < max; i++) pattern; so it's less likely to result in error.
And, also, in this particular case, you have been able to track down the root cause and mitigate it (by using the correct FLASH_Read_Data()). I never consider the "the problem went away, although I'm not sure why" to be an acceptable solution; that's why to me it's very important to maintain the problem (i.e. don't make any subsequent changes, unless I can surely undo them) until I am absolutely sure what was the root cause and that I removed it.
JW
2024-06-20 07:25 AM
Hi @Tesla DeLorean and @waclawek.jan.
Many thanks for the time spent on this thread. Very appraciated!
I now understand the issue. As you probably noticed, I hadn't marked my post as a solution because I hadn't fully grasped the root cause.
I've marked it just now thanks to your explanation. However, I still want to understand why the stack pointer retained that value. Do you have any idea?
I agree about the for loop too, I normally use it.
Cheers
2024-06-20 07:51 AM
>>it's simpler than that. Look at the end of loop condition of this function:
Yeah, responding to an earlier post, not the more detailed/focused one.
2024-06-20 09:09 AM
It's not "stack pointer" but "pointer to stack frame". Stack frame is the name of the space on stack the compiler makes upon entry to a function, so that it can place local variables there and, in some cases, also temporary variables it is unable to hold in registers (so called "register spills"). Compiler needs to maintain a pointer to that space, as it is not a fixed value, it depends on where top of stack (pointed to by stack pointer) was at the moment the function was called.
But upon function entry, before creating the stack frame, the compiler needs to push on stack registers it intends to modify in that function (except r0-r3, which are used as parameters upon entry/return values upon exit/freely modifiable in the function).
For example:
FLASH_BootCounter: 08002f98: push {r7, lr} --> r7 = 0x2000FFF0 (contains 0x20000A10), lr contains the returning address 08002f9a: sub sp, #8 --> sp = 0x2000FFE0 (contains 0x2000FF0) 08002f9c: add r7, sp, #0 --> r7 = 0x2000FFF0 sp = 0x2000FFD8 (contains 0x8003B27)
here, in the first instruction, lr (the return address) and r7 are pushed first, as the compiler intends to use r7 in that function; then moves the top of stack to make space for the stack frame - here for two local variables, and stores that address into r7 to be used as pointer when those local addresses are to be used.
So, on stack, we have
r7-> [new_boot_counter] [last_boot_counter] [saved r7 of caller] [lr i.e. return address] [... stack content of caller ...]
Now calling FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U); wrote two words instead of one, so it wrote last_boot_counter, but also corrupted [saved r7 of caller]. When FLASH_BootCounter finished:
08002fc4: pop {r7, pc} --> r7 = sp = 0x2000FFE0 (contains 0xFFFFFFFF) --> sp = 0x2000FFE8, r7 = 0xFFFFFFFF
that corrupted r7 is popped from the stack. Execution returned to the function, which called FLASH_BootCounter() (I don't know how is it called, let's call it X), where apparently r7 is again used as stack frame pointer for the local variables of function X. And then later, in function X, a local variable bRead_speed is accessed:
bool bRead_speed = Shaft_Measure();
(while bRead_speed is defined in the middle of the function, the compiler gathers all local variables throughout the function and creates one common frame at the beginning of the function). However, r7 now points to invalid address and that access then causes the fault.
JW
2024-06-20 09:46 PM
Hi @waclawek.jan,
crystal clear explanation.
I've a very last question, not directly related to issue, but I think it is still correct asking it here (because the context).
As manually gathering all the registers value, change their indianness, keeping track of everything (.. and Lord knows how many times I had to restart because I went a bit over an instruction) was super time comsuming, is there any (debug) tool that could have saved me from all of this?
Many thanks for your time!
2024-06-22 02:27 AM - edited 2024-06-22 09:19 AM
> is there any (debug) tool that could have saved me from all of this?
This is similar to your previous question:
>> how should I start to write code to defend myself from these hidden traps?
And, again, the answer is - as you probably know - disappointing: there is no magical debug tool in the same way as there is no magic tool preventing errors. (There are folks promoting such tools, nonetheless.)
Errors do happen, and some errors are simply very complicated. The best thing we can do is to try to prevent making errors as much as possible (see above); and have a full toolbox of debugging tools (i.e. not hoping in one single magic tool).
In other words, the hard way.
Btw. note, that in what happened you got quite lucky, as the fault happened not too far from the place where the bug was. Would the bug overwrite some very unrelated variable on the stack, or resulting in a complicated chain of events, and the symptoms might've been very far and unrelated to the point where the bug "acted", with even much more time and effort spent on hunting it down. But this all is part of our job, it does and will happen, and we have to be prepared to the occurences of it.
JW