Hard - to catch - fault

root · ‎2013-06-04

Posted on June 04, 2013 at 09:50

Hello,

I have a kind of strange issue. When I really stress my system (it handles serial packets, get the right data, and send it via serial, I send thousands of messages without waiting for reply), in an unpredictable way, I sometimes get an invalid PC load usage fault of invalid state fault.

When I get an invalid PC load usage, the program counter is like 0x0 or 0x5, or sometimes it contains a ram address, but I don't have code in ram, and looking at the stack trace, I have the feeling there is a stack pointer corruption somewhere because some of the registers have flash address of branch code in them (and LR has weird stuff, obviously not flash code nor ram address).

Here are my stack traces :

****************************

HARD FAULT !

Stack = 0x20000660

Invalid PC load usage fault at

Program counter = 0x200082B0

Stack frame :

R0 = 0x400264B8

R1 = 0x20008318

R2 = 0x3C

R3 = 0x200082B0

R12 = 0x0

LR = 0x8002417

PC = 0x200082B0

PSR = 0x20008318

****************************

Or :

****************************

HARD FAULT !

Stack = 0x20000688

Invalid state usage fault at

Program counter = 0x20008270

Stack frame :

R0 = 0x20008288

R1 = 0x20008EA8

R2 = 0x3C

R3 = 0x200082B0

R12 = 0x0

LR = 0x20008270

PC = 0x20008270

PSR = 0x20000200

****************************

Or again:

****************************

HARD FAULT !

Stack = 0x20000670

Invalid PC load usage fault at

Program counter = 0x1

Stack frame :

R0 = 0x0

R1 = 0x80023D7

R2 = 0x8003B26

R3 = 0x21000200

R12 = 0x0

LR = 0x8003279

PC = 0x1

PSR = 0x200082B0

****************************

The problem seems to happen (tried to track it down but it's very hard) on the service call interrupt exit after a malloc call (but there is like 10000 malloc calls without problem first).

My process stacks are far from full (half empty at min), I have 8k of system stack. The hard fault happens with user stack, but again, seems to trigger when popping rgisters at service call exit.

Spent about 10 hours trying to fix this, but no luck so far, do youguys have any advice for me?

Thomas.

#dma2

root · ‎2013-08-16

Posted on August 16, 2013 at 11:37

Hello,

https://dl.dropboxusercontent.com/u/3574941/screenshot.png

Here is a screenshot of TASKING when the hard fault occurs. In the expressions window is the current task stack, item 83 is 0xFEDCCDEF which is my fill stack value (and all items above 83 are the same value so stack is far from full).

At the time of the hard fault the stack pointer is item 86 (so stack frame pushed by hard fault exception is 86 to 93). Item 94 is 0x0 which probably has been used as a link return (hence the hard fault with PC = 0x0).

0x800299F is the line just after a system call to pop a queued buffer

0x8003116 is the service call return address (I have functions with only the SVC call, this is to mock the compiler and get the right parameters in the right registers before the call, and get back the returned value in R0 as function output).

0x8002997 is another system call return address (wait event call)

0x80030FD is the stacked link return address that calls the ''task finished'' system call.

The system call to pop a buffer from a queue sets the buffer address in R0 of the stacked frame (stacked by the SVC interrupt) ... and when there is no buffer to pop, the result is 0x0.

How could it be possible that it works most of the time and suddenly it writes to a bad address ? I'm still investigating ...

Thomas.

root · ‎2013-08-16

Posted on August 16, 2013 at 13:21

The only thing I do not have the ability to debug in depth is the malloc and free implementation from TASKING.

I checked that the interrupt priorities are ok, interrupts are ok, system interrupts are never preempted ... checked everything. Malloc and free call are made via system calls to ensure there is only one task/interrupt accessing the functions at the same time. Each time hard fault triggers, memory addresses for Malloc OR PopBuffer are somewhere on the screen. PopBuffer doesn't call malloc. Or maybe my service call interrupt code has a flaw ? This is my ASM handler :

; This is the service call interrupt handler, it does : 
; - check if current stack is main or process 
; - retrieve parameters 
; - call OPSY_Service_Handler 
; - return with correct value in R0 
.section .text 
.align 4 
.global OPSY_SVC_Handler 
OPSY_SVC_Handler: .type func 
TST LR, #0x4 ; test bit of return address (to check if we were in master or process stack) 
ITTEE EQ ; if then block 
MRSEQ R0, MSP ; if equal, then we were using in master stack 
LDREQ R1, #0 ; if equal, put 0 in R1 
MRSNE R0, PSP ; if not equal, we were using process stack 
LDRNE R1, #1 ; if not equal, put 1 in R1, this will easily tell in the handler if we were using main stack or process stack 
LDR R2,[R0,#24] ; get the service call in R2 
LDRB R2, [R2,#-2] ; and extract the parameter 
PUSH {LR} ; push link return because we will overwrite it 
BL OPSY_Service_Handler ; then call the service handler 
POP {PC} ; once done, pop back link return into PC to return from interrupt 
.endsec

That calls my C handler :

void OPSY_Service_Handler(void* stack, uint32_t isThreadCall, uint8_t parameter) 
{ 
OPSY_SetReady(_currentTaskId); 
volatile stackFrame_t* stackFrame = (stackFrame_t*) stack; 
switch (parameter) 
{ 
case taskFinished: 
OPSY_TaskFinished(); 
break; 
case addTask: 
OPSY_TaskInit((taskfunc_t) stackFrame->R0, (uint8_t) stackFrame->R1); 
break; 
case sleep: 
if (isThreadCall) 
OPSY_TaskWaitDelay(stackFrame->R0); 
else 
OPSY_Error(SleepFromInterrupt); 
break; 
case signalEvent: 
OPSY_SignalEvent((eventTag_t) stackFrame->R0); 
break; 
case waitEvent: 
if (isThreadCall) 
OPSY_SetWaitEvent(_currentTaskId, (eventTag_t) stackFrame->R0); 
else 
OPSY_Error(WaitEventFromInterrupt); 
break; 
case waitEventWithTimeout: 
if (isThreadCall) 
OPSY_SetWaitEventWithTimeout(_currentTaskId, (eventTag_t) stackFrame->R0, stackFrame->R1); 
else 
OPSY_Error(WaitEventFromInterrupt); 
break; 
case disableInterrupt: 
OPSY_DisableInterrupt((uint8_t) stackFrame->R0); 
break; 
case enableInterrupt: 
OPSY_EnableInterrupt((uint8_t) stackFrame->R0, (taskfunc_t) stackFrame->R1, (uint8_t) stackFrame->R2); 
break; 
case pushBuffer: 
stackFrame->R0 = (uint32_t)OPSY_PushBuffer((queueTag_t) stackFrame->R0, (buffer_t*) stackFrame->R1); 
break; 
case popBuffer: 
stackFrame->R0 = (uint32_t)OPSY_PopBuffer((queueTag_t) stackFrame->R0); 
break; 
case mallocSvc: 
if(stackFrame->R0 != 0) 
stackFrame->R0 = (uint32_t)malloc((size_t)stackFrame->R0); 
break; 
case freeSvc: 
if(stackFrame->R0 != 0) 
free((void*)stackFrame->R0); 
break; 
default: 
break; 
} 
if (isThreadCall != 0) 
OPSY_ExitAndSwitchContext(); 
}

Any idea ? Thomas.

Tesla DeLorean · ‎2013-08-16

Posted on August 16, 2013 at 20:08

I need to ponder, but

R0 in the screen shot is a FLASH based subroutine address SP is a peripheral register (DMA1 Stream 6), this is rather odd Have you tried the SVC routine with a branch rather than ITE? The ITTEE looks valid, but trying to eliminate things. It looks to be a task swap issue perhaps? Maybe focus on context switching code, or DMA corruption of stack, or other things outside of normal code flow. Have you confirmed the Hard Fault output by sanity checking the addresses/pc on a known faulting instruction? A lot of the original ones you list have bogus RAM addresses, which are not ODD (LR), and out of place. Can you trap

OPSY_Service_Handler

() if there is an invalid parameter, or do some additional sanity checking. ie allocation of bogus size, freeing of wrong/previously allocated space. You could wrap malloc/free to add guards/checks. For the sake of other who might stumble on this, the ART problem I cited is not related to the speed of the FLASH, it's on the prefetch port of the processor related to a cache hit and data delivery from ART.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

root · ‎2013-08-19

Posted on August 19, 2013 at 10:18

Hello,

Here is my task switch code first in assembly :

.section .text 
.global OPSY_PendSV_ASMHandler 
.align 4 
OPSY_PendSV_ASMHandler: .type func 
LDR R0, =__opsy_registerSave ; load __opsy_registerSave address in R0 
STM R0, {R4-R11} ; store R4 to R11 registers content to __opsy_registerSave 
PUSH {LR} ; push link return because we will overwrite it 
BL OPSY_PendSV_Handler ; then call the service handler 
LDR R0, =__opsy_registerSave ; load __opsy_registerSave address in R0 
LDM R0, {R4-R11} ; store R4 to R11 registers content to __opsy_registerSave 
POP {PC} ; once done, pop back link return into PC to return from interrupt 
.endsec

That call this handler in C :

void OPSY_PendSV_Handler(void) 
{ 
SCB ->ICSR &= ~SCB_ICSR_PENDSVCLR_Msk; 
int32_t nextTask = _readyList[0]; 
if (nextTask == OPSY_NOTASK) // error : next task should not be empty ! 
{ 
OPSY_Error(DidntFindTaskToRun); 
nextTask = _idleTask; 
} 
int32_t* save_registers = _tasks[_currentTaskId].registers; 
int32_t* load_registers = _tasks[nextTask].registers; 
int32_t i; 
for(i = 0; i<OPSY_REGISTERS; i++) 
{ 
if(__opsy_registerSave[i] != 0) 
__nop(); 
save_registers[i] = __opsy_registerSave[i]; 
__opsy_registerSave[i] = load_registers[i]; 
} 
OPSY_SetRunning(nextTask); 
}

That calls OPSY_SetRunning() :

static void OPSY_SetRunning(int32_t taskId) 
{ 
task_t* task = &(_tasks[taskId]); 
OPSY_RemoveFromList(taskId, _readyList); 
__set_PSP((uint32_t) (task->stackCurrent)); 
task->status = Running; 
task->lastSwitch = ++_switches; 
_currentTaskId = taskId; 
}

DMA corruption ... I'll check but here the system runs at only a few % of bandwith, so I would be very surprised I can fill the 4k transmit buffer I use only for DMA transmit (which is DMA1_Stream6 ...). I'll check my hard fault handler and add checks to the service handler. Many thanks for your help ! Thomas.

root · ‎2013-08-19

Posted on August 19, 2013 at 11:08

Hello,

Was a task switching problem ... but not the task switch code itself ...

In OPSY_Service_Handler I was callingOPSY_ExitAndSwitchContext ONLY if the call has been made from process stack, butat the start of OPSY_Service_Handler I was calling OPSY_SetReady anyway (this procedure stops the running task and put it back to the tasks ready list).

Now I call OPSY_ExitAndSwitchContext anyway and no more hard faults !

The hard fault could only happen if the service handler was called by an interrupt, that's why it would trigger only when loaded.

Thanks again for your help Clive !!!!!

Thomas.