osMessageGet Crashing

temp2010 · ‎2020-06-19

I have lots of experience in the general area, but am new to both STM32 and FreeRTOS specifically.

I'm adding a message queue to my program, using the Code at https://www.keil.com/pack/doc/CMSIS/RTOS/html/group__CMSIS__RTOS__Message.html#gac9a6a6276c12609793e7701afcc82326 as an example to follow.

It seems that osMessageGet is crashing. I have a breakpoint both on and after the call to osMessageGet(). I can single step debug into osMessageGet(). It behaves as expected, finds a message, sets up the "event" structure that is to be returned, and gets to the function return statement. If I let the program run from that point, I expect to reach my second breakpoint that's after the call. Instead, I don't reach it. When I pause the program, I see it's stuck in the HardFault_Handler(). So something must be wrong with the return happening.

The only thing I can think of is to ask, "huh? It's returning a complex data type that is a structure?" Do I have to set up some special flags or something in order for the compiler to do this correctly?

Meanwhile, I printed out the size of the structure and it's only 12 bytes. I bumped my stack up that much and there was no change. So I don't think it's a stack size problem.

I set a breakpoint in the interrupt routine that's filling the queue with pointers to structures from a pool, and that breakpoint wasn't hit between the time I let the return in question execute and the program being in hard fault. So I do not think the issue is a task switch and the fault actually occurring due to something other than the return itself.

What could be wrong here?

Below is a screenshot of the hard fault:

and below are germane segments of my code:

// Global declarations
 
	int can1_rx_drop_count = 0;
 
	typedef struct {
		CAN_RxHeaderTypeDef   RxHeader;
		uint8_t               RxData[8];
	} RxPacket_t;
	RxPacket_t RxPacket;
 
	osPoolDef(RxPool, 16, RxPacket_t);
	osPoolId RxPool = NULL;
	osMessageQDef(RxQueue, 16, RxPacket_t);
	osMessageQId RxQueue = NULL;
 
	...
 
// From my CAN RX interrupt routine...
 
	RxPacket_t *ppacket;
 
	if (RxPool && RxQueue) {
		ppacket = osPoolAlloc(RxPool);								// Allocate memory for packet in pool
		if (ppacket) {
			memcpy(ppacket, &RxPacket, sizeof(RxPacket_t));				// Copy packet to pool
			osMessagePut(RxQueue, (uint32_t)ppacket, osWaitForever);	// Put pooled packet on RxQueue
		} else {
			// DROP CAN PACKET
			++can1_rx_drop_count;
		}
	}
 
	...
	
// From my main task setup
 
	  RxPool = osPoolCreate(osPool(RxPool));				// Create memory pool for RxPackets
	  RxQueue = osMessageCreate(osMessageQ(RxQueue), NULL);	// Create queue for RxPackets
 
	  ...
	  
// From my main task loop
 
	if (RxPool && RxQueue) {
		RxPacket_t localPacket;
		RxPacket_t *ppacket;
		osEvent event;
 
		// THIS IS THE OFFENDING CALL TO osMessageGet <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
		event = osMessageGet(RxQueue, osWaitForever);			// Wait for packet in queue
		if (event.status == osEventMessage) {					// Is there a packet in the queue
			...
		}
	}
 
	...

alister · ‎2020-06-22

12 bytes is negligible. Increase the stack of the task calling osMessageGet by 256 bytes and re-test.
Another task's stack may be overflowing, perhaps on interrupt, and screwing a stack or other data. Increase all your task stacks by 256 bytes and re-test.
Step to the PC that hard-faults. You may see the cause before it hard-faults.
Analyze the hard fault information. The information's many places. Try https://interrupt.memfault.com/blog/cortex-m-fault-debug#determining-what-caused-the-fault.

temp2010 · ‎2020-06-23

Thanks, Alister.

Between my post and now, it did occur to me that it might all be stack related, since the hard fault was occurring upon RETURN from the function. I was going to first try reducing the pool size from 16 entries to 12 or 8 entries. I had already tried a small increase in this tasks stack size, but I'll keep in mind your suggestion about other tasks (there's only the main, unsure about interrupt task).

About stack size, long ago I did a bunch of work analyzing .map files for a different processor (Microchip). But I do not yet know how to analyze the STM32 .map file for memory usage. I could go hunting on my own, or I could Google on my own, but I do ask if you know already of a tutorial on how to extract this info (memory or stack usage) from the STM32 .map file. (That old Microchip .map file included a function call tree that either directly or indirectly [don't recall] told me the stack usage. I didn't see such on a recent quick inspection of the STM32 .map file.)

ALSO, I just mentioned interrupt task. I know how to set the stack size when I create a new task. I don't know how to do so for an interrupt. Is it using the stack from the main program? Well, I don't know how to set that either, aside from just the overall stack leftover. I do need to read up on this more. Link suggestions appreciated. I know the concepts very well, just not the details for STM32 or FreeRTOS. (My prior Microchip work as bare metal, not multi-tasking, but I've done plenty of multi-tasking on bigger computers.)

alister · ‎2020-06-23

>but I'll keep in mind your suggestion about other tasks

You're misunderstanding the intention. You want to take definitive steps to isolate the problem, i.e. each step should produce a result.

Assuming you've the RAM, add 256 to _all_ your stacks. Then if the crash stops, you know a stack size is the cause and at least one stack needs increasing and your next step is to check the stacks to determine which and by how much.

>ALSO, I just mentioned interrupt task.

There is no interrupt task. Read your code to determine what it does for interrupt stack. Probably it's using the stack of the current task.

>long ago I did a bunch of work analyzing .map files

Read your code. Google configCHECK_FOR_STACK_OVERFLOW. Grep taskCHECK_FOR_STACK_OVERFLOW. Instrument your code if necessary. With the stacks very large and after executing for some time, look at each to determine how much is used and allow some more, e.g. 100, for safety.

temp2010 · ‎2020-06-24

Yes, I did indeed understand you, Alister. There were other problems, however. After further analysis, I see that STM32CubeMX defaulted me to only 4K of FreeRTOS heap, while my selected processor has 64K. This was limiting any stack growth I would try. I gather that all my task stacks come out of this FreeRTOS heap.

With that larger FreeRTOS heap, I was able to up my default task from 600 max that didn't hang to now 2048. That's much more than the 256 you requested.

I discovered the TrueSTUDIO View FreeRTOS Task List. That window now shows IDLE (which I believe is the main starting task) has "Start of Stack" 0x20000204 and "Top of Stack" 0x200011b8. It also has my defaultTask with start 0x2001498 and top 0x200033e0.

Now what confuses me is this. My debugger is on the call to osMessageGet(), that being the function I can't return from without hitting hard fault. So I have NOT YET called this function. I look at the registers. Is "sp" the stack pointer? Well, it has a value of "0x20003430 <ucHeap+8096>". Well, my current breakpoint is in a function called by defaultTask. So I should be in the default task. But this sp value is ***NOT*** between the start and top numbers I mentioned. It seems like there's already trouble. Ignoring the 0x2000 for brevity, I'm at 3430 which is beyond the 1498 to 33e0 range. Is this stack growing up from 1498 to 33e0, or growing down from 33e0 to 1498? If growing up, I've already grown to far. If growing down, how am the sp now before it even starts? (Note I've been doing stacks for a very long time. I know how they work in general. But not FreeRTOS perhaps.)

Now, I only have these two tasks, plus I have a pool for messages and objects on messages. Without calculating their size right now, I figure I still have oodles of extra OS heap available. I bump my defaultTask stack size up to 4096. Now when I hit my breakpoint before calling osMessageGit, the FreeRTOS Task List is telling me my defaultTask has a stack start 0x20001498 (unchanged) and a top of 02x00053e0 (yep, much larger). Yet that darned sp is also huge, at 0x20005430. Again, it is BEYOND the top of the stack.

Obviously, I don't understand what's going on here. The sp is always OUTSIDE of the start and top of the stack. Is the top of the stack the END of the stack as I'm thinking? Or is it just where the stack is now? Is "sp" not the stack pointer? Is there some OS reservation making up this difference? I understand the things in general, but the actual system here is doing some special stuff and I don't know what it is.

BTW, it still the case that when I trace through osMessageGet, all seems fine until the return executes. Then I get a hard fault.

I can't be needing to increase any stack size anywhere. I'm just totally missing the boat somewhere regarding this systems memory management method.

[EDIT: Working on CheckForStackOverflow now. Nope, vApplicationStackOverflowHook is *not* getting called. Selected option 2 for stack overflow checks.]

alister · ‎2020-06-24

>I was able to up my default task from 600 max that didn't hang to now 2048

>osMessageGet(), that being the function I can't return from without hitting hard fault

These appear contradictory. Guessing 600 does hang.

>IDLE (which I believe is the main starting task)

No. Read the code. A quick search finds a comment "The idle task, which as all tasks is implemented as a never ending loop.

* The idle task is automatically created and added to the ready lists upon

* creation of the first user task."

The starting tasks are configured in Cube.

>Is "sp" the stack pointer?

The 600 does hang?

Read the PM (programmers manual) re SP and hard faults and/or Google it.

The stack grows down. So stack start = stack base + stack size. Stack unit is 32-bit word.

For additional information how to study the stack, look for taskCHECK_FOR_STACK_OVERFLOW and read what it does. That'll show how the OS knows where each stack is.

>I should be in the default task

The current task is at pxCurrentTCB. Its name is pxCurrentTCB->pcTaskName.

>that darned sp is also huge, at 0x20005430

I can only speculate. But has hallmarks of stack corruption. Possible causes include

Stack overflow, screwing a function's LR on the stack (read the PM). Once the LR is screwed any register including SP may become screwed before hard fault.
Heap overflow by allocator or overrun/underrun of a buffer allocated on the heap.
Overrun/underrun of array defined in function (i.e. local variable, stored on the stack).
Too many/too large variables defined in function.
Uninitialized or badly initialized pointer.

Many causes and they can compound too.

>stack size 4096... stack start 0x20001498... top of 02x00053e0... that darned sp is also huge, at 0x20005430

With stack size unit = 32-bit word, these look fair. 4096 is much too large for typical software.

>when I trace through osMessageGet, all seems fine until the return executes. Then I get a hard fault.

Exceptions occur as you step.

Next step is to re-read everything and test all assumptions.

temp2010 · ‎2020-06-25

SOLVED.

The problem wasn't my stack settings at all, although it was stack corruption as we both thought.

The problem turns out to have been THREE different bugs, only one of which was mine! The other two were actually FreeRTOS bugs, or at a minimum poor style.

BUG #1:

If you look at my original code quote, I have:

typedef struct {
		CAN_RxHeaderTypeDef   RxHeader;
		uint8_t               RxData[8];
	} RxPacket_t;
	RxPacket_t RxPacket;
 
	osPoolDef(RxPool, 16, RxPacket_t);
	osPoolId RxPool = NULL;
	osMessageQDef(RxQueue, 16, RxPacket_t);
	osMessageQId RxQueue = NULL;

Well, line 9 above needed an ampersand on the third parameter to osMessageQDef. That ampersand was missing. The "type" on the queue was supposed to be a pointer and not a whole structure.

BUG #2:

My own code was based on the example code at https://www.keil.com/pack/doc/CMSIS/RTOS/html/group__CMSIS__RTOS__Message.html#gac9a6a6276c12609793e7701afcc82326 . The section of code from there is:

typedef struct {                                 // Message object structure
  float    voltage;                              // AD result of measured voltage
  float    current;                              // AD result of measured current
  int      counter;                              // A counter value
} T_MEAS;
 
osPoolDef(mpool, 16, T_MEAS);                    // Define memory pool
osPoolId  mpool;
osMessageQDef(MsgBox, 16, &T_MEAS);              // Define message queue
osMessageQId  MsgBox;

Here we see the comparable code at line 9. But here's the bug with this. The have an ampersand on the name of a typedef. I do not believe one can do this. One cannot take the address of a typedef, a thing that hasn't been instantiated. I discovered this when I simply tried to add the ampersand to my own code. It wouldn't compile. I had to change the token in my code from the typedef name "RxPacket_t" to the actual instantiated structure name "RxPacket". Only then could I add the ampersand to get "&RxPacket".

Now, there might be some other version, some non-standard version, of a C compiler that allows this. But the C compiler in Atollic TrueSTUDIO does not. Therefore, I call this a bug in the example code. The intent of the example code was correct, though, so this is only a very small bug. It just means the example code was not compilable. Yes, this happens sometimes!

BUG #3

This is the bigger bug in the FreeRTOS implementation that comes with the STM32CubeMX. The way I found this bug was to trace down through the osGetMessage() function, the one that was failing to return, in order to try to find the point at which the stack was being corrupted. The function osGetMessage() from cmsis_os.c was in turn calling xQueueReceive() from queue.c. This was in turn calling prvCopyDataFromQueue(), also from queue.c. (Notice that prvCopyDataFromQueue() is doing a copy but does not have a size parameter.) Finally, prvCopyDataFromQueue() is calling memcpy as:

( void ) memcpy( ( void * ) pvBuffer, ( void * ) pxQueue->u.pcReadFrom, ( size_t ) pxQueue->uxItemSize );

THIS RIGHT HERE was the culprit causing my memory corruption. This might not seem like a bug at this point, so let me explain. pxQueue->uxItemSize comes from that third parameter that I mentioned in BUG #1. Because I passed the name of my structure rather than the address, this uxItemSize was 36 instead of 4. Well, this should be just fine if my intent was to put whole structures on the queue rather than just pointers.

However, the destination of the memcpy, pvBuffer, was a smaller place somewhat hard coded in the code. Specifically, pvBuffer was parameter #2 to prvCopyDataFromQueue(). This was in turn parameter #2 to xQueueReceive(). Finally, in osMessageGet() of cmsis_os.c this parameter #2 was passed to xQueueReceive() as a reference to a member of the structure osEvent:

osEvent osMessageGet (osMessageQId queue_id, uint32_t millisec)
{
...
 
  osEvent event;
  
...
 
    if (xQueueReceive(queue_id, &event.value.v, ticks) == pdTRUE) {
 
...
}

Now, osEvent is a structure defined in cmsis_os.h as:

typedef struct  {
  osStatus                 status;     ///< status code: event or error information
  union  {
    uint32_t                    v;     ///< message as 32-bit value
    void                       *p;     ///< message or mail as void pointer
    int32_t               signals;     ///< signal flags
  } value;                             ///< event value
  union  {
    osMailQId             mail_id;     ///< mail id obtained by \ref osMailCreate
    osMessageQId       message_id;     ///< message id obtained by \ref osMessageCreate
  } def;                               ///< event definition
} osEvent;

It is a uint32_t and can never be anything other than a uint32_t. It has a size of 4 bytes an can never have a size other than 4 bytes.

Meanwhile, my code and the example code quoted use osMessageQDef() invoke a macro in cmsis_os.h that strictly takes the size of the third parameter:

#if defined (osObjectsExternal)  // object is external
#define osMessageQDef(name, queue_sz, type)   \
extern const osMessageQDef_t os_messageQ_def_##name
#else                            // define the object
#if( configSUPPORT_STATIC_ALLOCATION == 1 )
#define osMessageQDef(name, queue_sz, type)   \
const osMessageQDef_t os_messageQ_def_##name = \
{ (queue_sz), sizeof (type), NULL, NULL  }

PUTTING IT ALL TOGETHER:

We have a parameter given to osMessageQDef, the size of which is being given to osMessageQDef_t. This is in cmsis_os.h Then, elsewhere in cmsis_os.c we are memcpy-ing into an object with size 4, the size of a uint32_t. This means that if ANYTHING OTHER THAN 4 is given to osMessageQDef, then we're doing things wrong. And if anything larger than 4 is given, then there's going to be a corruption of some sort.

It seems to me that osMessageQDef should not have that third parameter at all. cmsis_os.c/.h should internally use a sizeof(uint32_t) and not allow the user to provide this size at all.

It's like the original author of osMessageQDef envisioned having different size objects on the queue. But then the author of osGetMessage envisioned only ever having objects of size uint32_t, as codified in the fast that it deals with the osEvent object that contains a uint32_t.

I call this a bug.

SUMMARY:

Yes, it was indeed my own bug for failing to put the ampersand on my third parameter to osMessageQDef. But in discovering my own bug, I've discovered two others in the FreeRTOS doc example code and in the FreeRTOS code from cmsis_os.c (plus involvement with queue.c)

MEANWHILE:

I still need to figure out what the TrueSTUDIO GUI is trying to tell me with the Start of Stack and Top of Stack figures, and why register sp seems to be outside of this range. Thanks very much for your asistance, Alister. You haven't ever gotten around to clarifying this, I don't think. I'll re-read what you've written to see if I missed it in there somewhere.

Also, regarding the IDLE task. You keeping saying I don't understand which is the IDLE task, but I do, I really do. It is the task of the main function which does osThreadDef(defaultTask, StartDefaultTask, osPriorityNormal, 0, 4096/*excessive, yes*/), then osKernelStart(), then loops with simply while(1){}, comments omitted. And yes, these were all written into main.c by the cube, and then I made minor changes inside the USER CODE comment sections. (Interestingly, it looks like the cube doesn't want me to change the default task stack size, even though I repeatedly put it back to larger for my testing.)

By the way, when I said more the 600 on defaultTask would hang, that was when the cube setting only allowed the OS a total of 4K of heap. And the hang was a *different* hang from the one in question in this thread. Sorry I wasn't clear there. The point was that it was impossible for me to give defaultTask more than a 600 [word?] stack until after I increased the cube setting for the OS heap.

Alister, you did mention pxCurrentTCB. I'd like to learn more about this one. I can't find that token "pxCurrentTCB". I searched the whole STM32Cube_FW_F1_V1.8.0 folder and none of the .c or .h files contain "pxCurrentTCB". It's not in my own project folder anywhere, either. You may be referring to something other than this cube of Atollic TrueSTUDIO.

Alister, you did respond to my "stack size 4096... stack start 0x20001498... top of 02x00053e0... that darned sp is also huge, at 0x20005430", but you didn't comment on the fact that 5430 is ***OUTSIDE*** of the range from 1498 to 53e0. This is where I'm confused.

Nevertheless, I still thank you for your help, Alister.

dungeonlords789 · ‎2021-07-05

Do you know?

osMessageCreate is not xMessageBufferCreate but xQueueCreate in cmsis_os.c...

DAldr.1 · ‎2022-10-12

Dang. I got hit by the same bug. I was passing the typedef not the address of an existing structure. I'd rather just have a sizeof argument instead. They need to fix CMSIS_OS documentation.