Strange behavior with USB CDC as virtual COM port on NUCLEO-L476RG

Benjamin Brammer · ‎2020-01-07

Hello everybody,

I am using the NUCLEO-L476RG to send simple ACII text over UART (RS232 adapter) or USB CDC as a virtual COM port. UART communication is working seamlessly but when I use the USB CDC library functions from STM there is hapenning something strange I cannot explain:

I am sending in regular intervalls ASCII-tesxt in the follwoing format: ABP;***;***;***\n

The * stands for different numbers. normally I would see this kind of text flow on a terminal like hterm:

ABP;***;***;***\n

... and so on

but with the USB CDC usage I see occasionally this:

ABP;***;***;***\n

ABP;ABP;***;***;***\n

ABP;***;***;***\n

ABP;ABP;***;***;***\n

when I halt the debugger to check if my character array is corrupted, it isn't. So my assumption is that there is something not working correctly with the USB CDC device library from STM. This is the code, that gets invoked when I transmit the ASCII-text:

sprintf(Communication.measure,"ABP;%ld;%ld;%ld\n",Patient.Systolic, Patient.Diastolic, Patient.MAP);
if(System.Mode == UART)
{
	LPUART_transmit_message(Communication.measure);
}
else if(System.Mode == USB)
{
	CDC_Transmit_FS(Communication.measure, strlen(Communication.measure));
}

as you can see, i am using the same array preparations for the USB CDC version as for the UART version, except that the UART version functions seamlessly. The big difference between both versions is that I only utilize USB CDC device library functions provided by STM, not my own.

Has anybody experienced something similar? Or has a good hint where to find the problem?

best regards

Benjamin

Bob S · ‎2020-01-27

OOPS - I typed this in last Friday but never hit "reply"...

You are correct, 0xffffffe9 in not a valid value for ANY kind of pointer. And looks suspiciously like an LR value from an interrupt/fault.

> When the hardfault occurs, and this only happens sometimes if I have measurements

> running and do attach the USB cable,

Now THAT is some useful information! As I mentioned in the last paragraph of my previous post, your code should never run from 0x0000 0200 or anywhere close to that except the initial reset vector fetch. You probably end up there as a result of de-referencing an NULL function pointer. Since this happens when you are collecting data AND connecting the USB cable, is it possible that part of the USB CDC stack is getting called before the supporting function pointer structures have been filled in.

How are you determining when is it OK to send data out the USB interface?

Try setting a breakpoint at 0x0000 0200 and see what the call stack looks like when you hit the breakpoint. To set the breakpoint you may have to go into

As a debugging tool, I configure the memory protection unit to detect NULL pointers by setting memory down at 0x0000 0000 as forbidden for any access. Then I get an MPU fault whenever the code tries to use a NULL pointer.

Benjamin Brammer · ‎2020-01-28

Hello Bob,

the USB driver lib gets initialized before I start my main program together with the HAL drivers.

When my program runs without USB all the communication is done inside the LPTIM2_AutoReloadCallback(..) function and data is send there via LPUART.

I now had a new kind of fault after I had tried to plug and unplug the USB cable several times:

The errors seem random in occurence and not allways appear with a cable disconnect and re-connection. I also tried setting up a watchpoint for the problematic address but this time it was not triggered sicne the address is now different..

I have never worked with the MPU. I will try to implement your debug suggestion. address 0x0000 0000 is the reset vector address, right? from there it points to the first startup address of the code?

Should it not allways be usefull to use the MPU and just define all the reserved and peripheral memory parts of the controller as not executable? I thin 16 regions was the max region number, right?

Bob S · ‎2020-01-28

Yes, in general setting the MPU to trap all access to non-existent memory/peripherals AND to the memory at 0x0 is helpful. There are tricks you can play with sub-regions to expand the areas you can cover. Alas, that would not help with the fault you just got, since it was executing from a FLASH address that should be valid to execute from (from the MPU's perspective).

Homework - download PM0214 from the ST web site. This is the "STM32 Cortex®-M4 MCUs and MPUs programming manual". Read the description on "UNDEFINSTR". See what registers are valid and what they tell you about the fault.

Whenever you get a fault, you need to see WHAT generated the fault. You have the call stack right there in the upper left of your screen. It shows you the address (in LPTIM2_IRQHandler()) tha caused the fault. Click on it and see what is there. I'm betting this is constant data that the compiler embeds in the code (for example, when it need to load the address of a global variable, it stores that address as a data word somewhere after the next "branch" or "return" instruction). You may need to view the disassembly to see that exact address.

What is similar about this fault and the other faults you have been getting? Stop and think before continuing to read.

Got an idea yet?

I suspect the common symptom is (still) corrupted function pointers causing the code to jump somewhere it is not supposed to. Or maybe corrupted stack that causes the return address to send the program to the wrong place. But that kind of depends on what you find when you look at the source of the UNDEFINSTR fault.

This is getting close to (or is already past) what can be diagnosed by someone on a forum who has no access to your full code. Or maybe something straight forward that I am just not seeing. But since nobody else has jumped in here, it is time to get back to basics of debugging. Here are SOME questions you should look at, in no particular order. See where these lead you.

Does this EVER happen if you never plug in the USB cable?
Does this EVER happen if you plug in the USB cable and leave it plugged in? What if you plug in the USB **BEFORE** you start collecting data? What if you plug in the USB **AFTER** you start collecting data?
You say that when your program runs without USB all coms is done the LPTIM interrupt handler callback function. What changes when you are using the USB? Where is the code that sends data to the USB Interface? In an interrupt function or in your main polling loop?
Can you decrease the time interval between outputting the data strings to make this fault happen more often?
You may have to add some kind of logging functions to parts of your code (or maybe the USB/CDC driver code) so you can figure out what functions have been called. This would log data to somewhere in RAM that you can then look at after a fault. Or get a debugger that can use the extended trace bits from the CPU to log the execution path. I don't know that the ST-Link devices can do that, maybe the ST-Link3. The Segger devices can.
A question you never answered - when using the USB, how does your program determine that it is OK to send data to the USB interface?
DId you look for strcpy() and sprintf() calls and change them to strncpy() and snprintf()? DO you have any other function calls that copies data to arrays that might overrun buffers?
Look very carefully at variables that are assigned values inside interrupt functions, INCLUDING callback functions, that are also used by your main polling loop. At the very least these must be declared volatile. And you may need disable/enable interrupt guards around them in your main code to keep the interrupt functions from modifying them in the middle of the main code also modifying them.
Try to disable as much of your program's functionality as possible, yet still exhibit this problem. Then re-enable one thing at a time back into the program and see what breaks it.
Are you using malloc()? Or, if this is a c++ program, are you using "new"?
Are you ever passing the address of a local variable to a function that stores that address to use later?
How much RAM are you using? Look at your map file. You stack pointer looks reasonable, unless you are using close to 96K bytes of RAM.

One key rule for debugging is "only change one thing at a time, then test it". That is the only way to know for sure WHAT fixed the problem, or made it worse, or made it change its behavior.

Benjamin Brammer · ‎2020-01-30

Hello Bob,

thanks for your help and advises, and off course patience.

Today was the first time I got a hard fault while just using the LPUART communication interface:

I will work your homework list and hopefully find the problem.

As soon as I have solved it or doesn't know what to do anymore, I will report back.

And one thing that is allways in common with all the different faults is the LR address: 0xffffffe9

Bob S · ‎2020-01-30

LR = 0xffffffe9 indicates the type of exception return needed (return to "thread" mode, exception return uses floating point state from MSP and exception uses MSP after return).

First homework step: Look at the code in the disassembly view at address 0x08002acc and see that is there. THAT is the address from which the invalid instruction was fetched. And if I recall correctly, it has been that same address for the last few faults that you posted. See the 3rd paragraph of my previous message. If it looks like this is really supposed to be executed code, instead of data embedded in the code, then is there any chance you are compiling for a different CPU than the one you are actually using?

Good luck, and good bug hunting!

Benjamin Brammer · ‎2020-02-03

Hello Bob,

I once again got the error. It seems to be the LPTIM2_IRQHandler() but when I want to see what is in the disassembly view, there is nothing shown:

When I start a new debug session I can go to address 0x08002acc. it has the following instructions:

          LPTIM2_IRQHandler:
08002acc:   push    {r7, lr}
08002ace:   add     r7, sp, #0
273         HAL_LPTIM_IRQHandler(&hlptim2);
08002ad0:   ldr     r0, [pc, #8]    ; (0x8002adc <LPTIM2_IRQHandler+16>)
08002ad2:   bl      0x8005df4 <HAL_LPTIM_IRQHandler>

I don't get what the problem here is.. I think it means I push r7 and the LR on the stack, right?

Bob S · ‎2020-02-05

My previous reply may have been misleading, sorry. If you had read the Cortex M4 programming manual, it would have told you that for UNDEFINSTR faults the PC shows the address of the invalid instruction (there, I did your homework for you). In your post on Jan 30th, the PC showed the same value as the LPTIM2_IRQHandler starting address. Which is why I pointed you to look there.In your Feb 3rd post, the PC is zero. But I think this is all leading down the wrong rabbit hole (wild goose chase? pick your expression).

The LPTIM2_IRQHandler assembly code you posted looks like you are compiling with optimization level zero (-O0). Just for kicks, change the optimization level to "debug" (-Og). I don't really expect a compiler bug, but just curious how (or *IF*) this changes the program's behavior. If it works with -Og, DON'T CONSIDER IT FIXED. It may have just moved the problem somewhere that is less obvious.

Look at all of your callback functions. Remember, these are called from interrupts. Any non-local variables that they alter must be declared "volatile". And any variables used by both callbacks and main code that cannot be accessed atomically need disable/enable interrupt guards around them when accessed from non-interrupt functions. Have you double and triple checked that you are no longer using sprintf() or strcpy()? These are all shots in the dark because without seeing your code nobody else has any idea what is going on.

Benjamin Brammer · ‎2020-02-05

Hello Bob,

thanks for your answer. In fact I did read the part in the programming manual but as i have written in my last post, I could not see what was at the PC address. In fact if it was 0x0 then this would mean it is the reset vector, right? So this makes sense that I cannot see anything in the disassembly? Nevertheless you are right, this leads to no solution it's getting more obvious that your assumption of a roaming pointer is correct. I hope I can find the problem.

What do you mean with atomic access? Does this mean I could have more than one instruction to read or write data and that if an interrupt get's triggered this could corrupt me the data?

In fact I have some data that is not declared volatile but is used in main and local code, I will check on that..

Bob S · ‎2020-02-06

> What do you mean with atomic access? Does this mean I could have more than one instruction

> to read or write data and that if an interrupt get's triggered this could corrupt me the data?

Yes. "atomic" means "non-interruptable". For CortexM4, reading OR writing an 8-bit, 16-bit or 32-bit value is atomic. Read-modify-write is not. So, for example, if your non-interrupt code wants to set or clear a bit in a 32-bit variable, it (1) reads the variable from RAM into a register, (2) alters the bit, then (3) writes the new value back to RAM. If an interrupt happens between (1) and (2), or between (2) and (3), and the interrupt also modifies that variable, the non-interrupt code will overwrite the interrupt's new value with old data.

Also, if you don't declare the variable as "volatile", the compiler is free to read the variable into a register, update the value IN THE REGISTER, check the value IN THE REGISTER, and write the register back to RAM (if needed) when it is no longer used. If an interrupt occurs while the main code has this variable in a register, the main program will never see that new value, and will overwrite the interrupt's new value with the main program's register contents.

And based on your other post (https://community.st.com/s/question/0D50X0000C5TILvSQO/adding-a-lf-to-a-string-of-unkown-length), you are still learning about possible buffer overflows. Though you did use strncpy() in your first post :) So, again, go back over every piece of code that writes to a buffer and make absolutely sure you are staying withing the allocated buffer sizes.

WARNING: while strncpy() will limit the number of bytes it copies, it does NOT guarantee that the destination buffer will be NULL terminated (http://www.cplusplus.com/reference/cstring/strncpy/ ). So using strlen() after a strncpy() call may give you unexpected results. When I use strncpy() I always pass it the length of the destination buffer MINUS ONE, and then always write a NULL into the last byte of the destination buffer. That way I know for sure that the buffer is always a valid NULL-terminated "C" string.

Benjamin Brammer · ‎2020-02-06

Hey Bob,

yeah I am still learning. Although I did quite some embedded projects, I feel that I am still not experienced enough epsecially when programming for embedded systems and what to keep in mind when using c- function that where written for an OS like Windows.

I am really thankful for yours and ohers help, so that I can learn more. Do you have some liturature that you would recommend? I shortly bought myself the Book "The De�?nitive Guide to ARM Cortex-M3 and Cortex-M4 Processors" by Joseph Yiu which I so far find quite good but It would be great to have some literature about how to program good software in an embedded system and what to watch out for.

In your last paragraph you pointed out the difficulties when using strncpy() and strlen(). If I would copy a string which size is smaller than my buffer array into the buffer via strncopy and would set the length of my buffer as the n value, then my buffer would be padded with NULL and this wouldn't be a problem, or am I wrong?