STM32F4 in hardfault caused by CAN

Quo · ‎2024-04-12

Hello everyone,

I am using an STM32F405 for a communication over can. It is receiving the messages on CAN1 and CAN2. For drivers I use STM32_HAL. My initialization procedure goes the following way:

Init clocks
Init GPIOs
Init CAN
- Init CAN peripheral
- Init filters
- Activate notification(CAN_IT_RX_FIFO0_MSG_PENDING, CAN_IT_RX_FIFO0_OVERRUN)
- HAL_CAN_Start

I repeat this for CAN1 and CAN2.

The code is fine, when the can messages start arriving some time after the CAN peripheral is initialized, but if it arrives immediately then microcontroller ends in a HardFault.

In the production environment there are two devices on each bus: microcontroller and BMS. In this setup the HardFault occurs regularily, as microcontroller doesn't ACK the CAN frame which then BMS retries immediately. If I add a PCAN, which ACKs the messages the BMS sends Microcontroller works fine, but this is not the situation we have in production.

I've found the workaround: call HAL_CAN_Start 5 seconds later, then there is no HardFault, but I would like to understand why is this happening if I call it immediately. Is there some time that should be left for the peripheral to 'settle in' after the Init, or did I do something else wrong?

mƎALLEm · ‎2024-04-12

Hello,

I don't think CAN communication has a relation with a Hardfault. It could be something related to your code.

Need to setup a test project with a minimal CAN config+data communication and reproduce the issue. If you are using RTOS or something else complex, remove it and check as it could be something related to these stuff!

Did you already had a look at / test the example provided by STM32CubeF4 under this path?

I think it could be a good starting point.

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

Tesla DeLorean · ‎2024-04-12

Almost certainly got nothing to do with the CAN peripheral.

And everything to do with failings in your code. Instrument the HardFault_Handler() so it provides actionable data. Look at what the MCU is complaining about. That you're violating some memory access, perhaps via an errant pointer, or you corrupting the stack.

The CAN HAL code might have it's own issues, but you'd need to debug them the same way, you have ALL the source code available to you.

Hard Faults are usually due to gross errors, and should be relatively easy to pin down with classic debugging techniques.

Perhaps you have colleagues who can help you review and analyze the problem?

Do other devices on the bus have expectations about how frequently they are queried or sent data?

Does failure behvaiour change or moved depending on if the compiler has optimization on or off?

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Quo · ‎2024-04-17

I got some time and looken into the system control block registers.

Every time it fails the HFSR reads FORCED = 1.

There are two lines of code where if fails:

(float) (HAL_GetTick() & 2047)) / 2047;

If the code fails on this line it fails with `No coprocessor usage fault(NOCP)` bit set in CFSR.

The other line where it fails is a call to the FreeRTOS function, but it fails on the function call itself.

If the code fails on the function call then `Invalid PC load usage fault(INVPC)` bit is set in CFSR.

And again, if I unplug all the CAN devices, or use something to listen and ACK the messages then this doesn't occur, otherwise it consistently occurs. I've checked the clock and power supplies, they are fine.

mƎALLEm · ‎2024-04-17

Hello,

(float) (HAL_GetTick() & 2047)) / 2047;

What do you mean by this line? What is the purpose?

The other line where it fails is a call to the FreeRTOS function, but it fails on the function call itself.

Could it be a stack size issue of your Task?

And again, if I unplug all the CAN devices, or use something to listen and ACK the messages then this doesn't occur, otherwise it consistently occurs. I've checked the clock and power supplies, they are fine.

Again, CAN has nothing to do with the hardfault. The behavior you're seeing is only the result and not the origin of the issue.

To give better visibility on the answered topics, please click on "Accept as Solution" on the reply which solved your issue or answered your question.

Andrew Neil · ‎2024-04-17

@Quo wrote:
And again, if I unplug all the CAN devices, or use something to listen and ACK the messages then this doesn't occur, otherwise it consistently occurs. I've checked the clock and power supplies, they are fine.

That just indicates that the faulty piece(s) of code don't get called when there's no CAN activity - it doesn't indicate a fault with the CAN itself.

A complex system that works is invariably found to have evolved from a simple system that worked.
A complex system designed from scratch never works and cannot be patched up to make it work.

Tesla DeLorean · ‎2024-04-17

Double check the voltage at the VCAP pins, and the bulk capacitance you have placed on the board. Expect 1.25V and about 4u7 F total over the pin(s)

Check the Flash latency / wait state settings.

Check the FPU is enabled, typically done in SystemInit()

Check the values in the Vector Table, make sure there aren't any errant addresses or bindings there.

Check you're not trashing the stack, say with excessively large auto/local variables.

Check you're not using auto/local variable for buffers that are used beyond the scope of the function/block within which they are defined and valid.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Quo · ‎2024-05-06

Hello, appologies for the late reply and thanks for the ideas.

FPU was not enabled as Compiler decided it didn't need it for some reason, so I added it. This solved the NOCP issue.

The other issue remained however. I checked the stack it was not being ran over, and the Vector table was what we would expect it to be.

Also the Vcap was 1.25 and there was not too much ripple there.

The code was built with -O0, when we switched to the -Og the problem was not appearing anymore, same goes for O1 and O2. All of the code was compiled with default compiler supplied with CubeIDE on Windows(10.3.1 20210824). Then we built our own gcc 12.3, and problem was not happening even on -O0, so we may be hitting a compiler issue.

Anyway, thanks for all the suggestions.