FDCAN with custom bootloader

zmiga · ‎2023-12-22

Hello everyone!

Background

We are working on STM32G474VET6 based custom board for some specific motor control application. Project demands to have a custom bootloader on FD-CAN with 500kbit/s arbitration baudrate and data rate with 2Mbit/s with BRS option turned ON. Bootloader is done and working fine.

Board has a support for three CAN buses, therefore using FDCAN1, FDCAN2 and FDCAN3 periphery. All CAN buses has the same bus configuration and are not inter-connected.

Everything works perfectly with running only application on uC. As soon as we add bootloader that runs on FDCAN1, that same application (adjusted to be bootloadable ofcourse - offseted in flash, vector table adjusted) that was running all three CANs, now hangs if any message is received/transmitted on eihter FDCAN2 or FDCAN3. If there is no traffic on FDCAN2 or FDCAN3 everything is working as expected. Beside CAN periphery we're running also ADC1&2 with injected and regular channels, triggered by center aligned PWM timers, UART for debug purposes,...

Anyway application boots completely normal and works OK until we receive something on FDCAN2 or 3, then things fall apart. Mind that FDCAN2&3 works perfecly without bootloader. So I'm susspecting something around that transition from bootloader to application.

FDCAN Driver

FDCAN drivers source file are append to post. There it is visible how they are configured and used.

Transition from bootloader to application

Before entering application, bootloader de-init and resets all periphery that is using. FDCAN periphery is de-initialized, therefore when application boots it should find FDCAN clean as in standalone application.

De-init routine before entering application:

    // De-Initialize cryptographic library
    if ( CMOX_INIT_SUCCESS != cmox_finalize( NULL ))
    {
         status = eBOOT_ERROR;
    }

    // De-init gpio
    gpio_deinit();

    // De-init CAN
    can_deinit( eCAN_MASTER );

    // De-init clocks
    HAL_RCC_DeInit();

    // De-init HAL
    HAL_DeInit();

    // Disable systick
    SysTick->CTRL &= ~SysTick_CTRL_ENABLE_Msk;

Problem with FDCAN on application with bootloader support

Receiving/Transmitting something on FDCAN2 triggers FDCAN1_IT1_IRQHandler
Receiving/Transmitting something on FDCAN3 end-up in Default Handler

Questions

Why do we end up in FDCAN1_IT1_IRQHandler, when EINT1 bit inside FDCAN1 ILE register is cleared?
Why does traffic on FDCAN2 triggers FDCAN1 interrupt in the first place?
Ending in Default Handler probably means that some interrupts are not handled. Can you recommend me how to figure up which interrupt is not being handled? Again standalone application works perfectly!
Can there be something that bootloader should do before entering application, regarding FDCAN periphery?
Is there any officail ST recommendation how to de-init FDCAN, before entering application?

Any recomendations are highly welcome. We're eager to hear your suggestions/answers!

BR, Žiga

zmiga · ‎2023-12-23

Update (23.12.2023): I've speculated that bootloader might mess FDCAN periphery up, as it works without any problem on standalone application. Therefore I've made a test with bootloader so simple, that it does only one think and that is the jump to application. Before that I've flashed bootloadable application on proper flash location using STM32CubeProgrammer.

Even with that simple bootloader, FDCAN 2&3 still does not work in bootable application! That means that bootloader has nothing todo with FDCAN problem, as it doesn't touch anything at all. The problem must be in application that is prepared for bootloader.

With that said, I'll start to search for issues at application side.

BR, Žiga

Tesla DeLorean · ‎2023-12-23

Well, personally I wouldn't tear down the entire system.

Have some plan to transferring control. Define the expectations on each side, there's little reason to do initialization TWICE, and it can be disruptive. If the clocks and peripheral are all running properly there's no need to kill them. If the clocks are already optimal/maximal don't run SystemClock_Config() AGAIN

What you want to stop is interrupts at their source (peripheral register), and expectations in HAL initialization/instances you hold in RAM.

Make sure the Vector Table is correctly selected on the App side, perhaps by using the symbol for it in the SystemInit() of SCB->VTOR

Do not transfer control from an Interrupt Handler or Call-Back. Watch for RTOS and User/System execution context.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

zmiga · ‎2023-12-24

Hi Tesla DeLorean,

thank you for your comments/recommendations!

"Well, personally I wouldn't tear down the entire system."

That is my plan also. I'll take everything from application until CAN starts to work.

"Have some plan to transferring control. Define the expectations on each side, there's little reason to do initialization TWICE, and it can be disruptive. If the clocks and peripheral are all running properly there's no need to kill them. If the clocks are already optimal/maximal don't run SystemClock_Config() AGAIN"

That is interesting thinking. I've never thought about not re-initialize all the stuff application need it. It makes perfect sense but on the other hand it also has downsides to write application code to apply that logic. As application needs to initialize only stuff that is not already init, therefore making high coupling between bootloader and application code. For example, if bootloader interface changes (from FDCAN to whatever), application code needs to be adopted for that change so, as now it needs to initialize FDCAN, as previously it was done by bootloader. And there are many of such real-life scenarios that can happened...

"Make sure the Vector Table is correctly selected on the App side, perhaps by using the symbol for it in the SystemInit() of SCB->VTOR"

This is OK, as otherwise nothing will work. As already mentioned, ADC, USART and other periphery triggers interrupts and beside FDCAN everything is working as expected.

"Do not transfer control from an Interrupt Handler or Call-Back."

Yes, that is also taken care of. Jump to application is done from main loop. Right before jump I also call "__disable_irq()".

Can you maybe talk more about "Watch for RTOS and User/System execution context"?

What do you think by that? Do you have any good reference that you can share about that?

Overall, you were more focus on general boot-app mechanism, but I'm more suspicious of FDCAN periphery as interrupts are triggered when they are disabled.

Do you think that problem is actually on boot-app level and it only manifest on FDCAN periphery level? So FDCAN periphery doesn't have anything todo with those issues?

Thank you!

BR, Žiga

zmiga · ‎2023-12-26

Well, I finally get to the problem and solution.

I've completely taken apart application code so that what has left is only CAN code, sending some dummy msg on all three FDCANs. The result was the same, receiving/transmitting any packet on either FDCAN2/3 result in abnormal interrupt triggering (like explained above). I couldn’t make application to boot and to work normally.

False interrupt triggering got me into thinking about interrupt vector offset settings and that was a root cause of the problem that manifests as abnormal FDCAN behaviour (triggering some IRQ that it should not be). After inspecting VTOR settings I've founded that nothing is wrong here as everything is according to ST recommendations. Looking into system_stm32g4xx.c:

/************************* Miscellaneous Configuration ************************/
/* Note: Following vector table addresses must be defined in line with linker
         configuration. */
/*!< Uncomment the following line if you need to relocate the vector table
     anywhere in Flash or Sram, else the vector table is kept at the automatic
     remap of boot address selected */

// ZIGA: Enable user vector table settings
#define USER_VECT_TAB_ADDRESS

#if defined(USER_VECT_TAB_ADDRESS)
/*!< Uncomment the following line if you need to relocate your vector Table
     in Sram else user remap will be done in Flash. */
/* #define VECT_TAB_SRAM */
#if defined(VECT_TAB_SRAM)
#define VECT_TAB_BASE_ADDRESS   SRAM_BASE       /*!< Vector Table base address field.
                                                     This value must be a multiple of 0x200. */
#define VECT_TAB_OFFSET         0x00000000U     /*!< Vector Table base offset field.
                                                     This value must be a multiple of 0x200. */
#else
#define VECT_TAB_BASE_ADDRESS   FLASH_BASE      /*!< Vector Table base address field.
                                                     This value must be a multiple of 0x200. */

// ZIGA: Added vector offsetting
#ifndef __BOOTLOADER_SUPPORT__
    #define VECT_TAB_OFFSET     0x0U        /*!< Vector Table base offset field. This value must be a multiple of 0x100. */
#else
    #define VECT_TAB_OFFSET     0x8100U        /*!< Vector Table base offset field. This value must be a multiple of 0x100. */  
#endif

#endif /* VECT_TAB_SRAM */
#endif /* USER_VECT_TAB_ADDRESS */
/******************************************************************************/

So I have setup as:

SCB->VTOR = VECT_TAB_BASE_ADDRESS | VECT_TAB_OFFSET; /* Vector Table Relocation in Internal SRAM */

which results in VTOR=0x08008100 and that address is start of application.

Notice the comment:

/*!< Vector Table base offset field. This value must be a multiple of 0x100. */

As everything seems to be OK with VTOR settings and therefore I didn't pay much attention to that section and move on to investigate other part of the boot-app system.

But then I started to play with VTOR settings, as knowing from other ST uC that they demands VTOR to be multiple of 0x200 (STM32F4/L4) or 0x400 (STM32H7). It turns out that changing VTOR from 0x100 multiple to 0x200 multiple solves the problem that was causing FDCAN strange behaviour. Now the bootable application is working the same as standalone application.

So my question now are:

Why did the VTOR change from multiple of 0x100 -> 0x200 fix the problem, if there is a clear specs that VTOR can be multiple of 0x100 on G4 family?
Does ST need to fix those comments in "system_stm32g4xx.c" file?
It seems to me that FDCAN is only a collateral damage here, as it might be any other periphery impacted by "wrong" VTOR settings. Can you comment this?

@Tesla DeLorean i'm eager to hear from you!

Thanks in advance!

BR, Žiga