HardFault during NPU Inference on STM32N6: ATON Vector Access Failure

Hugi_ · ‎2026-02-03

Hello,

I’m working on a project using the STM32N6 NPU, and I’m consistently encountering a HardFault during model inference.

To set up my project, I followed this tutorial:
https://community.st.com/t5/stm32-mcus/how-to-build-an-ai-application-from-scratch-on-the-nucleo-n657x0/ta-p/828502

I also used a TFLite model from the Model Zoo on GitHub (I tested several models from the zoo as well as my own custom models).

Using the debugger, I was able to pinpoint that the crash occurs on the following line in network.c:

LL_Arithacc_Init(2, &Conv2D_7_mul_scale_4_init3);

More specifically, the fault happens inside ll_aton.c at:

uint32_t t;
uint32_t A = Ap != NULL ? LL_ATON_getbits(Ap, bitcnt_A, nbits_A) : (uint32_t)conf->A_scalar;

The convolution configuration generated in network.c looks like this:

static const LL_Arithacc_InitTypeDef Conv2D_7_mul_scale_4_init3 = {
    .rounding_x = 0,
    .saturation_x = 0,
    .round_mode_x = 0,
    .inbytes_x = 2,
    .outbytes_x = 2,
    .shift_x = 0,
    .rounding_y = 0,
    .saturation_y = 0,
    .round_mode_y = 0,
    .inbytes_y = 2,
    .outbytes_y = 2,
    .combinebc = 0,
    .clipout = 0,
    .shift_y = 0,
    .rounding_o = 1,
    .saturation_o = 1,
    .round_mode_o = 1,
    .relu_mode_o = 0,
    .outbytes_o = 2,
    .shift_o = 16,
    .scalar = 0,
    .dualinput = 0,
    .operation = ARITH_AFFINE,
    .bcast = ARITH_BCAST_CHAN,
    .Ax_shift = 0,
    .By_shift = 0,
    .C_shift = 0,
    .fWidth = 224,
    .fHeight = 224,
    .fChannels = 16,
    .batchDepth = 8,
    .clipmin = 0,
    .clipmax = 0,
    .A_scalar = 1,
    .B_scalar = 0,
    .C_scalar = 0,
    .A_vector = {((unsigned char *)(ATON_LIB_PHYSICAL_TO_VIRTUAL_ADDR(0x71000000UL + 11064256)))},
    .B_vector = {0},
    .C_vector = {0},
    .vec_precision = {16, 16, 16},
};

Initially, I flashed the FSBL and the application using the programmer. Later, I noticed that I could load the application directly into RAM using the debugger. I also enabled RIF for the application in CubeMX, and I still get the same HardFault when flashing the application normally.

My questions are:

Could this issue be caused by loading the application into RAM instead of flashing it?
Is there a missing RIF configuration step in the tutorial that would allow the CPU to access this memory region?
Or does this indicate a model generation issue (e.g., invalid ATON address, misaligned vector, unsupported layer configuration, etc.)?