on
2025-07-30
1:11 AM
- edited on
2025-07-30
2:55 AM
by
Laurids_PETERSE
This tutorial begins with STM32CubeMX and demonstrates an AI application for the STM32N6-DK board.
This example explains how to load an application from external flash memory, execute an AI model inference stored in external flash, and display the inference output via a serial interface.
To achieve this, the tutorial uses an example model from the ST model zoo and the STM32CubeMX package: X-CUBE-AI.
This tutorial uses the STM32N6570-DK board. You also need a USB Type-C® cable to program the board.
You can create the project in STM32CubeMX or use the STM32CubeMX embedded in STM32CubeIDE.
The following software and versions were used:
The STM32N6 microcontroller does not include internal flash memory. Therefore, to retain the application after power-off, external flash memory is typically used to store the binaries initially. Based on this, several design templates are provided to allow the user to copy the application from flash to RAM, either entirely or in multiple stages.
This tutorial is based on the FSBL LRUN template.
The five design templates available in the CubeN6 package are described below.
The FSBL binary is initially stored in the external memory of the STM32N6 board. It is copied by the boot ROM at power-on into the internal SRAM and executed from there once the boot ROM finishes its task.
Two binaries, the FSBL and the application (Appli), are initially stored in the external memory of the STM32N6 board. At power-on, the boot ROM copies the FSBL binary into internal SRAM. Once the boot ROM completes its task, the FSBL executes after performing clock and system configuration, it copies the application binary into internal SRAM. When done, the application starts and runs.
Two binaries, the FSBL and the application (Appli), are initially stored in the external memory of the STM32N6 board. At power-on, the boot ROM copies the FSBL binary into internal SRAM. Once the boot ROM completes, the FSBL executes after system configuration, it maps the external memory (containing the application binary) into the memory space for XIP (execute in place). When the FSBL completes, the application executes directly from external memory.
Three binaries, the FSBL, the secure application (AppliSecure), and the nonsecure application (AppliNonSecure) are initially stored in the external memory of the STM32N6 board. At power-on, the boot ROM copies the FSBL binary into internal SRAM. After boot ROM execution, the FSBL executes. It configures the system and then copies both the secure and nonsecure application binaries into internal SRAM. Once complete, the secure application runs first and configures isolation settings, followed by the execution of the nonsecure application.
Three binaries, the FSBL, the secure application, and the nonsecure application are initially stored in the external memory of the STM32N6 board. At power-on, the boot ROM copies the FSBL binary into internal SRAM. Once the boot ROM completes, the FSBL executes it performs clock and system configuration and then maps the external memory for execution. The secure application runs directly from external memory, configures isolation settings, and then jumps to the nonsecure application.
You can create a project in either the standalone STM32CubeMX application or the integrated STM32CubeMX version within STM32CubeIDE. The standard approach is to design the project in STM32CubeMX and then export it to STM32CubeIDE. This method is used in this tutorial.
Begin your project by clicking Access to board selector and selecting STM32N6570-DK, which corresponds to the STM32N6 Discovery kit board.
When asked to "Initialize all peripherals with their default Mode", click [No].
While some peripherals may be enabled by default, remove any that are unnecessary to avoid pin conflicts.
Select "Secure domain only" and click [OK]:
We can proceed with the configuration.
Navigate to "Pinout & Configuration."
Enable the CPU ICACHE and CPU DCACHE.
Enable CACHEAXI.
Ensure that ICACHE is disabled.
As shown in the previous image above, a pin conflict warning is displayed for PA8 in the RCC section. This pin is assigned to both RCC and LCD. Since this tutorial does not use the LCD, change the PA8 pin function from LTDC_B6 to RCC_MCO_1 in the "Pinout View."
To ensure that the code executes correctly, the green and red LEDs indicate the status of the FSBL and the application, respectively. Although the red LED is preconfigured as LED2, it is not set as an output by default. In the pinout view, the green LED corresponds to pin PO1, and the red LED corresponds to pin PG10.
In the "Pinout View", configure PO1 and PG10 as GPIO_Output.
Under "Pin Context Assignment", assign PO1 to the FSBL and PG10 to the Application.
Other parameters related to these pins can be left at their default values.
To enable overdrive mode, where the CPU operates at its maximum frequency (as shown below in the table from the STM32N6 datasheet), the EXT_SMPS_MODE pin must be configured. According to the STM32N6570-DK User Manual, this corresponds to pin PF4.
Therefore, configure PF4 as GPIO_OutPut and edit his configuration as follows:
Additionally, the maximum CPU frequency in overdrive mode requires setting the "Power Regulator Voltage Scale" to "0." Ensure that this configuration is applied in the RCC section.
Since intermediate activations during inference are stored in RAM, enable the RAM controllers accordingly.
You can deactivate everything in this part to avoid potential conflict:
First, you can deactivate everything that is set by default by CubeMX which we do not use, in this case:
As shown in the image below from the STM32N6570-DK user manual, the MCU is connected to the octo-SPI flash memory via the XSPI (extended-SPI) interface. Thus, XSPI2 is configured under the "Connectivity" section.
In XSPIM, select the operation mode as shown:
Then, configure XSPI2 as shown below:
The serial interface USART1 (PE5/PE6), which supports the bootloader, is directly accessible as a Virtual COM port on the PC when connected via the STLINK-V3EC USB connector (CN6). We use this interface to redirect the ‘printf‘ output, enabling easy debugging through a serial terminal. Enable USART1 for the Application runtime context, set the mode to "Asynchronous", and configure the baud rate to 115200 bit/s
As you can see below, there is a conflict shown in orange. To fix it, change the pin PA11 to USART1_CTS:
To redirect ‘printf‘, additional code needs to be added to your project. This is done later.
We do not use anything, so you can deactivate everything that was set by default.
To ensure boot security, enable "BSEC" for the FSBL. Additionally, since the NPU is used in the application, activate "RIF" for the Application. Check the boxes as shown below:
In the RIF panel of CubeMX, make sure that the NPU is checked. Depending on your version, the NPU can be either in Peripherals (RISUP) or Domains (RIMU).
With the serial interface already configured, you can now set up the external memory loader. Start by selecting [Load and Run] under "Selection of Boot System". The remaining parameters depend on the size of the generated binaries.
If you are unsure of the sizes, you can first generate the binaries using estimated values, then adjust the configuration accordingly and regenerate.
For this project, the FSBL binary is approximately 65 KB, and the Appli binary is around 295 KB. The board’s memory map is shown below.
Note two key addresses in this map: the secure RAM block starts at address ‘0x34000000‘, and the octo-SPI flash (interfaced by XSPI2) starts at address ‘0x70000000‘.
Upon power-up, the boot ROM is executed first. After that, the FSBL (stored at the beginning of flash) is copied into RAM and executed. The FSBL then copies the Appli binary from flash to RAM and executes it. Thus, the FSBL binary is stored at ‘0x70000000‘.
The Appli binary should be placed at an address offset greater than the FSBL size. Choosing an offset of ‘0x00100000‘ (1 MB) provides ample space. The code size corresponds to the Appli binary size; ‘0x0004BAF0‘ (310 KB) offers a suitable margin.
By default, "Memory 1" is the source memory, corresponding to the XSPI2 interface. The destination address should be set to the start of the secure RAM block, where the Appli binary is loaded.
The X-CUBE-AI middleware is used to generate application code for running neural network inferences. Enable it in the Application context and select "Application Template" as the application type.
When asked if you want to automatically fix peripherals and clocks, click [No].
In this tutorial, the object detection model from the ST model zoo is used (link). Get the quantized .tflite model.
For your information: You can find script to retrain, deploy example application and much more for different use cases (image classification, object detection, audio detection etc.) on our GitHub model zoo services:
GitHub: AI model zoo services for STM32 devices
Select "TFLite" and browse for your model. Additionally, select the profile "n6-allmems-03" if not already used by default.
In the "Advanced Settings" (accessible via the blue icon above "Show Graph"), you can view the memory pool used by X-CUBE-AI. This pool stores the model’s fixed weights in flash and its activations in RAM. The OctoFlash pool begins at address ‘0x71000000‘, so after generating the model weights image, it should be downloaded to this address. The image generation process is explained later in this tutorial.
Configure the clock according to the maximum supported frequencies (shown earlier in the overdrive mode section).
High-speed OTP optimization is not enabled for XSPI. Therefore, the XSPI2 clock must be reduced. Configure IC3 with PLL4 as input and a prescaler of 1, then set IC3 as the input for the XSPI2 clock multiplexer. You should have this:
At this point, the CubeMX design and configuration are complete. In the "Project Manager" section, click [Generate Code] to export your project. Ensure that both FSBL and Appli are included and select STM32CubeIDE as the target toolchain.
After exporting your project from CubeMX, you will have the following structure. There are two nested projects: one for the FSBL and one for the Appli. The "Drivers" folder, which includes CMSIS and HAL drivers, is located outside the global project, and both nested projects access it by including the appropriate headers. The same applies to the ‘Middlewares‘ folder.
Now, you should add code to the main functions of each project to complete certain initializations, use the LEDs to indicate successful execution, and validate the outputs. In the FSBL, to turn on the green LED, add the function call as shown in the image below.
Toggle LED1 after FSBL initialization in main.c
HAL_GPIO_WritePin(GPIOO,GPIO_PIN_1,GPIO_PIN_SET);
Additionally, in the FSBL main code, the overdrive mode selection pin must be set high before configuring the system clock. Therefore, make sure to initialize the corresponding GPIO before the system clock configuration, as shown in the image below:
MX_GPIO_Init();
HAL_Delay(1);
Next, go to the Appli project. At the top of ‘main.c‘, in the private define section, declare the following macro for the ‘put_char‘ function.
#define PUTCHAR_PROTOTYPE int __io_putchar(int ch)
Then, in the ‘User Code 4‘ section at the bottom of ‘main.c‘, implement the following functions:
PUTCHAR_PROTOTYPE
{
HAL_UART_Transmit(&huart1, (uint8_t *)ch, 1, 0xFFFF);
return ch;
}
int _write(int fd, char * ptr, int len){
HAL_UART_Transmit(&huart1, (uint8_t *) ptr, len, HAL_MAX_DELAY);
return len;
}
Within ‘main.c‘, in the X-CUBE-AI initialization function, add the following lines to enable the RAM sections that were previously initialized and enabled:
RAMCFG_SRAM2_AXI->CR &= ~RAMCFG_CR_SRAMSD;
RAMCFG_SRAM3_AXI->CR &= ~RAMCFG_CR_SRAMSD;
RAMCFG_SRAM4_AXI->CR &= ~RAMCFG_CR_SRAMSD;
RAMCFG_SRAM5_AXI->CR &= ~RAMCFG_CR_SRAMSD;
RAMCFG_SRAM6_AXI->CR &= ~RAMCFG_CR_SRAMSD;
Important note: Depending on the version of CubeMX, X-CUBE-AI, and the STM32N6Cube package, you may have different things in this MX_X_CubeAI_Init() function. Edit it if needed to have exactly what is in the image above.
Add the following line to the RIF (SystemIsolation_Config) function to complete the slave configuration:
HAL_RIF_RISC_SetSlaveSecureAttributes(RIF_RISC_PERIPH_INDEX_NPU, RIF_ATTRIBUTE_PRIV | RIF_ATTRIBUTE_SEC);
Important note 2: Make sure that the part of the code above the line we add is present in the generated code. If not, it may indicate an issue in the RIF configuration. Make sure to look back at the Security configuration in CubeMX and that the NPU is selected in the RIF tab of CubeMX.
Finally, you need to handle the input and output buffers to feed your neural network and retrieve the inferred results. Since both the NPU and MCU have their own cache memories, these must be properly managed before invoking the low-level inference function to ensure that both units access up-to-date data in memory and that any results are stored in a mutually accessible region. Therefore, you must perform a cache clean and invalidate operations.
The ‘MX_X_CUBE_AI_Process‘ function below implements the following features:
The last functionality is implemented because memory is always allocated as an int table, but for this model, the output format is float. An int is 8 bits, and a float is 32 bits. Therefore, reading a single element from the array does not yield 19 valid values. Instead, four consecutive 8-bit values must be concatenated and interpreted as a single 32-bit float. You can copy the ‘MX_X_CUBE_AI_Process‘ function from the script below. After pasting it, press [Ctrl+I] to autoindent the code properly.
void MX_X_CUBE_AI_Process(void)
{
/* USER CODE BEGIN 6 */
LL_ATON_RT_RetValues_t ll_aton_rt_ret = LL_ATON_RT_DONE;
const LL_Buffer_InfoTypeDef * ibuffersInfos = NN_Interface_Default.input_buffers_info();
const LL_Buffer_InfoTypeDef * obuffersInfos = NN_Interface_Default.output_buffers_info();
buffer_in = (uint8_t *)LL_Buffer_addr_start(&ibuffersInfos[0]);
buffer_out = (uint8_t *)LL_Buffer_addr_start(&obuffersInfos[0]);
// Printing buffer start and end addresses.
printf("Input buffer: offset start = %lu, \n \r offset end = %lu \n \r",ibuffersInfos->offset_start,ibuffersInfos->offset_end);
printf("Output buffer: offset start = %lu, \n \r offset end = %lu \n \r",obuffersInfos->offset_start,obuffersInfos->offset_end);
// Getting buffer size and printing it.
buff_in_len = ibuffersInfos->offset_end - ibuffersInfos->offset_start;
buff_out_len = obuffersInfos->offset_end - obuffersInfos->offset_start;
printf("Buffer input size = %lu \n\r Buffer output size = %lu \n\r", buff_in_len, buff_out_len);
uint8_t val = 10;
LL_ATON_RT_RuntimeInit();
// Run 10 inferences
for (int inferenceNb = 0; inferenceNb < 10; ++inferenceNb) {
/* ------------- */
/* - Inference - */
/* ------------- */
/* Pre-process and fill the input buffer */
// Fill input buffer with constant data.
for(uint32_t i = 0; i < buff_in_len; i++){
buffer_in[i] = val;
}
// Clean and invalidate MCU DCache and invalidate NPU cache.
mcu_cache_clean_invalidate_range(buffer_in, buffer_in + buff_in_len);
npu_cache_invalidate();
// Check that input buffer was properly assigned with "val".
printf("Buffer[1] = %d \n \r", buffer_in[1]);
printf("Buffer[1000] = %d \n \r", buffer_in[1000]);
printf("Buffer[10000] = %d \n \r", buffer_in[10000]);
//_pre_process(buffer_in);
/* Perform the inference */
LL_ATON_RT_Init_Network(&NN_Instance_Default); // Initialize network instance
do {
// Execute first/next epoch block
ll_aton_rt_ret = LL_ATON_RT_RunEpochBlock(&NN_Instance_Default);
// Wait for event if required
if (ll_aton_rt_ret == LL_ATON_RT_WFE) {
LL_ATON_OSAL_WFE();
}
} while (ll_aton_rt_ret != LL_ATON_RT_DONE);
// Post-process the output buffer
// Invalidate CPU cache if needed
// Convert int8 to float. Buffer is int8, but model's output is float.
uint8_t aux[4];
float_t *conver;
for(uint32_t i = 0; i < buff_out_len; i += 4){
aux[0] = buffer_out[i];
aux[1] = buffer_out[i+1];
aux[2] = buffer_out[i+2];
aux[3] = buffer_out[i+3];
conver = (float_t *)aux;
printf("Out %lu = %f \n \r", i, *conver);
}
//_post_process(buffer_out);
LL_ATON_RT_DeInit_Network(&NN_Instance_Default);
/* -------------------- */
/* - End of Inference - */
/* -------------------- */
}
LL_ATON_RT_RuntimeDeInit();
/* USER CODE END 6 */
}
This model has a 3-dimensional output buffer. To validate the output, only the first dimension is used in this example. Configure a serial terminal, such as Tera Term, with the settings shown in the image below. You are able to observe all the output buffer values.
You can use the Python script provided in the annex (provided at the end of this article). This is used to compare whether the quantized and optimized model running on your MCU produces the same outputs as the original model running on your PC.
To ensure that the application was correctly copied, that the peripheral initialization succeeded, and that the main loop is not stuck during inference, add the following two lines of code to blink LED2:
HAL_GPIO_TogglePin(GPIOG,GPIO_PIN_10);
HAL_Delay(200);
Your project is now complete. You may proceed to build it. Normally, there should be no errors. However, if you encounter dependency errors such as missing external sources, you can manually add them inside the nested projects. For instance, if you get errors indicating that LL functions are undeclared, it means that the compiler cannot locate the LL sources in the global middleware folder.
In that case, import the required source files and ensure that the folder is marked as a source location in the project settings.
To import a folder, right-click on the project, then select "Import" → "General" → "File System". Choose the folder containing the missing source files, and filter to import only ‘.c‘ files. Then, right-click the project again, go to "Properties" → "C/C++ General" → "Paths and Symbols." Under the "Source Location" tab, add the folder you just imported.
Look at this question from the ST Community product forums for additional troubleshooting: Solved: Linker garbage problem when deploying AI models on... - STMicroelectronics Community
After building, you will find the binaries in their respective "Debug" folders.
Embedded systems that implement security features such as TrustZone®, as in the STM32N6, require firmware authentication. The STM32-SignTool is a key utility that ensures a secure platform by signing binary images using ECC keys. These signed binaries are used during the STM32 secure boot process to establish a trusted boot chain. This process ensures authentication and integrity checks of the loaded images.
In short, you must sign the generated binaries before flashing them to the N6.
The Signing Tool executable is located in your STM32CubeProgrammer installation directory (by default: C:/Program Files/STMicroelectronics/STM32Cube/STM32Cube Programmer/bin). To run the commands shown below, you can add the ‘bin‘ folder to your environment variables so that you can execute them from any directory.
Otherwise, run the command directly from the bin folder, specifying the full path to the binary file
STM32_SigningTool_CLI.exe -bin <your_project>.bin -nk -of 0x80000000 -t fsbl -o <your_project>-trusted.bin -hv 2.3 -dump <your_project>-trusted.bin
In our case, you want to sign two files:
And you should end up with 2 new files:
For reference, the terminal output should look like:
In your project folder, you can find at the root a file named network_data.xSPI2.raw which contains the weights of your model. This file results from X-CUBE-AI and in particular is the result of the ST Edge AI Core command running behind it:
stedgeai generate --model Model_File.tflite --target stm32n6 --st-neural-art
Documentation: https://stedgeai-dc.st.com/assets/embedded-docs/index.html
In our case, we want to rename and convert this file to network_data.xSPI2.bin:
cp network_atonbuf.xSPI2.raw network_data.xSPI2.bin
Next, add the path to ‘arm-none-eabi-objcopy‘ to your environment variables. You can find it in your STM32CubeIDE installation, typically under: C:/ST/STM32CubeIDE_<version>/STM32CubeIDE/plugins/ com.st.stm32cube. ide.mcu.externaltools.gnu-tools-for-stm32.13.3.rel1.win32_1.0.0.202411081344/ tools/bin. This tool allows you to convert the ‘.bin‘ file into a ‘.hex‘ file with a specified flash memory address:
arm-none-eabi-objcopy -I binary network_data.xSPI2.bin --change-addresses 0x71000000 -O ihex network_data.hex
You now have the hexadecimal file containing fixed weights and parameters ready for flashing.
We now have all three image files: the FSBL, the application, and the model weights.
Open STM32CubeProgrammer and ensure that your ST-LINK configuration matches the image below and confirm that the firmware is up to date.
Set your STM32N6570-DK board to development boot mode (Boot1 and Boot2 to the right) and click [Connect] in STM32CubeProgrammer.
Boot switches on the DK board:
On the image, the board is in boot from flash.
FSBL and application are binary files, and their flashing addresses must be specified manually.
For example:
To flash the images:
At boot, the boot ROM loads the FSBL from flash to internal RAM. The FSBL then loads the application from flash and executes it.
Now, if you connect the board with both switches set to the left, the application is loaded and executed from flash memory. Here is what you should observe in a terminal:
Here are some comments:
The input size and output size can be understood by opening the model.tflite
file that was downloaded earlier.
Tool: Netron
X-CUBE-AI allocates both input and output buffers as INT8 tables. For the model used in this tutorial, the input buffer data type is INT8, therefore the memory size allocated corresponds directly to its dimensions (192x192x3 = 110592 bytes).
However, the output buffers data type is FLOAT32, which means the memory size allocated is 4 times greater. So, for example, for the third output buffer on Netron - that is the first one you see on LL_aton_buffers_output_info - whose dimensions are (3875,2), 3875x2x4 = 30700 bytes were allocated. Then, as you can see on MX_CUBE_Ai_Process, the output buffer INT8 values must be "cast" to FLOAT32. Therefore, we use a float pointer to point at the beginning of a INT8 table of length 4.
In the serial terminal, only this output buffer used as example was printed out, as well as in the Python script.
By opening the network.c
file in Appli/X-Cube-AI/App
, you can find information about the outputs of the model generated by X-CUBE-AI:
In the function LL_Buffer_InfoTypeDef *LL_ATON_Output_Buffers_Info_Default
, the three outputs are listed in order, and the first one corresponds to the size 3845 * 2 (so the last one in netron).
If you have carefully followed the steps in this tutorial, the green LED should now be turned on, and the red LED should be blinking. The blinking interval reflects the sum of the user-defined delay, the inference execution time, and the time required to print the outputs to the serial terminal. Furthermore, the output values observed on the terminal should match those produced by executing the reference Python script provided in the annex.
This tutorial aimed to provide a minimal yet functional application that enables users, regardless of expertise level, to develop STM32N6 edge AI projects with a clear and structured workflow.
import numpy as np
import tensorflow as tf
# Path to your TFLite model file
MODEL_PATH = "ssd_mobilenet_v2_fpnlite_035_192_int8.tflite"
# The constant value to fill into the input tensor
FILL_VALUE = 10
# Load TFLite model and allocate tensors
interpreter = tf.lite.Interpreter(model_path=MODEL_PATH)
interpreter.allocate_tensors()
# Get input and output tensor details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# You can print these to inspect:
print("Input details:", input_details)
print("Output details:", output_details)
# Getting model's input details
input_index = input_details[0]['index']
input_shape = input_details[0]['shape']
input_dtype = input_details[0]['dtype']
# Create input data filled with the constant value
input_data = np.full(input_shape, FILL_VALUE, dtype=input_dtype)
# Set the tensor
interpreter.set_tensor(input_index, input_data)
# Run inference
interpreter.invoke()
# Retrieve output tensors
outputs = []
for out in output_details:
output_data = interpreter.get_tensor(out['index'])
outputs.append(output_data)
# Print first buffer outputs
for dim in outputs[0]:
for i, val in enumerate(dim):
print(f"Output {i} = {val}")
print("\n\n\n\n")