How to build an AI application from scratch on the NUCLEO-N657X0-Q using STM32CubeMX

Julian E. · ‎2025-08-14

Introduction

This tutorial begins with STM32CubeMX and demonstrates an AI application for the NUCLEO-N657X0-Q board.

This example explains how to load an application from external flash memory, execute an AI model inference stored in external flash, and display the inference output via a serial interface.

To achieve this, the tutorial uses an example model from the ST model zoo and the STM32CubeMX package: X-CUBE-AI.

The .ioc is attached at the end of this article.

Introduction
1. Hardware and software prerequisites
2. STM32CubeN6 package
2.1 Types of templates
2.1.1 Basic template
2.1.2 FSBL Load and Run (LRUN) template
2.1.3 FSBL execute in place (XIP) template
2.1.4 Isolation LRUN template
2.1.5 Isolation XIP template
3. Design in STM32CubeMX
3.1 Create a project
3.2 Configure peripherals
3.2.1 System core
3.2.2 LEDs
3.2.3 Overdrive mode
3.2.4 Analog
3.2.5 Connectivity
3.2.5.1 Configure serial communication with external memory
3.2.5.2 Configure serial communication for debugging
3.2.6 Multimedia
3.2.7 Security
3.2.8 Middleware and software packs
3.2.8.1 Configure external memory manager middleware
3.2.8.2 Configure the X-CUBE-AI middleware
3.3 Configure the clocks
3.4 Project manager
4. STM32CubeIDE
4.1 Add code in the FSBL
4.2 Add code in Appli
4.3 Build
5. Deploying the application
5.1 Sign binary files with Signing Tool
5.2 Generate model weights binary image
5.3 Flash binaries with STM32CubeProgrammer
6. Running the application
Conclusion
Related links
Annexes

1. Hardware and software prerequisites

This tutorial uses the NUCLEO-N657X0-Q board. You also need a USB Type-C® cable to program the board.

You can create the project in STM32CubeMX or use the STM32CubeMX embedded in STM32CubeIDE.

The following software and versions were used:

STM32CubeIDE: Version 1.18.1
STM32CubeMX: Version 6.15.0
STM32CubeProgrammer: Version 2.19.0
X-CUBE-AI plugin: Version 10.2.0
ST32Cube FW_N6: Version 1.1.1
STM32Cube MCU package for STM32N6 series: Version 1.2.0

2. STM32CubeN6 package

The STM32N6 microcontroller does not include internal flash memory. Therefore, to retain the application after power-off, external flash memory is typically used to store the binaries initially. Based on this, several design templates are provided to allow the user to copy the application from flash to RAM, either entirely or in multiple stages.

This tutorial is based on the FSBL LRUN template.

2.1 Types of templates

The five design templates available in the CubeN6 package are described below.

2.1.1 Basic template

The FSBL binary is initially stored in the external memory of the STM32N6 board. It is copied by the boot ROM at power-on into the internal SRAM and executed from there once the boot ROM finishes its task.

2.1.2 FSBL Load and Run (LRUN) template

Two binaries, the FSBL and the application (Appli), are initially stored in the external memory of the STM32N6 board. At power-on, the boot ROM copies the FSBL binary into internal SRAM. Once the boot ROM completes its task, the FSBL executes after performing clock and system configuration, it copies the application binary into internal SRAM. When done, the application starts and runs.

2.1.3 FSBL execute in place (XIP) template

Two binaries, the FSBL and the application (Appli), are initially stored in the external memory of the STM32N6 board. At power-on, the boot ROM copies the FSBL binary into internal SRAM. Once the boot ROM completes, the FSBL executes after system configuration, it maps the external memory (containing the application binary) into the memory space for XIP (execute in place). When the FSBL completes, the application executes directly from external memory.

2.1.4 Isolation LRUN template

Three binaries, the FSBL, the secure application (AppliSecure), and the nonsecure application (AppliNonSecure) are initially stored in the external memory of the STM32N6 board. At power-on, the boot ROM copies the FSBL binary into internal SRAM. After boot ROM execution, the FSBL executes. It configures the system and then copies both the secure and nonsecure application binaries into internal SRAM. Once complete, the secure application runs first and configures isolation settings, followed by the execution of the nonsecure application.

2.1.5 Isolation XIP template

Three binaries, the FSBL, the secure application, and the nonsecure application are initially stored in the external memory of the STM32N6 board. At power-on, the boot ROM copies the FSBL binary into internal SRAM. Once the boot ROM completes, the FSBL executes it performs clock and system configuration and then maps the external memory for execution. The secure application runs directly from external memory, configures isolation settings, and then jumps to the nonsecure application.

3. Design in STM32CubeMX

3.1 Create a project

You can create a project in either the standalone STM32CubeMX application or the integrated STM32CubeMX version within STM32CubeIDE. The standard approach is to design the project in STM32CubeMX and then export it to STM32CubeIDE. This method is used in this tutorial.

Begin your project by clicking Access to board selector and selecting NUCLEO-N657X0-Q, which corresponds to the STM32N6 Discovery kit board.

When asked to "Initialize all peripherals with their default Mode", click [No].

While some peripherals may be enabled by default, remove any that are unnecessary to avoid pin conflicts.

Select "Secure domain only" and click [OK]:

When prompted about board project options, you can select LD1, LD2, and LD3 as shown below:

We can proceed with the configuration.

3.2 Configure peripherals

Navigate to "Pinout & Configuration."

3.2.1 System core

Enable the CPU ICACHE and CPU DCACHE.

Enable CACHEAXI.

Ensure that ICACHE is disabled.

3.2.2 LEDs

To know if the code is running correctly, LED blue and LED red is used to flag the execution state of respectively the FSBL and the Application. The BSP package allows initializing and using these LEDs without the need to configure them directly. To do such, enable BSP for both FSBL and Application runtime contexts and select all LEDs, as shown in the following image:

3.2.3 Overdrive mode

To enable overdrive mode, where the CPU operates at its maximum frequency (as shown below in the table from the STM32N6 datasheet), the EXT_SMPS_MODE pin must be configured. According to the NUCLEO-N657X0-Q user manual, this corresponds to pin PB12.

Therefore, pin PB12 has to be set as GPIO_Output:

And configured as follow in the GPIO menu:

Additionally, the maximum CPU frequency in overdrive mode requires setting the "Power Regulator Voltage Scale" to "0." Ensure that this configuration is applied in the RCC section.

Since intermediate activations during inference are stored in RAM, enable the RAM controllers accordingly.

3.2.4 Analog

You can deactivate everything in this part to avoid potential conflict:

3.2.5 Connectivity

First, you can deactivate everything that is set by default by CubeMX that we do not use, and just keep:

LPUART1
XSPI2
XSPIM

3.2.5.1 Configure serial communication with external memory

As shown in the image below from the NUCLEO-N657X0-Q user manual, the MCU is connected to the octo-SPI flash memory via the XSPI (extended-SPI) interface. Thus, XSPI2 is configured under the "Connectivity" section.

In XSPIM, select the operation mode as shown:

Then, configure XSPI2 as shown below:

Fifo threshold: 4
Memory Size: 1 GBits
Delay Hold Quarter Cycle: Enabled

3.2.5.2 Configure serial communication for debugging

The serial interface LUSART1 (PE5/PE6), which supports the bootloader, is directly accessible as a Virtual COM port on the PC when connected via the STLINK-V3EC USB connector (CN6). We use this interface to redirect the ‘printf‘ output, enabling easy debugging through a serial terminal. Enable LUSART1 for the Application runtime context, set the mode to "Asynchronous", and configure the baud rate to 115200 bit/s

As you can see below, there is a conflict shown in orange. To fix it, change the pin PA11 to LUSART1_CTS:

To redirect ‘printf‘, additional code needs to be added to your project. This is done later.

3.2.6 Multimedia

We do not use anything, so you can deactivate everything that was set by default.

3.2.7 Security

To ensure boot security, enable "BSEC" for the FSBL. Additionally, since the NPU is used in the application, activate "RIF" for the Application. Check the boxes as shown below:

3.2.8 Middleware and software packs

3.2.8.1 Configure external memory manager middleware

With the serial interface already configured, you can now set up the external memory loader. Start by selecting [Load and Run] under "Selection of Boot System". The remaining parameters depend on the size of the generated binaries.
If you are unsure of the sizes, you can first generate the binaries using estimated values, then adjust the configuration accordingly and regenerate.

For this project, the FSBL binary is approximately 65 KB, and the Appli binary is around 295 KB. The board’s memory map is shown below.

Note two key addresses in this map: the secure RAM block starts at address ‘0x34000000‘, and the octo-SPI flash (interfaced by XSPI2) starts at address ‘0x70000000‘.

Upon power-up, the boot ROM is executed first. After that, the FSBL (stored at the beginning of flash) is copied into RAM and executed. The FSBL then copies the Appli binary from flash to RAM and executes it. Thus, the FSBL binary is stored at ‘0x70000000‘.
The Appli binary should be placed at an address offset greater than the FSBL size. Choosing an offset of ‘0x00100000‘ (1 MB) provides ample space. The code size corresponds to the Appli binary size; ‘0x0004BAF0‘ (310 KB) offers a suitable margin.

By default, "Memory 1" is the source memory, corresponding to the XSPI2 interface. The destination address should be set to the start of the secure RAM block, where the Appli binary is loaded.

3.2.8.2 Configure the X-CUBE-AI middleware

The X-CUBE-AI middleware is used to generate application code for running neural network inferences. Enable it in the Application context and select "Application Template" as the application type.

When asked if you want to automatically fix peripherals and clocks, click [No].

In this tutorial, the object detection model from the ST model zoo is used (link). Get the quantized .tflite model.

For your information: You can find script to retrain, deploy example application and much more for different use cases (image classification, object detection, audio detection etc.) on our GitHub model zoo services:

GitHub: AI model zoo services for STM32 devices

Select "TFLite" and browse for your model. Additionally, select the profile "n6-allmems-03" if not already used by default.

In the "Advanced Settings" (accessible via the blue icon above "Show Graph"), you can view the memory pool used by X-CUBE-AI. This pool stores the model’s fixed weights in flash and its activations in RAM. The OctoFlash pool begins at address ‘0x71000000‘, so after generating the model weights image, it should be downloaded to this address. The image generation process is explained later in this tutorial.

3.3 Configure the clocks

Configure the clock according to the maximum supported frequencies (shown earlier in the overdrive mode section).

High-speed OTP optimization is not enabled for XSPI. Therefore, the XSPI2 clock must be reduced. Configure IC3 with PLL4 as input and a prescaler of 1, then set IC3 as the input for the XSPI2 clock multiplexer. You should have this:

Finally, click on Resolve Clock Issues to solve the LPUART1 Source Mux issue.

3.4 Project manager

At this point, the CubeMX design and configuration are complete. In the "Project Manager" section, click [Generate Code] to export your project. Ensure that both FSBL and Appli are included and select STM32CubeIDE as the target toolchain.

4. STM32CubeIDE

After exporting your project from CubeMX, you will have the following structure. There are two nested projects: one for the FSBL and one for the Appli. The "Drivers" folder, which includes CMSIS and HAL drivers, is located outside the global project, and both nested projects access it by including the appropriate headers. The same applies to the ‘Middlewares‘ folder.

4.1 Add code in the FSBL

First of all, you should complement BSEC initialization. First, add the following constants into the private define area of stm32n6xx_hal_msp.c

#define HSLV_OTP 124
#define VDDIO3_HSLV_MASK (1<<15)

Then proceed to add the following piece of code into the HAL_XSPI_MspInit function.

BSEC_HandleTypeDef hbsec;
uint32_t fuse_data = 0;
/* Enable BSEC & SYSCFG clocks to ensure BSEC data accesses */
__HAL_RCC_BSEC_CLK_ENABLE();
__HAL_RCC_SYSCFG_CLK_ENABLE();
hbsec.Instance = BSEC;
if (HAL_BSEC_OTP_Read(&hbsec, HSLV_OTP, &fuse_data) != HAL_OK)
{
Error_Handler();
}
/* Set PWR configuration for IO speed optimization */
__HAL_RCC_PWR_CLK_ENABLE();
HAL_PWREx_EnableVddIO3();
HAL_PWREx_ConfigVddIORange(PWR_VDDIO3, PWR_VDDIO_RANGE_1V8);
/* Set SYSCFG configuration for IO speed optimization (clock already enabled)
*/
HAL_SYSCFG_EnableVDDIO3CompensationCell();
/* Enable the XSPI memory interface clock */
__HAL_RCC_XSPI2_CLK_ENABLE();

Now, you should add some code into the main functions of each project to use the LEDs to flag good execution. In the FSBL, to turn on the blue LED add functions call accordingly to the following image. Note that the initialization added by STM32CubeMX was commented out and this same piece of code was added under the User Area Code 2. This was done because the function to turn the LED blue on has to be called before booting the application. This is the only User Area Code available for such a purpose. Be aware that the section commented out may uncomment itself after STM32CubeMX modifications so be sure to comment it before rebuilding.

In FSBL main.c:

BSP_LED_Init(LED_BLUE);
BSP_LED_Init(LED_RED);
BSP_LED_Init(LED_GREEN);
BSP_LED_On(LED_BLUE);

Moreover, the Overdrive mode selection pin must be high before configuring the clock. Make sure to add the GPIO init function before the system clock initialization, as shown in the image below.

MX_GPIO_Init();
HAL_Delay(1);

4.2 Add code in Appli

Next, go to the Appli project. At the top of ‘main.c‘, in the private define section, declare the following macro for the ‘put_char‘ function.

4.2.png

#define PUTCHAR_PROTOTYPE int __io_putchar(int ch)

Then, in the ‘User Code 4‘ section at the bottom of ‘main.c‘, implement the following functions:

PUTCHAR_PROTOTYPE
{
HAL_UART_Transmit(&hlpuart1, (uint8_t *)ch, 1, 0xFFFF);
return ch;
}
int _write(int fd, char * ptr, int len){
HAL_UART_Transmit(&hlpuart1, (uint8_t *) ptr, len, HAL_MAX_DELAY);
return len;
}

Within ‘main.c‘, in the X-CUBE-AI initialization function, add the following lines (if not already present) to enable the RAM sections that were previously initialized and enabled:

4.2 - 3.png

RAMCFG_SRAM2_AXI->CR &= ~RAMCFG_CR_SRAMSD;
RAMCFG_SRAM3_AXI->CR &= ~RAMCFG_CR_SRAMSD;
RAMCFG_SRAM4_AXI->CR &= ~RAMCFG_CR_SRAMSD;
RAMCFG_SRAM5_AXI->CR &= ~RAMCFG_CR_SRAMSD;
RAMCFG_SRAM6_AXI->CR &= ~RAMCFG_CR_SRAMSD;

Important note: Depending on the version of CubeMX, X-CUBE-AI, and the STM32N6Cube package, you may have different things in this MX_X_CubeAI_Init() function. Edit it if needed to have exactly what is in the image above.

Add the following line to the RIF (SystemIsolation_Config) function to complete the slave configuration:

4.2 - 4.png

HAL_RIF_RISC_SetSlaveSecureAttributes(RIF_RISC_PERIPH_INDEX_NPU, RIF_ATTRIBUTE_PRIV | RIF_ATTRIBUTE_SEC);

Important note 2: Make sure that the part of the code above the line we add is present in the generated code. If not, it may indicate an issue in the RIF configuration. Make sure to look back at the Security configuration in CubeMX and that the NPU is selected in the RIF tab of CubeMX.

Finally, you need to handle the input and output buffers to feed your neural network and retrieve the inferred results. Since both the NPU and MCU have their own cache memories, these must be properly managed before invoking the low-level inference function. This is to ensure that both units access up-to-date data in memory and that any results are stored in a mutually accessible region. Therefore, you must perform a cache clean and invalidate operations.

The ‘MX_X_CUBE_AI_Process‘ function below implements the following features:

Calculating buffer sizes.
Filling the input buffer with constant values.
Cleaning and invalidating the MCU DCACHE and invalidating the NPU cache prior to inference.
Converting integer table values to float and printing them via UART.

The last functionality is implemented because memory is always allocated as an int table, but for this model, the output format is float. An int is 8 bits, and a float is 32 bits. Therefore, reading a single element from the array does not yield 19 valid values. Instead, four consecutive 8-bit values must be concatenated and interpreted as a single 32-bit float. You can copy the ‘MX_X_CUBE_AI_Process‘ function from the script below. After pasting it, press [Ctrl+I] to autoindent the code properly.

// Initialize the two following variables at the begining of the file
uint32_t buff_in_len, buff_out_len;

void MX_X_CUBE_AI_Process(void)
{
    /* USER CODE BEGIN 6 */
	LL_ATON_RT_RetValues_t ll_aton_rt_ret = LL_ATON_RT_DONE;
	const LL_Buffer_InfoTypeDef * ibuffersInfos = NN_Interface_Default.input_buffers_info();
	const LL_Buffer_InfoTypeDef * obuffersInfos = NN_Interface_Default.output_buffers_info();
	buffer_in = (uint8_t *)LL_Buffer_addr_start(&ibuffersInfos[0]);
	buffer_out = (uint8_t *)LL_Buffer_addr_start(&obuffersInfos[0]);
	// Printing buffer start and end addresses.
	printf("Input buffer: offset start = %lu, \n \r offset end = %lu \n \r",ibuffersInfos->offset_start,ibuffersInfos->offset_end);
	printf("Output buffer: offset start = %lu, \n \r offset end = %lu \n \r",obuffersInfos->offset_start,obuffersInfos->offset_end);
	// Getting buffer size and printing it.
	buff_in_len = ibuffersInfos->offset_end - ibuffersInfos->offset_start;
	buff_out_len = obuffersInfos->offset_end - obuffersInfos->offset_start;
	printf("Buffer input size = %lu \n\r Buffer output size = %lu \n\r", buff_in_len, buff_out_len);
	uint8_t val = 10;
	LL_ATON_RT_RuntimeInit();
	// Run 10 inferences
	for (int inferenceNb = 0; inferenceNb < 10; ++inferenceNb) {
		/* ------------- */
		/* - Inference - */
		/* ------------- */
		/* Pre-process and fill the input buffer */
		// Fill input buffer with constant data.
		for(uint32_t i = 0; i < buff_in_len; i++){
			buffer_in[i] = val;
		}
		// Clean and invalidate MCU DCache and invalidate NPU cache.
		mcu_cache_clean_invalidate_range(buffer_in, buffer_in + buff_in_len);
		npu_cache_invalidate();
		// Check that input buffer was properly assigned with "val".
		printf("Buffer[1] = %d \n \r", buffer_in[1]);
		printf("Buffer[1000] = %d \n \r", buffer_in[1000]);
		printf("Buffer[10000] = %d \n \r", buffer_in[10000]);
		//_pre_process(buffer_in);
		/* Perform the inference */
		LL_ATON_RT_Init_Network(&NN_Instance_Default); // Initialize network instance
		do {
			// Execute first/next epoch block
			ll_aton_rt_ret = LL_ATON_RT_RunEpochBlock(&NN_Instance_Default);
			// Wait for event if required
			if (ll_aton_rt_ret == LL_ATON_RT_WFE) {
				LL_ATON_OSAL_WFE();
			}
		} while (ll_aton_rt_ret != LL_ATON_RT_DONE);
		// Post-process the output buffer
		// Invalidate CPU cache if needed
		// Convert int8 to float. Buffer is int8, but model's output is float.
		uint8_t aux[4];
		float_t *conver;
		for(uint32_t i = 0; i < buff_out_len; i += 4){
			aux[0] = buffer_out[i];
			aux[1] = buffer_out[i+1];
			aux[2] = buffer_out[i+2];
			aux[3] = buffer_out[i+3];
			conver = (float_t *)aux;
			printf("Out %lu = %f \n \r", i, *conver);
		}
		//_post_process(buffer_out);
		LL_ATON_RT_DeInit_Network(&NN_Instance_Default);
		/* -------------------- */
		/* - End of Inference - */
		/* -------------------- */
	}
	LL_ATON_RT_RuntimeDeInit();
    /* USER CODE END 6 */
}

You most likely have to enable use float with printf as shown below:

This model has a 3-dimensional output buffer. To validate the output, only the first dimension is used in this example. Configure a serial terminal, such as Tera Term, with the settings shown in the image below. You are able to observe all the output buffer values.

You can use the Python script provided in the annex (provided at the end of this article). This is used to compare whether the quantized and optimized model running on your MCU produces the same outputs as the original model running on your PC.

To ensure that the application was correctly copied, that the peripheral initialization succeeded, and that the main loop is not stuck during inference, add the following two lines of code to blink the red LED:

 BSP_LED_Toggle(LED_RED);
 HAL_Delay(200);

4.3 Build

Your project is now complete. You may proceed to build it. Normally, there should be no errors. However, if you encounter dependency errors such as missing external sources, you can manually add them inside the nested projects. For instance, if you get errors indicating that LL functions are undeclared, it means that the compiler cannot locate the LL sources in the global middleware folder.

In that case, import the required source files and ensure that the folder is marked as a source location in the project settings.

To import a folder, right-click on the project, then select "Import" → "General" → "File System". Choose the folder containing the missing source files, and filter to import only ‘.c‘ files. Then, right-click the project again, go to "Properties" → "C/C++ General" → "Paths and Symbols." Under the "Source Location" tab, add the folder you just imported.

Look at this question from the ST Community product forums for additional troubleshooting: Solved: Linker garbage problem when deploying AI models on... - STMicroelectronics Community

After building, you will find the binaries in their respective "Debug" folders.

5. Deploying the application

5.1 Sign binary files with Signing Tool

Embedded systems that implement security features such as TrustZone®, as in the STM32N6, require firmware authentication. The STM32-SignTool is a key utility that ensures a secure platform by signing binary images using ECC keys. These signed binaries are used during the STM32 secure boot process to establish a trusted boot chain. This process ensures authentication and integrity checks of the loaded images.

In short, you must sign the generated binaries before flashing them to the N6.

The Signing Tool executable is located in your STM32CubeProgrammer installation directory (by default: C:/Program Files/STMicroelectronics/STM32Cube/STM32Cube Programmer/bin). To run the commands shown below, you can add the ‘bin‘ folder to your environment variables so that you can execute them from any directory.
Otherwise, run the command directly from the bin folder, specifying the full path to the binary file. Starting from STM32CubeProgrammer version 2.21.0, a new parameter is needed at the end "-align". Please make sure you use the updated command, as follows:

STM32_SigningTool_CLI.exe -bin <your_project>.bin -nk -of 0x80000000 -t fsbl -o <your_project>-trusted.bin -hv 2.3 -dump <your_project>-trusted.bin -align

In our case, you want to sign two files:

Your_Project_Folder/FSBL/Debug/<Your_Project_Name>_FSBL.bin
Your_Project_Folder/Appli/Debug/<Your_Project_Name>_Appli.bin

And you should end up with 2 new files:

<Your_Project_Name>_FSBL-trusted.bin
<Your_Project_Name>_Appli-trusted.bin

For reference, the terminal output should look like:

5.2 Generate model weights binary image

In your project folder, you can find at the root a file named network_data.xSPI2.raw that contains the weights of your model. This file results from X-CUBE-AI and in particular is the result of the ST Edge AI Core command running behind it:

stedgeai generate --model Model_File.tflite --target stm32n6 --st-neural-art

Documentation: https://stedgeai-dc.st.com/assets/embedded-docs/index.html

In our case, we want to rename and convert this file to network_data.xSPI2.bin:

cp network_atonbuf.xSPI2.raw network_data.xSPI2.bin

Next, add the path to ‘arm-none-eabi-objcopy‘ to your environment variables. You can find it in your STM32CubeIDE installation, typically under: C:/ST/STM32CubeIDE_<version>/STM32CubeIDE/plugins/ com.st.stm32cube. ide.mcu.externaltools.gnu-tools-for-stm32.13.3.rel1.win32_1.0.0.202411081344/ tools/bin. This tool allows you to convert the ‘.bin‘ file into a ‘.hex‘ file with a specified flash memory address:

arm-none-eabi-objcopy -I binary network_data.xSPI2.bin --change-addresses 0x71000000 -O ihex network_data.hex

You now have the hexadecimal file containing fixed weights and parameters ready for flashing.

5.3 Flash binaries with STM32CubeProgrammer

We now have all three image files: the FSBL, the application, and the model weights.

Open STM32CubeProgrammer and ensure that your ST-LINK configuration matches the image below and confirm that the firmware is up to date.

Set your NUCLEO-N657X0-Q board to development boot mode (Boot1 and Boot2 to the right) and click [Connect] in STM32CubeProgrammer.

Boot switches on the Nucleo board:

On the image, the board boots from flash mode.

FSBL and application are binary files, and their flashing addresses must be specified manually.

Flash FSBL to the start of OctoFlash at ‘0x70000000‘.
Flash the application to ‘0x70100000‘, based on the offset defined in the external memory loader.
The weights image is a ‘.hex‘ file with a predefined address (‘0x71000000‘), specified in the ‘objcopy‘ command.

For example:

To flash the images:

Put the board into development boot mode by sliding both BOOT switches to the right.
Flash all binary and hex images.
After flashing, switch the board to flash boot mode. Now, both switches to the left.
Power-cycle or reset the board.

At boot, the boot ROM loads the FSBL from flash to internal RAM. The FSBL then loads the application from flash and executes it.

6. Running the application

Now, if you connect the board with both switches set to the left, the application is loaded and executed from flash memory. Here is what you should observe in a terminal:

Here are some comments:

The input size and output size can be understood by opening the model.tflite file that was downloaded earlier.

Tool: Netron

X-CUBE-AI allocates both input and output buffers as INT8 tables. For the model used in this tutorial, the input buffer data type is INT8, therefore the memory size allocated corresponds directly to its dimensions (192x192x3 = 110592 bytes).

However, the output buffers data type is FLOAT32, which means the memory size allocated is four times greater. So, for example, for the third output buffer on Netron, which is the first one you see on LL_aton_buffers_output_info whose dimensions are (3875,2), 3875x2x4 = 30700 bytes were allocated. Then, as you can see on MX_CUBE_Ai_Process, the output buffer INT8 values must be "cast" to FLOAT32. Therefore, we use a float pointer to point at the beginning of an INT8 table of length 4.

In the serial terminal, only this output buffer used as the example was printed out, as well as in the Python script.

By opening the network.c file in Appli/X-Cube-AI/App, you can find information about the outputs of the model generated by X-CUBE-AI:

In the function LL_Buffer_InfoTypeDef *LL_ATON_Output_Buffers_Info_Default, the three outputs are listed in order, and the first one corresponds to the size 3845 * 2 (so the last one in netron).

Conclusion

If you have carefully followed the steps in this tutorial, the green LED should now be turned on, and the red LED should be blinking. The blinking interval reflects the sum of the user-defined delay, the inference execution time, and the time required to print the outputs to the serial terminal. Furthermore, the output values observed on the terminal should match those produced by executing the reference Python script provided in the annex.

This tutorial aimed to provide a minimal yet functional application that enables users, regardless of expertise level, to develop STM32N6 edge AI projects with a clear and structured workflow.

Thank you for reading.

Annexes

import numpy as np
import tensorflow as tf
# Path to your TFLite model file
MODEL_PATH = "ssd_mobilenet_v2_fpnlite_035_192_int8.tflite"
# The constant value to fill into the input tensor
FILL_VALUE = 10
# Load TFLite model and allocate tensors
interpreter = tf.lite.Interpreter(model_path=MODEL_PATH)
interpreter.allocate_tensors()
# Get input and output tensor details
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# You can print these to inspect:
print("Input details:", input_details)
print("Output details:", output_details)
# Getting model's input details
input_index = input_details[0]['index']
input_shape = input_details[0]['shape']
input_dtype = input_details[0]['dtype']
# Create input data filled with the constant value
input_data = np.full(input_shape, FILL_VALUE, dtype=input_dtype)
# Set the tensor
interpreter.set_tensor(input_index, input_data)
# Run inference
interpreter.invoke()
# Retrieve output tensors
outputs = []
for out in output_details:
output_data = interpreter.get_tensor(out['index'])
outputs.append(output_data)
# Print first buffer outputs
for dim in outputs[0]:
for i, val in enumerate(dim):
print(f"Output {i} = {val}")
print("\n\n\n\n")

FeherAron · ‎2025-09-02

Hi!

Firstly, thank you for the extensive tutorial. It was really easy to understand and to follow along.

Secondly, do you know of some changes in the recent CubeMX changes (6.15.0) not generating the X_CUBE_AI functions? For me, inside the Appli folder, there is the middlewares folder that has the ll_aton files, and inside the X-CUBE-AI/App folder, I have the report, and a c/h pair that uses the aton files, but nothing else.

Thanks in advance.

FeherAron · ‎2025-09-02

Solved It.

For some reason, I had to click 3 - 5 times for CubeMX to remember the Application Template. And without that, there were no high-level AI functions.

ERROR · ‎2025-09-04

Hi, @Julian E.

Thank you for your efforts!

nhwzzang · ‎2025-09-25

By default, "memory 1" is the source memory corresponding to the XSPI2 interface. The destination address must be set as the beginning of the secure RAM block from which the Application binary is loaded.

Based on this, if I enabled AXIRAM3,4,5,6 in RAMCFG, should I set the LRUN destination address of EXTMEM_MANAGER to 0x34200000?

Julian E. · ‎2025-09-30

@nhwzzang,

If you enable AXIRAM3,4,5,6 in RAMCFG, and you want the application to be loaded into AXIRAM3, then yes, you should set the LRUN destination address of EXTMEM_MANAGER to 0x34200000, which is the start address of AXIRAM3.

Have a good day,

Julian

YousefDessouki · ‎2025-10-20

Hello,
Thank you very much for your efforts. This was a very well detailed and dcoumented approach.

However, as I followed the example, I faced multiple issues.

When I generate the code from CubeMX, the SystemIsolation_Config only have the pin attributes setup and it is missing code. I have tried to select the NPU in the RIF tab but can not. I have tried your iloc file as well I made my own based on the tutorial instruction and I faced the same issue. I have attached images below. If you could explain why, it will be helpful for me.
I have also tried writing commenting the AI code , so that in the while(1) I can toggle LEDS. I have generated the bin files and the hex data for the model weights. However, it seems to me that the board does not leave the FSBL stage. The LED is not blinking. I have also tried to mass erase the excterbal memory in the STM32 programmer but I could not because it is secured by a memory protection unit.

I am new to STM32 world and this is my first STM32 board and thistutorial was very helpful. However, If any clarification regarding this matter will be greatly appreciated.

willisk · ‎2025-11-01

Nice tutorial @Julian E. , i experienced the same issue as @YousefDessouki

YousefDessouki · ‎2025-11-04

Hello,

I have had probably the same issue like you when programming the NPU in the nucelo board. I will share with my experience as well as the binary files in a github repo so you can test it.

My main issue was in the external memory mangement unit. in the CubeMx, you should use 0x34000000 as the destination of the source code ,not 0x24050000 as in the tutorial images. Also RIF_Isolation function do not edit it. As for the selection of the NPU in the RIF config tab, this is not an issue. Your main issue is probbaly in the CubeMx ioc files. Make sure to use STM32N6 MCU package version v1.1.1 . You can download it from the CUbeMx itself and make sure to force use it when generating the code.

Below is the github repo link, the binary files u will find them in the debug folder of each of Appli and the FSBL. You will also find the .hex file for the weights. Try using the binary files on the STM32programmer initially.

https://github.com/YousefDessouki/STM32_N6_NuceloBoardwithNPU.git

Let me know if there is any issue :)

Regards,

imagineb · ‎2025-11-26

Hi @YousefDessouki , thanks so much for sharing your repo!

My deployment didn't work (without any LED toggling and UART), and tried to used your binary instead.

However, the debug folder isn't included in your repo (appli & fbsl / trusted bin).

Do you mind sharing these? I'd like to just at least try directly with the working ones in my board.

Not sure what is happening, but I simply used your repo and build binary & trusted binary files, and programmed it.

Thanks!

willisk · ‎2025-11-27

Had some problems where the model got stuck in inference

WANGGANG · ‎2025-12-03

Hi Julian E., thank you very much for your document. I encountered a problem when connecting the device: when my computer is connected to the development board, STM32CUBEPROGRAMMER can find the device, but the connection fails. Please help explain the root cause of the problem and how to solve it.