Custom trained pytorch model yolov11 deployment into STM32N6570-DK

dev8 · ‎2025-09-09

Hello,

I am working on deploying a custom PyTorch YOLOv11 model to the STM32N6570-DK board using the STM32 AI Developer Cloud / STM32 AI Model Zoo workflow.

Currently, I have my model trained in PyTorch (.pt format), and I can also export it to ONNX. My goal is to run inference on the STM32N6570-DK with int8 quantization (TFLite or ONNX → STM32 supported format).

Could you please guide me on the following:

What is the recommended workflow to convert a custom YOLOv11 PyTorch model into a format compatible with STM32 AI tools?
Should I first export to ONNX, then quantize to TFLite, and finally use STM32Cube.AI / STEdgeAI tools?
Are there any specific constraints or optimizations needed for YOLOv11 models to run efficiently on STM32N6570-DK?

Any detailed steps, documentation links, or examples would be very helpful.

Thank you in advance!

Julian E. · ‎2025-09-16

Hello @dev8,

We have a tutorial to deploy yolov8 and yolov11n models in partnership with ultralytics, it can be helpful, I let you take a look: ultralytics/examples/YOLOv8-STEdgeAI/README.md at main · stm32-hotspot/ultralytics · GitHub

To use the acceleration provided by the NPU on N6, you should use a quantized model

The ST Neural-ART compiler supports two formats of quantized model. Both are based on the same quantization scheme: 8b/8b, ss/sa per-channel.

A quantized TensorFlow lite model generated by a post-training or training aware process. The calibration has been performed by the TensorFlow Lite framework, principally through the “TF Lite converter” utility exporting a TensorFlow file.
A quantized ONNX model based on the Tensor-oriented (QDQ; Quantize and DeQuantize) format. The DeQuantizeLinear and QuantizeLinear operators are inserted between the original operators (in float) to simulate the quantization and dequantization process. It can be generated with the ONNX quantizer runtime services.

https://stedgeai-dc.st.com/assets/embedded-docs/stneuralart_operator_support.html

Concerning optimization in using yolov11 model, I would suggest to use the nano version. When doing the generate command (or when using CubeAI), take a close look at where the activation are located. If they don't fit in internal memory (NPUrams) you will most likely increase by a lot the inference time.

Weights can be in external ram, that is not an issue as they are only read once.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

dev8 · ‎2025-10-01

Hi @Julian E.,

I have a YOLOv11 model downloaded from Ultralytics along with my own custom dataset. My goal is to train this model from scratch (without using any pretrained weights) on my dataset, and then deploy it to the STM32N6570-DK.

I would like to confirm:

Is there any script in the STM Model Zoo that allows training a fresh YOLOv11 model on a custom dataset?
Or should I train my model entirely in Ultralytics (with pretrained=False), then export to ONNX/TFLite and only use STM Model Zoo for quantization and deployment?

Basically, I want to ensure I am not missing an STM-provided training flow, since I don’t want to rely on my previous pretrained model.

Thanks and regards,
dev8

Julian E. · ‎2025-10-02

Hello @dev8,

The second option is the way to do it:

"Or should I train my model entirely in Ultralytics (with pretrained=False), then export to ONNX/TFLite and only use STM Model Zoo for quantization and deployment?"

We don't have scripts to train yolos in model zoo, so as you pointed out, you need to rely on Ultralytics then export your model to then quantize it and convert it.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

Thomas-Carl · ‎2025-10-02

To deploy a custom YOLOv11 PyTorch model on STM32N6570-DK, the recommended flow is:

Export your model from PyTorch → ONNX and simplify it.
Remove/custom-handle unsupported ops (e.g., NMS off-device).
Quantize to int8 using ONNX Runtime (with a calibration dataset).
Import the quantized ONNX into STM32Cube.AI or STM32 AI Developer Cloud for conversion and deployment.
Optimize input size (e.g., 320×320) or use a “tiny” variant for memory/performance.

In short: PyTorch → ONNX → int8 quantization → STM32Cube.AI → deploy & profile.

Julian E. · ‎2025-10-02

Hello @Thomas-Carl,

The recommended flow is this one:

stm32ai-modelzoo-services/object_detection/docs/tuto/How_to_deploy_yolov8_yolov5_object_detection.md at main · STMicroelectronics/stm32ai-modelzoo-services

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.