2026-03-16 12:34 AM
I've implemeted a vision transformer for STM32N6 NPU.
Repository: https://github.com/minchoCoin/stm32n6-transformer
BatchMatmul operator is not yet supported on NPU and run on the CPU, but it is much faster than STM32H7(inference time was reduced by about 90% than STM32H7) because fully connected layer and convolutions is executed on NPU.
Vision Transformer for STM32N6 NPU has three differences from the original ViT(A. Dosovitskiy et al., 2020)
1. Patch Embedding (PE) is configured to be performed in the pre-processing process before model input: Because if the model has PE and self attention, it is not possible to interpret the model structure using STEdgeAI (I don't know why yet...)
2. Remove bias parameter from Fully Connected layer or change fully connected layer to 1x1 Conv2D: fully connected layer of onnx should have 2-dimensional input. however, fully connected layer of ViT has 3 dimensional input(batch,patch,embedding), so compiler removed the batch axis, and make patch axis as batch axis. however, an error occured when batch size is not 1 and the fully connected layer has bias.
3. Use ReLU activation function for MLP: because TFLite GELU is not supported
For detailed information, please refer the PPT: https://github.com/minchoCoin/stm32n6-transformer/blob/main/assets/stm32n6_transformer.pdf
Solved! Go to Solution.
2026-03-23 2:50 AM
Hi @mincho00,
Thank you for sharing.
I will show this to my colleague working on the compiler. It may be useful for them.
Have a good day,
Julian
2026-03-23 2:50 AM
Hi @mincho00,
Thank you for sharing.
I will show this to my colleague working on the compiler. It may be useful for them.
Have a good day,
Julian