2025-04-25 3:03 AM - edited 2025-04-25 7:37 AM
Hi,
Could you share documentation or examples for running quantized foundational models (e.g. Google Gemma) on the STM32MP257F-DK—first in Python, then in C/C++ using the STM32MP2 NPU? Specifically:
Does the STM32MP2 NPU support transformer-based architectures, or is it limited to CNNs (like the STM32N6)?
Which inference frameworks are supported for GenAI on this platform? does this NPU ported by ST for llama.cpp ?
Sorry, couldn't find required info from STM32 MPU wiki pages.
Thanks!
2025-11-07 2:16 AM
Hello
We are currently evaluating hardware options and have the same question. Can somebody from ST answer it here?
Thank you and best regards
Jan
2026-01-05 10:00 PM
Hello,
I have the same question. Is it possible to run LLMs on the STM32MP2 series?
Additionally, what is the expected performance/inference efficiency?
Thanks!
2026-01-08 1:48 AM
Hello,
The NPU architecture of the STM32MP2 series does not support transformer based architecture models.
LLM models can be run using the CPU.
The framework supported by the X-LINUX-AI package are listed in this wiki page:
https://wiki.st.com/stm32mpu/wiki/Category:X-LINUX-AI_expansion_package
BR
2026-01-08 2:44 AM
An example of local running LLM on STM32MP257F-EV1 (as said, using CPU only).
https://www.linkedin.com/posts/danilopietropau_another-great-example-of-llm-on-stm32mp2-activity-7293222309333495809-Qe0d
Regards.