2025-09-21 3:36 AM
Hello everyone,
I'm working on deploying a PyTorch model to an STM32F401RE NUCLEO board and encountering some challenging memory and quantization issues that I hope the community can help me resolve.
My project involves running a custom PyTorch model (converted to ONNX format) on the STM32F401RE NUCLEO board. The system already has USB Host (Audio Class) library and FreeRTOS integrated as essential components for my application, which means I need to work within the remaining available memory space.
When I configure X-CUBE-AI with compression set to high and optimization set to ram, the build process fails with a linker error:
C:/ST/STM32CubeIDE_1.18.1/STM32CubeIDE/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.13.3.rel1.win32_1.0.0.202411081344/tools/bin/../lib/gcc/arm-none-eabi/13.3.1/../../../../arm-none-eabi/bin/ld.exe: Xcube.elf section `.bss' will not fit in region `RAM' C:/ST/STM32CubeIDE_1.18.1/STM32CubeIDE/plugins/com.st.stm32cube.ide.mcu.externaltools.gnu-tools-for-stm32.13.3.rel1.win32_1.0.0.202411081344/tools/bin/../lib/gcc/arm-none-eabi/13.3.1/../../../../arm-none-eabi/bin/ld.exe: region `RAM' overflowed by 10800 bytes
Given that the STM32F401RE has only 96KB of SRAM and I already have USB Host and FreeRTOS consuming memory, this overflow isn't entirely surprising.
To address the memory constraints, I attempted INT8 quantization using ONNX Runtime. Here's the quantization code I used:
class MyCalibrationDataReader(CalibrationDataReader): def __init__(self, data, model_path): self.enum_data = None self.data = data # Use inference session to get input shape. session = onnxruntime.InferenceSession(model_path, None) batch_size, channel, length = session.get_inputs()[0].shape self.input_name = session.get_inputs()[0].name self.datasize = len(data) def get_next(self): if self.enum_data is None: self.enum_data = iter([ {self.input_name: sample[np.newaxis, np.newaxis, :].astype(np.float32)} # (2048,) → (1, 1, 2048) for sample in self.data ]) return next(self.enum_data, None) def rewind(self): self.enum_data = None # Reset the enumeration of calibration data dr = MyCalibrationDataReader(cali_data, model_fp32_prep) quantize_static( model_fp32_prep, model_quant, dr, quant_format=QuantFormat.QDQ, per_channel=True, weight_type=QuantType.QInt8, activation_type=QuantType.QInt8, reduce_range=True, extra_options={'WeightSymmetric': True, 'ActivationSymmetric': False} )
However, when I try to analyze the quantized model with X-CUBE-AI, I encounter this error:
Analyzing model C:/Users/user/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.2.0/Utilities/windows/stedgeai.exe analyze --target stm32f4 --name network -m C:/Users/user/Downloads/FRFconv-TDS_onnx.quant.onnx --compression high --verbosity 1 --no-inputs-allocation -O ram --no-outputs-allocation --memory-pool C:\Users\user\AppData\Local\Temp\mxAI_workspace1712638268650010180404439894557726\mempools.json --workspace C:/Users/user/AppData/Local/Temp/mxAI_workspace1712638268650010180404439894557726 --output C:/Users/user/.stm32cubemx/network_output ST Edge AI Core v2.2.0-20266 2adc00962 INTERNAL ERROR: 'NoneType' object is not subscriptable
I also tried using ST Edge AI Developer Cloud for quantization, but encountered the same issue:
>>> stedgeai analyze --model FRFconv-TDS_onnx_PerTensor_quant_random_2.onnx --optimization ram --target stm32f4 --name network --workspace workspace --output output ST Edge AI Core v2.2.0-20266 2adc00962 INTERNAL ERROR: 'NoneType' object is not subscriptable
I'm quite attached to my current model architecture as it's specifically designed for my application requirements, so I'd prefer not to change the model structure if possible.
Memory Optimization: Has anyone successfully deployed AI models on STM32F401RE with other libraries like USB Host and FreeRTOS running simultaneously? Are there additional memory optimization techniques beyond X-CUBE-AI's high compression and RAM optimization that I could try?
Quantization Error: Have you encountered the 'NoneType' error when analyzing quantized ONNX models in X-CUBE-AI? This seems to occur both locally and on the cloud platform. Could this be a compatibility issue with my quantization approach or the ONNX model format?
Alternative Approaches: Are there other strategies to make my model fit within the available memory constraints without modifying the model architecture?
I have the following resources available if they would help with troubleshooting:
Please let me know if you need any additional information to help diagnose these issues.
Thank you in advance for your assistance!
Solved! Go to Solution.
2025-09-22 2:14 AM
Hello @SR1218,
Your model is already very small. When quantizing the model in QDQ format (which is what we support, and the dev cloud is doing), it adds new layers quantize and dequatize that can help reducing the size of big models.
But in the case of small models, the weight of these multiples new layers increases more the size of the model than what you gain in terms of weights compression.
Non quantize model 53kb -> QDQ model 62kb.
In your case it even creates a bug when using the ST Edge AI Core (that converts the model to C, behing the dev cloud and X Cube AI). But as I explained, even if there was no bug, it would still not help you...
As you pointed out, the main issue here is the memory available on the STM32F401RE NUCLEO.
I don't know what your use case is, I would guess that it is related to sound based on your I/O.
For sound, our example applications are based on the B-U585I-IOT02A, and STM32N6570-DK, you may try to get one of these. (N6 is interesting for big models)
I don't see how you could reduce the size of your model to make it usable.
I can maybe advise you to take a look at NanoEdge AI Studio and see if the machine learning libraries (models) that it generates can be enough for your usecase. It is a free tool and very easy to use. If you have dataset, I think you can very quickly see if you seem to get good results or not
NanoEdge AI Studio - STMicroelectronics - STM32 AI
Have a good day,
Julian
2025-09-22 2:14 AM
Hello @SR1218,
Your model is already very small. When quantizing the model in QDQ format (which is what we support, and the dev cloud is doing), it adds new layers quantize and dequatize that can help reducing the size of big models.
But in the case of small models, the weight of these multiples new layers increases more the size of the model than what you gain in terms of weights compression.
Non quantize model 53kb -> QDQ model 62kb.
In your case it even creates a bug when using the ST Edge AI Core (that converts the model to C, behing the dev cloud and X Cube AI). But as I explained, even if there was no bug, it would still not help you...
As you pointed out, the main issue here is the memory available on the STM32F401RE NUCLEO.
I don't know what your use case is, I would guess that it is related to sound based on your I/O.
For sound, our example applications are based on the B-U585I-IOT02A, and STM32N6570-DK, you may try to get one of these. (N6 is interesting for big models)
I don't see how you could reduce the size of your model to make it usable.
I can maybe advise you to take a look at NanoEdge AI Studio and see if the machine learning libraries (models) that it generates can be enough for your usecase. It is a free tool and very easy to use. If you have dataset, I think you can very quickly see if you seem to get good results or not
NanoEdge AI Studio - STMicroelectronics - STM32 AI
Have a good day,
Julian