NaN when validating on target

Dresult · ‎2025-04-09

Hi everyone!
I'm trying to validate an ONNX model on my STM32H747I-DISCO board, but I'm running into issues.

I've successfully generated the project and flashed it onto the board. The validation process does run (using random inputs), but the final metrics I'm getting are all NaN.
I'm using the FP32 version of the model (not quantized), and when I run the same validation on desktop, the metrics are almost perfect — so the model itself seems fine.
I've tried enabling the --classifier option, but it didn't make any difference. I also attempted using --no-onnx-io-transpose, but got the following error: stedgeai: error: unrecognized arguments: --no-onnx-io-transpose

Could this be an issue with my model, the CubeIDE project or am I missing something in the X-CUBE-AI configuration ?

I've also noticed that there is a note in the csv file with the inputs:

# Note: w/o data

number of item by sample is exceeded: 4096 > 512

Is this issue compromising the classification ?

Thanks in advance for your help!

Dresult · ‎2025-04-10

I'm using random numbers with (B, H*W, C) shape.

Anyway, I think I found the source of the problem... I'm attaching the onnx I used for the test, as it is lighter than the one I attached yesterday.

Practically, I observed that if all the parameters are in the internal flash everything works fine and there are no nans in the metrics. If I choose to put the parameters in the external flash as in the figure, there I get the nans (I was doing it since the model was too big for the solely internal flash).

To use the external flash, I enabled the QuadSPI for the M7, I don't know if this information could help in some way.

I'm also attaching the generated project, in case you want to take a look. As it is, I opened it in the STCubeIDE, compiled in Release mode and the flashed on the board before running the validation on target from STCubeMX.

In any case, I'm going to also test the case I "enlarge" the internal flash, as explained by Yanis in https://community.st.com/t5/edge-ai/issue-running-larger-models-on-stm32h747i-disco/td-p/787460 , to see if I still get the nans.

View solution in original post

Julian E. · ‎2025-04-09

Hello @Dresult,

Can you share your model in a zip file.

What is strange is that in the report you seem to use a .onnx model but the screenshot of this model here is a tflite?

Also, is there a specific reason for you to use conv2d instead of conv1d? Your third parameter is always 1.

For the note in the csv, it is because your input is 4096x1 and the compiler thinks that you use a batch size of 4096.

You can solve it by working with an input size of 1x4096x1.

The compiler understands it, so in the report, it adds a shape for the batch size but you still get the message.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

Dresult · ‎2025-04-09

Hi @Julian E. , thanks for your reply!

I'm attaching the zip file with the model. The screenshot comes from the "show graph" button in X-CUBE-AI, I guarantee that is related to the onnx file.

As for the Conv2D, I don't know why it shows the as Conv2D. In my PyTorch code they are defined as Conv1D. However, all of them have H that is always 1 so they should be treated as Conv1D.

UPDATE: I believe the issue is caused by the flatten layer between the last Conv2D and the first Dense, which is translated in the graph as the "Reduce" layer. In fact, by removing the layers before the Reduce in the ONNX model, I get an output with minimal errors in validation. However, when I add the Reduce layer back, the nan appear.

Dresult · ‎2025-04-10

@Julian E. at first, I thought the issue might be related to the "ReduceMean" layer, but I also tried a similar model without the mean operation and still got nans. Then, I tested a much simpler model with just a couple of linear layers and ReLU/Sigmoid activations, and in that case, the metrics were computed without any issues. Could the problem be due to my model being too complex, or is there something else I might be missing ?

Julian E. · ‎2025-04-10

@Dresult ,

I am thinking about either the input or output shape is causing the issue, I don't know why.

What were the I/O of the model that worked in your case?

I am looking internally for help on that. I'll update you once I know more.

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

Dresult · ‎2025-04-10

I'm using random numbers with (B, H*W, C) shape.

Anyway, I think I found the source of the problem... I'm attaching the onnx I used for the test, as it is lighter than the one I attached yesterday.

Practically, I observed that if all the parameters are in the internal flash everything works fine and there are no nans in the metrics. If I choose to put the parameters in the external flash as in the figure, there I get the nans (I was doing it since the model was too big for the solely internal flash).

To use the external flash, I enabled the QuadSPI for the M7, I don't know if this information could help in some way.

I'm also attaching the generated project, in case you want to take a look. As it is, I opened it in the STCubeIDE, compiled in Release mode and the flashed on the board before running the validation on target from STCubeMX.

In any case, I'm going to also test the case I "enlarge" the internal flash, as explained by Yanis in https://community.st.com/t5/edge-ai/issue-running-larger-models-on-stm32h747i-disco/td-p/787460 , to see if I still get the nans.

hamitiya · ‎2025-04-10

Hello

Accuracy stands for "Classification accuracy". It might be NaN if ST Edge AI Core does not detect your model as a classifier.

You can force this with --classifier argument (Advanced settings)

Find more in page 20, 23

Getting started with X-CUBE-AI Expansion Package for Artificial Intelligence (AI) - User manual

Best regards,

Yanis

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

Dresult · ‎2025-04-11

Hi @hamitiya ,
I have already tried both with the option enabled and disabled. Unfortunately, the result doesn't change.

UPDATE:

@hamitiya, @Julian E.

I did as mentioned above, placing the entire quantized model in the "extended" internal flash memory. The code runs and the validation completes successfully. I really think there was some issue with using the external flash memory...

Julian E. · ‎2025-04-11

@Dresult, Thank you for the update!

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

Dresult · ‎2025-04-11

Thank you all for the support :)

Have a great day too!