cancel
Showing results for 
Search instead for 
Did you mean: 

What the validation command does during its runtime?

brianon
Associate III

Hi,

 

With the new 2.2 STEdgeAI update I have done some neural network model validations on the CM55 of the nucleo-n657x0-q board. I have also used the STLink-v3pwr to measure the power consumption while validating. What I wonder is if there is some detailed explanation on what the validation command does during its runtime, more specifically what happens on the target board.

Attached is a .stpm of a example validation using my custom model with 10 data samples, as can be seen 10 spikes are recorded with some activity between them. How can this activity be classified? Is a inference the time from the start of a spike until the next spike, or is a inference only the spike and the time between them is some other activity?


Thanks in advance,

Brian.

6 REPLIES 6
SlothGrill
ST Employee

Hello @brianon ,

To understand better the graphs, it might be useful to increase the sampling rate to better see shorter patterns.

Most likely, the spikes you are seeing represent the inferences that are being executed on your 10 samples (on the whole, when NPU is "working", an increase in power consumption is noted + depending on the current "epoch" of the inference being executed, hardware units not used during this epoch are clock-gated (and consume less - so you can see variations in the "spikes" you mention).

 

The activity between the spikes are most likely UART messages sent from/to the host machine. For each inference:

  • Before the inference, the validation "tool" sends the input data to the target through UART/USB
  • After each inference, the boards sends back the output data to the host machine through UART/USB

Those two steps are rather time-hungry because data transmission is way slower than crunching numbers on the target MCU.

 

Hi @SlothGrill ,

Thank you for the answer, while this clarifies it a bit I'm still slightly unsure how to interpret the results. To clarify what I mean, attached is the graph of a 5 layer model instead (still using the STedgeAI validation command with 10 samples). As you might see there are 10 clusters of 5 spikes, each spike being the activity per layer. Attached is also the report produced by the validation.

The report claims a average time per sample of 73,88ms yet when checking the graph the first spike alone seems to be around that size, as if it ignores the remaining layers (See A in the picture below). The average time per sample when including callbacks on the other hand states 1380,10ms which seems to match the time from the first spike in the 5 spike series until the end of the more noisy activity (see B below). Am I correct in assuming that B in this case is the "DEVICE duration" from the validation which only includes model related callbacks while C is the UART/USB communication, or does the "DEVICE duration" also include UART/USB communication and would therefore be a combination of B and C? If B is "DEVICE duration", would it then be more appropriate to count B as a single inference rather than A?

Screenshot 2025-07-16 183550.png

Thanks in advance,

Brian

 

 

SlothGrill
ST Employee

Hello !

Sorry i was vastly mistaken as i thought you used the Neural-Art accelerator (so please ignore my boring part about NPU function above).

So, in your graph if you say A is the inference, then i would say that 

  • C might be the phase where the inputs are sent to the board through UART (31360 Bytes sent)
  • A is the inference (73.xx ms) + layers inference time are sent through callbacks
  • In B-A,
    • the outputs are sent to from the board to the host (10 bytes - this should be quick)
    • then, i have to double check what is going on in the "noisy part" you are referring to (it could be a kind of cleanup but this seems a bit lengthy, or a kind of "wait state" but then maybe it should not be accounted for in the "device duration")

For the 1380,10ms, i have to double check, but this one should be the sum of sending input tensor(s)/doing the inference -and sending intermediate data through uart-/sending the output tensor(s)

 

Anyway if you want a bit more detail, you can call the checker.py with --debug option, you will see an output with the UART messages (and a bit of description of what is going on).
I will be back to you when i get to understand more precisely what is going on there.

Thank you for the answer, looking forward to hearing more!

This is especially confusing because when I run the validation on the NPU with the same model the difference between duration including and excluding callback is almost negligible. Excluding callback the duration was 16,89ms while including callback it only reached 17,10ms. Compare this to the CM55 validation where it went from 73,88ms to 1380,10ms.

Hi @SlothGrill,

Do you have any updates on what exactly is happening during validation and what the difference is with and without callback? And why its so different between CM55 validation and NPU validation?

SlothGrill
ST Employee

Hello @brianon ,

sorry for the delay.

So, as exposed above:

  • With callbacks = with added time to send info through UART
    • after executing each "node" that you see in your validation report (or when using `--debug` as proposed above), the firmware sends information about the time spent doing computations on CPU (node duration) and shapes of I/O tensors of this node.
  • Without callbacks = Without adding the time spent sending data through UART.

What is strange in your data is that you say you have made a model with "only 5 layers" (and shared the stpm of it + the validation report). In fact, in the validation report there is around 700 operators implemented in software, which is not even close to "5 layers".

Could you please state what are your "5 layers" (it looks like they are not "standard layers" and as such the tool is not generating "only 5 nodes" in C :) )? Would you mind sharing the file you use as input for stedgeai ?

 

From what we can understand by comparing the report and the stpm,

  • Lots of time is spent during inference to send data through UART (700 nodes means 700 times sending information about the node that just finished executing) --> This explains why the "duration with callbacks" is so long
  • Compared to that, the time spent doing the inference is very low (most of nodes have a small execution time) -> this explain why "duration w/o callbacks" is so short.

Most likely, in "B" of your graph, there is the inference+UART transfers all around, but it is nearly impossible to untangle pure CPU inference time from UART transfer time because the sampling rate of the measures is way too small (i.e. you try to observe events that last 1ms by sampling one point each 10ms).

As proposed above, could you try doing the same experiment and try to make the sampling rate higher (i.e. high enough to observe the events duration you expect).

 

For the last point, when using Neural-Art, it is possible that the model is better supported by the tool for NPU than for the CPU. It may then end up with way less epochs than the number of nodes you have here. Thus, less time is spent sending data through UART -> durations w/ and w/o callbacks might be closer because of this.

 

Let us know if it makes sense... 
Best regards.