What the validation command does during its runtime?

brianon · ‎2025-07-10

Hi,

With the new 2.2 STEdgeAI update I have done some neural network model validations on the CM55 of the nucleo-n657x0-q board. I have also used the STLink-v3pwr to measure the power consumption while validating. What I wonder is if there is some detailed explanation on what the validation command does during its runtime, more specifically what happens on the target board.

Attached is a .stpm of a example validation using my custom model with 10 data samples, as can be seen 10 spikes are recorded with some activity between them. How can this activity be classified? Is a inference the time from the start of a spike until the next spike, or is a inference only the spike and the time between them is some other activity?

Thanks in advance,

Brian.

SlothGrill · ‎2025-07-15

Hello @brianon ,

To understand better the graphs, it might be useful to increase the sampling rate to better see shorter patterns.

Most likely, the spikes you are seeing represent the inferences that are being executed on your 10 samples (on the whole, when NPU is "working", an increase in power consumption is noted + depending on the current "epoch" of the inference being executed, hardware units not used during this epoch are clock-gated (and consume less - so you can see variations in the "spikes" you mention).

The activity between the spikes are most likely UART messages sent from/to the host machine. For each inference:

Before the inference, the validation "tool" sends the input data to the target through UART/USB
After each inference, the boards sends back the output data to the host machine through UART/USB

Those two steps are rather time-hungry because data transmission is way slower than crunching numbers on the target MCU.

brianon · ‎2025-07-16

Hi @SlothGrill ,

Thank you for the answer, while this clarifies it a bit I'm still slightly unsure how to interpret the results. To clarify what I mean, attached is the graph of a 5 layer model instead (still using the STedgeAI validation command with 10 samples). As you might see there are 10 clusters of 5 spikes, each spike being the activity per layer. Attached is also the report produced by the validation.

The report claims a average time per sample of 73,88ms yet when checking the graph the first spike alone seems to be around that size, as if it ignores the remaining layers (See A in the picture below). The average time per sample when including callbacks on the other hand states 1380,10ms which seems to match the time from the first spike in the 5 spike series until the end of the more noisy activity (see B below). Am I correct in assuming that B in this case is the "DEVICE duration" from the validation which only includes model related callbacks while C is the UART/USB communication, or does the "DEVICE duration" also include UART/USB communication and would therefore be a combination of B and C? If B is "DEVICE duration", would it then be more appropriate to count B as a single inference rather than A?

Thanks in advance,

Brian

SlothGrill · ‎2025-07-17

Hello !

Sorry i was vastly mistaken as i thought you used the Neural-Art accelerator (so please ignore my boring part about NPU function above).

So, in your graph if you say A is the inference, then i would say that

C might be the phase where the inputs are sent to the board through UART (31360 Bytes sent)
A is the inference (73.xx ms) + layers inference time are sent through callbacks
In B-A,
- the outputs are sent to from the board to the host (10 bytes - this should be quick)
- then, i have to double check what is going on in the "noisy part" you are referring to (it could be a kind of cleanup but this seems a bit lengthy, or a kind of "wait state" but then maybe it should not be accounted for in the "device duration")

For the 1380,10ms, i have to double check, but this one should be the sum of sending input tensor(s)/doing the inference -and sending intermediate data through uart-/sending the output tensor(s)

Anyway if you want a bit more detail, you can call the checker.py with --debug option, you will see an output with the UART messages (and a bit of description of what is going on).
I will be back to you when i get to understand more precisely what is going on there.

brianon · ‎2025-07-17

Thank you for the answer, looking forward to hearing more!

This is especially confusing because when I run the validation on the NPU with the same model the difference between duration including and excluding callback is almost negligible. Excluding callback the duration was 16,89ms while including callback it only reached 17,10ms. Compare this to the CM55 validation where it went from 73,88ms to 1380,10ms.

brianon · ‎2025-07-30

Hi @SlothGrill,

Do you have any updates on what exactly is happening during validation and what the difference is with and without callback? And why its so different between CM55 validation and NPU validation?

SlothGrill · ‎2025-07-31

Hello @brianon ,

sorry for the delay.

So, as exposed above:

With callbacks = with added time to send info through UART
- after executing each "node" that you see in your validation report (or when using `--debug` as proposed above), the firmware sends information about the time spent doing computations on CPU (node duration) and shapes of I/O tensors of this node.
Without callbacks = Without adding the time spent sending data through UART.

What is strange in your data is that you say you have made a model with "only 5 layers" (and shared the stpm of it + the validation report). In fact, in the validation report there is around 700 operators implemented in software, which is not even close to "5 layers".

Could you please state what are your "5 layers" (it looks like they are not "standard layers" and as such the tool is not generating "only 5 nodes" in C :) )? Would you mind sharing the file you use as input for stedgeai ?

From what we can understand by comparing the report and the stpm,

Lots of time is spent during inference to send data through UART (700 nodes means 700 times sending information about the node that just finished executing) --> This explains why the "duration with callbacks" is so long
Compared to that, the time spent doing the inference is very low (most of nodes have a small execution time) -> this explain why "duration w/o callbacks" is so short.

Most likely, in "B" of your graph, there is the inference+UART transfers all around, but it is nearly impossible to untangle pure CPU inference time from UART transfer time because the sampling rate of the measures is way too small (i.e. you try to observe events that last 1ms by sampling one point each 10ms).

As proposed above, could you try doing the same experiment and try to make the sampling rate higher (i.e. high enough to observe the events duration you expect).

For the last point, when using Neural-Art, it is possible that the model is better supported by the tool for NPU than for the CPU. It may then end up with way less epochs than the number of nodes you have here. Thus, less time is spent sending data through UART -> durations w/ and w/o callbacks might be closer because of this.

Let us know if it makes sense...
Best regards.

brianon · ‎2025-08-04

Hi again @SlothGrill ,

Sorry about the model confusion, calling it simply a 5 layer model was a huge understatement. Without getting into the details the model uses custom Tensorflow layers to mimic spiking behavior as seen in Spiking Neural Networks, additionally the input is handled as a spike train of inputs, therefore each layer loops for 10 time steps to simulate flow of time. The model is far from perfect, I think the large amount of operators is connected to the custom layer functions and the fact that some operators are not fully quantized which require switching to SW.

Attached is another example model which acts as a 1000 node "1 layer" spiking neural network. Attached is also the validation reports and power measurements for that model (this time with a high sampling rate). Do you believe that the misrepresentative amount of operators/layers affects the results as each operator acts as a layer? Also could you explain what exactly "PER_LAYER" mode does and if this has any effect on the results?

Thanks in advance!

SlothGrill · ‎2025-08-05

Hey,
So this is getting clearer... Thanks for the new measurements.

With the higher temporal resolution, the graph is easier to understand (tell me if this is not the case for you)

for the MCU graph:

The validation report shows there are 10x "Dense" layers that take 16+ms each, preceded by a Slice that take 1ms each
The stpm graph clearly shows the 10x 16+ms pattern with higher consumption levels ... let's say this is the rough mean level when the CPU is doing some "intense" computations.
Based on what has been discussed already... the stpm then shows lots of "peaks" (my guess is there are as many "diracs" as the number of layers shown in the report... i did not count :)) around this value after the 10x16ms batch, which i would say it's "inference computations" so to speak. Separated by less intense consumption (on the core, but might be different on IOs ) which are likely to be time spent sending data to the host machine.
-> with the higher sampling rate, you can easily understand the difference between:
- "pure inference time" is the sum of the length of the peaks,
- duration with callbacks takes the UART time into account (which takes most of the time here)

For the NPU graph:

Same way of interpreting the data
you have to zoom a lot to see the patterns, but you have lots of spikes with high consumption ("NPU working") with lower intensity signal (callbacks: sending data through uart)

For the last question, i'm not sure to understand:

The "validate" command is here to "benchmark" your codegeneration, and as such contains tooling around the code to do the benchmark (timers manipulation + data exchange with host).

This tooling does not change the results you get out from the inference (i.e. the output values are unchanged), but changes the "time observed by an external observer to make one inference" because time is spent sending data to the host (while the inference process is frozen).

So, the time spent for doing "pure inference" i.e. without callback is minimally impacted (or not impacted at all) by the validation process. The full time taken for doing an inference, since it takes into account sending data to the host is on its side impacted (and the impact is even bigger if you ask for "PER LAYER" information, because the callbacks are called after each layer --you can refer to the doc for more info How to use the AiRunner package, and look for PER_LAYER).

Anyway, again, this is for benchmarking purpose, so this is not expected to end-up in a final product, so i do not exactly understand why you are so intrigued about the time spent measuring stuff :).

The times reported in the validation table are "time spent doing useful stuff" (ie inference computation) and this is the only thing that should be of interest for you... isn't it ?

cheers.