cancel
Showing results for 
Search instead for 
Did you mean: 

Asking for Help: Why "inference on target" seems to be slower than expected?

wwlkda
Associate

I've got a simple model like this in ONNX:

wwlkda_0-1698322521771.png

 

I set HCLK@216MHz and ran "Inference on target" using Nucleo-f767zi board with X-Cube-AI 8.1.0

I got these useful information:

1) duration : 0.017ms (17us) by sample (200 samples, I got a similar result (21us) by a single sample validation)

2) cycles/MACC : 10.41 

However, when I generated the code, edited the code like

wwlkda_1-1698322774620.png

I built the project using X-Cube IDE 1.13.2. I got this:

wwlkda_0-1698323804894.png

duration: 8us / sample (prescaler set to 216)

Why is it significantly faster than that from the default "validate on target" button?

1) Why, in the same system configuration in the .ioc file, I got as much as x10 latency using this method?

2) Why, in contrast to the average 6-Cycle/MACC for Cortex-M7 in the manual handling float32 data, I got >10Cycle/MACC using Nucleo-F767zi@216MHz?

3) Why even a single dense layer using "validate on target" could generate a latency mounted to 8us, a large enough number that could match the mentioned bulit C project? Is this because the validation from the BUTTON taken into account extra data read/write latency?

1 ACCEPTED SOLUTION

Accepted Solutions
MBOB
ST Employee

Hello,

You have to set the prescaler to 108 Mhz to get 1MHz (not 216 Mhz). Keep HCLK to 216 Mhz.

Indeed Timer14 is plugged on the output « APB1 timer clock »  that produces 108Mhz,

as you can see in STM32CubeMX > clock configuration,  whereas « APB2 timer clock » produces 216Mhz.

That explains why you have inference time divided by 2: 8us instead of 17us.

I did exactly the same test you did using TIM14, and I've got the same value using "validate on target" and generating code, with prescaler = 108 Mhz.

You can see in the block diagram of the STM32F767 below that TIM14 is connected to APB1

MBOHB1_0-1698922306548.png

 

Best Regards

 

View solution in original post

2 REPLIES 2
MBOB
ST Employee

Hello,

You have to set the prescaler to 108 Mhz to get 1MHz (not 216 Mhz). Keep HCLK to 216 Mhz.

Indeed Timer14 is plugged on the output « APB1 timer clock »  that produces 108Mhz,

as you can see in STM32CubeMX > clock configuration,  whereas « APB2 timer clock » produces 216Mhz.

That explains why you have inference time divided by 2: 8us instead of 17us.

I did exactly the same test you did using TIM14, and I've got the same value using "validate on target" and generating code, with prescaler = 108 Mhz.

You can see in the block diagram of the STM32F767 below that TIM14 is connected to APB1

MBOHB1_0-1698922306548.png

 

Best Regards

 

Thanks a lot for this!