2023-10-26 05:31 AM - edited 2023-10-26 05:42 AM
I've got a simple model like this in ONNX:
I set HCLK@216MHz and ran "Inference on target" using Nucleo-f767zi board with X-Cube-AI 8.1.0
I got these useful information:
1) duration : 0.017ms (17us) by sample (200 samples, I got a similar result (21us) by a single sample validation)
2) cycles/MACC : 10.41
However, when I generated the code, edited the code like
I built the project using X-Cube IDE 1.13.2. I got this:
duration: 8us / sample (prescaler set to 216)
Why is it significantly faster than that from the default "validate on target" button?
1) Why, in the same system configuration in the .ioc file, I got as much as x10 latency using this method?
2) Why, in contrast to the average 6-Cycle/MACC for Cortex-M7 in the manual handling float32 data, I got >10Cycle/MACC using Nucleo-F767zi@216MHz?
3) Why even a single dense layer using "validate on target" could generate a latency mounted to 8us, a large enough number that could match the mentioned bulit C project? Is this because the validation from the BUTTON taken into account extra data read/write latency?
Solved! Go to Solution.
2023-11-02 04:08 AM
Hello,
You have to set the prescaler to 108 Mhz to get 1MHz (not 216 Mhz). Keep HCLK to 216 Mhz.
Indeed Timer14 is plugged on the output « APB1 timer clock » that produces 108Mhz,
as you can see in STM32CubeMX > clock configuration, whereas « APB2 timer clock » produces 216Mhz.
That explains why you have inference time divided by 2: 8us instead of 17us.
I did exactly the same test you did using TIM14, and I've got the same value using "validate on target" and generating code, with prescaler = 108 Mhz.
You can see in the block diagram of the STM32F767 below that TIM14 is connected to APB1
Best Regards
2023-11-02 04:08 AM
Hello,
You have to set the prescaler to 108 Mhz to get 1MHz (not 216 Mhz). Keep HCLK to 216 Mhz.
Indeed Timer14 is plugged on the output « APB1 timer clock » that produces 108Mhz,
as you can see in STM32CubeMX > clock configuration, whereas « APB2 timer clock » produces 216Mhz.
That explains why you have inference time divided by 2: 8us instead of 17us.
I did exactly the same test you did using TIM14, and I've got the same value using "validate on target" and generating code, with prescaler = 108 Mhz.
You can see in the block diagram of the STM32F767 below that TIM14 is connected to APB1
Best Regards
2023-11-02 04:37 AM
Thanks a lot for this!