cancel
Showing results for 
Search instead for 
Did you mean: 

Does compressing the model speeds up the inference (prediction)?

HKim.16.78
Associate II

Hi

I imported simple CNN to STM32L462RCT using STM32CUBE-AI v5.1.2 ApplicationTemplate

I found that compressing the model has no effect on inference time.

The aiRun procedure runs for 115ms both in 8-bit compression and "none" configurations although the accuracy drops a bit.

I thought compressing float network parameters to uint8_t would not only save the memory but also speed up the inference.

So, is compressing the model supposed to speed up the inference?

1 ACCEPTED SOLUTION

Accepted Solutions
jean-michel.d
ST Employee

Hi HKim,

Effectively, for the floating-point model, the compression is only applied to the FC layers. Only the weights are compressed to reduce the flash memory size. Concerning the impact on inference time, no significant change is expected. For a compressed FC layer (x8 or x4), the number of operation is always the same, there is only an indirection to retrieve the weight values (LUT-based). Only an impact of the accuracy can appear due to the "compression" of the weights.

br,

Jean-Michel

View solution in original post

2 REPLIES 2
HKim.16.78
Associate II

Several weeks ago I found that my model has no fully connected layers and the compression only applies to the FC layers.

jean-michel.d
ST Employee

Hi HKim,

Effectively, for the floating-point model, the compression is only applied to the FC layers. Only the weights are compressed to reduce the flash memory size. Concerning the impact on inference time, no significant change is expected. For a compressed FC layer (x8 or x4), the number of operation is always the same, there is only an indirection to retrieve the weight values (LUT-based). Only an impact of the accuracy can appear due to the "compression" of the weights.

br,

Jean-Michel