Does compressing the model speeds up the inference (prediction)?

HKim.16.78 · ‎2020-08-19

Hi

I imported simple CNN to STM32L462RCT using STM32CUBE-AI v5.1.2 ApplicationTemplate

I found that compressing the model has no effect on inference time.

The aiRun procedure runs for 115ms both in 8-bit compression and "none" configurations although the accuracy drops a bit.

I thought compressing float network parameters to uint8_t would not only save the memory but also speed up the inference.

So, is compressing the model supposed to speed up the inference?

jean-michel.d · ‎2020-09-15

Hi HKim,

Effectively, for the floating-point model, the compression is only applied to the FC layers. Only the weights are compressed to reduce the flash memory size. Concerning the impact on inference time, no significant change is expected. For a compressed FC layer (x8 or x4), the number of operation is always the same, there is only an indirection to retrieve the weight values (LUT-based). Only an impact of the accuracy can appear due to the "compression" of the weights.

br,

Jean-Michel

View solution in original post

HKim.16.78 · ‎2020-09-13

Several weeks ago I found that my model has no fully connected layers and the compression only applies to the FC layers.

jean-michel.d · ‎2020-09-15

Hi HKim,

Effectively, for the floating-point model, the compression is only applied to the FC layers. Only the weights are compressed to reduce the flash memory size. Concerning the impact on inference time, no significant change is expected. For a compressed FC layer (x8 or x4), the number of operation is always the same, there is only an indirection to retrieve the weight values (LUT-based). Only an impact of the accuracy can appear due to the "compression" of the weights.

br,

Jean-Michel