Model compression understanding

bardetad · ‎2019-11-28

Hi ST,

I've read the part about weight/bias compression in the X-Cube-AI documentation (6.1).

If not confidential can you tell a bit more about the Weight-sharing-based algorithm (k-means clustering)?

What is the meaning of the factor (x4 or x8)?

Is the algo based on some research papers? such as https://arxiv.org/abs/1510.00149 ?

Thanks in advance.

Best,

Adrien B

jean-michel.d · ‎2019-12-04

Hi,

Sorry for the delay,

This is not fully confidential.

Basically when the compression option is enabled (only applicable for the floating-point model), the code generator tries to compress the weight/bias parameters of the dense/fully-connected layers. Algo is based on a k-means clustering approach to create a dictionnary with the centroids. Two fixed size of dictionnary are defined: 256 values (x4 option), param-index is coded on 8bits and 16 values (x8 option), params-index is coded on 4bits. At the end, when x4 or x8 options is selected for the whole model, the final gain (or compression factor) of the weight size is dependent of the model itself and the size of the parameters for each dense-layers. At network level, there is different heuristics to decide to "compress" or not a dense-layer.

To know if a layer has been compressed or not, in the summary (report file or log), the "(c)" in the rom column indicates that the layer has been compressed (bias and/or weights).

Else if you want to apply a specific factor (x4 or x8) by layer to improve the final accuracy, see "FAQs" documentation, section "How to specify or to force a compression factor by layer?".

Best,

Jean-Michel

bardetad · ‎2019-12-05

Hi,

Ok. If I understood correctly the x4 (or x8) factors correspond to 256 (or 16) centroids deduced from a 256(or 16)-means clustering applied on the model weights values. So each centroid value is supposed to represent and replace the model weights that have approximately the same values.

Then why this technique is only used for dense layers?

Is it because other layer types (e.g. conv) are more sensitive to it in term of error?

Anyway thanks for the clear explanation.

Best,

Adrien

jean-michel.d · ‎2019-12-06

Hi, Adrien,

Currently, we have only implemented this technique for the Dense layer because to be efficient in term of memory (RAM/ROM usage) and execution time, a specific computing kernel implementation is requested. Unfortunatly, for the conv layer, implementation of this kernel is little bit more complex w/o addition RAM and as you have mentioned more sensible to in term of error. We had previligied and focused on other compression techniques based on the quantification of the params and activations to 8b integer format which is more adapted for the embedded constrained devices (with "limited" computional and memory ressources).

Best,

Jean-Michel

bardetad · ‎2019-12-06

Hi,

Alright, thanks again!

Best,

Adrien