In the TensorFlow model optimization docs they talk about reducing model size with weight pruning and weight clustering. Will either of these techniques help reduce my flash requirements for running my model using the STM AI tools?
I've been experimenting with applying these on my model and the results when I upload it to the STM32Cube.AI Developer Cloud does not seem to have any effect. I'm not sure if I'm doing it wrong or if these sorts of optimizations are not relevant in this context.
My guess is that applying this sort of gzip compression would use too much ram or that maybe these are already being applied in the optimization step that the server does.
If these approaches are not useful here are there any recommendations as to how I can reduce my flash size without reducing the quality of the results? The STM32CubeMX tool seems to allow you to select low, medium or high compression but I'm not seeing any options like that on the STM32Cube.AI Developer Cloud.
The best way to compress a neural network is using structured pruning.
We have been collaborating with NVIDIA to propose a full flow for image classification using TAO Toolkit:
Using both quantization and structured pruning, you can get impressive compression factors (even up to 100).