Loading fine-tuned MobileNetV3 ConvNet in STM32CubeAI

CDolo.1 · ‎2021-03-13

I loaded my mobilenetv3 model in STM32CubeAI but when I tried to choose the compression ratio of 4 and 8 to reduce size, the RAM and FLASH requirements are still the same.

Also, I optimized my model in tensorflow lite which results in ~2MB, but when I load it into STM32CubeAI, it still reads the original size which is 6MB?

Please help.

jean-michel.d · ‎2021-03-17

Hello,

X-CUBE-AI provides a support to compress only for the Dense layer in float, according some condition to limit the drop of accuracy.

For the optimization with TensorFlow, I suppose that you use TFLiteConverter with the default option to generate the tflite file:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model_quant = converter.convert()

In this case, only the weights are compressed (or quantized - float32 -> int8). This allows to improve the size of the TFLite file to deploy it on a Mobile. But it requires the specific implementation of the operators (i.e. hybrid op.) which allows to "quantize" on the fly the activations (input/output of the operators) or "dequantize" the weights before to apply the operation. For a IoT or Edge device this is not efficient (in term of processing time and memory usage). In this case, X-CUBE-AI "dequantize" off-line (during the code generation) the weights. This is why we retrieve the original size. Expected improvement of size is not conserved and you loss in accuracy.

For X-CUBE-AI, as for other Edge-AI-runtime, it is recommended to apply the TFLiteConverter options which allows to have a "full integer" quantized model (https://www.tensorflow.org/lite/performance/post_training_quantization)

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    yield [input_value]
 
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8  # or tf.float32, tf.uint8
converter.inference_output_type = tf.int8 # or tf.float32, tf.uint8
 
tflite_model_quant = converter.convert()

br,

Jean-Michel

View solution in original post

jean-michel.d · ‎2021-03-17

Hello,

X-CUBE-AI provides a support to compress only for the Dense layer in float, according some condition to limit the drop of accuracy.

For the optimization with TensorFlow, I suppose that you use TFLiteConverter with the default option to generate the tflite file:

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model_quant = converter.convert()

In this case, only the weights are compressed (or quantized - float32 -> int8). This allows to improve the size of the TFLite file to deploy it on a Mobile. But it requires the specific implementation of the operators (i.e. hybrid op.) which allows to "quantize" on the fly the activations (input/output of the operators) or "dequantize" the weights before to apply the operation. For a IoT or Edge device this is not efficient (in term of processing time and memory usage). In this case, X-CUBE-AI "dequantize" off-line (during the code generation) the weights. This is why we retrieve the original size. Expected improvement of size is not conserved and you loss in accuracy.

For X-CUBE-AI, as for other Edge-AI-runtime, it is recommended to apply the TFLiteConverter options which allows to have a "full integer" quantized model (https://www.tensorflow.org/lite/performance/post_training_quantization)

def representative_data_gen():
  for input_value in tf.data.Dataset.from_tensor_slices(train_images).batch(1).take(100):
    yield [input_value]
 
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8  # or tf.float32, tf.uint8
converter.inference_output_type = tf.int8 # or tf.float32, tf.uint8
 
tflite_model_quant = converter.convert()

br,

Jean-Michel