QInt16 quantization support

AMurz.1 · ‎2025-06-21

Hi,

The hardware seems to support 16 bits activations, but this seems to not be usable according to the ST Edge AI documentation.

We have a model where at a particular tensor between a Conv and Sigmoid, the precision loss due to the quantization cause non-negligible accuracy loss of the output of the model. The output of the Conv layer is between -90 and 2 mostly, then it goes to Sigmoid. But the low values cause the quantized output to loss too much accuracy, while being needed for an accurate Sigmoid output (which drives audio related ratio where accuracy is important).

I've tried to use these parameters with ONNX quantization, but ST Edge AI fails:

conf = StaticQuantConfig(
    calibration_data_reader=dr,
    quant_format=QuantFormat.QDQ,
    calibrate_method=CalibrationMethod.MinMax,
    activation_type=QuantType.QInt16,
    weight_type=QuantType.QInt8,
    per_channel=True)

The error message is:

NOT IMPLEMENTED: Unexpected type for constant input of Dequantize layer (SIGNED, 16 bit, C Size: 16 bits Scales: [1.5259021893143654e-05] Zeros: [-32768] Quantizer: UNIFORM)

The only way to fix the accuracy issue is to not quantize the Conv and Sigmoid layer so they are done in software with float32 (at the expense of lower inference speed due to the Conv in software).

Tensorflow already have support for 16 bits activations / 8 bits weights using:

converter.target_spec.supported_ops = [tf.lite.OpsSet.EXPERIMENTAL_TFLITE_BUILTINS_ACTIVATIONS_INT16_WEIGHTS_INT8]

This can be used for ARM Ethos platform.

I think ONNX need opset 21, but I'm not sure.

Is there a way to use 16 bits activation ? Or maybe it is will be implemented later ?

Or maybe float16 using MVE in software somehow ?

Thanks for your support.

Alexis Murzeau

Julian E. · ‎2025-06-24

Hello @AMurz.1,

This is indeed a software limitation.

The support for 16bits is in the roadmap but is not planned for at least next year.

As you pointed out, a solution to help with the accuracy loss is not to quantize a few layers to preserve the original information. So you will indeed have SW fallbacks and loose on inference time.

Concerning your last remark about maybe using float16 in software somehow, the compiler is not able to do it for sure. And for the N6, with a cortex M55, the MVE is used with the library Embednets (libNetworkRuntime/Embednets) which does not seem to have MVE-F related stuff.

(Note that I am not an expert on this subject, so it may be wrong, but this is the information I gathered internally)

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

View solution in original post

Julian E. · ‎2025-06-24