Error When Converting TensorFlow Whisper Encoder to .nb Using ST Edge AI Tool

Justin_wu · ‎2025-02-28

I am a new user of the STM32MP257F-EV1 board and am not very familiar with the ST Edge AI tool. I am currently trying to extract the TensorFlow Whisper encoder, convert it to an INT8 .tflite model using post-training quantization (PTQ), and then use the ST Edge AI tool to convert it into the .nb format for NPU acceleration.

I followed this procedure to perform the quantization:

model = TFWhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
config = model.get_config()
encoder = model.get_encoder()

# Define a Keras model that takes input features and outputs encoder embeddings input_features = tf.keras.Input(shape=(config['num_mel_bins'], 2*config['max_source_positions']), dtype=tf.float32)
encoder_output = encoder(input_features)
encoder_model = tf.keras.Model(inputs=input_features, outputs=encoder_output) encoder_model.save("whisper_encoder_saved_model", save_format="tf")

def representative_data_gen():
for _ in range(10:(
data = np.random.normal(size=(1,config['num_mel_bins'],
2* config['max_source_positions'])).astype(np.float32)
yield [data]

def convert_and_quantize_to_tflite(model_path, output_tflite_path, representative_data_gen:(
model = tf.keras.models.load_model(model_path)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen converter._experimental_disable_per_channel = True
converter._experimental_new_quantizer = False

# Ensure 8-bit asymmetric quantization
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
# Keep input and output in float32 for compatibility
converter.inference_input_type = tf.float32
converter.inference_output_type = tf.float32

# Convert and save the TFLite model
tflite_model = converter.convert()
with open(output_tflite_path, "wb") as f:
f.write(tflite_model)
print(f"{output_tflite_path} saved successfully.")

convert_and_quantize_to_tflite("whisper_encoder_saved_model", "whisper_encoder_int8.tflite", representative_data_gen)

After generating the whisper_encoder_int8.tflite model, I used the following command to convert it to .nb format:

./stedgeai generate --target stm32mp2 --model whisper_encoder_int8.tflite --input-data-type float32 --output-data-type float32

However, I encountered the following error:

E 23:02:43 Operands could not be broadcast together with shapes [147456, 1500, 384] [86973087744, 1500, 384] INTERNAL ERROR: ('Operands could not be broadcast together with shapes [147456, 1500, 384] [86973087744, 1500, 384]', None)

I am unsure what is causing this error. Could it be an issue with how I generated the .tflite model, maybe because the encoder has 4 layers? Or is there a limitation with the ST Edge AI tool regarding tensor shapes or dimensions?

I would appreciate any insights or guidance on resolving this issue.

Thank you!

Julian E. · ‎2025-03-04

Hello @Justin_wu ,

At first glance, the issue could be related to shape mismatches or dynamic shapes that are not interpretable.

It could be due to post-processing layers being removed by ST Edge AI, leading to errors, or certain layers not being supported on this model.

Additionally, it seems that the model quantization is done per-channel rather than per-tensor, which can significantly impact performance by running the model on the GPU instead of the NPU.

I will try to replicate to find where it comes from and update you.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

Justin_wu · ‎2025-03-05

Hi @Julian E. ,

Thank you very much for your assistance. I truly appreciate your support. I have a quick question regarding the quantization process: Could you please clarify how you determine that the quantization is performed on a per-channel basis? My understanding is that if I set converter._experimental_disable_per_channel = True, the converter will perform per-tensor quantization. Is that correct?

Thank you again for your help.

Best regards,
Wu

Julian E. · ‎2025-03-07

Hello @Justin_wu,

You are correct, I misread.

Additionally, you can check it in neutron for example:

If scale and zero_point are scalars (empty shape []), it's per-tensor quantization.
If scale and zero_point have a shape like [C], it's per-channel quantization.

For exemple:

On the left is PerChannel, on the right is PerTensor

I plan to replicate your test, but for the moment I am blocked by my proxy...

I will update you if I get anything useful.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

Justin_wu · ‎2025-03-10

Hi @Julian E.,

I appreciate your clarification! Thank you for your help.

Best regards,
Wu