Possible suboptimal STM32CubeAI Conv2D kernels? Errors for documented parameters?

tiny-incy-wincy-weeny-ml · ‎2024-11-02

Hi there, I'm not sure if it's better to leave a post here, or on github. If there is a preffered place, please let me know!

TL;DR

I'd appreciate any guidance on how to correctly configure the pad-values for 1-bit conv operations within X-CUBE-AI.

I appreciate those at ST may not be able to fully comment on the inner working of ST's kernel implementation, but I have a concern 1-bit convolutions may be implemented sub-optimally.

Problem and Setup

The problem I'm working on is generating a Binary Nerual Network (BNN). I've decided to trial the X-CUBE-AI framework for my company.

A terse description of my environment to replicate this is as follows (most of this should be irrelevant but just incase):

OS: Mac OS 15.1
Chip: M2 Pro
Python: 3.11.8
TensorFlow: 2.15.0
Larq: 0.13.3
X-CUBE-AI: 9.1.0

Simple Self-Contained Example

I've got a very simple python script called "test.py" (see below). It imports tensorflow and larq, and creates a very simple model, with a single larq binary 2D convolution (weights and inputs), where we take the sign of the inputs and the weights (e.g. go to the range [-1, 1]), to make a simple binary conv op. Note we have "same" padding, so there will typically be some padding at the edges.

import tensorflow as tf
import larq as lq


def build_model(pad_values: int, kernel_size: int = 3, stride: int = 1, out_channels: int = 16):
    x = tf.keras.Input(shape=(32, 32, 1))
    layer = lq.layers.QuantConv2D(
        filters=out_channels,
        kernel_size=(kernel_size, kernel_size),
        kernel_quantizer="ste_sign",
        input_quantizer="ste_sign",
        use_bias=False,
        strides=(stride, stride),
        padding="same",
        pad_values=pad_values,
        groups=1
    )
    y = layer(x)
    return tf.keras.Model(inputs=x, outputs=y)


print(tf.__version__)
model = build_model(pad_values=0)  # <-- TRY VARYING THIS VALUE
# Save the model
model.save("test.h5")

The python script at the end then saves the model as a file. I have a second script (bash this time), called "test.sh", see below:

#!/bin/bash
set -e

# Ensure we always start from a new model
rm -f test.h5

# Create a model
python test.py

# Change this path if your install is at a different location
~/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/9.1.0/Utilities/macarm/stedgeai \
  analyze \
  --model test.h5 \
  --target stm32 \
  --type keras

Note how it calls the python test.py script to build the model, and then passes the model "test.h5" to the "stegeai" CLI.

What I've noticed is that if the "pad_values" is 0, then "stedgeai" will work fine, then we can transpile from keras to the ST Edge internal framework successfully, and we'll see something like:

However, if I return to the Python script above, and modify pad_values to either -1 or 1, then we'll crash and burn! We'll see and error like:

NOT IMPLEMENTED: QuantConv2D (padded with 1) with formats {'out_0': (FLOAT, 32 bit, C Size: 32 bits), 'weights': (SIGNED, 1 bit, C Size: 1 bits Scales: [2.0] Zeros: [-0.5] Quantizer: UNIFORM), 'in_0': (SIGNED, 1 bit, C Size: 1 bits Scales: [2.0] Zeros: [-0.5] Quantizer: UNIFORM)} not supported

For the record, I think this error message (and the multiple other errors I've worked through) leaves a little to be desired. However, this is not the objective of this post. But, just leaving some feedback - if you have a closed-source library - error messages are super crucial to help guide the users to a working solution. I only realised pad_values broke the conv op, through trial and error on a number of other parameters, (in my opinion) this would have been much clearer with better, more verbose error messages!

Why I Think There is a Problem

I've implemented a few binary convs along the way by hand, and as a start compared to a naive conv2d implementation, typically we'd use XNOR and popcount to efficiently do the actualy convolution, (aswell as leveraging GEMM via Im2Col to help improve performance).

What I don't understand here is, in my example above, my larq conv2d layer uses the sign operator to quantise, meaning we have [-1, 1] binary values, rather than [0, 1]. If we add zero-same-padding, e.g. pad the edges with zero, then the signal being fed to the keras operation is ternary [-1, 0, 1]. In keras, this is fine, training is being done in float32, so this probably doesn't matter.

But when we get to an efficient implementation in C, how are the 0 values being represented? The convolution ideally would convert [-1, 1] to [0, 1], and then go on it's merry way computing the conv with XNOR/popcount. But if we have ternary [-1, 0, 1] values, then I'm not sure what's going on in the kernel under the hood, nor whether it's optimal - read here for more on this topic.

Confusing Docs Don't Help Either

I'll be honest, I've found a number of discrepancies in the documentation and github examples (model zoo), aswell as missing documentation, which lead to difficulty in using the X-CUBE-AI framework.

I'm sharing this incase anyone at ST want's to improve these areas, or show me where I've misunderstood things/got it wrong (highly possible!)

ST Larq Docs

https://wiki.st.com/stm32mcu/wiki/AI:Deep_Quantized_Neural_Network_support#Supported_Larq_layers

See the above link, it writes on larq Conv2D:

for binary quantization, 'pad_values=-1 or 1' is requested if 'padding="same"'

Here I can't get pad values -1 or 1 to work with "stedgeai" at all. This documentation suggests the writer thought -1 or 1 are required if padding is "same" (which I've set), so I find this documentation incorrect.

ST Model Zoo Code

https://github.com/STMicroelectronics/stm32ai-modelzoo/blob/e5361e76f8427b0907b67d9815101d05c32e7407/image_classification/src/models/st_resnet_8_hybrid.py#L38

One of the things I did when trying to interpret how "stedgeai" deals with binary convs was search for all use of larq in the model zoo repo. I'm really confused how this code here suggests we can use pad_values=1, where as I've demonstrated in my example this causes an obscure error.

Question(s)

So what's going on here?
Why does the model zoo point to 1 padding values being okay, but "stedgeai" failing on this?
What is happening if we pad to zero under the hood for a 1-bit convolution for the ST Micro kernels?

Thanks in advance, and I hope this hasn't come across too negative about the framework, I'm just keen to test out the most optimised neural networks with this framework and could do with a hand!

tiny-incy-wincy-weeny-ml · ‎2024-11-03

As a follow up to this, I've noticed that if I set "padding_values" to 0, and analyze the model, it's getting converted to a float32 conv anyway, so there's no optimised kernel (as far as I can tell) getting used at the end of the day anyway!

I used the above python script to generate the 1-layer 1-bit conv op, and then I see the following graph:

And I also get the following report:

ST Edge AI Core v1.0.0-19899
Created date          : 2024-11-03 17:30:57
Parameters            : analyze --target stm32h7 --name network -m /Users/xxx/weight_layer/test.h5 --compression lossless --verbosity 1 --allocate-inputs --allocate-outputs --custom /Users/xxx/custom_layers/custom_layers.json --workspace /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6097541673667914955950694509463940 --output /Users/xxx/.stm32cubemx/network_output

Exec/report summary (analyze)
---------------------------------------------------------------------------------------------------------------------------
model file         :   /Users/xxx/weight_layer/test.h5                                                              
type               :   keras                                                                                               
c_name             :   network                                                                                             
compression        :   lossless                                                                                            
options            :   allocate-inputs, allocate-outputs                                                                   
optimization       :   balanced                                                                                            
target/series      :   stm32h7                                                                                             
workspace dir      :   /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6097541673667914955950694509463940   
output dir         :   /Users/xxx/.stm32cubemx/network_output                                                       
model_fmt          :   float                                                                                               
model_name         :   test                                                                                                
model_hash         :   0x697014b17b79c93775e307a402d3e471                                                                  
params #           :   144 items (576 B)                                                                                   
---------------------------------------------------------------------------------------------------------------------------
input 1/1          :   'input_1', int1(1x32x32x1), 4.00 KBytes, 1b-32bpacked, activations                                  
output 1/1         :   'quant_conv2d', f32(1x32x32x16), 64.00 KBytes, activations                                          
macc               :   149,520                                                                                             
weights (ro)       :   640 B (640 B) (1 segment) / +64(+11.1%) vs float model                                              
activations (rw)   :   69,668 B (68.04 KiB) (1 segment) *                                                                  
ram (total)        :   69,668 B (68.04 KiB) = 69,668 + 0 + 0                                                               
---------------------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers can be used from the activations buffer

Model name - test
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ---------------- 
m_id   layer (type,original)                         oshape                 param/size        macc        connected to   | c_size          c_macc               c_type           
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ---------------- 
0      input_1 (Input, InputLayer)                   [b:1,h:32,w:32,c:1]                                                 |                 +2,048(+100.0%)      Conversion_[0]   
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ---------------- 
1      quant_conv2d_conv (Conversion, QuantConv2D)   [b:1,h:32,w:32,c:1]                     2,048             input_1   | +640(+100.0%)   +145,424(+7100.8%)   Conv2D_[o][1]    
       quant_conv2d (Conv2D, QuantConv2D)            [b:1,h:32,w:32,c:16]   144/576        147,456   quant_conv2d_conv   | -576(-100.0%)   -147,456(-100.0%)    
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ---------------- 
model/c-model: macc=149,504/149,520 +16(+0.0%) weights=576/640 +64(+11.1%) activations=--/69,668 io=--/0



Generated C-graph summary
------------------------------------------------------------------------------------------------------------------------
model name            : test
c-name                : network
c-node #              : 2
c-array #             : 6
activations size      : 69668 (1 segment)
weights size          : 640 (1 segment)
macc                  : 149520
inputs                : ['input_1_output']
outputs               : ['quant_conv2d_output']

C-Arrays (6)
------ ----------------------------- ------------- ------------------------- ------------- --------- 
c_id   name (*_array)                item/size     domain/mem-pool           c-type        comment   
------ ----------------------------- ------------- ------------------------- ------------- --------- 
0      input_1_0_conversion_output   1024/4096     activations/**default**   float                   
1      input_1_output                1024/4096     activations/**default**   s1            /input    
2      quant_conv2d_bias             16/64         weights/weights           const float             
3      quant_conv2d_output           16384/65536   activations/**default**   float         /output   
4      quant_conv2d_scratch0         9/36          activations/**default**   float                   
5      quant_conv2d_weights          144/576       weights/weights           const float             
------ ----------------------------- ------------- ------------------------- ------------- --------- 

C-Layers (2)
------ ---------------------- ---- ------------- -------- ----- -------------------------------- --------------------- 
c_id   name (*_layer)         id   layer_type    macc     rom   tensors                          shape (array id)      
------ ---------------------- ---- ------------- -------- ----- -------------------------------- --------------------- 
0      input_1_0_conversion   0    Conversion    2048     0     I: input_1_output                int1(1x32x32x1) (1)   
                                                                O: input_1_0_conversion_output   f32(1x32x32x1) (0)    
------ ---------------------- ---- ------------- -------- ----- -------------------------------- --------------------- 
1      quant_conv2d           1    Conv2D        147472   640   I: input_1_0_conversion_output   f32(1x32x32x1) (0)    
                                                                S: quant_conv2d_scratch0                               
                                                                W: quant_conv2d_weights          f32(16x3x3x1) (5)     
                                                                W: quant_conv2d_bias             f32(16) (2)           
                                                                O: quant_conv2d_output           f32(1x32x32x16) (3)   
------ ---------------------- ---- ------------- -------- ----- -------------------------------- --------------------- 



Number of operations per c-layer
------- ------ ----------------------------------- --------- -------------- 
c_id    m_id   name (type)                               #op           type 
------- ------ ----------------------------------- --------- -------------- 
0       0      input_1_0_conversion (Conversion)       2,048    smul_s1_f32 
1       1      quant_conv2d (Conv2D)                 147,472   smul_f32_f32 
------- ------ ----------------------------------- --------- -------------- 
total                                                149,520 

Number of operation types
---------------- --------- ----------- 
operation type           #           % 
---------------- --------- ----------- 
smul_s1_f32          2,048        1.4% 
smul_f32_f32       147,472       98.6% 

Complexity report (model)
------ ------------------- ------------------------- ------------------------- ------ 
m_id   name                c_macc                    c_rom                     c_id   
------ ------------------- ------------------------- ------------------------- ------ 
0      input_1             |                  1.4%   |                  0.0%   [0]    
1      quant_conv2d_conv   ||||||||||||||||  98.6%   |||||||||||||||| 100.0%   [1]    
------ ------------------- ------------------------- ------------------------- ------ 
macc=149,520 weights=640 act=69,668 ram_io=0

In this report we see the majority of ops being smul_f32_f32, which I interpret as this getting converted to a float32 convolution. I've read the documentation here which mentions this fallback to float32, but it doesn't indicate there is any way for us to know what the eggregious parameters are that are causing the fallback (would be great to have some more helpful error messages here in the framework).

Is there any warning/error/reason for this to fall back on the float32 conv? Does anyone know anywhere in the docs that instruct exactly what is, and what isn't, supported (in terms of combinations of parameters) to get this working?

Has anyone got a working example of larq/1-bit convs working on github/some place that they can kindly point me to?

Thanks in advance

tiny-incy-wincy-weeny-ml · ‎2024-11-03

I wondered if perhaps because it's the first layer (and only) layer, it might be silently failing on that, this is briefly mentioned (not as a hard requirement) in the docs:

It is preferable to leave the first layer and the last layer in higher precision: 's8' or 'f32'

So I made a 2-layer 1-bit conv2d model. Initially the graph did look suspect, because it just has Conv2D types of ops, compared to what I expected to see from the docs:

But I do see some sxor_s1_s1 ops in the analysis report, so this may well be working fine!

ST Edge AI Core v1.0.0-19899
Created date          : 2024-11-03 18:55:44
Parameters            : analyze --target stm32h7 --name network -m /Users/xxx/weight_layer/test.h5 --compression lossless --verbosity 1 --allocate-inputs --allocate-outputs --custom /Users/xxx/custom_layers/custom_layers.json --workspace /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6138963726983337144002591594854705 --output /Users/xxx/.stm32cubemx/network_output

Exec/report summary (analyze)
---------------------------------------------------------------------------------------------------------------------------
model file         :   /Users/xxx/weight_layer/test.h5                                                              
type               :   keras                                                                                               
c_name             :   network                                                                                             
compression        :   lossless                                                                                            
options            :   allocate-inputs, allocate-outputs                                                                   
optimization       :   balanced                                                                                            
target/series      :   stm32h7                                                                                             
workspace dir      :   /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6138963726983337144002591594854705   
output dir         :   /Users/xxx/.stm32cubemx/network_output                                                       
model_fmt          :   dqnn                                                                                                
model_name         :   test                                                                                                
model_hash         :   0xca7b92ec5d5583ea6981154d98de8eca                                                                  
params #           :   2,448 items (9.56 KiB)                                                                              
---------------------------------------------------------------------------------------------------------------------------
input 1/1          :   'input_1', int1(1x32x32x1), 4.00 KBytes, 1b-32bpacked, activations                                  
output 1/1         :   'quant_conv2d_1', f32(1x32x32x16), 64.00 KBytes, activations                                        
macc               :   2,539,536                                                                                           
weights (ro)       :   9,920 B (9.69 KiB) (1 segment) / +128(+1.3%) vs float model                                         
activations (rw)   :   70,400 B (68.75 KiB) (1 segment) *                                                                  
ram (total)        :   70,400 B (68.75 KiB) = 70,400 + 0 + 0                                                               
---------------------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers can be used from the activations buffer

Model name - test
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- --------------------------- 
m_id   layer (type,original)                           oshape                 param/size           macc          connected to   | c_size            c_macc                 c_type                      
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- --------------------------- 
0      input_1 (Input, InputLayer)                     [b:1,h:32,w:32,c:1]                                                      |                                          
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- --------------------------- 
1      quant_conv2d_conv (Conversion, QuantConv2D)     [b:1,h:32,w:32,c:1]                        2,048               input_1   | +640(+100.0%)     +178,176(+8700.0%)     Conv2D_/Conversion_[0, 1]   
       quant_conv2d (Conv2D, QuantConv2D)              [b:1,h:32,w:32,c:16]   144/576           147,456     quant_conv2d_conv   | -576(-100.0%)     -147,456(-100.0%)      
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- --------------------------- 
2      quant_conv2d_1_conv (Conversion, QuantConv2D)   [b:1,h:32,w:32,c:16]                      32,768          quant_conv2d   | +9,280(+100.0%)   +2,326,544(+7100.0%)   Conv2D_[o][2]               
       quant_conv2d_1 (Conv2D, QuantConv2D)            [b:1,h:32,w:32,c:16]   2,304/9,216     2,359,296   quant_conv2d_1_conv   | -9,216(-100.0%)   -2,359,296(-100.0%)    
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- --------------------------- 
model/c-model: macc=2,541,568/2,539,536 -2,032(-0.1%) weights=9,792/9,920 +128(+1.3%) activations=--/70,400 io=--/0



Generated C-graph summary
------------------------------------------------------------------------------------------------------------------------
model name            : test
c-name                : network
c-node #              : 3
c-array #             : 10
activations size      : 70400 (1 segment)
weights size          : 9920 (1 segment)
macc                  : 2539536
inputs                : ['input_1_output']
outputs               : ['quant_conv2d_1_output']

C-Arrays (10)
------ ---------------------------------- ------------- ------------------------- ------------- --------- 
c_id   name (*_array)                     item/size     domain/mem-pool           c-type        comment   
------ ---------------------------------- ------------- ------------------------- ------------- --------- 
0      input_1_output                     1024/4096     activations/**default**   s1            /input    
1      quant_conv2d_0_conversion_output   16384/65536   activations/**default**   float                   
2      quant_conv2d_1_bias                16/64         weights/weights           const float             
3      quant_conv2d_1_output              16384/65536   activations/**default**   float         /output   
4      quant_conv2d_1_scratch0            144/576       activations/**default**   float                   
5      quant_conv2d_1_weights             2304/9216     weights/weights           const float             
6      quant_conv2d_output                16384/4096    activations/**default**   s1                      
7      quant_conv2d_scratch0              18/72         activations/**default**   float                   
8      quant_conv2d_threshold             16/64         weights/weights           const float             
9      quant_conv2d_weights               144/576       weights/weights           const s1                
------ ---------------------------------- ------------- ------------------------- ------------- --------- 

C-Layers (3)
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ---------------------- 
c_id   name (*_layer)              id   layer_type    macc      rom    tensors                               shape (array id)       
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ---------------------- 
0      quant_conv2d                1    Conv2D        147456    640    I: input_1_output                     int1(1x32x32x1) (0)    
                                                                       S: quant_conv2d_scratch0                                     
                                                                       W: quant_conv2d_weights               int1(16x3x3x1) (9)     
                                                                       W: quant_conv2d_threshold             f32(1x1x16x1) (8)      
                                                                       O: quant_conv2d_output                int1(1x32x32x16) (6)   
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ---------------------- 
1      quant_conv2d_0_conversion   1    Conversion    32768     0      I: quant_conv2d_output                int1(1x32x32x16) (6)   
                                                                       O: quant_conv2d_0_conversion_output   f32(1x32x32x16) (1)    
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ---------------------- 
2      quant_conv2d_1              2    Conv2D        2359312   9280   I: quant_conv2d_0_conversion_output   f32(1x32x32x16) (1)    
                                                                       S: quant_conv2d_1_scratch0                                   
                                                                       W: quant_conv2d_1_weights             f32(16x3x3x16) (5)     
                                                                       W: quant_conv2d_1_bias                f32(16) (2)            
                                                                       O: quant_conv2d_1_output              f32(1x32x32x16) (3)    
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ---------------------- 



Number of operations per c-layer
------- ------ ---------------------------------------- ----------- -------------- 
c_id    m_id   name (type)                                      #op           type 
------- ------ ---------------------------------------- ----------- -------------- 
0       1      quant_conv2d (Conv2D)                        147,456     sxor_s1_s1 
1       1      quant_conv2d_0_conversion (Conversion)        32,768    smul_s1_f32 
2       2      quant_conv2d_1 (Conv2D)                    2,359,312   smul_f32_f32 
------- ------ ---------------------------------------- ----------- -------------- 
total                                                     2,539,536 

Number of operation types
---------------- ----------- ----------- 
operation type             #           % 
---------------- ----------- ----------- 
sxor_s1_s1           147,456        5.8% 
smul_s1_f32           32,768        1.3% 
smul_f32_f32       2,359,312       92.9% 

Complexity report (model)
------ --------------------- ------------------------- ------------------------- -------- 
m_id   name                  c_macc                    c_rom                     c_id     
------ --------------------- ------------------------- ------------------------- -------- 
1      quant_conv2d_conv     ||                 7.1%   ||                 6.5%   [0, 1]   
2      quant_conv2d_1_conv   ||||||||||||||||  92.9%   ||||||||||||||||  93.5%   [2]      
------ --------------------- ------------------------- ------------------------- -------- 
macc=2,539,536 weights=9,920 act=70,400 ram_io=0

However, with a more complicated example (based on a larger model, tricky to share here concisely), I'm finding all of my 1-bit convs used in a very similar way to the above simple examples are falling back to conv2d/failing with errors.

I'm not sure if there are clearly defined lists/tables of what works, what doesn't, or if there are any ST-led workshops/calls where I might be able to discuss this with an engineer etc, because I presume stedgeai is falling back to float32 conv2d because a parameter isn't set in a supported way that is not visible to me!

Julian E. · ‎2024-11-12

Hello @tiny-incy-wincy-weeny-ml ,

Sorry for the late answer...

thank you very much for this detailed report, we are checking internally, and we will be coming toward you.

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.