2024-11-02 05:25 PM - edited 2024-11-04 08:26 AM
Hi there, I'm not sure if it's better to leave a post here, or on github. If there is a preffered place, please let me know!
I'd appreciate any guidance on how to correctly configure the pad-values for 1-bit conv operations within X-CUBE-AI.
I appreciate those at ST may not be able to fully comment on the inner working of ST's kernel implementation, but I have a concern 1-bit convolutions may be implemented sub-optimally.
The problem I'm working on is generating a Binary Nerual Network (BNN). I've decided to trial the X-CUBE-AI framework for my company.
A terse description of my environment to replicate this is as follows (most of this should be irrelevant but just incase):
I've got a very simple python script called "test.py" (see below). It imports tensorflow and larq, and creates a very simple model, with a single larq binary 2D convolution (weights and inputs), where we take the sign of the inputs and the weights (e.g. go to the range [-1, 1]), to make a simple binary conv op. Note we have "same" padding, so there will typically be some padding at the edges.
import tensorflow as tf
import larq as lq
def build_model(pad_values: int, kernel_size: int = 3, stride: int = 1, out_channels: int = 16):
x = tf.keras.Input(shape=(32, 32, 1))
layer = lq.layers.QuantConv2D(
filters=out_channels,
kernel_size=(kernel_size, kernel_size),
kernel_quantizer="ste_sign",
input_quantizer="ste_sign",
use_bias=False,
strides=(stride, stride),
padding="same",
pad_values=pad_values,
groups=1
)
y = layer(x)
return tf.keras.Model(inputs=x, outputs=y)
print(tf.__version__)
model = build_model(pad_values=0) # <-- TRY VARYING THIS VALUE
# Save the model
model.save("test.h5")
The python script at the end then saves the model as a file. I have a second script (bash this time), called "test.sh", see below:
#!/bin/bash
set -e
# Ensure we always start from a new model
rm -f test.h5
# Create a model
python test.py
# Change this path if your install is at a different location
~/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/9.1.0/Utilities/macarm/stedgeai \
analyze \
--model test.h5 \
--target stm32 \
--type keras
Note how it calls the python test.py script to build the model, and then passes the model "test.h5" to the "stegeai" CLI.
What I've noticed is that if the "pad_values" is 0, then "stedgeai" will work fine, then we can transpile from keras to the ST Edge internal framework successfully, and we'll see something like:
However, if I return to the Python script above, and modify pad_values to either -1 or 1, then we'll crash and burn! We'll see and error like:
NOT IMPLEMENTED: QuantConv2D (padded with 1) with formats {'out_0': (FLOAT, 32 bit, C Size: 32 bits), 'weights': (SIGNED, 1 bit, C Size: 1 bits Scales: [2.0] Zeros: [-0.5] Quantizer: UNIFORM), 'in_0': (SIGNED, 1 bit, C Size: 1 bits Scales: [2.0] Zeros: [-0.5] Quantizer: UNIFORM)} not supported
For the record, I think this error message (and the multiple other errors I've worked through) leaves a little to be desired. However, this is not the objective of this post. But, just leaving some feedback - if you have a closed-source library - error messages are super crucial to help guide the users to a working solution. I only realised pad_values broke the conv op, through trial and error on a number of other parameters, (in my opinion) this would have been much clearer with better, more verbose error messages!
I've implemented a few binary convs along the way by hand, and as a start compared to a naive conv2d implementation, typically we'd use XNOR and popcount to efficiently do the actualy convolution, (aswell as leveraging GEMM via Im2Col to help improve performance).
What I don't understand here is, in my example above, my larq conv2d layer uses the sign operator to quantise, meaning we have [-1, 1] binary values, rather than [0, 1]. If we add zero-same-padding, e.g. pad the edges with zero, then the signal being fed to the keras operation is ternary [-1, 0, 1]. In keras, this is fine, training is being done in float32, so this probably doesn't matter.
But when we get to an efficient implementation in C, how are the 0 values being represented? The convolution ideally would convert [-1, 1] to [0, 1], and then go on it's merry way computing the conv with XNOR/popcount. But if we have ternary [-1, 0, 1] values, then I'm not sure what's going on in the kernel under the hood, nor whether it's optimal - read here for more on this topic.
I'll be honest, I've found a number of discrepancies in the documentation and github examples (model zoo), aswell as missing documentation, which lead to difficulty in using the X-CUBE-AI framework.
I'm sharing this incase anyone at ST want's to improve these areas, or show me where I've misunderstood things/got it wrong (highly possible!)
https://wiki.st.com/stm32mcu/wiki/AI:Deep_Quantized_Neural_Network_support#Supported_Larq_layers
See the above link, it writes on larq Conv2D:
for binary quantization, 'pad_values=-1 or 1' is requested if 'padding="same"'
Here I can't get pad values -1 or 1 to work with "stedgeai" at all. This documentation suggests the writer thought -1 or 1 are required if padding is "same" (which I've set), so I find this documentation incorrect.
One of the things I did when trying to interpret how "stedgeai" deals with binary convs was search for all use of larq in the model zoo repo. I'm really confused how this code here suggests we can use pad_values=1, where as I've demonstrated in my example this causes an obscure error.
Thanks in advance, and I hope this hasn't come across too negative about the framework, I'm just keen to test out the most optimised neural networks with this framework and could do with a hand!
2024-11-03 08:39 AM - edited 2024-11-03 08:49 AM
As a follow up to this, I've noticed that if I set "padding_values" to 0, and analyze the model, it's getting converted to a float32 conv anyway, so there's no optimised kernel (as far as I can tell) getting used at the end of the day anyway!
I used the above python script to generate the 1-layer 1-bit conv op, and then I see the following graph:
And I also get the following report:
ST Edge AI Core v1.0.0-19899
Created date : 2024-11-03 17:30:57
Parameters : analyze --target stm32h7 --name network -m /Users/xxx/weight_layer/test.h5 --compression lossless --verbosity 1 --allocate-inputs --allocate-outputs --custom /Users/xxx/custom_layers/custom_layers.json --workspace /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6097541673667914955950694509463940 --output /Users/xxx/.stm32cubemx/network_output
Exec/report summary (analyze)
---------------------------------------------------------------------------------------------------------------------------
model file : /Users/xxx/weight_layer/test.h5
type : keras
c_name : network
compression : lossless
options : allocate-inputs, allocate-outputs
optimization : balanced
target/series : stm32h7
workspace dir : /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6097541673667914955950694509463940
output dir : /Users/xxx/.stm32cubemx/network_output
model_fmt : float
model_name : test
model_hash : 0x697014b17b79c93775e307a402d3e471
params # : 144 items (576 B)
---------------------------------------------------------------------------------------------------------------------------
input 1/1 : 'input_1', int1(1x32x32x1), 4.00 KBytes, 1b-32bpacked, activations
output 1/1 : 'quant_conv2d', f32(1x32x32x16), 64.00 KBytes, activations
macc : 149,520
weights (ro) : 640 B (640 B) (1 segment) / +64(+11.1%) vs float model
activations (rw) : 69,668 B (68.04 KiB) (1 segment) *
ram (total) : 69,668 B (68.04 KiB) = 69,668 + 0 + 0
---------------------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers can be used from the activations buffer
Model name - test
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ----------------
m_id layer (type,original) oshape param/size macc connected to | c_size c_macc c_type
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ----------------
0 input_1 (Input, InputLayer) [b:1,h:32,w:32,c:1] | +2,048(+100.0%) Conversion_[0]
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ----------------
1 quant_conv2d_conv (Conversion, QuantConv2D) [b:1,h:32,w:32,c:1] 2,048 input_1 | +640(+100.0%) +145,424(+7100.8%) Conv2D_[o][1]
quant_conv2d (Conv2D, QuantConv2D) [b:1,h:32,w:32,c:16] 144/576 147,456 quant_conv2d_conv | -576(-100.0%) -147,456(-100.0%)
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ----------------
model/c-model: macc=149,504/149,520 +16(+0.0%) weights=576/640 +64(+11.1%) activations=--/69,668 io=--/0
Generated C-graph summary
------------------------------------------------------------------------------------------------------------------------
model name : test
c-name : network
c-node # : 2
c-array # : 6
activations size : 69668 (1 segment)
weights size : 640 (1 segment)
macc : 149520
inputs : ['input_1_output']
outputs : ['quant_conv2d_output']
C-Arrays (6)
------ ----------------------------- ------------- ------------------------- ------------- ---------
c_id name (*_array) item/size domain/mem-pool c-type comment
------ ----------------------------- ------------- ------------------------- ------------- ---------
0 input_1_0_conversion_output 1024/4096 activations/**default** float
1 input_1_output 1024/4096 activations/**default** s1 /input
2 quant_conv2d_bias 16/64 weights/weights const float
3 quant_conv2d_output 16384/65536 activations/**default** float /output
4 quant_conv2d_scratch0 9/36 activations/**default** float
5 quant_conv2d_weights 144/576 weights/weights const float
------ ----------------------------- ------------- ------------------------- ------------- ---------
C-Layers (2)
------ ---------------------- ---- ------------- -------- ----- -------------------------------- ---------------------
c_id name (*_layer) id layer_type macc rom tensors shape (array id)
------ ---------------------- ---- ------------- -------- ----- -------------------------------- ---------------------
0 input_1_0_conversion 0 Conversion 2048 0 I: input_1_output int1(1x32x32x1) (1)
O: input_1_0_conversion_output f32(1x32x32x1) (0)
------ ---------------------- ---- ------------- -------- ----- -------------------------------- ---------------------
1 quant_conv2d 1 Conv2D 147472 640 I: input_1_0_conversion_output f32(1x32x32x1) (0)
S: quant_conv2d_scratch0
W: quant_conv2d_weights f32(16x3x3x1) (5)
W: quant_conv2d_bias f32(16) (2)
O: quant_conv2d_output f32(1x32x32x16) (3)
------ ---------------------- ---- ------------- -------- ----- -------------------------------- ---------------------
Number of operations per c-layer
------- ------ ----------------------------------- --------- --------------
c_id m_id name (type) #op type
------- ------ ----------------------------------- --------- --------------
0 0 input_1_0_conversion (Conversion) 2,048 smul_s1_f32
1 1 quant_conv2d (Conv2D) 147,472 smul_f32_f32
------- ------ ----------------------------------- --------- --------------
total 149,520
Number of operation types
---------------- --------- -----------
operation type # %
---------------- --------- -----------
smul_s1_f32 2,048 1.4%
smul_f32_f32 147,472 98.6%
Complexity report (model)
------ ------------------- ------------------------- ------------------------- ------
m_id name c_macc c_rom c_id
------ ------------------- ------------------------- ------------------------- ------
0 input_1 | 1.4% | 0.0% [0]
1 quant_conv2d_conv |||||||||||||||| 98.6% |||||||||||||||| 100.0% [1]
------ ------------------- ------------------------- ------------------------- ------
macc=149,520 weights=640 act=69,668 ram_io=0
In this report we see the majority of ops being smul_f32_f32, which I interpret as this getting converted to a float32 convolution. I've read the documentation here which mentions this fallback to float32, but it doesn't indicate there is any way for us to know what the eggregious parameters are that are causing the fallback (would be great to have some more helpful error messages here in the framework).
Is there any warning/error/reason for this to fall back on the float32 conv? Does anyone know anywhere in the docs that instruct exactly what is, and what isn't, supported (in terms of combinations of parameters) to get this working?
Has anyone got a working example of larq/1-bit convs working on github/some place that they can kindly point me to?
Thanks in advance
2024-11-03 10:09 AM
I wondered if perhaps because it's the first layer (and only) layer, it might be silently failing on that, this is briefly mentioned (not as a hard requirement) in the docs:
It is preferable to leave the first layer and the last layer in higher precision: 's8' or 'f32'
So I made a 2-layer 1-bit conv2d model. Initially the graph did look suspect, because it just has Conv2D types of ops, compared to what I expected to see from the docs:
But I do see some sxor_s1_s1 ops in the analysis report, so this may well be working fine!
ST Edge AI Core v1.0.0-19899
Created date : 2024-11-03 18:55:44
Parameters : analyze --target stm32h7 --name network -m /Users/xxx/weight_layer/test.h5 --compression lossless --verbosity 1 --allocate-inputs --allocate-outputs --custom /Users/xxx/custom_layers/custom_layers.json --workspace /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6138963726983337144002591594854705 --output /Users/xxx/.stm32cubemx/network_output
Exec/report summary (analyze)
---------------------------------------------------------------------------------------------------------------------------
model file : /Users/xxx/weight_layer/test.h5
type : keras
c_name : network
compression : lossless
options : allocate-inputs, allocate-outputs
optimization : balanced
target/series : stm32h7
workspace dir : /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6138963726983337144002591594854705
output dir : /Users/xxx/.stm32cubemx/network_output
model_fmt : dqnn
model_name : test
model_hash : 0xca7b92ec5d5583ea6981154d98de8eca
params # : 2,448 items (9.56 KiB)
---------------------------------------------------------------------------------------------------------------------------
input 1/1 : 'input_1', int1(1x32x32x1), 4.00 KBytes, 1b-32bpacked, activations
output 1/1 : 'quant_conv2d_1', f32(1x32x32x16), 64.00 KBytes, activations
macc : 2,539,536
weights (ro) : 9,920 B (9.69 KiB) (1 segment) / +128(+1.3%) vs float model
activations (rw) : 70,400 B (68.75 KiB) (1 segment) *
ram (total) : 70,400 B (68.75 KiB) = 70,400 + 0 + 0
---------------------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers can be used from the activations buffer
Model name - test
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- ---------------------------
m_id layer (type,original) oshape param/size macc connected to | c_size c_macc c_type
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- ---------------------------
0 input_1 (Input, InputLayer) [b:1,h:32,w:32,c:1] |
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- ---------------------------
1 quant_conv2d_conv (Conversion, QuantConv2D) [b:1,h:32,w:32,c:1] 2,048 input_1 | +640(+100.0%) +178,176(+8700.0%) Conv2D_/Conversion_[0, 1]
quant_conv2d (Conv2D, QuantConv2D) [b:1,h:32,w:32,c:16] 144/576 147,456 quant_conv2d_conv | -576(-100.0%) -147,456(-100.0%)
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- ---------------------------
2 quant_conv2d_1_conv (Conversion, QuantConv2D) [b:1,h:32,w:32,c:16] 32,768 quant_conv2d | +9,280(+100.0%) +2,326,544(+7100.0%) Conv2D_[o][2]
quant_conv2d_1 (Conv2D, QuantConv2D) [b:1,h:32,w:32,c:16] 2,304/9,216 2,359,296 quant_conv2d_1_conv | -9,216(-100.0%) -2,359,296(-100.0%)
------ ----------------------------------------------- ---------------------- ------------- ----------- --------------------- --- ----------------- ---------------------- ---------------------------
model/c-model: macc=2,541,568/2,539,536 -2,032(-0.1%) weights=9,792/9,920 +128(+1.3%) activations=--/70,400 io=--/0
Generated C-graph summary
------------------------------------------------------------------------------------------------------------------------
model name : test
c-name : network
c-node # : 3
c-array # : 10
activations size : 70400 (1 segment)
weights size : 9920 (1 segment)
macc : 2539536
inputs : ['input_1_output']
outputs : ['quant_conv2d_1_output']
C-Arrays (10)
------ ---------------------------------- ------------- ------------------------- ------------- ---------
c_id name (*_array) item/size domain/mem-pool c-type comment
------ ---------------------------------- ------------- ------------------------- ------------- ---------
0 input_1_output 1024/4096 activations/**default** s1 /input
1 quant_conv2d_0_conversion_output 16384/65536 activations/**default** float
2 quant_conv2d_1_bias 16/64 weights/weights const float
3 quant_conv2d_1_output 16384/65536 activations/**default** float /output
4 quant_conv2d_1_scratch0 144/576 activations/**default** float
5 quant_conv2d_1_weights 2304/9216 weights/weights const float
6 quant_conv2d_output 16384/4096 activations/**default** s1
7 quant_conv2d_scratch0 18/72 activations/**default** float
8 quant_conv2d_threshold 16/64 weights/weights const float
9 quant_conv2d_weights 144/576 weights/weights const s1
------ ---------------------------------- ------------- ------------------------- ------------- ---------
C-Layers (3)
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ----------------------
c_id name (*_layer) id layer_type macc rom tensors shape (array id)
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ----------------------
0 quant_conv2d 1 Conv2D 147456 640 I: input_1_output int1(1x32x32x1) (0)
S: quant_conv2d_scratch0
W: quant_conv2d_weights int1(16x3x3x1) (9)
W: quant_conv2d_threshold f32(1x1x16x1) (8)
O: quant_conv2d_output int1(1x32x32x16) (6)
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ----------------------
1 quant_conv2d_0_conversion 1 Conversion 32768 0 I: quant_conv2d_output int1(1x32x32x16) (6)
O: quant_conv2d_0_conversion_output f32(1x32x32x16) (1)
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ----------------------
2 quant_conv2d_1 2 Conv2D 2359312 9280 I: quant_conv2d_0_conversion_output f32(1x32x32x16) (1)
S: quant_conv2d_1_scratch0
W: quant_conv2d_1_weights f32(16x3x3x16) (5)
W: quant_conv2d_1_bias f32(16) (2)
O: quant_conv2d_1_output f32(1x32x32x16) (3)
------ --------------------------- ---- ------------- --------- ------ ------------------------------------- ----------------------
Number of operations per c-layer
------- ------ ---------------------------------------- ----------- --------------
c_id m_id name (type) #op type
------- ------ ---------------------------------------- ----------- --------------
0 1 quant_conv2d (Conv2D) 147,456 sxor_s1_s1
1 1 quant_conv2d_0_conversion (Conversion) 32,768 smul_s1_f32
2 2 quant_conv2d_1 (Conv2D) 2,359,312 smul_f32_f32
------- ------ ---------------------------------------- ----------- --------------
total 2,539,536
Number of operation types
---------------- ----------- -----------
operation type # %
---------------- ----------- -----------
sxor_s1_s1 147,456 5.8%
smul_s1_f32 32,768 1.3%
smul_f32_f32 2,359,312 92.9%
Complexity report (model)
------ --------------------- ------------------------- ------------------------- --------
m_id name c_macc c_rom c_id
------ --------------------- ------------------------- ------------------------- --------
1 quant_conv2d_conv || 7.1% || 6.5% [0, 1]
2 quant_conv2d_1_conv |||||||||||||||| 92.9% |||||||||||||||| 93.5% [2]
------ --------------------- ------------------------- ------------------------- --------
macc=2,539,536 weights=9,920 act=70,400 ram_io=0
However, with a more complicated example (based on a larger model, tricky to share here concisely), I'm finding all of my 1-bit convs used in a very similar way to the above simple examples are falling back to conv2d/failing with errors.
I'm not sure if there are clearly defined lists/tables of what works, what doesn't, or if there are any ST-led workshops/calls where I might be able to discuss this with an engineer etc, because I presume stedgeai is falling back to float32 conv2d because a parameter isn't set in a supported way that is not visible to me!
2024-11-12 06:39 AM
Hello @tiny-incy-wincy-weeny-ml ,
Sorry for the late answer...
thank you very much for this detailed report, we are checking internally, and we will be coming toward you.
Have a good day,
Julian