cancel
Showing results for 
Search instead for 
Did you mean: 

Difference between inference time on ST dev cloud and edge device (STM32N6570-DK)

Hi,

I’m working on benchmarking and deploying a 33 MB half-INT8 model. During benchmarking, the measured inference time in ST dev cloud was about 264 ms. However, after compiling the model locally using STEdgeAI and deploying it, the inference time was found to be 405 ms.

Additionally, when using the “Generate Code” option in the ST Dev Cloud, the generated artifacts—when deployed—showed the same inference time of 405 ms.

I have verified that the weight and activation memory allocations are consistent with those used in the ST Dev Cloud (please refer to the attached logs).

Could you please confirm whether this difference in inference time is expected, or if there is a known delta that should be addressed?

logs:

ST Dev cloud:

>>> stedgeai analyze --model arcface_halfint8.tflite --st-neural-art custom@/tmp/stm32ai_service/7369b431-60e1-4924-9783-5f76cbd6b229/profile-27ac5fbe-1304-4c95-9d4e-44268ad9580f.json --target stm32n6 --optimize.export_hybrid True --name network --workspace workspace --output output
ST Edge AI Core v2.2.0-20266 2adc00962
WARNING: Unsupported keys in the current profile custom are ignored: memory_desc
	> memory_desc is not a valid key anymore, use machine_desc instead
 >>>> EXECUTING NEURAL ART COMPILER
   atonn -i "/tmp/stm32ai_service/7369b431-60e1-4924-9783-5f76cbd6b229/output/arcface_halfint8_OE_3_3_0.onnx" --json-quant-file "/tmp/stm32ai_service/7369b431-60e1-4924-9783-5f76cbd6b229/output/arcface_halfint8_OE_3_3_0_Q.json" -g "network.c" --load-mdesc "/app/stm32ai/Utilities/configs/stm32n6.mdesc" --load-mpool "/app/stm32ai/Utilities/linux/targets/stm32/resources/mpools/stm32n6.mpool" --save-mpool-file "/tmp/stm32ai_service/7369b431-60e1-4924-9783-5f76cbd6b229/workspace/neural_art__network/stm32n6.mpool" --out-dir-prefix "/tmp/stm32ai_service/7369b431-60e1-4924-9783-5f76cbd6b229/workspace/neural_art__network/" 
   --native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os --Oauto-sched --optimization 3 --enable-virtual-mem-pools --Oshuffle-dma --Ocache-opt --cache-maintenance --Oauto-sched --Omax-ca-pipe 4 --output-info-file "c_info.json"
 <<<< DONE EXECUTING NEURAL ART COMPILER
 Exec/report summary (analyze)
 -----------------------------------------------------------------------------------------------------------
 model file         :   arcface_halfint8.tflite   
 type               :   tflite                                                                              
 c_name             :   network                                                                             
 options            :   allocate-inputs, allocate-outputs                                                   
 optimization       :   balanced                                                                            
 target/series      :   stm32n6npu                                                                          
 workspace dir      :   workspace                 
 output dir         :   output                    
 model_fmt          :   ss/sa per tensor                                                                    
 model_name         :   arcface_halfint8                                                                    
 model_hash         :   0xce8ba0b48ca91dc6b30f9b3d2ba14615                                                  
 params #           :   34,129,728 items (32.57 MiB)                                                        
 -----------------------------------------------------------------------------------------------------------
 input 1/1          :   'Input_34_out_0', f32(1x112x112x3), 147.00 KBytes, activations                      
 output 1/1         :   'Dequantize_352_out_0', f32(1x512), 2.00 KBytes, activations                        
 macc               :   0                                                                                   
 weights (ro)       :   34,217,857 B (32.63 MiB) (1 segment) / -102,301,055(-74.9%) vs float model          
 activations (rw)   :   5,577,856 B (5.32 MiB) (5 segments) *                                               
 ram (total)        :   5,577,856 B (5.32 MiB) = 5,577,856 + 0 + 0                                          
 -----------------------------------------------------------------------------------------------------------
 (*) 'input'/'output' buffers are allocated in the activations buffer
Computing AI RT data/code size (target=stm32n6npu)..
Compilation details
   ---------------------------------------------------------------------------------
Compiler version: 1.1.1-14
Compiler arguments:  -i arcface_halfint8_OE_3_3_0.onnx --json-quant-file arcface_halfint8_OE_3_3_0_Q.json -g network.c --load-mdesc stm32n6.mdesc --load-mpool stm32n6.mpool --save-mpool-file stm32n6.mpool --out-dir-prefix neural_art__network/ --native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os --Oauto-sched --optimization 3 --enable-virtual-mem-pools --Oshuffle-dma --Ocache-opt --cache-maintenance --Oauto-sched --Omax-ca-pipe 4 --output-info-file c_info.json
====================================================================================
Memory usage information  (input/output buffers are included in activations)
   ---------------------------------------------------------------------------------
	flexMEM    [0x34000000 - 0x34000000]:          0  B /          0  B  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
	cpuRAM1    [0x34064000 - 0x34064000]:          0  B /          0  B  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
	cpuRAM2    [0x34100000 - 0x34200000]:      1.000 MB /      1.000 MB  (100.00 % used) -- weights:          0  B (  0.00 % used)  activations:      1.000 MB (100.00 % used)
	npuRAM3    [0x34200000 - 0x34270000]:    448.000 kB /    448.000 kB  (100.00 % used) -- weights:          0  B (  0.00 % used)  activations:    448.000 kB (100.00 % used)
	npuRAM4    [0x34270000 - 0x342E0000]:    392.000 kB /    448.000 kB  ( 87.50 % used) -- weights:          0  B (  0.00 % used)  activations:    392.000 kB ( 87.50 % used)
	npuRAM5    [0x342E0000 - 0x34350000]:    447.125 kB /    448.000 kB  ( 99.80 % used) -- weights:          0  B (  0.00 % used)  activations:    447.125 kB ( 99.80 % used)
	npuRAM6    [0x34350000 - 0x343C0000]:          0  B /    448.000 kB  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
	octoFlash  [0x71000000 - 0x78000000]:     32.633 MB /    112.000 MB  ( 29.14 % used) -- weights:     32.633 MB ( 29.14 % used)  activations:          0  B (  0.00 % used)
	hyperRAM   [0x90000000 - 0x92000000]:      3.062 MB /     32.000 MB  (  9.57 % used) -- weights:          0  B (  0.00 % used)  activations:      3.062 MB (  9.57 % used)
Total:                                            37.952 MB                                  -- weights:     32.633 MB                  activations:      5.319 MB
====================================================================================
Used memory ranges
   ---------------------------------------------------------------------------------
	cpuRAM2    [0x34100000 - 0x34200000]: 0x34100000-0x34200000
	npuRAM3    [0x34200000 - 0x34270000]: 0x34200000-0x34270000
	npuRAM4    [0x34270000 - 0x342E0000]: 0x34270000-0x342D2000
	npuRAM5    [0x342E0000 - 0x34350000]: 0x342E0000-0x3434FC80
	octoFlash  [0x71000000 - 0x78000000]: 0x71000000-0x730A1F90
	hyperRAM   [0x90000000 - 0x92000000]: 0x90000000-0x90310000
====================================================================================
Epochs details
   ---------------------------------------------------------------------------------
Total number of epochs: 147 of which 2 implemented in software
epoch ID   HW/SW/EC Operation (SW only)
epoch 1       HW
epoch 2      -SW-   (   QuantizeLinear   )
epoch 3       HW
epoch 4       HW
epoch 5       HW
epoch 6       HW
epoch 7       HW
epoch 8       HW
epoch 9       HW
epoch 10      HW
epoch 11      HW
epoch 12      HW
epoch 13      HW
epoch 14      HW
epoch 15      HW
epoch 16      HW
epoch 17      HW
epoch 18      HW
epoch 19      HW
epoch 20      HW
epoch 21      HW
epoch 22      HW
epoch 23      HW
epoch 24      HW
epoch 25      HW
epoch 26      HW
epoch 27      HW
epoch 28      HW
epoch 29      HW
epoch 30      HW
epoch 31      HW
epoch 32      HW
epoch 33      HW
epoch 34      HW
epoch 35      HW
epoch 36      HW
epoch 37      HW
epoch 38      HW
epoch 39      HW
epoch 40      HW
epoch 41      HW
epoch 42      HW
epoch 43      HW
epoch 44      HW
epoch 45      HW
epoch 46      HW
epoch 47      HW
epoch 48      HW
epoch 49      HW
epoch 50      HW
epoch 51      HW
epoch 52      HW
epoch 53      HW
epoch 54      HW
epoch 55      HW
epoch 56      HW
epoch 57      HW
epoch 58      HW
epoch 59      HW
epoch 60      HW
epoch 61      HW
epoch 62      HW
epoch 63      HW
epoch 64      HW
epoch 65      HW
epoch 66      HW
epoch 67      HW
epoch 68      HW
epoch 69      HW
epoch 70      HW
epoch 71      HW
epoch 72      HW
epoch 73      HW
epoch 74      HW
epoch 75      HW
epoch 76      HW
epoch 77      HW
epoch 78      HW
epoch 79      HW
epoch 80      HW
epoch 81      HW
epoch 82      HW
epoch 83      HW
epoch 84      HW
epoch 85      HW
epoch 86      HW
epoch 87      HW
epoch 88      HW
epoch 89      HW
epoch 90      HW
epoch 91      HW
epoch 92      HW
epoch 93      HW
epoch 94      HW
epoch 95      HW
epoch 96      HW
epoch 97      HW
epoch 98      HW
epoch 99      HW
epoch 100     HW
epoch 101     HW
epoch 102     HW
epoch 103     HW
epoch 104     HW
epoch 105     HW
epoch 106     HW
epoch 107     HW
epoch 108     HW
epoch 109     HW
epoch 110     HW
epoch 111     HW
epoch 112     HW
epoch 113     HW
epoch 114     HW
epoch 115     HW
epoch 116     HW
epoch 117     HW
epoch 118     HW
epoch 119     HW
epoch 120     HW
epoch 121     HW
epoch 122     HW
epoch 123     HW
epoch 124     HW
epoch 125     HW
epoch 126     HW
epoch 127     HW
epoch 128     HW
epoch 129     HW
epoch 130     HW
epoch 131     HW
epoch 132     HW
epoch 133     HW
epoch 134     HW
epoch 135     HW
epoch 136     HW
epoch 137     HW
epoch 138     HW
epoch 139     HW
epoch 140     HW
epoch 141     HW
epoch 142     HW
epoch 143     HW
epoch 144     HW
epoch 145     HW
epoch 146     HW
epoch 147    -SW-   (  DequantizeLinear  )
====================================================================================
 Requested memory size by section - "stm32n6npu" target
 ------------------------------- -------- ------------ ------ -----------
 module                              text       rodata   data         bss
 ------------------------------- -------- ------------ ------ -----------
 network.o                         22,332      183,633      0           0
 NetworkRuntime1020_CM55_GCC.a      3,068            0      0           0
 ll_aton_reloc_network.o                0            0      0           0
 lib (toolchain)*                     896          624      0           0
 ll atonn runtime                   6,990        2,244      0          29
 ------------------------------- -------- ------------ ------ -----------
 RT total**                        33,286      186,501      0          29
 ------------------------------- -------- ------------ ------ -----------
 weights                                0   34,217,857      0           0
 activations                            0            0      0   5,577,856
 io                                     0            0      0           0
 ------------------------------- -------- ------------ ------ -----------
 TOTAL                             33,286   34,404,358      0   5,577,885
 ------------------------------- -------- ------------ ------ -----------
 *  toolchain objects (libm/libgcc*)
 ** RT AI runtime objects (kernels+infrastructure)
  Summary - "stm32n6npu" target
  --------------------------------------------------
               FLASH (ro)     %*    RAM (rw)      %
  --------------------------------------------------
  RT total        219,787   0.6%          29   0.0%
  --------------------------------------------------
  TOTAL        34,437,644          5,577,885
  --------------------------------------------------
  *  rt/total
Creating txt report file network_analyze_report.txt
elapsed time (analyze): 21.367s

 

Local STedgeai compilation: 

C:\ST\STEdgeAI\2.2\Utilities\windows>stedgeai generate --model ./arcface_halfint8.tflite --target stm32n6 --optimize.export_hybrid True --st-neural-art default@user_neuralart.json
ST Edge AI Core v2.2.0-20266 2adc00962
WARNING: Unsupported keys in the current profile default are ignored: memory_desc
        > memory_desc is not a valid key anymore, use machine_desc instead
 >>>> EXECUTING NEURAL ART COMPILER
   C:/ST/STEdgeAI/2.2/Utilities/windows/atonn.exe -i "C:/ST/STEdgeAI/2.2/Utilities/windows/st_ai_output/arcface_halfint8_OE_3_3_0.onnx" --json-quant-file "C:/ST/STEdgeAI/2.2/Utilities/windows/st_ai_output/arcface_halfint8_OE_3_3_0_Q.json" -g "network.c" --load-mdesc "C:/ST/STEdgeAI/2.2/Utilities/configs/stm32n6.mdesc" --load-mpool "C:/ST/STEdgeAI/2.2/Utilities/windows/targets/stm32/resources/mpools/stm32n6.mpool" --save-mpool-file "C:/ST/STEdgeAI/2.2/Utilities/windows/st_ai_ws/neural_art__network/stm32n6.mpool" --out-dir-prefix "C:/ST/STEdgeAI/2.2/Utilities/windows/st_ai_ws/neural_art__network/" --native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os --Oauto-sched --optimization 3 --enable-virtual-mem-pools --Oshuffle-dma --Ocache-opt --cache-maintenance --Oauto-sched --Omax-ca-pipe 4 --output-info-file "c_info.json"
 <<<< DONE EXECUTING NEURAL ART COMPILER

 Exec/report summary (generate)
 ----------------------------------------------------------------------------------------------------
 model file         :   C:\ST\STEdgeAI\2.2\Utilities\windows\arcface_halfint8.tflite
 type               :   tflite
 c_name             :   network
 options            :   allocate-inputs, allocate-outputs
 optimization       :   balanced
 target/series      :   stm32n6npu
 workspace dir      :   C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_ws
 output dir         :   C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output
 model_fmt          :   ss/sa per tensor
 model_name         :   arcface_halfint8
 model_hash         :   0xce8ba0b48ca91dc6b30f9b3d2ba14615
 params #           :   34,129,728 items (32.57 MiB)
 ----------------------------------------------------------------------------------------------------
 input 1/1          :   'Input_34_out_0', f32(1x112x112x3), 147.00 KBytes, activations
 output 1/1         :   'Dequantize_352_out_0', f32(1x512), 2.00 KBytes, activations
 macc               :   0
 weights (ro)       :   34,217,857 B (32.63 MiB) (1 segment) / -102,301,055(-74.9%) vs float model
 activations (rw)   :   5,577,856 B (5.32 MiB) (5 segments) *
 ram (total)        :   5,577,856 B (5.32 MiB) = 5,577,856 + 0 + 0
 ----------------------------------------------------------------------------------------------------
 (*) 'input'/'output' buffers are allocated in the activations buffer

Computing AI RT data/code size (target=stm32n6npu)..
 -> compiler "gcc:arm-none-eabi-gcc" is not in the PATH

Compilation details
   ---------------------------------------------------------------------------------
Compiler version: 1.1.1-14
Compiler arguments:  -i C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output\arcface_halfint8_OE_3_3_0.onnx --json-quant-file C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output\arcface_halfint8_OE_3_3_0_Q.json -g network.c --load-mdesc C:\ST\STEdgeAI\2.2\Utilities\configs\stm32n6.mdesc --load-mpool C:\ST\STEdgeAI\2.2\Utilities\windows\targets\stm32\resources\mpools\stm32n6.mpool --save-mpool-file C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_ws\neural_art__network\stm32n6.mpool --out-dir-prefix C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_ws\neural_art__network/ --native-float --mvei --cache-maintenance --Ocache-opt --enable-virtual-mem-pools --Os --Oauto-sched --optimization 3 --enable-virtual-mem-pools --Oshuffle-dma --Ocache-opt --cache-maintenance --Oauto-sched --Omax-ca-pipe 4 --output-info-file c_info.json
====================================================================================
Memory usage information  (input/output buffers are included in activations)
   ---------------------------------------------------------------------------------
        flexMEM    [0x34000000 - 0x34000000]:          0  B /          0  B  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
        cpuRAM1    [0x34064000 - 0x34064000]:          0  B /          0  B  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
        cpuRAM2    [0x34100000 - 0x34200000]:      1.000 MB /      1.000 MB  (100.00 % used) -- weights:          0  B (  0.00 % used)  activations:      1.000 MB (100.00 % used)
        npuRAM3    [0x34200000 - 0x34270000]:    448.000 kB /    448.000 kB  (100.00 % used) -- weights:          0  B (  0.00 % used)  activations:    448.000 kB (100.00 % used)
        npuRAM4    [0x34270000 - 0x342E0000]:    392.000 kB /    448.000 kB  ( 87.50 % used) -- weights:          0  B (  0.00 % used)  activations:    392.000 kB ( 87.50 % used)
        npuRAM5    [0x342E0000 - 0x34350000]:    447.125 kB /    448.000 kB  ( 99.80 % used) -- weights:          0  B (  0.00 % used)  activations:    447.125 kB ( 99.80 % used)
        npuRAM6    [0x34350000 - 0x343C0000]:          0  B /    448.000 kB  (  0.00 % used) -- weights:          0  B (  0.00 % used)  activations:          0  B (  0.00 % used)
        octoFlash  [0x70580000 - 0x72780000]:     32.633 MB /     34.000 MB  ( 95.98 % used) -- weights:     32.633 MB ( 95.98 % used)  activations:          0  B (  0.00 % used)
        hyperRAM   [0x90000000 - 0x91000000]:      3.062 MB /     16.000 MB  ( 19.14 % used) -- weights:          0  B (  0.00 % used)  activations:      3.062 MB ( 19.14 % used)

Total:                                            37.952 MB                                  -- weights:     32.633 MB                  activations:      5.319 MB
====================================================================================
Used memory ranges
   ---------------------------------------------------------------------------------
        cpuRAM2    [0x34100000 - 0x34200000]: 0x34100000-0x34200000
        npuRAM3    [0x34200000 - 0x34270000]: 0x34200000-0x34270000
        npuRAM4    [0x34270000 - 0x342E0000]: 0x34270000-0x342D2000
        npuRAM5    [0x342E0000 - 0x34350000]: 0x342E0000-0x3434FC80
        octoFlash  [0x70580000 - 0x72780000]: 0x70580000-0x72621F90
        hyperRAM   [0x90000000 - 0x91000000]: 0x90000000-0x90310000
====================================================================================
Epochs details
   ---------------------------------------------------------------------------------
Total number of epochs: 147 of which 2 implemented in software

epoch ID   HW/SW/EC Operation (SW only)
epoch 1       HW
epoch 2      -SW-   (   QuantizeLinear   )
epoch 3       HW
epoch 4       HW
epoch 5       HW
epoch 6       HW
epoch 7       HW
epoch 8       HW
epoch 9       HW
epoch 10      HW
epoch 11      HW
epoch 12      HW
epoch 13      HW
epoch 14      HW
epoch 15      HW
epoch 16      HW
epoch 17      HW
epoch 18      HW
epoch 19      HW
epoch 20      HW
epoch 21      HW
epoch 22      HW
epoch 23      HW
epoch 24      HW
epoch 25      HW
epoch 26      HW
epoch 27      HW
epoch 28      HW
epoch 29      HW
epoch 30      HW
epoch 31      HW
epoch 32      HW
epoch 33      HW
epoch 34      HW
epoch 35      HW
epoch 36      HW
epoch 37      HW
epoch 38      HW
epoch 39      HW
epoch 40      HW
epoch 41      HW
epoch 42      HW
epoch 43      HW
epoch 44      HW
epoch 45      HW
epoch 46      HW
epoch 47      HW
epoch 48      HW
epoch 49      HW
epoch 50      HW
epoch 51      HW
epoch 52      HW
epoch 53      HW
epoch 54      HW
epoch 55      HW
epoch 56      HW
epoch 57      HW
epoch 58      HW
epoch 59      HW
epoch 60      HW
epoch 61      HW
epoch 62      HW
epoch 63      HW
epoch 64      HW
epoch 65      HW
epoch 66      HW
epoch 67      HW
epoch 68      HW
epoch 69      HW
epoch 70      HW
epoch 71      HW
epoch 72      HW
epoch 73      HW
epoch 74      HW
epoch 75      HW
epoch 76      HW
epoch 77      HW
epoch 78      HW
epoch 79      HW
epoch 80      HW
epoch 81      HW
epoch 82      HW
epoch 83      HW
epoch 84      HW
epoch 85      HW
epoch 86      HW
epoch 87      HW
epoch 88      HW
epoch 89      HW
epoch 90      HW
epoch 91      HW
epoch 92      HW
epoch 93      HW
epoch 94      HW
epoch 95      HW
epoch 96      HW
epoch 97      HW
epoch 98      HW
epoch 99      HW
epoch 100     HW
epoch 101     HW
epoch 102     HW
epoch 103     HW
epoch 104     HW
epoch 105     HW
epoch 106     HW
epoch 107     HW
epoch 108     HW
epoch 109     HW
epoch 110     HW
epoch 111     HW
epoch 112     HW
epoch 113     HW
epoch 114     HW
epoch 115     HW
epoch 116     HW
epoch 117     HW
epoch 118     HW
epoch 119     HW
epoch 120     HW
epoch 121     HW
epoch 122     HW
epoch 123     HW
epoch 124     HW
epoch 125     HW
epoch 126     HW
epoch 127     HW
epoch 128     HW
epoch 129     HW
epoch 130     HW
epoch 131     HW
epoch 132     HW
epoch 133     HW
epoch 134     HW
epoch 135     HW
epoch 136     HW
epoch 137     HW
epoch 138     HW
epoch 139     HW
epoch 140     HW
epoch 141     HW
epoch 142     HW
epoch 143     HW
epoch 144     HW
epoch 145     HW
epoch 146     HW
epoch 147    -SW-   (  DequantizeLinear  )
====================================================================================

 Generated files (5)
 ------------------------------------------------------------------------------------
 C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output\arcface_halfint8_OE_3_3_0.onnx
 C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output\arcface_halfint8_OE_3_3_0_Q.json
 C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output\network.c
 C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output\network_atonbuf.xSPI2.raw
 C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output\network.h

Creating txt report file C:\ST\STEdgeAI\2.2\Utilities\windows\st_ai_output\network_generate_report.txt
elapsed time (generate): 60.889s

 

0 REPLIES 0