Artificial intelligence computing speed is too slow

Fyouj.1 · ‎2022-07-26

Dear ST Engineer,

Hello, I am doing sound classification (AER-CNN-KERAS) artificial intelligence, hardware STM32H743IIT6 (400M) external extension SDRAM (W9825G6KH-6) This is a 16-bit data width 32M bytes memory rate 100M

After downloading the model to STM32H743, it is estimated that a speech (a piece of audio) recognition will take about 40 seconds, which is too slow to accept. I would like to ask what is the reason for such a long operation time?

My analysis suggests that external SDRAM is the main reason, the model requires 9.31MiB of space all outside. But it should not be so slow. Is there any way to speed up the operation? Is it that STM32CUBEMX_AI has not recognized that SDRAM is a 16-bit memory?

And use it as an 8-bit? If I extend 32-bit SDRAM, can I speed up the computation?

GLASS · ‎2022-07-27

Hi,

1) be sure that code is executed from flash and highly/frequentely used data in dtcm ram

2) Prefetch, ART accelerator, i-cache and d-cache activated.

3) MPU *must* be properly configured.

4) Be aware that using d-cache need to correctly manage it when dma are used (fat fs, sd card, ethernet...) are some of HAL or middleware prone to give us erratics.can analyse where'm not from ST, only an other stm32 user.

5) look at code to be optimized. Nested loop long running are good candidate to use loop unroll

Fyouj.1 · ‎2022-07-27

Thank you for your answer.

After I set up the MPU, the runtime of the program was reduced from 40 seconds to 12 seconds.

But 12 seconds is still a long time. I tried to load the ai_run() function into ITCM and run it, but STM32CUBEMX_AI generates a library.

How do I load AI_run () from NetworkRuntime700_cm7_keil.lib into ITCM to run?

And if I use 32 bit width SDRAM, can I get some speed?

GLASS · ‎2022-07-27

Hi, knowledge is a rare thing to grow when we share it !

So i'm happy that you can manage a boost with my informations..

May be you can clic on 'like' button on my post...

About AI lib i never used it.

If you only got a lib and don't have access to source code it will be difficult to relocate code in itcm ram...

Does the lib size can fit in itcm?

Using 32 bits vs 16 bits access to sdram may fastup runtime. But now that you get benefit of cache, for sure you cannot expect twice speed...

Using MPU do you activated cache in write back mode for the sdram?

Are you sure that there is no access outside allowed sdram space.

Be sure that speculative access not try to fill cache line (32 bytes aligned) from address near the end of sdram.

A simple thing to do for that is to be sure you keep data away from minimum 32 bytes from end and start of sdram.

Ever keep data aligned to cache line.

An other way to try : be sure that you tuned sdram controler to fit optimal timing for sdram model used (if the pcb routing was well designed).

Changing sdram timing need to write à ram test (checking integrity and may be benchmark...).

Do you have other peripheral using sdram (lcd display, etc). This may introduce bus contention to sdram.

If so, give a try without other peripheral activated...

And for sure I hope also that you are not using heavy multitasking especially if calling blocking hal function that are prone to use cpu loop waiting... ;>))

Do you benchmark with code compiled with optimize activated?

Vahid Ajallooeian · ‎2022-07-27

Honestly, if you are working on image data, it would be best to switch to nvidia boards, otherwise expecting fast run from cpus with maximum 500 mhz capability is wasting time.