I have to deploy an algorithm with a lightweight deep learning-based approach on a low-power STM System-on-Chip based on an Arm Cortex-M-class MCU or Arm Cortex-A-class MPU. Is there a way to reduce cache-misses or a way to reduce the CPU stalls on cache-misses? Is there a difference with an Arm Cortex-A-class MPU?
Any suggestions are welcome.
Read the Reference Manuals and Data Sheets for your chosen familys of stm32. They contain information about how the caches work.
I’m most familiar with Cortex-M in stm32f4 and stm32f7 where cache lines are strictly tied to the low address bits. So you might get some benefit from having the most-executed routines not have the same low address bits, which is easily achieved by putting them immediately adjacent in FLASH.
Data cacheing is a different problem but high optimisation settings generally allow the regular compilers to order instructions to allow for the time it will take to fetch the data.
You might get some benefit from executing code from RAM as that has faster access than FLASH but it could make things worse because stm32 prefetch FLASH for non-branching code automatically to avoid delays, and code in RAM might then make data accesses slower if it shares the same path as code accesses. It very much depends on your code.
But tweaks here can only give marginal improvements over “normal recommended” settings