cancel
Showing results for 
Search instead for 
Did you mean: 

Performance hit running code on STM32F7 vs SMT32F4

KParv
Associate

Hi,

I'm currently implementing an encoder/decoder for a Forward Error Correcting code (specifically a Fountain Code) that I'm trying to implement on an embedded system and optimize for speed. I'm currently using an STM32F429I-Discovery Board as well as a STM32F769I-Disc to run the encoder/decoder respectively. The way the encoder works is it takes a 1MB block of data and encodes that into a nearly infinite amount of data depending on how much redundancy one wants. To accomplish this I'm using the external SDRAM on both respective devices and I've got them setup and working according to the example files in en.stm32cubef4 (and f7).

Initially I got everything up and running and working on the F4 and then ported the code over to the F7 but this is where I've run into a strange performance issue. Essentially the code always takes longer to run on the F7 vs. the F4 but only under certain conditions - namely when optimisations are on and when I'm reading/writing to the SDRAM. When I turn off all optimisations (and still use the SDRAM) the F4 runs notably slower as expected but with (-O3) on it's always reversed. The exception to this is when I comment out any code that read/writes to the SDRAM (and replace it with NOP instructions), in that case the F7 again begins to perform better.

This is leading me to believe some configuration with the SDRAM is causing a performance hit. I have the F4 running at 180 MHz so that the SDRAM runs at 90 MHz and the F7 is running at 200 MHz to achieve an SDRAM clock of 100 MHz. Is it possible that the external memory on the F7 is just slower than that of the F4?

I've read the advice offered in a similar user's case (https://community.st.com/s/question/0D50X00009XkY4uSAF/stm32f429-vs-stm32f767-process-speed) and I've performed the mentioned advice such as:

a. Using the AXI interface with instruction and data caches turned on.

b. Using the TCM interface with ART and instruction prefetch on.

c. I saw it mentioned to run the code from the SRAM but I haven't attempted that yet.

But beyond that I'm not sure what else could be the problem. I'm using the DWT timer to measure the time (the code is probabilistic so it takes a different amount of time each run but depending on the settings and after averaging the F4 takes roughly 0.82 seconds while the F7 can take upwards of 0.86).

Thanks for any help/insight.

Kian

0 REPLIES 0