cancel
Showing results for 
Search instead for 
Did you mean: 

STM32f429 vs STM32F767, process speed?

oguzhan demirci
Associate
Posted on March 20, 2017 at 13:31

Hi,

We are testing a code which is generated from simulink. Code  is processing over the single precision operands. When we compile the code with keil , by choosing STM32F767 and SP FPU, the maximum time to complete the  one cycle of the algorithm seems more than the maximum time  for the STM32F429 in the same conditions. This means F429 is processing and completing  the algorithm faster than F767.

?n the technical manual, we were expecting the FPU of the F757 must be faster than F429, in contrass processing the algorith time  seems more than F429.

CPU speed of  F767 is 216MHz and F429 is 168MHz. we selected the same optimization levels over the keil before compiling.

Algorithm time is measured by a timer as starting in the beggining of the algorithm and stoping at the end.

Question is why and how is the 767 works slower than the 429?

thank you for your answers,

O?uzhan Demirci

#stm32f767 #stm32f429 #fp
22 REPLIES 22
Kraal
Senior III
Posted on March 21, 2017 at 10:39

Hi,

I am no expert regarding F7 vs F4, however the problem you are facing might be related by the wait states that are needed for the Flash, and since it is not linear but by steps, the F4 might have a faster flash access than the F7 in these conditions.

In order to get the result you are looking for, you can either :

- Run the F4 and the F7 at the same speed, but I am not sure this will be interesting.

- Run your algorithm routines from SRAM to get the maximum throughput.

GHITH.Abdelhamid
Associate II
Posted on March 21, 2017 at 14:21

Hello,

The exact conditions and FW loop can help. Your observation is not inline with the product performances.

What about the Cache/ART configuration?

  • if you are using ITCM interface for code execution, you will need to enable ART accelerator and prefetch
  • when fetching code on AXI interface you need to enable Instruction and Data cache.
  • then if your data is larger than the DTCM SRAM you need (in previous configurations, both of them) to enable data cache

What about the timer?

Algorithm time is measured by a timer as starting in the beggining of the algorithm and stoping at the end.

Can you check for timer overflow on Cortex-M4? if overflow is not managed you can see that on CM4 you have a shorter period

Br,

Abdelhamid.

Amel NASRI
ST Employee
Posted on March 21, 2017 at 14:47

Hi

demirci.oguzhan

‌,

If the configuration used for STM32F767 is aligned with the proposals of

GHITH.Abdelhamid

‌ and you still note the same initially described issue, please share:

  • The Simulink model and settings for both products
  • The generated projects for both products

This should help to make farther checks from our side to identify the configuration issue leading to such wrong conclusions.

If you need more details on how to get better performance using STM32F7 devices, please refer to

http://www.st.com/content/st_com/en/products/embedded-software/mcus-embedded-software/stm32-embedded-software/stm32cube-expansion-software/x-cube-32f7perf.html

and its associated firmware.

-Amel

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

Posted on March 21, 2017 at 15:35

I would definitely check the caching, and memory used. If exposed the FLASH is still going to be rather slow.

Unless I'm mistaken the F767 has the FPU-D not the FPU-S, so is not identical to the Cortex-M4. You could likely compile code as if it were using the CM4F to get a more apples-to-apples comparison. Would strongly suggest reviewing a disassembly of the code to determine if different code/libraries were the source of performance differences.

For benchmarking code the use of DWT_CYCCNT is highly recommended, being a 32-bit counter clocked at core speeds. On the CM7 you must unlock access to the DWT

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Mehmet Akif AYGUN
Associate
Posted on March 21, 2017 at 16:59

Hi,

I also faced with the same problem. After searching on the web i found 

https://community.arm.com/processors/f/discussions/5567/what-is-the-advantage-of-floating-point-of-cm7-versus-cm4

 this link which discusses the topic in detail.What i summarized from this discussion;
  • 6-stage pipeline architecture  (I could not find any detailed explanation on how 6-stage pipeline works)
  • Register dependency 

could be reasons why M4 Cortex is faster than M7.  (But not sure whether this situation happens because compilers are not generating code that is suitable for Cortex-M7 architecture or there is a problem with hardware in M7)

I tried to do performance test just by adding 2 float variables and using 

DWT_CYCCNT, CM4 performed better than CM7, assembly code that is generated for addition is same for both CM4 and CM7. I am using KEIL uVision 5.20.0.0 and STM32F767 Nucleo board running at 216MHz for M7 Cortex and STM32F429I Discovery-I board running at 168MHz for M4 Cortex.

I also asked members of ST about this topic and they said that there are too many people complaining about this, and they try to find what causes the problem (software or harware) by looking the source code.

When i have free time, i will try to do performance tests using different and most up-to-date compiler even then i do not get the expected result, i will try to code in assembly to get the expected performance results.

As a result, it seems it is not related with memory access time and anyone who faced the same problem and found a solution, please share the solution with us.

Posted on March 21, 2017 at 16:32

Hi Clive,

Yes, I would definitely check caching, memory map and use a 32-bit timer for benchmarking.

Just a minor comment: F767 has both FPU-D and FPU-S.

If the compiler is configured for only

single precision

, it will generate single precision instruction for 32-bit float data. For double precision operations it will call C library functions.

When configured for double precision, compiler will continue to generate single precision instruction for 32-bit float data and will generate the appropriate instructions for double precision operations.

In any case when using double precision data: code generated by compiler configured for double precision will be executed faster on F767 than a code generated w/ FPU-S compiler conf.

Br,

Abdelhamid.

Posted on March 21, 2017 at 19:08

Hi,

As you said the Cortex-M7 architecture is different from Cortex-M4. It has a 6-stage dual-issue pipeline allowing it to process 2 instruction in parallel. Unfortunately only one floating point pipe supporting single and double precision operations is available. So no dual-issuing possible for FPU instructions.

FPU instructions will execute sequencially.

I agree with your observation, a performance test just by adding 2 float variables (and no optimization from the compiler, same instructions generated) will result in higher cycle count on Cortex-M7 (Longer pipeline+sequencial execution).

To be able to see Cortex-M7 performance improvement, I suggest to memic a real application use case (Loops, conditional processing and different data types processing...). Such code will take benefits from the enhancement on CM7 (dual-issuing, optimized PFU with BTAC, caches ..)

Also in you tests I suppose data and code are mapped on AXI. Executing only two instructions will result in caches miss&fill overhead.

Br,

Abdelhamid.

Posted on March 22, 2017 at 00:38

>>6-stage pipeline architecture  (I could not find any detailed explanation on how 6-stage pipeline works)

It tends to mean the throughput is 1-cycle, but the latency is 6-cycles. The dependency limits the dispatch, perhaps up to 6 cycles, where the next operation is dependent on the result of the current operation.

>>Unfortunately only one floating point pipe supporting single and double precision operations is available.

Sounds like the FPU-D can do both 32-bit floats, and 64-bit doubles, and they share resources, and not that it contains both FPU-D and FPU-S

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
Posted on March 22, 2017 at 08:28

Yes. Onre FPU supporting both single/double precision operations. It is more precise like this.

Br

Abdelhamid