2017-03-20 5:31 AM
Hi,
We are testing a code which is generated from simulink. Code is processing over the single precision operands. When we compile the code with keil , by choosing STM32F767 and SP FPU, the maximum time to complete the one cycle of the algorithm seems more than the maximum time for the STM32F429 in the same conditions. This means F429 is processing and completing the algorithm faster than F767.
?n the technical manual, we were expecting the FPU of the F757 must be faster than F429, in contrass processing the algorith time seems more than F429.
CPU speed of F767 is 216MHz and F429 is 168MHz. we selected the same optimization levels over the keil before compiling.
Algorithm time is measured by a timer as starting in the beggining of the algorithm and stoping at the end.
Question is why and how is the 767 works slower than the 429?
thank you for your answers,
O?uzhan Demirci
#stm32f767 #stm32f429 #fp2017-03-21 2:39 AM
Hi,
I am no expert regarding F7 vs F4, however the problem you are facing might be related by the wait states that are needed for the Flash, and since it is not linear but by steps, the F4 might have a faster flash access than the F7 in these conditions.
In order to get the result you are looking for, you can either :
- Run the F4 and the F7 at the same speed, but I am not sure this will be interesting.
- Run your algorithm routines from SRAM to get the maximum throughput.
2017-03-21 6:21 AM
Hello,
The exact conditions and FW loop can help. Your observation is not inline with the product performances.
What about the Cache/ART configuration?
What about the timer?
Algorithm time is measured by a timer as starting in the beggining of the algorithm and stoping at the end.
Can you check for timer overflow on Cortex-M4? if overflow is not managed you can see that on CM4 you have a shorter period
Br,
Abdelhamid.
2017-03-21 6:47 AM
Hi
demirci.oguzhan
,If the configuration used for STM32F767 is aligned with the proposals of
GHITH.Abdelhamid
 and you still note the same initially described issue, please share:This should help to make farther checks from our side to identify the configuration issue leading to such wrong conclusions.
If you need more details on how to get better performance using STM32F7 devices, please refer to
and its associated firmware.-Amel
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
2017-03-21 8:35 AM
I would definitely check the caching, and memory used. If exposed the FLASH is still going to be rather slow.
Unless I'm mistaken the F767 has the FPU-D not the FPU-S, so is not identical to the Cortex-M4. You could likely compile code as if it were using the CM4F to get a more apples-to-apples comparison. Would strongly suggest reviewing a disassembly of the code to determine if different code/libraries were the source of performance differences.
For benchmarking code the use of DWT_CYCCNT is highly recommended, being a 32-bit counter clocked at core speeds. On the CM7 you must unlock access to the DWT
2017-03-21 8:59 AM
Hi,
I also faced with the same problem. After searching on the web i found
this link which discusses the topic in detail.What i summarized from this discussion;could be reasons why M4 Cortex is faster than M7. (But not sure whether this situation happens because compilers are not generating code that is suitable for Cortex-M7 architecture or there is a problem with hardware in M7)
I tried to do performance test just by adding 2 float variables and using
DWT_CYCCNT, CM4 performed better than CM7, assembly code that is generated for addition is same for both CM4 and CM7. I am using KEIL uVision 5.20.0.0 and STM32F767 Nucleo board running at 216MHz for M7 Cortex and STM32F429I Discovery-I board running at 168MHz for M4 Cortex.
I also asked members of ST about this topic and they said that there are too many people complaining about this, and they try to find what causes the problem (software or harware) by looking the source code.
When i have free time, i will try to do performance tests using different and most up-to-date compiler even then i do not get the expected result, i will try to code in assembly to get the expected performance results.
As a result, it seems it is not related with memory access time and anyone who faced the same problem and found a solution, please share the solution with us.
2017-03-21 9:32 AM
Hi Clive,
Yes, I would definitely check caching, memory map and use a 32-bit timer for benchmarking.
Just a minor comment: F767 has both FPU-D and FPU-S.
If the compiler is configured for only
single precision
, it will generate single precision instruction for 32-bit float data. For double precision operations it will call C library functions.When configured for double precision, compiler will continue to generate single precision instruction for 32-bit float data and will generate the appropriate instructions for double precision operations.
In any case when using double precision data: code generated by compiler configured for double precision will be executed faster on F767 than a code generated w/ FPU-S compiler conf.
Br,
Abdelhamid.
2017-03-21 12:08 PM
Hi,
As you said the Cortex-M7 architecture is different from Cortex-M4. It has a 6-stage dual-issue pipeline allowing it to process 2 instruction in parallel. Unfortunately only one floating point pipe supporting single and double precision operations is available. So no dual-issuing possible for FPU instructions.
FPU instructions will execute sequencially.
I agree with your observation, a performance test just by adding 2 float variables (and no optimization from the compiler, same instructions generated) will result in higher cycle count on Cortex-M7 (Longer pipeline+sequencial execution).
To be able to see Cortex-M7 performance improvement, I suggest to memic a real application use case (Loops, conditional processing and different data types processing...). Such code will take benefits from the enhancement on CM7 (dual-issuing, optimized PFU with BTAC, caches ..)
Also in you tests I suppose data and code are mapped on AXI. Executing only two instructions will result in caches miss&fill overhead.Br,
Abdelhamid.
2017-03-21 5:38 PM
>>6-stage pipeline architecture (I could not find any detailed explanation on how 6-stage pipeline works)
It tends to mean the throughput is 1-cycle, but the latency is 6-cycles. The dependency limits the dispatch, perhaps up to 6 cycles, where the next operation is dependent on the result of the current operation.
>>Unfortunately only one floating point pipe supporting single and double precision operations is available.
Sounds like the FPU-D can do both 32-bit floats, and 64-bit doubles, and they share resources, and not that it contains both FPU-D and FPU-S
2017-03-22 1:28 AM
Yes. Onre FPU supporting both single/double precision operations. It is more precise like this.
BrAbdelhamid
