Odd Benchmark Results Using Different gcc Optimizations

Scott Gravenhorst · ‎2019-06-08

I've written a benchmark test to compare the CPU execution time for float adds versus int32_t adds. The results were surprising. The benchmark is very simple, a loop of 100 million is created and timed so that it's value can be subtracted from other tests. The same loop with a float add and another loop with int32_t add is created. The resulting times were then output:

Optimization -O0 22.5 nanosecond for both float and int32_t

Optimization -O1 5 nanosecond for float and 2.5 nanosecond for int32_t

Optimization -O2 10 nanoseconds for both float and int32_t

Optimization -O3 10 nanoseconds for both float and int32_t

I find this odd because I did the same thing for STM32F746 and in that case the fastest times were for -O3 and in all cases, the float add and int32_t add had identical times.

Does anyone have an explanation for this result?

Tesla DeLorean · ‎2019-06-08

I guess you could start by decomposing the assembler code to see how they differ. See how instructions might pair, and if there are alignment errors, or where you're effectively stalling the pipeline.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Scott Gravenhorst · ‎2019-06-08

Thanks. Yeah, I've read a bit in the assembly listing of this particular program, I will look there for clues about why O2 and O3 are worse than O1 and for now, I'll live with O1.

I understand the 2.5 nanosec (1 clock at 400 MHz) result for integer add, but I was surprised that the float add takes twice as long (5.0 nanosec or 2 clocks at 400 MHz).

Surprised only because the STM32F7xx at 200 MHz had single cycle float add as well as multiply. (IIRC)...

Anyway, this started because of an application (music) that has been developed for an STM32F746 which runs at 200 MHz supporting 32 voices.

The same code on the STM32H743 at 400 MHz supports 40 voices. 64 would be the expected maximum, but 40 seems low at 1.25 to 1

I wanted to know why the H7 didn't support a few more voices and I think I see why. It's a lot of arithmetic using float type.

Singh.Harjit · ‎2019-06-09

If your code can work with integers/fixed point, you may be able to use the DSP instructions which are SIMD (single instruction, multiple data). This should work on the F7 and H7 parts. Actually, DSP instructions are supported on the F1, F2, F3, F4, etc. parts.

MikeDB · ‎2019-06-10

The GCC compiler doesn't seem to understand the H7 properly as I've seen a similar thing. And with a music program !

-O3 is unusuable, whilst it's usually a toss-up between -O1 and -O2. Although the F7 is also dual issue, the H7 seems to have different memory timings that give it more gains.

But to be honest if you want more voices, go for one or more H750s - cheap, faster processor core, huge amount of RAM and enough Flash to store most music generation programs in. I'm getting over 200 voices per MCU, but obviously not all voices are created equal =)

Unfortunately the ARM SIMD instructions are useless for music - not enough resolution. But if you go for integer arithmetic a little assembly code can double the speed of the -O2 setting.

Piranha · ‎2019-06-10

F1 and F2 has Cortex-M3 and doesn't have DSP instructions.

Singh.Harjit · ‎2019-06-10

My bad. Thanks for catching that.