2021-10-07 12:50 PM
On an 'F4, I was working on a lengthy piece of floating-point calculation, and had to benchmark it, so I thought I'd share some of the results. Basically, this is a linear calculation with moderate amount of constants and quite liberal amount of sinf, cosf, sqrtf and kin. Kept it strictly float (single-precision) - the FPU is an amazing piece of hardware in an mcu. A taster from disasm:
800aee4: ed94 7a0d vldr s14, [r4, #52] ; 0x34
800aee8: ee29 0a00 vmul.f32 s0, s18, s0
800aeec: ee60 7a00 vmul.f32 s15, s0, s0
800aef0: ed84 0a0e vstr s0, [r4, #56] ; 0x38
800aef4: eee7 7a07 vfma.f32 s15, s14, s14
800aef8: eeb0 0a67 vmov.f32 s0, s15
800aefc: eef1 7ac0 vsqrt.f32 s15, s0
800af00: eef4 7a67 vcmp.f32 s15, s15
800af04: eef1 fa10 vmrs APSR_nzcv, fpscr
800af08: f040 81fc bne.w 800b304
800af0c: edd4 0a0d vldr s1, [r4, #52] ; 0x34
800af10: ed94 0a0e vldr s0, [r4, #56] ; 0x38
800af14: edc4 7a0f vstr s15, [r4, #60] ; 0x3c
800af18: f034 f8ac bl 803f074 <atan2f>
800af1c: edd4 8a05 vldr s17, [r4, #20]
800af20: ed84 0a10 vstr s0, [r4, #64] ; 0x40
800af24: eeb0 8a40 vmov.f32 s16, s0
800af28: eeb0 0a68 vmov.f32 s0, s17
800af2c: f033 ff6a bl 803ee04 <sinf>
800af30: eeb0 ba40 vmov.f32 s22, s0
800af34: eeb0 0a68 vmov.f32 s0, s17
800af38: f033 feec bl 803ed14 <cosf>
800af3c: edd4 7a07 vldr s15, [r4, #28]
800af40: 4f71 ldr r7, [pc, #452] ; (800b108)
800af42: f8df 81c8 ldr.w r8, [pc, #456] ; 800b10c
800af46: ee38 8a27 vadd.f32 s16, s16, s15
800af4a: eef0 ba40 vmov.f32 s23, s0
800af4e: eeb0 0a48 vmov.f32 s0, s16
800af52: f033 ff57 bl 803ee04 <sinf>
800af56: eeb0 9a40 vmov.f32 s18, s0
800af5a: eeb0 0a48 vmov.f32 s0, s16
800af5e: f033 fed9 bl 803ed14 <cosf>
800b108: 0806efac .word 0x0806efac
800b10c: 10001fa0 .word 0x10001fa0
I know this is not your typical embedded program, but the number are here, for what they are worth:
(reset (normal
default) full speed)
LATENCY 0WS 5WS 5WS 5WS 5WS 5WS
ICACHE 0 1 1 1 0 0
DCACHE 0 1 1 0 1 0
PREFETCH 0 1 0 1 1 0
cyc 45082 54097 57292 57536 74179 94830
JW
2021-10-07 01:41 PM
>>quite liberal amount of sinf, cosf, sqrtf and kin.
I miss the days when we had real FPU's that could do that stuff with high precision..
Imagine what it could do on current semi processes.
http://datasheets.chipdb.org/Motorola/68882/mc68882.pdf
2021-10-08 02:15 PM
The code mentioned above is littered by sinf()/cosf(), there must be at least three dozens of them, so naturally the "benchmarking" focused initially on those. I found out that they are (to me) surprisingly fast, around 300 cycles, if the argument is normalized (+-pi/2). Even with small unnormalized argument it was down to 400something, but above a threshold - as I've found out, around 128*pi/2 - things got nasty and cycles suddenly jumped to thousands. So, with some tweaking of the underlying algorithm, I managed to got rid of those and cut the total execution time almost by half (the above figures are for the final product, "optimized" code).
So I looked at the MC68882's User Manual to check the timing of sin/cos. And, to my surprise, it's somewhat above 300 cycles (sure, extended precision data, but that's the power of hardware adder/multiplier for you). But that probably is to be expected, at the end of the day, the microcode in the MC68882 is no miracle. Normalization is to be said around 100 cycles, which is also similar (the nastiness of the large argument's optimization in libc may be just that there's some tradeoff between FLASH usage (tables) and utility; I don't intend to dig deep into this).
Sure, the difference is in that while the external coprocessor runs, the main can continue to execute code which is not dependent on the result. But this is often tricky to utilize - for example, as seen from above disasm, my code is densely packed float, and I don't quite know how would I intersperse unrelated code in other way than writing asm manually in the most tedious way.
So, yeah, maybe it would be relatively cheap to add microcode capability to the Cortex-M FPU (maybe even RAM-based) and it would be surely a cool addition, maybe worth it, even if hard to use practically (but then there are already examples for that, e.g. the SIMD instructions).
OTOH, I'm quite certain ARM and ST are not interested in real-world applications anymore, surfing on that IoT/security nonsense wave.
JW
2021-10-10 01:31 PM
The IoT itself is real - the network printers, storage, cameras, payment terminals etc. It's the the marketing "experts", who cannot imagine anything more than a "very important" wireless sensor with LTE modem sending data to the cloud and processed by AI... to switch a room lighting on/off.