ART Accelerator allowing 0-wait state execution

waclawek.jan · ‎2021-10-07

On an 'F4, I was working on a lengthy piece of floating-point calculation, and had to benchmark it, so I thought I'd share some of the results. Basically, this is a linear calculation with moderate amount of constants and quite liberal amount of sinf, cosf, sqrtf and kin. Kept it strictly float (single-precision) - the FPU is an amazing piece of hardware in an mcu. A taster from disasm:

800aee4:	ed94 7a0d 	vldr	s14, [r4, #52]	; 0x34
 800aee8:	ee29 0a00 	vmul.f32	s0, s18, s0
 800aeec:	ee60 7a00 	vmul.f32	s15, s0, s0
 800aef0:	ed84 0a0e 	vstr	s0, [r4, #56]	; 0x38
 800aef4:	eee7 7a07 	vfma.f32	s15, s14, s14
 800aef8:	eeb0 0a67 	vmov.f32	s0, s15
 800aefc:	eef1 7ac0 	vsqrt.f32	s15, s0
 800af00:	eef4 7a67 	vcmp.f32	s15, s15
 800af04:	eef1 fa10 	vmrs	APSR_nzcv, fpscr
 800af08:	f040 81fc 	bne.w	800b304 
 800af0c:	edd4 0a0d 	vldr	s1, [r4, #52]	; 0x34
 800af10:	ed94 0a0e 	vldr	s0, [r4, #56]	; 0x38
 800af14:	edc4 7a0f 	vstr	s15, [r4, #60]	; 0x3c
 800af18:	f034 f8ac 	bl	803f074 <atan2f>
 800af1c:	edd4 8a05 	vldr	s17, [r4, #20]
 800af20:	ed84 0a10 	vstr	s0, [r4, #64]	; 0x40
 800af24:	eeb0 8a40 	vmov.f32	s16, s0
 800af28:	eeb0 0a68 	vmov.f32	s0, s17
 800af2c:	f033 ff6a 	bl	803ee04 <sinf>
 800af30:	eeb0 ba40 	vmov.f32	s22, s0
 800af34:	eeb0 0a68 	vmov.f32	s0, s17
 800af38:	f033 feec 	bl	803ed14 <cosf>
 800af3c:	edd4 7a07 	vldr	s15, [r4, #28]
 800af40:	4f71      	ldr	r7, [pc, #452]	; (800b108)
 800af42:	f8df 81c8 	ldr.w	r8, [pc, #456]	; 800b10c
 800af46:	ee38 8a27 	vadd.f32	s16, s16, s15
 800af4a:	eef0 ba40 	vmov.f32	s23, s0
 800af4e:	eeb0 0a48 	vmov.f32	s0, s16
 800af52:	f033 ff57 	bl	803ee04 <sinf>
 800af56:	eeb0 9a40 	vmov.f32	s18, s0
 800af5a:	eeb0 0a48 	vmov.f32	s0, s16
 800af5e:	f033 fed9 	bl	803ed14 <cosf>
 
 800b108:	0806efac 	.word	0x0806efac
 800b10c:	10001fa0 	.word	0x10001fa0

I know this is not your typical embedded program, but the number are here, for what they are worth:

            (reset    (normal
            default)  full speed)
LATENCY     0WS       5WS         5WS      5WS     5WS     5WS
ICACHE      0         1           1        1       0       0
DCACHE      0         1           1        0       1       0
PREFETCH    0         1           0        1       1       0
cyc         45082     54097       57292    57536   74179   94830

JW

Tesla DeLorean · ‎2021-10-07

>>quite liberal amount of sinf, cosf, sqrtf and kin.

I miss the days when we had real FPU's that could do that stuff with high precision..

Imagine what it could do on current semi processes.

http://datasheets.chipdb.org/Motorola/68882/mc68882.pdf

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

waclawek.jan · ‎2021-10-08

The code mentioned above is littered by sinf()/cosf(), there must be at least three dozens of them, so naturally the "benchmarking" focused initially on those. I found out that they are (to me) surprisingly fast, around 300 cycles, if the argument is normalized (+-pi/2). Even with small unnormalized argument it was down to 400something, but above a threshold - as I've found out, around 128*pi/2 - things got nasty and cycles suddenly jumped to thousands. So, with some tweaking of the underlying algorithm, I managed to got rid of those and cut the total execution time almost by half (the above figures are for the final product, "optimized" code).

So I looked at the MC68882's User Manual to check the timing of sin/cos. And, to my surprise, it's somewhat above 300 cycles (sure, extended precision data, but that's the power of hardware adder/multiplier for you). But that probably is to be expected, at the end of the day, the microcode in the MC68882 is no miracle. Normalization is to be said around 100 cycles, which is also similar (the nastiness of the large argument's optimization in libc may be just that there's some tradeoff between FLASH usage (tables) and utility; I don't intend to dig deep into this).

Sure, the difference is in that while the external coprocessor runs, the main can continue to execute code which is not dependent on the result. But this is often tricky to utilize - for example, as seen from above disasm, my code is densely packed float, and I don't quite know how would I intersperse unrelated code in other way than writing asm manually in the most tedious way.

So, yeah, maybe it would be relatively cheap to add microcode capability to the Cortex-M FPU (maybe even RAM-based) and it would be surely a cool addition, maybe worth it, even if hard to use practically (but then there are already examples for that, e.g. the SIMD instructions).

OTOH, I'm quite certain ARM and ST are not interested in real-world applications anymore, surfing on that IoT/security nonsense wave.

JW

Piranha · ‎2021-10-10

The IoT itself is real - the network printers, storage, cameras, payment terminals etc. It's the the marketing "experts", who cannot imagine anything more than a "very important" wireless sensor with LTE modem sending data to the cloud and processed by AI... to switch a room lighting on/off.