2017-03-20 12:12 PM
Hi.
I'm reading 1025 index float lookup tables where the last item is for looping the data. The input to the function is a 32-bit phase number which is converted to 10-bit for indexing and a 22-bit rest for interpolation.
This is the function
float readinterpolated(const uint32_t x, const float *datapt)
{
uint32_t coarse=x>>22;
float fine=(x%4194304)/41940f;
return datapt[coarse]+(datapt[coarse+1]-datapt[coarse])*fine;
}�?�?�?�?�?�?
and the generated assembly code is
lsrs r3, r0, #22 @ coarse, x,
ubfx r2, r0, #0, #22 @ D.12138, x,,
add r1, r1, r3, lsl #2 @ tmp127, datapt, coarse,
vmov s15, r2 @ int @ D.12138, D.12138
vldr.32 s0, [r1] @ D.12139, *_8
vcvt.fs32 s15, s15, #22 @ fine, D.12138,
vldr.32 s14, [r1, #4] @ *_12, *_12
vsub.f32 s14, s14, s0 @ D.12139, *_12, D.12139
vfma.f32 s0, s15, s14 @, fine, D.12139
bx lr @�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?�?
Is this the fastest way of doing it? I'm using it in a DSP application to generate waveforms and I need the optimum performance possible since my application is running near full capacity now without all features added to it.
2017-03-22 02:22 AM
Looks reasonably optimal. Using asm, inline or full, you could tweak the ordering of instructions, that depending on the particular core (M4/M7) might result in better utilization of parallelism between the integer and float execution units. You can also use a secondary table with precalculated differences. The positioning of code and also the tables (FLASH/RAM/TCM or cached memory in M7) might make a difference too. There might be some tweaking available outside of the code you presented too.
Expect hard work and no miracles, though.
JW
2017-03-22 04:55 AM
The dual table might help. Or a 2D array with 2x1024 entries.
Strikes me that a 64 bit load and interleaving the dependencies might buy a few cycles.