stm32h732 matrix-vector multiply throughput

Eelco · ‎2024-05-10

I am trying to gain an appreciation of how much f32 128x128 element matrix-vector multiply compute I should expect from this type of processor (for the purpose of running small control neural networks on the m7 core).

As I understand the m7 core, it should be capable of executing one fused-multiply-add per clockcycle as long as there is data in the core. I dont have a definite reference on that so if anyone knows more id love to hear it.

What I find most difficult to appreciate is the memory system.

As I understand, the tightly-coupled-memory (TCM) can be read with zero-latency; I imagine that means it takes a single cycle and that latency can get hidden by the instruction pipeline.

The way I see it, I could page matrix entries from data-sram into dTCM using DMA.

Moreoever, I could page matrix entries from flash into data-sram using DMA.

The questions I am most in the dark about are about bandwidth:

* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from dram to dTCM

* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from flash to dram

And perhaps even more useful than these theoretical questions would be real world benchmarks on these type of workloads. If anyone knows of those id love to see it. Experiences with similar models of processors are also definitely welcome.

Eelco · ‎2024-05-11

Thanks; that does indeed go into some more detail: I like this part

`It is split into two DTCM-RAMs with 32-bit access each. Both memories are connected respectively to the D0TCM and D1TCM ports of the Cortex®-M7 (not represented in the figures) and can be used in parallel (for load/store operations) thanks to the Cortex®- M7 dual issue capability`

So it seems I could load an f32 matrix and vector component in a single cycle if both are stored in different DTCM banks. Which low-key contradicts what I read about the FMAC; but that might have been an FMAC specific limitation then.

Piecing together all the tidbits of information ive found, it seems to me that someone who knows exactly what they are doing could get one fused-multiply-add per cycle out of the stm32h7... but I also suspect that its going to be a lot of work to get there, manually orchestrating all the memory management from flash to ram to TCM to core. Writing a single matmul-benchmark might not be too much work; but actually tying it together into an actual neural network would probably end up a little like writing your own cube.ai code generator.

Speaking of trying things out; I should certainly give cube.ai a spin, to see if its anything nearly as bad as the paper linked above for RNNs. Maybe im lucky; and probably the autogenerated code will teach me a lot.

LCE · ‎2024-05-11

I have no real idea what this is about - but it looks interesting! ;)

Therefore, just 2 things:

- DMA has no access to TCM

- grab a H723 Nucleo and try, it's only about 30 $ / €

MasterT · ‎2024-05-11

"Piecing together all the tidbits of information ive found, it seems to me that someone who knows exactly what they are doing could get one fused-multiply-add per cycle out of the stm32h7.."

Been in similar situation, experimenting with FFT (butterfly operation multyply-sum-accumulate in core) I was not able to get any better than 2 msps processing time, about 240 instructions per sample on 480 MHz uCPU. Even counting mult add two complex numbers, it's more than 20 cycles per single operation.

My understanding that Cortex-M7 is kind of different "big-farma" belongins, same apply to GCC, so ST is likely not the one to blame for such low performace

Eelco · ‎2024-05-11

According to the STM32H7 documentation, the MDMA does have access to the DTCM; so that should be good?

As mentioned before, the $30 isnt the issue here. The issue is the many months itd take me to convince myself id coded up a benchmark thats representative of what the chip is capable of.

Eelco · ‎2024-05-11

Thanks, thats interesting to know. Not sure what the big-farma refers to here tho.

20 cycles per single operation indeed does not sound good (but in line with other real world experience like the paper I refered to previously).

That being said; fft is tricker than matrix-vector products, with a much more scattered memory access pattern. Certainly if youd try an fft that doesnt fit fully into TCM I bet you are in trouble from what I understand of this arch; but I think I can get away with working purely on sequential data in TCM, which should be a best case scenario.

How far did you dive into these experiments? Did you try and use a lot of stm32h7 specifics like the TCM, or were you hoping that vanilla-C with GCC would mostly 'just work'?

LCE · ‎2024-05-11

Oops, when it comes to DMA I always think of the peripherals.

So yes, by reference manual MDMA can access DTCM.

MasterT · ‎2024-05-11

Typo error, big-pharma, uCPU architecture is defined by monsters in the shadow.

I stop research when I realized that even 32k FFT (128 kbytes) in float math doesn't fit into fast memory, plus memory segmentation in H7.

Another unspecified term in equation is cashing, though data doesn't flow between MEM<-> ALU directly

static void CPU_CACHE_Enable(void)
{
SCB_EnableICache();
SCB_EnableDCache();
}

Eelco · ‎2024-05-11

Im hoping default instruction caching will do for my use case; though as for data I guess its manual layout and management all the way, if you want to get anywhere close to theoretical throughput.