stm32h732 matrix-vector multiply throughput

Eelco · ‎2024-05-10

I am trying to gain an appreciation of how much f32 128x128 element matrix-vector multiply compute I should expect from this type of processor (for the purpose of running small control neural networks on the m7 core).

As I understand the m7 core, it should be capable of executing one fused-multiply-add per clockcycle as long as there is data in the core. I dont have a definite reference on that so if anyone knows more id love to hear it.

What I find most difficult to appreciate is the memory system.

As I understand, the tightly-coupled-memory (TCM) can be read with zero-latency; I imagine that means it takes a single cycle and that latency can get hidden by the instruction pipeline.

The way I see it, I could page matrix entries from data-sram into dTCM using DMA.

Moreoever, I could page matrix entries from flash into data-sram using DMA.

The questions I am most in the dark about are about bandwidth:

* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from dram to dTCM

* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from flash to dram

And perhaps even more useful than these theoretical questions would be real world benchmarks on these type of workloads. If anyone knows of those id love to see it. Experiences with similar models of processors are also definitely welcome.

Eelco · ‎2024-05-10

I was just reading up on the 16 bit FMAC functionality; and I saw that this has a maximum throughput of 1 fmad per two core clock cycles, since it takes two load instructions to fetch both arguments. The document concerning the 16 bit FMAC did not explicitly promise it was possible to sustain that one op per two cycles given possible constraints on keeping the local memory fed with relevant data; though it seemed implied at least.

I suppose that implied im not going to get more than 1 f32 fmad per two clock cycles either. Or does it? The FMAD is additional silicon intended to run without bothering the rest of the core; I suppose its limitations do not necessarily imply limitation of the full core?

In any case... going for 16 bit quantization likely is an option for me; so understanding what the 16 bit FMAD can do is also interesting in itself.

AScha.3 · ‎2024-05-10

To know whats the real timing, just try it.

A simple loop , 1000x transfer (1000 int32 or f32 with mdma -> to DTCM) - then you know.

+

* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from dram to dTCM

The AXI bus is clocked at core speed (afaik) , so 1 clk -> 1 32b transfer . (maximum- if "nobody" else requesting the bus, maybe cpu or other DMA...)

+

* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from flash to dram

Depends on size of transfer, flash is accessed 256b wide, at 3WS (4 clks) at max. speed (550MHz);

so 4 clks for first , then 1 clk next 7 x 32b words.

see rm.

If D-cache enabled and cache line loaded, then maybe without any waitstates .

But anyway : try a simple test, then you know.

If you feel a post has answered your question, please click "Accept as Solution".

Eelco · ‎2024-05-10

Just found this paper with some rare numbers relevant to my use case; some 5M fmads/s on a 80mhz STM32L4 processor. Thats a ratio less than what id be hoping for... but thats on a processor without a FMAC; and using cube.ai autogenerated C code. From the examples it seems quite heavily geared to convolutional applications; not RNNs where you are dealing with a single token at a time at inference and thus are more constrained by memory bandwidth...

I probably should take a look at that cube.ai autogenerated code, to see what that looks like and if it seems like it ought to come close to making good use of the hardware, for my intended use case.

Eelco · ‎2024-05-10

Thanks; those numbers you mention inspire confidence.

But I dont have that much confidence in the 'just try it' approach, considering im very new to stm32 programming; and I expect that all my synthetic benchmarking would prove is that I have no clue what im doing. So im hoping to teach myself a bit of an understanding of what is going on, independent from what any narrow benchmark might (seem to) show.

Note that in my last reply I quoted a paper where they only achieve about 1 fmad every 16 clock cycles; with code generated by cube.ai; perhaps that isnt the best generated code but I also wouldnt expect it to be the worst. Thats for an stm32L4; but yeah...

Uwe Bonnes · ‎2024-05-10

But the laymans 'just try it' approach will at least give a lower bound...

AScha.3 · ‎2024-05-10

>about 1 fmad

I have no idea what this is , i just think: mybe float MAC - or so.

But anyway : try a simple test, then you know. :)

-> these cpu is really complex, so its almost impossible to say : oh, this instruction executes in 1 cycle, + store cycle...

no, that depends here on many "side effects" : first : the compiler / optimizer setting, changing speed about 300..500% up, to the speed it can really go; but also depends , how much is in cache - or not;

so - no joke- its more easy and faster , to run a test on the real cpu, modify settings, for cache, MPU, etc ,

than talk about or try to simulate...

See some speed tests :

https://www.eembc.org/coremark/scores.php

If you feel a post has answered your question, please click "Accept as Solution".

MasterT · ‎2024-05-10

Read:

AN4891
Application note
STM32H72x, STM32H73x, and single-core STM32H74x/75x
system architecture and performance

Eelco · ‎2024-05-10

True. From what ive read im willing to assume a roughly 1/10 ops per cycle lower bound... but the amount of work before ive created a relevant benchmark that narrows that down does seem like multiple weeks.... I think id rather keep searching around a bit more first.

Its interesting; I suppose its just that the user base of these embedded chips is a lot thinner than say, intel or amd... from the non-embedded world im used to questions like this having simple answers, but thats not the impression im getting from asking questions about embedded chips in various places.

Eelco · ‎2024-05-11

Im not talking about trying to simulate; just looking for real world data / experience of people with similar workloads. Given that I know how hard synthetic benchmarking is on non-embedded systems even if you think you know them well, 'just try' seems a little pointless to me for a noob such as myself; ive got no idea what it takes to make my benchmark neither ridiculously optimistic or pessimistic relative to a real world use scenario. The real question isnt 'cant I do this' but rather 'should I hire someone experienced to do this'... but before I do so I do feel obliged to gauge how much of a point there is to it.

Sadly compound benchmarks like coremark dont tell me very much. The scenario im looking at, of matrix-vector-products specifically, is maximally memory-bandwidth-bound; every matrix entry needs to be hoisted into the core once, to be used once; and the vector entry as well. This thing could be a beast at anything happening in-core but if the memory system wont keep up its of no use.

The cube.ai website has a lot of performance figures; but they are all for convolutional-networks; which have token-level paralelism, which drastically alters the ratio of compute to memory bandwidth. Also it seems that cube.ai has only the most barebones support for recurrent architectures; I dont really think id be able to use it and id have to write my own code anyway.