Simultaneous DMAs?

turboscrew · ‎2018-11-03

I've been trying to fins out what exactly this means:

"The multi-layer bus matrix allows masters to perform data transfers concurrently as long as

they are addressing different slave modules."

(AN4031 Using the STM32F2, STM32F4 and STM32F7 Series DMA controller, 2.1 Multi-layer bus matrix)

What I wonder is: are the DMA transfers between different slaves "invisible" to each other - are the transfers truly concurrent?

If there is a DMA from, say, internal SRAM to FMC-bus, how does that affect the core/queues?

(That is: are the queues in the core-side of the bus matrix?)

How about DMA transfer from external SRAM to external SRAM: does the DMA interfere with the core access to internal SRAM/internal FLASH?

Can continuous DMA transfer slow down the core altogether, or is the bus matrix a matrix of full buses instead of "arbitration matrix"? And how about the output FIFO of the Chrom-ART?

The manuals don't seem to be clear enough for me about that.

And why does it say "(Cybercom)" next to my nickname? I haven't worked there for almost 5 years now... How can I change that? I ask here, because I haven't found any "site help forum".

Tesla DeLorean · ‎2018-11-03

The company name is only visible to you, coming from the CRM database so you'd likely need to address that through the sales side.

I believe the buses/masters have multiple state machines, if they don't have to contended/arbitrate then all actions that can occur concurrently will do so. Otherwise they will yield until those other operations complete.

You have finite bandwidth, bus activity takes multiple cycles.

The diagrams show the localized busing. The logic to pull it off is very complex, and I don't see ST explaining it to you beyond the block diagrams. You should perhaps also review all of ARMs documentation about the cores, and integration with other IP. There are many modes of utilization, you are expected to experiment and benchmark your own configurations. For instance a very slow external memory is likely to have a huge impact on bandwidth if you're banging on it continuously.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

View solution in original post

Tesla DeLorean · ‎2018-11-03

The company name is only visible to you, coming from the CRM database so you'd likely need to address that through the sales side.

I believe the buses/masters have multiple state machines, if they don't have to contended/arbitrate then all actions that can occur concurrently will do so. Otherwise they will yield until those other operations complete.

You have finite bandwidth, bus activity takes multiple cycles.

The diagrams show the localized busing. The logic to pull it off is very complex, and I don't see ST explaining it to you beyond the block diagrams. You should perhaps also review all of ARMs documentation about the cores, and integration with other IP. There are many modes of utilization, you are expected to experiment and benchmark your own configurations. For instance a very slow external memory is likely to have a huge impact on bandwidth if you're banging on it continuously.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

turboscrew · ‎2018-11-03

I mean, are there different data lines such that data can be moved from, say, internal static RAM to extermal RAM (FMC) at the same time the core fetches an instruction from internal FLASH? That is, true concurrence, not cycle stealing.

It may be important (haven't calculated yet), if there's lot to do by processor, and there are two or three DMA transfers going on. Processor is doing time-critical computations, one DMA is reading ADCs, one DMA is keeping QVGA TFT display happy dumping frame buffer data to it, and every now and then DMA moves a new image into the frame buffer. I wonder if one shared bus can handle it, unless the bus is a lot faster than the masters.

Of course the accesses to frame buffer needs to be cycle stealing, but the processor should not be delayed notably.

I think a frame could be transferred to frame buffer in about 1 ms, and it's barely quick enough. If cycle stealing steals every second cycle, the copying would take too much.

turboscrew · ‎2018-11-03

The more I've read about the bus matrix, the more I feel like the bus matrix is just multiplexing channels (control signals) as opposed to the old dedicated (hard-wired) channels, and the parallelism seems to mean cycle stealing (virtual parallelism just like time sharing does for processes). Please correct me if I'm wrong.

Tesla DeLorean · ‎2018-11-03

Buses are typically not bidirectional internally, so likely to be a wide output, and wide input data bus, so can potentially do both at the same time. And you've also got different buses which can act autonomously when the core isn't acting on them.

The core is going to prefer TCM (Tighly Coupled Memory), and prefetch paths out of the FLASH/ART. From FLASH into ART the line width in many parts is 128-bit, on the F4 linear code execution is faster from ART, as the first word takes a hit (if not present, or preloaded), and the rest are 0-cycle (not 0-wait/1-cycle) into the prefetch.

The F7 also has architectural level caches, these aren't connected to the TCM as that's already fast and will waste cache resources.

External memory is slow, this can be compounded by data bus width, several DISCO boards have 16-bit data buses, this is a serious drag on potential bandwidth.

Don't put the stack in external memory, use TCM/CCM memories.

With an 800x600 16bpp @ 60 Hz you're pretty close to the ceiling with the LTDC/DSI. In that context 320x240 is sadly pedestrian.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

turboscrew · ‎2018-11-03

Getting to see the IP-level documentation is pretty hard, and I think most of the documents are ARM merchandise - no way without a lot of money.

T J · ‎2018-11-03

We don't see your old company name,

this is what we see of you:

Tesla DeLorean · ‎2018-11-03

I'd hope under-grad level computer architecture courses would cover ARM SoC topics in 2018

The F722 has a newer CM7 core than the F74x, and the F76x has larger caches.

If data is transient, don't cache it, use the MPU configuration to make things non-cacheable.

Use 32-bit wide external busing.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

turboscrew · ‎2018-11-04

Thanks.

turboscrew · ‎2018-11-04

Oh, just looked into Cortex-M System Design Kit TRM, and it really looks like all bus signals - including address and data lines are routed. The bus matrix seems to show actual parallelism after all!

And that document was freely accessible on ARM infocenter site.