Speeding up access to shared memory on dual-core STM32

cbcooper · ‎2025-06-26

On my dual-core STM32H747IGT6 I've got the CM4 doing data acquisition and processing, it's taking its output and storing it into shared memory (multiple possible destination buffers - some in SRAM1, some in SRAM2) for the CM7 to pick up. It's working fine except that it's taking longer than I would like and I'm trying to figure out how to speed up the code.

In order to figure out where the slowdown was happening, I set up TIM3 to run at full speed (240 MHz) and then I check the value of TIM3 before and after certain chunks of code. Here's the chunk I'm looking at:

            __disable_irq();  // So no IRQs fire, throwing off the timing
            start_timer = LL_TIM_GetCounter(TIM3);
            // buffer_state is in SRAM1 or SRAM2
            buffer_state->buffer->data[adc_index].samples[num_data] = adc_value;
            buffer_state->buffer->data[lpf_index].samples[num_data] = lpf_value;
            buffer_state->buffer->data[hpf_index].samples[num_data] = hpf_value;
            buffer_state->buffer->data[dac_index].samples[num_data] = dac_value;
            //
            end_timer = LL_TIM_GetCounter(TIM3);
            __enable_irq();

I've looked at the assembly generated for this C code and it's highly optimized, I probably couldn't do much better hand-rolling it.

What I noticed is that the time this piece of code takes is not consistent. It usually takes 34 ticks (142 ns) but on occasion will take up to 134 ticks (558 ns). It seems strange that code this simple wouldn't be more consistent, but then I read that memory access "takes as long as it takes" and now I'm suspecting that the reason for the variation is the time it takes to access the shared memory. It definitely makes sense that if the CM4 is trying to access the shared RAM at the same time as the CM7, something's gotta give.

1) Is there any documentation on how the shared RAM works? For example, if the shared RAM "locks" access in 4k blocks, then I could try to find a way to have the CM4 and CM7 not working in adjacent memory blocks. If the shared RAM locks its entire contents to one MPU and makes the other MPU wait until the first MPU is done, then this won't work.

2) Would it work to have the CM4 write to SRAM3 (currently unused in my system) and then have the MDMA copy the data from SRAM3 into SRAM1/SRAM2? Or am I going to run into the same problem where the CM4 is trying to write to SRAM3 but the MDMA has it locked? Can I set the MDMA to lower priority so its operation has minimal impact on the CM4 timing?

Tesla DeLorean · ‎2025-06-26

Not sure, but much of it's dual ported and should by byte resolution.

Check the bus matrix to understand connectivity and paths, so if on different busses and contending.

The M7 side can be cached, 32-byte resolution/alignment.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..