cancel
Showing results for 
Search instead for 
Did you mean: 

Optimizing FMC Read / Writes for Audio Application STM32H7

EPala.2
Associate III

Hi, I am working on a realtime audio DSP product using the STM32H7 with a AS4C4M16SA-7BCN SDRAM chip for long delay line memory. I am using the FMC controller with the settings in the attached photo:

EPala2_0-1718652361505.png

The product processes an incoming audio stream in real time, so this is a very runtime critical application. I have found that reads and writes to and from the delay memory on the SDRAM are by far the biggest drag on overall performance. 

Currently I am just accessing SDRAM memory automatically through the C++ compiler, declaring as follows and accessing as I would any other variable: 
static float delay_line[48000 * 30] __attribute__((section(".sdram"))); //48000 sample rate * 30 seconds

I am wondering if there are any ways to optimize SDRAM reads and writes to get better performance, either through how I structure my code, or through settings in the CubeMX configurator. 

In particular, would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory rather than just accessing at random points based on my code behavior? Is there a vector style function that can quickly copy a block of data from the SDRAM to local memory? Would this approach be likely to provide a noticeable performance increase?

Please advise, thanks!

 

6 REPLIES 6
Pavel A.
Evangelist III

 would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory 

Yes. This is called the Data cache on Cortex-M7.

 

For MCU access you can cache the SDRAM see MPUConfig() examples

DMA into memory won't cache.

You should be able to use MEM2MEM DMA modes to move data in the background, but that might add contention.

You'll have to benchmark to see the amount of performance you can trade doing processing on-board, and then migrating out to SDRAM. Generally the least amount of moves, and simpler the pipe-line the better.

On the F4's the SDRAM was of the order of 6x slower than Internal-SRAM.

The DTCM is not cached, and is better than 0-wait state outside the core. If you can keep things small-and-fast, do that.

If you use the SDRAM as the dynamic memory pool (HEAP) and use pointers you can likely test and adapt things more quickly.

Don't use SDRAM for STACK

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Thank you! What are the MPUConfig examples that you are referring to? I tried googling but wasn't seeing any clear results. 

EPala.2
Associate III

In my application I am writing incoming audio to a very long delay line (30+ seconds, 48kHz sample rate), and executing a lot of reads from different points which are then mixed together. Maybe it would be possible to execute the writes to SDRAM via DMA (since there is only one write per delay line happening per callback), and then the reads as memcpy calls from SDRAM to local buffers the size of my audio callback buffer.  And I could store the local buffers in DTCRAM for faster execution. 

Does that sound like a good approach?

Unmanageably long / large amounts of Data, my gut says do it once in to / out of SDRAM. Least complicated, least number of moves.

If you're pre-processing, do it in the fastest memory first, and move/generate the results into the SDRAM, ideally directly

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..