Optimizing FMC Read / Writes for Audio Application STM32H7

EPala.2 · ‎2024-06-17

Hi, I am working on a realtime audio DSP product using the STM32H7 with a AS4C4M16SA-7BCN SDRAM chip for long delay line memory. I am using the FMC controller with the settings in the attached photo:

The product processes an incoming audio stream in real time, so this is a very runtime critical application. I have found that reads and writes to and from the delay memory on the SDRAM are by far the biggest drag on overall performance.

Currently I am just accessing SDRAM memory automatically through the C++ compiler, declaring as follows and accessing as I would any other variable:
static float delay_line[48000 * 30] __attribute__((section(".sdram"))); //48000 sample rate * 30 seconds

I am wondering if there are any ways to optimize SDRAM reads and writes to get better performance, either through how I structure my code, or through settings in the CubeMX configurator.

In particular, would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory rather than just accessing at random points based on my code behavior? Is there a vector style function that can quickly copy a block of data from the SDRAM to local memory? Would this approach be likely to provide a noticeable performance increase?

Please advise, thanks!

Pavel A. · ‎2024-06-17

> would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory

Yes. This is called the Data cache on Cortex-M7.

Tesla DeLorean · ‎2024-06-17

For MCU access you can cache the SDRAM see MPUConfig() examples

DMA into memory won't cache.

You should be able to use MEM2MEM DMA modes to move data in the background, but that might add contention.

You'll have to benchmark to see the amount of performance you can trade doing processing on-board, and then migrating out to SDRAM. Generally the least amount of moves, and simpler the pipe-line the better.

On the F4's the SDRAM was of the order of 6x slower than Internal-SRAM.

The DTCM is not cached, and is better than 0-wait state outside the core. If you can keep things small-and-fast, do that.

If you use the SDRAM as the dynamic memory pool (HEAP) and use pointers you can likely test and adapt things more quickly.

Don't use SDRAM for STACK

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

EPala.2 · ‎2024-06-17

Thank you! What are the MPUConfig examples that you are referring to? I tried googling but wasn't seeing any clear results.

EPala.2 · ‎2024-06-17

In my application I am writing incoming audio to a very long delay line (30+ seconds, 48kHz sample rate), and executing a lot of reads from different points which are then mixed together. Maybe it would be possible to execute the writes to SDRAM via DMA (since there is only one write per delay line happening per callback), and then the reads as memcpy calls from SDRAM to local buffers the size of my audio callback buffer. And I could store the local buffers in DTCRAM for faster execution.

Does that sound like a good approach?

Tesla DeLorean · ‎2024-06-17

https://github.com/STMicroelectronics/STM32CubeH7/blob/master/Projects/STM32H743I-EVAL/Examples/FMC/FMC_SDRAM_DataMemory/Src/main.c#L273

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

Tesla DeLorean · ‎2024-06-17

Unmanageably long / large amounts of Data, my gut says do it once in to / out of SDRAM. Least complicated, least number of moves.

If you're pre-processing, do it in the fastest memory first, and move/generate the results into the SDRAM, ideally directly

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..

EPala.2 · ‎2024-07-05

Hi all, I've found that this method of loading a whole buffer at a time is running faster than the original version:

void mdsp_pedal_yy::process_grain_cloud_sd(T* in_b, T* out_b){

	memcpy(&gfxl[write_ptr], in_b, sizeof(T) * PROC_BUFFER_SIZE);

	if(write_ptr <= PROC_BUFFER_SIZE * 2){ //copy the beginning of delay memory to end to ensure contiguous reads
		memcpy(&gfxl[write_ptr + GRAIN_DELAY], in_b, sizeof(T) * PROC_BUFFER_SIZE);
	}

	write_ptr += PROC_BUFFER_SIZE;

	if(write_ptr >= GRAIN_DELAY){
		write_ptr -= GRAIN_DELAY;
	}

	for(int g = 0; g < num_sd_grains; ++g){
		if(sd_grains[g].pos >= sd_grains[g].size || !sd_grains[g].active){
			sd_grains[g].active = true;
			start_grain_cloud_sd(g);
		}
		sd_grains[g].read_ptr = write_ptr + GRAIN_DELAY - sd_grains[g].read;
		if(sd_grains[g].read_ptr >= GRAIN_DELAY){ sd_grains[g].read_ptr -= GRAIN_DELAY; }
	}

	for(int g = 0; g < num_sd_grains; ++g){
		memcpy(sd_grains[g].buffer, &gfxl[sd_grains[g].read_ptr], sizeof(T) * sd_grains[g].buffer_len);
	}

	memset(out_b, 0, sizeof(T) * PROC_BUFFER_SIZE);

	for(int g = 0; g < num_sd_grains; ++g){
		for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
			T env = sinf((sd_grains[g].pos / sd_grains[g].size) * M_PI) * 1.0f;
			if(env < 0){ env = 0; }
			if(env > 1){ env = 1; }
			out_b[i] += sd_grains[g].buffer[i] * env;
			sd_grains[g].pos += 1;
		}
	}
}

I do have an issue where for a pitch shifted octave up process I need to read every other sample from SDRAM memory (to double playback speed of the sample).

Do y'all think it would be faster to do this via a for loop that reads every other sample from SDRAM? Or faster to just read a double length buffer with memcpy()?

AScha.3 · ‎2024-07-05

Hi,

just : why you use float ? > static float delay_line[48000 * 30]

To get Hi-Fi , 16b would be ok, for top studio quality 24b . If some extra headroom...32b. (Integer.)

And how you load the delay ? Circular buffer...?

And why so super long delay line ? (30 sec = 9 km size room...even "over" to simulate a free air concert in a stadion.)

If you feel a post has answered your question, please click "Accept as Solution".

EPala.2 · ‎2024-07-05

16b is an approach we are probably going to take. Some other parts of the processing need float resolution, but for the granular process 16b will be enough. Just have not added that in yet, focusing on read / write functions for time being.

As for delay memory, this is a granular process, it is a creative effect, different than a regular delay or reverb. Long memory is part of how that works.

Do you have any thoughts on the question of my octave up use case?

In this scenario I need every other sample played back in order to achieve a pitch shift effect.

Would it be faster to do this by using memcpy and reading a buffer of twice the length; or by using a for loop to read every other sample from an SDRAM segment to local memory?

Please advise.