Optimizing FMC Read / Writes for Audio Application STM32H7

EPala.2 · ‎2024-06-17

Hi, I am working on a realtime audio DSP product using the STM32H7 with a AS4C4M16SA-7BCN SDRAM chip for long delay line memory. I am using the FMC controller with the settings in the attached photo:

The product processes an incoming audio stream in real time, so this is a very runtime critical application. I have found that reads and writes to and from the delay memory on the SDRAM are by far the biggest drag on overall performance.

Currently I am just accessing SDRAM memory automatically through the C++ compiler, declaring as follows and accessing as I would any other variable:
static float delay_line[48000 * 30] __attribute__((section(".sdram"))); //48000 sample rate * 30 seconds

I am wondering if there are any ways to optimize SDRAM reads and writes to get better performance, either through how I structure my code, or through settings in the CubeMX configurator.

In particular, would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory rather than just accessing at random points based on my code behavior? Is there a vector style function that can quickly copy a block of data from the SDRAM to local memory? Would this approach be likely to provide a noticeable performance increase?

Please advise, thanks!

AScha.3 · ‎2024-07-05

Faster ? So use 16b ! (min. 200% faster than float.)

+

Your H7?? is at 400Mhz , or so ; its always faster, than the (external) memory access, so needs wait states for every access. Doing it faster is :

1. smaller data , float -> int16_t ;

2. memcpy or direct "take the int" is about same (always the cpu doing and waiting...)

3. maybe (!) faster : if needed part of memory is copied by the DMA or MDMA to internal RAM, "maybe" because: while DMA action is going on, the internal bus is busy and so the cpu has to wait for free bus access.

What is really fastest, you have to try - and make a plan in advance, where the data going and which bus is blocked then...so (i dont know, what you doing, you didnt tell any details ) it could be some way, to copy data by DMA to a block B , while cpu works on data in block A; when finished, cpu works on B, while A is loaded with new data by the DMA. (circular DMA might be your friend...and half/full callbacks.)

If you feel a post has answered your question, please click "Accept as Solution".

EPala.2 · ‎2024-07-05

@AScha.3

To clarify, are you saying that memcpy and manual reads using a for loop will be the same speed of execution?

Which of these two approaches do you think will be faster for SDRAM reading for octave up as I've described?

for(int i = 0; i < BUFFER_SIZE; i += 2){
	//read every other sample from sdram memory in for loop = approach 1
	local_buffer[i >> 1] = sdram_buffer[start_pos + i]; 
}

//read double length buffer using memcpy = approach 2
memcpy(local_buffer, &sdram_buffer[start], sizeof(int16_t) * BUFFER_SIZE * 2);

This is the main question I would like to get answered right now, other approaches I am already looking into.

AScha.3 · ‎2024-07-05

Both using the cpu in a loop, to move data - so about same. (To see any some % difference, you have to try.)

Much more speed is in the optimizer settings : which you use ?

If you feel a post has answered your question, please click "Accept as Solution".

EPala.2 · ‎2024-07-09

void mdsp_pedal_yy::process_grain_cloud_sd(T* in_b, T* out_b){

	int16_t cpy_buffer[PROC_BUFFER_SIZE];

	for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
		cpy_buffer[i] = in_b[i] * FLOAT_TO_INT16;
	}

	memcpy(&gfxlsd[write_ptr], cpy_buffer, sizeof(int16_t) * PROC_BUFFER_SIZE);

	if(write_ptr <= PROC_BUFFER_SIZE * 2){ //copy the beginning of delay memory to end to ensure contiguous reads
		memcpy(&gfxlsd[write_ptr + GRAIN_DELAY], cpy_buffer, sizeof(int16_t) * PROC_BUFFER_SIZE);
	}

	write_ptr += PROC_BUFFER_SIZE;

	if(write_ptr >= GRAIN_DELAY){
		write_ptr -= GRAIN_DELAY;
	}

	for(int g = 0; g < num_sd_grains; ++g){
		if(sd_grains[g].pos >= sd_grains[g].size || !sd_grains[g].active){
			sd_grains[g].active = true;
			start_grain_cloud_sd(g);
		}
		sd_grains[g].read_ptr = write_ptr + GRAIN_DELAY - sd_grains[g].read;
		if(sd_grains[g].read_ptr >= GRAIN_DELAY){ sd_grains[g].read_ptr -= GRAIN_DELAY; }
	}

	for(int g = 0; g < num_sd_grains; ++g){
		int pitch = sd_grains[g].pitch;
		uint32_t read = sd_grains[g].read_ptr;
		if(pitch == normal_speed){
			memcpy(sd_grains[g].buffer, &gfxlsd[read], sizeof(int16_t) * PROC_BUFFER_SIZE);
		}else if(pitch == double_speed){
			for(int i = 0; i < PROC_BUFFER_SIZE * 2; i += 2){
				sd_grains[g].buffer[i>>1] = gfxlsd[read + i];
			}
			sd_grains[g].read -= PROC_BUFFER_SIZE;
		}else if(pitch == half_speed){
			memcpy(cpy_buffer, &gfxlsd[read], sizeof(int16_t) * ((PROC_BUFFER_SIZE >> 1) + 1));
			for(int i = 0; i < PROC_BUFFER_SIZE; i += 2){
				int loc = i >> 1;
				sd_grains[g].buffer[i + 1] = (cpy_buffer[loc] >> 1) + (cpy_buffer[loc + 1] >> 1);
				sd_grains[g].buffer[i] = cpy_buffer[loc];
			}
			sd_grains[g].read += PROC_BUFFER_SIZE >> 1;
		}else if(pitch == reverse){
			for(int i = PROC_BUFFER_SIZE - 1; i >= 0; --i){
				int loc = PROC_BUFFER_SIZE - 1 - i;
				sd_grains[g].buffer[loc] = gfxlsd[read + i];
			}
			sd_grains[g].read += PROC_BUFFER_SIZE * 2;
		}
	}

	memset(out_b, 0, sizeof(T) * PROC_BUFFER_SIZE);

	for(int g = 0; g < num_sd_grains; ++g){
		for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
			T env = sinf((sd_grains[g].pos / sd_grains[g].size) * M_PI) * 1.0f;
			if(env < 0){ env = 0; }
			if(env > 1){ env = 1; }
			out_b[i] += float(sd_grains[g].buffer[i]) * INT16_TO_FLOAT * env * sd_grains[g].vol;
			sd_grains[g].pos += 1;
		}
	}
}

Here's an updated version using int16_t for the granular delay memory. Only getting a marginal increase in performance from doing this (+2 read pointers), perhaps because of all the extra multiplication I need to do to convert from int to float and back. Is there any way to further optimize this using vector functions for multiplication or something similar?

AScha.3 · ‎2024-07-10

Ok, not much gain. 🙂

Basically the INT multiplication is in one cycle, but in cpu with FPU also the float same speed (H7 has double float FPU), just loading the fpu registers need extra clock cycles, thats why the INT is a little faster. BUT if you have to do multiplications for every value from float to int and back - then you loose the higher speed by this, as is here.

So doing it with INT will be faster, if all is in INT, without conversion.

>an incoming audio stream

This is in INT16 , so keep it at INT16....without any conversion to float and then back to int and then to float etc.

And - whats your optimizer setting ? (This has strong effect on speed...), i use -O2 , but try -Ofast also.

If you feel a post has answered your question, please click "Accept as Solution".

EPala.2 · ‎2024-07-10

My thought on using int16_t as the memory format was that the SDRAM read write process is the slowest thing happening and therefore the biggest bottleneck to performance. int16_t means that half the amount of data will be read off the SDRAM compared to floating point (2 bytes versus 4 bytes), but it seems like the extra overhead to perform the conversion negates some of this advantage.

I need the rest of the system to be floating point.

Is there any way to implement something like float16_t ? This would be the best of both worlds if the option exists.

AScha.3 · ‎2024-07-10

again..

And - whats your optimizer setting ?

If you feel a post has answered your question, please click "Accept as Solution".

EPala.2 · ‎2024-07-10

-Ofast

I realize cranking up optimization settings is liable to increase performance, but not in a very predictable way. I would also hope that it's not changing my variable to different types under the hood, which means that the SDRAM bottleneck will still be an issue regardless. Which is why I'm most interested in discussing the elements of the code I can deterministically control in this thread.

Does float16_t seem like a potentially reasonable approach? Is there a fast way to implement this on STM32H7?

BarryWhit · ‎2024-07-10

You're probably aware that SDRAM chips are optimized for burst access. For anything else, the performance really depends both on the access pattern and on the specific way the (specific) FMC in your (specific) chip translates the access pattern into requests to the SDRAM. None of us have access to the FMC RTL, and the RM for the chip doesn't provide such details either (unlike say, the docs for a memory controller IP in an FPGA). So, aside from "maximize spatial locality", your best bet is to simply benchmark various approaches and see what works best. Or ask ST directly.

It's possible that the FMC would translate a +2 stride loop to exactly the same SDRAM access pattern as a sequential memcpy. That wastes cycles transferring data you don't need, but avoids the latency cost of more individual requests. ST has the details, we don't.

- If someone's post helped resolve your issue, thank them by clicking "Accept as Solution".
- Once you've solved your issue, please post an update with any further details.

AScha.3 · ‎2024-07-10

>Does float16_t seem like a potentially reasonable approach? Is there a fast way to implement this on STM32H7?

No + yes.

No : useless idea, because all in software then. Slower than just using float + FPU .

Yes: just use the cmsis lib (afair there is one), but all not using FPU, so slower.

If you feel a post has answered your question, please click "Accept as Solution".