Showing results for 
Search instead for 
Did you mean: 

Mysterious memory behaviour in simple DSP test program

Associate III


I am currently developing a DSP application on an STM32F439ZI Nucleo board in combination with a Pmod I2S2 audio codec board. So far, basic testing and filtering worked fine until I started to try FFT processing. I isolated the issue in a simple test program (code attached). I am using the CMSIS DSP library for the FFT and IFFT.

This is what the code does:

Audio sample data is transferred over I2S using DMA in circular mode. The DMA buffer itself is rather small in order to have the TxRxHalfComplete and TxRxComplete ISR called every time after one audio sample from the left and right channel has been received. I need this because I have an FIR filter function that needs to be called for every sample transmitted and received (tested and works, but not used inside the test program). Inside the ISRs the incoming samples are converted from 24 bit signed PCM into float32 and copied into a bigger circular input buffer. As soon as one half of the buffer is filled with data a flag is set. Inside the main while loop the sample block is then transformed into the frequency domain using arm_rfft_fast_f32 and immediately transformed back into the time domain. The output buffer's current index is delayed to ensure there is enough time to perform the FFT calculations (verified).

Here's the issue:

It works perfectly fine for a transform size of 1024 and 2048. If I make the transform / buffer size any larger or smaller the audio output gets corrupted. Output buffer delay is adjusted appropriately. On top you can see the 1 kHz sine wave that is fed into the unit, below that is the corrupted output of the left channel and below that is the output of the right channel (feeding samples straight though inside the ISR).


What I have already done trying to fix the bug:

I verified the FFT and IFFT is working correctly by feeding a known array with a known output into the transform functions - works.

Simply copying the input buffer to the output buffer in the while loop works as well.

The really weird thing is, if I simply uncomment line 155 where I memcpy just the input buffer into another totally unused buffer, that has nothing to do with the transform, it suddenly works for all transform sizes.


So I assume this must be some kind of memory access issue. However at this point I am absolutely clueless what could potentially cause this behavior or how I could debug it. I would highly appreciate if somebody could point me into the right direction!


Best regards


Accepted Solutions
Associate III

I found the issue. The 20 - 30 us pre-delay looked very much like the time between two samples at 44.1 kHz sampling rate. This made me realize I was doing the FFT/IFFT one sample too early. The array index gets updated at the end of the TxRxHalfCplt ISR and I was setting the dataReady flag at the same time, which is wrong, because the new sample has not been written to the buffer yet. This actually happens in the following TxRxCplt ISR. So moving this piece of code from TxRxHalfCplt to TxRxCplt fixed it:



	if(currentInputBufferIndex == AUDIO_BUFFER_SIZE_HALF-1)
		dataReady = 1;
		inputBufferOffset = 0;
		outputBufferOffset = AUDIO_BUFFER_SIZE_HALF;

	if(currentInputBufferIndex == AUDIO_BUFFER_SIZE-1)
		dataReady = 1;
		inputBufferOffset = AUDIO_BUFFER_SIZE_HALF;
		outputBufferOffset = 0;


Thank you everyone for pointing me towards the solution!

View solution in original post

Associate III

Update: It is enough to just read from audioInputBuffer in the while loop before doing the FFT transform to make it work. No memcpy, nothing, just reading from it. If I don't do that, it's the same issue. What on earth is going on there?


Senior II

How does your design take into account cache coherency & data buffer alignment?



Associate III

It doesn't because I don't have to? I'm using an F4 which doesn't have a L1 cache. As far as I know it only has a D- and I-cache for flash memory access. Please correct me if I am wrong or if I didn't understand the concept.

The data buffer contains 32 bit floats. I double checked in the memory viewer during runtime that the buffers are aligned and not overlapping. I also checked for wrong array boundaries, but everything seems fine as well.


@JP_ama, I am afraid you are correct and therefore I am not. Should have known as last week was investigating STM32F407 SDIO and therefore D-cache is flash only as stated in RM0090. Had some problems long time ago with STM32H735 where DMA did abnormal things from time to time related to cache coherency.


Hi, in your code i dont see any time control. Any code on audio samples stream can work only if calculation speed > as stream speed. Seems your code with float calculations is slower and based on size of buffer corruption occurs on every N buf...

Simply your half and full calback copies dont endet fft arrays...

Too check volatile requirments for your defs. Array not volatile is optimized out maybe


The output buffer array index is running 512 samples behind the input buffer index (for 4096 transform size). That's about 11 ms. The FFT calculations take roughly 2 ms. I measured and verified this. Not sure if there are nicer ways to do this though.

I tried making the input buffer array volatile. However the volatile qualifier gets discarded in the arm_rfft_fast_f32 function call (compiler warning). So I'm not sure if this is doing much. But I am suspecting something like this causing the problem as well.

@Johi It makes perfect sense to investigate in this direction. I appreciate everyone taking the time to take a look into this. Thanks!

Senior III


Few notes with quick review:

Is it enough if you keep small delay instead of read from audioInputBuffer before FFT to get it working ?

I see you have USB, Eth, Usart initializations in yourt code. Is it possible that those have higher interrupt priority which delays I2S interrupts ? (the code is abbreviated i guess)

Then you init HAL_I2SEx_TransmitReceive_DMA with lenght of 4 but read buffer with lenght of 8. Is this considered to be this way ? (just note, im not familiar with I2S).

Br J.T

Ok 2ms is one or both fft and this reflect to your graphs. In this 2ms you send to output intermediate calculations of fft instead real data, because you dont have time management. Too i recommend double output buffers swap...

Chief II

The inputBufferOffset, outputBufferOffset and dataReady are all modified in interrupt context and therefore are unsynchronized for the normal thread context (the main loop). As a quick fix - disable the interrupts while you read/modify the set of those variables. But eventually better redesign the logic using the principles from these articles: