STM32H743 realtime audio processing with DSP

psychegr · ‎2025-03-20

Hi there,

this is my first post here and also the first time that I mess around with STM32 processors.
A few months ago I decided to start a new project for realtime audio processing with DSP
and the STM32H7 series seemed to me a good candidate for what I want to do.
So I bought a development board with STM32H743, a couple flash chips and a few CS4272 codecs.
I started the whole project, i wired everything and started writing the firmware.
I used as a guide a few youtube videos and soon I came up with the base code.
The problem is that maybe I have some buffer synchronization issues because I get distorted sound.
Here are some details about the project. Clock set at 480 MHz, the CS4272 is set as standalone in slave mode and connected with I2S. The audio frequency is set at 48Khz and the dataframe is 24bits at 32 bit. I have verified that all clocks are correct. The levels of the audio signal are within specs. I use circular buffer with DMA in a array of words.
At first I tried to add some reverberation on the audio signal and that worked better than what I expected. The problem is when I try to do a IIR convolution on the signal. This is where I start to hear the distorted sound. I have tried lowering the buffer lengths and also the IIR buffer but nothing really changes.
Can someone guide me with troubleshooting this? I am using STM32CubeIDE with ST-LINK debugger.

Regards.

LCE · ‎2025-03-20

No specific problem description (means: no source code), so only general advice possible:

do not only "hear" sound, check with scope and audio analyser (some freeware PC audio stuff is usually good enough)
do not input complex audio, start with sine only (see above, you actually don't want to hear 1 kHz all day :D )
make sure you cleanly get out what you put in without any signal processing (sine generator -> codec -> STM32 -> no DSP -> codec -> analyser)
measure the time your convolution algorithm takes (use the ARM cycle counter), maybe the algorithm is too slow ***, compare to DMA buffer size x sampling period
make sure you are using the floating point unit

*** the H7 is pretty powerful, but all the stuff people want from audio DSPs these days, I'd say some dedicated audio DSP with lots of "hardware accelerators" might be better for the job.

psychegr · ‎2025-03-20

Ok here is some code from what I have till now.

#define FILTER_TAP_NUM          256
#define BUFFER_SIZE		2048
#define SAMPLING_FREQUENCY_HZ	48000.0f

__attribute__((aligned(32))) int32_t adcData[BUFFER_SIZE];
__attribute__((aligned(32))) int32_t dacData[BUFFER_SIZE];

static volatile __attribute__((aligned(32))) int32_t *inBufPtr;
static volatile __attribute__((aligned(32))) int32_t *outBufPtr;

static float firdata [FILTER_TAP_NUM];
static int firptr [FILTER_TAP_NUM];
static int fir_w_ptr = 0;

float Calc_FIR (float inSample) {
	float inSampleF = inSample;
	float outdata = 0;

	for (int i = 0; i < FILTER_TAP_NUM; i++) {
		outdata += (firdata[i]*cabinetIR[firptr[i]]);   // cabinetIR has the FFT 
		firptr[i]++;
	}

	firdata[fir_w_ptr] = inSampleF;
	firptr[fir_w_ptr] = 0;
	fir_w_ptr++;
	if (fir_w_ptr == FILTER_TAP_NUM) fir_w_ptr=0;

	return outdata;
}

void Process_HalfBuffer() {
	// Input samples
	static float leftIn		= 0.0f;
	static float leftProcessed	= 0.0f;

	// Loop through half of audio buffer (double buffering), convert int->float, apply processing, convert float->int, set output buffers
	for (uint16_t i = 0; i < (BUFFER_SIZE/2); i += 2) {

		/*
		 * Convert current input samples (24-bits) to floats (two I2S data lines, two channels per data line)
		 */
		// Extract 24-bits via bit mask
		inBufPtr[i]		&= 0xFFFFFF;
		inBufPtr[i + 1]	&= 0xFFFFFF;

		// Check if number is negative (sign bit)
		if (inBufPtr[i] & 0x800000) {
			inBufPtr[i] |= ~0xFFFFFF;
		}

		if (inBufPtr[i + 1] & 0x800000) {
			inBufPtr[i + 1] |= ~0xFFFFFF;
		}

		// Normalise to float (-1.0, +1.0)
		leftIn  = (float) inBufPtr[i] / (float) (0x7FFFFF);

		/*
		 * Apply processing
		 */
		//leftProcessed = leftIn;    // Passthru
		//leftProcessed = (1.0f - wet) * leftIn + wet * Do_Reverb(leftIn);    // Reverb
		leftProcessed = Calc_FIR(leftIn);
		leftProcessed *= 1.5f;     // Volume Adjust

		// Ensure output samples are within [-1.0,+1.0] range
		if (leftProcessed < -1.0f) {
			leftProcessed = -1.0f;
		} else if (leftProcessed > 1.0f) {
			leftProcessed =  1.0f;
		}

		// Scale to 24-bit signed integer and set output buffer
		outBufPtr[i]	   = (int32_t)(leftProcessed * 0x7FFFFF);
	}
	dataReadyFlag = 0;
}

void HAL_I2SEx_TxRxHalfCpltCallback(I2S_HandleTypeDef *hi2s)
{
	inBufPtr  = &(adcData[0]);
	outBufPtr = &(dacData[0]);

	//Process_HalfBuffer();

	dataReadyFlag = 1;
}

void HAL_I2SEx_TxRxCpltCallback(I2S_HandleTypeDef *hi2s)
{
	inBufPtr  = &(adcData[BUFFER_SIZE/2]);
	outBufPtr = &(dacData[BUFFER_SIZE/2]);

	//Process_HalfBuffer();

	dataReadyFlag = 1;
}

int main(void)
{

  /* USER CODE BEGIN 1 */
	
  /* USER CODE END 1 */

  /* MPU Configuration--------------------------------------------------------*/
  MPU_Config();

  /* MCU Configuration--------------------------------------------------------*/

  /* Reset of all peripherals, Initializes the Flash interface and the Systick. */
  HAL_Init();

  /* USER CODE BEGIN Init */

  /* USER CODE END Init */

  /* Configure the system clock */
  SystemClock_Config();

  /* USER CODE BEGIN SysInit */

  /* USER CODE END SysInit */

  /* Initialize all configured peripherals */
  MX_GPIO_Init();
  MX_DMA_Init();
  MX_SPI4_Init();
  MX_I2S3_Init();
  /* USER CODE BEGIN 2 */
  
  /* USER CODE END 2 */

  /* Infinite loop */
  /* USER CODE BEGIN WHILE */
  
  CS4272_Init();
  HAL_I2SEx_TransmitReceive_DMA(&hi2s3,  (uint16_t *)dacData,  (uint16_t *)adcData, BUFFER_SIZE);
  while (1)
  {
	  if(dataReadyFlag) {
		  Process_HalfBuffer();
	  }

    /* USER CODE END WHILE */

    /* USER CODE BEGIN 3 */
  }
  /* USER CODE END 3 */
}

This is the cleaned up code that does the audio processing. The truth is that I need to measure the "time cost" of the Calc_IR() function. Also I send a photo of the oscilloscope with 1Khz sine wave. You will notice that the yellow has some noise (maybe high frequency) but you need to notice the "glitches". It is like the buffer is misalligned or something. I cant understand why this happens. I even tried Overlap-save code with arm_copy_f32(), arm_rfft_fast_f32(), arm_cmplx_mult_cmplx_f32() functions from CMSIS library but I had no luck.

psychegr · ‎2025-03-20

So I wrote a reply and it never appeared.
Anyway I will send the code again.

#define BUFFER_SIZE				8
#define FILTER_TAP_NUM 			256
#define SAMPLING_FREQUENCY_HZ	48000.0f

__attribute__((aligned(32))) int32_t adcData[BUFFER_SIZE];
__attribute__((aligned(32))) int32_t dacData[BUFFER_SIZE];

static volatile __attribute__((aligned(32))) int32_t *inBufPtr;
static volatile __attribute__((aligned(32))) int32_t *outBufPtr;

static float firdata [FILTER_TAP_NUM];
static int firptr [FILTER_TAP_NUM];
static int fir_w_ptr = 0;

float Calc_FIR (float inSample) {
	float inSampleF = inSample;
	float outdata = 0;

	for (int i = 0; i < FILTER_TAP_NUM; i++) {
		outdata += (firdata[i]*cabinetIR[firptr[i]]);  // cabinetIR is the FFT
		firptr[i]++;
	}

	firdata[fir_w_ptr] = inSampleF;
	firptr[fir_w_ptr] = 0;
	fir_w_ptr++;
	if (fir_w_ptr == FILTER_TAP_NUM) fir_w_ptr=0;

	return outdata;
}

void Process_HalfBuffer() {
	// Input samples
	static float leftIn			= 0.0f;
	static float leftProcessed	= 0.0f;
	
	// Loop through half of audio buffer (double buffering), convert int->float, apply processing, convert float->int, set output buffers
	for (uint16_t i = 0; i < (BUFFER_SIZE/2); i += 2) {

		/*
		 * Convert current input samples (24-bits) to floats (two I2S data lines, two channels per data line)
		 */
		// Extract 24-bits via bit mask
		inBufPtr[i]		&= 0xFFFFFF;
		inBufPtr[i + 1]	&= 0xFFFFFF;

		// Check if number is negative (sign bit)
		if (inBufPtr[i] & 0x800000) {
			inBufPtr[i] |= ~0xFFFFFF;
		}

		if (inBufPtr[i + 1] & 0x800000) {
			inBufPtr[i + 1] |= ~0xFFFFFF;
		}

		// Normalise to float (-1.0, +1.0)
		leftIn  = (float) inBufPtr[i] / (float) (0x7FFFFF);

		/*
		 * Apply processing
		 */
		//leftProcessed = leftIn;  // Passthru
		//leftProcessed = (1.0f - wet) * leftIn + wet * Do_Reverb(leftIn);  // Reverb
		//x = *DWT_CYCCNT;
		leftProcessed = Calc_FIR(leftIn);
		//y = *DWT_CYCCNT;
		//cycles = y - x;
		leftProcessed *= 1.5f;  // Volume

		// Ensure output samples are within [-1.0,+1.0] range
		if (leftProcessed < -1.0f) {
			leftProcessed = -1.0f;
		} else if (leftProcessed > 1.0f) {
			leftProcessed =  1.0f;
		}

		// Scale to 24-bit signed integer and set output buffer
		outBufPtr[i]	   = (int32_t)(leftProcessed * 0x7FFFFF);
	}
	dataReadyFlag = 0;
}

void HAL_I2SEx_TxRxHalfCpltCallback(I2S_HandleTypeDef *hi2s)
{
	inBufPtr  = &(adcData[0]);
	outBufPtr = &(dacData[0]);

	//Process_HalfBuffer();

	dataReadyFlag = 1;
}

void HAL_I2SEx_TxRxCpltCallback(I2S_HandleTypeDef *hi2s)
{
	inBufPtr  = &(adcData[BUFFER_SIZE/2]);
	outBufPtr = &(dacData[BUFFER_SIZE/2]);

	//Process_HalfBuffer();

	dataReadyFlag = 1;
}

int main(void)
{

  /* USER CODE BEGIN 1 */

  /* USER CODE END 1 */

  /* MPU Configuration--------------------------------------------------------*/
  MPU_Config();

  /* MCU Configuration--------------------------------------------------------*/

  /* Reset of all peripherals, Initializes the Flash interface and the Systick. */
  HAL_Init();

  /* USER CODE BEGIN Init */

  /* USER CODE END Init */

  /* Configure the system clock */
  SystemClock_Config();

  /* USER CODE BEGIN SysInit */

  /* USER CODE END SysInit */

  /* Initialize all configured peripherals */
  MX_GPIO_Init();
  MX_DMA_Init();
  MX_SPI4_Init();
  MX_I2S3_Init();
  /* USER CODE BEGIN 2 */
  /* USER CODE END 2 */

  /* Infinite loop */
  /* USER CODE BEGIN WHILE */
  CS4272_Init();
  HAL_I2SEx_TransmitReceive_DMA(&hi2s3,  (uint16_t *)&dacData[0],  (uint16_t *)&adcData[0], BUFFER_SIZE);
  
  while (1)
  {
	  if(dataReadyFlag) {
		  Process_HalfBuffer();
	  }
	  
  }
  /* USER CODE END 3 */
}

And here is a screenshot of the waveform.

psychegr · ‎2025-03-20

So I added a few lines to measure the cycles for the calculation, using the DWT_CYCCNT.
I changed the BUFFER_SIZE to 256 and FILTER_TAPS to 256 and the cycles for the Calc_FIR() function are ~45700.

And this is the resulting waveform.

If I lower the BUFFER_SIZE to 8 then this is the resulting waveform. Cycles are ~50000.

AScha.3 · ‎2025-03-20

Hi,

1. whats your optimizer setting ? ( -O2 is fine, i use it always)

2. never use float in real time calculation - what you want to gain from float or double ? Your DAC anyway just converts 16 or 18 bits real, so do everything as int16_t , until its working perfect. Then maybe, try getting the last 10 dB S/N out of it, if calculation time is fast enough.

3. "DSP" you can not expect from a non-dsp-cpu , just "close - to " -- and only using INT calculations, done in CCM ram.

So try with these basic changes - and tell...

If you feel a post has answered your question, please click "Accept as Solution".

psychegr · ‎2025-03-20

Based on your number 3. That means that I have chosen the wrong processor model to do what I want to do!!! Maybe the STM32F411 is a better candidate, because I just saw that it supports DSP while the STM32H743 doesnt support DSP (I just noticed it!!!!)!

Chris21 · ‎2025-03-20

That's not correct:

https://community.arm.com/cfs-file/__key/communityserver-blogs-components-weblogfiles/00-00-00-21-42/7563.ARM-white-paper-_2D00_-DSP-capabilities-of-Cortex_2D00_M4-and-Cortex_2D00_M7.pdf

Also H7's CPU can operate at a higher clock rate.

AScha.3 · ‎2025-03-20

No, the H7 at 400 or 600MHz (H7S3) will be much faster than a F411.

The difference to a "real" DSP is ...

https://en.wikipedia.org/wiki/Digital_signal_processor

...and just see, whats in the "cpu" (-> soc: Snapdragon 636) of my 6y old $160 mobile phone, for only the audio processing there is a "small" DSP : Hexagon 680 , running at 500MHz , with full 4 cores;

these could run a Linux system on their own, without the other main cpu's on the chip.

Just read a little about this "small" DSP,

https://www.anandtech.com/show/9552/qualcomm-details-hexagon-680-dsp-in-snapdragon-820-accelerated-imaging

...then you know, whats the difference "cpu with some dsp instructions" to a "dsp with VLIW instuctions" and many (!) ALUs on each of its cores....

The more recent versions -> Hexagon 698 DSP capable of 15 trillion operations per second (TOPS)

Compare this to fast cpu like H7 , doing 500 million OPS .... speed is about 0,003 % of the Hexagon698.

And its just the "helping co-processor" for audio etc , for the 8 ARM main cores at 2 GHz or so.

If you feel a post has answered your question, please click "Accept as Solution".

LCE · ‎2025-03-20

So, what I said in the first post, the H7 is not made for that.

50k cycles is just too many, but maybe you can optimize your code, maybe using integer math is good enough.

But do you at least get out what you put in without any processing?