cancel
Showing results for 
Search instead for 
Did you mean: 

H7 OCTOSPI HyperRAM data throughput changing with compilation

LCE
Principal

Heyho,

I'm using the H733 (custom board) / H735 (eval kit) with Infineon's HyperRAM S70KL1281 / S70KL1282 at 100 MHz for some time now, all working great, except for one thing that is very annoying:

  • the data throughput from / to HyperRAM seems to depend on compilation, even though the OCTOSPI peripheral was not changed
  • after some compilations it's about 178 Mbyte / s, after another only 54 MB/s.
  • data throughput is constant for one compilation, no matter if I call the test function at MCU power up or while operating with all other peripherals running
  • no caching anywhere

I'm pretty sure that it's not "faulty" timing measurements, using the cycle counter and disabling all interrupt calls around the for loops.

  • Is there something wrong in my test function?
  • Is it maybe "only" how the for loop / iteration is compiled?
  • right now I can't get it back to the high speed, so no map / list file
  • my scope here is too old and slow to check the signal lines

Here's the test function, first writing to HyperRAM, then reading:

/* +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
/* OCTOSPI HyperRAM test
 */
#define HYPER_TEST_UART		1

uint32_t OspiHypRamTest(uint8_t u8CountDown)
{
	uint32_t i = 0;
	uint32_t u32Val = 0xFFFFFFFF;
	uint32_t u32MaxLen = (uint32_t)((uint32_t)OSPI_HYPERRAM_END_ADDR / 4);
	uint32_t u32Errors = 0;
	uint32_t u32Data = 0;
	uint32_t u32CycStart = 0;
	uint32_t u32Cycles = 0;
	float flClockMHz = (float)HAL_RCC_GetSysClockFreq() / 1E6;
	float flVal = 0.0f;
	uint32_t *pu32MemAddr = NULL;

	if( 	 OCTOSPI1 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI1_BASE;
	else if( OCTOSPI2 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;

#if HYPER_TEST_UART
	uart_printf("\n\r+++++++++++++++++++++++++++++++++++++++++++++++++\n\r");
	uart_printf("OCTOSPI HyperRAM test, memory mapped, IRQs OFF\n\rcounting ");
	if( 0 == u8CountDown ) uart_printf("UP, start with 0\n\r\n\r");
	else uart_printf("DOWN, start with %08lX\n\r\n\r", u32Val);

	uart_printf("writing bytes: %lu\n\r", (uint32_t)OSPI_HYPERRAM_END_ADDR);
#endif

__DSB();
__disable_irq();

/* write complete HyperRAM */
	/* UP - should be faster */
	if( 0 == u8CountDown )
	{
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			pu32MemAddr[i] = i;
		}
		__DMB();
		__DSB();
		u32Cycles = DWT->CYCCNT;
	}
	/* DOWN */
	else
	{
		u32Val = 0xFFFFFFFF;
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			pu32MemAddr[i] = u32Val;
			u32Val--;
		}
		__DMB();
		__DSB();
		u32Cycles = DWT->CYCCNT;
	}

__enable_irq();
__DSB();

	u32Cycles -= u32CycStart;

	flVal = (float)u32Cycles / flClockMHz;
	flOspiRamSpeedMBpsMmWr = (float)OSPI_HYPERRAM_END_ADDR / flVal;
	flOspiRamSpeedMBpsMmWr *= (float)MEGA_CORRECTION;

#if HYPER_TEST_UART
	uart_printf("%lu CPU cycles = %.1f ms\n\r", u32Cycles, (flVal / 1000.0f));
	uart_printf("\n\r-> %.2f MB/s (%.0f Mbit/s) WRITE\n\r\n\r", flOspiRamSpeedMBpsMmWr, (8.0f * flOspiRamSpeedMBpsMmWr));

	uart_printf("reading & comparing bytes: %lu\n\r", (uint32_t)OSPI_HYPERRAM_END_ADDR);
#endif

__DSB();

	if( 	 OCTOSPI1 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI1_BASE;
	else if( OCTOSPI2 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;

__disable_irq();
__DSB();

/* read complete HyperRAM and compare */
	/* UP - should be faster */
	if( 0 == u8CountDown )
	{
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			u32Data = pu32MemAddr[i];
			if( u32Data != i ) u32Errors++;
		}
		__DMB();
		__DSB();

		u32Cycles = DWT->CYCCNT;
	}
	/* DOWN */
	else
	{
		u32Val = 0xFFFFFFFF;
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			u32Data = pu32MemAddr[i];
			if( u32Data != (u32Val - i) ) u32Errors++;
		}
		__DMB();
		__DSB();

		u32Cycles = DWT->CYCCNT;
	}
__enable_irq();

	u32Cycles -= u32CycStart;

	flVal = (float)u32Cycles / flClockMHz;
	flOspiRamSpeedMBpsMmRd = (float)OSPI_HYPERRAM_END_ADDR / flVal;
	flOspiRamSpeedMBpsMmRd *= (float)MEGA_CORRECTION;

#if HYPER_TEST_UART
	uart_printf("%lu CPU cycles = %.1f ms\n\r", u32Cycles, (flVal / 1000.0f));
	uart_printf("\n\r-> %.2f MB/s (%.0f Mbit/s) READ\n\r", flOspiRamSpeedMBpsMmRd, (8.0f * flOspiRamSpeedMBpsMmRd));

	if( 0 == u32Errors ) uart_printf("\n\rNULL errors\n\r");
	else uart_printf("\n\r# ERR: u32Errors = %lu\n\r", u32Errors);
	uart_printf("-------------------------------------------------\n\r");
#endif

	return u32Errors;
}

Anybody any ideas?

Thanks in advance!

16 REPLIES 16
STOne-32
ST Employee

Dear @LCE ,

Thanks for the interesting use case. is that possible to detail the exact IDE/compiler environment so we can try to reproduce the same at our end ?   @KDJEM.1 and then analyze 

Ciao

STOne-32. 

LCE
Principal

I'm using

- H735 EVK or H733 custom board

- STM32CubeIDE Version: 1.10.1

- optimization FAST

- CPU clock 400 MHz

- OSPI 100 MHz

- HyperRAM setup via direct register access (doesn't make a difference to HAL setup)

LCE
Principal

I just got the "fast" version again, maybe there's some bus issues in the background, depending on the UART use:

UART 3 is used for debugging, in TX DMA mode.

The ouput function uart_printf() fills the TX DMA buffer, just waits at the beginning for previous transfers to finish by checking TC and other stuff with a function UartDbgDmaTxWait().

When I put UartDbgDmaTxWait() after each uart_printf() around OspiHypRamTest() I get the high speed - for now at least...

The question remains, before I did that, why sometimes fast / slow results, without changing anything concerning the OSCTOSPI peripheral and the test function?

 

 

LCE
Principal

I also compared the assembler in the list files, between slow / fast version:

the important loops reading / writing HyperRAM and comparing - while the interrupts are disabled - basically look the same

Bassett.David
Senior

Hello LCE,

Have you considered providing protection if

u32Cycles -= u32CycStart;

wraps around?  Perhaps that would account for the two consistent values...

Regards,

Dave

LCE
Principal

That's not necessary with (C's ?) unsigned integer math.

(I think I did that before, it didn't change anything.)

That would only explain the values at start-up, a rather defined time, but I also get the exact same timing values if I start OspiHypRamTest() by UART anytime the application is running.

And I checked also with the 1 ms SysTick, giving the same results.

Pavel A.
Evangelist III

So what is different in "compilation"? Debug vs Release? Optimization?

 

So what is different in "compilation"? Debug vs Release? Optimization?

That would be too easy and too obvious! ;)

No, that happens with a new compilation with no change of release / debug mode or optimization settings.

And even without any change of the relevant HyperRam init and test files.

 

So it can be only something happening in the background, using the same bus as OCTOSPI, my guess. 

The test is performed at start-up, the only stuff doing using busses "in the background" until then are:

  • ADC3 (AHB4) via DMA to SRAM4 (AHB)
  • UART 3 (APB1) - with TX DMA from AXI SRAM 

See above, the UART is my best guess for now, as it is using DMA and the AXI SRAM, where also OCTOSPI is connected. And as said above, waiting until UART3 TX DMA was finished already helped.

I'll keep an eye on this with my next compilations...

Dear @LCE ,

Are you able to localize the difference of the cycles counts at WRITE or READ or for Both operations. I’m suspecting a kind of code alignment from Cortex-M7 that is propagated to the AXI ( 64-bits)  width to the memory mapped OctalSPI . If possible to share the assembly generated on both cases ( high and slow)  for the small loops ( it should be 5 to 6 assembly lines each) . Thank you for help 

Cheers,

STOne-32