Skip to main content
LCE
Principal II
October 16, 2024
Question

H7 OCTOSPI HyperRAM data throughput changing with compilation

  • October 16, 2024
  • 13 replies
  • 3074 views

Heyho,

I'm using the H733 (custom board) / H735 (eval kit) with Infineon's HyperRAM S70KL1281 / S70KL1282 at 100 MHz for some time now, all working great, except for one thing that is very annoying:

  • the data throughput from / to HyperRAM seems to depend on compilation, even though the OCTOSPI peripheral was not changed
  • after some compilations it's about 178 Mbyte / s, after another only 54 MB/s.
  • data throughput is constant for one compilation, no matter if I call the test function at MCU power up or while operating with all other peripherals running
  • no caching anywhere

I'm pretty sure that it's not "faulty" timing measurements, using the cycle counter and disabling all interrupt calls around the for loops.

  • Is there something wrong in my test function?
  • Is it maybe "only" how the for loop / iteration is compiled?
  • right now I can't get it back to the high speed, so no map / list file
  • my scope here is too old and slow to check the signal lines

Here's the test function, first writing to HyperRAM, then reading:

/* +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
/* OCTOSPI HyperRAM test
 */
#define HYPER_TEST_UART		1

uint32_t OspiHypRamTest(uint8_t u8CountDown)
{
	uint32_t i = 0;
	uint32_t u32Val = 0xFFFFFFFF;
	uint32_t u32MaxLen = (uint32_t)((uint32_t)OSPI_HYPERRAM_END_ADDR / 4);
	uint32_t u32Errors = 0;
	uint32_t u32Data = 0;
	uint32_t u32CycStart = 0;
	uint32_t u32Cycles = 0;
	float flClockMHz = (float)HAL_RCC_GetSysClockFreq() / 1E6;
	float flVal = 0.0f;
	uint32_t *pu32MemAddr = NULL;

	if( 	 OCTOSPI1 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI1_BASE;
	else if( OCTOSPI2 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;

#if HYPER_TEST_UART
	uart_printf("\n\r+++++++++++++++++++++++++++++++++++++++++++++++++\n\r");
	uart_printf("OCTOSPI HyperRAM test, memory mapped, IRQs OFF\n\rcounting ");
	if( 0 == u8CountDown ) uart_printf("UP, start with 0\n\r\n\r");
	else uart_printf("DOWN, start with %08lX\n\r\n\r", u32Val);

	uart_printf("writing bytes: %lu\n\r", (uint32_t)OSPI_HYPERRAM_END_ADDR);
#endif

__DSB();
__disable_irq();

/* write complete HyperRAM */
	/* UP - should be faster */
	if( 0 == u8CountDown )
	{
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			pu32MemAddr[i] = i;
		}
		__DMB();
		__DSB();
		u32Cycles = DWT->CYCCNT;
	}
	/* DOWN */
	else
	{
		u32Val = 0xFFFFFFFF;
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			pu32MemAddr[i] = u32Val;
			u32Val--;
		}
		__DMB();
		__DSB();
		u32Cycles = DWT->CYCCNT;
	}

__enable_irq();
__DSB();

	u32Cycles -= u32CycStart;

	flVal = (float)u32Cycles / flClockMHz;
	flOspiRamSpeedMBpsMmWr = (float)OSPI_HYPERRAM_END_ADDR / flVal;
	flOspiRamSpeedMBpsMmWr *= (float)MEGA_CORRECTION;

#if HYPER_TEST_UART
	uart_printf("%lu CPU cycles = %.1f ms\n\r", u32Cycles, (flVal / 1000.0f));
	uart_printf("\n\r-> %.2f MB/s (%.0f Mbit/s) WRITE\n\r\n\r", flOspiRamSpeedMBpsMmWr, (8.0f * flOspiRamSpeedMBpsMmWr));

	uart_printf("reading & comparing bytes: %lu\n\r", (uint32_t)OSPI_HYPERRAM_END_ADDR);
#endif

__DSB();

	if( 	 OCTOSPI1 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI1_BASE;
	else if( OCTOSPI2 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;

__disable_irq();
__DSB();

/* read complete HyperRAM and compare */
	/* UP - should be faster */
	if( 0 == u8CountDown )
	{
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			u32Data = pu32MemAddr[i];
			if( u32Data != i ) u32Errors++;
		}
		__DMB();
		__DSB();

		u32Cycles = DWT->CYCCNT;
	}
	/* DOWN */
	else
	{
		u32Val = 0xFFFFFFFF;
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			u32Data = pu32MemAddr[i];
			if( u32Data != (u32Val - i) ) u32Errors++;
		}
		__DMB();
		__DSB();

		u32Cycles = DWT->CYCCNT;
	}
__enable_irq();

	u32Cycles -= u32CycStart;

	flVal = (float)u32Cycles / flClockMHz;
	flOspiRamSpeedMBpsMmRd = (float)OSPI_HYPERRAM_END_ADDR / flVal;
	flOspiRamSpeedMBpsMmRd *= (float)MEGA_CORRECTION;

#if HYPER_TEST_UART
	uart_printf("%lu CPU cycles = %.1f ms\n\r", u32Cycles, (flVal / 1000.0f));
	uart_printf("\n\r-> %.2f MB/s (%.0f Mbit/s) READ\n\r", flOspiRamSpeedMBpsMmRd, (8.0f * flOspiRamSpeedMBpsMmRd));

	if( 0 == u32Errors ) uart_printf("\n\rNULL errors\n\r");
	else uart_printf("\n\r# ERR: u32Errors = %lu\n\r", u32Errors);
	uart_printf("-------------------------------------------------\n\r");
#endif

	return u32Errors;
}

Anybody any ideas?

Thanks in advance!

13 replies

STOne-32
Technical Moderator
October 16, 2024

Dear @LCE ,

Thanks for the interesting use case. is that possible to detail the exact IDE/compiler environment so we can try to reproduce the same at our end ?   @KDJEM.1 and then analyze 

Ciao

STOne-32. 

LCE
LCEAuthor
Principal II
October 16, 2024

I'm using

- H735 EVK or H733 custom board

- STM32CubeIDE Version: 1.10.1

- optimization FAST

- CPU clock 400 MHz

- OSPI 100 MHz

- HyperRAM setup via direct register access (doesn't make a difference to HAL setup)

LCE
LCEAuthor
Principal II
October 16, 2024

I just got the "fast" version again, maybe there's some bus issues in the background, depending on the UART use:

UART 3 is used for debugging, in TX DMA mode.

The ouput function uart_printf() fills the TX DMA buffer, just waits at the beginning for previous transfers to finish by checking TC and other stuff with a function UartDbgDmaTxWait().

When I put UartDbgDmaTxWait() after each uart_printf() around OspiHypRamTest() I get the high speed - for now at least...

The question remains, before I did that, why sometimes fast / slow results, without changing anything concerning the OSCTOSPI peripheral and the test function?

 

 

LCE
LCEAuthor
Principal II
October 16, 2024

I also compared the assembler in the list files, between slow / fast version:

the important loops reading / writing HyperRAM and comparing - while the interrupts are disabled - basically look the same

Bassett.David
Associate III
October 16, 2024

Hello LCE,

Have you considered providing protection if

u32Cycles -= u32CycStart;

wraps around?  Perhaps that would account for the two consistent values...

Regards,

Dave

LCE
LCEAuthor
Principal II
October 16, 2024

That's not necessary with (C's ?) unsigned integer math.

(I think I did that before, it didn't change anything.)

That would only explain the values at start-up, a rather defined time, but I also get the exact same timing values if I start OspiHypRamTest() by UART anytime the application is running.

And I checked also with the 1 ms SysTick, giving the same results.

Pavel A.
Super User
October 16, 2024

So what is different in "compilation"? Debug vs Release? Optimization?

 

LCE
LCEAuthor
Principal II
October 17, 2024

So what is different in "compilation"? Debug vs Release? Optimization?

That would be too easy and too obvious! ;)

No, that happens with a new compilation with no change of release / debug mode or optimization settings.

And even without any change of the relevant HyperRam init and test files.

 

So it can be only something happening in the background, using the same bus as OCTOSPI, my guess. 

The test is performed at start-up, the only stuff doing using busses "in the background" until then are:

  • ADC3 (AHB4) via DMA to SRAM4 (AHB)
  • UART 3 (APB1) - with TX DMA from AXI SRAM 

See above, the UART is my best guess for now, as it is using DMA and the AXI SRAM, where also OCTOSPI is connected. And as said above, waiting until UART3 TX DMA was finished already helped.

I'll keep an eye on this with my next compilations...

STOne-32
Technical Moderator
October 18, 2024

Dear @LCE ,

this is a follow-up can you change all of your variables and buffer of data to have 64-bits wide instead of a word by this 

uint64_t instead of uint32_t

and let us know if now the compilation is stable . The idea is to use maximum optimized width for the AXI bus where the OctoSPI is connected.

Hope it helps ,

STOne-32

 

LCE
LCEAuthor
Principal II
October 21, 2024

Hello @STOne-32 ,

thanks for taking a look at this!

So, right now I get these results:

counting UP:
Read = 144.61 MB/s
Write = 58.69 MB/s          for( i = 0; i < u32MaxLen; i++ ) pu32MemAddr[i] = i;

 

counting DOWN:
Read = 54.50 MB/s
Write = 179.32 MB/s

I would have expected the results to be the other way round...

Because Write UP is simply:  for( i = 0; i < u32MaxLen; i++ ) pu32MemAddr[i] = i;
And in the list file that's only 4 lines of assembler (also only 4 lines for write DOWN)

Here's the list file part with the function OspiHypRamTest(uint8_t u8CountDown)


080523b0 <OspiHypRamTest>:
 80523b0:	b538 	push	{r3, r4, r5, lr}
 80523b2:	4604 	mov	r4, r0
 80523b4:	f01d fa74 	bl	806f8a0 <HAL_RCC_GetSysClockFreq>
 80523b8:	ee07 0a90 	vmov	s15, r0
 80523bc:	4d4c 	ldr	r5, [pc, #304]	; (80524f0 <OspiHypRamTest+0x140>)
 80523be:	4a4d 	ldr	r2, [pc, #308]	; (80524f4 <OspiHypRamTest+0x144>)
 80523c0:	eeb8 7a67 	vcvt.f32.u32	s14, s15
 80523c4:	682b 	ldr	r3, [r5, #0]
 80523c6:	ed9f 6b48 	vldr	d6, [pc, #288]	; 80524e8 <OspiHypRamTest+0x138>
 80523ca:	eeb7 7ac7 	vcvt.f64.f32	d7, s14
 80523ce:	4293 	cmp	r3, r2
 80523d0:	ee27 7b06 	vmul.f64	d7, d7, d6
 80523d4:	eeb7 7bc7 	vcvt.f32.f64	s14, d7
 80523d8:	d07f 	beq.n	80524da <OspiHypRamTest+0x12a>	; load address of OCTOSPI 1
 80523da:	f502 42a0 	add.w	r2, r2, #20480	; 0x5000
 80523de:	4293 	cmp	r3, r2
 80523e0:	bf0c 	ite	eq
 80523e2:	f04f 42e0 	moveq.w	r2, #1879048192	; 0x70000000 OCTOSPI 2 unused
 80523e6:	2200 	movne	r2, #0
 80523e8:	f3bf 8f4f 	dsb	sy
 
; WRITE loops in between cpsid / cpsie
 80523ec:	b672 	cpsid	i
 80523ee:	4b42 	ldr	r3, [pc, #264]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
 
 80523f0:	3a04 	subs	r2, #4
 80523f2:	6858 	ldr	r0, [r3, #4]
 80523f4:	4611 	mov	r1, r2
 80523f6:	2c00 	cmp	r4, #0
 80523f8:	d165 	bne.n	80524c6 <OspiHypRamTest+0x116>		; u8CountDown != 0, jump to write DOWN
 80523fa:	4623 	mov	r3, r4
 
; write UP loop
 80523fc:	f841 3f04 	str.w	r3, [r1, #4]!
 8052400:	3301 	adds	r3, #1
 8052402:	f5b3 0f80 	cmp.w	r3, #4194304	; 0x400000 memory size in 32b
 8052406:	d1f9 	bne.n	80523fc <OspiHypRamTest+0x4c>
 
 8052408:	f3bf 8f5f 	dmb	sy
 805240c:	f3bf 8f4f 	dsb	sy
 8052410:	4b39 	ldr	r3, [pc, #228]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
 8052412:	685b 	ldr	r3, [r3, #4]
 8052414:	b662 	cpsie	i
; WRITE end
 
 8052416:	f3bf 8f4f 	dsb	sy
 805241a:	1a1b 	subs	r3, r3, r0
 805241c:	ed9f 6a37 	vldr	s12, [pc, #220]	; 80524fc <OspiHypRamTest+0x14c>
 8052420:	eddf 6a37 	vldr	s13, [pc, #220]	; 8052500 <OspiHypRamTest+0x150>
 8052424:	ee07 3a90 	vmov	s15, r3
 8052428:	4936 	ldr	r1, [pc, #216]	; (8052504 <OspiHypRamTest+0x154>)
 805242a:	ee27 7a26 	vmul.f32	s14, s14, s13
 805242e:	eef8 7a67 	vcvt.f32.u32	s15, s15
 8052432:	eec6 6a27 	vdiv.f32	s13, s12, s15
 8052436:	ee66 7a87 	vmul.f32	s15, s13, s14
 805243a:	edc1 7a00 	vstr	s15, [r1]
 805243e:	f3bf 8f4f 	dsb	sy
 8052442:	492c 	ldr	r1, [pc, #176]	; (80524f4 <OspiHypRamTest+0x144>)
 8052444:	682b 	ldr	r3, [r5, #0]
 8052446:	428b 	cmp	r3, r1
 8052448:	d04a 	beq.n	80524e0 <OspiHypRamTest+0x130>
 805244a:	482f 	ldr	r0, [pc, #188]	; (8052508 <OspiHypRamTest+0x158>)
 805244c:	492f 	ldr	r1, [pc, #188]	; (805250c <OspiHypRamTest+0x15c>)
 805244e:	4283 	cmp	r3, r0
 8052450:	bf08 	it	eq
 8052452:	460a 	moveq	r2, r1

; READ & compare loops in between cpsid / cpsie
 8052454:	b672 	cpsid	i
 8052456:	f3bf 8f4f 	dsb	sy
 805245a:	bb1c 	cbnz	r4, 80524a4 <OspiHypRamTest+0xf4>		; u8CountDown != 0, jump to read DOWN
 805245c:	4b26 	ldr	r3, [pc, #152]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
 805245e:	4620 	mov	r0, r4
 8052460:	685c 	ldr	r4, [r3, #4]
 8052462:	4603 	mov	r3, r0
 
; read & compare UP loop
 8052464:	f852 1f04 	ldr.w	r1, [r2, #4]!
 8052468:	4299 	cmp	r1, r3
 805246a:	f103 0301 	add.w	r3, r3, #1
 805246e:	bf18 	it	ne
 8052470:	3001 	addne	r0, #1
 8052472:	f5b3 0f80 	cmp.w	r3, #4194304	; 0x400000 memory size in 32b
 8052476:	d1f5 	bne.n	8052464 <OspiHypRamTest+0xb4>
 
 8052478:	f3bf 8f5f 	dmb	sy
 805247c:	f3bf 8f4f 	dsb	sy
 8052480:	4b1d 	ldr	r3, [pc, #116]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
 8052482:	685b 	ldr	r3, [r3, #4]
 8052484:	b662 	cpsie	i
; READ & compare end
 
 8052486:	1b1b 	subs	r3, r3, r4
 8052488:	ed9f 6a1c 	vldr	s12, [pc, #112]	; 80524fc <OspiHypRamTest+0x14c>
 805248c:	4a20 	ldr	r2, [pc, #128]	; (8052510 <OspiHypRamTest+0x160>)
 805248e:	ee07 3a90 	vmov	s15, r3
 8052492:	eef8 7a67 	vcvt.f32.u32	s15, s15
 8052496:	eec6 6a27 	vdiv.f32	s13, s12, s15
 805249a:	ee26 7a87 	vmul.f32	s14, s13, s14
 805249e:	ed82 7a00 	vstr	s14, [r2]
 80524a2:	bd38 	pop	{r3, r4, r5, pc}
 
; read & compare prepare
80524a4:	4914 	ldr	r1, [pc, #80]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
 80524a6:	f04f 33ff 	mov.w	r3, #4294967295		; 0xFFFFFFFF
 80524aa:	2000 	movs	r0, #0
 80524ac:	f46f 0c80 	mvn.w	ip, #4194304		; 0x400000 memory size in 32b
 80524b0:	684c 	ldr	r4, [r1, #4]
 
; read & compare DOWN loop
 80524b2:	f852 1f04 	ldr.w	r1, [r2, #4]!
 80524b6:	4299 	cmp	r1, r3
 80524b8:	f103 33ff 	add.w	r3, r3, #4294967295	; 0xFFFFFFFF
 80524bc:	bf18 	it	ne
 80524be:	3001 	addne	r0, #1
 80524c0:	4563 	cmp	r3, ip
 80524c2:	d1f6 	bne.n	80524b2 <OspiHypRamTest+0x102>
 80524c4:	e7d8 	b.n	8052478 <OspiHypRamTest+0xc8>	; back to end of read
 
 80524c6:	f04f 33ff 	mov.w	r3, #4294967295		; 0xFFFFFFFF
 80524ca:	f46f 0c80 	mvn.w	ip, #4194304		; 0x400000 memory size in 32b
 
; write DOWN loop
 80524ce:	f841 3f04 	str.w	r3, [r1, #4]!
 80524d2:	3b01 	subs	r3, #1
 80524d4:	4563 	cmp	r3, ip
 80524d6:	d1fa 	bne.n	80524ce <OspiHypRamTest+0x11e>
 
 80524d8:	e796 	b.n	8052408 <OspiHypRamTest+0x58>	; back to end of write
 
 80524da:	f04f 4210 	mov.w	r2, #2415919104	; 0x90000000	OCTOSPI 1 with HyperRAM
 80524de:	e783 	b.n	80523e8 <OspiHypRamTest+0x38>		; back
 
 80524e0:	4a0c 	ldr	r2, [pc, #48]	; (8052514 <OspiHypRamTest+0x164>)
 80524e2:	e7b7 	b.n	8052454 <OspiHypRamTest+0xa4>		; back to 1st loop
 80524e4:	f3af 8000 	nop.w
 
 80524e8:	a0b5ed8d 	.word	0xa0b5ed8d
 80524ec:	3eb0c6f7 	.word	0x3eb0c6f7
 80524f0:	24002cbc 	.word	0x24002cbc
 80524f4:	52005000 	.word	0x52005000
 80524f8:	e0001000 	.word	0xe0001000
 80524fc:	4b800000 	.word	0x4b800000
 8052500:	3f742400 	.word	0x3f742400
 8052504:	24002bfc 	.word	0x24002bfc
 8052508:	5200a000 	.word	0x5200a000
 805250c:	6ffffffc 	.word	0x6ffffffc
 8052510:	24002bf8 	.word	0x24002bf8
 8052514:	8ffffffc 	.word	0x8ffffffc

 

LCE
LCEAuthor
Principal II
October 21, 2024

... including some comments.

I didn't find the option to select "assembler" for source code posting.

LCE
LCEAuthor
Principal II
October 21, 2024

I just found that I had set the alignment of the HyperRAM in the linker filer to "ALIGN(8)" = 64 bit.

I changed it to ALIGN(4) - and it didn't change anything.