cancel
Showing results for 
Search instead for 
Did you mean: 

STM32U5A5: PSRAM (4MB external RAM) via QSPI (OCTOSPI1) - for you

tjaekel
Senior III

Not a question: just to share with you my experience and "tricks" to connect a QSPI PSRAM on STM32U5A5/STM32U575 and potentially other MCUs (from a FW point of view, not HW).

Background

I want to extend my own PCB, STM32U5A5 (LQFP64) with an external RAM (PSRAM). The chip I use is: IS66WVS4M8ALL. It gives me additional 32Mb = 4MByte of SRAM. I want to use in Memory Mapped Mode (for Read and Write).

Using FMC or OCTOSPI with 8bit data lanes is not possible for me on 64pin LQFP). So, just OCTOSPI1 in QuadSPI (QSPI) mode.

The important parameters of the chip to bear in mind (becomes important later/below):

  • 33MHz normal (SPI) speed
  • 104MHz fast (QSPI) speed
  • 1024 byte page wrap - important to be in mind this "feature"

Clock Configuration

I use PLL2N. The reason: the OCTOSPI can be clocked up to 200MHz, even the MCU and core clock (SYSCLK) is just 160MHz.

But when selecting PLL2N for OCTOSPI1 clock - it can provide 200MHz (but just for OCTOSPI).

This is great, because:

  • with SYSCLK = 160MHz - it is too fast for the QSPI PSRAM (104MHz as max.)
  • the next possible (with divider 2) - I would get just 80MHz - not the full speed of the external chip possible
  • so, I configure PLL2N for 200MHz and with divider 2 I get 100MHz for the external PSRAM:
    I see on scope: the data comes in burst: a byte burst (2 clocks) is with 125 MHz, the gap between bytes is with 83MHz. So, the average is: 104MHz --> exactly what the PSRAM chip supports.

Bear in mind: the chip starts in normal SPI mode, so 33MHz. There is a command to enable QSPI mode (as 4-4-4).
Only this mode runs with the 104MHz.

So, I leave the clock slow, send the command to change from SPI to QSPI and change the clock afterwards to 100MHz (200 / 2).

Clock Config in "HAL_OSPI_MspInit()":

 

 

    PeriphClkInit.PeriphClockSelection = RCC_PERIPHCLK_OSPI;
    PeriphClkInit.OspiClockSelection = RCC_OSPICLKSOURCE_PLL2;
    PeriphClkInit.PLL2.PLL2Source = RCC_PLLSOURCE_MSI;		//HSE fails with TIME_OUT!!!!!
    PeriphClkInit.PLL2.PLL2M = 1;
    PeriphClkInit.PLL2.PLL2N = 50;		//40 = 160 MHz, 50 = 200 MHz - the max. for QSPI
    PeriphClkInit.PLL2.PLL2P = 2;
    PeriphClkInit.PLL2.PLL2Q = 1;
    PeriphClkInit.PLL2.PLL2R = 2;
    PeriphClkInit.PLL2.PLL2RGE = RCC_PLLVCIRANGE_0;
    PeriphClkInit.PLL2.PLL2FRACN = 0;
    PeriphClkInit.PLL2.PLL2ClockOut = RCC_PLL2_DIVQ;
    if (HAL_RCCEx_PeriphCLKConfig(&PeriphClkInit) != HAL_OK)
    {
       Error_Handler();
    }

 

 

In "MX_OCTOSPI1_Init":

 

 

  hospi1.Init.ClockPrescaler = 5;			//80 MHz: max. is 104 MHz, 160 NHz is too fast!
  /* we have PLL2N as 200MHz - 40 MHz in SPI mode first */

 

 

Change speed for QSPI

So, after I have sent the command to change from SPI to QSPI:

 

 

#define ENTER_QPI								0x35

 

 

I change the OCTOSPI clock again - mainly now divider 2 (for 200MHz / 2) = 100MHz.

This function looks like this:

 

 

void Change_QSPISpeed(void)
{
  /* OCTOSPI1 parameter configuration for faster speed*/
  HAL_OSPI_DeInit(&hospi1);

  hospi1.Instance = OCTOSPI1;
  hospi1.Init.FifoThreshold = 1;
  hospi1.Init.DualQuad = HAL_OSPI_DUALQUAD_DISABLE;
  hospi1.Init.MemoryType = HAL_OSPI_MEMTYPE_MICRON;
  hospi1.Init.DeviceSize = 24;		//number of address bits!
  hospi1.Init.ChipSelectHighTime = 1;
  hospi1.Init.FreeRunningClock = HAL_OSPI_FREERUNCLK_DISABLE;	//HAL_OSPI_FREERUNCLK_DISABLE;
  hospi1.Init.ClockMode = HAL_OSPI_CLOCK_MODE_0;
  hospi1.Init.WrapSize = HAL_OSPI_WRAP_NOT_SUPPORTED;	//or configure for HAL_OSPI_WRAP_32_BYTES?
  /* with HAL_OSPI_WRAP_32_BYTES the debugger disconnects ! */
  hospi1.Init.ClockPrescaler = 2;			//with PLL2N = 200 MHz = 100 MHz (104 MHz average) for QSPI PSRAM in QSPI mode
  hospi1.Init.SampleShifting = HAL_OSPI_SAMPLE_SHIFTING_NONE;
  hospi1.Init.DelayHoldQuarterCycle = HAL_OSPI_DHQC_ENABLE;
  hospi1.Init.DelayBlockBypass = HAL_OSPI_DELAY_BLOCK_USED;
#ifdef ENABLE_DCACHE
  /* with DCache: align with cache line size (32bytes: U5A5, U575 is just 16bytes) */
  hospi1.Init.ChipSelectBoundary = 8;		/* 4 is 4 words per chunk = 16bytes: U575, 8 is 8 words per chunk = 32bytes: U5A5 */
  hospi1.Init.MaxTran = 0;		/* just used of other OCTOSPI is needed */
  hospi1.Init.Refresh = 256;
#else
  /* without DCache, but avoid page wrap - assuming AHB bus is 23bit */
  hospi1.Init.ChipSelectBoundary = 4;	/* should it be 1 for 32but bus, 4 bytes? */
  hospi1.Init.MaxTran = 0;
  hospi1.Init.Refresh = 256;
#endif
  if (HAL_OSPI_Init(&hospi1) != HAL_OK)
  {
    Error_Handler();
  }
}

 

 

Enable DQS forWrite

This is very important! See the MCU Errata:
STM32U575xx and STM32U585xx device errata - Errata sheet  paragraph 2.6.1 or this thread:

Solved: STM32U585 OSPI hard fault on memory-mapped write - STMicroelectronics Community

So, even my PSRAM does not support DDR (DTR) mode - just SDR - I have to enable DQS for the Write command. Not doing so, will not work.

Also to see: DQS is not enabled for a Read command. There seem to be other bugs, e.g.:

  • enable DQS for a Read (in SDR mode) or
  • change the WrapSize, e.g. to HAL_OSPI_WRAP_32_BYTES

results for me in a disconnect of the debugger (I assume, the internal bus fabric hangs and is locked up). DQS enabled for a Read in SDR mode will never finish a read transaction: the QSPI bus keeps clocking and the OCTOSPI1 is busy for the rest of life.

So, my configuration for the Memory Wrapped Commands to set is this:

int PSRAM_Init(void)
{
	OSPI_RegularCmdTypeDef sCommand = {0};

	  /* Enable Compensation cell */
	  EnableCompensationCell();
	  /* Delay block configuration ------------------------------------------------ */
	  if (HAL_OSPI_DLYB_GetClockPeriod(&hospi1,&dlyb_cfg) != HAL_OK)
	  {
		  return -1;
	  }

	  /*when DTR, PhaseSel is divided by 4 (emperic value)*/
	  dlyb_cfg.PhaseSel /= 4;	//4

	  /* save the present configuration for check*/
	  dlyb_cfg_test = dlyb_cfg;

	  /*set delay block configuration*/
	  HAL_OSPI_DLYB_SetConfig(&hospi1, &dlyb_cfg);

	  /*check the set value*/
	  HAL_OSPI_DLYB_GetConfig(&hospi1, &dlyb_cfg);
	  if ((dlyb_cfg.PhaseSel != dlyb_cfg_test.PhaseSel) || (dlyb_cfg.Units != dlyb_cfg_test.Units))
	  {
		  return -1;
	  }

	  /*Configure QSPI mode: afterwards 4-4-4 */
	  sCommand.OperationType      = HAL_OSPI_OPTYPE_COMMON_CFG;	//HAL_OSPI_OPTYPE_WRITE_CFG;
	  sCommand.FlashId            = HAL_OSPI_FLASH_ID_1;
	  sCommand.Instruction        = ENTER_QPI;
	  sCommand.InstructionMode    = HAL_OSPI_INSTRUCTION_1_LINE;
	  sCommand.InstructionSize    = HAL_OSPI_INSTRUCTION_8_BITS;
	  sCommand.InstructionDtrMode = HAL_OSPI_INSTRUCTION_DTR_DISABLE;
	  sCommand.AddressMode        = HAL_OSPI_ADDRESS_NONE;
	  sCommand.AddressSize        = HAL_OSPI_ADDRESS_32_BITS;
	  sCommand.AddressDtrMode     = HAL_OSPI_ADDRESS_DTR_DISABLE;
	  sCommand.AlternateBytesMode = HAL_OSPI_ALTERNATE_BYTES_NONE;
	  sCommand.DataMode           = HAL_OSPI_DATA_NONE;
	  sCommand.DataDtrMode        = HAL_OSPI_DATA_DTR_DISABLE;
	  sCommand.DummyCycles        = DUMMY_CLOCK_CYCLES_WRITE;
	  sCommand.DQSMode            = HAL_OSPI_DQS_DISABLE;
	  sCommand.SIOOMode           = HAL_OSPI_SIOO_INST_EVERY_CMD;

	  if (HAL_OSPI_Command(&hospi1, &sCommand, HAL_OSPI_TIMEOUT_DEFAULT_VALUE) != HAL_OK)
	  {
		  return -1;
	  }

	  Change_QSPISpeed();		/* change to faster speed in QSPI mode, max. 104 MHz, we should have 100 MHz with PLL2N = 200 MHz */

	  /*Configure Memory Mapped mode*/
	  sCommand.OperationType      = HAL_OSPI_OPTYPE_WRITE_CFG;
	  sCommand.FlashId            = HAL_OSPI_FLASH_ID_1;
	  sCommand.Instruction        = WRITE_CMD;
	  sCommand.InstructionMode    = HAL_OSPI_INSTRUCTION_4_LINES;
	  sCommand.InstructionSize    = HAL_OSPI_INSTRUCTION_8_BITS;
	  sCommand.InstructionDtrMode = HAL_OSPI_INSTRUCTION_DTR_DISABLE;
	  sCommand.AddressMode        = HAL_OSPI_ADDRESS_4_LINES;
	  sCommand.AddressSize        = HAL_OSPI_ADDRESS_24_BITS;
	  sCommand.Address			  = 0x0;
	  sCommand.AddressDtrMode     = HAL_OSPI_ADDRESS_DTR_DISABLE;
	  sCommand.AlternateBytesMode = HAL_OSPI_ALTERNATE_BYTES_NONE;
	  sCommand.DataMode           = HAL_OSPI_DATA_4_LINES;
	  sCommand.DataDtrMode        = HAL_OSPI_DATA_DTR_DISABLE;
	  sCommand.DummyCycles        = DUMMY_CLOCK_CYCLES_WRITE;
	  ////sCommand.DQSMode            = HAL_OSPI_DQS_DISABLE;
	  /* VERY IMPORTANT! */
	  sCommand.DQSMode       	  = HAL_OSPI_DQS_ENABLE;			//ERRATA: ENABLE should fix - but it does not!
	  	  	  	  	  	  	  	  	  	  	  	  	  	  			//the debugger is disconnected and program even do not proceed!
	  sCommand.SIOOMode           = HAL_OSPI_SIOO_INST_EVERY_CMD;

	  if (HAL_OSPI_Command(&hospi1, &sCommand, HAL_OSPI_TIMEOUT_DEFAULT_VALUE) != HAL_OK)
	  {
	    Error_Handler();
	  }

	  sCommand.OperationType = HAL_OSPI_OPTYPE_READ_CFG;
	  sCommand.Instruction   = READ_CMD;
	  sCommand.DummyCycles   = DUMMY_CLOCK_CYCLES_READ;
	  sCommand.DQSMode       = HAL_OSPI_DQS_DISABLE;
	  /* Remark: if you disable here DQSMode - the debugger will disconnect! */

	  if (HAL_OSPI_Command(&hospi1, &sCommand, HAL_OSPI_TIMEOUT_DEFAULT_VALUE) != HAL_OK)
	  {
	    Error_Handler();
	  }

	  sMemMappedCfg.TimeOutActivation = HAL_OSPI_TIMEOUT_COUNTER_ENABLE;
	  sMemMappedCfg.TimeOutPeriod     = 0x34;

	  LED_Status(0);

	  if (HAL_OSPI_MemoryMapped(&hospi1, &sMemMappedCfg) != HAL_OK)
	  {
		  return -1;
	  }

	  return 0;
}

 

 

 

Page Wrap!

This is also very important to bear in mind: when a transaction (WRITE or READ) starts on the external PSRAM: when the page boundary is crossed (default is 1024, so address 0x90000400 is crossed, for instance) - it wraps back to the start of the SAME page!

You would not realize really if you do not test carefully (see Memory Test) below. If you write with a page wrap - the read, also doing a page wrap - all looks correct: all what you have written can be read back properly (never mind there was a page wrap or not).

But if you try to read again the page start - you will see it is "corrupted" (overwritten with the "tail" of the write beyond the page size).

This is very tricky and I thought about how to "fix" this issue. The answer is:

  • There is a config as "ChipSelectBoundary"

See code in my "Change_QSPISpeed()":

 

 

#ifdef ENABLE_DCACHE
  /* with DCache: align with cache line size (32bytes: U5A5, U575 is just 16bytes) */
  hospi1.Init.ChipSelectBoundary = 8;		/* 4 is 4 words per chunk = 16bytes: U575, 8 is 8 words per chunk = 32bytes: U5A5 */
  hospi1.Init.MaxTran = 0;		/* just used of other OCTOSPI is needed */
  hospi1.Init.Refresh = 256;
#else
  /* without DCache, but avoid page wrap - assuming AHB bus is 23bit */
  hospi1.Init.ChipSelectBoundary = 4;	/* should it be 1 for 32but bus, 4 bytes? */
  hospi1.Init.MaxTran = 0;
  hospi1.Init.Refresh = 256;
#endif

 

 

What it does:

  • after a configured number of "transactions" - the OCTOSPI generates a new transaction
  • it will de-assert nCS and start again with a new command
  • so, when configured properly: on a page boundary, after 1024 bytes - a new transaction should be generated
  • this avoids keep writing and reading in the same page with a "Page Wrap" done by the external chip: there should not be a page wrap anymore

Also to bear in mind, esp. when testing:

  • the internal bus fabric (the AHB, where the OCTOSPI1 is connected to) is 32bit wide
  • even you do just a single byte read or write - you should see always a 32bit word transaction (the correct byte is taken from the bus after reading an entire word)
  • So, the ChipSelectBoundary might be 4 (it worked for me, but potentially it should be 1, for one 32bit word transaction)

And see also, there is an optimization "trick" when DCache is used (see following).

DCache Optimization

The setting for the "ChipSelectBoundary" depends on: do you enable DCache or not?

If you keep the setting when it is working to avoid a page wrap without DCache enabled, but you enable DCache - you will not get the fastest performance, even DCache is there (and speeds up a bit). The keyword is "Cache Line Size":

  • every time, the DCache has to update memory (write cache to memory, "eviction") or it has to read from external PSRAM memory (update cache from memory, e.g. after cache was "cleaned") - it writes and reads in larger chunks
  • the chunk size of the DCache is the "Cache Line Size":
    it depends on the MCU you are using: for instance: the U5A5 has 32byte Cache Line Size, the U575 just 16byte!
  • every single byte from memory transferred via cache enabled can result in a burst transfer with Cache Line Size
  • if you have configured now "ChipSelectBoundary", e.g. as 4 or even 1 - this 32byte transfer is split into separate transactions
  • every transaction starts with CMD, ADDR, turn around clocks (for READ), so a huge overhead: even if DCACHE is enabled: the 32byte transfer is split into N times single transfers. Very inefficient.

Therefore, I tried to match the "ChipSelectBoundary" with the "CacheLineSize" (here 32bytes). It results in setting this parameter to 8 because: 8 x 4 bytes is 32Bytes (the bus reads always 4 bytes, but the cache reads 32bytes)

Make sure to run a good memory test before you rely on: "all is working". The "page wrap feature" in external PSRAM was the most tricky part to deal with.

FIFO size - not clear

I am not sure how the "FifoThreshold" should be set: I left it at 1. No clue if increasing FIFO size (I think it has 16 entries) has a meaning when DCache is enabled. It might be just helpful in "indirect mode".

In "memory mapped mode" I want to get all the words transferred into DCache, not sitting just in FIFO still. So, I assume, for DCache enabled, the FIFO is not used or it should be set to the lowest threshold. But maybe, I am wrong: the DCache might read all from FIFO, never mind how the threshold is set.

Memory Test

I suggest to implement a memory test. The simplest way is to write the address itself as value to the PSRAM (through DCache enabled). This should at least realize if the "page wrap" kicks in: you would see "aliasing" (values written to a wrong address, due to a page wrap).

My memory test is very simple. It checks mainly for the "aliasing" (not expecting bit errors):

 

 

ECMD_DEC_Status CMD_memt(TCMD_DEC_Results *res, EResultOut out)
{
	unsigned long *p;
	unsigned long i;
	unsigned long tick1, tick2;

	if ( !res->val[1] )
		return CMD_DEC_OK;

	//write memory:
	p = (unsigned long *)res->val[0];
	tick1 = HAL_GetTick();
	for (i = 0; i < res->val[1]; i++)
	{
		*p = (unsigned long)p;
		p++;
	}
	tick1 = HAL_GetTick() - tick1;
	//check memory:
	tick2 = HAL_GetTick();
	p = (unsigned long *)res->val[0];
	for (i = 0; i < res->val[1]; i++)
	{
		if (*p != (unsigned long)p)
		{
			hex_dump((uint8_t *)p, 16, 4, out);
			break;
		}
		p++;
	}
	tick2 = HAL_GetTick() - tick2;

	print_log(out, "WR: %d [ms] | RD: %d [ms]\r\n", (int)tick1, (int)tick2);

	return CMD_DEC_OK;
}

 

 

Make sure to run a memory test, esp. if you want to use the PSRAM like a real RAM: I need RAM in order to store "Pico-C" script files on it. The entire memory should behave like a real RAM for the entire range, no "page wraps".

If you use PSRAM as storage for a RAM based file system (FS) - the page wrap should not be an issue: the FS will handle anyway all as "pages" (sectors, e.g. 512 bytes) and a page wrap would not happen really. Just a "linear memory" with crossing page boundaries - then it matters (e.g. large text files with scripts directly in RAM).

Performance (results)

With this 4MByte external PSRAM, via QSPI, enabling DCache and align with the "Cache Line Size" I get:

  • 46.4 MByte/sec for WRITE
  • 44.6 MByte/sec for READ

Pretty good: this is faster as 1/4 of SYSCLK (160MHz) Any instruction fetch, data fetch from internal RAM might be similar (assuming: a single C-code instruction needs also 4 clock cycles to be done).

Have success with connecting PSRAM via QSPI.

 

 

1 ACCEPTED SOLUTION

Accepted Solutions

I will test/measure w/o DCache. But it will be way slower and depends how I configure:
to avoid the page wrap: every PSRAM transaction is very short, e.g. just 4 bytes (one word), will have all the overhead (CMD, ADDR, turn around) again, all the time on every transaction. So, 4 bytes data plus 4 bytes overhead (on WRITE, with READ plus 6 additional clocks turn around): it should go down to 50% of the speed with cache on.

If I leave the page warp "in": 1024 bytes possible as one transfer.

YES: my PSRAM runs with 100 MHz in SDR mode (and it does not have a DDR mode).
Yes, so, absolute max. would be 50MByte/s. I think it makes sense with DCache: Cache Line Size chunks (32bytes on U5A5), to see a bit below 50MByte/s.

View solution in original post

6 REPLIES 6
KDJEM.1
ST Employee

Hello @tjaekel ,

Thank you for sharing your experience that can help the community members 🙂.

Thank you for your contribution in STCommunity and do not hesitate to create a post in the ST community for sharing idea, request, experience, question........  

Kaouthar

To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.

LCE
Principal

Thanks for sharing! 

I just learned that the OCTOSPI peripheral can handle the page wrapping, nice!
When looking for HyperRam, that was one feature limiting the choice.

~45 MByte/s is pretty good, esp. that it includes the wrap handling.
Theoretical maximum at 100 MHz / 4bits / SDR is 50 MByte/s.

One question @tjaekel :

Have you tested the speed without DCACHE?

 

I will test/measure w/o DCache. But it will be way slower and depends how I configure:
to avoid the page wrap: every PSRAM transaction is very short, e.g. just 4 bytes (one word), will have all the overhead (CMD, ADDR, turn around) again, all the time on every transaction. So, 4 bytes data plus 4 bytes overhead (on WRITE, with READ plus 6 additional clocks turn around): it should go down to 50% of the speed with cache on.

If I leave the page warp "in": 1024 bytes possible as one transfer.

YES: my PSRAM runs with 100 MHz in SDR mode (and it does not have a DDR mode).
Yes, so, absolute max. would be 50MByte/s. I think it makes sense with DCache: Cache Line Size chunks (32bytes on U5A5), to see a bit below 50MByte/s.

LCE
Principal

@tjaekel Thanks again for the info!

No need to test without cache, your explanation concerning page wrap makes sense!

shii
Associate II

Thanks for sharing your experience.

I'm also able to read write to PSRAM using QSPI. but how can I use it as RAM? please let me know.

Configure and enable the Memory Mapped mode, maybe also DCache and use it as RAM, starting at the base address for external QSPI memory.

You can use as RAM just after all is initialized. So, you cannot have this as RAM in your linker script. You can just initialize, copy data etc. to this external QSPI RAM during runtime (after Memory Mapped mode was enabled).