2024-03-05 06:48 PM
Not a question: just to share with you my experience and "tricks" to connect a QSPI PSRAM on STM32U5A5/STM32U575 and potentially other MCUs (from a FW point of view, not HW).
Background
I want to extend my own PCB, STM32U5A5 (LQFP64) with an external RAM (PSRAM). The chip I use is: IS66WVS4M8ALL. It gives me additional 32Mb = 4MByte of SRAM. I want to use in Memory Mapped Mode (for Read and Write).
Using FMC or OCTOSPI with 8bit data lanes is not possible for me on 64pin LQFP). So, just OCTOSPI1 in QuadSPI (QSPI) mode.
The important parameters of the chip to bear in mind (becomes important later/below):
Clock Configuration
I use PLL2N. The reason: the OCTOSPI can be clocked up to 200MHz, even the MCU and core clock (SYSCLK) is just 160MHz.
But when selecting PLL2N for OCTOSPI1 clock - it can provide 200MHz (but just for OCTOSPI).
This is great, because:
Bear in mind: the chip starts in normal SPI mode, so 33MHz. There is a command to enable QSPI mode (as 4-4-4).
Only this mode runs with the 104MHz.
So, I leave the clock slow, send the command to change from SPI to QSPI and change the clock afterwards to 100MHz (200 / 2).
Clock Config in "HAL_OSPI_MspInit()":
PeriphClkInit.PeriphClockSelection = RCC_PERIPHCLK_OSPI;
PeriphClkInit.OspiClockSelection = RCC_OSPICLKSOURCE_PLL2;
PeriphClkInit.PLL2.PLL2Source = RCC_PLLSOURCE_MSI; //HSE fails with TIME_OUT!!!!!
PeriphClkInit.PLL2.PLL2M = 1;
PeriphClkInit.PLL2.PLL2N = 50; //40 = 160 MHz, 50 = 200 MHz - the max. for QSPI
PeriphClkInit.PLL2.PLL2P = 2;
PeriphClkInit.PLL2.PLL2Q = 1;
PeriphClkInit.PLL2.PLL2R = 2;
PeriphClkInit.PLL2.PLL2RGE = RCC_PLLVCIRANGE_0;
PeriphClkInit.PLL2.PLL2FRACN = 0;
PeriphClkInit.PLL2.PLL2ClockOut = RCC_PLL2_DIVQ;
if (HAL_RCCEx_PeriphCLKConfig(&PeriphClkInit) != HAL_OK)
{
Error_Handler();
}
In "MX_OCTOSPI1_Init":
hospi1.Init.ClockPrescaler = 5; //80 MHz: max. is 104 MHz, 160 NHz is too fast!
/* we have PLL2N as 200MHz - 40 MHz in SPI mode first */
Change speed for QSPI
So, after I have sent the command to change from SPI to QSPI:
#define ENTER_QPI 0x35
I change the OCTOSPI clock again - mainly now divider 2 (for 200MHz / 2) = 100MHz.
This function looks like this:
void Change_QSPISpeed(void)
{
/* OCTOSPI1 parameter configuration for faster speed*/
HAL_OSPI_DeInit(&hospi1);
hospi1.Instance = OCTOSPI1;
hospi1.Init.FifoThreshold = 1;
hospi1.Init.DualQuad = HAL_OSPI_DUALQUAD_DISABLE;
hospi1.Init.MemoryType = HAL_OSPI_MEMTYPE_MICRON;
hospi1.Init.DeviceSize = 24; //number of address bits!
hospi1.Init.ChipSelectHighTime = 1;
hospi1.Init.FreeRunningClock = HAL_OSPI_FREERUNCLK_DISABLE; //HAL_OSPI_FREERUNCLK_DISABLE;
hospi1.Init.ClockMode = HAL_OSPI_CLOCK_MODE_0;
hospi1.Init.WrapSize = HAL_OSPI_WRAP_NOT_SUPPORTED; //or configure for HAL_OSPI_WRAP_32_BYTES?
/* with HAL_OSPI_WRAP_32_BYTES the debugger disconnects ! */
hospi1.Init.ClockPrescaler = 2; //with PLL2N = 200 MHz = 100 MHz (104 MHz average) for QSPI PSRAM in QSPI mode
hospi1.Init.SampleShifting = HAL_OSPI_SAMPLE_SHIFTING_NONE;
hospi1.Init.DelayHoldQuarterCycle = HAL_OSPI_DHQC_ENABLE;
hospi1.Init.DelayBlockBypass = HAL_OSPI_DELAY_BLOCK_USED;
#ifdef ENABLE_DCACHE
/* with DCache: align with cache line size (32bytes: U5A5, U575 is just 16bytes) */
hospi1.Init.ChipSelectBoundary = 8; /* 4 is 4 words per chunk = 16bytes: U575, 8 is 8 words per chunk = 32bytes: U5A5 */
hospi1.Init.MaxTran = 0; /* just used of other OCTOSPI is needed */
hospi1.Init.Refresh = 256;
#else
/* without DCache, but avoid page wrap - assuming AHB bus is 23bit */
hospi1.Init.ChipSelectBoundary = 4; /* should it be 1 for 32but bus, 4 bytes? */
hospi1.Init.MaxTran = 0;
hospi1.Init.Refresh = 256;
#endif
if (HAL_OSPI_Init(&hospi1) != HAL_OK)
{
Error_Handler();
}
}
Enable DQS forWrite
This is very important! See the MCU Errata:
STM32U575xx and STM32U585xx device errata - Errata sheet paragraph 2.6.1 or this thread:
Solved: STM32U585 OSPI hard fault on memory-mapped write - STMicroelectronics Community
So, even my PSRAM does not support DDR (DTR) mode - just SDR - I have to enable DQS for the Write command. Not doing so, will not work.
Also to see: DQS is not enabled for a Read command. There seem to be other bugs, e.g.:
results for me in a disconnect of the debugger (I assume, the internal bus fabric hangs and is locked up). DQS enabled for a Read in SDR mode will never finish a read transaction: the QSPI bus keeps clocking and the OCTOSPI1 is busy for the rest of life.
So, my configuration for the Memory Wrapped Commands to set is this:
int PSRAM_Init(void)
{
OSPI_RegularCmdTypeDef sCommand = {0};
/* Enable Compensation cell */
EnableCompensationCell();
/* Delay block configuration ------------------------------------------------ */
if (HAL_OSPI_DLYB_GetClockPeriod(&hospi1,&dlyb_cfg) != HAL_OK)
{
return -1;
}
/*when DTR, PhaseSel is divided by 4 (emperic value)*/
dlyb_cfg.PhaseSel /= 4; //4
/* save the present configuration for check*/
dlyb_cfg_test = dlyb_cfg;
/*set delay block configuration*/
HAL_OSPI_DLYB_SetConfig(&hospi1, &dlyb_cfg);
/*check the set value*/
HAL_OSPI_DLYB_GetConfig(&hospi1, &dlyb_cfg);
if ((dlyb_cfg.PhaseSel != dlyb_cfg_test.PhaseSel) || (dlyb_cfg.Units != dlyb_cfg_test.Units))
{
return -1;
}
/*Configure QSPI mode: afterwards 4-4-4 */
sCommand.OperationType = HAL_OSPI_OPTYPE_COMMON_CFG; //HAL_OSPI_OPTYPE_WRITE_CFG;
sCommand.FlashId = HAL_OSPI_FLASH_ID_1;
sCommand.Instruction = ENTER_QPI;
sCommand.InstructionMode = HAL_OSPI_INSTRUCTION_1_LINE;
sCommand.InstructionSize = HAL_OSPI_INSTRUCTION_8_BITS;
sCommand.InstructionDtrMode = HAL_OSPI_INSTRUCTION_DTR_DISABLE;
sCommand.AddressMode = HAL_OSPI_ADDRESS_NONE;
sCommand.AddressSize = HAL_OSPI_ADDRESS_32_BITS;
sCommand.AddressDtrMode = HAL_OSPI_ADDRESS_DTR_DISABLE;
sCommand.AlternateBytesMode = HAL_OSPI_ALTERNATE_BYTES_NONE;
sCommand.DataMode = HAL_OSPI_DATA_NONE;
sCommand.DataDtrMode = HAL_OSPI_DATA_DTR_DISABLE;
sCommand.DummyCycles = DUMMY_CLOCK_CYCLES_WRITE;
sCommand.DQSMode = HAL_OSPI_DQS_DISABLE;
sCommand.SIOOMode = HAL_OSPI_SIOO_INST_EVERY_CMD;
if (HAL_OSPI_Command(&hospi1, &sCommand, HAL_OSPI_TIMEOUT_DEFAULT_VALUE) != HAL_OK)
{
return -1;
}
Change_QSPISpeed(); /* change to faster speed in QSPI mode, max. 104 MHz, we should have 100 MHz with PLL2N = 200 MHz */
/*Configure Memory Mapped mode*/
sCommand.OperationType = HAL_OSPI_OPTYPE_WRITE_CFG;
sCommand.FlashId = HAL_OSPI_FLASH_ID_1;
sCommand.Instruction = WRITE_CMD;
sCommand.InstructionMode = HAL_OSPI_INSTRUCTION_4_LINES;
sCommand.InstructionSize = HAL_OSPI_INSTRUCTION_8_BITS;
sCommand.InstructionDtrMode = HAL_OSPI_INSTRUCTION_DTR_DISABLE;
sCommand.AddressMode = HAL_OSPI_ADDRESS_4_LINES;
sCommand.AddressSize = HAL_OSPI_ADDRESS_24_BITS;
sCommand.Address = 0x0;
sCommand.AddressDtrMode = HAL_OSPI_ADDRESS_DTR_DISABLE;
sCommand.AlternateBytesMode = HAL_OSPI_ALTERNATE_BYTES_NONE;
sCommand.DataMode = HAL_OSPI_DATA_4_LINES;
sCommand.DataDtrMode = HAL_OSPI_DATA_DTR_DISABLE;
sCommand.DummyCycles = DUMMY_CLOCK_CYCLES_WRITE;
////sCommand.DQSMode = HAL_OSPI_DQS_DISABLE;
/* VERY IMPORTANT! */
sCommand.DQSMode = HAL_OSPI_DQS_ENABLE; //ERRATA: ENABLE should fix - but it does not!
//the debugger is disconnected and program even do not proceed!
sCommand.SIOOMode = HAL_OSPI_SIOO_INST_EVERY_CMD;
if (HAL_OSPI_Command(&hospi1, &sCommand, HAL_OSPI_TIMEOUT_DEFAULT_VALUE) != HAL_OK)
{
Error_Handler();
}
sCommand.OperationType = HAL_OSPI_OPTYPE_READ_CFG;
sCommand.Instruction = READ_CMD;
sCommand.DummyCycles = DUMMY_CLOCK_CYCLES_READ;
sCommand.DQSMode = HAL_OSPI_DQS_DISABLE;
/* Remark: if you disable here DQSMode - the debugger will disconnect! */
if (HAL_OSPI_Command(&hospi1, &sCommand, HAL_OSPI_TIMEOUT_DEFAULT_VALUE) != HAL_OK)
{
Error_Handler();
}
sMemMappedCfg.TimeOutActivation = HAL_OSPI_TIMEOUT_COUNTER_ENABLE;
sMemMappedCfg.TimeOutPeriod = 0x34;
LED_Status(0);
if (HAL_OSPI_MemoryMapped(&hospi1, &sMemMappedCfg) != HAL_OK)
{
return -1;
}
return 0;
}
Page Wrap!
This is also very important to bear in mind: when a transaction (WRITE or READ) starts on the external PSRAM: when the page boundary is crossed (default is 1024, so address 0x90000400 is crossed, for instance) - it wraps back to the start of the SAME page!
You would not realize really if you do not test carefully (see Memory Test) below. If you write with a page wrap - the read, also doing a page wrap - all looks correct: all what you have written can be read back properly (never mind there was a page wrap or not).
But if you try to read again the page start - you will see it is "corrupted" (overwritten with the "tail" of the write beyond the page size).
This is very tricky and I thought about how to "fix" this issue. The answer is:
See code in my "Change_QSPISpeed()":
#ifdef ENABLE_DCACHE
/* with DCache: align with cache line size (32bytes: U5A5, U575 is just 16bytes) */
hospi1.Init.ChipSelectBoundary = 8; /* 4 is 4 words per chunk = 16bytes: U575, 8 is 8 words per chunk = 32bytes: U5A5 */
hospi1.Init.MaxTran = 0; /* just used of other OCTOSPI is needed */
hospi1.Init.Refresh = 256;
#else
/* without DCache, but avoid page wrap - assuming AHB bus is 23bit */
hospi1.Init.ChipSelectBoundary = 4; /* should it be 1 for 32but bus, 4 bytes? */
hospi1.Init.MaxTran = 0;
hospi1.Init.Refresh = 256;
#endif
What it does:
Also to bear in mind, esp. when testing:
And see also, there is an optimization "trick" when DCache is used (see following).
DCache Optimization
The setting for the "ChipSelectBoundary" depends on: do you enable DCache or not?
If you keep the setting when it is working to avoid a page wrap without DCache enabled, but you enable DCache - you will not get the fastest performance, even DCache is there (and speeds up a bit). The keyword is "Cache Line Size":
Therefore, I tried to match the "ChipSelectBoundary" with the "CacheLineSize" (here 32bytes). It results in setting this parameter to 8 because: 8 x 4 bytes is 32Bytes (the bus reads always 4 bytes, but the cache reads 32bytes)
Make sure to run a good memory test before you rely on: "all is working". The "page wrap feature" in external PSRAM was the most tricky part to deal with.
FIFO size - not clear
I am not sure how the "FifoThreshold" should be set: I left it at 1. No clue if increasing FIFO size (I think it has 16 entries) has a meaning when DCache is enabled. It might be just helpful in "indirect mode".
In "memory mapped mode" I want to get all the words transferred into DCache, not sitting just in FIFO still. So, I assume, for DCache enabled, the FIFO is not used or it should be set to the lowest threshold. But maybe, I am wrong: the DCache might read all from FIFO, never mind how the threshold is set.
Memory Test
I suggest to implement a memory test. The simplest way is to write the address itself as value to the PSRAM (through DCache enabled). This should at least realize if the "page wrap" kicks in: you would see "aliasing" (values written to a wrong address, due to a page wrap).
My memory test is very simple. It checks mainly for the "aliasing" (not expecting bit errors):
ECMD_DEC_Status CMD_memt(TCMD_DEC_Results *res, EResultOut out)
{
unsigned long *p;
unsigned long i;
unsigned long tick1, tick2;
if ( !res->val[1] )
return CMD_DEC_OK;
//write memory:
p = (unsigned long *)res->val[0];
tick1 = HAL_GetTick();
for (i = 0; i < res->val[1]; i++)
{
*p = (unsigned long)p;
p++;
}
tick1 = HAL_GetTick() - tick1;
//check memory:
tick2 = HAL_GetTick();
p = (unsigned long *)res->val[0];
for (i = 0; i < res->val[1]; i++)
{
if (*p != (unsigned long)p)
{
hex_dump((uint8_t *)p, 16, 4, out);
break;
}
p++;
}
tick2 = HAL_GetTick() - tick2;
print_log(out, "WR: %d [ms] | RD: %d [ms]\r\n", (int)tick1, (int)tick2);
return CMD_DEC_OK;
}
Make sure to run a memory test, esp. if you want to use the PSRAM like a real RAM: I need RAM in order to store "Pico-C" script files on it. The entire memory should behave like a real RAM for the entire range, no "page wraps".
If you use PSRAM as storage for a RAM based file system (FS) - the page wrap should not be an issue: the FS will handle anyway all as "pages" (sectors, e.g. 512 bytes) and a page wrap would not happen really. Just a "linear memory" with crossing page boundaries - then it matters (e.g. large text files with scripts directly in RAM).
Performance (results)
With this 4MByte external PSRAM, via QSPI, enabling DCache and align with the "Cache Line Size" I get:
Pretty good: this is faster as 1/4 of SYSCLK (160MHz) Any instruction fetch, data fetch from internal RAM might be similar (assuming: a single C-code instruction needs also 4 clock cycles to be done).
Have success with connecting PSRAM via QSPI.
Solved! Go to Solution.
2024-03-07 09:09 AM
I will test/measure w/o DCache. But it will be way slower and depends how I configure:
to avoid the page wrap: every PSRAM transaction is very short, e.g. just 4 bytes (one word), will have all the overhead (CMD, ADDR, turn around) again, all the time on every transaction. So, 4 bytes data plus 4 bytes overhead (on WRITE, with READ plus 6 additional clocks turn around): it should go down to 50% of the speed with cache on.
If I leave the page warp "in": 1024 bytes possible as one transfer.
YES: my PSRAM runs with 100 MHz in SDR mode (and it does not have a DDR mode).
Yes, so, absolute max. would be 50MByte/s. I think it makes sense with DCache: Cache Line Size chunks (32bytes on U5A5), to see a bit below 50MByte/s.
2024-03-07 06:31 AM - edited 2024-03-07 06:32 AM
Hello @tjaekel ,
Thank you for sharing your experience that can help the community members :).
Thank you for your contribution in STCommunity and do not hesitate to create a post in the ST community for sharing idea, request, experience, question........
Kaouthar
To give better visibility on the answered topics, please click on Accept as Solution on the reply which solved your issue or answered your question.
2024-03-07 07:54 AM
Thanks for sharing!
I just learned that the OCTOSPI peripheral can handle the page wrapping, nice!
When looking for HyperRam, that was one feature limiting the choice.
~45 MByte/s is pretty good, esp. that it includes the wrap handling.
Theoretical maximum at 100 MHz / 4bits / SDR is 50 MByte/s.
One question @tjaekel :
Have you tested the speed without DCACHE?
2024-03-07 09:09 AM
I will test/measure w/o DCache. But it will be way slower and depends how I configure:
to avoid the page wrap: every PSRAM transaction is very short, e.g. just 4 bytes (one word), will have all the overhead (CMD, ADDR, turn around) again, all the time on every transaction. So, 4 bytes data plus 4 bytes overhead (on WRITE, with READ plus 6 additional clocks turn around): it should go down to 50% of the speed with cache on.
If I leave the page warp "in": 1024 bytes possible as one transfer.
YES: my PSRAM runs with 100 MHz in SDR mode (and it does not have a DDR mode).
Yes, so, absolute max. would be 50MByte/s. I think it makes sense with DCache: Cache Line Size chunks (32bytes on U5A5), to see a bit below 50MByte/s.
2024-03-07 09:37 AM
@tjaekel Thanks again for the info!
No need to test without cache, your explanation concerning page wrap makes sense!
2024-04-25 03:59 AM
Thanks for sharing your experience.
I'm also able to read write to PSRAM using QSPI. but how can I use it as RAM? please let me know.
2024-04-25 09:10 AM
Configure and enable the Memory Mapped mode, maybe also DCache and use it as RAM, starting at the base address for external QSPI memory.
You can use as RAM just after all is initialized. So, you cannot have this as RAM in your linker script. You can just initialize, copy data etc. to this external QSPI RAM during runtime (after Memory Mapped mode was enabled).