2020-08-26 07:43 AM
Below is my benchmarking of multply-accumulate performance on contiguous memory blocks on the STM32F7508-DK board for three different types of memory (on-chip SRAM, external SDRAM managed by the FMC, and QSPI-connected NOR flash):
The horizontal axes give the size of the contiguous memory region operated on, and the vertical axes gives the number of millions of multiply-accumulates per second.
One observation that makes sense to me is that performance in all cases drops markedly once the contiguous memory block grows beyond 2^12 B = 4 kiB = the cache size.
The primary thing I don't understand is why the external SDRAM performance is so much worse in the small-size region. Can someone elaborate on this?
The board, SDRAM and NOR flash are all initialized by the STM32CubeF7's BSP functions and templates for the STM32F7508-DISCO board.
While the absolute numbers differ, the overall qualitative behavior is the same across optimization levels from -O0 to -O3.
Solved! Go to Solution.
2020-08-26 06:25 PM
Diagram looks to show SDRAM is an order of magnitude slow than SRAM, if I'm reading it right.
More likely something in the MPU settings as to whether the memory is bufferable/cacheable
2020-08-26 06:08 PM
Do you have instruction and data cache enabled? It could be that the larger sizes produce more cache misses. Can you share the actual code being tested?
2020-08-26 06:25 PM
Diagram looks to show SDRAM is an order of magnitude slow than SRAM, if I'm reading it right.
More likely something in the MPU settings as to whether the memory is bufferable/cacheable
2020-08-27 12:45 AM
Yes, I have caches enabled. If I leave out enabling them, performance drops by an order of magnitude across the board, as expected.
That larger sizes produce more cache misses makes sense, and I think that fits well the drop in performance above a certain size that we see in my benchmark. That is expected, and I'm not confused about that. What I am confused about is that both before and after that drop in performance, SDRAM appears so much slower than even the NOR.
I will see if I can share the code of the benchmark. Thanks for your feedback.
2020-08-27 12:46 AM
Exactly. And an order of magnitude slower than the NOR flash. This is indeed what's puzzling me.
I will look into the MPU settings about bufferable/cacheable memory. Thank you for the hint!
2020-08-27 06:39 AM
I configured the MPU as follows:
MPU_Region_InitTypeDef MPU_InitStruct;
HAL_MPU_Disable();
MPU_InitStruct.Enable = MPU_REGION_ENABLE;
MPU_InitStruct.BaseAddress = SDRAM_DEVICE_ADDR;
MPU_InitStruct.Size = MPU_REGION_SIZE_8MB;
MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
MPU_InitStruct.IsBufferable = MPU_ACCESS_BUFFERABLE;
MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;
MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;
MPU_InitStruct.Number = MPU_REGION_NUMBER0;
MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
MPU_InitStruct.SubRegionDisable = 0x00;
MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
HAL_MPU_ConfigRegion(&MPU_InitStruct);
HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);
Now my SDRAM performance is in line with expectations:
Thanks a lot!