2020-08-26 07:43 AM
Below is my benchmarking of multply-accumulate performance on contiguous memory blocks on the STM32F7508-DK board for three different types of memory (on-chip SRAM, external SDRAM managed by the FMC, and QSPI-connected NOR flash):
The horizontal axes give the size of the contiguous memory region operated on, and the vertical axes gives the number of millions of multiply-accumulates per second.
One observation that makes sense to me is that performance in all cases drops markedly once the contiguous memory block grows beyond 2^12 B = 4 kiB = the cache size.
The primary thing I don't understand is why the external SDRAM performance is so much worse in the small-size region. Can someone elaborate on this?
The board, SDRAM and NOR flash are all initialized by the STM32CubeF7's BSP functions and templates for the STM32F7508-DISCO board.
While the absolute numbers differ, the overall qualitative behavior is the same across optimization levels from -O0 to -O3.
Solved! Go to Solution.
2020-08-26 06:25 PM
2020-08-26 06:08 PM
Do you have instruction and data cache enabled? It could be that the larger sizes produce more cache misses. Can you share the actual code being tested?
2020-08-26 06:25 PM
2020-08-27 12:45 AM
Yes, I have caches enabled. If I leave out enabling them, performance drops by an order of magnitude across the board, as expected.
That larger sizes produce more cache misses makes sense, and I think that fits well the drop in performance above a certain size that we see in my benchmark. That is expected, and I'm not confused about that. What I am confused about is that both before and after that drop in performance, SDRAM appears so much slower than even the NOR.
I will see if I can share the code of the benchmark. Thanks for your feedback.
2020-08-27 12:46 AM
Exactly. And an order of magnitude slower than the NOR flash. This is indeed what's puzzling me.
I will look into the MPU settings about bufferable/cacheable memory. Thank you for the hint!
2020-08-27 06:39 AM
I configured the MPU as follows:
MPU_Region_InitTypeDef MPU_InitStruct;
HAL_MPU_Disable();
MPU_InitStruct.Enable = MPU_REGION_ENABLE;
MPU_InitStruct.BaseAddress = SDRAM_DEVICE_ADDR;
MPU_InitStruct.Size = MPU_REGION_SIZE_8MB;
MPU_InitStruct.AccessPermission = MPU_REGION_FULL_ACCESS;
MPU_InitStruct.IsBufferable = MPU_ACCESS_BUFFERABLE;
MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;
MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE;
MPU_InitStruct.Number = MPU_REGION_NUMBER0;
MPU_InitStruct.TypeExtField = MPU_TEX_LEVEL1;
MPU_InitStruct.SubRegionDisable = 0x00;
MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;
HAL_MPU_ConfigRegion(&MPU_InitStruct);
HAL_MPU_Enable(MPU_PRIVILEGED_DEFAULT);
Now my SDRAM performance is in line with expectations:
Thanks a lot!