What I have checked through the user manual UM2526 and FP-AI-VISION1 examples is that I am using X-cube-ai version 6.0.0 for the 32bit floating point model (the model data type I am currently using), so I want to read data from external SDRAM in order to reduce inference time. I know that it is optimized so that the inference time is faster than reading from the internal flash memory. Currently, the model I trained is Keras 2.3.0, which is not compatible with X-cube-ai 5.0.0.
Purpose : To reduce the inference time, I want to save the initial weight&bias table to the external Q-SPI flash memory and move it to SDRAM as in the example.
Question 1 : The inference time is faster when it is placed in the external memory when the data type is float, but in the table12, internal flash memory shows faster results , I don’t understand why. (reference : UM2611 manual)
Q2 : In general, what clock delay is the read time latency of the internal flash memory?
Q3 : If I use external SDRAM for read operation, can I expect to decrease inference time? And How much ? Clock cycles / MACC is too slow that I expected.
•In order to verify the above mentioned steps, what I am currently curious about is how to program and read/write the data required for external QSPI memory and SDRAM.
•I would like to know a series of steps from setting up pins in cube mx to generating code to the IDE , how to run it in the IDE and verifying that it is properly programmed.
Q1: Internal Flash memory is still faster (250ms) because of the optimized data path of internal memories and higher operating frequencies. The external SDRAM is configured to operate at 100MHz.
Q2: The internal Flash minimum operating frequency can be found in RM0399 Table 16. FLASH recommended number of wait states and programming delay according to the embedded Flash memory AXI interface clock frequency (sys_ck) and the internal voltage range of the device (Vcore).
Q3: In general, you should favour internal SRAM versus external SRAM. The external memory latencies are much greater. Refer to where you can view the performance of various internal memories(DTCM, AXI-RAM, SRAM1, ...).
As for the read-only memory, assuming you are not able to fit your weights & bias inside the 2MB internal Flash, loading them from external Quad-SPI Flash to external SDRAM is a great option.
This would be comparing to the Q-SPI external Flash memory (283ms) vs. the External SDRAM (267ms) configurations in table 12. But of course, values will differ from one model to another. I do not have more specific numbers to share.
Lastly, to learn how to configure the QSPI FLash memory, I would recommend you take a look a the BSP configuration. Pins can be found in the board user manual (Schematics) UM2411.
Thank you for answer.
I understand using internal SRAM is faster than using external SDRAM for intermediate inference computing.
UM2611(220.127.116.11, page: 24) document said that "In order to speed up the inference, it is possible for the user to set flag WEIGHT_EXEC_EXTRAM to 1 (as demonstrated for instance in configurations STM32H747I-DISCO_FoodReco_Float_Split_Sdram and S TM32H747I-DISCO_FoodReco_Float_Ext_Sdram) so that the weight-and-bias table gets copied from the external Q-SPI Flash memory into the external SDRAM memory at program initialization. " .
==>I understand that when the weight-and-bias table are copied to the external SDRAM memory at program initialization, the read time of SDRAM is faster than internal flash memory. But you answer my question #1, Internal Flash memory is still faster (250ms) because of the optimized data path of internal memories and higher operating frequencies. The external SDRAM is configured to operate at 100MHz.
==> The example uses 400MHz core clock so, AHB/AXI clock can maximally operated 200MHz. In this circumstances , is is right that using external SDRAM is faster than using internal flash memory ?
also , in page 33, " The kernels of the v5.1.2 STM32Cube.AI (X-CUBE-AI) library are optimized to deal efficiently with the following situations: • Quantized model with activations located in internal SRAM • Float model with weights located in external memory
As a consequence: • When a quantized model is used: if activations are in external SDRAM, it is recommended to use version 5.0.0 of the STM32Cube.AI (X-CUBE-AI) library to obtain optimized inference time • When a float model is used with weights located in internal Flash memory: it is recommended to use version 5.0.0 of the STM32Cube.AI (X-CUBE-AI) library to obtain optimized inference time "
In my situation, my model is float32 model and i used version 6.0.0 of the STM32cube.AI because my trained model used tensorflow 2.2 which is not compatible with version 5.0.0 of the STM32cube.AI. In version 6.0.0 of the STM32cube.AI library are optimized to deal efficiently with float model with weights located in external memory. I want to verify that when the weight&bias table is located in the external SDRAM memory at program initialization is faster than the weight&bias table is located in the internal Flash memory.
Yes, from what I've seen, internal flash is faster than external SDRAM (independently of the Cube.AI version). If you can keep your weights & bias internally, generally speaking, you will get better inference times than copying them to external SDRAM. This may vary from model to model and needs to be confirmed via benchmarking for your specific model, but it's what I've seen so far.
In UM2611, the speedup proposition is for the case were you do not have enough internal memory:
"There are use cases where the table of weights and biases does not fit into the internal Flash memory (the size of which being 2 Mbytes, shared between read-only data and code). In such situation, the user has the possibility to store the weight-and-bias table in the external Flash memory."
Can you please tell me how to use the external Flash and SDRAM ? I have STM32F769I-DSIC1 board which I'm using for running MobileNet-v1 image classification model. Through STM32Cube-AI plugin, i have selected the option to use external flash(qspi) and SDRAM. I have selected the option of 'splitting the weights using linker sript'.
Yet when I run the code, it gets stuck at 'HardFault'. Can you please guide me as to what I might be doing wrong? What changes would I need to make in the linker script?
Normally X-CUBE-AI will generate all the initialization code necessary to access the external memory. The code is in app_x-cube-ai.c and is using the BSP of the board.
If you have your own code, you need to copy what is in the init phase in your init code
Hard Faults are more indicative of the memory interface not being initialized, or done early enough.
For C++ type constructors the SDRAM needs to be viable early, done in SystemInit(), not much too late in main()