2025-01-09 01:58 AM
Hello,
We are trying to make use of the ART accelerator of the STM32F7.
As a start we placed only a single part of code in the adress range of 0x00200000.
And we enabled the ART accelerator in CubeMx as well as code prefetch.
However when we start the code with the debugger it does not even come the main(void) function. When going back to the reset entry, the call to SystemInit works, but the branch to __main immediately results in a hard fault.
This is strange because this part of the code resides in the standard flash adress of 0x08000000 still. Even in the disassembler window there was no single step possible. (Normally you can see the scatter loader working here …)
The fault does not occur as soon as we remove the code at 0x00200000 in the scatter file.
So just a simple question:
Could you please provide an example with any STM32 cpu that shows us the usage of ART accelerator?
Thank you very much for your help
Andreas
2025-01-09 03:25 AM
Hello,
The issue is not very clear but you can refer to X-CUBE-32F7PERF package that comes with the application note AN4667 "STM32F7 Series system architecture and performance".
See the projects: 1 FlashITCM-RAM_DTCM and FlashITCM-RAM_SRAM1. The code execution is done in the FlashITCM that adress starts from 0x00200000.
The ART enable is done in HAL_Init():
HAL_StatusTypeDef HAL_Init(void)
{
/* Configure Flash prefetch and Instruction cache through ART accelerator */
#if (ART_ACCLERATOR_ENABLE != 0)
__HAL_FLASH_ART_ENABLE();
#endif /* ART_ACCLERATOR_ENABLE */
/* Set Interrupt Group Priority */
HAL_NVIC_SetPriorityGrouping(NVIC_PRIORITYGROUP_4);
/* Use systick as time base source and configure 1ms tick (default clock after Reset is HSI) */
HAL_InitTick(TICK_INT_PRIORITY);
/* Init the low level hardware */
HAL_MspInit();
/* Return function status */
return HAL_OK;
}
Note that this is an old application note and STM32CubeIDE was not available at that time, only IAR, KEIL and System Workbench (Eclipse based) were used.
If you are execution from the FlashITCM you need to enable the ART to increase the performance, enabling/disabling the ART which doesn't have impact on the execution from the FlashAXI 0x08000000.
2025-01-09 07:14 AM
Hello SofLit,
Thank you for the example X-CUBE-32F7PERF
I saw in the example that the scatterfile locates ALL of the flash code to the adress starting from 0x00200000 which is the TCM Flash.
And the loader loaded it to the adress 0x00200000. But how is this possible? Flash writing is only possible in the 0x08000000 domain according to reference manual.
#### Is the loader internally using 0x08000000 and just pretending to use 0x00200000? ###
Anyhow in this example the debugger reaches _main().
I tried the same scatterfile in my project, but unfortunately even with ART + Prefetch Enabled the TCM-Flash is slower that FlashAXI with Instruction Cache. So I do not understand the purpose of ART at all...
Thank you very much for your kind answer
Andreas
2025-01-09 07:21 AM
There are two addresses for the Flash 0x00200000 over ITCM and 0x08000000 over AXI.
Please read the AN4667 "STM32F7 Series system architecture and performance" especially the section 1.5.1 Embedded Flash memory.
It's the same physical Flash but could be accessed over two address ranges.
@andywild2 wrote:
I tried the same scatterfile in my project, but unfortunately even with ART + Prefetch Enabled the TCM-Flash is slower that FlashAXI with Instruction Cache. So I do not understand the purpose of ART at all...
Thank you very much for your kind answer
Andreas
Again read the AN4667 it gives a description of the product architecture and it provides some performance results of X-CUBE-32F7PERF.
2025-01-09 07:44 AM
Hello,
I am very aware of the two adresses for flash 0x00200000 (TCM) and 0x08000000 (AXI).
In AN4667 on page 11 topic "1.5.1 Embedded Flash memory"
It says:
"A 64-bit ITCM interface:
It connects the embedded Flash memory to the Cortex-M7 via the ITCM bus (path1 in
Figure 2) and is used for the program execution and data read access for constants.
The write access to the Flash memory is not permitted via this bus."
So I repeat my question: How can the loader program the flash at adresses starting from 0x00200000??
Is the loader internally using 0x08000000 and just pretending to use 0x00200000?
Thanks a lot for explaining this to me
Andreas
2025-01-09 07:48 AM
I don't know how the loader was built.
Each loader has its own address. If you use ITCM flash loader you need to use ITCM Flash loader. If you are using AXI Flash you need to use AXI Flash loader.
2025-01-09 08:22 AM
ITCM does not accept write to flash. (see AN4667)
How can a loader write to flash then via ITCM?
2025-01-09 10:32 AM
Which loader?
The 0x0419.stldr looks to only deal with the 0x08000000 mapping/shadow of the memory.
The CM7 core is not caching the TCM, as it's already tightly coupled, the ART is generally there to manage the very wide flash lines that perhaps have some 35ns load times, and can deal with that for the line misses, and funnel subsequent words via the prefetch within the current cycle. ie faster than single-cycle or zero-wait-state
For KEIL I'd have to look at the .FLM in the F7 pack
I'd assume at some level the "loader" or the application calling the "loader" has some awareness as to which memory space to write the data, and maps accordingly.
2025-01-09 01:00 PM
Hi Clive,
thank you for your response!
It is the Arm\Packs\Keil\STM32F7xx_DFP\2.16.0\CMSIS\Flash\STM32F7xTCM_2048.FLM loader.
And you are right there must be some internal mapping in this loader, because writing direct to 0x00200000 is not possible.
I know the CM7 is not caching the TCM. I thought the 256Bit parallel fetch of the ART might do a good performance job, but it does not unfortunately. My benchmark is this project for using the jpeg codec:
https://github.com/STMicroelectronics/STM32CubeF7/tree/master/Projects/STM32F769I_EVAL/Examples/JPEG/JPEG_DecodingFromFLASH_DMA
So decoding the jpeg slides of emWin *emf videos should use just a very limited space in flash, easy to be covered by the cache as well as ART.
But in fact when mapping the code to 0x08000000 the time per frame is 39ms, when mapping the SAME code to 0x00200000 it is 52ms.
So as a conclusion the AXI Interface with cache is much faster, provided the piece of code can be handled by the cache with no cache misses.
So I will not use the ART at all. Which has also the advantage to be able to do firmware updates, as writing to flash is possible then.
Thank you very much for your comments
Andreas