cancel
Showing results for 
Search instead for 
Did you mean: 

fp issue: vstm/vldm all 32 fp registers at once

picguy2
Associate II
Posted on August 12, 2016 at 01:09

A few details: both code and data caches are enabled. 216MHz SysClk. Timer ISR schedules a task every millisecond. SCB.SHCSR=1<<18 to enable use of usage fault ISR for the fp trap handler. All ISR’s mentioned are set to the absolute minimum priority as is the SVC used for task entry to the task scheduler.

In my own selected tasks have access to all floating point registers. I elected to not provide fp access to ISR’s. Every time any task is scheduled floating point access is disabled:

mov32 r0,baseSCB+CPACR<
br
>
movs r1,#0<
br
>
str r1,[r0] ;disable fp<
br
>

Then when a task attempts to use floating point the code traps. After a sanity check fp access is enabled. The trap handler checks if the current fp registers “belong�? to the task. If not, the existing fp context is copied to the proper task’s fp save area:

vmrs r0,FPSCR<
br
>
stm r5!,{r0}<
br
>
vstm r5!,{s0-s31}<
br
>

Then the trap handler restores the saved fp registers for the task that initially caused the trap:

ldm r5!,{r0}<
br
>
vldm r5!,{s0-s31}<
br
>
vmsr FPSCR,r0<
br
>

This code works on STM32F4 parts. It fails on the STM32F746 DISCOVERY board. The crash entered the fp trap handler with an error flag indicating an attempt to execute an instruction that makes illegal use of the EPSR. Possibly a bad EXC_RETURN? I suspected that the 32-word vstm/vldmmight be a problem. For debug I changed the vstm/vldm to store/load only s0-s3. That worked in that it did not crash. But my RTOS needs to save and restore all fp registers. I’m software so this has to be a hardware problem. Is this a known deficiency? (It’s not in the latest STM32F74xxx STM32F75xxx Errata sheet. Ditto ARM’s errata.) Is there a workaround? #stm32f7-floating-point
2 REPLIES 2
Posted on August 12, 2016 at 01:26

How does it encode? ie s0-s3, vs s0-s31 (d0-d15 perhaps?)

What if you ensure r5 is 64-bit aligned? Perhaps by putting the r0 on the back end instead of the front.

Tips, Buy me a coffee, or three.. PayPal Venmo
Up vote any posts that you find helpful, it shows what's working..
picguy2
Associate II
Posted on August 14, 2016 at 18:56

Mea culpa. There were several little problems all seemly intent on conspiring with each other to make debugging any one of them difficult.

clive1 asked how does it encode. I assume “it� is vstm r5!,{s0-s31}. Unlike ldm/stm and their variants which use a mask for the individual registers, vldm/vstm encode first register and number of registers. Assemble some code and look at a listing.

My troubles

* I failed to set FPCCR to zero. That made for serious confusion with stack pointers.

* When I wrote both 3-line save/restore fragments I was thinking stack not ldm/stm which both act like *r5++. Code now performs vldm/vstm first (on a double word boundary) followed by a simple ldr/str for the FPSCR.

* Had to recast my ‘M4 task table/stack area setup because the ‘M7 deprecated bit banding. That recast actually improved a few things but messed up when it came to the task table item that points to the floating point save restore area. (Gotta check my ‘M4 code.)

* I also had to remove the multiple pieces of debug code for my timer3 running fast problem. I really should have made a new project… C’est la vie.

Using the M7 I timed the floating point “save other task’s registers� then “restore my task’s registers�. 106 clocks = 491 nanoseconds @ 216 MHz. Both caches were enabled. It’s likely that the register store writes remained in the data cache which may have helped the restore side of things. ISR overhead + my overhead + 33 RAM writes and 33 reads.

Exact same code with (#define removing M7 specific code) took 140 clocks on M4 running at 168 MHz. Still that’s only 619 nanoseconds for the fp register swap.

Along the way I recast the three SCB_[CleanInvalidate, Clean, Invalidate]_DCache_by_Addr functions into a clean assembly language function. I plan to use that code for DMA. BTW, the ARM provided functions will always miss the last block if the address%32!=0 and length%32==0.