cancel
Showing results for 
Search instead for 
Did you mean: 

X-CUBE-SBSFU : Bootloader is really slow (6s) (STM32F429)

Peeters.Bram
Senior

I ported the SBSFU to an STM32F429 and everything works but it takes almost up to 6s (!) before the bootloader exits and jumps to the main application.

My application is about 1.3Mb (entire flash is 2Mb) so it is not exactly small, but 6s boot time is just unacceptable for my application.

After analysing the where this time is lost I see that SFU_IMG_VerifyActiveImg is called twice and takes 1.4s to run each time, and SFU_IMG_VerifyActiveSlot is called once which takes a whopping 2.8s. Are these expected times ? Why call SFU_IMG_VerifyActiveImg twice( not that it would help much to do it only once timing wise)

Apparently the sha256 calculation which is at the core of these is really slow.

Any suggestions for speeding this up ? At the moment I am thinking of trying to either replace this by a simple CRC32 ( if I recall correctly I manage to do that in 500ms ) or just skip these verification steps entirely.

10 REPLIES 10
AAgar.2
Associate III

Which clock are you using? Whats the clock speed?

Jocelyn RICARD
ST Employee

Hello Peeter,

The SBSFU is authenticating the image in 2 steps:

1) First it is checking the signature of the header file. Checking signature involves

  • Computing the hash (SHA256) of the header (very fast as this is small chunk of data)
  • Checking the signature using ECDSA: This signature check is quite complex algorithm and may take around 200 to 300ms. I have to check

2) Once this signature check is done, the hash of the full firmware will be computed and compared with the one provided in the header (and authenticated thanks to first step). Here you are right the size of the application makes this computation quite long.

Looking at the crypto lib performance (for GCC) I can see 4191 cycles to handle 16 byte block. This leads to around 1.9 second at 180MHz. for 1.3MB

The speed optimised version is not giving significant improvement here (4046 cycles)

So the complete crypto operations should take a bit more than 2.2 seconds (theoretically)

The difference you obtain with the 6 s comes from:

1) the full authentication step1 + step2 is performed twice: This was added to increase robustness to attack. You can remove one.

2) the SBSFU is authenticating the header once more when you have the traces activated

The rest may be related to secureboot overhead. This needs to be checked.

I will make some tests next week to confirm this.

Best regards

Jocelyn

Peeters.Bram
Senior

Hi,

Thanks for the replies !

I am not running at 180Mhz, it is a battery powered handheld device so we have to be conservative with power consumption, and also we have an external SRAM which puts limits on the max clock speed.

My HCLK was suppose to be 48 Mhz but I just noticed I overlooked to modify the define for HSE input clock (was still at 8Mhz from the code I started from but should be 16Mhz for our board).

After fixing that the timing of VerifyActiveImage remains more or less unchanged, but VerifyActiveSlot is now 1.2s.

Not sure how this is possible but i don't know exactly how the wrong HSE value rippled through, my timings come from looking at the tick counter so they might have been off too.

I have added an image in attachment showing where the time is mainly spend with the current timings (1 tick = 1ms ).

I don't think I see much of the ECDSA time you are refering to. I only see significant time spend in the 2 functions i just mentioned, one of them being an SHA256 calculation over the firmware (VerifyActiveImage) and one checking the region after the firmware for no spurious code (VerifyActiveSlot). This is if I understand what these functions are doing correctly.

I think i will make a first attempt removing one of the VerifyActiveImage operations and skipping the VerifyActiveSlot entirely ( I like my image intact , but I don't really care that there is some extra code behind it. ). That should put me in the 1.4s range.

And then I will experiment with the clock speed, since I don't need the external sram in the bootloader and it only runs for a short time I could probably bump the BL clockspeed to 180Mhz (and fall back to 48Mhz later in the main application) which should reduce it to <500ms if everything scales with clock speed.

But then I really hope your calculations or wrong because according to them it should be ballpark 1.9s :/.

So if that does not turn out to be enough to get <500ms I will try the CRC32 over the firmware image io SHA256 but that will take some more tinkering I assume.

Note: it would for sure be nice if the SBSFU code support a bit more options there out of the box so you could make the tradeof between speed or integrity more easily, for a lot of products cold boot boot startup time requirements are already pretty hard to meet and a bootloader adding something in the range of seconds is not helping :).

If I am overlooking things let me know =)

Regards

Bram

0693W000001r0NPQAY.png

Edit: there is something really fishy going on with my tick measurements. Now it says the tick count is around 1400 by the time the application launces, but if I chrono it it is more +- 7 seconds. I checked if I just add a 10s delay and print something before/after on the uart and that delay is indeed 10s with the tick counter increased to 10000 so basic tick timing is ok. Maybe ticks are not counted (irq disabled) during some parts ?

Peeters.Bram
Senior

Ok so I looked with the scope and some prints

SFU_IMG_VerifyActiveImgMetadata: Start

[383ms]

SFU_IMG_VerifyActiveImg: Start

[4343ms]

SFU_IMG_HasValidActiveFirmware: Done

[383ms]

SFU_IMG_VerifyActiveImgMetadata: Start

[387ms]

SFU_IMG_VerifyActiveImgMetadata: Start

[388ms]

SFU_IMG_ControlActiveImgTag: Start

Seems like ticks (interrupts?) are indeed disabled for long periods during these calculations

And I will get nowhere near 500ms using the SBSFU mechanisms.

Generic SHA256, doing 1MB on a STM32L4R9 at 120 MHz took 921 ms

ECDSA SECP192R2 signature verification 56 ms

Tips, buy me a coffee, or three.. PayPal Venmo Up vote any posts that you find helpful, it shows what's working..
Jocelyn RICARD
ST Employee

Hello Peeter,

I ported the SBSFU from F413 to F429.

I set the clocking at 180MHz.

I made an application that is 1316 KB and slot is 1984 KB

Here is the result I obtain:

Hash of the firmware with SHA256 481ms

ECDSA check : 131ms

Check non used space : 346 ms

First, the check of unused space really needs optimisation: this is done byte by byte. I need to check why. Making a quick check of 0xFF by dword, I get 13ms...

Then remaining is not really optimised but this is a know issue that should be fixed in next release.

The ECDSA check is done 4 times (1 is done in the trace to give the version so is removed if you deactivate debug).

3 others are :

1) Check user fw status

2) verify user fw signature

3) execute user fw

Also the hash is done twice

1) Check user fw status

2) Verify user fw signature

Now, with just the optim of remaining check I get a total of 1.6 sec for start up.

Anyway, it will be difficult to reach 500m anyway. I guess this could be possible, but not 100% sure;

The replacement of Hash by CRC is a bad idea.

This is easy to forge a firmware that has a specific CRC. So, you will loose all the security provided by SBSFU !

only way to go is to try optimizing this hash check. Today, the content is first copied in RAM then send to hash algo.

Removing this RAM copy my gain some time but not sure this will give that much !

Best regards

Jocelyn

Peeters.Bram
Senior

Hi Jocelyn,

Thank you very much for your effort to establish a lower bound, but my customer really cannot accept 1.6s extra time for the bootloader part.

I understand hash collisions with forged firmware are easy to realize if I only use a CRC32 check but for the kind of device I am working on this is an acceptable risk.

If in a future project a higher level of security is required in combination with lower startup times we have to take this into account and look for a processor that can do a better check in HW so that the verification lime can be kept in check (if it matters).

I modified the prepareimage tool to genere a matching CRC32 with the one the hardware CRC block generates and now the 'unsecure' verification runs in 100 ms @ 48Mhz so that is OK.

I do still run into the very very weird issue now : if I program the bootloader with a Jlink debugger everything is OK, but if I program it via a DFU file with the baked in bootloader it goes wrong. When reading back the contents of the program flash where the bootloader is located (256kb in my case, I had to add some other stuff so I had to add the next 2 sectors too) for both ways of programming they are 100% the same. Also the option bytes are the same (I disabled protections for now) but still somewhere really early in the startup phase (before C code starts) the first instruction of __iar_zero_init3 for zeroing out memory which is a LDR.W R1, [R0]#0x4 instruction manages to result in a jump to 0x801bc3c (always the same address, no idea why). This happens even if i completely power down and restart the board (in the idea that maybe some interrupt was still armed from the baked in bootloader). It does not happen in the Jlink version. I am still investigating (now comparing all cpu regs between the 2 situations when the instruction is executed), if I don't find it I will make a dedicated post for it, but if you happened to have experience with similar spooky behavior I am all ears 🙂

Regards

Bram

Jocelyn RICARD
ST Employee

Hello Peeters,

what I would do is flash with DFU, re-read all flash with flasher. Then flash with flashed, re-read the whole flash.

Should be exactly the same.

If they are exactly the same and the CPU behaves differently, I don't know what to say !

I know JLink is setting some registers to be able to keep the hand in case of low power transition.

But all these settings should be removed after a power on reset.

Best regards

Jocelyn

Eivor
Associate II

Hello Peeters,

Can you tell me how did you port the f413 sbsfu example to f429?!