2020-06-17 06:05 PM
I ported the SBSFU to an STM32F429 and everything works but it takes almost up to 6s (!) before the bootloader exits and jumps to the main application.
My application is about 1.3Mb (entire flash is 2Mb) so it is not exactly small, but 6s boot time is just unacceptable for my application.
After analysing the where this time is lost I see that SFU_IMG_VerifyActiveImg is called twice and takes 1.4s to run each time, and SFU_IMG_VerifyActiveSlot is called once which takes a whopping 2.8s. Are these expected times ? Why call SFU_IMG_VerifyActiveImg twice( not that it would help much to do it only once timing wise)
Apparently the sha256 calculation which is at the core of these is really slow.
Any suggestions for speeding this up ? At the moment I am thinking of trying to either replace this by a simple CRC32 ( if I recall correctly I manage to do that in 500ms ) or just skip these verification steps entirely.
2020-06-17 11:23 PM
Which clock are you using? Whats the clock speed?
2020-06-18 03:19 AM
Hello Peeter,
The SBSFU is authenticating the image in 2 steps:
1) First it is checking the signature of the header file. Checking signature involves
2) Once this signature check is done, the hash of the full firmware will be computed and compared with the one provided in the header (and authenticated thanks to first step). Here you are right the size of the application makes this computation quite long.
Looking at the crypto lib performance (for GCC) I can see 4191 cycles to handle 16 byte block. This leads to around 1.9 second at 180MHz. for 1.3MB
The speed optimised version is not giving significant improvement here (4046 cycles)
So the complete crypto operations should take a bit more than 2.2 seconds (theoretically)
The difference you obtain with the 6 s comes from:
1) the full authentication step1 + step2 is performed twice: This was added to increase robustness to attack. You can remove one.
2) the SBSFU is authenticating the header once more when you have the traces activated
The rest may be related to secureboot overhead. This needs to be checked.
I will make some tests next week to confirm this.
Best regards
Jocelyn
2020-06-18 06:56 AM
Hi,
Thanks for the replies !
I am not running at 180Mhz, it is a battery powered handheld device so we have to be conservative with power consumption, and also we have an external SRAM which puts limits on the max clock speed.
My HCLK was suppose to be 48 Mhz but I just noticed I overlooked to modify the define for HSE input clock (was still at 8Mhz from the code I started from but should be 16Mhz for our board).
After fixing that the timing of VerifyActiveImage remains more or less unchanged, but VerifyActiveSlot is now 1.2s.
Not sure how this is possible but i don't know exactly how the wrong HSE value rippled through, my timings come from looking at the tick counter so they might have been off too.
I have added an image in attachment showing where the time is mainly spend with the current timings (1 tick = 1ms ).
I don't think I see much of the ECDSA time you are refering to. I only see significant time spend in the 2 functions i just mentioned, one of them being an SHA256 calculation over the firmware (VerifyActiveImage) and one checking the region after the firmware for no spurious code (VerifyActiveSlot). This is if I understand what these functions are doing correctly.
I think i will make a first attempt removing one of the VerifyActiveImage operations and skipping the VerifyActiveSlot entirely ( I like my image intact , but I don't really care that there is some extra code behind it. ). That should put me in the 1.4s range.
And then I will experiment with the clock speed, since I don't need the external sram in the bootloader and it only runs for a short time I could probably bump the BL clockspeed to 180Mhz (and fall back to 48Mhz later in the main application) which should reduce it to <500ms if everything scales with clock speed.
But then I really hope your calculations or wrong because according to them it should be ballpark 1.9s :/.
So if that does not turn out to be enough to get <500ms I will try the CRC32 over the firmware image io SHA256 but that will take some more tinkering I assume.
Note: it would for sure be nice if the SBSFU code support a bit more options there out of the box so you could make the tradeof between speed or integrity more easily, for a lot of products cold boot boot startup time requirements are already pretty hard to meet and a bootloader adding something in the range of seconds is not helping :).
If I am overlooking things let me know =)
Regards
Bram
Edit: there is something really fishy going on with my tick measurements. Now it says the tick count is around 1400 by the time the application launces, but if I chrono it it is more +- 7 seconds. I checked if I just add a 10s delay and print something before/after on the uart and that delay is indeed 10s with the tick counter increased to 10000 so basic tick timing is ok. Maybe ticks are not counted (irq disabled) during some parts ?
2020-06-18 08:07 AM
Ok so I looked with the scope and some prints
SFU_IMG_VerifyActiveImgMetadata: Start
[383ms]
SFU_IMG_VerifyActiveImg: Start
[4343ms]
SFU_IMG_HasValidActiveFirmware: Done
[383ms]
SFU_IMG_VerifyActiveImgMetadata: Start
[387ms]
SFU_IMG_VerifyActiveImgMetadata: Start
[388ms]
SFU_IMG_ControlActiveImgTag: Start
Seems like ticks (interrupts?) are indeed disabled for long periods during these calculations
And I will get nowhere near 500ms using the SBSFU mechanisms.
2020-06-18 04:02 PM
Generic SHA256, doing 1MB on a STM32L4R9 at 120 MHz took 921 ms
ECDSA SECP192R2 signature verification 56 ms
2020-06-23 07:59 AM
Hello Peeter,
I ported the SBSFU from F413 to F429.
I set the clocking at 180MHz.
I made an application that is 1316 KB and slot is 1984 KB
Here is the result I obtain:
Hash of the firmware with SHA256 481ms
ECDSA check : 131ms
Check non used space : 346 ms
First, the check of unused space really needs optimisation: this is done byte by byte. I need to check why. Making a quick check of 0xFF by dword, I get 13ms...
Then remaining is not really optimised but this is a know issue that should be fixed in next release.
The ECDSA check is done 4 times (1 is done in the trace to give the version so is removed if you deactivate debug).
3 others are :
1) Check user fw status
2) verify user fw signature
3) execute user fw
Also the hash is done twice
1) Check user fw status
2) Verify user fw signature
Now, with just the optim of remaining check I get a total of 1.6 sec for start up.
Anyway, it will be difficult to reach 500m anyway. I guess this could be possible, but not 100% sure;
The replacement of Hash by CRC is a bad idea.
This is easy to forge a firmware that has a specific CRC. So, you will loose all the security provided by SBSFU !
only way to go is to try optimizing this hash check. Today, the content is first copied in RAM then send to hash algo.
Removing this RAM copy my gain some time but not sure this will give that much !
Best regards
Jocelyn
2020-06-23 08:28 AM
Hi Jocelyn,
Thank you very much for your effort to establish a lower bound, but my customer really cannot accept 1.6s extra time for the bootloader part.
I understand hash collisions with forged firmware are easy to realize if I only use a CRC32 check but for the kind of device I am working on this is an acceptable risk.
If in a future project a higher level of security is required in combination with lower startup times we have to take this into account and look for a processor that can do a better check in HW so that the verification lime can be kept in check (if it matters).
I modified the prepareimage tool to genere a matching CRC32 with the one the hardware CRC block generates and now the 'unsecure' verification runs in 100 ms @ 48Mhz so that is OK.
I do still run into the very very weird issue now : if I program the bootloader with a Jlink debugger everything is OK, but if I program it via a DFU file with the baked in bootloader it goes wrong. When reading back the contents of the program flash where the bootloader is located (256kb in my case, I had to add some other stuff so I had to add the next 2 sectors too) for both ways of programming they are 100% the same. Also the option bytes are the same (I disabled protections for now) but still somewhere really early in the startup phase (before C code starts) the first instruction of __iar_zero_init3 for zeroing out memory which is a LDR.W R1, [R0]#0x4 instruction manages to result in a jump to 0x801bc3c (always the same address, no idea why). This happens even if i completely power down and restart the board (in the idea that maybe some interrupt was still armed from the baked in bootloader). It does not happen in the Jlink version. I am still investigating (now comparing all cpu regs between the 2 situations when the instruction is executed), if I don't find it I will make a dedicated post for it, but if you happened to have experience with similar spooky behavior I am all ears :)
Regards
Bram
2020-06-23 09:18 AM
Hello Peeters,
what I would do is flash with DFU, re-read all flash with flasher. Then flash with flashed, re-read the whole flash.
Should be exactly the same.
If they are exactly the same and the CPU behaves differently, I don't know what to say !
I know JLink is setting some registers to be able to keep the hand in case of low power transition.
But all these settings should be removed after a power on reset.
Best regards
Jocelyn
2021-09-13 02:13 AM
Hello Peeters,
Can you tell me how did you port the f413 sbsfu example to f429?!