2022-01-15 02:55 AM
We have encountered a problem with CRYPTO IP core on STM32MP157C.
Our setup is STM32MP157C-DK2 with latest image/SDK installed
(openstlinux-5.10-dunfell-mp1-21-11-17).
1. First of all, our IPSec solution based on strongSwan doesn't work at
all when stm32-cryp.ko is loaded: after processing several packets
IPSec connection stucks. The only message we got in kernel ring
buffer is:
```
[ 102.064269] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
```
It stucks no matter what cipher is selected or which settings are
used. However, if we don't use stm32-cryp at all (e.g. if we unload
this module), IPSec connection works perfectly.
We haven't found a simple way to reprocude this bug without deploying
IPSec infrastructure (it's very simple to do it with AlgoVPN [1]), so
we can provide you an access to our test environment or give you more
details on request.
2. Moreover, we have made a performance test using cryptodev-tests
([2], but this package is available in Yocto SDK too) and `openssl
speed`, and it looks like software implementations are much faster
than hardware accelerated one.
The first test was performed with userspace software implementation
(as evidence, CPU was mostly in userspace (18.02s/18.38s) during this
test):
```
root@stm32mp1:~# cat /proc/crypto | grep cbc
root@stm32mp1:~# time openssl speed -evp aes-256-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 1829060 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 548756 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 145037 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 36751 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 4614 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 2304 aes-256-cbc's in 3.00s
...
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 9754.99k 11706.79k 12376.49k 12544.34k 12599.30k 12582.91k
real 0m 18.38s
user 0m 18.02s
sys 0m 0.00s
```
The second one uses hardware-accelerated algo:
```
root@stm32mp1:~# insmod stm32-cryp.ko
root@stm32mp1:~# cat /proc/crypto | grep cbc
name : cbc(des3_ede)
driver : stm32-cbc-des3
name : cbc(des)
driver : stm32-cbc-des
name : cbc(aes)
driver : stm32-cbc-aes
root@stm32mp1:~# time openssl speed -evp aes-256-cbc -elapsed
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-256-cbc for 3s on 16 size blocks: 32666 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 64 size blocks: 26338 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 256 size blocks: 15378 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 1024 size blocks: 5661 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 8192 size blocks: 818 aes-256-cbc's in 3.00s
Doing aes-256-cbc for 3s on 16384 size blocks: 408 aes-256-cbc's in 3.01s
...
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes
aes-256-cbc 174.22k 561.88k 1312.26k 1932.29k 2233.69k 2220.82k
real 0m 18.13s
user 0m 0.07s
sys 0m 2.59s
```
Very similar results with speed test from cryptodev source code:
```
root@stm32mp1:~# insmod stm32-cryp.ko
root@stm32mp1:~# time ./speed
Testing AES-128-CBC cipher:
Encrypting in chunks of 65536 bytes: done. 11.47 MB in 5.00 secs: 2.29 MB/sec
real 0m 5.00s
user 0m 0.00s
sys 0m 0.01s
```
So the question is: is this the real performance (2 MB/s for chunks
>16KB) of crypto IP core or is it an issue due to drivers or any
other hw/sw interaction problems? As we know, we are not the only
ones who bumped into this issue ([3], last answer).
It's worth noticing that during hardware-accelerated test CPU was
intensively used (95.4%) in kernel space with irq/60-54001000 task,
so this method can't be used even for reducing CPU load with
offloading it to crypto IP.
P.S. We have added these lines into local.conf to build strongSwan
and OpenSSL with cryptodev support:
```
PACKAGECONFIG_append_pn-openssl = " cryptodev-linux"
IMAGE_INSTALL_append = " strongswan cryptodev-module cryptodev-tests"
```
Thank you in advance!
[1]: https://github.com/trailofbits/algo
[2]: https://github.com/cryptodev-linux/cryptodev-linux/tree/master/tests
[3]: https://community.st.com/s/question/0D50X0000C4POdo/crypto-api
2022-01-17 02:32 AM
Hello @dimax ,
Thank you for your detailed message.
I will try to help you.
Regarding your first question about your issue with stm32-cryp.ko.
You are talking about this error that you got:
[ 102.064269] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
If you're seeing a few of those every now and then, nothing to be alarmed about. It can happen when the CPU is stressed and is normal. If you're constantly seeing it then you might want to consider either reducing the CPU load or disabling NOHZ.
But if you decide to do this, make sure you fully understand what it does by reading the kernel documentation: https://elixir.bootlin.com/linux/v5.10.10/source/Documentation/timers/no_hz.rst
To help you on this error of stuck behavior, I need to have:
2. Cryptodev performance tests
Unfortunately, it is a known linux issue that crypto-dev framework and other framework are not optimized for HW engines.
The result is better performance in Full SW than with HW accelerated framework
for the crypto functions that require many cyclic operation on small size data (linked to key
size):
- Linux framework is using work queues that will extend scheduling usage
- dma use will not help (more time to configure than copy)
This is the same issue for any vendor.
Regarding that, I advise you to use the SW implementation instead of HW acceleration, if it is possible in your project.
One more question:
In the first part of your post, you are talking about the fact that IPsec get stuck when stm32-cryp.ko is loaded.
But during your tests of performance in the second part, you are doing a
insmod stm32-cryp.ko
And the crypto seems to work. Did you disable your IPsec for this test?
Regards,
Kevin
2022-01-20 08:17 AM
Yes IPsec was disabled during testing.
I went on and made the same testing on old x86 machine. Here I get about 5 times improvement with HW acceleration.
How can you explain that?
→ ./speed
Testing AES-128-CBC cipher:
Encrypting in chunks of 65536 bytes: done. 6.13 GB in 5.00 secs: 1.23 GB/sec
→ sudo rmmod aesni_intel
→ ./speed
Testing AES-128-CBC cipher:
Encrypting in chunks of 65536 bytes: done. 1.23 GB in 5.00 secs: 0.25 GB/sec
2022-01-20 10:13 AM
And here are testing results with NXP part that show up to 100 times performance increase wit ha use of HW acceleration:
2022-02-16 03:45 AM
Hello,
ST policy about Linux is to rely on Linux framework + HW driver adaptations to these framework to better upstream an overall solution maintained by the community (and therefore giving more sustainability of ST solution for customers).
Doing this for Cryptodev framework shows the results you highlight here: for some cases, performances are very low compared to full cpu and with the same cpu load ....
A study is ongoing to check what could be done to improve this (depending on HW capability and SW adaptation required on top of HW drivers) but there is no short term solution.