Can the STM32 CRC peripheral be made to work with the CRC-15_CAN polynomial?

DHase.1 · ‎2022-06-07

Can the CRC for the ADBMS1818 (and other Analog Devices BMS parts) be generated using the STM32 CRC peripheral?

The ADBMS1818 datasheet shows a 15 bit polynomial for the CRC as--

x 15 + x 14 + x 10 + x 8 + x 7 + x 4 + x 3 + 1

Other sources call this a CAN-15-CRC polynomial, e.g., the Wikipedia article Cyclic redundancy check.

Wikipedia lists this polynomial as "even", and the Ref Manual for the STM32L431 says that the CRC peripheral does not work for even polynomials. However, it is not clear as to the exact definition of what Wikipedia and ST are using for "even." So, it is not clear that this polynomial can be handled with the 'L431 CRC.

However, if the problem is odd/even, there might be some tricks to make it work, e.g. reversal of the polynomial with a zero added, but I'm not sure if that is possible. I've made some attempts that have not been successful.

Finally, the datasheet has an example of software routine for generating the CRC. It uses a polynomial representation of 0x4599. However, the Wikipedia and its references show it as 0xC599 (16b!).

For software implementation the usual table lookup is probably satisfactory, however if it is possible to make use of the 'L431 hardware it would save the memory space for the table and provide a small improvement in computation time.

DHase.1 · ‎2022-06-12

waclawek.jan,

>You can also consider the packing - I don't know why is the ST implementation inefficient but I am not going to investigate, I really don't care about Cube.

Unless I missed it, the Ref Manual doesn't mention that when a 32b word is loaded into the CRC_DR, the byte order when the data format is bytes, is that the byte order in that word is backwards, i.e. big endian rather than the "natural" little endian. What HAL_CRC_Calculate does, when there are 4 or more bytes, is that it packs 4 bytes into a word, but it has to do that by a series of 4 shift/or to construct the word in the proper byte order. The compiled code shows that this takes up as much time as just loading a series of single bytes into the CRC_DR register.

It took me a while to realize the byte order in the word when I first skimmed over the HAL_CRC_Calculate source code. I didn't catch that their shift|or was setting up the bytes in reverse order. The comments suggest the programmer thought this was optimizing the speed.

The other problem is that HAL_CRC_Calculate includes switch statements and branches to handle all the different ways the CRC peripheral might be configured.

It would have helped if the Ref Manual had included few sentences which would have clarified the byte order for word and 1/2 word loading.

DHase.1 · ‎2022-06-12

PS: I just realized that the Cortex-M series processors have a "REV" instruction that will reverse the byte order in a word. That would provide a way for efficient packing.

waclawek.jan · ‎2022-06-13

> It would have helped if the Ref Manual had included few sentences which would have clarified the byte order for word and 1/2 word loading.

Yes. Unfortunately, the ST documentation leaves a lot of such details behind. This is the consequence of creating the documentation in the same way as they create the chips themselves: by slapping modules together with not much thought given to the "long proven" modules themselves or properly documenting the interconnections and their consequences. In particular, I'd guess the CRC unit comes from a development around a naturally big-endian processor core (maybe the Power architecture on which the automotive STC56 are based).

And this is one of my problems with Cube, too - instead of ST providing clean and documented examples, they hide this kind of problems inside Cube with little or no comment/explanation.

> Cortex-M series processors have a "REV" instruction

I guess whoever wrote that Cube implementation either was unaware of this instruction (and REV16 for the 16-bit variant) and the __REV() (__REV16()) CMSIS intrinsics (which you may not be aware either, as you seem to avoid using the CMSIS-mandated device header and symbols from therein - I wonder why). But then, whoever wrote the CRC module implementation and in particular its newer incarnations, providing the bit reverse (which has an instruction in CM core, too) and not byte reverse, might have a similar mindset.

But at least we have an 8-bit scratch register available in CRC.

JW

DHase.1 · ‎2022-06-13

> I'd guess the CRC unit comes from a development around a naturally big-endian processor core

(maybe the Power architecture on which the automotive STC56 are based).

My guess is that ST's CRC was based on using "network order" rather than the big endian processor. I think the networking concepts originated on the early 1960's and on IBM machines that were what we now call big endian. The closest thing to networking I worked on in those days was a system using our own format on a slow 30 bps teletype network with no thought of something so impractical as implementing a CRC with relays and discrete transistors. A lot of progress in the last 60 years.

> I guess whoever wrote that Cube implementation either was unaware of this instruction (and REV16 for the 16-bit variant) and the __REV() (__REV16())

I think one problem with inserting ASM instructions is that it is compiler/linker specific and HAL is designed to work on a several different compilers, so dealing with that might have been an issue.

> CMSIS intrinsics (which you may not be aware either, as you seem to avoid using the CMSIS-mandated device header and symbols from therein

I was quite aware of the CMSIS, but was having some difficulty with getting the pointer casting sorted out when mixed in with the other uncertainties and took the lazy/direct approach.

waclawek.jan · ‎2022-06-13

>> and the __REV() (__REV16())

>> CMSIS intrinsics

>>

>I think one problem with inserting ASM instructions is that it is compiler/linker specific

> and HAL is designed to work on a several different compilers,

> so dealing with that might have been an issue.

That's why I mentioned the CMSIS intrinsics. They are C functions (static inline) provided by ARM, designed to work with the 3 dominant translators.

JW

DHase.1 · ‎2022-06-14

Thanks. I see your point. I was not familiar with the term "intrinsics." The last time I embedded ASM in STM32 C code was 10 years ago and it was compiler (gcc) specific.

The following makes use of ASM instructions REV and REV16 to reverse bytes, and compute a 6 byte PEC15 inline. The initialization is called once (assuming STM32CubeMX isn't used, i.e. CRC is not activated). The pec15_reg is looping routine loading bytes into the CRC.

/* *************************************************************************
 * uint16_t pec15_reg_init (void);
 *  @brief  : Iniitalize RCC and CRCregisters for ADBMS1818 CRC-15 computation
 * *************************************************************************/
#define SEED 0x10 // PEC15 initial value
void pec15_reg_init (void)
{
  /* Bit 12 CRCEN: CRC clock enable */
    RCC->AHB1ENR |= RCC_AHB1ENR_CRCEN;
 
  /* Set CRC registers. */
  CRC->INIT = SEED*2;
  CRC->POL = 0x8B32; // CRC_POL: 0x4599 Polynomial * 2
 
  return;
}
 
/* *************************************************************************
 * uint16_t pec15_reg (uint8_t *pdata , int len);
 *  @brief  : Reset and compute CRC
 *  @param  : pdata = pointer to input bytes
 *  @param  : len = number of bytes
 *  @return : CRC-15 * 2 (ADBMS1818 16b format)
 * *************************************************************************/
uint16_t pec15_reg (uint8_t *pdata , int len)
{
  /* Control register configuration includes reset. */
  //*(CRCBASE+2) = 0x9; // CRC_CR: 16b + reset
  CRC->CR = 0x9;
 
  uint8_t* pend = pdata + len;
  do
  {
     *(__IO uint8_t*)CRC_BASE = *pdata++;
  } while (pdata < pend);
 
  return CRC->DR;//*CRCBASE;
}
 
 // ###SNIP Six bytes: Word, 1/2 word ###
  /* Six byte PEC15 computation */
    CRC->CR = 0x9; // 16b poly, + reset
    *(__IO uint32_t*)CRC_BASE = (uint32_t)__REV (*(uint32_t*)&data[0]);
    *(__IO uint16_t*)CRC_BASE = (uint16_t)__REV16 (*(uint16_t*)&data[4]);
    p15H = CDC->DR;  // Store 1/2 word result
 
 
// ###SNIP Six bytes three 1/2 words ###
     /* Six byte PEC15 computation */
    CRC->CR = 0x9; // 16b poly, + reset
    *(__IO uint16_t*)CRC_BASE = (uint16_t)__REV16 (*(uint16_t*)&data[0]);
    *(__IO uint16_t*)CRC_BASE = (uint16_t)__REV16 (*(uint16_t*)&data[2]);
    *(__IO uint16_t*)CRC_BASE = (uint16_t)__REV16 (*(uint16_t*)&data[4]);
    p15E = CRC->DR; // Store 1/2 word result

My test routine to run the various approaches computes the CRC on the same data input, i.e. they should all produce the same result. In the following, the number of machine cycles is listed in the row below.

A: FOR loop in main: one byte at a time to CRC DR

B: subroutine: pec15: 256 entry table lookup, 8b bytes

C: CRC_Handle_8 (routine is embedded in HAL_CRC_Calculate

D: HAL_CRC_Calculate

E: inline: three 1/2 words

F: subroutine: pec15_nibble: 16 entry table lookup, 4b nibbles

G: subroutine: pec15_reg: loop one byte at a time to CRC DR

H: inline: 32b word + 16b 1/2word

POLY: 8B32 SEED: 20 SIZE: 6 DATA[]: A5 02 03 04 05 FE
  0 A:EA4C B:EA4C C:EA4C D:EA4C E:EA4C F:EA4C G:EA4C H:EA4C
      85    109     78    115     23    156     80     21

As expected the inline statements are noticeably faster, and the table lookup by nibbles the slowest.

One item I have not investigated is the computation time issue. The clock setup might have the bus divided, in which case inline instructions could overrun the CRC input. The Ref Manual says,

"An input buffer allows to immediately write a second data without waiting for any wait states

due to the previous CRC calculation."

This suggests that if inline instructions overtake the CRC peripheral, wait states would be generated. When the bus is not divided, it looks like overrunning would not be possible. Here is a snip of the compiled code for the word + 1/2 word six byte CRC computation--

  CRC->CR = 0x9; // 16b poly, + reset
 8001efe:	2009      	movs	r0, #9
 8001f00:	6098      	str	r0, [r3, #8]
    *(__IO uint32_t*)CRC_BASE = (uint32_t)__REV (*(uint32_t*)&data[0]);
 8001f02:	4d51      	ldr	r5, [pc, #324]	; (8002048 <StartDefaultTask+0x22c>)
 8001f04:	682a      	ldr	r2, [r5, #0]
__STATIC_FORCEINLINE uint32_t __REV(uint32_t value)
 8001f06:	ba12      	rev	r2, r2
 8001f08:	601a      	str	r2, [r3, #0]
    *(__IO uint16_t*)CRC_BASE = (uint16_t)__REV16 (*(uint16_t*)&data[4]);
 8001f0a:	88aa      	ldrh	r2, [r5, #4]
__STATIC_FORCEINLINE uint32_t __REV16(uint32_t value)
  __ASM volatile ("rev16 %0, %1" : __CMSIS_GCC_OUT_REG (result) : __CMSIS_GCC_USE_REG (value) );
 8001f0c:	ba52      	rev16	r2, r2
 8001f0e:	b292      	uxth	r2, r2
 8001f10:	801a      	strh	r2, [r3, #0]
    p15H = *(uint32_t*)0x40023000;//hcrc.Instance->DR;//*(uint32_t*)(crcbase+0);
 8001f12:	681a      	ldr	r2, [r3, #0]
 8001f14:	920c      	str	r2, [sp, #48]	; 0x30

Loading both the word and 1/2 word takes 4 cycles. I don't think the uxth instruction is needed for the 1/2 word, but that is not important. When the clock setup has the bus running at the same rate as the system, the above instructions could not overrun the CRC computation.

waclawek.jan · ‎2022-06-15

> I was not familiar with the term "intrinsics."

It's ARM's term. I don't like it. I don't like ARM's terminology in general, but that's all I can do about it.

Thanks for sharing the code and results.

> table lookup by nibbles the slowest

Yet it's the most portable. A very nice illustration of the principle, where efficiency goes straight against flexibility (whatever both of these mean).

> This suggests that if inline instructions overtake the CRC peripheral, wait states would be generated.

Probably so, especially considering it in context of older incarnation of the CRC module, where there was no such buffer and waitstates were explicitly said to be generated. ST might've formulated more clearly, but clear documentation is not ST's forte.

In pure asm, you could've tried to "overload" the CRC by loading several words of data into several registers beforehand, and then dump it on CRC at once. In the case I am talking about in that gotcha, the problem is exactly this: both values for control register and the first data are loaded into registers beforehand (by the C optimizer, given some previous sequence of events), and then dumped at CRC module immediately after each other. The waitstate mechanism in 'F4 obviously does not account for this. I wonder, whether in 'L4 this potential problem has been already solved.

JW