cancel
Showing results for 
Search instead for 
Did you mean: 

UTF-8 characters in text strings

EThom.3
Senior II

I am writing some firmware for a device with an alpha-numeric display. The display language must be changable. For most languages, this means using characters outside the A-Z range. UTF-8 covers most of the necessary characters.

In the project properties, I found this setting, and naively thought that the UTF-8 selection included the actual compiler:

EThom3_1-1757993792408.png

Well, obviously not, as it is GCC that needs to know that the character set is to be UTF-8. The code

sprintf(pString, "Français");

is translated to

sprintf(pString, "Français");

 (It might not be super-visible, but the ç in Français is a c with a cedilla, which has the UTF-8 code 0xE7.)

I tried using the flags -finput-charset=utf-8 and -fexec-charset=utf-8 in an attempt to tweak GCC to use UTF-8 for strings and characters, but that didn't make a difference. Also, I have a hunch that these flags only work for C++, but I'm not sure.

Is what I'm trying to do even possible?

I do have the option to replace ç with \xE7, but with loads of text for several languages, and numerous "special" characters, this becomes rather cumbersome. I have enough trouble with the Polish characters as it is...

1 ACCEPTED SOLUTION

Accepted Solutions
EThom.3
Senior II

Oh ***...

As it turns out, I didn't know enough about UTF-8, and simply relied on the tables I found. When digging a bit, however, a greater complexity turns up.

But that's perfectly fine. I have a function that converts UTF-8 to the rather odd codepage in the display. I will just need to expand that to handle some more conversion.

View solution in original post

6 REPLIES 6
gbm
Principal

CubeIDE and GCC use UTF-8 by default. I guess that some other piece of software you use for editing your files doesn't recognize your source text as UTF-8 and uses some other extended ASCII encoding, so your Unicode non-ASCII chars are displayed as sequences of 2 or more characters.

My STM32 stuff on github - compact USB device stack and more: https://github.com/gbm-ii/gbmUSBdevice

I don't use any other software for editing. STM32CubeIDE only.

I extracted this from the .hex file: 4672616EC3A7616973

This shows that the compiler interprets the ç as byte values 0xC3 0xA7 rather than 0xE7. This makes no sense to me, as these values have nothing to do with ç.

EThom.3
Senior II

Update: I find this interesting.

If I copy the text (on screen) from Cube into a hex editor, I get what I expected: 00000000h: 46 72 61 6E E7 61 69 73 ; Français

However, if I open file.c file in the hex editor, I get this: 0000021ah: 46 72 61 6E C3 A7 61 69 73 ; Français

In my opinion, this shifts the blame from GCC to Cube. The file isn't saved as UTF-8.

 

EThom.3
Senior II

Oh ***...

As it turns out, I didn't know enough about UTF-8, and simply relied on the tables I found. When digging a bit, however, a greater complexity turns up.

But that's perfectly fine. I have a function that converts UTF-8 to the rather odd codepage in the display. I will just need to expand that to handle some more conversion.

gbm
Principal

C3 A7 is the correct UTF-8 encoding of character U+E7. I cannot see a problem here.

My STM32 stuff on github - compact USB device stack and more: https://github.com/gbm-ii/gbmUSBdevice

The problem was that I was a ***.

I should have used ISO 8859-1 encoding rather than UTF-8.

Well, I learned something today. That can't be half bad.

 

P.S. Interestingly, the website won't let me speak badly about myself...