UTF-8 characters in text strings

EThom.3 · ‎2025-09-15

I am writing some firmware for a device with an alpha-numeric display. The display language must be changable. For most languages, this means using characters outside the A-Z range. UTF-8 covers most of the necessary characters.

In the project properties, I found this setting, and naively thought that the UTF-8 selection included the actual compiler:

Well, obviously not, as it is GCC that needs to know that the character set is to be UTF-8. The code

sprintf(pString, "Français");

is translated to

sprintf(pString, "FranÃ§ais");

(It might not be super-visible, but the ç in Français is a c with a cedilla, which has the UTF-8 code 0xE7.)

I tried using the flags -finput-charset=utf-8 and -fexec-charset=utf-8 in an attempt to tweak GCC to use UTF-8 for strings and characters, but that didn't make a difference. Also, I have a hunch that these flags only work for C++, but I'm not sure.

Is what I'm trying to do even possible?

I do have the option to replace ç with \xE7, but with loads of text for several languages, and numerous "special" characters, this becomes rather cumbersome. I have enough trouble with the Polish characters as it is...

EThom.3 · ‎2025-09-16

Oh ***...

As it turns out, I didn't know enough about UTF-8, and simply relied on the tables I found. When digging a bit, however, a greater complexity turns up.

But that's perfectly fine. I have a function that converts UTF-8 to the rather odd codepage in the display. I will just need to expand that to handle some more conversion.

View solution in original post

gbm · ‎2025-09-16

CubeIDE and GCC use UTF-8 by default. I guess that some other piece of software you use for editing your files doesn't recognize your source text as UTF-8 and uses some other extended ASCII encoding, so your Unicode non-ASCII chars are displayed as sequences of 2 or more characters.

My STM32 stuff on github - compact USB device stack and more: https://github.com/gbm-ii/gbmUSBdevice

EThom.3 · ‎2025-09-16

I don't use any other software for editing. STM32CubeIDE only.

I extracted this from the .hex file: 4672616EC3A7616973

This shows that the compiler interprets the ç as byte values 0xC3 0xA7 rather than 0xE7. This makes no sense to me, as these values have nothing to do with ç.

EThom.3 · ‎2025-09-16

Update: I find this interesting.

If I copy the text (on screen) from Cube into a hex editor, I get what I expected: 00000000h: 46 72 61 6E E7 61 69 73 ; Français

However, if I open file.c file in the hex editor, I get this: 0000021ah: 46 72 61 6E C3 A7 61 69 73 ; FranÃ§ais

In my opinion, this shifts the blame from GCC to Cube. The file isn't saved as UTF-8.

EThom.3 · ‎2025-09-16

Oh ***...

As it turns out, I didn't know enough about UTF-8, and simply relied on the tables I found. When digging a bit, however, a greater complexity turns up.

But that's perfectly fine. I have a function that converts UTF-8 to the rather odd codepage in the display. I will just need to expand that to handle some more conversion.

gbm · ‎2025-09-16

C3 A7 is the correct UTF-8 encoding of character U+E7. I cannot see a problem here.

My STM32 stuff on github - compact USB device stack and more: https://github.com/gbm-ii/gbmUSBdevice

EThom.3 · ‎2025-09-16

The problem was that I was a ***.

I should have used ISO 8859-1 encoding rather than UTF-8.

Well, I learned something today. That can't be half bad.

P.S. Interestingly, the website won't let me speak badly about myself...