Anyone used fgets() to read from text files in UTF-8 format?

TeaRDoWN

To allow more language specific characters I have converted my text files from ASCII to UTF-8 but now the fgets() returns odd garbage characters in beginning of string:

Instead of: "Hello world." I get: "ï»¿Hello world."

Looking in the text file with a text editor (for example Notepad) the file looks exactly the same as the old ASCII version.

Is there these three "garbage characters" in the beginning of the file that I need to step by before my "real charcters" come that I want to read?

LennyLen

TeaRDoWN said:

Is there these three "garbage characters" in the beginning of the file that I need to step by before my "real charcters" come that I want to read?

I don't know the answer to that, but it doesn't matter. Don't use any of the stdio functions with unicode. Use one of the unicode-based functions indtead.

Even if you ignore thge first three bytes, you're bound to get more garbage the moment your program encounters any other character that fgets() can't handle.

Mika Halttunen

TeaRDoWN said:

Is there these three "garbage characters" in the beginning of the file that I need to step by before my "real charcters" come that I want to read?

Those "garbage characters" are actually the byte-order mark of UTF-8, that specify the text is encoded in UTF-8. See UTF-8 BOM for more info. You can ignore them, but as LennyLen said, you shouldn't be using the stdio functions here.

CGamesPlay

I'd like to point out that fgets works fine in UTF-8, it's the function you're using to display the string that can't handle it. That unicode byte order mark is a non-printable character, but it does belong in the string.

Peter Wang

Exactly. That's nothing wrong with using stdio functions for UTF-8 -- that's the point!

Matthew Leverton

Those three junk bytes are a zero-width non-breaking space character (EF BB BF). It was probably inserted by some text editor that uses it for UTF-8 detection. You could safely remove it from the text file (but your editor may put it back in), but you shouldn't skip over them if present.

As the others are saying, the standard input functions work fine with UTF-8. You just need to use UTF-8 aware output functions to display the high characters (greater than 0x7f).

CGamesPlay

To clarify, UTF-16 and UTF-32 both use multibyte sequences for all characters. For the first 255 characters, this ends up being stored as '\0', 'a' (where the character is 'a'). Because of that, you can't use the standard input functions to read UTF-16- or UTF-32-encoded data, because they stop when they encounter a NULL character.

UTF-8 was devised specifically to not ever have NULL characters in the byte stream except to represent the NULL character. It uses a variable-length encoding that represents the ASCII characters as themselves ('a' is stored as 'a').

Thread #599849. Printed from Allegro.cc