Friday, February 11, 2011

Wide-character API of glibc

I've had to write an application that deals with text in virtually any language, at least, any latin-based language. The source of the text is in files that are encoded in UTF-8, so the question was "how do I deal with this new type of text data?".


Glibc (and others) provide the wide char (wchar_t) data type, which is able to represent any character by using a fixed-width representation of a character, in my platform, 32 bits wide.


There are functions used to manipulate text data in wide char format, and most of them are equivalent to the classic 8-bit string functions: fwprintf, swprintf, wcslen, fputwc, fgetwc, etc...


However, there some tricks that must be taken into account. There is a new operator "L" used to generate wide-char text constants, so the statement:


char   mystring[] = "hello";


in the wide-char form must be expresed as:


wchar_t  mystring[] = L"hello";


Note that the "format" argument used in all wprintf functions is also a wide-char text, so it needs to be converted to wide-char text using the "L" operator as well.


Once the data is in wchar format, all operations are similar to the classic equivalents, but now the point is how to convert to/from the classic string format. Well, to be able to convert to a wchar character which has more bits than the standard 8-bit representation, some kind of encoding must be used. For example, UTF-8 or UTF-16 are methods to represent wide chars using 8-bit octets. It is important to note that the type of encoding used is independent of the code or the functions used in the code. All the code needs are a couple of functions to convert to/from byte sequences from/to wide chars. The particular type of encoding is not considered here and depends on the operating system's resources and their configuration, usually in the internationalization settings.

For example, the mbsrtowcs and wcsrtombs functions are used to convert from multibyte sequences to wide char strings, and from wide char strings to multibyte sequences, respectively. 

The setlocale function must be used to select a particular internationalizaton setting used to convert between wide-char and multibyte sequence strings, for example:


setlocale(LC_CTYPE, "en_US.UTF-8");


This setting is used by glibc to do the conversion in the mbs/wc functions. Note that the proper conversion files are required. In Linux, these are under the /usr/lib/locale directory, organized by locale name. In the example above, the files searched would be, in precedence order:


/usr/lib/locale/en_US.UTF-8/LC_CTYPE
/usr/lib/locale/en_US.utf8/LC_CTYPE
/usr/lib/locale/en_US/LC_CTYPE


The file must be generated with the 'localedef' utility, provded by the glibc installation:


localedef -i en_US -f UTF-8 /tmp


generates the internationalization files for the locale in the exmaple under the /tmp directory. Embedded system developers should note that the LC_CTYPE file is 256 KB big!, which is too much for some systems. The problem is that, without these files, the mbs/wc and fput/fget functions simply don't work, indicating that "an invalid byte sequence cannot be converted".


In this case, one has to write his own conversion functions at the price of losing some portability of the code. For example, it is cheaper in terms of disk space, to write our own UTF-8 to/from wide-char conversion functions, but we are limited to this type of encoding.




No comments: