Friday, February 11, 2011

Wide-character API of glibc

I've had to write an application that deals with text in virtually any language, at least, any latin-based language. The source of the text is in files that are encoded in UTF-8, so the question was "how do I deal with this new type of text data?".


Glibc (and others) provide the wide char (wchar_t) data type, which is able to represent any character by using a fixed-width representation of a character, in my platform, 32 bits wide.


There are functions used to manipulate text data in wide char format, and most of them are equivalent to the classic 8-bit string functions: fwprintf, swprintf, wcslen, fputwc, fgetwc, etc...


However, there some tricks that must be taken into account. There is a new operator "L" used to generate wide-char text constants, so the statement:


char   mystring[] = "hello";


in the wide-char form must be expresed as:


wchar_t  mystring[] = L"hello";


Note that the "format" argument used in all wprintf functions is also a wide-char text, so it needs to be converted to wide-char text using the "L" operator as well.


Once the data is in wchar format, all operations are similar to the classic equivalents, but now the point is how to convert to/from the classic string format. Well, to be able to convert to a wchar character which has more bits than the standard 8-bit representation, some kind of encoding must be used. For example, UTF-8 or UTF-16 are methods to represent wide chars using 8-bit octets. It is important to note that the type of encoding used is independent of the code or the functions used in the code. All the code needs are a couple of functions to convert to/from byte sequences from/to wide chars. The particular type of encoding is not considered here and depends on the operating system's resources and their configuration, usually in the internationalization settings.

For example, the mbsrtowcs and wcsrtombs functions are used to convert from multibyte sequences to wide char strings, and from wide char strings to multibyte sequences, respectively. 

The setlocale function must be used to select a particular internationalizaton setting used to convert between wide-char and multibyte sequence strings, for example:


setlocale(LC_CTYPE, "en_US.UTF-8");


This setting is used by glibc to do the conversion in the mbs/wc functions. Note that the proper conversion files are required. In Linux, these are under the /usr/lib/locale directory, organized by locale name. In the example above, the files searched would be, in precedence order:


/usr/lib/locale/en_US.UTF-8/LC_CTYPE
/usr/lib/locale/en_US.utf8/LC_CTYPE
/usr/lib/locale/en_US/LC_CTYPE


The file must be generated with the 'localedef' utility, provded by the glibc installation:


localedef -i en_US -f UTF-8 /tmp


generates the internationalization files for the locale in the exmaple under the /tmp directory. Embedded system developers should note that the LC_CTYPE file is 256 KB big!, which is too much for some systems. The problem is that, without these files, the mbs/wc and fput/fget functions simply don't work, indicating that "an invalid byte sequence cannot be converted".


In this case, one has to write his own conversion functions at the price of losing some portability of the code. For example, it is cheaper in terms of disk space, to write our own UTF-8 to/from wide-char conversion functions, but we are limited to this type of encoding.




Generation of simple PDF files

The goal is to produce a very simple PDF file that contains unformatted text without using very heavy resources, such as PDF converters, graphics libraries, etc, and match the limited resources available on an embedded platform.

PDF generation in this simple way is quite straightforward. The Adobe's PDF specification provides a couple of examples of simple PDF files that can be used as a template for a very simple PDF rendering library.

Basically, a PDF file is a collection of objects linked from a global object table (xref) that lists the offsets where each object can be located in the file. Some objects have references to other objects as well.


Problems come when the text to be converted to PDF contains special characters, such as those used in the eastern european ones. The PDF specification states that all characters in the text must have 8 bits only, and PDF doesn't know anything about UTF nor Unicode encodings. For each 8-bit character in the text, a reader will fetch the corresponding glyph from the font file selected in the font-type PDF object. The fetch is done using the 8-bit value of the character and taking into account the encoding of the font. Only two types of font encodings are supported : WinAnsiEncoding and MacExpertEncoding, which are used in Windows and MAC, respectively. This encoding is defined in a "Font"  PDF object, or in an "Encoding" object referenced by a "Font" object. However, the windows font encoding is always fixed to the CP-1252 code table. Unfortunately, this encoding only covers some non-ASCII characters, but most of them are not. For example, most of the chars used in eastern european languages (polish, hungarian, etc) are not defined in CP-1252, but in CP-1250 code page.

With these limitations, it is not possible to use non-CP1252 chars in a text stream of a PDF file. There is, however, a workaround that may work in some cases. PDF provides the so-called "font encoding differences", which is an optional entry of the font encoding dictionary that allows the writer to map a given set of character codes in the text to a set of glyphs in the selected font. Glyphs are defined by a standard name, and a list of official glyph names can be retrieved from the Adobe Glyph name list. However, not all glyphs are available in all fonts, of course.


This is an example of this mapping:

10 0 obj
<
/Subtype /Type1
/Name /F1
/BaseFont /MyriadPro
/Encoding 11 0 R
>>
endobj
11 0 obj
<
/BaseEncoding /WinAnsiEncoding
/Differences [ 156 /sacute 230 /cacute
241 /nacute 179 /lslash
]
>>
endobj


In this example, there are two objects: a Font object (10) and an Encoding object (11). The font object uses the Encoding object to define the encoding of the font, which in turn includes the encoding differences. In this example, char code 156 is mapped to the "ś" character, code 230 is mapped to "ć", code 241 is mapped to "ń" and code 179 is mapped to "ł", which are some of the characters of the polish alphabet, and the codes are taken from the CP-1250 code table, which is NOT supported by PDF.

Note that the font type used (MyriadPro in this case) must contain the glyphs used in the Encoding object: sacute, cacute, nacute and lslash. Otherwise, the .notdef glyph would be displayed by the reader. For exmaple, the basic "Courier", "Arial" and "Helvetica" fonts I tried, did NOT contain these glyphs. In particular, I couldn't find any standard proportional font containing these glyphs.


So, for this method to work, it requires that the operating system where the PDF reader is running, provides the font defined in the PDF file, which is a non-standard font. A better solution would be embedding these fonts in the PDF file, but this is not a easy task. 
Unfortunately, this seems to be only safe solution, that is guaranteed to work in any system or environment.


The next step is to understand and learn how to embed font programs in PDF files, in an easy way.

USB File Storage Gadget

Today I have enabled the USB gadget support for file storage. The intention is to be able to export files via the USB device interface to a PC.

The file storage gadget must be enabled at the kernel config menu:
USB support -> Support for USB gadgets -> File-backed storage gadget

Note that only one USB gadget may be enabled at the same time. If multiple gadgets must be supported, all of them must be configured as modules, so I had to remove built-in support for ethernet gadget from the kernel. Switching the USB function requires removing and installing the proper modules.

The module for file storage is g_file_storage.o and is installed this way:
insmod g_file_storage.o file=/results.bin stall=0

The 'stall' argument is necessary for the USB disk to be properly detected by windows. Linux does not require this argument and the drive can be mounted without problem. If 'stall' is not set to zero and the gadget is connected to a windows PC, the following messages appear:
g_file_storage pxa2xx_udc: full speed config #1
udc: pxa2xx_ep_disable, ep1in-bulk not enabled
udc: pxa2xx_ep_disable, ep2out-bulk not enabled
udc: USB reset
udc: USB reset

repeating every few seconds.

For the 'stall' option to be available, it is necessary to enable the 'file-backed storage gadget in test mode' option in the kernel configuration.

Multiple files may be specified when the gadget module is installed, thus creating multiple drives visible to the remote host.


Any volume size may be created but it seems that Windows assigns floppy drive letters if the volume size is similar to a floppy device size. I have tested 720KB and 1440KB only.

The volume may be declared as read-only by using the "ro=1" parameter at the insmod.

The backend file may be either a disk partition or a file image.
An initial filesystem image can be created this way:

# dd if=/dev/zero of=results.bin bs=512 count=2880
# mkdosfs results.bin

Then, loop-mount the image file and populate the filsystem. Here is where problems came: if a process writes a new file to the loop filesystem, the host side of the USB connection (where the file browser runs) does not see the new file, even if the file browser is refreshed. The only workaround is to unplug/plug again the USB cable. This happens even if a 'sync' command is run on the tester device.



Also, some inconsistences happen if the USB host side writes to the device. The device doesn't see the new files, and vice-versa.


In conclusion, it is quite a good method to export only file from a Linux device, but with some limitations on "live" filesystems.