Friday, February 11, 2011

Generation of simple PDF files

The goal is to produce a very simple PDF file that contains unformatted text without using very heavy resources, such as PDF converters, graphics libraries, etc, and match the limited resources available on an embedded platform.

PDF generation in this simple way is quite straightforward. The Adobe's PDF specification provides a couple of examples of simple PDF files that can be used as a template for a very simple PDF rendering library.

Basically, a PDF file is a collection of objects linked from a global object table (xref) that lists the offsets where each object can be located in the file. Some objects have references to other objects as well.


Problems come when the text to be converted to PDF contains special characters, such as those used in the eastern european ones. The PDF specification states that all characters in the text must have 8 bits only, and PDF doesn't know anything about UTF nor Unicode encodings. For each 8-bit character in the text, a reader will fetch the corresponding glyph from the font file selected in the font-type PDF object. The fetch is done using the 8-bit value of the character and taking into account the encoding of the font. Only two types of font encodings are supported : WinAnsiEncoding and MacExpertEncoding, which are used in Windows and MAC, respectively. This encoding is defined in a "Font"  PDF object, or in an "Encoding" object referenced by a "Font" object. However, the windows font encoding is always fixed to the CP-1252 code table. Unfortunately, this encoding only covers some non-ASCII characters, but most of them are not. For example, most of the chars used in eastern european languages (polish, hungarian, etc) are not defined in CP-1252, but in CP-1250 code page.

With these limitations, it is not possible to use non-CP1252 chars in a text stream of a PDF file. There is, however, a workaround that may work in some cases. PDF provides the so-called "font encoding differences", which is an optional entry of the font encoding dictionary that allows the writer to map a given set of character codes in the text to a set of glyphs in the selected font. Glyphs are defined by a standard name, and a list of official glyph names can be retrieved from the Adobe Glyph name list. However, not all glyphs are available in all fonts, of course.


This is an example of this mapping:

10 0 obj
<
/Subtype /Type1
/Name /F1
/BaseFont /MyriadPro
/Encoding 11 0 R
>>
endobj
11 0 obj
<
/BaseEncoding /WinAnsiEncoding
/Differences [ 156 /sacute 230 /cacute
241 /nacute 179 /lslash
]
>>
endobj


In this example, there are two objects: a Font object (10) and an Encoding object (11). The font object uses the Encoding object to define the encoding of the font, which in turn includes the encoding differences. In this example, char code 156 is mapped to the "ś" character, code 230 is mapped to "ć", code 241 is mapped to "ń" and code 179 is mapped to "ł", which are some of the characters of the polish alphabet, and the codes are taken from the CP-1250 code table, which is NOT supported by PDF.

Note that the font type used (MyriadPro in this case) must contain the glyphs used in the Encoding object: sacute, cacute, nacute and lslash. Otherwise, the .notdef glyph would be displayed by the reader. For exmaple, the basic "Courier", "Arial" and "Helvetica" fonts I tried, did NOT contain these glyphs. In particular, I couldn't find any standard proportional font containing these glyphs.


So, for this method to work, it requires that the operating system where the PDF reader is running, provides the font defined in the PDF file, which is a non-standard font. A better solution would be embedding these fonts in the PDF file, but this is not a easy task. 
Unfortunately, this seems to be only safe solution, that is guaranteed to work in any system or environment.


The next step is to understand and learn how to embed font programs in PDF files, in an easy way.

No comments: