Unicode: from text and font to display

I’ve always wanted to be able to support all kind of text on my Symbian bible. The problem is this: the Unicode support for Symbian is very limited and even non-existent for some languages. A few weeks ago someone asked me if I can make Symbian bible to support Malayalam bible (using Malayalam script of course). Malayalam writing system was only supported recently by major operating systems (such as Windows XP and OS X Leopard), and Symbian haven’t supported it yet (and even if it supports it, the support will only be activated to phones on certain sales area).

I know that there are several Unicode layout libraries, such as Pango, Qt or ICU. All of them are too big for me to understand and port to Symbian. So my solution was to use the FreeType library and harfbuzz font shaping. After using those two libraries on Symbian, I think now I can understand Unicode. Here is my quick guide to understanding Unicode, from having a data and a font, to displaying the data with that font.

The first thing that you need to understand is: Unicode defines a list of characters for many languages of the world.

Some languages don’t have the concept of lower or upper case. Some languages have their own numbering system. Every language has their own system of ordering things. If possible: always use the function provided by the library you are using.

Second to note is there are over 100 thousand characters defined in the Unicode. We need to store this efficiently.

Ideally, to represent the data in memory, you will use 4 bytes representation because 1 or 2 bytes will not be enough to hold all the 100 thousand characters. Practically most people will not use the whole 100 thousand characters on a document or in an application, and storing everything in 4 bytes will waste a lot of space. To solve that (and to add another problem), different Unicode encodings are created. There are several encodings 7-bit (UTF7), 8-bit (UTF8), and 16-bit Little Endian/Small Endian encodings.

Encoding Unicode character using less than 32 bit works by using escape characters inside the string. For example if we use 16 bit encoding, we can only store 65536 values, if the character number to be stored is less than some number (say 60000), then just store the number, else write an escape code that says “for the next character, add 60000 to it), and write “the number minus 60000”. Actually, this is a simplification, because the numbering is actually not linear (it is divided into block numbers/planes).

Third thing to understand is: Unicode encoding creates a new problem.

As you can see from the previous explanation: if we have a string of 4 x 16 bits, the length of the character may not be 4, it could be 2, 3, or 4, it depends on the data. You will need to use special functions to do string manipulation such as accessing character at index x or finding the length of a string. Different operating system uses different encoding for Unicode, so you need to know what encoding you are using.

Another thing to remember is the conversion of old encoding system to Unicode or the other way around. Before Unicode, many countries/languages have their own encoding, character number 65 may not always be an “A” in some language. You may need a library to do this kind of conversion.

The fourth thing to understand is: Unicode string is only a list of characters in logical order.

You can’t just paint each character and hope that the result will be fine. If you have character “A” as the first character, and a COMBINING GRAVE character as the second character, on display you will only see one “character” (actually it is one grapheme). You also need to note that several algorithms should understand this logical ordering, for example, when you try to reverse a string, some character combinations must be kept intact.

Fifth thing to understand is: a font contains list of glyphs representing  characters.

Currently, a font will never have all the possible glyphs of Unicode characters. Having a font library will enable us to load glyphs from a font, and look up whether the font contains the glyphs that we need to display the character. Some fonts also contain predefined shapes for character combinations. Usually, the font library will not be able to tell us that a glyph represents a combination of other glyphs.

Sixth thing to understand is: you need font shaping library to draw Unicode strings nicely.

Font shaper such as harfbuzz will try to parse a string into a syllable, and will try to find the best representation of the syllable, either by looking up the correct glyphs that already combines several characters or by joining several glyphs in the correct position.

 

Editing Unicode text is more difficult, and I still don’t understand well enough about it (my program doesn’t need that for now).

PS: the Malayalam version of Symbian bible is not released yet. I was just happy that I can display the characters correctly (as the reference PDF), so I want to write this.

One thought on “Unicode: from text and font to display”

  1. Hi,

    Glad you found HarfBuzz useful. We’re working on a next-generation version. That won’t be available for quite a while, but when it is, it will drastically simplify the usage.

    Cheers,

    behdad

Leave a Reply to behdad Cancel reply

Your email address will not be published. Required fields are marked *