Aug 17, 2014

Oh Unicode, Why Are You So Frustrating?

I do, you know, find Unicode to be truly frustrating. For a long time I've wanted to write a long loud flame about everything I dislike about Unicode. But, I can't. You see, every time I run across something that frustrates me about Unicode I make the mistake of researching it. If you want to really hate something you have to completely lack the intellectual ability, and honesty, to research that hateful thing and try to understand it. If you do the research, you just might find out that there are good reasons for what you hate and that you could not have done better. Jeez, that is frustrating.

I wish the folks who defined Unicode had realized that they were not going to get away with a 16 bit code. But, when they started, 23 years ago, 16 bits certainly seemed like enough. Now there is just too much old code, including a bunch of compilers, that have the built in belief that a character is 16 bits. Unicode now requires a 21 bit code space. I wish the folks in charge of Unicode would just admit that it is going to be a 32 bit code some day so that the rest of us would be forced to design for a 32 bit character set. Could I have done better? Could I have avoided their mistakes? Hell no. I was perfectly happy with seven bit US-ASCII. Back in the late '80s I was vaguely aware that Chinese used a lot of characters, enough that I was suspicious that they would not all fit in 16 bits, but I really didn't care. I'm pretty sure I would have focused on the languages I was most familiar with and ignored every thing else.

Unicode has forced me to understand that I was(am?) a language bigot. Nobody really wants to come face to face with their character (sorry about that) flaws. I am a 12th generation American and I speak American English. In school I studied German, French, and Spanish. The only one of any use to me is Spanish. I once met a guy from Quebec who refused to speak English, but he sure understood it. I have never met anyone from Germany or France who did not speak English. But, living in Texas and the US southwest knowing a lot more Spanish than I do would be useful. I have met and worked with people from dozens of countries, but they all speak American English. Only folks from the UK seem to make a big deal about UK English and they don't seem to care if they can be understood. Wow, that sure sounded like it was written by a language bigot didn't it. Or, may be it just sounds like it was written by a typical monolinguistic American.

Unicode has taught me more about how the majority of humans communicate than anything else. It has forced me to face all those other scripts. It has forced me to think about characters in a completely new way. Until I tried to grok Unicode I had no idea that most scripts do not have upper and lower case letters. I did not know that some scripts have three cases, lower case, upper case, and title case. I always believed that every lower case letters has one, and only one, upper case equivalent. I did not know that there is not always a one to one relationship between upper and lower case characters. The old toupper() and tolower() functions I know and love are now pretty much meaningless in an international, multilingual, environment. So if you want to insure that “are”, “Are”, and “ARE” all test equal you can't just convert them to upper or lower case and do the comparison. You have to do case folding according to a Unicode algorithm and then do the test.

I hate modifier characters. (They are also known as combining characters.) The idea that there are two ways to represent the same character is enough to drive one to insanity. You can code “ö” as either a single character or as two characters, an “o” followed by an Unicode diaeresis (U+0308). Modifier characters of different sorts are used all through Unicode.
Why does Unicode have modifier characters? Why not just have unique encodings for each of the different combinations of characters with special decorations? It would make life so much easier! And, it is not like they do not have plenty of code points to spread around. Well, Unicode has a policy of being compatible with existing character encodings. That doesn't mean you get the same code points as in the old encoding. (Or, maybe it does! It does at least sometimes.) But, it does mean that you get the same glyphs in the same order so that it is easy to convert existing data to Unicode.

Making existing characters sets fit into Unicode with no, of few, changes just makes too much sense. If you want people to use your encoding then make their existing data compliant by default or with a minimal amount of work. That is just too sensible to say it is wrong. Guess what? The old character encodings had modifier characters. Why? I do not know but I can guess that it is because they were trying to fit them into a one byte character along side US-ASCII. I mean, US-ASCII already has most of the letters used in European languages so it was easier to add a few modifier characters than to add all the characters you can make with them. I mean, 8 bits is not much space. I can hate modifier characters all I want but I can't fault them for leaving them in. And, once they are in, why not use the concept of modifier characters to solve other problems? Never waste a good concept.

How do modifiers make me a grumpy programmer? What do I have to do to compare “Gödel” and “Gödel” and make sure it comes out equal? I mean, one of those strings could be 5 characters long while the other is 6 characters long. (Well, no, they are both 5 characters long, it is just too bad that the representation of one of the characters may be one code point longer in one string than in the other one.) So, first you have to convert the strings to one of the canonical forms defined by Unicode, do case folding, and then compare them. Since the strings might change length you may have to allocate fresh memory to store them. So what used to be a very simple operation now requires two complex algorithms be run possibly allocating dynamic memory and then you can do the comparison. Oh my...

All that complexity just to do a string comparison. OK, so I had to face the fact that I am a cycle hoarder and a bit hoarder. I grew up back when a megabyte or RAM was called a Moby and a million cycles per second was a damn fast computer. I'm sitting here on a machine with 8 gigabytes of RAM, 8 multibillion cycle per second cores (not to mention the 2 gigs of video RAM and 1000s of GPUs) and I am worried about the cost of doing a test for equality on a Unicode string. When you grow up with scarcity it can be hard to adapt to abundance. I saw that in the parents who grew up during the Great Depression (and a very depressing time it was). My mother made sure we did not tear wrapping paper when we unwrapped a present. She would save it and reuse it. I worry about using a few extra of bytes of RAM and the cost of a call to malloc(). I am trying to get over that. Really, I am. Can't fault Unicode for my attitudes.

I wish that Unicode was finished. It isn't finished. It is a moving target. The first release was in October 1991, the most recent release (as I write this) was in June 2014. There have been a total of 25 releases in 23 years. The most recent release added a couple of thousand characters to support more than 20 new scripts. Among other things after this long they finally got around to adding the Ruble currency sign. (If I were Russian I might feel a little insulted by that.) You would think they could finish this in 23 years, right? Wrong. It takes that long just to get everyone who will benefit from Unicode to hear about it, decide it is worth working on, and finally get around to working on it. It takes that long. It will take a lot longer. Get used to the fact that it may never be done. I hope that the creativity of human beings will force Unicode to add new characters forever.

Each release seems to “slightly” changed the format of some files, so everyone who processes the data needs to do some rework. I have never been able to find a formal syntax specification for any of the Unicode datafiles. If you find one please let me know. It would not be that hard to create an EBNF for the files.

I do wish they would add Klingon and maybe Elvish. That is very unlikely to happen. If they let one constructed script in they would have a hard time keeping others out. I can see people creating new languages with new scripts just to get them added to Unicode. People are nasty that way. Unicode does have a huge range of code points set aside for user defined characters. But that doesn't seem to be much use for document exchange. There needs to be a way to register different uses of that character space.

I hate that I do not fully understand Unicode. There are characters in some scripts that seem to only be there to control how characters are merged while being printed. Ligatures I understand, but the intricacies of character linking in the Arabic scripts are something I probably will never understand. But, if you want to sort a file containing names in English, Chinese, and Arabic you better understand how to treat those characters.

During the last 40 years I have learned a lot of cute tricks for dealing with characters in parsers. Pretty much none of them work with Unicode. Think about parsing a number. That is a pretty simple job if you can tell numeric characters from non-numeric characters. In US-ASCII there are 10 such characters. They are grouped together as a sequence of 10 characters. OK, fine. The Unicode database nicely flags characters as numeric or not. It classes them into several numeric classes and gives their numeric values. Should be easy to parse a number. But, there are over a hundred versions of the decimal digits in Unicode including superscripts and subscripts. If I am parsing a decimal number should you allow all of these characters in a number? Can a number include old fashioned US-ASCII digits, Devanagari digits, and Bengali digits all in the same number? Should I allow the use of language specific number characters that stand for things like the number 40? Should I recognize and deal with the Tibetan half zero? How about the special character representations of the so called “vulgar fractions”? Or, should I pick a script based on the current locale, or perhaps require the complete number to be in a single script? What do I do in a locale that does not have a zero but has the nine other digits?

(I must admit to being truly amazed by the Tibetan half zero character. Half zero? OK, it may or may not, depending on who you read, mean a value of -1/2. But, there seems to be no examples of it being used. And they left out the ruble sign until 2014?)

How about hexadecimal numbers? Well, according to the Unicode FAQ you can only use the Latin digits 0..9 and Latin letters a..f, A..F and their full width equivalents in hexadecimal numbers. I can use Devanagari for decimal numbers but I have to use Latin characters for hexadecimal numbers. That does not make a great deal of sense. This is an example of what would be called the founders effect in genetics. The genes of the founding population has a disproportionate effect on the genetics of the later population. English has been the dominant language in computer technology since its beginning and seems to be forcing the use of the English alphabet and language everywhere. What a mess.

You run into similar problems with defining a programming language identifier name. Do, you go with the locale. Do you go with the script of the first character? Or do you let people mix and match to their hearts content? I can see lots of fun obfuscating code by using a mixture of scripts in identifier names. If you go with the locale you could wind up with code that compiles in Delhi, but not in Austin, Beijing, or London. I think I have to write a whole blog on this problem.

I've used the word “locale” several times now without commenting on what it means in regard to Unicode. The idea of a locale is to provide a database of characters and formats to use based on the local language and customs. Things like the currency symbol, locale gives you “$” in the USA, “¥” in Japan, and “£” in the UK. Great, unless you have something like eBay.com that might want to list prices in dollars, yen, and pounds on a computer in the EU. Locale is for local data, not for the data stored in the database. You use the locale to decide how to display messages and dates and so on on the local computer. But, it does not, and can not be used to control how data is stored in a database.

This is a great collection of grumps, complaints, and whines. It was mostly about the difference between my ancient habits and patterns of thought and the world as it really is. Writing this has helped me come to grips with my problems with Unicode and helped me understand it better. The march of technology has exposed me, and I hope many others, to many languages and the cultures that created them. One of the worst traits of humans is the tendency to believe that the rest of the world is just like their backyard. Without even realizing how rare a backyard is!

15 comments:

  1. AFAIK Unicode code points are 21 bits, not 23. There are 17 planes of 16 bits each, or about 20.06 bits. So they won't be able to keep adding characters forever. Now why is it that .NET still doesn't have any representation for a true "code point" and offers functions like "IsDigit(char c)" that don't make sense beyond the BMP?

    ReplyDelete
    Replies
    1. Thank you for catching my error! I have corrected it.

      Delete
  2. ‮.gnola emoc I !esrow yna teg t'ndluoc edocinU thguoht uoy nehw tsuJ

    ReplyDelete
  3. Not looking to defend the mess that is unicode, but maybe find another example. 11 December 2013, the Central Bank of Russia approved the new ruble sign. Before that there was no sign. It was in the updated unicode 2 months later, not too bad really.

    ReplyDelete
    Replies
    1. Came here to say this. Yet another example of something that looks bad on the surface, but makes sense when you look into it.

      Delete
  4. Spiffing blog, old man. I do you love your patter, don't you know.

    Just one gripe however. It is the Queen's English after all and her majesty does get ever so slightly miffed when some bloke called Google insists on telling her that she doesn't know how to spell "colour", "realise" among others.

    Anyway hope that colonial thing is working out for you. We'll take you back you know, if you ask nicely

    Cheerio and toodle-oo

    ReplyDelete
    Replies
    1. I do wish that you had identified which queen you are writing about. There are a lot of queens in the world. None of them seem to own any language in general or English in particular.

      As for the "colonial thing" we gave up being a colonial power quite a while ago. We just didn't seem to like being a colonial power after the middle of the 20th century.

      Delete
  5. I think the word encoding languages are taking advantage of a structure designed to handle one letter at a time. To get back to equality, we should define a new encoding so all languages have a word by word encoding. Use a short encoding for the shortest words and modifiers, and a longer encoding for longer words. Modifiers for first caps, all caps, italics, reverse italics, bold, underline, overline, strikethrough, following non-blank punctuation.

    ReplyDelete
  6. > If they let one constructed script in they would have a hard time keeping others out.

    Yet, APL.

    ReplyDelete
  7. 1. Remember this: Use utf-8 encoding whenever possible. And if it isn't possible, make it possible.

    2. Read this famous article by Joel Spolsky:
    http://www.joelonsoftware.com/articles/Unicode.html

    3. Read this clarification:
    http://stackoverflow.com/questions/20942469/clarification-on-joel-spolskys-unicode-article

    ReplyDelete
  8. Regarding string comparisons, Unicode is just the tip of the iceberg.

    Collations is what should be used to compare text, and contrary to Unicode that has the ideal of supporting all possible characters in just one standard, you will never be able to do that with collations. There will never be "one unicollation".

    Why? Simple... because under some cultures considering that character A is equal to B might be useful, while in other it might be insulting. Even inside the same culture we can have multiple rules of how to compare text.

    Regards.

    ReplyDelete
  9. Their expertise and ability to deliver quality results made for a successful partnership.
    interactive design agency

    ReplyDelete
  10. They are a solid, flexible, and professional team that delivers highly valuable research results.
    UX design company

    ReplyDelete
  11. I truly get pleasure from while I read your blogs and its content.
    San Francisco design studios

    ReplyDelete