Aug 17, 2014

Oh Unicode, Why Are You So Frustrating?

I do, you know, find Unicode to be truly frustrating. For a long time I've wanted to write a long loud flame about everything I dislike about Unicode. But, I can't. You see, every time I run across something that frustrates me about Unicode I make the mistake of researching it. If you want to really hate something you have to completely lack the intellectual ability, and honesty, to research that hateful thing and try to understand it. If you do the research, you just might find out that there are good reasons for what you hate and that you could not have done better. Jeez, that is frustrating.

I wish the folks who defined Unicode had realized that they were not going to get away with a 16 bit code. But, when they started, 23 years ago, 16 bits certainly seemed like enough. Now there is just too much old code, including a bunch of compilers, that have the built in belief that a character is 16 bits. Unicode now requires a 21 bit code space. I wish the folks in charge of Unicode would just admit that it is going to be a 32 bit code some day so that the rest of us would be forced to design for a 32 bit character set. Could I have done better? Could I have avoided their mistakes? Hell no. I was perfectly happy with seven bit US-ASCII. Back in the late '80s I was vaguely aware that Chinese used a lot of characters, enough that I was suspicious that they would not all fit in 16 bits, but I really didn't care. I'm pretty sure I would have focused on the languages I was most familiar with and ignored every thing else.

Unicode has forced me to understand that I was(am?) a language bigot. Nobody really wants to come face to face with their character (sorry about that) flaws. I am a 12th generation American and I speak American English. In school I studied German, French, and Spanish. The only one of any use to me is Spanish. I once met a guy from Quebec who refused to speak English, but he sure understood it. I have never met anyone from Germany or France who did not speak English. But, living in Texas and the US southwest knowing a lot more Spanish than I do would be useful. I have met and worked with people from dozens of countries, but they all speak American English. Only folks from the UK seem to make a big deal about UK English and they don't seem to care if they can be understood. Wow, that sure sounded like it was written by a language bigot didn't it. Or, may be it just sounds like it was written by a typical monolinguistic American.

Unicode has taught me more about how the majority of humans communicate than anything else. It has forced me to face all those other scripts. It has forced me to think about characters in a completely new way. Until I tried to grok Unicode I had no idea that most scripts do not have upper and lower case letters. I did not know that some scripts have three cases, lower case, upper case, and title case. I always believed that every lower case letters has one, and only one, upper case equivalent. I did not know that there is not always a one to one relationship between upper and lower case characters. The old toupper() and tolower() functions I know and love are now pretty much meaningless in an international, multilingual, environment. So if you want to insure that “are”, “Are”, and “ARE” all test equal you can't just convert them to upper or lower case and do the comparison. You have to do case folding according to a Unicode algorithm and then do the test.

I hate modifier characters. (They are also known as combining characters.) The idea that there are two ways to represent the same character is enough to drive one to insanity. You can code “ö” as either a single character or as two characters, an “o” followed by an Unicode diaeresis (U+0308). Modifier characters of different sorts are used all through Unicode.
Why does Unicode have modifier characters? Why not just have unique encodings for each of the different combinations of characters with special decorations? It would make life so much easier! And, it is not like they do not have plenty of code points to spread around. Well, Unicode has a policy of being compatible with existing character encodings. That doesn't mean you get the same code points as in the old encoding. (Or, maybe it does! It does at least sometimes.) But, it does mean that you get the same glyphs in the same order so that it is easy to convert existing data to Unicode.

Making existing characters sets fit into Unicode with no, of few, changes just makes too much sense. If you want people to use your encoding then make their existing data compliant by default or with a minimal amount of work. That is just too sensible to say it is wrong. Guess what? The old character encodings had modifier characters. Why? I do not know but I can guess that it is because they were trying to fit them into a one byte character along side US-ASCII. I mean, US-ASCII already has most of the letters used in European languages so it was easier to add a few modifier characters than to add all the characters you can make with them. I mean, 8 bits is not much space. I can hate modifier characters all I want but I can't fault them for leaving them in. And, once they are in, why not use the concept of modifier characters to solve other problems? Never waste a good concept.

How do modifiers make me a grumpy programmer? What do I have to do to compare “Gödel” and “Gödel” and make sure it comes out equal? I mean, one of those strings could be 5 characters long while the other is 6 characters long. (Well, no, they are both 5 characters long, it is just too bad that the representation of one of the characters may be one code point longer in one string than in the other one.) So, first you have to convert the strings to one of the canonical forms defined by Unicode, do case folding, and then compare them. Since the strings might change length you may have to allocate fresh memory to store them. So what used to be a very simple operation now requires two complex algorithms be run possibly allocating dynamic memory and then you can do the comparison. Oh my...

All that complexity just to do a string comparison. OK, so I had to face the fact that I am a cycle hoarder and a bit hoarder. I grew up back when a megabyte or RAM was called a Moby and a million cycles per second was a damn fast computer. I'm sitting here on a machine with 8 gigabytes of RAM, 8 multibillion cycle per second cores (not to mention the 2 gigs of video RAM and 1000s of GPUs) and I am worried about the cost of doing a test for equality on a Unicode string. When you grow up with scarcity it can be hard to adapt to abundance. I saw that in the parents who grew up during the Great Depression (and a very depressing time it was). My mother made sure we did not tear wrapping paper when we unwrapped a present. She would save it and reuse it. I worry about using a few extra of bytes of RAM and the cost of a call to malloc(). I am trying to get over that. Really, I am. Can't fault Unicode for my attitudes.

I wish that Unicode was finished. It isn't finished. It is a moving target. The first release was in October 1991, the most recent release (as I write this) was in June 2014. There have been a total of 25 releases in 23 years. The most recent release added a couple of thousand characters to support more than 20 new scripts. Among other things after this long they finally got around to adding the Ruble currency sign. (If I were Russian I might feel a little insulted by that.) You would think they could finish this in 23 years, right? Wrong. It takes that long just to get everyone who will benefit from Unicode to hear about it, decide it is worth working on, and finally get around to working on it. It takes that long. It will take a lot longer. Get used to the fact that it may never be done. I hope that the creativity of human beings will force Unicode to add new characters forever.

Each release seems to “slightly” changed the format of some files, so everyone who processes the data needs to do some rework. I have never been able to find a formal syntax specification for any of the Unicode datafiles. If you find one please let me know. It would not be that hard to create an EBNF for the files.

I do wish they would add Klingon and maybe Elvish. That is very unlikely to happen. If they let one constructed script in they would have a hard time keeping others out. I can see people creating new languages with new scripts just to get them added to Unicode. People are nasty that way. Unicode does have a huge range of code points set aside for user defined characters. But that doesn't seem to be much use for document exchange. There needs to be a way to register different uses of that character space.

I hate that I do not fully understand Unicode. There are characters in some scripts that seem to only be there to control how characters are merged while being printed. Ligatures I understand, but the intricacies of character linking in the Arabic scripts are something I probably will never understand. But, if you want to sort a file containing names in English, Chinese, and Arabic you better understand how to treat those characters.

During the last 40 years I have learned a lot of cute tricks for dealing with characters in parsers. Pretty much none of them work with Unicode. Think about parsing a number. That is a pretty simple job if you can tell numeric characters from non-numeric characters. In US-ASCII there are 10 such characters. They are grouped together as a sequence of 10 characters. OK, fine. The Unicode database nicely flags characters as numeric or not. It classes them into several numeric classes and gives their numeric values. Should be easy to parse a number. But, there are over a hundred versions of the decimal digits in Unicode including superscripts and subscripts. If I am parsing a decimal number should you allow all of these characters in a number? Can a number include old fashioned US-ASCII digits, Devanagari digits, and Bengali digits all in the same number? Should I allow the use of language specific number characters that stand for things like the number 40? Should I recognize and deal with the Tibetan half zero? How about the special character representations of the so called “vulgar fractions”? Or, should I pick a script based on the current locale, or perhaps require the complete number to be in a single script? What do I do in a locale that does not have a zero but has the nine other digits?

(I must admit to being truly amazed by the Tibetan half zero character. Half zero? OK, it may or may not, depending on who you read, mean a value of -1/2. But, there seems to be no examples of it being used. And they left out the ruble sign until 2014?)

How about hexadecimal numbers? Well, according to the Unicode FAQ you can only use the Latin digits 0..9 and Latin letters a..f, A..F and their full width equivalents in hexadecimal numbers. I can use Devanagari for decimal numbers but I have to use Latin characters for hexadecimal numbers. That does not make a great deal of sense. This is an example of what would be called the founders effect in genetics. The genes of the founding population has a disproportionate effect on the genetics of the later population. English has been the dominant language in computer technology since its beginning and seems to be forcing the use of the English alphabet and language everywhere. What a mess.

You run into similar problems with defining a programming language identifier name. Do, you go with the locale. Do you go with the script of the first character? Or do you let people mix and match to their hearts content? I can see lots of fun obfuscating code by using a mixture of scripts in identifier names. If you go with the locale you could wind up with code that compiles in Delhi, but not in Austin, Beijing, or London. I think I have to write a whole blog on this problem.

I've used the word “locale” several times now without commenting on what it means in regard to Unicode. The idea of a locale is to provide a database of characters and formats to use based on the local language and customs. Things like the currency symbol, locale gives you “$” in the USA, “¥” in Japan, and “£” in the UK. Great, unless you have something like eBay.com that might want to list prices in dollars, yen, and pounds on a computer in the EU. Locale is for local data, not for the data stored in the database. You use the locale to decide how to display messages and dates and so on on the local computer. But, it does not, and can not be used to control how data is stored in a database.

This is a great collection of grumps, complaints, and whines. It was mostly about the difference between my ancient habits and patterns of thought and the world as it really is. Writing this has helped me come to grips with my problems with Unicode and helped me understand it better. The march of technology has exposed me, and I hope many others, to many languages and the cultures that created them. One of the worst traits of humans is the tendency to believe that the rest of the world is just like their backyard. Without even realizing how rare a backyard is!

Aug 5, 2014

Ghost Drivers!

Ghost Drivers!

I've been thinking a lot about self driving vehicles recently. They are going to change society greatly. But recently it occurred to me that they may create a new kind of problem for law enforcement: The Ghost Driver!

This could happen today.

You can buy a car with cruise control to keep your speed constant. You can buy car with a collision avoidance system so that even if you are reading a book it will not hit other cars. And, you can buy a car with automatic lane following. It keeps you in your lane even if you are inattentive and start to drift across lanes. (Collision avoidance and lane control seem to me to be the "Texting Driver" package.) Yes, the lane following systems available require you to keep you hands on the steering wheel. But, an article on Slashdot points out that taping a soda can to the steering wheel will fool one system into thinking you are holding on.

That all means that right now you can buy a car, that with the help of a can of coke and some duct tape, will let you read a book as your car drives down the Interstate.

I was wondering if any of these cars can tell if the driver has died, or even passed out? You hear about that every so often. A driver has a massive stroke or heart attack, or just gets shit faced drunk and passes out while driving along the Interstate. This usually results in the car going out of control and wandering off the road into on coming traffic, or into other cars close to the no longer functioning driver.

So, the drive goes to the happy parking lot in the sky and his automatic everything car doesn't notice. The car keeps driving. And, driving. And driving. Finally the car runs out of gas. Do these automated cars know how to pull over to the side of the road before they run out of gas? Maybe call for AAA or OnStar for help? I do not know.

The driver dies and the car keeps going. Say he has gas to travel 400 miles. The driver dies and 6 or 7 hours later his car comes to a complete stop in the middle lane of your favorite Interstate. In many parts of the US he will have crossed one, two, or even three state lines as he drives while dead. Here in central Texas he may well still be in Texas when the car stops, but not necessarily.

Just a quick question, is driving while dead illegal?

The cops get a call that a car has "just stopped" on the Interstate. The poor police get to the scene to find a corpse from another state strapped into the drivers seat. The first time this happens, and it will happen, the cops are going to be baffled. How in the world did he get there? It will get especially bizarre when the coroner tells the cops how long the poor driver has been dead.

My bet is that there will be a whole bunch of posts on the Internet claiming that the dead can drive and that "proves" a whole bunch of shit that is too weird for even me to imagine. (Didn't Steven King write a book about a car that killed people?) How is the manufacturer going to react when people claim the CAR killed the driver and drove off with him? Will someone claim that the car had become sentient, had killed its slave master, and was making a break for freedom? Will the incident be seen as the start of a zombie apocalypses?

What happens if the car is just on a highway? It won't slow down as it approaches every little town. It will not stop for traffic lights. Well, it might if there is traffic in the intersection. But, will it understand cross traffic and stay stopped, or will it try to bull its way through every intersection? How will small town cops respond to a car traveling at highway speeds blasting through town? Will they get in the weirdest chase ever with a car that just keeps going, and going, and going? Will they shoot the driver?

Heaven help us if they make these cars with sun roofs or any other easy way to leave them while they a running. (Oh, really.... you never bailed out of a car through a sun roof? You must have had a very sane childhood. Betcha' never played car tag either!) If it is easy to leave the car while it is moving you can count on people stealing these cars, starting them off on ghost journeys, and bailing from the sun roof to a nearby pick up. Who knows, maybe organized crime will adopt this as a way to get rid of bodies. Just make it look like Ghost Driver.