I
do, you know, find Unicode to be truly frustrating. For a long time
I've wanted to write a long loud flame about everything I dislike
about Unicode. But, I can't. You see, every time I run across
something that frustrates me about Unicode I make the mistake of
researching it. If you want to really hate something you have to
completely lack the intellectual ability, and honesty,
to research that hateful thing and try to understand it. If you do
the research, you just might find out that there are good reasons for
what you hate and that you could not have done better. Jeez, that is
frustrating.
I
wish the folks who defined Unicode had realized that they were not
going to get away with a 16 bit code. But, when they started, 23
years ago, 16 bits certainly seemed like enough. Now there is just
too much old code, including a bunch of compilers, that have the
built in belief that a character is 16 bits. Unicode now requires a
21 bit code space. I wish the folks in charge of Unicode would just
admit that it is going to be a 32 bit code some day so that the rest
of us would be forced to design for a 32 bit character set. Could I
have done better? Could I have avoided their mistakes? Hell no. I was
perfectly happy with seven bit US-ASCII. Back in the late '80s I was
vaguely aware that Chinese used a lot of characters, enough that I
was suspicious that they would not all fit in 16 bits, but I really
didn't care. I'm pretty sure I would have focused on the languages I
was most familiar with and ignored every thing else.
Unicode
has forced me to understand that I was(am?) a language bigot. Nobody
really wants to come face to face with their character (sorry about
that) flaws. I am a 12th generation American and I speak
American English. In school I studied German, French, and Spanish.
The only one of any use to me is Spanish. I once met a guy from
Quebec who refused to speak English, but he sure understood it. I
have never met anyone from Germany or France who did not speak
English. But, living in Texas and the US southwest knowing a lot more
Spanish than I do would be useful. I have met and worked with people
from dozens of countries, but they all speak American English. Only
folks from the UK seem to make a big deal about UK English and they
don't seem to care if they can be understood. Wow, that sure sounded
like it was written by a language bigot didn't it. Or, may be it just
sounds like it was written by a typical monolinguistic American.
Unicode
has taught me more about how the majority of humans communicate than
anything else. It has forced me to face all those other scripts. It
has forced me to think about characters in a completely new way.
Until I tried to grok Unicode I had no idea that most scripts do not
have upper and lower case letters. I did not know that some scripts
have three cases, lower case, upper case, and title case. I always
believed that every lower case letters has one, and only one, upper
case equivalent. I did not know that there is not always a one to one
relationship between upper and lower case characters. The old
toupper() and tolower() functions I know and love are now pretty much
meaningless in an international, multilingual, environment. So if you
want to insure that “are”, “Are”, and “ARE” all test
equal you can't just convert them to upper or lower case and do the
comparison. You have to do case folding according to a Unicode
algorithm and then do the test.
I hate modifier characters. (They are also known as combining characters.) The idea that there are two ways to represent the same character is enough to drive one to insanity. You can code “ö” as either a single character or as two characters, an “o” followed by an Unicode diaeresis (U+0308). Modifier characters of different sorts are used all through Unicode.
I hate modifier characters. (They are also known as combining characters.) The idea that there are two ways to represent the same character is enough to drive one to insanity. You can code “ö” as either a single character or as two characters, an “o” followed by an Unicode diaeresis (U+0308). Modifier characters of different sorts are used all through Unicode.
Why
does Unicode have modifier characters? Why not just have unique
encodings for each of the different combinations of characters with
special decorations? It would make life so much easier! And, it is
not like they do not have plenty of code points to spread around.
Well, Unicode has a policy of being compatible with existing
character encodings. That doesn't mean you get the same code points
as in the old encoding. (Or, maybe it does! It does at least
sometimes.) But, it does mean that you get the same glyphs in the
same order so that it is easy to convert existing data to Unicode.
Making
existing characters sets fit into Unicode with no, of few, changes
just makes too much sense. If you want people to use your encoding
then make their existing data compliant by default or with a minimal
amount of work. That is just too sensible to say it is wrong. Guess
what? The old character encodings had modifier characters. Why? I do
not know but I can guess that it is because they were trying to fit
them into a one byte character along side US-ASCII. I mean, US-ASCII
already has most of the letters used in European languages so it was
easier to add a few modifier characters than to add all the
characters you can make with them. I mean, 8 bits is not much space.
I can hate modifier characters all I want but I can't fault them for
leaving them in. And, once they are in, why not use the concept of
modifier characters to solve other problems? Never waste a good
concept.
How
do modifiers make me a grumpy programmer? What do I have to do to
compare “Gödel” and “Gödel” and make sure it comes out
equal? I mean, one of those strings could be 5 characters long while
the other is 6 characters long. (Well, no, they are both 5 characters
long, it is just too bad that the representation of one of the
characters may be one code point longer in one string than in
the other one.) So, first you have to convert the strings to one of
the canonical forms defined by Unicode, do case folding, and then
compare them. Since the strings might change length you may have to
allocate fresh memory to store them. So what used to be a very simple
operation now requires two complex algorithms be run possibly
allocating dynamic memory and then you can do the comparison. Oh
my...
All
that complexity just to do a string comparison. OK, so I had to face
the fact that I am a cycle hoarder and a bit hoarder. I grew up back
when a megabyte or RAM was called a Moby and a million cycles per
second was a damn fast computer. I'm sitting here on a machine with 8
gigabytes of RAM, 8 multibillion cycle per second cores (not to
mention the 2 gigs of video RAM and 1000s of GPUs) and I am worried
about the cost of doing a test for equality on a Unicode string. When
you grow up with scarcity it can be hard to adapt to abundance. I saw
that in the parents who grew up during the Great Depression (and a
very depressing time it was). My mother made sure we did not tear
wrapping paper when we unwrapped a present. She would save it and
reuse it. I worry about using a few extra of bytes of RAM and the
cost of a call to malloc(). I am trying to get over that. Really, I
am. Can't fault Unicode for my attitudes.
I
wish that Unicode was finished. It isn't finished. It is a moving
target. The first release was in October 1991, the most recent
release (as I write this) was in June 2014. There have been a total
of 25 releases in 23 years. The most recent release added a couple of
thousand characters to support more than 20 new scripts. Among other
things after this long they finally got around to adding the Ruble
currency sign. (If I were Russian I might feel a little insulted by
that.) You would think they could finish this in 23 years, right?
Wrong. It takes that long just to get everyone who will benefit from
Unicode to hear about it, decide it is worth working on, and finally
get around to working on it. It takes that long. It will take a lot
longer. Get used to the fact that it may never be done. I hope that
the creativity of human beings will force Unicode to add new
characters forever.
Each
release seems to “slightly” changed the format of some files, so
everyone who processes the data needs to do some rework. I have never
been able to find a formal syntax specification for any of the
Unicode datafiles. If you find one please let me know. It would not
be that hard to create an EBNF for the files.
I
do wish they would add Klingon and maybe Elvish. That is very
unlikely to happen. If they let one constructed script in they would
have a hard time keeping others out. I can see people creating new
languages with new scripts just to get them added to Unicode. People
are nasty that way. Unicode does have a huge range of code points set
aside for user defined characters. But that doesn't seem to be much
use for document exchange. There needs to be a way to register
different uses of that character space.
I
hate that I do not fully understand Unicode. There are characters in
some scripts that seem to only be there to control how characters are
merged while being printed. Ligatures I understand, but the
intricacies of character linking in the Arabic scripts are something
I probably will never understand. But, if you want to sort a file
containing names in English, Chinese, and Arabic you better
understand how to treat those characters.
During
the last 40 years I have learned a lot of cute tricks for dealing
with characters in parsers. Pretty much none of them work with
Unicode. Think about parsing a number. That is a pretty simple job if
you can tell numeric characters from non-numeric characters. In
US-ASCII there are 10 such characters. They are grouped together as a
sequence of 10 characters. OK, fine. The Unicode database nicely
flags characters as numeric or not. It classes them into several
numeric classes and gives their numeric values. Should be easy to
parse a number. But, there are over a hundred versions of the decimal
digits in Unicode including superscripts and subscripts. If I am
parsing a decimal number should you allow all of these characters in
a number? Can a number include old fashioned US-ASCII digits,
Devanagari digits, and Bengali digits all in the same number? Should
I allow the use of language specific number characters that stand for
things like the number 40? Should I recognize and deal with the
Tibetan half zero? How about the special character representations of
the so called “vulgar fractions”? Or, should I pick a script
based on the current locale, or perhaps require the complete number
to be in a single script? What do I do in a locale that does not have
a zero but has the nine other digits?
(I
must admit to being truly amazed by the Tibetan half zero character.
Half zero? OK, it may or may not, depending on who you read, mean a
value of -1/2. But, there seems to be no examples of it being used.
And they left out the ruble sign until 2014?)
How
about hexadecimal numbers? Well, according to the Unicode FAQ you can
only use the Latin digits 0..9 and Latin letters a..f, A..F and their
full width equivalents in hexadecimal numbers. I can use Devanagari
for decimal numbers but I have to use Latin characters for
hexadecimal numbers. That does not make a great deal of sense. This
is an example of what would be called the founders effect in
genetics. The genes of the founding population has a disproportionate
effect on the genetics of the later population. English has been the
dominant language in computer technology since its beginning and
seems to be forcing the use of the English alphabet and language
everywhere. What a mess.
You
run into similar problems with defining a programming language
identifier name. Do, you go with the locale. Do you go with the
script of the first character? Or do you let people mix and match to
their hearts content? I can see lots of fun obfuscating code by using
a mixture of scripts in identifier names. If you go with the locale
you could wind up with code that compiles in Delhi, but not in
Austin, Beijing, or London. I think I have to write a whole blog on
this problem.
I've
used the word “locale” several times now without commenting on
what it means in regard to Unicode. The idea of a locale is to
provide a database of characters and formats to use based on the
local language and customs. Things like the currency symbol, locale
gives you “$” in the USA, “¥”
in Japan, and “£” in the UK. Great, unless you have something
like eBay.com that might want to list prices in dollars, yen, and
pounds on a computer in the EU. Locale
is for local data, not for the data stored in the database. You use
the locale to decide how to display messages and dates and so on on
the local computer. But, it does not, and can not be used to control
how data is stored in a database.
This
is a great collection of grumps, complaints, and whines. It was
mostly about the difference between my ancient habits and patterns of
thought and the world as it really is. Writing this has helped me
come to grips with my problems with Unicode and helped me understand
it better. The march of technology has exposed me, and I hope many
others, to many languages and the cultures that created them. One
of the worst
traits of humans is the tendency to believe that the rest of the
world is just like their backyard. Without even realizing how rare a
backyard is!
AFAIK Unicode code points are 21 bits, not 23. There are 17 planes of 16 bits each, or about 20.06 bits. So they won't be able to keep adding characters forever. Now why is it that .NET still doesn't have any representation for a true "code point" and offers functions like "IsDigit(char c)" that don't make sense beyond the BMP?
ReplyDeleteThank you for catching my error! I have corrected it.
Delete.gnola emoc I !esrow yna teg t'ndluoc edocinU thguoht uoy nehw tsuJ
ReplyDeleteNot looking to defend the mess that is unicode, but maybe find another example. 11 December 2013, the Central Bank of Russia approved the new ruble sign. Before that there was no sign. It was in the updated unicode 2 months later, not too bad really.
ReplyDeleteCame here to say this. Yet another example of something that looks bad on the surface, but makes sense when you look into it.
DeleteSpiffing blog, old man. I do you love your patter, don't you know.
ReplyDeleteJust one gripe however. It is the Queen's English after all and her majesty does get ever so slightly miffed when some bloke called Google insists on telling her that she doesn't know how to spell "colour", "realise" among others.
Anyway hope that colonial thing is working out for you. We'll take you back you know, if you ask nicely
Cheerio and toodle-oo
I do wish that you had identified which queen you are writing about. There are a lot of queens in the world. None of them seem to own any language in general or English in particular.
DeleteAs for the "colonial thing" we gave up being a colonial power quite a while ago. We just didn't seem to like being a colonial power after the middle of the 20th century.
I think the word encoding languages are taking advantage of a structure designed to handle one letter at a time. To get back to equality, we should define a new encoding so all languages have a word by word encoding. Use a short encoding for the shortest words and modifiers, and a longer encoding for longer words. Modifiers for first caps, all caps, italics, reverse italics, bold, underline, overline, strikethrough, following non-blank punctuation.
ReplyDelete> If they let one constructed script in they would have a hard time keeping others out.
ReplyDeleteYet, APL.
1. Remember this: Use utf-8 encoding whenever possible. And if it isn't possible, make it possible.
ReplyDelete2. Read this famous article by Joel Spolsky:
http://www.joelonsoftware.com/articles/Unicode.html
3. Read this clarification:
http://stackoverflow.com/questions/20942469/clarification-on-joel-spolskys-unicode-article
Regarding string comparisons, Unicode is just the tip of the iceberg.
ReplyDeleteCollations is what should be used to compare text, and contrary to Unicode that has the ideal of supporting all possible characters in just one standard, you will never be able to do that with collations. There will never be "one unicollation".
Why? Simple... because under some cultures considering that character A is equal to B might be useful, while in other it might be insulting. Even inside the same culture we can have multiple rules of how to compare text.
Regards.
Hey nice post man! Thanks for incredible info.
ReplyDeletedesigner companies list
Their expertise and ability to deliver quality results made for a successful partnership.
ReplyDeleteinteractive design agency
They are a solid, flexible, and professional team that delivers highly valuable research results.
ReplyDeleteUX design company
I truly get pleasure from while I read your blogs and its content.
ReplyDeleteSan Francisco design studios