Excerpts taken from the book "Structured Computer Organization", 4th edition, by Andrew S. Tanenbaum. (Prentice Hall Publishing):
"UNICODE is now supported by some programming language(e.g. Java), some operating systems and many applications. It is likely to become increasingly accepted as the computer industry goes global."
"The basic idea behid UNICODE is to assign every character and symbol, a unique, permanent 16-bit value, called a code point."
Which means that we can represent, theoretically speaking, every single character and symbol of all world's idioms using a single pattern. Thus, we could create a fontset that would allow us to write in any language
"...With 16-bit symbols, UNICODE has 65,546 code points. Since the world's languages collectively use about 200,000 symbols, code points are a scarce resource that must be allocated with great care."
"... Each major alphabet in UNICODE has a sequence of consecutive zones."
"... After these come the symbols needed for Chinese, Japanese and Korean. First are 1024 phonetic symbols (e.g. katakana and bopomofo) and then the unified Han ideographs (20,992) used in Chinese and Japanese, and the Korean Hangul syllables (11,156)."
"While UNICODE solves many problems associated with internationalization, it does not (attempt to) solve all the world's problems. For example, while the Latin alphabet is in order, the Han ideographs are not in dictionary order. A Japanese program needs external tables to figure out which of two symbols comes before the other in the dictionary."
"Adding new words in English does not require new code points. Adding them in Japanese does. In addition to new technical words, there is a demand for at least 20,000 new (mostly Chinese) personal and place names."
"UNICODE uses the same code point for characters that look almost identical but have different meanings or are written slightly differently in Japanese and Chinese...."
"Some people view this as an optimization to save scarce code points; others see it as Anglo-Saxon culutural imperialism (and you thought assigning 16-bit values to characters was not highly political?). To make matters worse, a full Japanese dictionary has 50,000 kanji (excluding names), so with only 20,992 code points available for Han ideographs, choices had to be made. Not all Japanese people think that a consortium of computer companies, even if a few of them are Japanese, is the ideal forum to make these choices."
After reading the whole article I ask myself: If the Unicode consortium already knew that 200,000 code points are necessary to represent every single symbol concerning all languages, why the hell they decided to use 16 bits to encode them?
Why didn't they use 24 bits? (remember: a byte (8 bits) is the least addressable unit in a computer, so 16 + 8 = 24).
18 bits would address 262144 symbols. The other 6 could be used as Error-Correcting Code, like Hamming's or Reed-Solomon.
Why do computer companies like to play stupid?
"UNICODE is now supported by some programming language(e.g. Java), some operating systems and many applications. It is likely to become increasingly accepted as the computer industry goes global."
"The basic idea behid UNICODE is to assign every character and symbol, a unique, permanent 16-bit value, called a code point."
Which means that we can represent, theoretically speaking, every single character and symbol of all world's idioms using a single pattern. Thus, we could create a fontset that would allow us to write in any language
"...With 16-bit symbols, UNICODE has 65,546 code points. Since the world's languages collectively use about 200,000 symbols, code points are a scarce resource that must be allocated with great care."
"... Each major alphabet in UNICODE has a sequence of consecutive zones."
"... After these come the symbols needed for Chinese, Japanese and Korean. First are 1024 phonetic symbols (e.g. katakana and bopomofo) and then the unified Han ideographs (20,992) used in Chinese and Japanese, and the Korean Hangul syllables (11,156)."
"While UNICODE solves many problems associated with internationalization, it does not (attempt to) solve all the world's problems. For example, while the Latin alphabet is in order, the Han ideographs are not in dictionary order. A Japanese program needs external tables to figure out which of two symbols comes before the other in the dictionary."
"Adding new words in English does not require new code points. Adding them in Japanese does. In addition to new technical words, there is a demand for at least 20,000 new (mostly Chinese) personal and place names."
"UNICODE uses the same code point for characters that look almost identical but have different meanings or are written slightly differently in Japanese and Chinese...."
"Some people view this as an optimization to save scarce code points; others see it as Anglo-Saxon culutural imperialism (and you thought assigning 16-bit values to characters was not highly political?). To make matters worse, a full Japanese dictionary has 50,000 kanji (excluding names), so with only 20,992 code points available for Han ideographs, choices had to be made. Not all Japanese people think that a consortium of computer companies, even if a few of them are Japanese, is the ideal forum to make these choices."
After reading the whole article I ask myself: If the Unicode consortium already knew that 200,000 code points are necessary to represent every single symbol concerning all languages, why the hell they decided to use 16 bits to encode them?
Why didn't they use 24 bits? (remember: a byte (8 bits) is the least addressable unit in a computer, so 16 + 8 = 24).
18 bits would address 262144 symbols. The other 6 could be used as Error-Correcting Code, like Hamming's or Reed-Solomon.
Why do computer companies like to play stupid?