Unicode and Japanese

Marc · 18 Jun 2003

Excerpts taken from the book "Structured Computer Organization", 4th edition, by Andrew S. Tanenbaum. (Prentice Hall Publishing):

"UNICODE is now supported by some programming language(e.g. Java), some operating systems and many applications. It is likely to become increasingly accepted as the computer industry goes global."

"The basic idea behid UNICODE is to assign every character and symbol, a unique, permanent 16-bit value, called a code point."

Which means that we can represent, theoretically speaking, every single character and symbol of all world's idioms using a single pattern. Thus, we could create a fontset that would allow us to write in any language

"...With 16-bit symbols, UNICODE has 65,546 code points. Since the world's languages collectively use about 200,000 symbols, code points are a scarce resource that must be allocated with great care."

"... Each major alphabet in UNICODE has a sequence of consecutive zones."

"... After these come the symbols needed for Chinese, Japanese and Korean. First are 1024 phonetic symbols (e.g. katakana and bopomofo) and then the unified Han ideographs (20,992) used in Chinese and Japanese, and the Korean Hangul syllables (11,156)."

"While UNICODE solves many problems associated with internationalization, it does not (attempt to) solve all the world's problems. For example, while the Latin alphabet is in order, the Han ideographs are not in dictionary order. A Japanese program needs external tables to figure out which of two symbols comes before the other in the dictionary."

"Adding new words in English does not require new code points. Adding them in Japanese does. In addition to new technical words, there is a demand for at least 20,000 new (mostly Chinese) personal and place names."

"UNICODE uses the same code point for characters that look almost identical but have different meanings or are written slightly differently in Japanese and Chinese...."

"Some people view this as an optimization to save scarce code points; others see it as Anglo-Saxon culutural imperialism (and you thought assigning 16-bit values to characters was not highly political?). To make matters worse, a full Japanese dictionary has 50,000 kanji (excluding names), so with only 20,992 code points available for Han ideographs, choices had to be made. Not all Japanese people think that a consortium of computer companies, even if a few of them are Japanese, is the ideal forum to make these choices."

After reading the whole article I ask myself: If the Unicode consortium already knew that 200,000 code points are necessary to represent every single symbol concerning all languages, why the hell they decided to use 16 bits to encode them?

Why didn't they use 24 bits? (remember: a byte (8 bits) is the least addressable unit in a computer, so 16 + 8 = 24).
18 bits would address 262144 symbols. The other 6 could be used as Error-Correcting Code, like Hamming's or Reed-Solomon.
Why do computer companies like to play stupid?

Twisted · 18 Jun 2003

I'm no expert on this, but i'm guessing that when UNICODE was designed, resources were scarce.

Enfour · 18 Jun 2003

There is quite a bit of resistance to the use of unicode in Japan which at times is quite silly and based more on politics than the logic of it all.

I can't really see what is the problem except a "not invented here" syndrome.

Personally I think unicode is fantastic - sure it is not perfect, but it is far superior and the consortium is active and dynamic and get things done.

JMHO

Marc · 19 Jun 2003

Err...
I was looking for more information about this issue, and I've bumped into this FAQ (taken from the Unicode consortium site itself): FAQ - Chinese and Japanese

"I have heard that UTF-8 does not support some Japanese characters. Is this correct?

A: There is a lot of misinformation floating around about the support of Chinese, Japanese and Korean (CJK) characters. The Unicode Standard supports all of the CJK characters from JIS X 0208, JIS X 0212, JIS X 0221, or JIS X 0213, for example, and many more. This is true no matter which encoding form of Unicode is used: UTF-8, UTF-16, or UTF-32.

Unicode supports over 70,000 CJK characters right now, and work is underway to encode further additions. The International Standard ISO/IEC 10646 and the Unicode Standard are completely synchronized in repertoire and content. And that means that Unicode has the same repertoire as GB 18030, since that also is synchronized with ISO 10646 -- although with a different ordering
and byte format.

Q: Who is responsible for future CJK characters?

A: The development and extension of the CJK characters is being done by the Ideographic Rapporteur Group (IRG), which includes official representatives of China, Hong Kong (SAR), Macao (SAR), Singapore, Japan, South Korea, North Korea, Taiwan and Vietnam, plus a representative from the Unicode consortium. For more information, see http://www.info.gov.hk/digital21/eng/structure/intro_irg.html

The IRG is very carefully cataloging, reviewing, and assessing CJK characters for inclusion into the standard. The only real limitation on the number of CJK characters in the standard is the ability of this group to process them, because the characters are increasingly obscure (no person -- living or deceased -- knows more than a fraction of the set already encoded)."

So it seems that there are 3 types of Unicode: 8-bit, 16-bit and 32-bit formats. Sorry about that, didn't intend to be so harsh.

Although Tanenbaum is a well-known respected professor in the computer academic world, I shouldn't have taken his words as the "Holy Truth". What really intrigues me is that these excerpts were taken from the fourth edition of his book (1999). By that time, Unicode 3.1 was already out (???, almost sure), fixing most of the flaws I've mentioned above.

So what have we learned today kids? Never take someone's speech as the only truth. This will avoid embarassing moments like mine

Unicode and Japanese

Marc

先輩

Twisted

That man in the corner

Enfour

先輩

Marc

先輩

Similar threads