Date: Fri, 20 Oct 2000 11:56:55 +0900
From: "Martin J. Duerst" <duerst(a)w3.org>
Reply-To: li18nux(a)li18nux.org
Subject: [li18nux:753] Fwd: RE: GB18030 summary and issues
To: li18nux(a)li18nux.org
With the permission of the author, I'm sending you a comment on
the GB18030 mapping table that have appeared on this list
some time ago.
Regards, Martin.
X-UML-Sequence: 5977 (2000-10-17 00:36:44 GMT)
From: Kenneth Whistler <kenw(a)sybase.com>
Date: Mon, 16 Oct 2000 16:36:41 -0800 (GMT-0800)
Subject: RE: GB18030 summary and issues
I've taken a look at the GB18030.TXT you
provided, and unfortunately,
as it stands, the mapping table has *major* problems.
Most of these problems really derive from the serious flaws in GB 18030-2000
itself, so I'm not sure exactly what implementers are going to
do about them, but so you can focus in on the issues, here is some
of what I turned up.
A. GB 18030's encoding and mapping of Annex B (p. 91) -- ideographic
variation indicator, and the ideographic description characters, is
flat-out wrong. The same thing applies to Annex C (p. 92), the CJK
radicals supplement. Essentially, the relevant Chinese committee rushed this
thing to publication without having determined where these characters
were encoded in 10646, *despite* the fact that GB 18030 then makes
normative mappings to the entirety of 10646-1:2000 (actually to
GB 13000.1, but that is just a pointer to 10646-1:2000, unless they
printed *that* wrong, too, in which case we are even more screwed up).
The result is just out-and-out errors. To wit:
1. U+303E (GB18030 A989) is mapped to U+E7E7 (user-defined)
The net result in GB18030.TXT is that GB A989 is mapped into private use,
even though in the chart it is shown as U+303E. But U+303E, as a *code
position*, is mapped to the 4-byte form 0x8139A634.
2. U+2FF0..U+2FFB (GB18030 A98A..A995) are mapped correctly in the
main tables of GB18030 (p. 82), but are mapped again incorrectly
in Annex C (U+E7E8..U+E7F3, user-defined).
The net result in GB18030.TXT is that all the ideographic description
characters are double-mapped.
3. U+2E80..2EF3, the CJK radicals supplement, are mapped haphazardly,
from an earlier draft, apparently: GB18030 FE50..FEA0 is mapped
to U+E815..U+E864, instead of the actual Unicode code points. In
addition, some of the characters in Annex C, are actually in
Vertical Extension A, resulting in gapping in the tables.
The net result in GB18030.TXT is that all the CJK radicals and
other characters in Annex C are double-mapped.
B. GB 18030 makes the mistake of trying to encode all code positions
in GB 13000.1 (= 10646-1:2000), regardless of their status. That
means, among other things, that all private use code positions
in Unicode on the BMP are given GB 18030 code assignments --
*regardless* of their status in GB 18030 as assigned characters or
not. This makes a complete hash, compounded by the fact that all the
characters mentioned in A above are erroneously assigned to private
use codes in Unicode. That renders the mapping of the rest of user
space trash.
C. As an extension of B., GB 18030 also maps surrogate code positions
to GB 18030 4-byte codes, *as if* they were characters. Thus U+D800
(a surrogate code point, not an unassigned character) is mapped to
0x8336C739, indifferently from U+D7FF (an unassigned character
position) being mapped to 0x8336C738.
Incidentally, there appears to be an off-by-one error in this area in
GB18030.TXT as well: GB18030.TXT shows 0x8336c830 = U+D800, whereas
the printed text of the GB18030-2000 standard itself shows
0x8336C739 = U+D800.
I'm not sure what the solution here is, other than to encourage China
to fix its $@&#*^! standard. But if the tables you posted have in
fact already been rolled out in Linux implementations in China, then
we are all going to have to live with horrendous interoperability
problems resulting from bad mapping tables for bad standards.
Here it is the year 2000, and having lived with the yen/backslash
problem and the fullwidth tilde problem, and the not sign problem for
decades in East Asian implementations, I guess everybody has decided
that we should start off the new century with a brand-spanking new
set of ways to shoot ourselves in both feet at the same time for
Chinese implementations.
--Ken
-------------------------- eGroups Sponsor -------------------------~-~>
eLerts
It's Easy. It's Fun. Best of All, it's Free!
---------------------------------------------------------------------_->
To unsubscribe from this group, send an email to:
i18n-chinese-unsubscribe(a)egroups.com
URL to this group: