���� GB18030 ������, ˭�бȽ�ȫ����˽�?
Date: Fri, 20 Oct 2000 11:56:55 +0900
From: "Martin J. Duerst" <duerst(a)w3.org>
Subject: [li18nux:753] Fwd: RE: GB18030 summary and issues
With the permission of the author, I'm sending you a comment on
the GB18030 mapping table that have appeared on this list
some time ago.
>X-UML-Sequence: 5977 (2000-10-17 00:36:44 GMT)
>From: Kenneth Whistler <kenw(a)sybase.com>
>Date: Mon, 16 Oct 2000 16:36:41 -0800 (GMT-0800)
>Subject: RE: GB18030 summary and issues
>I've taken a look at the GB18030.TXT you provided, and unfortunately,
>as it stands, the mapping table has *major* problems.
>Most of these problems really derive from the serious flaws in GB 18030-2000
>itself, so I'm not sure exactly what implementers are going to
>do about them, but so you can focus in on the issues, here is some
>of what I turned up.
>A. GB 18030's encoding and mapping of Annex B (p. 91) -- ideographic
>variation indicator, and the ideographic description characters, is
>flat-out wrong. The same thing applies to Annex C (p. 92), the CJK
>radicals supplement. Essentially, the relevant Chinese committee rushed this
>thing to publication without having determined where these characters
>were encoded in 10646, *despite* the fact that GB 18030 then makes
>normative mappings to the entirety of 10646-1:2000 (actually to
>GB 13000.1, but that is just a pointer to 10646-1:2000, unless they
>printed *that* wrong, too, in which case we are even more screwed up).
>The result is just out-and-out errors. To wit:
> 1. U+303E (GB18030 A989) is mapped to U+E7E7 (user-defined)
>The net result in GB18030.TXT is that GB A989 is mapped into private use,
>even though in the chart it is shown as U+303E. But U+303E, as a *code
>position*, is mapped to the 4-byte form 0x8139A634.
> 2. U+2FF0..U+2FFB (GB18030 A98A..A995) are mapped correctly in the
> main tables of GB18030 (p. 82), but are mapped again incorrectly
> in Annex C (U+E7E8..U+E7F3, user-defined).
>The net result in GB18030.TXT is that all the ideographic description
>characters are double-mapped.
> 3. U+2E80..2EF3, the CJK radicals supplement, are mapped haphazardly,
> from an earlier draft, apparently: GB18030 FE50..FEA0 is mapped
> to U+E815..U+E864, instead of the actual Unicode code points. In
> addition, some of the characters in Annex C, are actually in
> Vertical Extension A, resulting in gapping in the tables.
>The net result in GB18030.TXT is that all the CJK radicals and
>other characters in Annex C are double-mapped.
>B. GB 18030 makes the mistake of trying to encode all code positions
>in GB 13000.1 (= 10646-1:2000), regardless of their status. That
>means, among other things, that all private use code positions
>in Unicode on the BMP are given GB 18030 code assignments --
>*regardless* of their status in GB 18030 as assigned characters or
>not. This makes a complete hash, compounded by the fact that all the
>characters mentioned in A above are erroneously assigned to private
>use codes in Unicode. That renders the mapping of the rest of user
>C. As an extension of B., GB 18030 also maps surrogate code positions
>to GB 18030 4-byte codes, *as if* they were characters. Thus U+D800
>(a surrogate code point, not an unassigned character) is mapped to
>0x8336C739, indifferently from U+D7FF (an unassigned character
>position) being mapped to 0x8336C738.
>Incidentally, there appears to be an off-by-one error in this area in
>GB18030.TXT as well: GB18030.TXT shows 0x8336c830 = U+D800, whereas
>the printed text of the GB18030-2000 standard itself shows
>0x8336C739 = U+D800.
>I'm not sure what the solution here is, other than to encourage China
>to fix its $@&#*^! standard. But if the tables you posted have in
>fact already been rolled out in Linux implementations in China, then
>we are all going to have to live with horrendous interoperability
>problems resulting from bad mapping tables for bad standards.
>Here it is the year 2000, and having lived with the yen/backslash
>problem and the fullwidth tilde problem, and the not sign problem for
>decades in East Asian implementations, I guess everybody has decided
>that we should start off the new century with a brand-spanking new
>set of ways to shoot ourselves in both feet at the same time for
-------------------------- eGroups Sponsor -------------------------~-~>
It's Easy. It's Fun. Best of All, it's Free!
To unsubscribe from this group, send an email to:
URL to this group: