´ó¼ÒºÃ,
¹ØÓÚ GB18030 µÄÕùÂÛ, ËÓбȽÏÈ«ÃæµÄÁ˽â?
Danny
Date: Fri, 20 Oct 2000 11:56:55 +0900 From: "Martin J. Duerst" duerst@w3.org Reply-To: li18nux@li18nux.org Subject: [li18nux:753] Fwd: RE: GB18030 summary and issues To: li18nux@li18nux.org
With the permission of the author, I'm sending you a comment on the GB18030 mapping table that have appeared on this list some time ago.
Regards, Martin.
X-UML-Sequence: 5977 (2000-10-17 00:36:44 GMT) From: Kenneth Whistler kenw@sybase.com
Date: Mon, 16 Oct 2000 16:36:41 -0800 (GMT-0800) Subject: RE: GB18030 summary and issues
I've taken a look at the GB18030.TXT you provided, and unfortunately, as it stands, the mapping table has *major* problems.
Most of these problems really derive from the serious flaws in GB 18030-2000 itself, so I'm not sure exactly what implementers are going to do about them, but so you can focus in on the issues, here is some of what I turned up.
A. GB 18030's encoding and mapping of Annex B (p. 91) -- ideographic variation indicator, and the ideographic description characters, is flat-out wrong. The same thing applies to Annex C (p. 92), the CJK radicals supplement. Essentially, the relevant Chinese committee rushed this thing to publication without having determined where these characters were encoded in 10646, *despite* the fact that GB 18030 then makes normative mappings to the entirety of 10646-1:2000 (actually to GB 13000.1, but that is just a pointer to 10646-1:2000, unless they printed *that* wrong, too, in which case we are even more screwed up). The result is just out-and-out errors. To wit:
- U+303E (GB18030 A989) is mapped to U+E7E7 (user-defined)
The net result in GB18030.TXT is that GB A989 is mapped into private use, even though in the chart it is shown as U+303E. But U+303E, as a *code position*, is mapped to the 4-byte form 0x8139A634.
- U+2FF0..U+2FFB (GB18030 A98A..A995) are mapped correctly in the main tables of GB18030 (p. 82), but are mapped again incorrectly in Annex C (U+E7E8..U+E7F3, user-defined).
The net result in GB18030.TXT is that all the ideographic description characters are double-mapped.
- U+2E80..2EF3, the CJK radicals supplement, are mapped haphazardly, from an earlier draft, apparently: GB18030 FE50..FEA0 is mapped to U+E815..U+E864, instead of the actual Unicode code points. In addition, some of the characters in Annex C, are actually in Vertical Extension A, resulting in gapping in the tables.
The net result in GB18030.TXT is that all the CJK radicals and other characters in Annex C are double-mapped.
B. GB 18030 makes the mistake of trying to encode all code positions in GB 13000.1 (= 10646-1:2000), regardless of their status. That means, among other things, that all private use code positions in Unicode on the BMP are given GB 18030 code assignments -- *regardless* of their status in GB 18030 as assigned characters or not. This makes a complete hash, compounded by the fact that all the characters mentioned in A above are erroneously assigned to private use codes in Unicode. That renders the mapping of the rest of user space trash.
C. As an extension of B., GB 18030 also maps surrogate code positions to GB 18030 4-byte codes, *as if* they were characters. Thus U+D800 (a surrogate code point, not an unassigned character) is mapped to 0x8336C739, indifferently from U+D7FF (an unassigned character position) being mapped to 0x8336C738.
Incidentally, there appears to be an off-by-one error in this area in GB18030.TXT as well: GB18030.TXT shows 0x8336c830 = U+D800, whereas the printed text of the GB18030-2000 standard itself shows 0x8336C739 = U+D800.
I'm not sure what the solution here is, other than to encourage China to fix its $@&#*^! standard. But if the tables you posted have in fact already been rolled out in Linux implementations in China, then we are all going to have to live with horrendous interoperability problems resulting from bad mapping tables for bad standards.
Here it is the year 2000, and having lived with the yen/backslash problem and the fullwidth tilde problem, and the not sign problem for decades in East Asian implementations, I guess everybody has decided that we should start off the new century with a brand-spanking new set of ways to shoot ourselves in both feet at the same time for Chinese implementations.
--Ken
-------------------------- eGroups Sponsor -------------------------~-~> eLerts It's Easy. It's Fun. Best of All, it's Free! http://click.egroups.com/1/9699/6/_/_/_/972034905/ ---------------------------------------------------------------------_->
To unsubscribe from this group, send an email to: i18n-chinese-unsubscribe@egroups.com
URL to this group: http://www.egroups.com/group/i18n-chinese