a couple of thoughts about codings in 21.4
jcb+xeb at jcbradfield.org
Fri Jul 17 16:58:02 EDT 2009
I've recently had all the non-UTF8 non-ASCII mail in my folders corrupted,
irrecoverably so (short of searching through many days' backups, which
I can't do myself). The cause of the corruption is bugs in VM, exposed
by my switching all my coding system defaults to utf-8. The reason
it's irrecoverable is the putrid pile of dingos' kidneys that is
mule-ucs, and in particular the way it does no validity checking at
all when it decodes alleged utf-8 (rather than copying the invalid
bytes into the buffer as Latin1, as the ISO2022, SJIS and Big5 methods
This caused me to observe:
(1) 21.4(.22) does have the necessary infrastructure to handle UTF8
itself for the BMP: it has UTF8 coding, it has mule-to-ucs-table
and ucs-to-mule-table and uses them in the C. So, with a fairly
small amount of work, plus the use of 9 private 2D charsets (for
which I had to lose chinese-isoir165 and ethiopic, which is frankly
no loss), one can implement UTF8 for the entire BMP in Lisp
without having to touch mule-ucs at all.
To me, this sounds like an improvement, that could be shipped
with 21.4 to make it more robust. However, ...
(2) The C routine coding_decode_utf8 *also* doesn't do any validity
checking! Who's responsible for that, eh?
This should be fixed, which I will do instanter (I already wrote
the code for my (currently suspended) pure Unicode fork anyway).
Any interest in having these in 21.4? (It is still the advertised
Secondly, I also find it essential nowadays (if I could keep my mail
uncorrupted) to handle GB18030. So does anybody in China. So I
implemented that in C, using a mapping table to Unicode.
Do you want that? (It should be almost the same in 21.5.)
On that topic, it's a sad truth that that PRC-locale software
(especially that made by Microsoft) advertises text as GB2312 when in
fact it's GBK or even GB18030. This is just too big a fact to
ignore. So what I would like to do is arrange that my "gb2312" coding
system actually decodes GB18030 on read, but correctly only puts out
real GB2312 on write. I can't see any easy way to arrange this in
Lisp. Is there one?
More information about the XEmacs-Beta