a couple of thoughts about codings in 21.4

Friday, 17 July 2009

        I've recently had all the non-UTF8 non-ASCII mail in my folders corrupted,
irrecoverably so (short of searching through many days' backups, which
I can't do myself). The cause of the corruption is bugs in VM, exposed
by my switching all my coding system defaults to utf-8. The reason
it's irrecoverable is the putrid pile of dingos' kidneys that is
mule-ucs, and in particular the way it does no validity checking at
all when it decodes alleged utf-8 (rather than copying the invalid
bytes into the buffer as Latin1, as the ISO2022, SJIS and Big5 methods
do).

This caused me to observe:
(1) 21.4(.22) does have the necessary infrastructure to handle UTF8
    itself for the BMP: it has UTF8 coding, it has mule-to-ucs-table
    and ucs-to-mule-table and uses them in the C. So, with a fairly
    small amount of work, plus the use of 9 private 2D charsets (for
    which I had to lose chinese-isoir165 and ethiopic, which is frankly
    no loss), one can implement UTF8 for the entire BMP in Lisp
    without having to touch mule-ucs at all.
    To me, this sounds like an improvement, that could be shipped
    with 21.4 to make it more robust. However, ...
(2) The C routine coding_decode_utf8 *also* doesn't do any validity
    checking! Who's responsible for that, eh?
    This should be fixed, which I will do instanter (I already wrote
    the code for my (currently suspended) pure Unicode fork anyway).

Any interest in having these in 21.4? (It is still the advertised
stable branch!)

Secondly, I also find it essential nowadays (if I could keep my mail
uncorrupted) to handle GB18030. So does anybody in China. So I
implemented that in C, using a mapping table to Unicode.
Do you want that? (It should be almost the same in 21.5.)

On that topic, it's a sad truth that that PRC-locale software
(especially that made by Microsoft) advertises text as GB2312 when in
fact it's GBK or even GB18030. This is just too big a fact to
ignore. So what I would like to do is arrange that my "gb2312" coding
system actually decodes GB18030 on read, but correctly only puts out
real GB2312 on write. I can't see any easy way to arrange this in
Lisp. Is there one?

Julian.

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998