I think it's about time I came clean about why I've been finding bugs
and asking strange questions recently...
For a very long time, I've been almost but not quite perfectly happy with
XEmacs 21.4. There are two major bummers: the catastrophic slowness in
sequential access to large buffers; and the lack of true Unicode
support. (mule-ucs is pretty good, but requires some set-up, and loses
data if you get your set-up wrong, or find a file with a codepoint you
didn't know you had to deal with. mule-ucs is also really disgusting
to maintain. ccl, ack, splt.) The slowness I could now fix, by
backporting the 21.5.21 improvements to the bufbyte-charpos cache;
this message is about the other issue. (It might be worth saying that
I want extensive multi-lingual support because I write about languages
and non-English things, not because I actually do any non-trivial
document processing in anything other than English. So I have no need
to process a 100MB Japanese document efficiently, say.)
XEmacs 21.5 ought to be an improvement, given the massive changes made
in it, and the many MULE improvements, but for me it isn't enough.
The unicode support is bolted on in a rather horrible fashion, albeit
much more robustly than via mule-ucs; and it doesn't allow one to
distinguish between unicode and legacy stuff (and since I have some
sympathies with the philosophy behind UTF-2000, and specifically
sometimes want to talk about different hanzi shapes in CJK, that
matters to me). Moreover, 21.5 is (for the most part) slower, bigger,
and behaves differently in subtle ways that take a lot of finding in
my single most important application, VM. (My impression from this
list is that these problems are mostly VM's fault, but that doesn't
make them any quicker to fix.)
So for a long time I've been contemplating modifying XEmacs to be
natively UCS and UTF-8, while also preserving perfect backward
compatibility with the existing 21.4 way of doing things.
Three weeks ago I stopped contemplating and started wasting my
evenings (and my days off...) coding;-)
I chose to start from 21.4, both for the reasons above, and because
the 21.4 code base is somewhat less hairy than the 21.5 code base.
Having dealt with it for 21.4, doing the same for 21.5 should not be
What I've done so far:
* Changed the Emchar so that it is either a UCS codepoint, or an old
style Mule character with a high bit set.
* Changed the Bufbyte encoding so that it is UTF-8, extended to
encode old-style Mule characters using some of the spare leading
bytes. (Since UCS was restricted to 17 planes, the bytes F5 to
FF are no longer valid leading bytes in UTF-8, so I'm using some
of them in an ad-hoc encoding for legacy characters. This means I'm
taking three bytes for mule official 1-D characters, and four bytes
for the rest, but I'm not too worried about that.)
* Introduced a new type of charset, the ucs charset, to represent
UCS characters. (I've stolen LEADING_BYTE_COMPOSITE for it, since I
don't believe XEmacs is ever going to support composite characters,
and I don't think it ever should.) For legacy compatibility, the
old charsets ascii, control-1 and iso-8859-1 are hard-wired-ly
identified with the first 256 points of ucs; all the other mule
charsets are completely distinct from the ucs charset.
* Made the necessary changes through the rest of the C code for this
to work "as expected". (The C code has far too many numerical
constants that should be #defines, by the way...Finding every
occurrence of 128 meaning MIN/NUM_LEADING_BYTES, or even worse,
4 meaning number of charset types, was acutely painful!)
Where I've got to now (or to be honest, where I will have got to
tomorrow - one or two things to fix up) is an XEmacs which (modulo bugs -
I haven't yet had the nerve to try it out in my actual main desktop)
is fully compatible with 21.4.21 at the Lisp and ccl level, as long as you
don't actually look at the integer representation of a character to
see the high bit flag - if you load mule-ucs, all your Unicode support
will happen the old way - but can also treat Unicode natively (in
which case it needs an iso10646-1 font).
What I still need to do to get to where I want to be:
* Embrace and extend some of the 21.5 chartab and unicode code so
that I can do efficient automatic conversion between ucs and mule
characters as required (needed in particular for quail to work in
utf, since at present decoding an iso2022 file produces mule
characters, not ucs characters); and so that search works properly
with translation tables etc., as I believe it does in 21.5.
This will also allow ucs characters to be displayed in legacy fonts
via a display table; and allow search to identify characters that
are the same in ucs.
* Check performance. For obvious reasons, I have all error-checking
and debugging switched on, with asserts everywhere, and
I dump core at the slightest provocation. It's a bit slow like
that, and I need to check that I haven't made things much slower in
* Put back some search optimizations (I switched off boyer-moore and
fastmaps since I haven't thought how to deal with them; though
frankly I'm not convinced it's worth the effort these days!).
* Provide some ethnic cleansing options to force all mule characters
to be converted to ucs on sight; or not; or the other way round.
* (one day far in the future) bidi, combining characters, and all the
rest of it.
* all the things that are on the Mule wishlist - fixed width buffers etc.
Now, this could be a purely private project; or it could be a way that
either SXEmacs or XEmacs might want to consider. I'd be interested in
the views of the XEmacs developers on this idea - especially the
reasons why it's a really bad idea and hasn't already been done!
I guess it'll be a couple more weeks before I'm really ready to let
other people have the code - if you'd like to see it at that point,
drop me a line.
P.S. Note new email address. My department has been forced to give up
its own mail service, and the Edinburgh University mail service is slow,
unreliable, and interferes with mail.
XEmacs-Beta mailing list