Moving to Unicode internally
Stephen J. Turnbull
stephen at xemacs.org
Mon Oct 4 23:22:43 EDT 2004
>>>>> "Ben" == Ben Wing <ben at 666.com> writes:
Ben> That means that a Mule build shouldn't go much slower, if at
Ben> all, than a non-Mule build when working only with Latin-1
Ben> characters. This is something we should do before worrying
Ben> about switching the internal format to UTF-anything. In fact
Ben> this should already be the case with pure 7-bit ASCII files
Ben> due to the simple caching in charbpos_to_bytebpos or
AFAICT there is no optimization that knows whether the file is
constant-width or pure ASCII or not. It is known that there are
important special cases where Mule still sucks badly (VM summary
generation is one), so presumably the caching is not good enough.
Things like font-lock trip on this, occasionally.
Steve Y AFAIK is still suffering with a terminal case of the slows.
Since he's on Linux, I have to suspect his homebrew environment and
penchant for cranking up gcc optimization have something to do with
it; I don't see the problems he does on Linux with slower hardware
(which is also running heavy stuff in the background). But he's not
imagining things, he's got timings that are horrifying. And they're
bad with vanilla build settings, too.
Is it really worth trying to improve caching?
IMO, we need the fixed-width buffer optimization. As you point out,
strings aren't so important.
Ben> Changing the default 8-bit format from Mule-internal to UTF-8
Ben> is orthogonal to the issue of changing the IByte *
Ben> representation. It's this issue that "Moving to Unicode
Ben> internally" signifies to me. As Stephen pointed out, I wrote
Ben> the Mule code so that the vast majority of XEmacs code
Ben> doesn't care about the internal representation. Hence, this
Ben> kind of format switch should not be a very big deal. (Didn't
Ben> someone already do this in fact in the UTF-2000 project?)
Yes. The big issue is converting back to font indicies that the
system understands. I still haven't figured out how to get Xft to eat
Unicode, but for X11 fonts the functions from unicode.c are fine.
Ben> However, this kind of change brings up the issue of how you
Ben> encode language-specific tagging. Currently this is done in
Ben> the charset (yuck), and doing it using extents would be
Ben> better, but a fair amount of work.
I think that recording undo information is (a) more important and (b)
easily generalizes to language tagging. But for the moment can't we
just say it's a YAGNI? The only people who really care would probably
like to use XEmacs/CHISE anyway (it has special functions for database
access to character property data, etc), except Olivier Galibert. ;-)
Ben> Besides, the tables aren't that big.
Well, they're not all that small, either. Under 1MB, with only the
CNS charsets we can't yet (?) represent left out. 5% < charsets < 10%
of total XEmacs (dumped) with all the trimmings. Optimized for size,
it's probably about 10%.
Still, most people can probably live without any of the Oriental
charsets, which would cut it to under 100KB, and it's easy enough to
customize by simply commenting out in the charset loader. I think
that's easy enough ... if you don't want space badly enough to edit
that list, you don't want it badly enough. ;-)
Ben> Sounds like something Jamie either made up or would claim to
Ben> have made up.
Now that's .sig material!
Institute of Policy and Planning S
More information about the XEmacs-Beta