>>>> "Hrvoje" == Hrvoje Niksic
Hrvoje> We aren't developing crappy commercial software here, after all.
Right. That's why we should do UCS-4 exactly right, and avoid getting
into "my buffers are bigger" can-you-piss-over-this-line contests.
BTW: I don't care if Zed has .75GB buffers, or you do, or somebody is
simply trying to determine the period of (psychoanalyze-pinhead); I
think they're important. Just not as important to me, and I believe
to Mule-dependent people in general, as UCS-4.
Hrvoje> "Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:
Hrvoje> What does direct UCS-4 support in characters buy me
> It buys you direct UCS-4 support in characters.
Hrvoje> And what does that buy me? Are we running in circles
Yes. UCS-4 is going to be the standard unified character type in the
near future. It can handle all foreseeable needs for plain text.
If you aren't excited about UCS-4 itself, then you aren't going to be
excited about direct support. I can't argue that.
But direct support is probably a good idea; otherwise we must use
different representations for redisplay text runes and Lisp
characters. Maybe that's not such a big deal, although it would sure
simplify string processing (text glyphs) to use UCS-4 everywhere.
> o XEmacs will never again (well, for a future longer than the
> history of electronic computers) need to change the abstract
> format for a Lisp character.
Hrvoje> This is true without UCS-4 too. It's only the
Hrvoje> implementation that changes.
Unfortunately, probably not true. Current Mule representation has
probably reached the end of the line; we're running out of code
space. Tomo has already appropriated 3 private leading bytes for Thai
and Vietnamese. Probably another has gone to Ethiopian. Several more
We need to change the internal encoding of Mule characters soon.
Eventually we will go to UCS-4, Unicode is not big enough, and as you
point out, XEmacs does things right when we know how. It will be
fixed width, multibyte representation of single characters is an
Since textual runes are basically Emchars, figuring out how to make
UCS-4 runes work would make the transition to either fixed-width UCS
(2 or 4) buffers (getting rid of all that difficult-to-maintain
caching that Mule must do) or UTF-8 buffers much easier.
Let's do it now, let's do it right.
As a practical matter, the internal representation of Mule will be
nearly plain text; we could consider doing Mule autosaves as straight
buffer dumps or as true UCS-4 in about 3 lines of C code. No more
screwing around with coding-systems; that would be purely a service
for the user, and no longer a core Mule functionality (in the sense
that editing functions would no longer have to care about it, only I/O
functions---this is not true today, VM for example does not produce
correct autosave files, the ESC character in ISO-2022 sequences gets
duplicated for some subtle reason).
> o Mule backwards compatibility.
Hrvoje> Lisp code that relies on a particular representation of
Hrvoje> characters is broken in the first place.
Yup. So I see lots and lots of patches aimed at fixing the Lisp
engine, which is broken in the first place because it has default
dynamic scoping, and redisplay, which may not be broken but can't be
fixed in the first place. Not. It's the same issue; we have existing
Mule code and character sets which will need to be supported, and it
will take time to get all of them properly mapped to UCS. And some of
the character sets we support don't have official positions in UCS yet.
We need those bit-30-set private groups.
> Sure, those may not excite you.
Hrvoje> They don't excite me at all. 700M buffers excite me.
Hrvoje> Whose 700M buffer is this? Zed's. Who's Zed? Zed's dead
Hrvoje> baby. Zed's dead...
Good. Oppose this change then, that's your right. If UCS-4 is
unneeded functionality for others, I'll maintain a private branch of
XEmacs until you all do something that completely breaks it, then I'll
do something else.
But this is the technically right and technically do-able direction for
Mule to go.
It'll make Windows users happy, too; the internal representation will
be the same as their system's alleged native encoding.
> You can (partially) implement a bignum type, and implement it
> where it helps somebody. Places where it has not yet been
> implemented will throw a type error automatically if you try to
> pass them in;
Hrvoje> Don't you get it? It breaks any hope of large buffers
Hrvoje> working with bignums. Integers get propagated all over
Hrvoje> the place. If you take the easy way out and throw type
Hrvoje> errors in XINT, you could as well not bother in the first
Hrvoje> place. That's why I never tried to add them.
You know the code better than I; so probably that's true. I know
integers get propagated all over the place; I don't know how
connective the graph is. It is not clear to me that we can't put up
Chinese walls between the usages within the C code, such as bignums
used for time values and bignums used for buffer sizes and bignums
used for something else that ought to be big and "generic" bignums for
Sure, that's ugly, that's not "right"---but doing bignums right is
hard. Doing this part of UCS-4 right is easy, and the migration path
for moving to a full UCS representation is pretty straightforward, I
think. (And we have volunteers for it - Olivier, me, and Tomo.) It
>>>> "Jan" == Jan Vroonhof
Jan> lrecord integers. Of course you could also argue for lrecord
Jan> chars, presumably the actual value of chars will not be used
Jan> that often, presumably less then integers. Did anybody ever
Jan> do profiling on this?
Jan> Maybe putting characters in lrecords for Mule is the way to
Jan> go. Maybe even with the ugly small-character/big-character
Jan> thing I reposed for unsigned/signed integers above.
Nope, uh-uh, sorry. There are Emchars in every textual rune. See
You could argue for splitting the character type into redisplay
representation and Lisp representation. I believe that Morioka-san
wants to do this so that Lisp characters can be first class objects
with their own property lists and so on. He is also opposed to the
scheme I propose because it would break code that uses
(char-to-int (... (int-to-char ...) ...))
(I actually consider that a good thing ;-)
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
What are those two straight lines for? "Free software rules."