Re: Patch: Promoting characters to 31 bits (for discussion)

Thursday, 1 July 1999

        ...
>>>> "Hrvoje" == Hrvoje Niksic
<hniksic(a)srce.hr&gt; writes: 
    Hrvoje> We aren't developing crappy commercial software here, after all.
    Hrvoje> :-)

Right.  That's why we should do UCS-4 exactly right, and avoid getting 
into "my buffers are bigger" can-you-piss-over-this-line contests.

BTW: I don't care if Zed has .75GB buffers, or you do, or somebody is
simply trying to determine the period of (psychoanalyze-pinhead); I
think they're important.  Just not as important to me, and I believe
to Mule-dependent people in general, as UCS-4.

    Hrvoje> "Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp&gt; writes:
    Hrvoje> What does direct UCS-4 support in characters buy me
    Hrvoje> anyway?

...
> It buys you direct UCS-4 support in characters. 
    Hrvoje> And what does that buy me?  Are we running in circles
    Hrvoje> here?

Yes.  UCS-4 is going to be the standard unified character type in the
near future.  It can handle all foreseeable needs for plain text.

If you aren't excited about UCS-4 itself, then you aren't going to be
excited about direct support.  I can't argue that.

But direct support is probably a good idea; otherwise we must use
different representations for redisplay text runes and Lisp
characters.  Maybe that's not such a big deal, although it would sure
simplify string processing (text glyphs) to use UCS-4 everywhere.

...
> o XEmacs will never again (well, for a future longer than the
> history of electronic computers) need to change the abstract
> format for a Lisp character. 
    Hrvoje> This is true without UCS-4 too.  It's only the
    Hrvoje> implementation that changes.

Unfortunately, probably not true.  Current Mule representation has
probably reached the end of the line; we're running out of code
space.  Tomo has already appropriated 3 private leading bytes for Thai 
and Vietnamese.  Probably another has gone to Ethiopian.  Several more 
for Unicode.

We need to change the internal encoding of Mule characters soon.
Eventually we will go to UCS-4, Unicode is not big enough, and as you
point out, XEmacs does things right when we know how.  It will be
fixed width, multibyte representation of single characters is an
abomination.

Since textual runes are basically Emchars, figuring out how to make
UCS-4 runes work would make the transition to either fixed-width UCS
(2 or 4) buffers (getting rid of all that difficult-to-maintain
caching that Mule must do) or UTF-8 buffers much easier.

Let's do it now, let's do it right.

As a practical matter, the internal representation of Mule will be
nearly plain text; we could consider doing Mule autosaves as straight
buffer dumps or as true UCS-4 in about 3 lines of C code.  No more
screwing around with coding-systems; that would be purely a service
for the user, and no longer a core Mule functionality (in the sense
that editing functions would no longer have to care about it, only I/O 
functions---this is not true today, VM for example does not produce
correct autosave files, the ESC character in ISO-2022 sequences gets
duplicated for some subtle reason).

...
> o Mule backwards compatibility. 
    Hrvoje> Lisp code that relies on a particular representation of
    Hrvoje> characters is broken in the first place.

Yup.  So I see lots and lots of patches aimed at fixing the Lisp
engine, which is broken in the first place because it has default
dynamic scoping, and redisplay, which may not be broken but can't be
fixed in the first place.  Not.  It's the same issue; we have existing
Mule code and character sets which will need to be supported, and it
will take time to get all of them properly mapped to UCS.  And some of
the character sets we support don't have official positions in UCS yet.

We need those bit-30-set private groups.

...
> Sure, those may not excite you. 
    Hrvoje> They don't excite me at all.  700M buffers excite me.
    Hrvoje> Whose 700M buffer is this?  Zed's.  Who's Zed?  Zed's dead
    Hrvoje> baby.  Zed's dead...

Good.  Oppose this change then, that's your right.  If UCS-4 is
unneeded functionality for others, I'll maintain a private branch of
XEmacs until you all do something that completely breaks it, then I'll 
do something else.

But this is the technically right and technically do-able direction for 
Mule to go.

It'll make Windows users happy, too; the internal representation will
be the same as their system's alleged native encoding.

...
> You can (partially) implement a bignum type, and implement it
> where it helps somebody.  Places where it has not yet been
> implemented will throw a type error automatically if you try to
> pass them in; 
    Hrvoje> Don't you get it?  It breaks any hope of large buffers
    Hrvoje> working with bignums.  Integers get propagated all over
    Hrvoje> the place.  If you take the easy way out and throw type
    Hrvoje> errors in XINT, you could as well not bother in the first
    Hrvoje> place.  That's why I never tried to add them.

You know the code better than I; so probably that's true.  I know
integers get propagated all over the place; I don't know how
connective the graph is.  It is not clear to me that we can't put up
Chinese walls between the usages within the C code, such as bignums
used for time values and bignums used for buffer sizes and bignums
used for something else that ought to be big and "generic" bignums for
generic arithmetic.

Sure, that's ugly, that's not "right"---but doing bignums right is
hard.  Doing this part of UCS-4 right is easy, and the migration path
for moving to a full UCS representation is pretty straightforward, I
think.  (And we have volunteers for it - Olivier, me, and Tomo.)  It
will happen.

...
>>>> "Jan" == Jan Vroonhof
<vroonhof(a)math.ethz.ch&gt; writes: 
    Jan> lrecord integers. Of course you could also argue for lrecord
    Jan> chars, presumably the actual value of chars will not be used
    Jan> that often, presumably less then integers. Did anybody ever
    Jan> do profiling on this?

    Jan> Maybe putting characters in lrecords for Mule is the way to
    Jan> go.  Maybe even with the ugly small-character/big-character
    Jan> thing I reposed for unsigned/signed integers above.

Nope, uh-uh, sorry.  There are Emchars in every textual rune.  See
redisplay.h.

You could argue for splitting the character type into redisplay
representation and Lisp representation.  I believe that Morioka-san
wants to do this so that Lisp characters can be first class objects
with their own property lists and so on.  He is also opposed to the
scheme I propose because it would break code that uses

	      (char-to-int (... (int-to-char ...) ...))

(I actually consider that a good thing ;-)

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
__________________________________________________________________________
__________________________________________________________________________
What are those two straight lines for?  "Free software rules."

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Patch: Promoting characters to 31 bits (for discussion)