Re: Patch: Promoting characters to 31 bits (for discussion)

Wednesday, 30 June 1999

        ...
>>>> "Hrvoje" == Hrvoje Niksic
<hniksic(a)srce.hr&gt; writes: 
    Hrvoje> Note that no editor on 32-bit systems can grok a >2G
    Hrvoje> buffer because of limited pointer size.  Thus a 1G limit
    Hrvoje> is just one step below the "theoretical" limit.  You can
    Hrvoje> argue that being two steps below it is not much worse, but
    Hrvoje> I am not really convinced.

I do so argue, and I see no reason why you should be convinced.  But
maybe others will agree with me.

    Hrvoje> What does direct UCS-4 support in characters buy me
    Hrvoje> anyway?  Will it be another case of all-encompassing Mule
    Hrvoje> priorities breaking the neck of the rest of us?

It buys you direct UCS-4 support in characters.  What did you think?
:-)

  o XEmacs will never again (well, for a future longer than the
    history of electronic computers) need to change the abstract
    format for a Lisp character.  (Of course it will get reimplemented 
    as 64, 128, ... bits as time goes on.)

  o Mule backwards compatibility.  Packages that expect Mule
    character formats will be able to get it by putting the entire
    Mule character set into a private UCS-4 group.  All such groups
    live above the 30-bit boundary, single planes are available with
    30-bit code points, but are too small.  This is the obvious
    migration path.  We definitely do not want the arrays of runes
    passed to redisplay methods to contain lrecords!  (They won't, see 
    below, but we will have to sacrifice transparency and half the
    UCS-4 code space.)

  o Source separation.  For Asian languages, it may be important to
    separate Chinese-origin characters from Japanese-origin
    characters, etc.  Unicode advocates disagree, but there may be
    some diehards out there.  Olivier thinks this is important, and
    plans to implement it.  The sensible place to do this is in a
    private UCS-4 group.

  o UCS-4 will come, maybe not immediately, but within a couple of
    years.  The basic multilingual plane (Unicode) is basically full;
    they're already filling up planes 1 and 2 (or was that 2 and 3).
    You could fake that with < 31-bit characters but people who want
    to use large character sets (there is at least one font for more
    than 80,000 Asian ideographs already in the making) will need full 
    UCS-4 (until they are standardized they will be private-group
    applications).

  o You get to Do The Right Thing.

All of those except the last could be done with some inefficient or
non-standard-conformant hacks, but why complexify things?

Sure, those may not excite you.  But one thing that worries me is
precisely the fact that the applications that absolutely positively
require 31-bit characters are probably a few years off.  So we do a
half-ass hack for the purpose of UCS-4 compliance while maintaining
Mule backwards compatibility.

Eg, fitting the private groups into 30 bits can be done pretty easily
by surgically removing bit 29 from the representation, and
representing bits 29-30 00 -> bit 29 = 0, 01 -> na, 10 -> na, and
11 -> 1, where na means not available in our representation.  Finally
four years from now somebody actually gets around to implementing a
library that requires private space characters, and boom! XEmacs
crashes.  But the code has bitrotted and nobody knows how to fix it.

Or something weird happens, something gets assigned to space in Groups
20-5F by the ISO and Boom!

Breaking necks?  No, I think not:

    Hrvoje> Implementing bignums is hard because if you want to do it

Agreed.  But...

    Hrvoje> right, you have to modify all the C code that relies on
    Hrvoje> Lisp_Object integers fitting in an integral type called
    Hrvoje> "EMACS_INT" (int or long).  And there's a *lot* of such
    Hrvoje> code, with possibly long integers propagating all over the
    Hrvoje> place.

...I don't see that it has to be done all at once.  You can
(partially) implement a bignum type, and implement it where it helps
somebody.  Places where it has not yet been implemented will throw a
type error automatically if you try to pass them in; this will
automatically identify the problem.  It won't even be ambiguous, like
the Ebola warnings often are, eg in

(mapcar #'(lambda (x) (if (equal x ?A) (smile) (puke))) '(?A 1))

People who don't need BigBuffers[tm] or plan to die before 2004 can
leave the relevant bignum-check-and-convert code out with a compiler
switch.

Yes, this is going to be a big project, like de-ebolification.  But it
can be done more gradually, albeit with attendant risks of bignums
escaping from converted code to unconverted code somehow.  So we add
Ebola-like checks and regression tests.

    Hrvoje> Also, with bignums avaialble, you would probably want to
    Hrvoje> provide compiler declarations so that reasonably efficient
    Hrvoje> code can be written, i.e. (declare (fixnum x)).

Sure.  But again, that's a project that can be put off until it
matters to somebody.  (Maybe; I realize that it's possible that checks 
for bignums will slow everything down and everybody will care.  But
then the --with-bignums=no option wins.)

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
__________________________________________________________________________
__________________________________________________________________________
What are those two straight lines for?  "Free software rules."

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Patch: Promoting characters to 31 bits (for discussion)