RE: proposed Eistring interface

Monday, 24 April 2000

        ...
>>>> "Bill" == Bill Tutt
<billtut(a)microsoft.com&gt; writes: 
    Bill> Why do you want more than one internal representation?

I proposed swapping the internal representation of Lisp characters
with Lisp integers so that characters could have 2^31 bits (limiting
integers to 2~30 bits), and was shot down because people were editing
files bigger than 2^29 characters and needed the extra bit in the Lisp
integer (it's signed) to represent the size of such files on 32-bit
machines.

Obviously, a 1-byte representation for binary files and ISO-8859
character sets (among others) makes sense in that world.  Saving >2^29
bytes by doing "(convert-representation (current-buffer) 'unibyte)"
seems like a concept to me.

    Bill> My personal opinion is that you pick one internal
    Bill> representation and stick with it. It would definately make
    Bill> life simpler.

Well, we're kinda committed to multiple representations, at least over
time: we currently have code to handle variable width representation,
but many developers believe this to be responsible for massive
slowness in Mule (Ben disagrees, which suggests maybe not---on the
other hand, a fixed-width representation would make it possible for
journeyman programmers to maintain efficient algorithms. ;-)

    Bill> They're in plane-14 which certainly is encodeable by UTF-16.

No.  I'm talking about Mule-leading-byte-like tags which allow you to
extract a character's charset or language without context, not a modal
encoding.

...
> Unlike Henry Ford, we do not plan to allow people to use any
> character set they like "as long as it's black." 
    Bill> I'm not sure what the point of this is. The way the encoding
    Bill> the source code is stored under doesn't mean that it doesn't
    Bill> make some sense to only use one internal encoding. I'm just
    Bill> saying that I think UTF-16 seems to make some sense as a
    Bill> candidate.

Not if one of the "character sets" you want to use (as many Japanese
apparently do) is "konjaku-mojikyo" which already has about 70,000
code points assigned, with new ones coming in at a fantastic rate.
This is not a standard character set, of course, and should be
unified---except that its users don't believe in unification, that's
why the set was created in the first place.

We can humor these users fairly easily without sacrificing standard
functionality (with a UTF-8 or UCS-4 internal representation); why not
do so?  It has "hack appeal."  But UTF-16 doesn't cut it.

    Bill> If you're referring to the Japanese Mule developers
    Bill> disliking having to deal with a UTF-16 internal
    Bill> representation then you kind of have a problem.

Precisely.  There are others of us who think that UTF-16 is an ugly
kludge, but the Japanese have a visceral dislike for it.  Remember,
one of the things that Japanese who dislike Unicode dislike _most_
about Unicode is that it accepted the JIS standard as the basis for
unification!  There already mutterings that "we blew it AGAIN,"
referring to JIS X 0213.[1]

    Bill> Getting back to Ben's proposed Eistring interface. I don't
    Bill> think I saw any functions related to helping you iterate
    Bill> sequentially over characters in the internal
    Bill> encoding. (whatever it is) Those would certainly be
    Bill> necessary if you were to use UTF-16, or indeed for some
    Bill> reason needed to change your internal encoding to take up
    Bill> even more space.

Byte position adjustment is trivial:

	  bp += character_positions*representation.bytewidth;

as long as we don't work with surrogates.

But this is in general a hard problem; many of the instances where
iteration occurs require sophisticated understanding of the character
properties (eg, font registries -- just because you have a CJK
character -- easy to detect -- doesn't mean that your font can display
it; that can only be determined via a mapping table).  Ben has
suggested a concept called "coding lstreams" which I suppose is
intended to address this among other issues.

Footnotes: 
[1]  Don't ask me to justify these; I'm just passing on the gossip I
hear.

-- 
University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
_________________  _________________  _________________  _________________
What are those straight lines for?  "XEmacs rules."

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

RE: proposed Eistring interface