>>>> "Bill" == Bill Tutt
<billtut(a)microsoft.com> writes:
Bill> Why do you want more than one internal representation?
I proposed swapping the internal representation of Lisp characters
with Lisp integers so that characters could have 2^31 bits (limiting
integers to 2~30 bits), and was shot down because people were editing
files bigger than 2^29 characters and needed the extra bit in the Lisp
integer (it's signed) to represent the size of such files on 32-bit
machines.
Obviously, a 1-byte representation for binary files and ISO-8859
character sets (among others) makes sense in that world. Saving >2^29
bytes by doing "(convert-representation (current-buffer) 'unibyte)"
seems like a concept to me.
Bill> My personal opinion is that you pick one internal
Bill> representation and stick with it. It would definately make
Bill> life simpler.
Well, we're kinda committed to multiple representations, at least over
time: we currently have code to handle variable width representation,
but many developers believe this to be responsible for massive
slowness in Mule (Ben disagrees, which suggests maybe not---on the
other hand, a fixed-width representation would make it possible for
journeyman programmers to maintain efficient algorithms. ;-)
Bill> They're in plane-14 which certainly is encodeable by UTF-16.
No. I'm talking about Mule-leading-byte-like tags which allow you to
extract a character's charset or language without context, not a modal
encoding.
> Unlike Henry Ford, we do not plan to allow people to use any
> character set they like "as long as it's black."
Bill> I'm not sure what the point of this is. The way the encoding
Bill> the source code is stored under doesn't mean that it doesn't
Bill> make some sense to only use one internal encoding. I'm just
Bill> saying that I think UTF-16 seems to make some sense as a
Bill> candidate.
Not if one of the "character sets" you want to use (as many Japanese
apparently do) is "konjaku-mojikyo" which already has about 70,000
code points assigned, with new ones coming in at a fantastic rate.
This is not a standard character set, of course, and should be
unified---except that its users don't believe in unification, that's
why the set was created in the first place.
We can humor these users fairly easily without sacrificing standard
functionality (with a UTF-8 or UCS-4 internal representation); why not
do so? It has "hack appeal." But UTF-16 doesn't cut it.
Bill> If you're referring to the Japanese Mule developers
Bill> disliking having to deal with a UTF-16 internal
Bill> representation then you kind of have a problem.
Precisely. There are others of us who think that UTF-16 is an ugly
kludge, but the Japanese have a visceral dislike for it. Remember,
one of the things that Japanese who dislike Unicode dislike _most_
about Unicode is that it accepted the JIS standard as the basis for
unification! There already mutterings that "we blew it AGAIN,"
referring to JIS X 0213.[1]
Bill> Getting back to Ben's proposed Eistring interface. I don't
Bill> think I saw any functions related to helping you iterate
Bill> sequentially over characters in the internal
Bill> encoding. (whatever it is) Those would certainly be
Bill> necessary if you were to use UTF-16, or indeed for some
Bill> reason needed to change your internal encoding to take up
Bill> even more space.
Byte position adjustment is trivial:
bp += character_positions*representation.bytewidth;
as long as we don't work with surrogates.
But this is in general a hard problem; many of the instances where
iteration occurs require sophisticated understanding of the character
properties (eg, font registries -- just because you have a CJK
character -- easy to detect -- doesn't mean that your font can display
it; that can only be determined via a mapping table). Ben has
suggested a concept called "coding lstreams" which I suppose is
intended to address this among other issues.
Footnotes:
[1] Don't ask me to justify these; I'm just passing on the gossip I
hear.
--
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
_________________ _________________ _________________ _________________
What are those straight lines for? "XEmacs rules."