From: Stephen J. Turnbull [mailto:turnbullīŧ sk.tsukuba.ac.jp]
>>>>> "Bill" == Bill Tutt <billtut(a)microsoft.com> writes:
Bill> You might indeed. I'm not exactly sure why you'd want to do
Bill> that, but that's you call. :) Seems like a waste of a large
Bill> chunk of memory if you make it your internal representation
Bill> without a fairly compelling reason....
_Default_ internal representation. UCS-2 is a massive waste of space
if an ISO-8859 set will do.
We plan to allow at least 1-byte and
variable width internal representations, and then extend to 2-byte and
4-byte internal representations. The default internal representation
being discussed here would be used for internal buffering etc, and
would not (as planned, anyway) be imposed on editing buffers or Lisp
strings.
See my other message for further discussion.
From an earlier note of yours:
So what happens if you have
data that is not representable in the default internal
representation? Do we just tell those users to get lost?
Why do you want more than one internal representation? Is it so you can
handle displaying those CJK character s that Unicode still hasn't specified
slots for? (Family names, or company specific characters come to mind)
Why couldn't you simply define a mapping from the appropriate non-UTF-16
format into some part of the Unicode private use space until such a time as
the problem in Unicode is either fixed or Unicode encourages use of the
private use space for these characters?
It would be kind of weird if the default internal representation
that Eistrings dealt with was UCS-2 but UTF-8 representation was
available in buffers, which you don't rule out.
My personal opinion is that you pick one internal representation and stick
with it. It would definately make life simpler.
You may not consider it compelling, but looking at the history of
Mule
over the last 12 years, I think it is nearly certain that some people,
probably including Ken'ichi Handa, will want access to a language-tag-
in-character representation.
I don't disagree. See (
http://www.unicode.org/unicode/reports/tr7/) for
where those proposed notations exist.
They're in plane-14 which certainly is encodeable by UTF-16. (via surrogate
pairs) I'm sure the folks in MS Word will eventually want to do something
along those lines so they can spell/grammar check multi-language documents.
Unlike Henry Ford, we do not plan to
allow people to use any character set they like "as long as it's
black." I'm pretty sure that some of the people we would most like to
have using XEmacs (Japanese Mule developers) would be quite adamently
opposed to UTF-16.
I'm not sure what the point of this is. The way the encoding the source code
is stored under doesn't mean that it doesn't make some sense to only use one
internal encoding. I'm just saying that I think UTF-16 seems to make some
sense as a candidate. If you're referring to the Japanese Mule developers
disliking having to deal with a UTF-16 internal representation then you kind
of have a problem.
Getting back to Ben's proposed Eistring interface. I don't think I saw any
functions related to helping you iterate sequentially over characters in the
internal encoding. (whatever it is) Those would certainly be necessary if
you were to use UTF-16, or indeed for some reason needed to change your
internal encoding to take up even more space.
Bill