New subject: proposed Eistring interface

Saturday, 22 April 2000

...
 From: Stephen J. Turnbull [mailto:turnbull＠sk.tsukuba.ac.jp]

 >>>>> "Bill" == Bill Tutt <billtut(a)microsoft.com&gt; writes:

     Bill> You might indeed. I'm not exactly sure why you'd want to do
     Bill> that, but that's you call. :) Seems like a waste of a large
     Bill> chunk of memory if you make it your internal representation
     Bill> without a fairly compelling reason....

 _Default_ internal representation.  UCS-2 is a massive waste of space
 if an ISO-8859 set will do.   
...
 We plan to allow at least 1-byte and
 variable width internal representations, and then extend to 2-byte and
 4-byte internal representations.  The default internal representation
 being discussed here would be used for internal buffering etc, and
 would not (as planned, anyway) be imposed on editing buffers or Lisp
 strings.

 See my other message for further discussion.

...
From an earlier note of yours: 
...
 So what happens if you have
 data that is not representable in the default internal
 representation?  Do we just tell those users to get lost? 
Why do you want more than one internal representation? Is it so you can
handle displaying those CJK character s that Unicode still hasn't specified
slots for? (Family names, or company specific characters come to mind)
Why couldn't you simply define a mapping from the appropriate non-UTF-16
format into some part of the Unicode private use space until such a time as
the problem in Unicode is either fixed or Unicode encourages use of the
private use space for these characters?

...
  It would be kind of weird if the default internal representation
  that Eistrings dealt with was UCS-2 but UTF-8 representation was
  available in buffers, which you don't rule out. 
My personal opinion is that you pick one internal representation and stick
with it. It would definately make life simpler.

...
 You may not consider it compelling, but looking at the history of
Mule
 over the last 12 years, I think it is nearly certain that some people,
 probably including Ken'ichi Handa, will want access to a language-tag-
 in-character representation.   
I don't disagree. See (http://www.unicode.org/unicode/reports/tr7/) for
where those proposed notations exist.
They're in plane-14 which certainly is encodeable by UTF-16. (via surrogate
pairs) I'm sure the folks in MS Word will eventually want to do something
along those lines so they can spell/grammar check multi-language documents.

...
 Unlike Henry Ford, we do not plan to
 allow people to use any character set they like "as long as it's
 black."  I'm pretty sure that some of the people we would most like to
 have using XEmacs (Japanese Mule developers) would be quite adamently
 opposed to UTF-16.

I'm not sure what the point of this is. The way the encoding the source code
is stored under doesn't mean that it doesn't make some sense to only use one
internal encoding. I'm just saying that I think UTF-16 seems to make some
sense as a candidate. If you're referring to the Japanese Mule developers
disliking having to deal with a UTF-16 internal representation then you kind
of have a problem. 

Getting back to Ben's proposed Eistring interface. I don't think I saw any
functions related to helping you iterate sequentially over characters in the
internal encoding. (whatever it is) Those would certainly be necessary if
you were to use UTF-16, or indeed for some reason needed to change your
internal encoding to take up even more space.

Bill

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

RE: proposed Eistring interface