Re: proposed Eistring interface

Sunday, 23 April 2000

        I wrote this last night:

NOTE: One possible default internal representation that was compatible
with UTF16 but allowed all possible chars in UCS4 would be to take an
unused range of 2048 chars (not from the private area because Microsoft
actually uses up most or all of it with EUDC chars).  Let's say we picked
4000 - 47FF.  Then, we'd have:

0000 - FFFF    Simple chars

D[8-B]xx D[C-F]xx  Surrogate char, represents 1M chars

4[0-7]xx D[C-F]xx D[C-F]xx   Surrogate char, represents 2G chars

This is exactly the same number of chars as UCS-4 handles, and it follows the
same property as UTF8 and Mule-internal:

1. There are two disjoint groupings of units, one representing leading units
   and one representing non-leading units.
2. Given a leading unit, you immediately know how many units follow to make
   up a valid char, irrespective of any other context.

"Stephen J. Turnbull" wrote:

...
 I see I did get closer this time.  :-)

 >>>>> "Ben" == Ben Wing <ben(a)666.com&gt; writes:

     >> (1) I still think it is harmless, at least for literals, and
     >> possibly useful to allow arbitrary bytes in ei{cat,cpy}_c().

     Ben> i still disagree because it can corrupt the innards unless
     Ben> you expect automatic binary conversion?  Then someone who
     Ben> feeds in JIS-encoded data this way will not get Japanese, but
     Ben> raw data, and perhaps shouldve used eicpy_ext().

 It can't corrupt the innards because it does do binary conversion.

 Of course Japanese should be forbidden there, as should _all_ encoded
 text.  Only when the programmer is willing to take reponsibility for
 correctly handling the encoding at the raw octet level should the
 ei*_c APIs be used.  Eg, in mswindows_get_file() you handled a similar
 issue of correct formatting by checking for a trailing `\'.

 But when someone needs to insert ISO 2022 escape sequences by hand, or
 such like, the ei*_c APIs are convenient, and they also signal that
 the programmer is doing something potentially dangerous.  Not to
 XEmacs; but to the external system or to the user's data.  That's true
 of mistakes with ASCII, too.

 The following is an aside and not directly related to Eistrings
 anymore.

     Ben> by its nature, the default int. rep. must be able to
     Ben> represent all chars.  that would rule out utf16 if we have
     Ben> more than 1,000,000 and some chars.  but it doesn't rule out
     Ben> ucs4, or some utf16 extension that could encode gigs o'
     Ben> chars, etc.

 OK, I asked because I wondered if you'd thought about it in detail.
 Apparently not.  Here's how I see it.

 We do have that many character position codes if we admit UCS-4 or
 UTF-8.  But there will not be such a UTF-16 extension; Unicode is
 committed to staying within the UTF-16 character set for the
 forseeable future.  There is already a UTF-32 specification for a
 32-bit wide Unicode representation; this is also committed to the
 17*65536 UTF-16 code space, and will not be expanded beyond that for
 the forseeable future, despite the potential for 2^31 (they are also
 committed to ISO-10646 compatibility) characters.

 However, there are uses for the huge UCS-4 range.  Tomo and his
 buddies are already playing with the so-called "konjaku-mojikyo"
 pseudo-charsets, which have over 70,000 code points already assigned.
 These are quite popular with Japanese Windows users, too.  Anyway,
 they will eat up most of the UTF-16 private space if used in the
 obvious way.  Nor do we want to encourage "Japanese exceptionalists"
 to borrow not yet-standardized-parts of the UTF-16 space.

 Also, although _we_ should not support "language-tagged character"
 encodings (pace, Olivier) by default, we should permit third party
 libraries and extension modules to do so.  This could be easily done
 using UCS-4 private space.

 Before Hrvoje jumps in, let me say that given that the Unicode
 standard will not expand out of the UTF-16 range, I don't think
 konjaku-mojikyo or other linguistic research purposes are good enough
 reasons for not using UCS-2 as a default internal representation.  I
 will think about how to handle that potential conflict in a reasonable
 way, to forstall the abuses that I would expect from Japanese
 programmers.  We should, of course, support

           --with-default-internal-representation={ucs4,utf8}

 for people who want those huge spaces.  The issue here is how to have
 XEmacs gracefully decline to handle them when the default internal
 representation is UCS-2, not such a big deal.

 --
 University of Tsukuba                Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 Institute of Policy and Planning Sciences       Tel/fax: +81 (298) 53-5091
 _________________  _________________  _________________  _________________
 What are those straight lines for?  "XEmacs rules." 
--
Ben

In order to save my hands, I am cutting back on my mail.  I also write
as succinctly as possible -- please don't be offended.  If you send me
mail, you _will_ get a response, but please be patient, especially for
XEmacs-related mail.  If you need an immediate response and it is not
apparent in your message, please say so.  Thanks for your understanding.

See also http://www.666.com/ben/typing.html.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: proposed Eistring interface