I wrote this last night:
NOTE: One possible default internal representation that was compatible
with UTF16 but allowed all possible chars in UCS4 would be to take an
unused range of 2048 chars (not from the private area because Microsoft
actually uses up most or all of it with EUDC chars). Let's say we picked
4000 - 47FF. Then, we'd have:
0000 - FFFF Simple chars
D[8-B]xx D[C-F]xx Surrogate char, represents 1M chars
4[0-7]xx D[C-F]xx D[C-F]xx Surrogate char, represents 2G chars
This is exactly the same number of chars as UCS-4 handles, and it follows the
same property as UTF8 and Mule-internal:
1. There are two disjoint groupings of units, one representing leading units
and one representing non-leading units.
2. Given a leading unit, you immediately know how many units follow to make
up a valid char, irrespective of any other context.
"Stephen J. Turnbull" wrote:
I see I did get closer this time. :-)
>>>>> "Ben" == Ben Wing <ben(a)666.com> writes:
>> (1) I still think it is harmless, at least for literals, and
>> possibly useful to allow arbitrary bytes in ei{cat,cpy}_c().
Ben> i still disagree because it can corrupt the innards unless
Ben> you expect automatic binary conversion? Then someone who
Ben> feeds in JIS-encoded data this way will not get Japanese, but
Ben> raw data, and perhaps shouldve used eicpy_ext().
It can't corrupt the innards because it does do binary conversion.
Of course Japanese should be forbidden there, as should _all_ encoded
text. Only when the programmer is willing to take reponsibility for
correctly handling the encoding at the raw octet level should the
ei*_c APIs be used. Eg, in mswindows_get_file() you handled a similar
issue of correct formatting by checking for a trailing `\'.
But when someone needs to insert ISO 2022 escape sequences by hand, or
such like, the ei*_c APIs are convenient, and they also signal that
the programmer is doing something potentially dangerous. Not to
XEmacs; but to the external system or to the user's data. That's true
of mistakes with ASCII, too.
The following is an aside and not directly related to Eistrings
anymore.
Ben> by its nature, the default int. rep. must be able to
Ben> represent all chars. that would rule out utf16 if we have
Ben> more than 1,000,000 and some chars. but it doesn't rule out
Ben> ucs4, or some utf16 extension that could encode gigs o'
Ben> chars, etc.
OK, I asked because I wondered if you'd thought about it in detail.
Apparently not. Here's how I see it.
We do have that many character position codes if we admit UCS-4 or
UTF-8. But there will not be such a UTF-16 extension; Unicode is
committed to staying within the UTF-16 character set for the
forseeable future. There is already a UTF-32 specification for a
32-bit wide Unicode representation; this is also committed to the
17*65536 UTF-16 code space, and will not be expanded beyond that for
the forseeable future, despite the potential for 2^31 (they are also
committed to ISO-10646 compatibility) characters.
However, there are uses for the huge UCS-4 range. Tomo and his
buddies are already playing with the so-called "konjaku-mojikyo"
pseudo-charsets, which have over 70,000 code points already assigned.
These are quite popular with Japanese Windows users, too. Anyway,
they will eat up most of the UTF-16 private space if used in the
obvious way. Nor do we want to encourage "Japanese exceptionalists"
to borrow not yet-standardized-parts of the UTF-16 space.
Also, although _we_ should not support "language-tagged character"
encodings (pace, Olivier) by default, we should permit third party
libraries and extension modules to do so. This could be easily done
using UCS-4 private space.
Before Hrvoje jumps in, let me say that given that the Unicode
standard will not expand out of the UTF-16 range, I don't think
konjaku-mojikyo or other linguistic research purposes are good enough
reasons for not using UCS-2 as a default internal representation. I
will think about how to handle that potential conflict in a reasonable
way, to forstall the abuses that I would expect from Japanese
programmers. We should, of course, support
--with-default-internal-representation={ucs4,utf8}
for people who want those huge spaces. The issue here is how to have
XEmacs gracefully decline to handle them when the default internal
representation is UCS-2, not such a big deal.
--
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
_________________ _________________ _________________ _________________
What are those straight lines for? "XEmacs rules."
--
Ben
In order to save my hands, I am cutting back on my mail. I also write
as succinctly as possible -- please don't be offended. If you send me
mail, you _will_ get a response, but please be patient, especially for
XEmacs-related mail. If you need an immediate response and it is not
apparent in your message, please say so. Thanks for your understanding.
See also
http://www.666.com/ben/typing.html.