I see I did get closer this time. :-)
>>>> "Ben" == Ben Wing <ben(a)666.com>
writes:
> (1) I still think it is harmless, at least for literals, and
> possibly useful to allow arbitrary bytes in ei{cat,cpy}_c().
Ben> i still disagree because it can corrupt the innards unless
Ben> you expect automatic binary conversion? Then someone who
Ben> feeds in JIS-encoded data this way will not get Japanese, but
Ben> raw data, and perhaps shouldve used eicpy_ext().
It can't corrupt the innards because it does do binary conversion.
Of course Japanese should be forbidden there, as should _all_ encoded
text. Only when the programmer is willing to take reponsibility for
correctly handling the encoding at the raw octet level should the
ei*_c APIs be used. Eg, in mswindows_get_file() you handled a similar
issue of correct formatting by checking for a trailing `\'.
But when someone needs to insert ISO 2022 escape sequences by hand, or
such like, the ei*_c APIs are convenient, and they also signal that
the programmer is doing something potentially dangerous. Not to
XEmacs; but to the external system or to the user's data. That's true
of mistakes with ASCII, too.
The following is an aside and not directly related to Eistrings
anymore.
Ben> by its nature, the default int. rep. must be able to
Ben> represent all chars. that would rule out utf16 if we have
Ben> more than 1,000,000 and some chars. but it doesn't rule out
Ben> ucs4, or some utf16 extension that could encode gigs o'
Ben> chars, etc.
OK, I asked because I wondered if you'd thought about it in detail.
Apparently not. Here's how I see it.
We do have that many character position codes if we admit UCS-4 or
UTF-8. But there will not be such a UTF-16 extension; Unicode is
committed to staying within the UTF-16 character set for the
forseeable future. There is already a UTF-32 specification for a
32-bit wide Unicode representation; this is also committed to the
17*65536 UTF-16 code space, and will not be expanded beyond that for
the forseeable future, despite the potential for 2^31 (they are also
committed to ISO-10646 compatibility) characters.
However, there are uses for the huge UCS-4 range. Tomo and his
buddies are already playing with the so-called "konjaku-mojikyo"
pseudo-charsets, which have over 70,000 code points already assigned.
These are quite popular with Japanese Windows users, too. Anyway,
they will eat up most of the UTF-16 private space if used in the
obvious way. Nor do we want to encourage "Japanese exceptionalists"
to borrow not yet-standardized-parts of the UTF-16 space.
Also, although _we_ should not support "language-tagged character"
encodings (pace, Olivier) by default, we should permit third party
libraries and extension modules to do so. This could be easily done
using UCS-4 private space.
Before Hrvoje jumps in, let me say that given that the Unicode
standard will not expand out of the UTF-16 range, I don't think
konjaku-mojikyo or other linguistic research purposes are good enough
reasons for not using UCS-2 as a default internal representation. I
will think about how to handle that potential conflict in a reasonable
way, to forstall the abuses that I would expect from Japanese
programmers. We should, of course, support
--with-default-internal-representation={ucs4,utf8}
for people who want those huge spaces. The issue here is how to have
XEmacs gracefully decline to handle them when the default internal
representation is UCS-2, not such a big deal.
--
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
_________________ _________________ _________________ _________________
What are those straight lines for? "XEmacs rules."