>>>> "Ben" == Ben Wing <ben(a)666.com>
writes:
Ben> "Stephen J. Turnbull" wrote:
> What are these "C strings where non-ASCII characters are
> illegal"? I thought a C string (as defined by <string.h>) was
> a char[], with the length determined by the position of the
> first ASCII NULL. Do you mean that, eg, trying to feed it an
> ISO-8859-1 string will cause an abort with error-checking? How
> about control characters (C1, obviously, but C0, too)? These
> are all legal characters in both the current leading-byte
> representation and any Unicode-based representation.
Ben> you are confused.
Oh.
<+J,$G$7$g$&!#
Now please explain to me what good your check is going to do with
the absolutely no ASCII nowhere 100% Japanese string (which means,
appropriately, "on the contrary, aren't _you_ confused?") on the line
above.
The point is that there are _no_ characters representable in one byte
that can harm _Mule_ if for practical purposes
#define eicpy_c(ei, c_string) eicpy_ext(ei, c_string, Qbinary)
They can "only" cause mojibake and even external data corruption if
written back out under an inappropriate coding system. OTOH, if you
want to prevent mojibake and external data corruption, you need to
have a full (and 100% accurate) character set autodetector, or at
least a 100% reliable undo on all such operations.
There may be a way to make handling external-format text Mule-safe,
but this isn't it. External-format text is by definition just
infested with various ISO-sponsored diseases (and Microsoft toxins)
and needs to be hedged round with real error-checking, with sharp
teeth and thick hair on its chest. Not half-ass filters that won't
detect people doing many common kinds of Stupid Things[tm], while
preventing arguably legitimate and convenient usages like ISO-8859-1
literals. (Remember the actual implementation of Qbinary!)
If I had a veto, I'd use it on anything that allowed implicit typing
of any external format data.
Ben> This is defined only to make it easier to use literal ASCII
Ben> strings, which is going to be extremely common. If the
Ben> string has any non-ASCII characters in it, they need to
Ben> explicitly specify the encoding -- hence the restriction.
This is just plain evil. People are going to write code that is
allegedly Mule-safe because it used your API, and have it randomly
abort in non-ASCII locales when somebody feeds it non-ASCII data.
And need I mention that such code will be 100% un-gettext-izable?
Apparently. That issue needs to be carefully thought through before
you allow literals to be handled by these APIs.
A better way of dealing with the issues with literals would simply be
to specify that all XEmacs source files must be encoded in UCS-4 with
the UTF-8 representation. (That means that they must respect the
reserved areas in UCS-4 if they want to go beyond the UTF-32 space.)
Then string literals in the source code can be treated specially (but
generic C strings[1] should not be, since you don't know where they
come from), and will not need a coding system associated with them.
(Obviously, we have to get a usable Unicode coding-system implemented
first so that this is transparent to developers and users.) For the
moment, we can just make that a restriction to ASCII, except in
mule-packages. This will be easily extensible to UTF-8 with zero fuss
as soon as we have the UTF-8 coding system. And this would be
transparent to gettext, should we decide that is the way to go.
Anyway the pro-ASCII bias is just not PC.
Footnotes:
[1] Hm. What about .so modules?
--
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
_________________ _________________ _________________ _________________
What are those straight lines for? "XEmacs rules."