On Mon, 23 Oct 2006, stephen(a)xemacs.org murmured woefully:
Nix writes:
> But, um, why aren't the codecs expecting UTF-8 when LANG is set to
> a UTF-8 locale?
They do by default in most situations, such as reading from files.
But processes are a different matter, because they're not rewindable.
So basically the detector needs to make a decision based on the first
bufferful, which is short and very likely to be pure ASCII. (Don't
say "but ..." for a couple paragraphs, please.)
True:(
That's a bit of a sod, really.
In that situation, it's typically a bad idea to assume anything
but
binary. The codecs don't know that they're dealing with GCC; they
could easily be dealing with XEmacs, which has a habit of dumping byte
code in its Lisp backtraces. It turns out that in XEmacs 'binary is
synonymous with 'iso-8859-1 (a backwards-compatibility hack), which
had the misfeature of doing the right thing 90% of the time, keeping
the Mule developers from working on pushing that to 99.4%.
Gah.
You could keep a list of which apps are well-behaved, but I suspect
this would be unreliable.
I see the problem .Alas I expect that *most* apps expect that when the
locale is UTF-8, they can emit UTF-8 output. That's kind a large part of
what the locale setting *means*.
I don't know of any (working) resampling codecs that check to see
if
things have changed on the fly. It probably makes sense in the case
of synchronous processes, such as *compile* buffers, to treat them as
"slow" files. That is, buffer the whole output from the process, do
detection on a large sample, and then convert the whole thing.
True, unlss the compiler is *really* dog slow (not unheard of).
(There's an obvious space optimization of buffering only until
you
have so much you won't use any more, convert the detection sample, and
use that codec for the rest on-the-fly.) But it's not obvious how to
make this work for asynchronous processes, such as *shell* buffers.
I think you'd need a crude high-speed estimator which triggers a full
check of probable coding system only when a character is emitted that
has both not been emitted recently and that is not a frequently emitted
character in that coding system (so catting a binary by mistake would
check only a few times, but a sudden emission of a Unicode quote would
trigger a re-evaluation).
But this is probably totally impossible (as well as wildly impractical)
due to details of some obscure coding system I've never heard of :(
Do you really want to reconvert the buffer every time the process
spits out more output? (Maybe you do, but somebody's gonna have to
code it and put it in front of users to see what they think.)
No way. Reconversions should be as rare as possible, and no rarer.
(As I said, I have no idea how to actually make this happen :( )
> LANG= exec real-gcc "$@"
And what if "real-gcc" is a lie, and in fact it's a script
LANG=Etruscan exec really-real-gcc "$@"
!
? People (at least in the Japanese subset) actually do stuff like
that.
Well, that's strange, but if they've asked for that, presumably they
expect the results (which in this case wouldn't include Unicode
quote marks, but I'll pretend you said Etruscan.UTF-8 ;) )
But I'll agree that perhaps a flag is wanted which arranges to emit
nothing outside the 7-bit subset unless absolutely necessary. (Of
course that's too late for GCC 4.1 and 4.2 now.)
Environment is for setting user defaults, but the user rarely
interfaces directly with GCC these days; they interact with some kind
of filter (aka IDE).
In that case it's the IDE's job to reset LANG to a non-UTF-8 value
if it's not willing to cope with UTF-8 output!
"System-global" is incorrect as you
point out,
but LANG *is* process-global, which is inappropriate in multilingual
applications. (Suppose iconv obeyed LANG? Then in your environment
it would only be useful for converting UTF-8 to UTF-8! ;-)
iconv is a special case because its entire raison d'etre is encoding
conversion: of course it has to be capable of dealing with multiple
encodings simultaneously. GCC, so far, doesn't, so it uses LANG (like,
oh, just about every other noninteractive program out there other than
things that are part of the i18n infrastructure like iconv).
> Hm. I find myself wondering if perhaps the XEmacs shell modes
shouldn't
> arrange to reset LANG appropriately for the process-coding-system: of
> course that doesn't help much if you change it after the shell is
> started, but if you do *that* presumably you knew you were doing it and
> could feed a LANG= to the shell.
Nope; it's a chicken-and-egg problem. Sometimes you start with the
chicken and sometimes with the egg. There is no general way to do
this; it's AI stuff.
Well, it could have a conversion table that says `if the process-coding
system is FOO, set LANG to BAR'. (However, this is complicated by the
divergent locale names in many Unixes, argh.)
You need a heavy-duty multilingual natural
language processing system to do what even the famed 7-bit
null-lingual American programmer can do "by eye".
In the general case, you're right. In a lot of useful special cases it
may be possible anyway.
> (The Unix LANG/LC_* model isn't really designed for
situations where
> you're constantly changing applicable encoding, is it?)
You're darn tootin' it isn't. That's basically the issue that killed
XIM. setlocale() can take a noticeable amount of time in situations
where you're switching back and forth between input modes all the
time. And don't even mention "multi-threading"!
I could make you scream by mentioning the vile localeconv(). (But I
won't.)
It's a serious layering violation for GCC itself to be doing
those
translations. gcc (the gcc controller app itself) should assume that
Java code is native-endian UTF-16 (that's the standard, right?).
Yeah.
Users should invoke gcc via a wrapper that handles the translations
for them.
That means the users using KOI8-R would have to have a *really* smart
wrapper, that knows when to switch between that and UTF-16, and so on
and so forth... it's easier for the common case if GCC just uses iconv()
to convert things itself.
(This example was not plucked out of the air.)
> I still don't understand why. Is it that LANG might not
match the
> encoding you're using *right now*?
It's that in a multilingual environment the odds are very good that
LANG inherited from a process's initialization doesn't match the
encoding I'm using right now, yes.
Yeah, that's a bit of a swine. I fear the only approach that might work
there would be to have a wrapper around GCC that used gnudoit to query
XEmacs for the current buffer's coding system (or for the corresponding
LANG if you'd rather do the translation in Lisp than in the shell, and
who wouldn't), and then set LANG accordingly.
> If so, then, well, this only applies to people who are
changing
> encodings all the time in shell buffers in which they're also running
> compilations. Is this really common? (If they're changing encodings so
> often, surely they can change encoding back?)
In Japan it is; UTF-8-encoded Japanese text is very much a minority
taste even today.
(Most Japanese correspondents I talk to in my financial-info-thumping
day job seem to use SJIS.)
Of course you can change encodings back. The issue is, why should I
have to pay that tax for a *compiler's error output*?? The `'
convention is perfectly usable though ugly, and can easily be
beautified (eg, with a nearly harmless comint filter in Emacsen).
Well, unset LANG then, and GCC will default to 7-bit ASCII; that's even
easier:)
> No! Stick with a LANG set to UTF-8 and everything should work.
I can't
> understand why it isn't for you.
Because I'm a resident of Japan, which has *5* major encodings
(ISO-2022-JP, EUC-JP, Shift-JIS, UTF-8, and roman transliteration)
Wow. I knew it had a lot, but not that many. I guess I can see why the
original designers of MULE were Japanese: they had a *reason* to want
something so featureful...
(it's just a shame they didn't remain involved. Does anyone really
understand CCL any longer?)
(damn that
broken
Windows anyway, it's such a pane).
I hear a lot of people think it's smashing.
I *can* stick to LANG=ja_JP.UTF-8, precisely because XEmacs ignores
the "UTF-8". The important part of that to XEmacs is "ja_JP",
because
it tells XEmacs to prefer Japanese fonts and encodings where
characters are shared with Chinese and/or Korean.
So in other words you're saying `use UTF-8' and then relying on every
program you run regularly ignoring it (or so it seems to me, otherwise
you wouldn't be complaining about GCC using UTF-8 in that situation)?
That seems... brittle.
Once I know
it's
Japanese, the statistical characteristics of the octet streams give
"no defects in 10 years of heavy daily use" reliability in
distinguishing the 4 "real" Japanese encodings from each other. And
I knew MULE was good, but I didn't know it was that good. That's an
incredibly low error rate for any estimation function.
> In that case, said smart tool should have *no* trouble with a
couple of
> Unicode quotes coming out of GCC (and, indeed, for me, it all works.
> But that doesn't say much because if it didn't work for me I'd have
> fixed it already.)
XEmacs has no trouble decoding that, and even if it did, you could fix
it with a simple comint filter. What bothers me is that a useful
protocol was changed without warning, from something that is simple
It was prominently mentioned in the GCC 4.0 release notes, along with
info on how to disable it: <
http://gcc.gnu.org/gcc-4.0/changes.html> and
a link to an article by Markus Kuhn on why using Unicode quotes was a
good idea dammit.
I can't really see any way of advertising it more widely. We don't have
any rooftops to shout from.
and robust and well-known even to legacy applications, to something
that is less simple, demonstrably unreliable, and likely to cause bad
interactions with smart applications that use code dating from before
the protocol change. Since the protocol was never formalized, GCC is
certainly within its rights to say "the joke's on you for trying to do
something useful with our past very regular behavior". But I don't
think that's very productive.
It was a major version bump. Things change at major version bumps. It's
certainly less disruptive than a C++ ABI bump, and there've been a good
few of those. (There've even been C ABI bumps on some architectures,
e.g. mips-sgi-irix.)
--
`When we are born we have plenty of Hydrogen but as we age our
Hydrogen pool becomes depleted.'
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta