Nix writes:
> So basically the detector needs to make a decision based on the
first
> bufferful, which is short and very likely to be pure ASCII.
That's a bit of a sod, really.
Yeah. I have some ideas about doing it better, but I just don't
understand the lstream code well enough to do things right. I don't
feel terribly bad tho'; Uli Drepper admitted spending about a year
writing the gconv code that went into glibc, mostly on dealing with
the kinds of "multiply-buffered stream" that bedevil our lstreams.
We *really* *really* should expose lstreams to Lisp (but I don't know
how to do that, either). Python's codec API is a reasonable model.
Alas I expect that *most* apps expect that when the locale is
UTF-8, they can emit UTF-8 output. That's kind a large part of what
the locale setting *means*.
That's *precisely* what it means. But there's nothing in the Unicode
standard that says that a *receiving* Unicode application must handle
the full repertoire, only that it not corrupt character streams being
passed to other conforming applications. Since we're not passing that
stream anywhere, we're still conforming. (It's a bug, it's a bug, but
it's not a standards non-conformance issue.)
> It probably makes sense in the case of synchronous processes,
> such as *compile* buffers, to treat them as "slow" files. That
> is, buffer the whole output from the process, do detection on a
> large sample, and then convert the whole thing.
True, unlss the compiler is *really* dog slow (not unheard of).
Doesn't matter; a synchronous call-process doesn't come back until the
compiler exits.
I think you'd need a crude high-speed estimator which triggers a
full
check of probable coding system only when a character is emitted that
has both not been emitted recently and that is not a frequently emitted
character in that coding system (so catting a binary by mistake would
check only a few times, but a sudden emission of a Unicode quote would
trigger a re-evaluation).
Oh, we can do a *lot* better than that, for not too much expense in
in-core data or computation. Quite practical, even if not infallible.
At least, I'd like to try! I'm just lstream-challenged. :-)
> Do you really want to reconvert the buffer every time the
process
> spits out more output? (Maybe you do, but somebody's gonna have to
> code it and put it in front of users to see what they think.)
No way. Reconversions should be as rare as possible, and no rarer.
Then you should think again. Conversions are fast, because it's O(n)
and done in C. Even a multimegabyte buffer on current machines. How
about redisplay? No problem, because only the characters that
actually change get redisplayed. Of course we don't want to do this
in a tight loop, but "as rare as possible" just isn't a goal.
> ? People (at least in the Japanese subset) actually do stuff
like
> that.
Well, that's strange, but if they've asked for that, presumably they
expect the results
I thought you worked in the financial sector? On *this* Earth, people
expect to get what they *want*, not what they *asked for*. Traders
can get quite violent about it, I hear. :-p
> Environment is for setting user defaults, but the user rarely
> interfaces directly with GCC these days; they interact with some kind
> of filter (aka IDE).
In that case it's the IDE's job to reset LANG to a non-UTF-8 value
if it's not willing to cope with UTF-8 output!
My point is that the IDE *could* cope with GCC's UTF-8 output, but
smart IDE developers will Just Say No, and *will* set LANG to C to
turn off all bright ideas from the GCC error beautification committee.
They will then parse the traditional, bog-standard GCC error output,
*probably giving the same results that the G.E.B. committee's proposal
does*, but with the huge advantage that those results are available
with GCC 2.6.3 (ah, those were the days!)
iconv is a special case because its entire raison d'etre is
encoding
conversion: of course it has to be capable of dealing with multiple
encodings simultaneously. GCC, so far, doesn't,
Excuse me? What about your "the Russians bitched" example? You claim
that that's not treatable with the iconv medicine.
so it uses LANG (like, oh, just about every other noninteractive
program out there other than things that are part of the i18n
infrastructure like iconv).
Noninteractive programs should not use LANG at all. That's my point.
A noninteractive program should put on a Big Attitude, and say "Now
you listen up, you morons! My input is UTF-8, and I will spit it
right back on your shoes if you give me anything else. My output is
UTF-8, and if you don't like that, put it where the sun don't shine
because I don't want to hear about it! But you'll probably get better
results by piping them through GNU recode."<wink>
Then the IDEs can promote their "recode-less development workflow."
Everybody happy (except for the now unemployed GCC error beautification
committee, who now must work in the HCI department of an IDE vendor).
Well, it could have a conversion table that says `if the
process-coding
system is FOO, set LANG to BAR'. (However, this is complicated by the
divergent locale names in many Unixes, argh.)
You're missing the point. The stream that comes out of the process
(eg, on a network socket) may very well change encodings on the fly.
This happens in ESMTP, for example, if 8BIT is enabled (or if you
receive any Chinese spam, which doesn't care if 8BIT is enabled).
XEmacs should not consider its process-coding-system any more reliable
than the LANG it was started under (modulo the amount of time the
process runs, of course---if your XEmacs runs for a month, then of
course it's much more likely to switch encodings than a program that
runs for 250ms).
> You need a heavy-duty multilingual
natural
> language processing system to do what even the famed 7-bit
> null-lingual American programmer can do "by eye".
In the general case, you're right. In a lot of useful special cases it
may be possible anyway.
Sure. The problem is identifying those cases a priori; it's typically
just as easy to look at the output and fix it (by reconverting).
> Users should invoke gcc via a wrapper that handles the
translations
> for them.
That means the users using KOI8-R would have to have a *really* smart
wrapper, that knows when to switch between that and UTF-16, and so on
and so forth... it's easier for the common case if GCC just uses iconv()
to convert things itself.
Why? Though I wrote jokingly, I was not joking about the Big Attitude
Policy. If GCC simply declared that (1) Java program input is
bigendian UTF-16, with string and character constants that don't have
the right semantics if compiled verbatim as UTF-16 octet strings
required to use octal/hex escapes, and (2) output is UTF-16, with any
octet sequences that aren't legit UTF-16 converted to corresponding
octal/hex (there's no reason for such to appear outside of string and
character literals), then
gcc $(cat foo.c | iconv -t utf-16 -f koi8-r) 2>&1 \
| iconv -t koi8-r -f utf-16
should work fine for programs written in and for koi8-r only.
This would be a lot harder for C or C++, I admit, since they don't
mandate an internal text encoding, so literals in KOI8-R that got
translated to UTF-8 would blow up, since the programmers would
undoubtedly expect to to just blast them out with printf. But you
could often just link with a special library that DTRTs with scanf,
printf, and friends, I bet.
Yeah, that's a bit of a swine. I fear the only approach that
might work
there would be to have a wrapper around GCC that used gnudoit to query
XEmacs for the current buffer's coding system (or for the corresponding
LANG if you'd rather do the translation in Lisp than in the shell, and
who wouldn't), and then set LANG accordingly.
Nah. Just do (setenv "LANG" "C") early in startup.el, and then start
compiling a hitlist of apps that don't even respect that. This will
require work on our part (mostly in checking to see which of our Lisp
libraries call programs that produce output useful to XEmacs when LANG
is set) but it would seem that it's a lot easier than convincing the
rest of the world to put down the hammer it keeps whacking its head
with.
Since this is so easy, why should GCC produce non-ASCII, ever? Big
Attitude + Postel Principle on the output side is the way to go here,
I think.
Wow. I knew it had a lot, but not that many. I guess I can see why
the original designers of MULE were Japanese: they had a *reason*
to want something so featureful... (it's just a shame they didn't
remain involved.
What "they"? There only ever really was Ken'ichi Handa, who is quite
central over in the GNU camp. (I could also be provoked into saying
"What design?", but please don't put the obvious construction on that.
"True understanding" would require more explanation of Japanese
culture and Mule history than is appropriate here.)
Does anyone really understand CCL any longer?)
Ken Handa and Aidan Kehoe. Hisashi Miyashita probably does, but I
don't know if he's still doing Emacsen.
CCL is quite limited, and not very hard to understand, at least if you
ever programmed in assembler. The real problem is that it's mostly
undocumented, and didn't implement what is documented very well.
I've always though it a real shame that they didn't implement it as a
restricted Lisp rather than a completely different language.
> I *can* stick to LANG=ja_JP.UTF-8, precisely because XEmacs
ignores
> the "UTF-8". The important part of that to XEmacs is "ja_JP",
because
> it tells XEmacs to prefer Japanese fonts and encodings where
> characters are shared with Chinese and/or Korean.
So in other words you're saying `use UTF-8' and then relying on every
program you run regularly ignoring it (or so it seems to me, otherwise
you wouldn't be complaining about GCC using UTF-8 in that situation)?
Not at all. I'm saying that programs whose purpose is to *display*
localized output should pay attention to that. While programs that
produce output that can be usefully localized, but aren't in the
display business, should ignore LANG and produce localiz*able* output
(not localiz*ed*). In GCC's case, I really don't see why its error
output shouldn't be, say, XML with the mso namespace. :-)
N.B. I'm not really complaining about GCC using UTF-8. This requires
only a minor workaround for XEmacs (and could be even more minor than
it actually is, with a bit of investment on our part). I'm
complaining about that hammer you're whacking your own heads with.
But as long as y'all don't want to aim it at mine, be my guest.
> XEmacs has no trouble decoding that, and even if it did, you
could fix
> it with a simple comint filter. What bothers me is that a useful
> protocol was changed without warning
[...]
I can't really see any way of advertising it more widely. We
don't have
any rooftops to shout from.
Exactly! That's why something like that shouldn't be changed without
a better reason than "Markus Kuhn thinks jumping out of a 2d story
window was a good idea dammit." Markus Kuhn is a guy who has spent
several decades thinking about programs that read and display
localized content. He's not attuned to the issues of distributed
development of systems whose internal components mostly talk to each
other, and only incidentally to humans.
If I were a GCC developer, I would have
1. Written about 1 line of lex and 4 lines of yacc, and created a
simple post-processor you could pipe it to.
2. Found 3 developers, one each at Eclipse, vim, and GNU Emacs, to
fix their compile.el equivalents to translate `' to the typographically
correct characters. Really, it's *our* job, not GCC's.
It was a major version bump.
Now you're talking! Who needs to make "reasonable assumptions"? Just
say "This way leads to a better world in the long run" (which is
true), and "A major version bump is the right time to do it" (ditto).
BTW, I apologize for the troll; it wasn't intentional, but I knew I
was waiting for something. If I'd thought straight, I would have
realized "major version bump" were the words I wanted to hear, instead
of "reasonable assumption."
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta