Warning: Missing charsets in String to FontSet conversion

older

Re: Mule bugs: misidentification...

[Bug: 21.4.18] XEmacs crashes on...

Manuel Cebrian

Thursday, 19 October 2006 Thu, 19 Oct '06

3:42 p.m.

(permalink)

I am running xemacs on kubuntu dapper drake with the default settings (just installed). $ xemacs Warning: Missing charsets in String to FontSet conversion This message appears allways. There is a ¿related? bug when I compile using xemacs M-x compile, there are some strange (incorrect) caracters in the compilation window (\200\230): /home/mcebrian/docencia/ssoo/enunciados/p2/codigo/primos/ gcc -g -lm primes.c primes.c:14:2: error: invalid preprocessing directive #defini primes.c: In function â\200\230mainâ\200\231: primes.c:109: error: â\200\230LEERâ\200\231 undeclared (first use in this function) primes.c:109: error: (Each undeclared identifier is reported only once primes.c:109: error: for each function it appears in.) Compilation exited abnormally with code 1 at Thu Oct 19 16:40:04 Could you help me solving this please? -- Manuel Cebrian Escuela Politecnica Superior (Higher Polytechnical School) Tomas y Valiente, 11 Universidad Autonoma de Madrid Campus de Cantoblanco, 28049 Madrid, Spain Tfn: +34 91 497 3210 FAX: +34 91 497 2235 http://www.ii.uam.es/~mcebrian _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Show replies by date

Stephen J. Turnbull

Saturday, 21 October Sat, 21 Oct

2:24 a.m.

(permalink)

Manuel Cebrian writes:

...

I am running xemacs on kubuntu dapper drake with the default settings (just installed).

"kubuntu dapper drake" is meaningless to us; if it turns out to matter, it's by definition an Ubuntu problem, not ours. (That doesn't mean interactions with the environment aren't our problem, just that if it's one particular distro, that distro is going to have to reconfigure and rebuild its packages---we can't.) Please use the M-x report-xemacs-bug function to report bugs; it will tell us what we really need to know: XEmacs version, configuration options, libc and kernel versions, packages installed and versions, etc.

...

$ xemacs Warning: Missing charsets in String to FontSet conversion This message appears allways. There is a ¿related? bug when I compile using xemacs M-x compile, there are some strange (incorrect) caracters in the compilation window (\200\230):

These bugs are different, but both due to changes in the environment. XEmacs is based on X11 from 15 years ago, and expects the "legacy" bitmap fonts. However, many modern distros don't supply them, so you get the warning. Compile with --with-xim=no (XIM is another legacy from X11R5) and the "missing charsets" warning will go away. This will almost certainly never be fixed in the 21.4 series; anybody who cannot live with --with-xim=no will be told to upgrade to 21.5 for many reasons. I expect it will be fixed "en passant" as refactoring of 21.5 proceeds; it is not sufficiently urgent to devote immediate effort to it since the --with-xim=no configuration is known to work around the problem for almost everybody. The other bug is either in gcc or in a wrapper supplied by Ubuntu; it is using non-ASCII characters (specifically, typographically correct balanced single quotes) in the output from the compiler. IMHO this is unwise at best. Even today there are plenty of post-processing applications that are not prepared to handle anything but plain ASCII text. My strong recommendation is that you find out how to defeat this feature, so that gcc produces the "informal standard" ASCII-only output. This is an Ubuntu or GCC question; please direct it to the appropriate channel(s). If you are using XEmacs 21.5.6 or later, then you can probably put `(setq default-process-coding-system '(utf-8 . raw-text))' in your init file and get pretty output. However, you will almost surely lose some or all of the facilities that use functions that parse the error output, as they depend on matching the ASCII backtick and apostrophe characters instead of the "beautiful" Unicode left and right single quote characters. This will take a very long time to fully fix, as we can't just do a global replace of "`" with "[`‘]".---we need to find out which occurances need to be changed, and which should not. HTH Steve _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Nix

Saturday, 21 October Sat, 21 Oct

11:55 p.m.

(permalink)

On Sat, 21 Oct 2006, stephen(a)xemacs.org noted:

...

The other bug is either in gcc or in a wrapper supplied by Ubuntu; it is using non-ASCII characters (specifically, typographically correct balanced single quotes) in the output from the compiler.

This is an upstream change: GCC has done this since GCC 4.0 (and it was documented in the release notes specifically because it might break scripts).

...

IMHO this is unwise at best.

It is conditionalized off the locale. GCC makes the (reasonable) assumption that if you're in a UTF-8 locale, your tools, display devices, and so on can handle UTF-8. Wrapping GCC up in a script that sets LC_ALL to C or something like that would fix it. Mind you, if Manuel's Emacs can't handle UTF-8 (non-MULE?), he should probably change LANG or LC_ALL before starting it.

...

If you are using XEmacs 21.5.6 or later, then you can probably put `(setq default-process-coding-system '(utf-8 . raw-text))' in your init file and get pretty output. However, you will almost surely lose some or all of the facilities that use functions that parse the error output, as they depend on matching the ASCII backtick and apostrophe characters instead of the "beautiful" Unicode left and right single quote characters. This will take a very long time to fully fix, as we can't just do a global replace of "`" with "[`‘]".---we need to find out which occurances need to be changed, and which should not.

Almost everything GCC can emit should be quoted with Unicode quotes where possible: everything using the core pretty-printing infrastructure in gcc/pretty-print.c, and everything else Joseph Myers could find. There are a few places in fixed strings where `' are still used, but those are rare, and I can't imagine that much parsing code will want to pick things out of the middle of *fixed* strings like that anyway. (Variable components, language types, et seq, are all users of pretty-print.c, and so all use Unicode quoting.) -- `When we are born we have plenty of Hydrogen but as we age our Hydrogen pool becomes depleted.' _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Sunday, 22 October Sun, 22 Oct

6:23 p.m.

(permalink)

New subject: Typos in GCC output [was: Warning: Missing charsets ...]

Nix writes:

...

It is conditionalized off the locale. GCC makes the (reasonable) assumption

You can call it "reasonable assumption" if you like, I call it "tyranny of the majority." With all due respect to Joseph Myers, this was an ill-advised change. I mean, if this is such a great idea, in UTF-8 locales shouldn't you change the parser to allow the C notation for strings to use balanced quotes, and in fr_FR locales, guillemots? That's not just typographically correct, it would make things a lot easier for Emacs font-lock, you know!

...

that if you're in a UTF-8 locale, your tools, display devices, and so on can handle UTF-8.

It's NOT an issue of "being able to handle" UTF-8---XEmacs handles UTF-8 just fine, and if it knows it's coming it will display it prettily, too. The problem is that GCC is making the assumption that there's nothing in the pipeline able to handle *more* than just UTF-8, that might be expecting something else for any of a number of reasons. It may very well be that imposing this pain on XEmacs and other multilingual apps is the right thing to do; backward compatibility is the only reason not to do it, and backward compatibility cannot be a justification for permanent stasis. But defending it with "reasonable assumption" ... I expect better of y'all. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Nix

Sunday, 22 October Sun, 22 Oct

7:30 p.m.

(permalink)

New subject: Typos in GCC output

On Mon, 23 Oct 2006, stephen(a)xemacs.org moaned:

...

Nix writes: > It is conditionalized off the locale. GCC makes the (reasonable) > assumption You can call it "reasonable assumption" if you like, I call it "tyranny of the majority."

Part of the reason for the change is that the `' TeX-style balanced quotes don't look as good as they used to: in virtually every font used on modern terminals, they *don't* look balanced at all, they look like typos. The Unicode quote glyphs do look balanced.

...

With all due respect to Joseph Myers, this was an ill-advised change. I mean, if this is such a great idea, in UTF-8 locales shouldn't you change the parser to allow the C notation for strings to use balanced quotes, and in fr_FR locales, guillemots? That's not just typographically correct, it would make things a lot easier for Emacs font-lock, you know!

Alas, the C standard says no :) of course, C code is primarily produced for machines to read, and they prefer consistency. GCC's standard error stream is parsed in sufficient detail for quotes to matter by perhaps one or two programs, and what they do isn't terribly complex...

...

> that if you're in a UTF-8 locale, your tools, display devices, and > so on can handle UTF-8. It's NOT an issue of "being able to handle" UTF-8---XEmacs handles UTF-8 just fine, and if it knows it's coming it will display it prettily, too. The problem is that GCC is making the assumption that there's nothing in the pipeline able to handle *more* than just UTF-8, that might be expecting something else for any of a number of reasons.

I don't get it. If a program is looking at GCC's standard error stream, why would it expect anything other than text (7-bit ASCII before this change, UTF-8 afterwards)? There's no way we could avoid producing UTF-8 output on stderr in some circumstances, even if the quotes were kept at `': printed Java identifiers would have to be Unicode, for starters.

...

It may very well be that imposing this pain on XEmacs and other multilingual apps is the right thing to do; backward compatibility is the only reason not to do it, and backward compatibility cannot be a justification for permanent stasis. But defending it with "reasonable assumption" ... I expect better of y'all.

I'll admit that I can't figure out why you would set LANG=en_BLAH.UTF-8 in your environment if your tools were *not* capable of handling UTF-8. It *still* seems like a reasonable assumption to me. If your tools can't handle UTF-8, don't set your locale to a UTF-8 locale. -- `When we are born we have plenty of Hydrogen but as we age our Hydrogen pool becomes depleted.' _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Sunday, 22 October Sun, 22 Oct

9:34 p.m.

(permalink)

New subject: Typos in GCC output

Nix writes:

...

What the Emacs Lisp does is not terribly complex, true---but totally irrelevant. By that point, *it's too late*, the codecs (which live just on the XEmacs side of the pipe, and pretty much have to if you want any kind of efficiency) have already converted those bytes. But it's a poor atom-blaster that won't point both ways. You realize that what *GCC* is doing internally is not very complex, and could easily be delegated to a wrapper, so we wouldn't have to recommend that the user change a *system global parameter* (ie, LANG) to suit an application that rarely produces output directly for the user, but rather normally is filtered through one or more wrappers anyway?

...

I don't get it.

That's my point. If you *did* get it, I would call it a difference of values, not "ill-advised". ;-)

...

If a program is looking at GCC's standard error stream, why would it expect anything other than text (7-bit ASCII before this change, UTF-8 afterwards)?

If "a program" is XEmacs, it could be a shell buffer where I'd been looking at EUC-JP content in TeX error messages in my last make, or Shift JIS I'd grepped out of an email.

...

There's no way we could avoid producing UTF-8 output on stderr in some circumstances, even if the quotes were kept at `': printed Java identifiers would have to be Unicode, for starters.

But compile.el is not going to try to parse the Java identifiers, just spit them back. Ie, that's not Unicode *produced* by gcc, that's binary crap *rendered* from the data, just like the jTeX content. If the data in the code is something other than UTF-8 (eg, ISO-8859-1), I don't see why GCC should give a fig about the value of LANG in the environment.

...

I'll admit that I can't figure out why you would set LANG=en_BLAH.UTF-8 in your environment if your tools were *not* capable of handling UTF-8. It *still* seems like a reasonable assumption to me.

As stated, the assumption is not violated. My tools *are* capable of handling UTF-8. It is the inference that using UTF-8 is therefore reliable that is wrong.

...

If your tools can't handle UTF-8, don't set your locale to a UTF-8 locale.

Uh, we are talking about the LANG variable. It is *global*. For *most* of what I do, it makes sense to set that variable to *.UTF-8, because *most* of my data (including a fair number file names) *is* UTF-8, and (with the exception of XEmacs) *none* of my tools are smart enough to DTRT without LANG set appropriately. So what you're saying is that a user whose data is *mostly* UTF-8, and whose "dumb" tools all handle UTF-8 and only UTF-8 should change LANG to something else because he also uses a smart tool that not only can DTRT with UTF-8, but UTF-16, EUC-JP, GB2312, KSC6501, and KOI-8 as well as ASCII? _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Nix

Monday, 23 October Mon, 23 Oct

1:31 a.m.

(permalink)

New subject: Typos in GCC output

On Mon, 23 Oct 2006, stephen(a)xemacs.org stated:

...

Nix writes: > Alas, the C standard says no :) of course, C code is primarily produced > for machines to read, and they prefer consistency. GCC's standard error > stream is parsed in sufficient detail for quotes to matter by perhaps > one or two programs, and what they do isn't terribly complex... What the Emacs Lisp does is not terribly complex, true---but totally irrelevant. By that point, *it's too late*, the codecs (which live just on the XEmacs side of the pipe, and pretty much have to if you want any kind of efficiency) have already converted those bytes.

But, um, why aren't the codecs expecting UTF-8 when LANG is set to a UTF-8 locale? (I'm new to the wonderful multiple-encoding world of MULE, so perhaps I'm missing something.)

...

But it's a poor atom-blaster that won't point both ways. You realize that what *GCC* is doing internally is not very complex, and could easily be delegated to a wrapper, so we wouldn't have to recommend that the user change a *system global parameter* (ie, LANG) to suit an application that rarely produces output directly for the user, but rather normally is filtered through one or more wrappers anyway?

We could do, only LANG isn't a system-global parameter. You can easily wrap GCC in a tiny script that unsets LANG, viz LANG= exec real-gcc "$＠" or just arrange for compile.el or whatever to unset LANG before calling GCC. (start-process doesn't seem to support running programs with different environments yet, but that can't be too terribly hard to add, and is of general utility.)

...

> I don't get it. That's my point. If you *did* get it, I would call it a difference of values, not "ill-advised". ;-)

OK, I am merely ignorant :)

...

> If a program is looking at GCC's standard error stream, why would > it expect anything other than text (7-bit ASCII before this change, > UTF-8 afterwards)? If "a program" is XEmacs, it could be a shell buffer where I'd been looking at EUC-JP content in TeX error messages in my last make, or Shift JIS I'd grepped out of an email.

Hm. I find myself wondering if perhaps the XEmacs shell modes shouldn't arrange to reset LANG appropriately for the process-coding-system: of course that doesn't help much if you change it after the shell is started, but if you do *that* presumably you knew you were doing it and could feed a LANG= to the shell. (The Unix LANG/LC_* model isn't really designed for situations where you're constantly changing applicable encoding, is it?)

...

> There's no way we could avoid producing UTF-8 output on stderr in some > circumstances, even if the quotes were kept at `': printed Java > identifiers would have to be Unicode, for starters. But compile.el is not going to try to parse the Java identifiers, just spit them back. Ie, that's not Unicode *produced* by gcc, that's binary crap *rendered* from the data, just like the jTeX content. If the data in the code is something other than UTF-8 (eg, ISO-8859-1), I don't see why GCC should give a fig about the value of LANG in the environment.

Because it has to determine if the identifiers are valid, tokenize them, and so on. (i.e. there are a *lot* of possible encodings for, say, `.' in Java, all of which act like a `.'. C and C++ need some degree of Unicode support as a QoI matter, as well: who are we to say that people can't put their own names in comments or string literals?)

...

> I'll admit that I can't figure out why you would set LANG=en_BLAH.UTF-8 > in your environment if your tools were *not* capable of handling UTF-8. > It *still* seems like a reasonable assumption to me. As stated, the assumption is not violated. My tools *are* capable of handling UTF-8. It is the inference that using UTF-8 is therefore reliable that is wrong.

I still don't understand why. Is it that LANG might not match the encoding you're using *right now*? If so, then, well, this only applies to people who are changing encodings all the time in shell buffers in which they're also running compilations. Is this really common? (If they're changing encodings so often, surely they can change encoding back?) (Why is XEmacs-21.5.27 choosing to garbage-collect between every word I type? If this is how the incremental GC normally works it's damned annoying. Yes, each GC round only takes about a second, but still, that's an unresponsive second every two seconds...)

...

> If your tools can't handle UTF-8, don't set your locale to a UTF-8 > locale. Uh, we are talking about the LANG variable. It is *global*. For

What? It's an environment variable: a per-process attribute.

...

*most* of what I do, it makes sense to set that variable to *.UTF-8, because *most* of my data (including a fair number file names) *is* UTF-8, and (with the exception of XEmacs) *none* of my tools are smart enough to DTRT without LANG set appropriately.

Yeah. XEmacs beats the common herd yet again :)

...

So what you're saying is that a user whose data is *mostly* UTF-8, and whose "dumb" tools all handle UTF-8 and only UTF-8 should change LANG to something else

No! Stick with a LANG set to UTF-8 and everything should work. I can't understand why it isn't for you.

...

because he also uses a smart tool that not only can DTRT with UTF-8, but UTF-16, EUC-JP, GB2312, KSC6501, and KOI-8 as well as ASCII?

In that case, said smart tool should have *no* trouble with a couple of Unicode quotes coming out of GCC (and, indeed, for me, it all works. But that doesn't say much because if it didn't work for me I'd have fixed it already.) I think I'm still missing the point (no surprise there, I seem to specialize in it). One aside: if XEmacs goes into a tight loop, how do I debug it? I entered a newsgroup a few minutes ago and XEmacs wandered off computing madly, GCing occasionally, and never came back. The backtrace was, ahem, unhelpful: Attaching to program: /usr/bin/xemacs, process 28586 Failed to read a valid object file image from memory. Reading symbols from /usr/lib/libaudio.so.2...done. Loaded symbols for /usr/lib/libaudio.so.2 [...] Loaded symbols for /usr/lib/libXfixes.so.3 0x08204cdb in do_marker_adjustment (mpos=14799561, from=15192266, to=15192601, amount=-392705) at /usr/packages/xemacs/21.5.27/src/insdel.c:192 192 if (mpos > from + amount && mpos <= from) (gdb) bt #0 0x08204cdb in do_marker_adjustment (mpos=14799561, from=15192266, to=15192601, amount=-392705) at /usr/packages/xemacs/21.5.27/src/insdel.c:192 #1 0x08204d21 in adjust_markers (buf=<value optimized out>, from=15192266, to=15192601, amount=-392705) at /usr/packages/xemacs/21.5.27/src/insdel.c:231 #2 0x08206bb9 in gap_right (buf=0xa4112cc, cpos=14781487, bpos=14799896) at /usr/packages/xemacs/21.5.27/src/insdel.c:389 #3 0x082088fd in buffer_delete_range (buf=0xa4112cc, from=<value optimized out>, to=14781488, flags=0) at /usr/packages/xemacs/21.5.27/src/insdel.c:1410 #4 0x080c66e7 in Fdelete_char (count=14799561, killp=147669592) at /usr/packages/xemacs/21.5.27/src/cmds.c:283 #5 0x0811a59e in Ffuncall (nargs=2, args=0xbfa791b4) at /usr/packages/xemacs/21.5.27/src/eval.c:3893 #6 0x080abdf2 in execute_optimized_program (program=0xb8a0f78 "eb\210ÁÂÃÄ#«\aÅÆ!\210ªô\b?.db\210Æy\210ÇÈ!«\005ÅÉ!\210eb\210ÇÊ!«\n`Ëy\210`|\210ªóÁÌÃÄ#\aÅÆ!\210ªô\207", stack_depth=4, constants_data=0xb746260) at /usr/packages/xemacs/21.5.27/src/bytecode.c:862 [...] #78 0x080c45f8 in initial_command_loop (load_me=147669592) at /usr/packages/xemacs/21.5.27/src/cmdloop.c:313 #79 0x0810d9d8 in xemacs_21_5_b27_i686_pc_linux (argc=1, argv=0xbfa7bfc4, unused_envp=0x0, restart=0) at /usr/packages/xemacs/21.5.27/src/emacs.c:2667 #80 0x0810e7eb in main (argc=Cannot access memory at address 0xe1d2c9 ) at /usr/packages/xemacs/21.5.27/src/emacs.c:3111 (gdb) (gdb) info locals No locals. Thanks heaps, GDB. No valid object file image?! How can I even tell what Lisp function it was getting stuck in? (I'm considering just oprofiling it next time to get a clue what functions the loop is passing through: would that be a worthwhile approach?) -- `When we are born we have plenty of Hydrogen but as we age our Hydrogen pool becomes depleted.' _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Monday, 23 October Mon, 23 Oct

4:11 a.m.

(permalink)

New subject: Typos in GCC output

Nix writes:

...

But, um, why aren't the codecs expecting UTF-8 when LANG is set to a UTF-8 locale?

They do by default in most situations, such as reading from files. But processes are a different matter, because they're not rewindable. So basically the detector needs to make a decision based on the first bufferful, which is short and very likely to be pure ASCII. (Don't say "but ..." for a couple paragraphs, please.) In that situation, it's typically a bad idea to assume anything but binary. The codecs don't know that they're dealing with GCC; they could easily be dealing with XEmacs, which has a habit of dumping byte code in its Lisp backtraces. It turns out that in XEmacs 'binary is synonymous with 'iso-8859-1 (a backwards-compatibility hack), which had the misfeature of doing the right thing 90% of the time, keeping the Mule developers from working on pushing that to 99.4%. You could keep a list of which apps are well-behaved, but I suspect this would be unreliable. I don't know of any (working) resampling codecs that check to see if things have changed on the fly. It probably makes sense in the case of synchronous processes, such as *compile* buffers, to treat them as "slow" files. That is, buffer the whole output from the process, do detection on a large sample, and then convert the whole thing. (There's an obvious space optimization of buffering only until you have so much you won't use any more, convert the detection sample, and use that codec for the rest on-the-fly.) But it's not obvious how to make this work for asynchronous processes, such as *shell* buffers. Do you really want to reconvert the buffer every time the process spits out more output? (Maybe you do, but somebody's gonna have to code it and put it in front of users to see what they think.)

...

We could do, only LANG isn't a system-global parameter. You can easily wrap GCC in a tiny script that unsets LANG, viz LANG= exec real-gcc "$＠"

And what if "real-gcc" is a lie, and in fact it's a script LANG=Etruscan exec really-real-gcc "$＠" ? People (at least in the Japanese subset) actually do stuff like that. Environment is for setting user defaults, but the user rarely interfaces directly with GCC these days; they interact with some kind of filter (aka IDE). "System-global" is incorrect as you point out, but LANG *is* process-global, which is inappropriate in multilingual applications. (Suppose iconv obeyed LANG? Then in your environment it would only be useful for converting UTF-8 to UTF-8! ;-)

...

Nope; it's a chicken-and-egg problem. Sometimes you start with the chicken and sometimes with the egg. There is no general way to do this; it's AI stuff. You need a heavy-duty multilingual natural language processing system to do what even the famed 7-bit null-lingual American programmer can do "by eye".

...

(The Unix LANG/LC_* model isn't really designed for situations where you're constantly changing applicable encoding, is it?)

You're darn tootin' it isn't. That's basically the issue that killed XIM. setlocale() can take a noticeable amount of time in situations where you're switching back and forth between input modes all the time. And don't even mention "multi-threading"!

...

Uh, isn't that what iconv and XEmacs are for? And what makes you think that the guy at the next desk will necessarily use Unicode for his name?---after all, your company only bought him and his code last week. It's a serious layering violation for GCC itself to be doing those translations. gcc (the gcc controller app itself) should assume that Java code is native-endian UTF-16 (that's the standard, right?). Users should invoke gcc via a wrapper that handles the translations for them.

...

> As stated, the assumption is not violated. My tools *are* capable of > handling UTF-8. It is the inference that using UTF-8 is therefore > reliable that is wrong. I still don't understand why. Is it that LANG might not match the encoding you're using *right now*?

It's that in a multilingual environment the odds are very good that LANG inherited from a process's initialization doesn't match the encoding I'm using right now, yes.

...

If so, then, well, this only applies to people who are changing encodings all the time in shell buffers in which they're also running compilations. Is this really common? (If they're changing encodings so often, surely they can change encoding back?)

In Japan it is; UTF-8-encoded Japanese text is very much a minority taste even today. Of course you can change encodings back. The issue is, why should I have to pay that tax for a *compiler's error output*?? The `' convention is perfectly usable though ugly, and can easily be beautified (eg, with a nearly harmless comint filter in Emacsen).

...

No! Stick with a LANG set to UTF-8 and everything should work. I can't understand why it isn't for you.

Because I'm a resident of Japan, which has *5* major encodings (ISO-2022-JP, EUC-JP, Shift-JIS, UTF-8, and roman transliteration) reachable with one keystroke from my mail summary buffer, not to mention GB2312 (aka the preferred encoding of Chinese spam), UTF-16, and other special-purpose encodings primarily used internally to various applications that sometimes leak into public (damn that broken Windows anyway, it's such a pane). I *can* stick to LANG=ja_JP.UTF-8, precisely because XEmacs ignores the "UTF-8". The important part of that to XEmacs is "ja_JP", because it tells XEmacs to prefer Japanese fonts and encodings where characters are shared with Chinese and/or Korean. Once I know it's Japanese, the statistical characteristics of the octet streams give "no defects in 10 years of heavy daily use" reliability in distinguishing the 4 "real" Japanese encodings from each other. And even if I'm in an environment where the text is more likely to be French than Japanese, XEmacs does better than 99%.

...

XEmacs has no trouble decoding that, and even if it did, you could fix it with a simple comint filter. What bothers me is that a useful protocol was changed without warning, from something that is simple and robust and well-known even to legacy applications, to something that is less simple, demonstrably unreliable, and likely to cause bad interactions with smart applications that use code dating from before the protocol change. Since the protocol was never formalized, GCC is certainly within its rights to say "the joke's on you for trying to do something useful with our past very regular behavior". But I don't think that's very productive.

...

I think I'm still missing the point (no surprise there, I seem to specialize in it).

Well, it's not specific to you. The basic tension is that for humans it's all quite obvious, right there in front of your nose in FG_COLOR and BG_COLOR. A machine, however, generally has no idea what text it's spewing at users. It has even less idea about the encodings it is being fed. It's just a variant on the Postel Principle for Internet clients: be catholic about what you accept, puritan about what you produce. In this case, "puritan" can, and IMO should, mean "use UTF-8, of course! but restrict yourself to the subset 0-127". ;-) _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Nix

Monday, 23 October Mon, 23 Oct

10:42 p.m.

(permalink)

New subject: Typos in GCC output

On Mon, 23 Oct 2006, stephen(a)xemacs.org murmured woefully:

...

Nix writes: > But, um, why aren't the codecs expecting UTF-8 when LANG is set to > a UTF-8 locale? They do by default in most situations, such as reading from files. But processes are a different matter, because they're not rewindable. So basically the detector needs to make a decision based on the first bufferful, which is short and very likely to be pure ASCII. (Don't say "but ..." for a couple paragraphs, please.)

True:( That's a bit of a sod, really.

...

In that situation, it's typically a bad idea to assume anything but binary. The codecs don't know that they're dealing with GCC; they could easily be dealing with XEmacs, which has a habit of dumping byte code in its Lisp backtraces. It turns out that in XEmacs 'binary is synonymous with 'iso-8859-1 (a backwards-compatibility hack), which had the misfeature of doing the right thing 90% of the time, keeping the Mule developers from working on pushing that to 99.4%.

Gah.

...

You could keep a list of which apps are well-behaved, but I suspect this would be unreliable.

I see the problem .Alas I expect that *most* apps expect that when the locale is UTF-8, they can emit UTF-8 output. That's kind a large part of what the locale setting *means*.

...

I don't know of any (working) resampling codecs that check to see if things have changed on the fly. It probably makes sense in the case of synchronous processes, such as *compile* buffers, to treat them as "slow" files. That is, buffer the whole output from the process, do detection on a large sample, and then convert the whole thing.

True, unlss the compiler is *really* dog slow (not unheard of).

...

(There's an obvious space optimization of buffering only until you have so much you won't use any more, convert the detection sample, and use that codec for the rest on-the-fly.) But it's not obvious how to make this work for asynchronous processes, such as *shell* buffers.

I think you'd need a crude high-speed estimator which triggers a full check of probable coding system only when a character is emitted that has both not been emitted recently and that is not a frequently emitted character in that coding system (so catting a binary by mistake would check only a few times, but a sudden emission of a Unicode quote would trigger a re-evaluation). But this is probably totally impossible (as well as wildly impractical) due to details of some obscure coding system I've never heard of :(

...

Do you really want to reconvert the buffer every time the process spits out more output? (Maybe you do, but somebody's gonna have to code it and put it in front of users to see what they think.)

No way. Reconversions should be as rare as possible, and no rarer. (As I said, I have no idea how to actually make this happen :( )

...

> LANG= exec real-gcc "$＠" And what if "real-gcc" is a lie, and in fact it's a script LANG=Etruscan exec really-real-gcc "$＠"

...

? People (at least in the Japanese subset) actually do stuff like that.

Well, that's strange, but if they've asked for that, presumably they expect the results (which in this case wouldn't include Unicode quote marks, but I'll pretend you said Etruscan.UTF-8 ;) ) But I'll agree that perhaps a flag is wanted which arranges to emit nothing outside the 7-bit subset unless absolutely necessary. (Of course that's too late for GCC 4.1 and 4.2 now.)

...

Environment is for setting user defaults, but the user rarely interfaces directly with GCC these days; they interact with some kind of filter (aka IDE).

In that case it's the IDE's job to reset LANG to a non-UTF-8 value if it's not willing to cope with UTF-8 output!

...

"System-global" is incorrect as you point out, but LANG *is* process-global, which is inappropriate in multilingual applications. (Suppose iconv obeyed LANG? Then in your environment it would only be useful for converting UTF-8 to UTF-8! ;-)

iconv is a special case because its entire raison d'etre is encoding conversion: of course it has to be capable of dealing with multiple encodings simultaneously. GCC, so far, doesn't, so it uses LANG (like, oh, just about every other noninteractive program out there other than things that are part of the i18n infrastructure like iconv).

...

> Hm. I find myself wondering if perhaps the XEmacs shell modes shouldn't > arrange to reset LANG appropriately for the process-coding-system: of > course that doesn't help much if you change it after the shell is > started, but if you do *that* presumably you knew you were doing it and > could feed a LANG= to the shell. Nope; it's a chicken-and-egg problem. Sometimes you start with the chicken and sometimes with the egg. There is no general way to do this; it's AI stuff.

Well, it could have a conversion table that says `if the process-coding system is FOO, set LANG to BAR'. (However, this is complicated by the divergent locale names in many Unixes, argh.)

...

You need a heavy-duty multilingual natural language processing system to do what even the famed 7-bit null-lingual American programmer can do "by eye".

In the general case, you're right. In a lot of useful special cases it may be possible anyway.

...

> (The Unix LANG/LC_* model isn't really designed for situations where > you're constantly changing applicable encoding, is it?) You're darn tootin' it isn't. That's basically the issue that killed XIM. setlocale() can take a noticeable amount of time in situations where you're switching back and forth between input modes all the time. And don't even mention "multi-threading"!

I could make you scream by mentioning the vile localeconv(). (But I won't.)

...

It's a serious layering violation for GCC itself to be doing those translations. gcc (the gcc controller app itself) should assume that Java code is native-endian UTF-16 (that's the standard, right?).

Yeah.

...

Users should invoke gcc via a wrapper that handles the translations for them.

That means the users using KOI8-R would have to have a *really* smart wrapper, that knows when to switch between that and UTF-16, and so on and so forth... it's easier for the common case if GCC just uses iconv() to convert things itself. (This example was not plucked out of the air.)

...

> I still don't understand why. Is it that LANG might not match the > encoding you're using *right now*? It's that in a multilingual environment the odds are very good that LANG inherited from a process's initialization doesn't match the encoding I'm using right now, yes.

Yeah, that's a bit of a swine. I fear the only approach that might work there would be to have a wrapper around GCC that used gnudoit to query XEmacs for the current buffer's coding system (or for the corresponding LANG if you'd rather do the translation in Lisp than in the shell, and who wouldn't), and then set LANG accordingly.

...

> If so, then, well, this only applies to people who are changing > encodings all the time in shell buffers in which they're also running > compilations. Is this really common? (If they're changing encodings so > often, surely they can change encoding back?) In Japan it is; UTF-8-encoded Japanese text is very much a minority taste even today.

(Most Japanese correspondents I talk to in my financial-info-thumping day job seem to use SJIS.)

...

Of course you can change encodings back. The issue is, why should I have to pay that tax for a *compiler's error output*?? The `' convention is perfectly usable though ugly, and can easily be beautified (eg, with a nearly harmless comint filter in Emacsen).

Well, unset LANG then, and GCC will default to 7-bit ASCII; that's even easier:)

...

> No! Stick with a LANG set to UTF-8 and everything should work. I can't > understand why it isn't for you. Because I'm a resident of Japan, which has *5* major encodings (ISO-2022-JP, EUC-JP, Shift-JIS, UTF-8, and roman transliteration)

Wow. I knew it had a lot, but not that many. I guess I can see why the original designers of MULE were Japanese: they had a *reason* to want something so featureful... (it's just a shame they didn't remain involved. Does anyone really understand CCL any longer?)

...

(damn that broken Windows anyway, it's such a pane).

I hear a lot of people think it's smashing.

...

I *can* stick to LANG=ja_JP.UTF-8, precisely because XEmacs ignores the "UTF-8". The important part of that to XEmacs is "ja_JP", because it tells XEmacs to prefer Japanese fonts and encodings where characters are shared with Chinese and/or Korean.

So in other words you're saying `use UTF-8' and then relying on every program you run regularly ignoring it (or so it seems to me, otherwise you wouldn't be complaining about GCC using UTF-8 in that situation)? That seems... brittle.

...

Once I know it's Japanese, the statistical characteristics of the octet streams give "no defects in 10 years of heavy daily use" reliability in distinguishing the 4 "real" Japanese encodings from each other. And

I knew MULE was good, but I didn't know it was that good. That's an incredibly low error rate for any estimation function.

...

> In that case, said smart tool should have *no* trouble with a couple of > Unicode quotes coming out of GCC (and, indeed, for me, it all works. > But that doesn't say much because if it didn't work for me I'd have > fixed it already.) XEmacs has no trouble decoding that, and even if it did, you could fix it with a simple comint filter. What bothers me is that a useful protocol was changed without warning, from something that is simple

It was prominently mentioned in the GCC 4.0 release notes, along with info on how to disable it: <http://gcc.gnu.org/gcc-4.0/changes.html> and a link to an article by Markus Kuhn on why using Unicode quotes was a good idea dammit. I can't really see any way of advertising it more widely. We don't have any rooftops to shout from.

...

and robust and well-known even to legacy applications, to something that is less simple, demonstrably unreliable, and likely to cause bad interactions with smart applications that use code dating from before the protocol change. Since the protocol was never formalized, GCC is certainly within its rights to say "the joke's on you for trying to do something useful with our past very regular behavior". But I don't think that's very productive.

It was a major version bump. Things change at major version bumps. It's certainly less disruptive than a C++ ABI bump, and there've been a good few of those. (There've even been C ABI bumps on some architectures, e.g. mips-sgi-irix.) -- `When we are born we have plenty of Hydrogen but as we age our Hydrogen pool becomes depleted.' _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Tuesday, 24 October Tue, 24 Oct

12:07 p.m.

(permalink)

New subject: Typos in GCC output

Nix writes:

...

> So basically the detector needs to make a decision based on the first > bufferful, which is short and very likely to be pure ASCII.

...

That's a bit of a sod, really.

Yeah. I have some ideas about doing it better, but I just don't understand the lstream code well enough to do things right. I don't feel terribly bad tho'; Uli Drepper admitted spending about a year writing the gconv code that went into glibc, mostly on dealing with the kinds of "multiply-buffered stream" that bedevil our lstreams. We *really* *really* should expose lstreams to Lisp (but I don't know how to do that, either). Python's codec API is a reasonable model.

...

Alas I expect that *most* apps expect that when the locale is UTF-8, they can emit UTF-8 output. That's kind a large part of what the locale setting *means*.

That's *precisely* what it means. But there's nothing in the Unicode standard that says that a *receiving* Unicode application must handle the full repertoire, only that it not corrupt character streams being passed to other conforming applications. Since we're not passing that stream anywhere, we're still conforming. (It's a bug, it's a bug, but it's not a standards non-conformance issue.)

...

> It probably makes sense in the case of synchronous processes, > such as *compile* buffers, to treat them as "slow" files. That > is, buffer the whole output from the process, do detection on a > large sample, and then convert the whole thing. True, unlss the compiler is *really* dog slow (not unheard of).

Doesn't matter; a synchronous call-process doesn't come back until the compiler exits.

...

Oh, we can do a *lot* better than that, for not too much expense in in-core data or computation. Quite practical, even if not infallible. At least, I'd like to try! I'm just lstream-challenged. :-)

...

> Do you really want to reconvert the buffer every time the process > spits out more output? (Maybe you do, but somebody's gonna have to > code it and put it in front of users to see what they think.) No way. Reconversions should be as rare as possible, and no rarer.

Then you should think again. Conversions are fast, because it's O(n) and done in C. Even a multimegabyte buffer on current machines. How about redisplay? No problem, because only the characters that actually change get redisplayed. Of course we don't want to do this in a tight loop, but "as rare as possible" just isn't a goal.

...

> ? People (at least in the Japanese subset) actually do stuff like > that. Well, that's strange, but if they've asked for that, presumably they expect the results

I thought you worked in the financial sector? On *this* Earth, people expect to get what they *want*, not what they *asked for*. Traders can get quite violent about it, I hear. :-p

...

> Environment is for setting user defaults, but the user rarely > interfaces directly with GCC these days; they interact with some kind > of filter (aka IDE). In that case it's the IDE's job to reset LANG to a non-UTF-8 value if it's not willing to cope with UTF-8 output!

My point is that the IDE *could* cope with GCC's UTF-8 output, but smart IDE developers will Just Say No, and *will* set LANG to C to turn off all bright ideas from the GCC error beautification committee. They will then parse the traditional, bog-standard GCC error output, *probably giving the same results that the G.E.B. committee's proposal does*, but with the huge advantage that those results are available with GCC 2.6.3 (ah, those were the days!)

...

iconv is a special case because its entire raison d'etre is encoding conversion: of course it has to be capable of dealing with multiple encodings simultaneously. GCC, so far, doesn't,

Excuse me? What about your "the Russians bitched" example? You claim that that's not treatable with the iconv medicine.

...

so it uses LANG (like, oh, just about every other noninteractive program out there other than things that are part of the i18n infrastructure like iconv).

Noninteractive programs should not use LANG at all. That's my point. A noninteractive program should put on a Big Attitude, and say "Now you listen up, you morons! My input is UTF-8, and I will spit it right back on your shoes if you give me anything else. My output is UTF-8, and if you don't like that, put it where the sun don't shine because I don't want to hear about it! But you'll probably get better results by piping them through GNU recode."<wink> Then the IDEs can promote their "recode-less development workflow." Everybody happy (except for the now unemployed GCC error beautification committee, who now must work in the HCI department of an IDE vendor).

...

Well, it could have a conversion table that says `if the process-coding system is FOO, set LANG to BAR'. (However, this is complicated by the divergent locale names in many Unixes, argh.)

You're missing the point. The stream that comes out of the process (eg, on a network socket) may very well change encodings on the fly. This happens in ESMTP, for example, if 8BIT is enabled (or if you receive any Chinese spam, which doesn't care if 8BIT is enabled). XEmacs should not consider its process-coding-system any more reliable than the LANG it was started under (modulo the amount of time the process runs, of course---if your XEmacs runs for a month, then of course it's much more likely to switch encodings than a program that runs for 250ms).

...

> You need a heavy-duty multilingual natural > language processing system to do what even the famed 7-bit > null-lingual American programmer can do "by eye". In the general case, you're right. In a lot of useful special cases it may be possible anyway.

Sure. The problem is identifying those cases a priori; it's typically just as easy to look at the output and fix it (by reconverting).

...

> Users should invoke gcc via a wrapper that handles the translations > for them. That means the users using KOI8-R would have to have a *really* smart wrapper, that knows when to switch between that and UTF-16, and so on and so forth... it's easier for the common case if GCC just uses iconv() to convert things itself.

Why? Though I wrote jokingly, I was not joking about the Big Attitude Policy. If GCC simply declared that (1) Java program input is bigendian UTF-16, with string and character constants that don't have the right semantics if compiled verbatim as UTF-16 octet strings required to use octal/hex escapes, and (2) output is UTF-16, with any octet sequences that aren't legit UTF-16 converted to corresponding octal/hex (there's no reason for such to appear outside of string and character literals), then gcc $(cat foo.c | iconv -t utf-16 -f koi8-r) 2>&1 \ | iconv -t koi8-r -f utf-16 should work fine for programs written in and for koi8-r only. This would be a lot harder for C or C++, I admit, since they don't mandate an internal text encoding, so literals in KOI8-R that got translated to UTF-8 would blow up, since the programmers would undoubtedly expect to to just blast them out with printf. But you could often just link with a special library that DTRTs with scanf, printf, and friends, I bet.

...

Nah. Just do (setenv "LANG" "C") early in startup.el, and then start compiling a hitlist of apps that don't even respect that. This will require work on our part (mostly in checking to see which of our Lisp libraries call programs that produce output useful to XEmacs when LANG is set) but it would seem that it's a lot easier than convincing the rest of the world to put down the hammer it keeps whacking its head with. Since this is so easy, why should GCC produce non-ASCII, ever? Big Attitude + Postel Principle on the output side is the way to go here, I think.

...

What "they"? There only ever really was Ken'ichi Handa, who is quite central over in the GNU camp. (I could also be provoked into saying "What design?", but please don't put the obvious construction on that. "True understanding" would require more explanation of Japanese culture and Mule history than is appropriate here.)

...

Does anyone really understand CCL any longer?)

Ken Handa and Aidan Kehoe. Hisashi Miyashita probably does, but I don't know if he's still doing Emacsen. CCL is quite limited, and not very hard to understand, at least if you ever programmed in assembler. The real problem is that it's mostly undocumented, and didn't implement what is documented very well. I've always though it a real shame that they didn't implement it as a restricted Lisp rather than a completely different language.

...

> I *can* stick to LANG=ja_JP.UTF-8, precisely because XEmacs ignores > the "UTF-8". The important part of that to XEmacs is "ja_JP", because > it tells XEmacs to prefer Japanese fonts and encodings where > characters are shared with Chinese and/or Korean. So in other words you're saying `use UTF-8' and then relying on every program you run regularly ignoring it (or so it seems to me, otherwise you wouldn't be complaining about GCC using UTF-8 in that situation)?

Not at all. I'm saying that programs whose purpose is to *display* localized output should pay attention to that. While programs that produce output that can be usefully localized, but aren't in the display business, should ignore LANG and produce localiz*able* output (not localiz*ed*). In GCC's case, I really don't see why its error output shouldn't be, say, XML with the mso namespace. :-) N.B. I'm not really complaining about GCC using UTF-8. This requires only a minor workaround for XEmacs (and could be even more minor than it actually is, with a bit of investment on our part). I'm complaining about that hammer you're whacking your own heads with. But as long as y'all don't want to aim it at mine, be my guest.

...

> XEmacs has no trouble decoding that, and even if it did, you could fix > it with a simple comint filter. What bothers me is that a useful > protocol was changed without warning

[...]

...

I can't really see any way of advertising it more widely. We don't have any rooftops to shout from.

Exactly! That's why something like that shouldn't be changed without a better reason than "Markus Kuhn thinks jumping out of a 2d story window was a good idea dammit." Markus Kuhn is a guy who has spent several decades thinking about programs that read and display localized content. He's not attuned to the issues of distributed development of systems whose internal components mostly talk to each other, and only incidentally to humans. If I were a GCC developer, I would have 1. Written about 1 line of lex and 4 lines of yacc, and created a simple post-processor you could pipe it to. 2. Found 3 developers, one each at Eclipse, vim, and GNU Emacs, to fix their compile.el equivalents to translate `' to the typographically correct characters. Really, it's *our* job, not GCC's.

...

It was a major version bump.

Now you're talking! Who needs to make "reasonable assumptions"? Just say "This way leads to a better world in the long run" (which is true), and "A major version bump is the right time to do it" (ditto). BTW, I apologize for the troll; it wasn't intentional, but I knew I was waiting for something. If I'd thought straight, I would have realized "major version bump" were the words I wanted to hear, instead of "reasonable assumption." _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Monday, 23 October Mon, 23 Oct

4:20 a.m.

(permalink)

New subject: setting start-process environment, too much GC, debugging XEmacs

Nix suggests:

...

(start-process doesn't seem to support running programs with different environments yet, but that can't be too terribly hard to add, and is of general utility.)

(let ((process-environment an-appropriately-constructed-list)) (start-process ...)) (Untested but should work.) I prefer that idiom rather than cluttering up the start-process API even more than it already is.

...

(Why is XEmacs-21.5.27 choosing to garbage-collect between every word I type? If this is how the incremental GC normally works it's damned annoying. Yes, each GC round only takes about a second, but still, that's an unresponsive second every two seconds...)

Dunno. I don't use the new GC yet. File a bug report where Marcus will see it (M-x report-xemacs-bug).

...

One aside: if XEmacs goes into a tight loop, how do I debug it? I entered a newsgroup a few minutes ago and XEmacs wandered off computing madly, GCing occasionally, and never came back. The backtrace was, ahem, unhelpful:

Build with --with-union-type. Source src/.gdbinit. Type "lbt" anywhere to get a Lisp backtrace, "lbp count" in frame #4 to find out how many characters were deleted, and "pobj buf" in frame #2 to get a peek at the internal structures of the buffer being munged. The requirement for --with-union-type is an infelicity that appeared post-3.3 GCC; it used to be that GDB could get at the LISP_CHAR_TYPE macro, but for some reason it can't now. So in disunion builds, pobj can't determine the type of the object, and nothing much works.

...

(I'm considering just oprofiling it next time to get a clue what functions the loop is passing through: would that be a worthwhile approach?)

It might be. FWIW, there's known breakage in the regexp code that can lead to looping; it's claimed that simply incrementing the target text pointer under certain conditions where there is a null match DTRTs, but I've never been able to convince myself. Gnus (I assume you're using Gnus to read newsgroups) does a lot of weird stuff trying to make XEmacs appear to be GNU Emacs to its higher level code. I find Gnus code unreadable, and it's rarely commented, so I've not even been able to formulate a bug report or RFE. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Nix

Monday, 23 October Mon, 23 Oct

10:49 p.m.

(permalink)

New subject: setting start-process environment, too much GC, debugging XEmacs

On Mon, 23 Oct 2006, stephen(a)xemacs.org said:

...

Nix suggests: > (start-process doesn't seem to support running programs with > different environments yet, but that can't be too terribly hard to add, > and is of general utility.) (let ((process-environment an-appropriately-constructed-list)) (start-process ...)) (Untested but should work.) I prefer that idiom rather than cluttering up the start-process API even more than it already is.

I *always* forget `process-environment', and for some reason I never remember to hyper-apropos for it... obviously that's the Right Way, yes; there is no need for a change to `start-process' after all.

...

> (Why is XEmacs-21.5.27 choosing to garbage-collect between every word I > type? If this is how the incremental GC normally works it's damned > annoying. Yes, each GC round only takes about a second, but still, that's > an unresponsive second every two seconds...) Dunno. I don't use the new GC yet. File a bug report where Marcus will see it (M-x report-xemacs-bug).

Aside: the new GC, um, doesn't, despite running. After half an hour using Gnus: RSS VSZ COMMAND 499000 789848 xemacs `Eight hundred megabytes and constantly swapping.' (phase 0.0 n-gc-total 59.0 n-cycles-total 344.0 n-cycles-in-last-gc 1.0 n-cycles-in-this-gc 1.0 enqueued-total 55627565.0 enqueued-in-last-gc 1104846.0 enqueued-in-last-cycle 1104846.0 enqueued-in-this-gc 1087451.0 enqueued-in-this-cycle 1087451.0 dequeued-total 55627565.0 dequeued-in-last-gc 1104846.0 dequeued-in-last-cycle 1104846.0 dequeued-in-this-gc 1087451.0 dequeued-in-this-cycle 1087451.0 enqueued2-total 6552.0 enqueued2-in-last-gc 0.0 enqueued2-in-last-cycle 0.0 enqueued2-in-this-gc 0.0 enqueued2-in-this-cycle 0.0 dequeued2-total 6552.0 dequeued2-in-last-gc 0.0 dequeued2-in-last-cycle 0.0 dequeued2-in-this-gc 0.0 dequeued2-in-this-cycle 0.0 repushed-total 1021784.0 repushed-in-last-gc 0.0 repushed-in-last-cycle 0.0 repushed-in-this-gc 0.0 repushed-in-this-cycle 0.0 finalized-total 1023642.0 finalized-in-last-gc 8773.0 finalized-in-last-cycle 8773.0 finalized-in-this-gc 1726.0 finalized-in-this-cycle 1726.0 freed-total 130701690.0 freed-in-last-gc 977110.0 freed-in-last-cycle 977110.0 freed-in-this-gc 1068832.0 freed-in-this-cycle 1068832.0) Note in particular that `freed-total' is less than an eighth of XEmacs's memory image. I think I'm going to restart XEmacs now. And then I'll file a bug report. (oops, too late, the oom-killer just zapped it. I love autosave.)

...

> One aside: if XEmacs goes into a tight loop, how do I debug it? > I entered a newsgroup a few minutes ago and XEmacs wandered off > computing madly, GCing occasionally, and never came back.

It did come back. It just took two hours. I suspect this is an example of the `noticeable slowdowns' of --with-assertions.

...

> The backtrace was, ahem, unhelpful: Build with --with-union-type. Source src/.gdbinit. Type "lbt"

Aha! Lots of other nifty stuff in there too, thank you!

...

The requirement for --with-union-type is an infelicity that appeared post-3.3 GCC; it used to be that GDB could get at the LISP_CHAR_TYPE macro, but for some reason it can't now.

Does building with -g3 help? (That includes macro information in the debugging info, and makes that debugging info quite a lot larger in the process.) I suspect the GCC version number bump is because that's about when GCC stopped using stabs debugging info by default on most ELF platforms and switched to DWARF2. (You can still generate stabs debugging info with -gstabs, and stabs debugging info with macros with -gstabs3, but there is rarely a need for that unless you're cursed with old debuggers.)

...

FWIW, there's known breakage in the regexp code that can lead to looping; it's claimed that simply incrementing the target text pointer under certain conditions where there is a null match DTRTs, but I've never been able to convince myself.

Well, in this case it wasn't a loop, just sloth. (And much of that sloth may be because with XEmacs's memory image at that point hovering around 400Mb, L2 cache locality must have been absolutely awful. I'm not sure: I'll try running it under cachegrind if it gets that slow again and see what I can see.)

...

Gnus (I assume you're using Gnus to read newsgroups) does a lot of weird stuff trying to make XEmacs appear to be GNU Emacs to its higher level code. I find Gnus code unreadable, and it's rarely commented, so I've not even been able to formulate a bug report or RFE.

I've had quite a lot of experience picking Gnus code to bits (at least enough to figure out how to advise its more gory functions: I have a rather complex Gnus configuration). If it does turn out to be Gnus's fault, I think I can send a bug report and/or patch to Lars. But the same Gnus code worked fine with 21.4.19, so I suspect XEmacs here. (Of course, it was just slow...) -- `When we are born we have plenty of Hydrogen but as we age our Hydrogen pool becomes depleted.' _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Stephen J. Turnbull

Tuesday, 24 October Tue, 24 Oct

6:44 a.m.

(permalink)

New subject: setting start-process environment, too much GC, debugging XEmacs

Nix writes:

...

I *always* forget `process-environment', and for some reason I never remember to hyper-apropos for it...

Well, it *should* be in the docstring for `start-process', for heaven's sake, but it ain't. Doc fix forthcoming. While I'm thinking about it, is there anything else that we omit reminding you about in process handling? :-)

...

Aside: the new GC, um, doesn't, despite running. After half an hour using Gnus: RSS VSZ COMMAND 499000 789848 xemacs `Eight hundred megabytes and constantly swapping.'

Hrm. Adrian reported this and I thought Marcus fixed it. But that may be post-21.5.27. When you say "21.5.27", do you mean tarball and/or cvs co -r r21.5.27, or do you mean CVS HEAD and that's what `emacs-version' tells you? Unless you're CVS-challenged for some reason, I strongly recommend using CVS HEAD---there's very little risk, as nobody is committing megapatches at the moment. Stability is essentially monotonically increasing.

...

> > One aside: if XEmacs goes into a tight loop, how do I debug it? > > I entered a newsgroup a few minutes ago and XEmacs wandered off > > computing madly, GCing occasionally, and never came back. It did come back. It just took two hours. I suspect this is an example of the `noticeable slowdowns' of --with-assertions.

That's possible. Many algorithms go quadratic in buffer size because of error-checking on the position cache. If the position cache is getting lots of hits, it will only be linear, but a lot of Gnus's algorithms, as well as fontlock which Gnus uses heavily, seem to loop over all buffer positions repeatedly. That destroys locality, of course, and the position cache is useless. You could try configuring --with-error-checking=all,nobufpos.

...

Aha! Lots of other nifty stuff in there too, thank you!

My pleasure, sir.

...

Does building with -g3 help? (That includes macro information in the debugging info, and makes that debugging info quite a lot larger in the process.)

You tell me. (I'll try it in a few days, but the most real work I can do for the next couple is fixing docstrings. :-)

...

I suspect the GCC version number bump is because that's about when GCC stopped using stabs debugging info by default on most ELF platforms and switched to DWARF2.

Ah, I noticed that but I didn't know what it meant.

...

(You can still generate stabs debugging info with -gstabs, and stabs debugging info with macros with -gstabs3, but there is rarely a need for that unless you're cursed with old debuggers.)

How old? Mac OS X 10.3 uses a v5.3 gdb, and "cursed" doesn't begin to describe the way I feel half the time (and I don't even ever try to debug C++ :-( ).

...

I don't think we can talk about "fault" here. Gnus and XEmacs just don't think the same way AFAICT. But anything you can do to improve Gnus performance (or get our AUCTeX package up to current DAK version ;-) would be greatly appreciated. I'm not kidding about the AUCTeX package, by the way, but before anybody does significant work, notify at least me and Uwe Brauer, because we both have interest and experience along those lines. _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

Nix

Tuesday, 24 October Tue, 24 Oct

7:25 p.m.

(permalink)

New subject: setting start-process environment, too much GC, debugging XEmacs

On Tue, 24 Oct 2006, stephen(a)xemacs.org mused:

...

Nix writes: > I *always* forget `process-environment', and for some reason I never > remember to hyper-apropos for it... Well, it *should* be in the docstring for `start-process', for heaven's sake, but it ain't. Doc fix forthcoming. While I'm thinking about it, is there anything else that we omit reminding you about in process handling? :-)

Not that I know of. I mean, there's no need to remind me of my *last* disastrous attempt to fix anything in that area (xemacs-21.4.7, anyone?). Of course there might be things I've forgotten about process handling because they aren't documented: but I'm afraid I've forgotten about them:/

...

> `Eight hundred megabytes and constantly swapping.' Hrm. Adrian reported this and I thought Marcus fixed it. But that may be post-21.5.27. When you say "21.5.27", do you mean tarball and/or cvs co -r r21.5.27,

That.

...

or do you mean CVS HEAD and that's what `emacs-version' tells you? Unless you're CVS-challenged for some reason, I strongly recommend using CVS HEAD---there's very little risk, as nobody is committing megapatches at the moment. Stability is essentially monotonically increasing.

Oh, good. I'll upgrade this evening then. (The idea of an XEmacs trunk existing for more than a week *without* Ben Wing submitting megapatches is oddly disturbing.)

...

> I suspect this is an example of the `noticeable slowdowns' of > --with-assertions. That's possible. Many algorithms go quadratic in buffer size because of error-checking on the position cache. If the position cache is getting lots of hits, it will only be linear, but a lot of Gnus's algorithms, as well as fontlock which Gnus uses heavily, seem to loop over all buffer positions repeatedly. That destroys locality, of course, and the position cache is useless.

It also maintains some unimaginably vast variables and loops over *them* a lot, including checking their lengths frequently. (The group in question has well over *30Mb* of overview data, so it tends to point up slownesses in Gnus quite effectively. One of the reasons I've upgraded to 21.5.x is for the way-better profiler I vaguely recall noticing flying past on the list years ago, so I can try to stamp on some low-hanging sloth fruit.)

...

You could try configuring --with-error-checking=all,nobufpos.

On the list :)

...

> Does building with -g3 help? (That includes macro information in the > debugging info, and makes that debugging info quite a lot larger in > the process.) You tell me. (I'll try it in a few days, but the most real work I can do for the next couple is fixing docstrings. :-)

Docs are always good!

...

> I suspect the GCC version number bump is because that's about when GCC > stopped using stabs debugging info by default on most ELF platforms and > switched to DWARF2. Ah, I noticed that but I didn't know what it meant.

stabs suck really quite notably. It's a terribly limited format which can just about describe the behaviour of C-based programs without breaking down and crying, and which can't even do that on some platforms (OK, so Itanium isn't exactly a very widely-used platform, but once upon a time people thought otherwise, and it triggered significant changes in GCC before it became obvious that it was a white elephant with about four serious users). Plus it's critical to C++ exception handling in GCC 3.0 and above (guess why EH improved so fast? That's why). DWARF2 is `less a debugging format, more a programming language' and can seemingly describe just about anything. Among other things it allows all live stack positions in all functions in the call stack to be identified without the need for frame pointers, so those major platforms which don't require FPs as part of the ABI and which are grossly slowed by them (i386, I'm looking at you) can now disable them by default (and will do as of GCC 4.2). (Downside: only very new versions of GDB can *use* location lists: if a new GDB isn't released in time, the -fomit-frame-pointer-by-default change will probably have to be reverted and pushed back to GCC 4.3.) Even the Linux kernel now provides backtraces by parsing DWARF2. (Aside: the new GC is crashed by -fomit-frame-pointer. This may be intrinsic to the new GC or it may be a codegen bug in GCC 4.1.1. I'll look at it soon.) The *downside* of DWARF2 is that it's much bigger unless you optimize the redundancy out of it, which GCC is slowly starting to do because there are people out there who generate enormous executables, and they get annoyed when their 2Gb binary is supplemented by 500Gb of debugging info! (-feliminate-dwarf2-dups -feliminate-unused-debug-types -feliminate-unused-deubg-symbols turns this on, but don't try it before GCC 4.2 and binutils 2.18 because there have been really rather nasty bugs in the DWARF2 dup elimination that tend to break linking if you turn it on... not every time, and the failure is obvious, but still.)

...

> (You can still generate stabs debugging info with > -gstabs, and stabs debugging info with macros with -gstabs3, but there > is rarely a need for that unless you're cursed with old debuggers.) How old? Mac OS X 10.3 uses a v5.3 gdb, and "cursed" doesn't begin to describe the way I feel half the time (and I don't even ever try to debug C++ :-( ).

DWARF2 support was added around 5.0 (but is not *fully* supported by any release of GDB: but that's OK, because GCC doesn't emit all of DWARF2 yet, nor does anyone that I know of.)

...

> I've had quite a lot of experience picking Gnus code to bits (at least > enough to figure out how to advise its more gory functions: I have a > rather complex Gnus configuration). If it does turn out to be Gnus's > fault, I think I can send a bug report and/or patch to Lars. I don't think we can talk about "fault" here. Gnus and XEmacs just don't think the same way AFAICT.

For things that don't think the same way, they get on quite well

...

But anything you can do to improve Gnus performance (or get our AUCTeX package up to current DAK version

That is also on my list. I dropped away from preview-latex development a *long* time ago because I simply had too much else on my plate. I'm no longer accepting that excuse from myself :))

...

I'm not kidding about the AUCTeX package, by the way, but before anybody does significant work, notify at least me and Uwe Brauer, because we both have interest and experience along those lines.

OK. -- `When we are born we have plenty of Hydrogen but as we age our Hydrogen pool becomes depleted.' _______________________________________________ XEmacs-Beta mailing list XEmacs-Beta(a)xemacs.org http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

6873

days inactive

6878

days old

xemacs-beta@xemacs.org

Manage subscription

13 comments

3 participants

tags (0)

participants (3)

Manuel Cebrian
Nix
Stephen J. Turnbull

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Warning: Missing charsets in String to FontSet conversion