Jan Rychter wrote:
>> I have a feeling this could be related to the resolver.
Glynn> How so?
I think I'm seeing two separate (but possibly related) issues here. One
is that XEmacs hangs. The other are the BadWindow messages.
Sorry, I was focused on the BadWindow errors. The hanging does appear
to be related to the resolver.
I've just spent a long time trying to reproduce the problems,
which
seemed to be spurious. And I think I found a case that shows where to
look for the problems.
Tcpdump says:
192.168.1.82.32807 > 10.11.53.10.domain: 17472+ A?
news-server.san.rr.com. (40) (DF)
[repeated 5 more times]
192.168.1.82.32809 > 10.11.53.10.domain: 17473+[|domain] (DF)
[repeated 5 more times]
192.168.1.82.32811 > 10.11.53.10.domain: 17474+ AAAA?
news-server.san.rr.com. (40)
(DF)
[repeated 5 more times]
192.168.1.82.32813 > 10.11.53.10.domain: 17475+[|domain] (DF)
[repeated 5 more times]
192.168.1.82.32815 > 10.11.53.10.domain: 17476+ A?
news-server.san.rr.com. (40) (DF)
[repeated 5 more times]
192.168.1.82.32813 > 10.11.53.10.domain: 17477+[|domain] (DF)
[repeated 5 more times]
... at which points XEmacs becomes totally unresponsive and stays in
that state forever (as far as I can tell). There is no other network
activity occurring.
Now, I know _why_ it's trying to do a DNS lookup and failing. It seems
this particular XEmacs process was started when I was connected to some
Wi-Fi hotspot and the resolver took the (valid at the time) contents of
/etc/resolv.conf, remembered it and made it it's Bible forever. Of
course I have no way to reach 10.11.53.10 now, and I also have no way to
tell XEmacs that resolv.conf has changed (is there any solution to that
problem?).
To fix the running process, you would need to attach a debugger to the
process and manually modify libc's internal state.
To eliminate the problem in future, run a caching-only named, and have
resolv.conf point to 127.0.0.1. Then, you can just restart named if
you change the DNS server.
What I don't understand is why the lookup doesn't just fail.
XEmacs
shouldn't hang forever, no matter what happens.
Well, unless gethostbyname/getaddrinfo hang forever. AFAICT, XEmacs
relies upon those functions either succeeding or failing eventually;
it doesn't appear to try to pre-empt them.
And I don't understand how the BadWindow messages are related,
but they
seem to appear after I switch (using Alt-TAB in WindowMaker) to another
window (and possibly back)
I'm not sure that they are related. Unless it's due to a bug in the X
error handler. In general, error handling code is notorious for not
getting particularly well tested by typical usage.
So, it seems my hangs are directly related to the resolver (no DNS
servers being reachable) and should be easily reproducible.
I wouldn't be so sure, particularly what happens after XEmacs calls
gethostbyname or getaddrinfo is likely to be highly system-specific
(i.e. it depends upon what's in nsswitch.conf, whether you're running
nscd etc).
--
Glynn Clements <glynn.clements(a)virgin.net>