>>>> "Jan" == Jan Rychter
<jan(a)rychter.com> writes:
Jan> LANG! That is enlightening -- I somehow never thought XEmacs
Jan> would use the LANG setting to enforce coding systems for
Jan> files. In fact, I do have LANG set to en_US on this
Jan> machine. I was somehow convinced that XEmacs tried to stay
Jan> away from locales as much as possible.
I know of only two places where it is used. In initialization of
XEmacs, it is used as a default for set-language-environment, which
sets the default buffer-file-coding-system, among others. Otherwise,
the default coding system is iso8859-1. The other is to determine the
apropriate X Input Method.
Stephen> Or use latin-unity. (This doesn't apply to non-Latin
Stephen> users yet, but they mostly don't have these problems
Stephen> anyway.)
Jan> Does this solve all cases? I mean, are you sure that this
Jan> will trap all cases of data loss?
If used properly and it isn't buggy, yes. What is done is to hang a
function, latin-unity-sanity-check, on the hook that is used to
determine the coding system for writing.
The function goes through the buffer, determines all charsets used.
If they are restricted to Latin sets, an auxiliary computation is done
to see if they are all compatible with a single ISO 8859 character
set. If that is the same as the buffer default, the save goes through
immediately using that coding system. If they're not the same, the
compatible coding system is checked for membership in a preapproved
list, if so, the save goes through immediately using that coding
system. Optionally (user-configurable) the default coding system for
the buffer is updated.
Finally, if no luck yet, the best available coding system is computed
(it might be a Latin set but not preapproved, or it might be a
universal coding system), and recommended to the user. At that point
the user can choose any coding system (and thus shoot himself in the
foot if some characters can't be encoded), but the recommended system
is guaranteed not to lose data.
What can go wrong: (1) Somebody hangs an unsafe procedure on the hook
with higher priority. (This is why 21.5 needs to have this facility
native, so we can make sure there's at least a warning.) (2) There's
one corner case where you can escape and use the old Mule method for
computing the coding system, but the user has to set
latin-unity-like-to-live-dangerously non-nil or the save fails. ;-)
Jan> Here's *exactly* what I did (XEmacs 21.5-b12): -- open a file
Jan> containing 2022-7 with xemacs -vanilla using C-x C-f
Jan> filename.txt -- C-u C-x C-w new-filename.txt RET iso-8859-2
Jan> RET
Jan> new-filename.txt still contains ISO-2022-7 where the
Jan> ISO-8859-2 characters should be. I'm not talking about
Jan> on-screen representation, I'm talking about the file contents
Jan> as viewed with less or vi (to be sure nothing messes with
Jan> display).
If you can, send me the file. I cannot reproduce this behavior.
Jan> Perhaps this functionality is also influenced by my LANG
Jan> setting?
I don't think so; I can't reproduce it here, as I say.
Jan> But one more question begs asking: what is the benefit of
Jan> having your characters reduced to tildes? I mean, what
Jan> purpose does it serve?
Many terminals and applications choke on ISO 2022 escape sequences. I
don't think there are many left, but I once fried a terminal
inadvertantly. He sent me a mail with VT100 escape sequences in it
that caused (ASCII graphic) fireworks on my display. I sent it back
to him, not realizing that he'd ripped off the fancy console from the
machine room and didn't have a VT-compatible. I don't know if he got
literal fireworks, but the terminal did blow a fuse!
In general, filtering illegal characters out of a text stream is a
legitimate function. That's why ASCII has the 0x1A SUBSTITUTE
character, and Unicode has the 0xFFFD REPLACEMENT CHARACTER. Mule was
designed before Unicode, and you can't use SUB for its original
purpose on DOS because somebody decided to use SUB to mean END OF
TEXT. Thus the use of TILDE for this purpose. XEmacs is a general
text-processing application; it should be able to do such filtering.
Jan> So, while I understand your explanations about the complexity
Jan> of the issues involved, I still don't understand the
Jan> opposition to just changing or removing the evil piece of
Jan> code that changes data to tildes.
It's not a question of "just removing [one] evil piece of code",
because you'd have to make sure that the characters that were left
alone would not escape into a buffer or string where they weren't
legal internal coding. That would lead to a crash. This is not easy
to do, since the translation code is _not_ I/O code, and cannot know
whether the internal object it is producing will ever be referenced
inside of XEmacs again.
So really, the filtering function should be available, it just should
be disabled for most purposes. Unfortunately, without a fair amount
of work, that's hard to do and still keep the internal data structures
consistent.
--
Institute of Policy and Planning Sciences
http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.