Re: Build Reports

Tuesday, 29 April 2003

        ...
>>>> "Jan" == Jan Rychter
<jan(a)rychter.com&gt; writes: 
    Jan> LANG! That is enlightening -- I somehow never thought XEmacs
    Jan> would use the LANG setting to enforce coding systems for
    Jan> files. In fact, I do have LANG set to en_US on this
    Jan> machine. I was somehow convinced that XEmacs tried to stay
    Jan> away from locales as much as possible.

I know of only two places where it is used.  In initialization of
XEmacs, it is used as a default for set-language-environment, which
sets the default buffer-file-coding-system, among others.  Otherwise,
the default coding system is iso8859-1.  The other is to determine the
apropriate X Input Method.

    Stephen> Or use latin-unity.  (This doesn't apply to non-Latin
    Stephen> users yet, but they mostly don't have these problems
    Stephen> anyway.)

    Jan> Does this solve all cases? I mean, are you sure that this
    Jan> will trap all cases of data loss?

If used properly and it isn't buggy, yes.  What is done is to hang a
function, latin-unity-sanity-check, on the hook that is used to
determine the coding system for writing.

The function goes through the buffer, determines all charsets used.
If they are restricted to Latin sets, an auxiliary computation is done
to see if they are all compatible with a single ISO 8859 character
set.  If that is the same as the buffer default, the save goes through
immediately using that coding system.  If they're not the same, the
compatible coding system is checked for membership in a preapproved
list, if so, the save goes through immediately using that coding
system.  Optionally (user-configurable) the default coding system for
the buffer is updated.

Finally, if no luck yet, the best available coding system is computed
(it might be a Latin set but not preapproved, or it might be a
universal coding system), and recommended to the user.  At that point
the user can choose any coding system (and thus shoot himself in the
foot if some characters can't be encoded), but the recommended system
is guaranteed not to lose data.

What can go wrong:  (1) Somebody hangs an unsafe procedure on the hook
with higher priority.  (This is why 21.5 needs to have this facility
native, so we can make sure there's at least a warning.)  (2) There's
one corner case where you can escape and use the old Mule method for
computing the coding system, but the user has to set
latin-unity-like-to-live-dangerously non-nil or the save fails.  ;-)

    Jan> Here's *exactly* what I did (XEmacs 21.5-b12): -- open a file
    Jan> containing 2022-7 with xemacs -vanilla using C-x C-f
    Jan> filename.txt -- C-u C-x C-w new-filename.txt RET iso-8859-2
    Jan> RET

    Jan> new-filename.txt still contains ISO-2022-7 where the
    Jan> ISO-8859-2 characters should be. I'm not talking about
    Jan> on-screen representation, I'm talking about the file contents
    Jan> as viewed with less or vi (to be sure nothing messes with
    Jan> display).

If you can, send me the file.  I cannot reproduce this behavior.

    Jan> Perhaps this functionality is also influenced by my LANG
    Jan> setting?

I don't think so; I can't reproduce it here, as I say.

    Jan> But one more question begs asking: what is the benefit of
    Jan> having your characters reduced to tildes? I mean, what
    Jan> purpose does it serve?

Many terminals and applications choke on ISO 2022 escape sequences.  I
don't think there are many left, but I once fried a terminal
inadvertantly.  He sent me a mail with VT100 escape sequences in it
that caused (ASCII graphic) fireworks on my display.  I sent it back
to him, not realizing that he'd ripped off the fancy console from the
machine room and didn't have a VT-compatible.  I don't know if he got
literal fireworks, but the terminal did blow a fuse!

In general, filtering illegal characters out of a text stream is a
legitimate function.  That's why ASCII has the 0x1A SUBSTITUTE
character, and Unicode has the 0xFFFD REPLACEMENT CHARACTER.  Mule was
designed before Unicode, and you can't use SUB for its original
purpose on DOS because somebody decided to use SUB to mean END OF
TEXT.  Thus the use of TILDE for this purpose.  XEmacs is a general
text-processing application; it should be able to do such filtering.

    Jan> So, while I understand your explanations about the complexity
    Jan> of the issues involved, I still don't understand the
    Jan> opposition to just changing or removing the evil piece of
    Jan> code that changes data to tildes.

It's not a question of "just removing [one] evil piece of code",
because you'd have to make sure that the characters that were left
alone would not escape into a buffer or string where they weren't
legal internal coding.  That would lead to a crash.  This is not easy
to do, since the translation code is _not_ I/O code, and cannot know
whether the internal object it is producing will ever be referenced
inside of XEmacs again.

So really, the filtering function should be available, it just should
be disabled for most purposes.  Unfortunately, without a fair amount
of work, that's hard to do and still keep the internal data structures
consistent.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Build Reports