Re: catting random bytes causes character craziness

Monday, 4 November 2002

        Executive summary:  try

(set-coding-priority-list '(no-conversion))

before doing any I/O to buffers in .emacs.

...
>>>> "Hrvoje" == Hrvoje Niksic
<hniksic(a)xemacs.org&gt; writes: 
    Hrvoje> OK.  But it shouldn't be that hard to change this?  Or am
    Hrvoje> I missing something?

Probably.:-)

    Hrvoje> The "locale support" might do several things:

    Hrvoje> a) Check the appropriate variables and set the language
    Hrvoje> environment accordingly.

It does.  I guess the next thing to try is

(set-coding-priority-list '(no-conversion))

by default.  This means that Mule elisp files will not be displayed
correctly by default in the C locale.  However, AFAIK all the Mule
elisp is maintained by Japanese, me, and Ben, so they'll have a
Japanese language environment by default or know what to do.

OK, done.

    Hrvoje> b) Set things up so that the LC_* settings really are
    Hrvoje> respected in most reasonable circumstances.

We already go out of our way to do exactly the opposite.  The
consistent reports of Dired bugs in European locales are due to the
fact that we don't break the legs of some of the LC_* settings.

The POSIX locale is just plain broken inside of Mule.  Emacs does too
much parsing of non-interactive subprocess output, assuming it's going
to get C locale output.  Ditto for lots of internal functions and
system call wrappers.  OTOH, in interactive process buffers, we want
what the user expects.  So the bottom line is that this is not going
to get fixed for a while, because we're basically limited to
heuristics about "what the user expects."  I doubt the non/interactive
distinction will be sufficient.

    Hrvoje>    For example, in a UTF-8 environment, Mule would treat
    Hrvoje> unknown input as UTF-8.

Bu-wha-ha-ha!  That means depending on Mule-UCS, which _we don't know
how to maintain_ and Himi doesn't seem to be maintaining.  Postpone
that specific case to 22.0.

    Hrvoje> Needless to say, ISO 2022 autodetection should be turned
    Hrvoje> off by default.

If that's needless to say, then you're definitely missing something.
First of all, it is ISO 2022 autodetection that detects all ISO 8859
coding systems.  I know that what you want is to turn off escape
sequences, but unfortunately that's not the way it works.  The closest
thing is the set-coding-priority-list hack.  But that means that
ISO-8859-X is also shut off, because the only no-conversion coding
system is iso-8859-1-unix and its aliases.

Second, if (in an 8-bit locale) you don't need the escape sequences,
then you don't need Mule.

    Hrvoje> c) Make especially sure that in single-byte language
    Hrvoje> environments (e.g. the "C" locale and iso-8859-* locales,
    Hrvoje> but not e.g. UTF-8) the conversions from external to
    Hrvoje> internal format and vice versa are reversible.

This is already the case for binary == ISO-8859-1.  I'm not sure
whether it can be done trivially for ISO-8859-X, X != 1.  (It _is_
also true for "true" ISO-8859-X files; as you know, the problem is the
escape processing.)  I've been postponing doing that in 21.5 until I
can put together a regression test for it.  Now that I understand
test-harness.el, this is high on my list.

    Hrvoje> d) Get rid of ISO 2022... use UTF-8 as the internal
    Hrvoje> representation...  Oops.

I hope to have that in 22.0 (experimental).  It's not that hard, and
Ben says we'll have Unicode font support, so even a naive
implementation can be efficient for most purposes.

However, this is completely irrelevant to your concerns.  The internal
representation really is internal (modulo autosave files, but those
aren't in internal representation either).

    Hrvoje> It can't work perfectly for everyone, but it could work
    Hrvoje> much more reasonably by default for most people.

Sure, but for these purposes I'm Japanese, remember.  In my own apps,
I'm happy with the defaults.  God be thanked for you and Jamie, I have
well-described bugs to work with.  But the only Mule programmers I
have available are the Japanese, and me.  Getting the defaults right
in a way that doesn't just cause more breakage isn't easy.

    Hrvoje> I don't think what I proposed above really requires
    Hrvoje> rewriting Mule, except for the part about getting rid of
    Hrvoje> ISO 2022.

I was exaggerating.  But remember who is going to be doing that work,
at least in 21.4.  It's not Ben.  A lot of this is pretty major at my
current level of skills.

    Hrvoje> But remember that since Jamie is probably running in a
    Hrvoje> Latin 1 locale, and the `iso-8859-1' encoding is already
    Hrvoje> free of the ISO 2022 lossage, he'd be ok.

But he's not, and nothing you've suggested will change things for
Latin-1 locales.

Never-wrote-any-Code-That-Doesn't-Suck-ly y'rs

Steve

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: catting random bytes causes character craziness