Re: catting random bytes causes character craziness

Tuesday, 5 November 2002

        "Stephen J. Turnbull" <stephen(a)xemacs.org&gt; writes:

...
 by default.  This means that Mule elisp files will not be displayed
 correctly by default in the C locale.  However, AFAIK all the Mule
 elisp is maintained by Japanese, me, and Ben, so they'll have a
 Japanese language environment by default or know what to do. 
Or, emacs-lisp mode could simply recognize the Mule elisp files and do
the right thing.  Catting them in a shell buffer wouldn't display
Japanese contents, sure, but that's a feature, not a bug.

...
     Hrvoje>    For example, in a UTF-8 environment, Mule would
treat
     Hrvoje> unknown input as UTF-8.

 Bu-wha-ha-ha!  That means depending on Mule-UCS, which _we don't know
 how to maintain_ and Himi doesn't seem to be maintaining. 
That was only an example, although a pretty important one, given that
Red Hat 8 defaults to UTF-8 European locales.

Doesn't Ben have a workspace with working Unicode?  Or is it too
alpha?

...
     Hrvoje> Needless to say, ISO 2022 autodetection should be
turned
     Hrvoje> off by default.

 If that's needless to say, then you're definitely missing something.
 First of all, it is ISO 2022 autodetection that detects all ISO 8859
 coding systems. 
I'm not sure what you mean by this.  Why do "all ISO 8859 coding
systems" need special autodetection?  I'm arguing for all such
autodetection to be turned off.  Getting rid of ISO 2022 is orthogonal
to that effort, but is not contrary to it.

...
 But that means that ISO-8859-X is also shut off, because the only
 no-conversion coding system is iso-8859-1-unix and its aliases. 
In an iso-8859-2 locale, input bytes between 160 and 255 should be
considered Latin 2.

...
 Second, if (in an 8-bit locale) you don't need the escape
sequences,
 then you don't need Mule. 
That's not quite true.  I don't need escape sequences *by default*.
But for example, I want Latin 1 mail and news messages to be rendered
as Latin 1, and ditto for Japenese, etc.  Mail (MIME) is only one
example of a format that carries charset information with the message
stream; there are others.

...
     Hrvoje> c) Make especially sure that in single-byte language
     Hrvoje> environments (e.g. the "C" locale and iso-8859-* locales,
     Hrvoje> but not e.g. UTF-8) the conversions from external to
     Hrvoje> internal format and vice versa are reversible.

 This is already the case for binary == ISO-8859-1. 
Yup.  But that doesn't help non-ISO 8859-1 users.

...
 I'm not sure whether it can be done trivially for ISO-8859-X, X
!=
 1.  (It _is_ also true for "true" ISO-8859-X files; as you know, the
 problem is the escape processing.) 
As far as I can tell, the problem is that it's not easy to turn off
the escape processing.  A lot of internal stuff in Mule depends on
it.  But the a/b/c thing I wrote was a wish-list anyway.

...
 However, [using UTF-8 for internal representation] is completely
 irrelevant to your concerns. 
You're right.  It was just a "would be cool to have" item, but in no
way a requirement.

...
     Hrvoje> It can't work perfectly for everyone, but it could
work
     Hrvoje> much more reasonably by default for most people.

 Sure, but for these purposes I'm Japanese, remember. 
True.  However, we needn't stick to the POSIX locale like crazy,
either.  In Japanese locale, we could enable auto-detection.  I have
no problem with "doing whatever the users expect, unless they expect
consistency."  Most users don't change their locales every day, after
all.

...
 God be thanked for you and Jamie, I have well-described bugs to work
 with. 
Yeah.  If you need a *very* specific one for the test harness, here's
one from me:

(Assert (> (length
            (decode-coding-string
	     (encode-coding-string (string (make-char 'japanese-jisx0208 56 108))
                                   'iso-2022-jp)
             'iso-8859-2))
           1))

In other words, encoding an ISO 2022 string as "iso-8859-2" should not
produce a Japanese char.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: catting random bytes causes character craziness