[Bug: 21.5-b25] Problems with latin-unity and VM
Stephen J. Turnbull
stephen at xemacs.org
Wed Feb 7 08:16:07 EST 2007
Aidan Kehoe writes:
> That’s what the FSF have. We (XEmacs) don’t distinguish iso-8859-1
> and binary in your sense; they do.
Really? Interesting, I would have thought that would cause
interoperability problems, since XEmacs can't distinguish between
the coding system objects `iso-8859-1-unix' and `binary'.
> Neither version distinguishes ASCII and binary.
>> By definition, binary files consist only of octets and should be
>> written out as-is, without any attempt to change any octets.
[Aidan's explanation is perfectly correct; I've already written this
and it goes into somewhat more depth, so I'll send it. :-]
In fact, this statement makes no sense in normal use in either Emacs.
GNU Emacs has a concept of unibyte text (see the "as-multibyte" and
"as-unibyte" variables), and that is true binary, in the sense of
being an untyped sequence of octets. This historically has caused
them all manner of trouble, specifically the repeated regression of
the "\201 bug" that XEmacs has never had. (We had a different set of
troubles related to splitting a formerly unified type into two types.
Pay your money and take your choice; I'm happy with the XEmacs
approach.)
Otherwise, *all* buffers in both Emacsen are composed of *characters*,
which you may think of as being implemented as arbitrary nonnegative
integers. In GNU Emacs they actually are integers, in XEmacs the
integer and character types are disjoint (though often they are
implicitly coerced back and forth). Of course in fact there are
implementation restrictions, for XEmacs to 30 bits and in practice
only 21 bits are used IIRC, and there are "holes" in the domain due to
the way charset information is encoded into the integers.
It is true (just as in UTF-8) that internal characters that are ASCII
characters are represented in a single octet, by their ASCII codes.
However, C1 and Latin-1 characters are represented in *two* octets
(just as they would be in UTF-8, but the conversion is based on ISO
2022 rather than a "sane" algorithmic conversion). This means that
only ASCII buffers---which XEmacs doesn't actually know about---(and
GNU Emacs unibyte buffers) can be written out to files without
conversion. Since these three charsets exactly cover the range of
octets, binary is encoded using them.
There's a second misconception implicit in your statement, which is
that the coding of the file is somehow reflected in the buffer. It is
not. Buffers are arrays (technically a gap array, but that's an
implementation detail) of characters with some associated data. Again
it may useful to think of an Emacs buffer as containing a UTF-8
string, except perverted so that it's not Unicode (now are you
beginning to understand why we call it "Mule"?)
Among the associated data are a number of coding systems, but these
are *advisory* only. Almost all editing operations are conducted in
complete disregard for coding, and (IIRC) all I/O operations take a
coding system argument, which is normally defaulted to the value of
`buffer-file-coding-system'. However it can be set explicitly (see
the function `universal-coding-system-argument'. It is this coding
system argument that determines the nature of the conversion on
output.
Unfortunately, many XEmacs coding systems are lossy; they will simply
substitute "~" for any character that they can't encode.
Specifically, that is true for binary. So binary also needs to be
protected by latin-unity.
> But if the corresponding buffer contains characters that have no
> clear mapping to octets with the binary coding system, writing that
> buffer to disk will lose data.
This is a bug in the XEmacs coding system design.
> control-1 characters have a clear mapping to octets with the binary
> coding system. latin-unity not knowing about that mapping is the
> bug.
Right. There's a reason why latin-unity isn't in core. :-/
> In the correct course of events, VM will use MIME-encoding to
> generate a buffer where every character is a member of either
> ascii, control-1 or latin-iso8859-1, then write the buffer using
> 'binary.
Actually, it just reads in the mbox file, with the single exception of
an FCC to my knowledge.
> latin-unity will then look through the contents of the buffer, see
> that it can be encoded using 'binary, and all will be well.
:-)
> > -- all octets in binary files represent themselves.)
> All octets in binary files, when those files are read by XEmacs as
> binary, are represented by characters in the ascii, control-1 or
> latin-iso8859-1 character sets. This was a choice we made--an
> alternative would have been a combination of latin-jisx0201,
> control-1 and latin-iso8859-2. There are better reasons for the
> first choice, but it was still a choice.
The specific reason is that this choice is 100% upward compatible
from no-Mule. In fact, I believe that even the "font registry hack"
of using xemacs -font "-*-iso8859-2" works (ie, binary files will be
displayed as Latin 2).
More information about the XEmacs-Beta
mailing list