>>>> "Hrvoje" == Hrvoje Niksic
<hniksic(a)srce.hr> writes:
Hrvoje> jareth(a)camelot-soft.com (P. E. Jareth Hein) writes:
> I put that in there for use with email systems or digital sigs.
> Visualize having a buffer in SJIS, (which usually is translated
There are no buffers in SJIS. There are only buffers in Mule internal
encoding. Since SJIS is a complex encoding built from several JIS
standard character sets, AFAI can recall internally each character is
translated to the Mule internal coding according to the various JIS
character set leading bytes, and SJIS exists only in the
*-coding-system-* variables.
However, I believe that this implementation detail does not make the
feature unnecessary. The point being that the internal Mule encoding
isn't what the recipient gets, so checksumming the buffer as is is
useless. Or worse, if the operation is an encryption.
> to ISO-2022-JP for mailing) and wanting to put a digest on it.
> I figured that feature would come in handy...
>
> I am perfectly willing to be shouted down, however...
Hrvoje> No, the idea is fine. But is it the best thing to infest
Hrvoje> random primitives with CODING-SYSTEM arguments? Isn't
Hrvoje> there a more general variable that can be set? Please
Hrvoje> bear with me -- I know very little about coding systems
Hrvoje> and stuff, so maybe I'm just raving.
Sigh. You _should_ rave. Heaven knows, Erik Naggum does. In the
current implementation of Mule, coding systems are a property of
... you can't really be sure. And their semantics are ... rather ad
hoc.
In the particular case of Fmd5 and the uses of it that I know about,
it probably doesn't need it; proper care in implementation should have
Fmd5 use the output coding-system (but this must be specified by the
mailer, and most mailers do the translation to standard Internet mail
encodings as a post-processing step). But then, the original
implementor of md5 didn't consider that Steve Baur was going to use
XEmacs to md5 checksum packages' external representation; he thought
it would be OK as long as the buffer representation would be
checksummed correctly. So I wouldn't bet on it.
There are several problems here.
(1) Different XEmacs users want to display in different languages.
This is problem is basically solved by flexible faces and 8/16-bit
clean buffers.
(2) Some XEmacs users want multi-lingual buffers.
The problem is that existing character codes have collisions at
some code points.
This problem is solved by mechanisms like X Compound Text and
ISO-2022.
Problems (1) and (2) are fairly well dealt with by the "more general
variables" you mention.
(3) Some languages have multiple encodings based on varying character
sets. This is of course true of the oriental ideographs, but is
also true closer to your (Hrvoje's) home: I'm told that there are
getting on close to 20 character sets that can be (loosely) said
to implement Cyrillic, but they vary slightly in the characters
they contain and greatly in assignment to code points.
The problem is that often textual source which you would like to
combine in a buffer have varied encodings to start with. So some
choice of buffer representation and mechanisms for translation are
needed. Note that "textual source" not only means files that are
encoded differently, but also you may have process and user input
in yet other encodings. These last are especially hairy, as it's
very difficult to predict what encodings will be used, and you
cannot seek forward to get enough input to make a good decision.
(Cf recent discussion about eterm.[1])
Mule coding systems are an abstraction that has grown over time to try
to handle the problems that have been observed implementing
multilingual text manipulation.
Since the same buffer may have multiple incompatibly-coded languages
(I do this every day) and multiple sources of text (shell buffers for
example), this really needs to be dealt with flexibly on every
operation that modifies buffers :-(
I have been vaguely thinking about a mechanism like the X selection
mechanism, where every text modification operation has a source and a
target. If we go in the direction of a UCS (a superset of Unicode[2])
representation for text, as Olivier Galibert and I would like to, then
all modifications (including as a special case output) of the buffer
or string would have UCS either as source or target, simplifying the
set of translations.
This is more or less what current Mule implementations do, as well,
but instead of a standard set of fixed-width characters, Mule uses a
home-brewed internal multibyte representation. And as far as I can
tell, the idea of source and target are not formally and explicitly
recognized.
The problems that we are currently discussing stem, in my opinion,
from the lack of an good extent abstraction in Mule. This means that
source and target encodings must be defined either as properties of
characters or as a property of a whole stream. The former is ugly,
time-consuming, and mistake-prone; the latter is simply not
fine-grained enough. So you end up with coding-system specifications
in operators. Yuck.
I've started[3] implementing an extent-based coding-system model. I'm
hoping that this will allow coding system stuff to be localized to a
few primitives, and in general hide coding systems from the user and
from higher-level code. On the other hand, the UCS representation
that Olivier is talking about allows round-trip conversion. This
means that in principle one should be able to take a buffer, with one
extent full of characters whose source encoding was mistaken, and
translate them in-place to the text that would result from the correct
specification of the source encoding, simply by changing a (for now
hypothetical) source-encoding specifier. It is possible to do this
kind of conversion in-place in current Mules, but it requires you to
know what lstreams are and how to use them. It's easier simply to
start over again from the original I/O operation....
Concretely, this would be like the current situation with no-mule
XEmacs where you can change a buffer displayed as junk to legible
Croatian by changing the face from a *-iso8859-1 to a *-iso8859-2
font.[4]
Note: Olivier has allowed that my extent-based implementation is
"interesting", and offered several useful comments, but otherwise
bears no responsibility for it.
Footnotes:
[1] Thread starting with a post by Glynn Clements
<13631.1811.130276.935080(a)cerise.sensei.co.uk>.
[2] Including UCS-4 for you nervous character chauvinists.
[3] This is pretty generous; basically I've got lisp-mnt.el
compatible shells of files with a few lonely defuns and defvars in
them at this point, nothing that can even be called proof-of-concept.
[4] The implementation would be more complicated, as the actual
characters in the buffer need to be translated in the scheme I'm
talking about. But such complicated actions are not unusual when you
change a specifier, eg a glyph; the necessary hooks are in place, I
think.