>>>> "Hrvoje" == Hrvoje Niksic
<hniksic(a)srce.hr> writes:
"Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:
> In the current implementation of Mule, coding systems are a
> property of ... you can't really be sure. And their semantics
> are ... rather ad hoc.
Hrvoje> This looks like a major problem, in major need of fixing.
Agreed. The issue that is not yet clear in my mind is "Can `fixing'
be done by documentation or must we reimplement the Mule?" Olivier's
work is orthogonal to this problem, and I don't think my extent-based
coding systems really would solve it. So I don't know if documenting
the way it does work currently is good enough, and I also don't know
how to reimplement to avoid the problems. :-( The people who should
know are busy doing other stuff like re-porting CCL....
> In the particular case of Fmd5 and the uses of it that I know
> about, it probably doesn't need it;
Hrvoje> For the record: I *do* note that md5 checksums are useful
Hrvoje> for things other than the internal buffer representation.
Hrvoje> But I don't think that `md5' *function* should have that
Hrvoje> additional CODING-SYSTEM argument. Couldn't it pick the
Hrvoje> coding system from a generally useful variable, like
Hrvoje> coding-system-for-write? Am I making any sense?
You're making a lot of sense; but let's reframe the question. Do you
know that md5 checksums are *not* useful for any arbitrary
representation? If so, then with some care in picking the variable
for the external representation to be checksummed, we should be able
to eliminate that argument from Fmd5. Whether it would be a big
improvement, I'm not sure. See below.
Hrvoje> Well, this is true, but the case of `md5' calculating a
Hrvoje> digest is surely no harder than the case of `C-x C-s'
Hrvoje> saving the buffer, if you see what I mean...
Well, yes, I see what you mean, but it's not quite true. md5 needs to
know that what you want the checksum for is save-buffer, then it can
use coding-system-for-write.
But if you're actually going to stuff it into a TCP/IP stream via a
process buffer, you need to use coding-system-for-process-output (or
whatever the appropriate variable is named). These may very well be
different; eg, if the TCP/IP stream mentioned is actually SMTP, you
may very well want to use an ISO-2022-7 coding system, while when you
save the buffer to a local file as a log you can use 8-bit ISO-8859-2
without silly escape sequences, because as you say "Trust me, there's
nobody here that doesn't speak ISO-8859-2." Fmd5 cares whether you're
using its value as a check on the log or on the SMTP transmission.
In consequence, you need a special coding-system-for-md5 variable and
you'll see lots of
(let ((coding-system-for-md5 coding-system-for-write))
(md5 (current-buffer))
(save-buffer))
or so I can imagine. Yuck. Or you could create a suite of
*-with-accompanying-md5-checksum functions which basically wrap
function `*' with the let above. Double yuck. Or you can choose the
existing coding-system variable (presumably coding-system-for-write)
that is most likely to be used to be consulted by Fmd5, and you'll
only see let's like the above in other cases. I don't know how
frequent they are, and it will be easy to make a mistake and not so
easy to debug. Not to mention the fact of the weirdness of setting
coding-system-for-write in a process context.
It seems to me that in any case where external and internal encodings
differ, Fmd5 needs to know how you plan to use its value. In a world
where the external encodings are chosen by oracle, you need an
argument to Fmd5. You can either make it explicit, optional, and
default it as Jareth has done, or you can make it implicit and use
dynamic lets to access it.
Hrvoje> That would be cool. Even cooler would be to be able to
Hrvoje> specify the default latin* character set, so that when I
Hrvoje> encounter © anywhere in a file, I want to see it as ©, not
Hrvoje> as the copyright symbol. Sort of "trus me pal, I know
Hrvoje> it's latin2."
I suspect that the reason you can't do that in Mule (the documentation
indicates that you can, but it doesn't work[1] in my environment for some
reason, although it does work for Steve Baur) is cultural. Most Mule
users are Japanese; Japanese have to deal with three different
encodings of (basically) the same character set on a daily basis.
User setting is mistake-prone, so a set of heuristics is used for
autorecognition. These heuristics, based on the presence or absence
of ISO-2022 escape sequences and the presence or absence of certain
code points in the text, are quite good for the case of Japanese
... and absolutely useless for distinguishing among members of the
ISO-8859 family.
So the case where the majority of users try to set the coding-system
and fail is extremely rare; if there's still a bug it's going to be
hard to find.
And of course none of this addresses the performance hit; but
Olivier's work does that.
Footnotes:
[1] OK, last time I tried was somewhere around 20.3-b24. Didn't
work. I should try again, harder.