[21.5] Warn on data corruption when writing 'binary streams
Ben Wing
ben at 666.com
Thu Sep 30 21:56:02 EDT 2004
Hmm ...
[1] I really don't think the conversion should be getting delayed until
finalization. Finalization happens at a very touchy place in GC, and as
little as possible should be happening. Why isn't this happening at stream
close?
[2] I would suggest a flag, which when set forces an immediate error unless
gc_in_progress -- checking on this flag, and perhaps others, shouldn't be a
problem; I don't see why it is, and in any case after moving conversion out
of finalization this situation should occur less. Set this flag when saving
or doing anything else where inappropriate conversion is dangerous. [Perhaps
the flag should allow for error, warning, or no action? An error during
file loading would be very annoying, but a warning appropriate] BTW a flag,
and not always, because conversion happens in many situations, e.g. in
redisplay, when an error would be inappropriate.
[3] A last resort would be, rather than just a warning, set things up so
that an eval event is pushed, which brings up something that asks the user
what to do. But fix [1] and [2] instead.
I'll try to take a look at the code. At some point I may be able to
actually write code, but it'll take a while to get up and running again to
this point.
-----Original Message-----
From: Stephen J. Turnbull [mailto:turnbull at sk.tsukuba.ac.jp] On Behalf Of
Stephen J. Turnbull
Sent: Thursday, September 23, 2004 1:37 AM
To: xemacs-beta at xemacs.org; acs at xemacs.org
Cc: ben at xemacs.org
Subject: [21.5] Warn on data corruption when writing 'binary streams
Vin, this isn't ready for 21.4 yet---I'm just dipping my toes into this
little mudpuddle---but it's a very nasty design bug, and we should do
something by 21.4.17.
Mule users are probably aware that the binary coding system (aka
iso-8859-1) is likely to corrupt non-Latin-1 data without warning. I offer
the following patch for discussion, and for experienced Mule users I think
it offers a very substantial improvement over the current state of affairs.
After a little more experience I'll probably commit something like it
(generalized to some other coding
systems) to 21.5.
This patch buys us the warning, but I'm not sure it's correct. In
particular, I tried to arrange for an error, but that caused asserts because
the conversion functions always eventually get called in stream
finalization, which will typically occur in GC.
I tried guarding error calls by testing gc_in_progress, but that just
resulted in uninterpretable backtraces.
Handling the error correctly here is probably an awful lot of work, and I
actually think a redesign of coding streams is in order. Ben's notes
indicate he was thinking along the same line.
So the best we can do is write the stream and warn the user. If you manage
to get a buffer full of Japanese with a binary coding system (actually not
hard to do; just start XEmacs in POSIX locale and read in a Japanese file),
you lose everything but the newlines. But I don't see how to do the kind of
thing that latin-unity does, and walk the user through a recovery procedure,
with the current structure. (Of course we could emulate the coding systems
with a pre-check, as latin-unity does, but that's both inelegant and
unreliable if you get the emulation wrong, and doesn't help with
user-defined coding
systems.)
To see how it works, read in the HELLO file (C-h h), then use M-x
set-buffer-file-coding-system RET binary RET, edit, and save to a temp file
somewhere. With or without the patch, all non-ISO-8859-1 characters get
translated to '~' on write. Without the patch, it's completely silent.
With the patch, you get a humongous warning in your face.
Note that (as far as I can see) coding systems like Shift JIS are often even
worse, because they simply apply inappropriate algorithms or pass octets
unchanged or drop them when they encounter a charset they can't handle. ISO
2022 (including ISO 8859-N, N != 1) systems are safer in the sense that they
always can encode the characters using escapes, but if you need to work with
the file with a program other than Emacs, that might as well be corrupt.
Any wisdom on error handling in dangerous territory (GC, I/O) would be
greatly appreciated.
Index: src/ChangeLog
===================================================================
RCS file:
/Users/steve/Software/Repositories/cvs.xemacs.org/XEmacs/xemacs/src/ChangeLo
g,v
retrieving revision 1.743
diff -u -U0 -r1.743 ChangeLog
--- ChangeLog 2004/09/22 03:04:35 1.743
+++ ChangeLog 2004/09/23 08:00:50
@@ -0,0 +1,4 @@
+2004-09-23 Stephen J. Turnbull <stephen at xemacs.org>
+
+ * file-coding.c (no_conversion_convert): Warn on failed conversion.
+
Index: src/file-coding.c
===================================================================
RCS file:
/Users/steve/Software/Repositories/cvs.xemacs.org/XEmacs/xemacs/src/file-cod
ing.c,v
retrieving revision 1.40
diff -u -r1.40 file-coding.c
--- file-coding.c 2004/09/22 02:06:46 1.40
+++ file-coding.c 2004/09/23 01:32:17
@@ -3074,7 +3074,7 @@
/************************************************************************/
/* "No conversion"; used for binary files. We use quotes because there
- really is some conversion being applied (it does byte<->char
+ really is some conversion being applied (it does byte<->character
conversion), but it appears to the user as if the text is read in
without conversion.
@@ -3108,7 +3108,9 @@
}
else
{
-
+#ifdef MULE
+ Bytecount unconverted = 0; /* arg to emacs_sprintf, should be ok?
+*/ #endif
while (n--)
{
c = *src++;
@@ -3125,8 +3127,29 @@
c == LEADING_BYTE_CONTROL_1)
ch = c;
else
- /* #### This is just plain unacceptable. */
- Dynarr_add (dst, '~'); /* untranslatable character */
+ {
+ Lisp_Object name =
+ XCHARSET_SHORT_NAME (charset_by_leading_byte (c));
+
+ /* We'd like to error and force the user to do something,
+ but Fsignal will abort() in GC.
+ Simply conditioning on gc_in_progress failed; it's not
+ clear why, but eventually there is an abort. GCPROing
+ didn't help (but I haven't tried moving the list3 out
+ of the call into a variable and GCPROing that). */
+ if (0 == unconverted++)
+ /* A quick hack; we want to give the user information on
+ all unconvertible charsets present. Give her at
least
+ the first one. */
+ warn_when_safe_lispobj (Qtext_conversion_error,
+ Qerror,
+ list3 (build_msg_string (
+ "this type of stream can't encode that
charset"),
+ Qbinary,
+ name));
+ /* #### This is just plain unacceptable. */
+ Dynarr_add (dst, '~'); /* untranslatable character */
+ }
}
else
{
@@ -3144,6 +3167,25 @@
#endif /* MULE */
}
+#ifdef MULE
+ if (unconverted > 0)
+ {
+ CIbyte format[] =
+"%d inconvertible characters got written as '~'!\n"
+" You should change the coding system to one that can encode them,\n"
+" and retry the operation.\n" " A previous warning gave encoding and
+first unconvertible charset seen.";
+ CIbyte buffer[sizeof(format)+12]; /* 10 bytes for billions, one
for
+ terminator, one for good luck,
+ and throw in printf conversion
+ for good will */
+
+ emacs_sprintf (buffer, format, unconverted);
+ warn_when_safe (Qtext_conversion_error,
+ Qerror,
+ buffer);
+ }
+#endif
}
str->ch = ch;
--
Institute of Policy and Planning Sciences
http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573
JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the XEmacs-Beta
mailing list