>>>> "Yoshiki" == Yoshiki Hayashi
<t90553(a)m.ecc.u-tokyo.ac.jp> writes:
Yoshiki> "Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:
> Being automatic is possible only when the user knows what he is
> doing. We need to be very careful that users do not get wedged
> into coding systems they don't know how to get out of. This is
> one of Hrvoje's prime complaints about Mule; it can and does
> destroy data because of coding-system wedging.
Yoshiki> Ben's idea is to autodetect every output/input so you
Yoshiki> won't end up in strange coding-system. I don't think
Yoshiki> your last statement is true since you can always repeat
Yoshiki> the command before in shell-mode.
First, Hrvoje's example is with respect to binary _files_. A paranoid
user will have multiple backups, in principle there need not be a
problem. But if you trust Mule, reading a binary file can, and often
does, result in a non-raw coding system due to autodetection. This
can definitely destroy data; I've seen it happen. If it can happen to
files, it will obviously be possible for volatile streams.
Second, the point of having a shell-mode is that the behavior of the
shell is volatile; you cannot count on repeating it.
Third, given that all 8-bit ISO-2022 codes have the same space, it is
quite possible for an unsuspecting user to end up in a "strange coding
system". Happens all the time on the (Japanese) Web, because you
never know when an EUC-JP page will link an ISO-8859-1 page. The
former are rarely correctly announced by the server, and the latter is
(unfortunately) allowed not to announce because it is the default.
(Fortunately, web browsers by their nature must do the buffering I
suggest.)
Yoshiki> What we need is automatic detection and explicit
Yoshiki> specification of what coding-system to use.
I don't understand this. Looks like a contradiction to me, but I'm
sure I'm just missing your point.
> Remember, you can't do the equivalent of `C-x C-k RET C-u
C-x
> C-f "file" RET "the-right-encoding" RET' on a terminal
stream
> yet.
Yoshiki> Now we are discussing how to do that sensibly, aren't we?
Yoshiki> :-)
I thought we were discussing autodetection, not recovery from
autodetection failures?
Remember, the better the autodetection is, the more users trust it,
the less care they take, and the more surprised they are when it does
(inevitably) fail.
This is OK under the current regime, where Mule is an option. Ben
wants to make it a default. Then it is not OK. We need to think
about how to recover from failures.
> I think we should do something like buffer the first screenful,
> do autodetect on it, and `C-x C-m c' should (optionally?) offer
> a menu including coding systems and a line of sample text from
> the buffer to show the user what they are getting.
Yoshiki> This will fail if user accidentally output some amount of
Yoshiki> binary data.
Of course.
Yoshiki> And we need raw data to autodetect coding-system.
Of course.
Yoshiki> Text in the buffer is already code converted in most
Yoshiki> cases. What happens if iso-2022-jp, shift_jis and euc-jp
Yoshiki> is outputted in the same buffer?
Yoshiki> I think Ben's idea and my idea can coexist. This is a
Yoshiki> revised proposal.
Yoshiki> 1. Try to autodetect every input/output by resetting
Yoshiki> coding-system.
How do you define "every input/output"? Suppose the user does `cat
thisfile.euc thatfile.sjis' in a shell-mode?
Yoshiki> 2. If user specify explicitly what coding-system to use
Yoshiki> with C-x RET c, then use that. i.e. reset to that
Yoshiki> coding-system instead of auto-detection after every
Yoshiki> command.
Something more flexible is appropriate, I think. In particular, if
C-x C-m c is used to set the process coding system, then on
incompatible input (ie, with euc-jp default the process sends a
high-bit-set/high-bit-clear pair of bytes) the autodetect mechanism
should still be used, but rather than set the coding system it should
signal the user that the default is probably inappropriate (as less
does on encountering an apparently binary file).
Yoshiki> 3. Implement a way to specify coding-system used for only
Yoshiki> next command. This will be already existing command
Yoshiki> set-buffer-process-coding-system since it will be reseted
Yoshiki> after one command execution.
Be careful about backward compatibility here.
Yoshiki> 4. (Optional) Implement a way to change coding-system
Yoshiki> permanently.
I don't understand this.
By the way, I put forward an RFC a while ago concerning autodetection.
Ben wanted a revision, which I haven't been able to do yet. At the
time he generally approved, but that's no commitment on his part.
Still there are probably useful ideas here for you to work with. I'm
attaching first my message, then Ben's comments, without citation.
I have subtracted some mostly-irrelevant side comments; this was
submitted on a private CC group.
-------------------------------- my RFC --------------------------------
Let me give a formal proposal of what I would like to see in the
autodetection specification.
(1) Definitions
(a) *Autodetection* means detecting and making available to Mule
the external file's encoding. See (5), below. It doesn't
imply any specific actions based on that information.
(b) The *default* case is POSIX locale, and no environment
information in ~/.emacs.
N.B. This *will* cause breakage for all 1-byte users because
the default case can no longer assume Latin-1. You *may* be
able to use the TTY font or the Xt -font option to fake this,
and default to iso8859-1; I would hope that we would not use
such a kludge in the beta versions, although it might be
satisfactory for general use. In particular, encodings like
VISCII (Vietnamese) and I believe KOI-8 (Cyrillic) are not
ISO-2022-clean, but using C1 control characters as a heuristic
for detecting binary files is useful.
If we do allow it, I think that XEmacs should bitch and warn
that the practices of implicitly specifying language
environment by -font and defaulting on TTYs is deprecated and
likely to be obsoleted.
(c) The *European* case is any Latin-* locale, either implied by
setlocale() and friends or set in ~/.emacs. Latin-1 is
specifically not given precedence over other Latin-*, or
non-Latin or non-ISO-8859 for that matter. I suspect but am
not sure that this case extends to all ISO-8859 encodings, and
possibly to non-ISO-8859 single-byte encodings like KOI-8r (in
particular when combined in a class with ISO-8859 encodings).
(d) The *CJK* case is any CJK locale. Japanese is specifically
not given precedence over other Asian locales.
(e) For completeness, define the *Unicode* case (Unicode
unfortunately has lots of junk such as precomposed characters,
language tags, and directionality indicators in it; we
probably don't care yet, but we should also not claim
compliance) and the *general* case (which has a lot of
features similar to Unicode, but lacks the advantage of a
unified encoding). This proposal has no idea how to handle
the special features of these, or even if that matters. The
general case includes stuff that nobody here really knows how
it works, like Tibetan and Ethiopic.
Each of the following cases is given in the order of priority of
detection. I'm not sure I'm serious about the top priority given the
(optional) Unicode detection. This may be appropriate if Ben is
right that ISO-2022 is going to disappear, but possibly not until then
(two two-byte sequences out of 65536 is probably 1.99 too many). It
probably isn't too risky if (6)(c) is taken pretty seriously; a Unicode
file should contain _no_ private use characters unless the encoding is
explicitly specified, and that's a block of 1/10 of the code space,
which should help a lot in detecting binary files.
(2) Default locale
(a) Some Unicode (fixed width; maybe UTF-8, too?) may optionally
be detected by the byte-order-mark magic (if the first two
bytes are 0xFE 0xFF, the file is Unicode text, if 0xFF 0xFE,
it is wrong-endian Unicode; if legal in UTF-8, it would be
0xFE 0xBB 0xBF, either-endian). This is probably an
optimization that should not be on by default yet.
(b) ISO-2022 encodings will be detected as long as they use
explicit designation of all non-ASCII character sets. This
means that many 7-bit ISO-2022 encodings would be detected
(eg, ISO-2022-JP), but EUC-JP and X Compound Text would not,
because they implicitly designate character sets.
N.B. Latin-1 will be detected as binary, as for any Latin-*.
N.B. An explicit ISO-2022 designation is semantically
equivalent to a Content-Type: header. It is more dangerous
because shorter, but I think we should recognize them by
default despite the slight risk; XEmacs is a text editor.
N.B. This is unlikely to be as dangerous as it looks at first
glance. Any file that includes an 8-bit-set byte before the
first valid designation should be detected as binary.
(c) Binary files will be detected (eg, presence of NULs, other
non-whitespace control characters, absurdly long lines, and
presence of bytes >127).
(d) Everything else is ASCII.
(e) Newlines will be detected in text files.
(3) European locales
(a) Unicode may optionally be detected by the byte-order-mark
magic.
(b) ISO-2022 encodings will be detected as long as they use
explicit designation of all non-ASCII character sets.
(c) A locale-specific class of 1-byte character sets (eg,
'(Latin-1)) will be detected.
N.B. The reason for permitting a class is for cases like
Cyrillic where there are both ISO-8859 encodings and
incompatible encodings (KOI-8r) in common use. If you want to
write a Latin-1 v. Latin-2 detector, be my guest, but I don't
think it would be easy or accurate.
(d) Binary files will be detected per (2)(c), except that only
8-bit bytes out of the encoding's range imply binary.
(e) Everything else is ASCII.
(f) Newlines will be detected in text files.
(4) CJK locales
(a) Unicode may optionally be detected by the byte-order-mark
magic.
(b) ISO-2022 encodings will be detected as long as they use
explicit designation of all non-ASCII character sets.
(c) A locale-specific class of multi-byte and wide-character
encodings will be detected.
N.B. No 1-byte character sets (eg, Latin-1) will be detected.
The reason for a class is to allow the Japanese to let Mule do
the work of choosing EUC v. SJIS.
(d) Binary files will be detected per (3)(d).
(e) Everything else is ASCII.
(f) Newlines will be detected in text files.
(5) Unicode and general locales; multilingual use
(a) Hopefully a system general enough to handle (2)--(4) will
handle these, too, but we should watch out for gotchas like
Unicode "plane 14" tags which (I think _both_ Ben and Olivier
will agree) have no place in the internal representation, and
thus must be treated as out-of-band control sequences. I
don't know if all such gotchas will be as easy to dispose of.
(b) An explicit coding system priority list will be provided to
allow multilingual users to autodetect both Shift JIS and Big
5, say, but this ability is not promised by Mule, since it
would involve (eg) heuristics like picking a set of code
points that are frequent in Shift JIS and uncommon in Big 5
and betting that a file containing many characters from that
set is Shift JIS.
(6) Relationship to decoding semantics
(a) Autodetection should be run on every input stream unless the
user explicitly disables it.
(b) The (conceptual) default procedure is
(i) Read the file into the buffer
(ii) Announce the result of autodetection to the user.
(iii) User may request decoding, with autodetected encoding(s)
given priority in a list of available encodings.
Optimizations (see (e) below) should avoid introducing data
corruption that this default procedure would avoid.
Obviously, it can't be perfect if any autodecoding is done;
users like Hrvoje should have an easily available option to
return to this default (or an optimized approximation which
doesn't actually read the whole file into a buffer) or simply
display everything as binary (with the "font" for binary files
being a user option).
(c) This implies that we should be detecting conditions in the
tail of the file which violate the implicit assumptions of the
coding system autodetected (eg, in UTF-8 illegal UTF-8
sequences, including those corresponding to surrogates) should
raise a warning; the buffer should probably be made read-only
and the user prompted.
This could be taken to extremes, like checking by table
whether all characters in a Japanese file are actually
legitimate JIS codes; that's insane (and would cause corporate
encodings to be recognized as binary). But we should think
about the idea that autodetection shouldn't mean XEmacs can't
change its mind.
(d) A flexible means for the user to delegate the decision
(conditional on the result of autodetection) to decode or not
to XEmacs or a Lisp program should be provided (eg, the
coding priority list and/or a file-coding-alist).
(e) Optimized operations (eg, the current lstreams) should be
provided, with the recognition that if they depend on sampling
the file they are risky.
(f) Mule should provide a reasonable set of default delegations
(as in (d) above) for as many locales as possible.
(7) Implementation
(a) I think all the decision logic suggested above can be
accomplished through a coding-priority-list and appropriate
initializations for different language environments, and a
file-coding-alist.
(b) Many of the tests on the file's tail shouldn't be very
expensive; in particular, all of the ones I've suggested are
O(n) although they might involve moderate-sized auxiliary
tables for efficiency (eg, 64kB for a single Unicode-oriented
test).
Other comments:
It might be reasonable given Hrvoje's objections to require that any
autodetection that could cause data loss (any coding system that
involves escape sequences, and only those AFAIK: by design translation
to Unicode is invertible) by default prompt the user (presumable with
a novice-like ability to retain the prompt, always default to binary,
or always default to the autodetected encoding) in the future, at
least in locales that don't need it (POSIX, Latin-any).
Ben thinks that we can remember the input data; I think it's going to
be hard to comprehensively test that a highly optimized version works.
Good design will help, but ISO-2022 is enormously complex, and there
are many encodings that violate even its lax assumptions. On the
other hand, memory is the only way to get non-rewindable streams right.
Hrvoje himself said he would like to have an XEmacs that distinguishes
between Latin-1 and Latin-2 text. Where it is possible to do that,
this is exactly what autodetection of ISO-2022 and Unicode gives you.
Many people would want that, even at some risk of binary corruption.
----------------------------- Ben's reply ------------------------------
I think it is a good start, and definitely moving in the direction I
would like to see things going. However, I have some suggestions for
cleaning this up:
You should try to make it more layered. For example, you might have one
section devoted to the workings of autodetection, which starts out like this
(the section numbers below are totally arbitrary):
Section 5
Autodetect() is a function whose arguments are (1) a readable stream, (2) some
hints indicating how the autodetection is to proceed, and (3) a value
indicating the maximum number of characters to examine at the beginning of the
stream. (Possibly, the value in (3) may be some special symbol indicating
that we only go as far as the next line, or a certain number of lines ahead;
this would be used as part of "continuous autodetection", e.g. we are decoding
the results of an interactive terminal session, where the user may
periodically switch encodings, line terminations, etc. as different programs
get run and/or telnet or similar sessions are entered into and exited.) We
assume the stream is rewindable; if not, insert a "rewinding" stream in front
of the non-rewinding stream; this kind of stream automatically buffers the
data as necessary.
[You can use pseudo-code terminology here. No need for straight C or ELisp.]
[Then proceed to describe what the hints look like -- e.g. you could portray
it as a property list or whatever. The idea is that, for each locale, there
is a corresponding hints value that is used at least by default. The hints
structure also has to be set up to allow for two or more competing hints
specifications to be merged together. For example, the extension of a file
might provide an additional hint or hints about how to interpret the data of
that file, and the caller of autodetect(), when calling autodetect() on such a
file, would need to have a way of gracefully merging the default hints
corresponding to the locale with the more specific hints provided by the
extension. Furthermore, users like Hrvoje might well want to provide their
own hints to supplement and override parts of the generic hints -- e.g. "I
don't ever want to see non-European encodings decoded; treat them as binary
instead".]
[Then describe algorithmically how the autodetection works. First, you could
describe it more generally, i.e. presenting an algorithmic overview, then you
could discuss in detail exactly how autodetection of a particular type of
external encoding works -- e.g. "for iso2022, we first look for an escape
character, followed by a byte in this range [. ... .] etc."]
Section 6
This section describes the concept of a locale in XEmacs, and how it is
derived from the user's environment. A locale in XEmacs is a pair, a country
and a language, together determining the handling of locale-specific areas of
XEmacs. All locale-specific areas in XEmacs make use of this XEmacs locale,
and do not attempt to derive the locale from any other sources. The user is
free to change the current locale at any time; accessor and mutator functions
are provided to do this so that various locale-specific areas can optionally
be changed together with it.
[Then you describe how the XEmacs locale is extracted from .emacs, from
setlocale(), from the LANG environment variables, from -font, or wherever
else. All other sections assume this dirty work is done and never even
mention it]
Section 7
[Here you describe the default autodetect() hints value corresponding to each
possible locale. You should probably use a schematic description here, e.g.
an actual Lisp property list, liberally commented.]
Section 8 etc.
[Other sections cover anything I've missed. By being very careful to separate
out the layers, you simultaneously introduce more rigor (easier to catch bugs)
and make it easier for someone else to understand it completely.]
ben
--
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091
_________________ _________________ _________________ _________________
What are those straight lines for? "XEmacs rules."