Re: Don't set coding-system in shell-mode

older

[mule-base] TUTORIAL.ja

Re: Info-directory-list

Ben Wing

Monday, 27 December 1999 Mon, 27 Dec '99

4:39 a.m.

(permalink)

Yoshiki Hayashi wrote:

...

Ben Wing <ben(a)666.com> writes: > Martin Buchholz wrote: > > > I approve the patch, if this really works (it somewhat surprises me). > > It did not work when I originally wrote this code. Does this also > > work correctly for ttys? I mean, is the tty coding system properly > > autodetected using determine_real_coding_system? > > > > Blue Sky: Can > > > > cat test.euc > > cat test.jis > > > > possibly work `correctly'? Yes, because you can always distinguish that two coding-system. When escape sequence `ESC [ $ B' is found, it's most likely that coding-system is iso-2022-jp. When two byte sequence has its MSB set, then it is probably euc-jp. > > We know that the Latin equivalent > > > > cat test.iso8859-2 > > cat test.iso8859-1 > > > > cannot. Because there's no particular characteristics in that coding-system. It's just 8bit byte sequences. So it depends on what language-environment you are in. If you only use those two, it may be possible to disguinsh them with some good heuristics. > Why not? Shell mode could reset things so as to force autodetection for the > output of each command issued. Even your second example could work if those > files had proper ISO2022 or X-Compound-Text or whatever escape sequences in > them, or magic XEmacs coding-system: stuff, etc. to indicate the coding > system. (Granted, not likely.) Here's a list of candidate to Mule-izing shell-mode. 1. autodetect every time. 2. autodetect only first time. 3. set default coding-system and allow C-x RET c to alter default value. Now I found doing this cat test.ascii cat test.sjis will set process-output-coding-system to raw-text, I don't think 2 is a good plan (my original one). I think 3 is best as most user use only one coding-system. I'll send patch to do this.

Moved to xemacs-beta. I disagree. We always want to be automatic when possible -- i.e. option 1. We don't want to burden the user with extra work -- i.e. #3.

...

Besides that, it would be nice to have convenience way to set coding-system of next command in shell-mode. Maybe command to change process coding-system for permanently would be nice, too. -- Yoshiki Hayashi

-- In order to save my hands, I am cutting back on my responses, especially to XEmacs-related mail. You _will_ get a response, but please be patient. If you need an immediate response and it is not apparent in your message, please say so. Thanks for your understanding.

Show replies by date

Stephen J. Turnbull

Monday, 27 December Mon, 27 Dec

10:42 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

...

>>>> "Ben" == Ben Wing <ben(a)666.com> writes:

Ben> Moved to xemacs-beta. I disagree. We always want to be Ben> automatic when possible -- i.e. option 1. We don't want to Ben> burden the user with extra work -- i.e. #3. Being automatic is possible only when the user knows what he is doing. We need to be very careful that users do not get wedged into coding systems they don't know how to get out of. This is one of Hrvoje's prime complaints about Mule; it can and does destroy data because of coding-system wedging. It is also dependent on correct setting of the language environment. ISO-8859-1 v. ISO-8859-2 is the canonical case, of course, but EUC-JP vs. ISO-8859-1 is also likely to hose you; these can only be distinguished by heuristics that depend on character frequency and serial correlations. Remember, you can't do the equivalent of `C-x C-k RET C-u C-x C-f "file" RET "the-right-encoding" RET' on a terminal stream yet. I think we should do something like buffer the first screenful, do autodetect on it, and `C-x C-m c' should (optionally?) offer a menu including coding systems and a line of sample text from the buffer to show the user what they are getting. Still, this only works for files with meaning as text. -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules."

Yoshiki Hayashi

Tuesday, 28 December Tue, 28 Dec

3:36 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

"Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:

...

>>>>> "Ben" == Ben Wing <ben(a)666.com> writes: Ben> Moved to xemacs-beta. I disagree. We always want to be Ben> automatic when possible -- i.e. option 1. We don't want to Ben> burden the user with extra work -- i.e. #3. Being automatic is possible only when the user knows what he is doing. We need to be very careful that users do not get wedged into coding systems they don't know how to get out of. This is one of Hrvoje's prime complaints about Mule; it can and does destroy data because of coding-system wedging.

Ben's idea is to autodetect every output/input so you won't end up in strange coding-system. I don't think your last statement is true since you can always repeat the command before in shell-mode. What we need is automatic detection and explicit specification of what coding-system to use.

...

Remember, you can't do the equivalent of `C-x C-k RET C-u C-x C-f "file" RET "the-right-encoding" RET' on a terminal stream yet.

Now we are discussing how to do that sensibly, aren't we? :-)

...

I think we should do something like buffer the first screenful, do autodetect on it, and `C-x C-m c' should (optionally?) offer a menu including coding systems and a line of sample text from the buffer to show the user what they are getting.

This will fail if user accidentally output some amount of binary data. And we need raw data to autodetect coding-system. Text in the buffer is already code converted in most cases. What happens if iso-2022-jp, shift_jis and euc-jp is outputted in the same buffer? I think Ben's idea and my idea can coexist. This is a revised proposal. 1. Try to autodetect every input/output by resetting coding-system. 2. If user specify explicitly what coding-system to use with C-x RET c, then use that. i.e. reset to that coding-system instead of auto-detection after every command. 3. Implement a way to specify coding-system used for only next command. This will be already existing command set-buffer-process-coding-system since it will be reseted after one command execution. 4. (Optional) Implement a way to change coding-system permanently. -- Yoshiki Hayashi

Ben Wing

Tuesday, 28 December Tue, 28 Dec

4:28 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

This proposal sounds great. Please implement if you can. Yoshiki Hayashi wrote:

...

"Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes: > >>>>> "Ben" == Ben Wing <ben(a)666.com> writes: > > Ben> Moved to xemacs-beta. I disagree. We always want to be > Ben> automatic when possible -- i.e. option 1. We don't want to > Ben> burden the user with extra work -- i.e. #3. > > Being automatic is possible only when the user knows what he is > doing. We need to be very careful that users do not get wedged into > coding systems they don't know how to get out of. This is one of > Hrvoje's prime complaints about Mule; it can and does destroy data > because of coding-system wedging. Ben's idea is to autodetect every output/input so you won't end up in strange coding-system. I don't think your last statement is true since you can always repeat the command before in shell-mode. What we need is automatic detection and explicit specification of what coding-system to use. > Remember, you can't do the equivalent of `C-x C-k RET C-u C-x C-f > "file" RET "the-right-encoding" RET' on a terminal stream yet. Now we are discussing how to do that sensibly, aren't we? :-) > I think we should do something like buffer the first screenful, do > autodetect on it, and `C-x C-m c' should (optionally?) offer a menu > including coding systems and a line of sample text from the buffer to > show the user what they are getting. This will fail if user accidentally output some amount of binary data. And we need raw data to autodetect coding-system. Text in the buffer is already code converted in most cases. What happens if iso-2022-jp, shift_jis and euc-jp is outputted in the same buffer? I think Ben's idea and my idea can coexist. This is a revised proposal. 1. Try to autodetect every input/output by resetting coding-system. 2. If user specify explicitly what coding-system to use with C-x RET c, then use that. i.e. reset to that coding-system instead of auto-detection after every command. 3. Implement a way to specify coding-system used for only next command. This will be already existing command set-buffer-process-coding-system since it will be reseted after one command execution. 4. (Optional) Implement a way to change coding-system permanently. -- Yoshiki Hayashi

Yoshiki Hayashi

Thursday, 30 December Thu, 30 Dec

5:38 a.m.

(permalink)

New subject: shell-mode coding-system (2) ([xemacs-base] comint.el)

Ben Wing <ben(a)666.com> writes:

...

This proposal sounds great. Please implement if you can.

Sure. While I was implementing this, I was bitten by reset_decoding_stream bug. You need to apply my previous two patches to make this work correctly. Although set-buffer-process-coding-system seems designed to be used with shell-mode, it doesn't work the way it does before. Now it is equivalent to comint-set-next-coding-system. It's resetted after one command execution. You must use comint-set-coding-system instead. C-x RET c CODING-SYSTEM M-x shell works like FSF Emacs. If explicitly specified, comint will reset to that coding-system instead of automatic-conversion. It seems there's no way to check whether given XEmacs is compiled with file-coding. WIBNI it provides file-coding feature? 1999-12-30 Yoshiki Hayashi <t90553(a)mail.ecc.u-tokyo.ac.jp> * comint.el (comint-3-menubar-menu): Add menu for coding system. (comint-3-menubar-menu-1): Ditto. (comint-mode): Ditto. Add new local variable comint-coding-system-for-read and comint-coding-system-for-write. (comint-send-input): Reset coding system. (comint-set-next-coding-system): New function. (comint-set-coding-system): New function. Index: comint.el =================================================================== RCS file: /usr/CVSroot/XEmacs/xemacs-packages/libs/xemacs-base/comint.el,v retrieving revision 1.5 diff -u -r1.5 comint.el --- comint.el 1999/07/23 08:45:16 1.5 +++ comint.el 1999/12/30 03:19:01 ＠＠ -446,9 +446,17 ＠＠ ["Send EOF" comint-send-eof t] ))) +;; When compiled with file-coding +(when (boundp 'coding-system-for-read) + (defvar comint-3-menubar-menu nil) + (defconst comint-3-menubar-menu-1 + '("Coding-System" + ["Next Command" comint-set-next-coding-system] + ["All Command" comint-set-coding-system])) + (defvar comint-coding-system-for-read) + (defvar comint-coding-system-for-write)) - (defun comint-mode () "Major mode for interacting with an inferior interpreter. Interpreter name is same as buffer name, sans the asterisks. ＠＠ -530,6 +538,15 ＠＠ (make-local-variable 'comint-process-echoes) (make-local-variable 'comint-file-name-chars) (make-local-variable 'comint-file-name-quote-list) + ;; Set coding-system + ;; default to automatic-conversion + (when (boundp 'coding-system-for-read) + (make-local-variable 'comint-coding-system-for-read) + (make-local-variable 'comint-coding-system-for-write) + (setq comint-coding-system-for-read + (or coding-system-for-read 'automatic-conversion)) + (setq comint-coding-system-for-write + (or coding-system-for-write 'automatic-conversion))) (unless comint-1-menubar-menu (easy-menu-define comint-1-menubar-menu nil "" comint-1-menubar-menu-1)) ＠＠ -544,6 +561,11 ＠＠ (easy-menu-define comint-history-menubar-menu nil "" comint-history-menubar-menu-1)) (easy-menu-add comint-history-menubar-menu) + (when (boundp 'coding-system-for-read) + (unless comint-3-menubar-menu + (easy-menu-define comint-3-menubar-menu nil "" + comint-3-menubar-menu-1)) + (easy-menu-add comint-3-menubar-menu)) (run-hooks 'comint-mode-hook)) (if comint-mode-map ＠＠ -1358,8 +1380,15 ＠＠ ;; comint-send-input-hook? (run-hook-with-args 'comint-output-filter-functions (concat input "\n")) - (comint-output-filter proc "") - ))))) + (comint-output-filter proc "")) + ;; Let output-filter run and reset coding-system + (when (fboundp 'set-process-coding-system) + (sit-for 1) + (set-process-coding-system + proc + comint-coding-system-for-read + comint-coding-system-for-write)))))) + (defun comint-input-done () "Finalized comint-input-extent so nothing more is added." ;; Disable this for now. I'm not sure that font-lock doesn't do better ＠＠ -1732,6 +1761,40 ＠＠ (defalias 'comint-send-string 'process-send-string) (defalias 'comint-send-region 'process-send-region) + + +;; Coding-system + +(when (boundp 'coding-system-for-read) + (defun comint-set-next-coding-system (input output) + "Set coding system for next command. +INPUT is the coding system to be used to decode input from the process, +OUTPUT is the coding system to be used to encode output to the process." + (interactive + "zCoding-system for process input: \nzCoding-system for process output: ") + (let ((proc (get-buffer-process (current-buffer)))) + (if (null proc) + (error "no process") + (check-coding-system input) + (check-coding-system output) + (set-process-coding-system proc input output))) + (force-mode-line-update)) + + (defun comint-set-coding-system (input output) + "Set coding system for this session. +INPUT is the coding system to be used to decode input from the process, +OUTPUT is the coding system to be used to encode output to the process." + (interactive + "zCoding-system for process input: \nzCoding-system for process output: ") + (let ((proc (get-buffer-process (current-buffer)))) + (if (null proc) + (error "no process") + (check-coding-system input) + (check-coding-system output) + (setq comint-coding-system-for-read input) + (setq comint-coding-system-for-write output) + (set-process-coding-system proc input output)) + (force-mode-line-update)))) ;; Random input hackage -- Yoshiki Hayashi

Stephen J. Turnbull

Tuesday, 28 December Tue, 28 Dec

5:48 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

...

>>>> "Yoshiki" == Yoshiki Hayashi <t90553(a)m.ecc.u-tokyo.ac.jp> writes:

Yoshiki> "Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:

...

> Being automatic is possible only when the user knows what he is > doing. We need to be very careful that users do not get wedged > into coding systems they don't know how to get out of. This is > one of Hrvoje's prime complaints about Mule; it can and does > destroy data because of coding-system wedging.

Yoshiki> Ben's idea is to autodetect every output/input so you Yoshiki> won't end up in strange coding-system. I don't think Yoshiki> your last statement is true since you can always repeat Yoshiki> the command before in shell-mode. First, Hrvoje's example is with respect to binary _files_. A paranoid user will have multiple backups, in principle there need not be a problem. But if you trust Mule, reading a binary file can, and often does, result in a non-raw coding system due to autodetection. This can definitely destroy data; I've seen it happen. If it can happen to files, it will obviously be possible for volatile streams. Second, the point of having a shell-mode is that the behavior of the shell is volatile; you cannot count on repeating it. Third, given that all 8-bit ISO-2022 codes have the same space, it is quite possible for an unsuspecting user to end up in a "strange coding system". Happens all the time on the (Japanese) Web, because you never know when an EUC-JP page will link an ISO-8859-1 page. The former are rarely correctly announced by the server, and the latter is (unfortunately) allowed not to announce because it is the default. (Fortunately, web browsers by their nature must do the buffering I suggest.) Yoshiki> What we need is automatic detection and explicit Yoshiki> specification of what coding-system to use. I don't understand this. Looks like a contradiction to me, but I'm sure I'm just missing your point.

...

> Remember, you can't do the equivalent of `C-x C-k RET C-u C-x > C-f "file" RET "the-right-encoding" RET' on a terminal stream > yet.

Yoshiki> Now we are discussing how to do that sensibly, aren't we? Yoshiki> :-) I thought we were discussing autodetection, not recovery from autodetection failures? Remember, the better the autodetection is, the more users trust it, the less care they take, and the more surprised they are when it does (inevitably) fail. This is OK under the current regime, where Mule is an option. Ben wants to make it a default. Then it is not OK. We need to think about how to recover from failures.

...

> I think we should do something like buffer the first screenful, > do autodetect on it, and `C-x C-m c' should (optionally?) offer > a menu including coding systems and a line of sample text from > the buffer to show the user what they are getting.

Yoshiki> This will fail if user accidentally output some amount of Yoshiki> binary data. Of course. Yoshiki> And we need raw data to autodetect coding-system. Of course. Yoshiki> Text in the buffer is already code converted in most Yoshiki> cases. What happens if iso-2022-jp, shift_jis and euc-jp Yoshiki> is outputted in the same buffer? Yoshiki> I think Ben's idea and my idea can coexist. This is a Yoshiki> revised proposal. Yoshiki> 1. Try to autodetect every input/output by resetting Yoshiki> coding-system. How do you define "every input/output"? Suppose the user does `cat thisfile.euc thatfile.sjis' in a shell-mode? Yoshiki> 2. If user specify explicitly what coding-system to use Yoshiki> with C-x RET c, then use that. i.e. reset to that Yoshiki> coding-system instead of auto-detection after every Yoshiki> command. Something more flexible is appropriate, I think. In particular, if C-x C-m c is used to set the process coding system, then on incompatible input (ie, with euc-jp default the process sends a high-bit-set/high-bit-clear pair of bytes) the autodetect mechanism should still be used, but rather than set the coding system it should signal the user that the default is probably inappropriate (as less does on encountering an apparently binary file). Yoshiki> 3. Implement a way to specify coding-system used for only Yoshiki> next command. This will be already existing command Yoshiki> set-buffer-process-coding-system since it will be reseted Yoshiki> after one command execution. Be careful about backward compatibility here. Yoshiki> 4. (Optional) Implement a way to change coding-system Yoshiki> permanently. I don't understand this. By the way, I put forward an RFC a while ago concerning autodetection. Ben wanted a revision, which I haven't been able to do yet. At the time he generally approved, but that's no commitment on his part. Still there are probably useful ideas here for you to work with. I'm attaching first my message, then Ben's comments, without citation. I have subtracted some mostly-irrelevant side comments; this was submitted on a private CC group. -------------------------------- my RFC -------------------------------- Let me give a formal proposal of what I would like to see in the autodetection specification. (1) Definitions (a) *Autodetection* means detecting and making available to Mule the external file's encoding. See (5), below. It doesn't imply any specific actions based on that information. (b) The *default* case is POSIX locale, and no environment information in ~/.emacs. N.B. This *will* cause breakage for all 1-byte users because the default case can no longer assume Latin-1. You *may* be able to use the TTY font or the Xt -font option to fake this, and default to iso8859-1; I would hope that we would not use such a kludge in the beta versions, although it might be satisfactory for general use. In particular, encodings like VISCII (Vietnamese) and I believe KOI-8 (Cyrillic) are not ISO-2022-clean, but using C1 control characters as a heuristic for detecting binary files is useful. If we do allow it, I think that XEmacs should bitch and warn that the practices of implicitly specifying language environment by -font and defaulting on TTYs is deprecated and likely to be obsoleted. (c) The *European* case is any Latin-* locale, either implied by setlocale() and friends or set in ~/.emacs. Latin-1 is specifically not given precedence over other Latin-*, or non-Latin or non-ISO-8859 for that matter. I suspect but am not sure that this case extends to all ISO-8859 encodings, and possibly to non-ISO-8859 single-byte encodings like KOI-8r (in particular when combined in a class with ISO-8859 encodings). (d) The *CJK* case is any CJK locale. Japanese is specifically not given precedence over other Asian locales. (e) For completeness, define the *Unicode* case (Unicode unfortunately has lots of junk such as precomposed characters, language tags, and directionality indicators in it; we probably don't care yet, but we should also not claim compliance) and the *general* case (which has a lot of features similar to Unicode, but lacks the advantage of a unified encoding). This proposal has no idea how to handle the special features of these, or even if that matters. The general case includes stuff that nobody here really knows how it works, like Tibetan and Ethiopic. Each of the following cases is given in the order of priority of detection. I'm not sure I'm serious about the top priority given the (optional) Unicode detection. This may be appropriate if Ben is right that ISO-2022 is going to disappear, but possibly not until then (two two-byte sequences out of 65536 is probably 1.99 too many). It probably isn't too risky if (6)(c) is taken pretty seriously; a Unicode file should contain _no_ private use characters unless the encoding is explicitly specified, and that's a block of 1/10 of the code space, which should help a lot in detecting binary files. (2) Default locale (a) Some Unicode (fixed width; maybe UTF-8, too?) may optionally be detected by the byte-order-mark magic (if the first two bytes are 0xFE 0xFF, the file is Unicode text, if 0xFF 0xFE, it is wrong-endian Unicode; if legal in UTF-8, it would be 0xFE 0xBB 0xBF, either-endian). This is probably an optimization that should not be on by default yet. (b) ISO-2022 encodings will be detected as long as they use explicit designation of all non-ASCII character sets. This means that many 7-bit ISO-2022 encodings would be detected (eg, ISO-2022-JP), but EUC-JP and X Compound Text would not, because they implicitly designate character sets. N.B. Latin-1 will be detected as binary, as for any Latin-*. N.B. An explicit ISO-2022 designation is semantically equivalent to a Content-Type: header. It is more dangerous because shorter, but I think we should recognize them by default despite the slight risk; XEmacs is a text editor. N.B. This is unlikely to be as dangerous as it looks at first glance. Any file that includes an 8-bit-set byte before the first valid designation should be detected as binary. (c) Binary files will be detected (eg, presence of NULs, other non-whitespace control characters, absurdly long lines, and presence of bytes >127). (d) Everything else is ASCII. (e) Newlines will be detected in text files. (3) European locales (a) Unicode may optionally be detected by the byte-order-mark magic. (b) ISO-2022 encodings will be detected as long as they use explicit designation of all non-ASCII character sets. (c) A locale-specific class of 1-byte character sets (eg, '(Latin-1)) will be detected. N.B. The reason for permitting a class is for cases like Cyrillic where there are both ISO-8859 encodings and incompatible encodings (KOI-8r) in common use. If you want to write a Latin-1 v. Latin-2 detector, be my guest, but I don't think it would be easy or accurate. (d) Binary files will be detected per (2)(c), except that only 8-bit bytes out of the encoding's range imply binary. (e) Everything else is ASCII. (f) Newlines will be detected in text files. (4) CJK locales (a) Unicode may optionally be detected by the byte-order-mark magic. (b) ISO-2022 encodings will be detected as long as they use explicit designation of all non-ASCII character sets. (c) A locale-specific class of multi-byte and wide-character encodings will be detected. N.B. No 1-byte character sets (eg, Latin-1) will be detected. The reason for a class is to allow the Japanese to let Mule do the work of choosing EUC v. SJIS. (d) Binary files will be detected per (3)(d). (e) Everything else is ASCII. (f) Newlines will be detected in text files. (5) Unicode and general locales; multilingual use (a) Hopefully a system general enough to handle (2)--(4) will handle these, too, but we should watch out for gotchas like Unicode "plane 14" tags which (I think _both_ Ben and Olivier will agree) have no place in the internal representation, and thus must be treated as out-of-band control sequences. I don't know if all such gotchas will be as easy to dispose of. (b) An explicit coding system priority list will be provided to allow multilingual users to autodetect both Shift JIS and Big 5, say, but this ability is not promised by Mule, since it would involve (eg) heuristics like picking a set of code points that are frequent in Shift JIS and uncommon in Big 5 and betting that a file containing many characters from that set is Shift JIS. (6) Relationship to decoding semantics (a) Autodetection should be run on every input stream unless the user explicitly disables it. (b) The (conceptual) default procedure is (i) Read the file into the buffer (ii) Announce the result of autodetection to the user. (iii) User may request decoding, with autodetected encoding(s) given priority in a list of available encodings. Optimizations (see (e) below) should avoid introducing data corruption that this default procedure would avoid. Obviously, it can't be perfect if any autodecoding is done; users like Hrvoje should have an easily available option to return to this default (or an optimized approximation which doesn't actually read the whole file into a buffer) or simply display everything as binary (with the "font" for binary files being a user option). (c) This implies that we should be detecting conditions in the tail of the file which violate the implicit assumptions of the coding system autodetected (eg, in UTF-8 illegal UTF-8 sequences, including those corresponding to surrogates) should raise a warning; the buffer should probably be made read-only and the user prompted. This could be taken to extremes, like checking by table whether all characters in a Japanese file are actually legitimate JIS codes; that's insane (and would cause corporate encodings to be recognized as binary). But we should think about the idea that autodetection shouldn't mean XEmacs can't change its mind. (d) A flexible means for the user to delegate the decision (conditional on the result of autodetection) to decode or not to XEmacs or a Lisp program should be provided (eg, the coding priority list and/or a file-coding-alist). (e) Optimized operations (eg, the current lstreams) should be provided, with the recognition that if they depend on sampling the file they are risky. (f) Mule should provide a reasonable set of default delegations (as in (d) above) for as many locales as possible. (7) Implementation (a) I think all the decision logic suggested above can be accomplished through a coding-priority-list and appropriate initializations for different language environments, and a file-coding-alist. (b) Many of the tests on the file's tail shouldn't be very expensive; in particular, all of the ones I've suggested are O(n) although they might involve moderate-sized auxiliary tables for efficiency (eg, 64kB for a single Unicode-oriented test). Other comments: It might be reasonable given Hrvoje's objections to require that any autodetection that could cause data loss (any coding system that involves escape sequences, and only those AFAIK: by design translation to Unicode is invertible) by default prompt the user (presumable with a novice-like ability to retain the prompt, always default to binary, or always default to the autodetected encoding) in the future, at least in locales that don't need it (POSIX, Latin-any). Ben thinks that we can remember the input data; I think it's going to be hard to comprehensively test that a highly optimized version works. Good design will help, but ISO-2022 is enormously complex, and there are many encodings that violate even its lax assumptions. On the other hand, memory is the only way to get non-rewindable streams right. Hrvoje himself said he would like to have an XEmacs that distinguishes between Latin-1 and Latin-2 text. Where it is possible to do that, this is exactly what autodetection of ISO-2022 and Unicode gives you. Many people would want that, even at some risk of binary corruption. ----------------------------- Ben's reply ------------------------------ I think it is a good start, and definitely moving in the direction I would like to see things going. However, I have some suggestions for cleaning this up: You should try to make it more layered. For example, you might have one section devoted to the workings of autodetection, which starts out like this (the section numbers below are totally arbitrary): Section 5 Autodetect() is a function whose arguments are (1) a readable stream, (2) some hints indicating how the autodetection is to proceed, and (3) a value indicating the maximum number of characters to examine at the beginning of the stream. (Possibly, the value in (3) may be some special symbol indicating that we only go as far as the next line, or a certain number of lines ahead; this would be used as part of "continuous autodetection", e.g. we are decoding the results of an interactive terminal session, where the user may periodically switch encodings, line terminations, etc. as different programs get run and/or telnet or similar sessions are entered into and exited.) We assume the stream is rewindable; if not, insert a "rewinding" stream in front of the non-rewinding stream; this kind of stream automatically buffers the data as necessary. [You can use pseudo-code terminology here. No need for straight C or ELisp.] [Then proceed to describe what the hints look like -- e.g. you could portray it as a property list or whatever. The idea is that, for each locale, there is a corresponding hints value that is used at least by default. The hints structure also has to be set up to allow for two or more competing hints specifications to be merged together. For example, the extension of a file might provide an additional hint or hints about how to interpret the data of that file, and the caller of autodetect(), when calling autodetect() on such a file, would need to have a way of gracefully merging the default hints corresponding to the locale with the more specific hints provided by the extension. Furthermore, users like Hrvoje might well want to provide their own hints to supplement and override parts of the generic hints -- e.g. "I don't ever want to see non-European encodings decoded; treat them as binary instead".] [Then describe algorithmically how the autodetection works. First, you could describe it more generally, i.e. presenting an algorithmic overview, then you could discuss in detail exactly how autodetection of a particular type of external encoding works -- e.g. "for iso2022, we first look for an escape character, followed by a byte in this range [. ... .] etc."] Section 6 This section describes the concept of a locale in XEmacs, and how it is derived from the user's environment. A locale in XEmacs is a pair, a country and a language, together determining the handling of locale-specific areas of XEmacs. All locale-specific areas in XEmacs make use of this XEmacs locale, and do not attempt to derive the locale from any other sources. The user is free to change the current locale at any time; accessor and mutator functions are provided to do this so that various locale-specific areas can optionally be changed together with it. [Then you describe how the XEmacs locale is extracted from .emacs, from setlocale(), from the LANG environment variables, from -font, or wherever else. All other sections assume this dirty work is done and never even mention it] Section 7 [Here you describe the default autodetect() hints value corresponding to each possible locale. You should probably use a schematic description here, e.g. an actual Lisp property list, liberally commented.] Section 8 etc. [Other sections cover anything I've missed. By being very careful to separate out the layers, you simultaneously introduce more rigor (easier to catch bugs) and make it easier for someone else to understand it completely.] ben -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules."

Yoshiki Hayashi

Tuesday, 28 December Tue, 28 Dec

9 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

"Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:

...

First, Hrvoje's example is with respect to binary _files_. A paranoid user will have multiple backups, in principle there need not be a problem. But if you trust Mule, reading a binary file can, and often does, result in a non-raw coding system due to autodetection. This can definitely destroy data; I've seen it happen. If it can happen to files, it will obviously be possible for volatile streams.

Yes, it's likely to happen.

...

Second, the point of having a shell-mode is that the behavior of the shell is volatile; you cannot count on repeating it.

But why don't you save data to a separate file if you can't repeat it again? Or why not use binary coding-system?

...

Third, given that all 8-bit ISO-2022 codes have the same space, it is quite possible for an unsuspecting user to end up in a "strange coding system". Happens all the time on the (Japanese) Web, because you never know when an EUC-JP page will link an ISO-8859-1 page. The former are rarely correctly announced by the server, and the latter is (unfortunately) allowed not to announce because it is the default. (Fortunately, web browsers by their nature must do the buffering I suggest.)

As long as you are doing autodetection, this case can't be avoided. The only way not to lose any data is using binary coding-system. But I don't think that is what user wants as you have written in attached message.

...

Yoshiki> What we need is automatic detection and explicit Yoshiki> specification of what coding-system to use. I don't understand this. Looks like a contradiction to me, but I'm sure I'm just missing your point.

I guess I saved too much words. What I meant is, autodetect by default. Allow user to specify coding-system in case autodetect fails.

...

>> Remember, you can't do the equivalent of `C-x C-k RET C-u C-x >> C-f "file" RET "the-right-encoding" RET' on a terminal stream >> yet. Yoshiki> Now we are discussing how to do that sensibly, aren't we? Yoshiki> :-) I thought we were discussing autodetection, not recovery from autodetection failures?

I thought we were discussing what is a good default for shell-mode. It includes whether we should do autodetection at all.

...

Remember, the better the autodetection is, the more users trust it, the less care they take, and the more surprised they are when it does (inevitably) fail. This is OK under the current regime, where Mule is an option. Ben wants to make it a default. Then it is not OK. We need to think about how to recover from failures.

I think you are right in general. However, shell-mode is not same as editing files. You can see how it is broken in shell-mode. I think it's purpose is to keep some output, not saving binary data. It seems your argument is based on your proposal about autodetection. I like your idea and love to see that happen, but that will be XEmacs 21.4. I want shell-mode that works in current framework until MULE rebuilding comes.

...

>> I think we should do something like buffer the first screenful, >> do autodetect on it, and `C-x C-m c' should (optionally?) offer >> a menu including coding systems and a line of sample text from >> the buffer to show the user what they are getting. Yoshiki> This will fail if user accidentally output some amount of Yoshiki> binary data. Of course. Yoshiki> And we need raw data to autodetect coding-system. Of course.

Then you are proposing that until screenfull of data arrives, users see raw data in shell-mode? Or save output seperately?

...

Yoshiki> 1. Try to autodetect every input/output by resetting Yoshiki> coding-system. How do you define "every input/output"? Suppose the user does `cat thisfile.euc thatfile.sjis' in a shell-mode?

I meant every command. If you are using localized OS such as Solaris 2.6, some commands output localized text. Say you are using Japanese version of Solaris and types w RET. Its output contains euc-jp text. Then execute commands that outputs shift_jis. If autodetection is not done, that will print garbage.

...

Yoshiki> 2. If user specify explicitly what coding-system to use Yoshiki> with C-x RET c, then use that. i.e. reset to that Yoshiki> coding-system instead of auto-detection after every Yoshiki> command. Something more flexible is appropriate, I think. In particular, if C-x C-m c is used to set the process coding system, then on incompatible input (ie, with euc-jp default the process sends a high-bit-set/high-bit-clear pair of bytes) the autodetect mechanism should still be used, but rather than set the coding system it should signal the user that the default is probably inappropriate (as less does on encountering an apparently binary file).

That should be handled by lstream or some new general commands, not shell-mode.

...

Yoshiki> 3. Implement a way to specify coding-system used for only Yoshiki> next command. This will be already existing command Yoshiki> set-buffer-process-coding-system since it will be reseted Yoshiki> after one command execution. Be careful about backward compatibility here.

OK. I'll create a new command.

...

Yoshiki> 4. (Optional) Implement a way to change coding-system Yoshiki> permanently. I don't understand this.

You start shell mode with autodetection, but now you know you only get iso-8859-1. Then you can set coding-system for input/output forever. -- Yoshiki Hayashi

Stephen J. Turnbull

Tuesday, 28 December Tue, 28 Dec

10:06 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

...

>>>> "Yoshiki" == Yoshiki Hayashi <t90553(a)m.ecc.u-tokyo.ac.jp> writes:

...

> Second, the point of having a shell-mode is that the behavior > of the shell is volatile; you cannot count on repeating it.

Yoshiki> But why don't you save data to a separate file if you Yoshiki> can't repeat it again? I had exactly the same thought; I think I'd probably implement it as a buffer. Especially if we could somehow use indirect buffers to present a decoded view while maintaining the real transcript in raw form in the parent buffer. However, the point is that we don't do that now, and I don't think we have a robust implementation of indirect buffers although there was some traffic indicating people wanted to do it. Yoshiki> As long as you are doing autodetection, this case can't Yoshiki> be avoided. The only way not to lose any data is using Yoshiki> binary coding-system. But I don't think that is what Yoshiki> user wants as you have written in attached message. No, if you have the literal transcript of the raw stream as you have proposed, we lose no data and we do have the possibility to change the coding system view as in a Web browser.

...

> I thought we were discussing autodetection, not recovery from > autodetection failures?

Yoshiki> I thought we were discussing what is a good default for Yoshiki> shell-mode. It includes whether we should do Yoshiki> autodetection at all. We must, on Asian-localized systems, especially partially localized ones like Linux. Also non-Latin European scripts. Probably Latin-X systems can live without. Yoshiki> I think [shell-mode's] purpose is to keep some output, Yoshiki> not saving binary data. But we don't know what is "output" and what is "binary data". I suppose you're right, that for now we need to do something. I'm sorry, I didn't realize you were proposing a patch specific to shell-mode. I thought shell-mode was an example where it would work. Will this be easily generalizable to other process-oriented modes? Yoshiki> Then you are proposing that until screenfull of data Yoshiki> arrives, users see raw data in shell-mode? Or save Yoshiki> output seperately? Ah, you're taking my "screenful" too seriously. I meant the same thing you do by "every command" (or less for a command with large amounts of output like `ls -lR /'). Yoshiki> 1. Try to autodetect every input/output by resetting Yoshiki> coding-system.

...

> How do you define "every input/output"? Suppose the user does > `cat thisfile.euc thatfile.sjis' in a shell-mode?

Yoshiki> I meant every command. If you are using localized OS Yoshiki> such as Solaris 2.6, some commands output localized text. Yoshiki> Say you are using Japanese version of Solaris and types w Yoshiki> RET. Its output contains euc-jp text. Then execute Yoshiki> commands that outputs shift_jis. If autodetection is not Yoshiki> done, that will print garbage. Right, and I gave an example where a single command generates garbage. What I would like to see is a way to save the original data, set a region on the screen, and have a command `coding-revert-region' that grabs the region from the saved transcript, and then we can apply `decode-coding-region' to that region. (At the user level all this would be done with a single interactive command.) Coordination of the literal transcript buffer and the presentation buffer would be tricky; I think I know how to go from literal -> presentation but I don't know whather that would be invertible. I'll think about it, and inform you on it if it looks feasible.

...

> Something more flexible is appropriate, I think. In > particular, if C-x C-m c is used to set the process coding > system, then on incompatible input (ie, with euc-jp default the > process sends a high-bit-set/high-bit-clear pair of bytes) the > autodetect mechanism should still be used, but rather than set > the coding system it should signal the user that the default is > probably inappropriate (as less does on encountering an > apparently binary file).

Yoshiki> That should be handled by lstream or some new general Yoshiki> commands, not shell-mode. Yes, with the caveat that the decision to keep using autodetection would be made by shell-mode, and the underlying layer (eg, lstream) would need to provide hooks so that autodetection can signal rather than trigger decoding.

...

> Be careful about backward compatibility here.

Yoshiki> OK. I'll create a new command. I think that's best; once you've implemented it and people can see how it works, we can decide whether the replace the old command with it. Yoshiki> 4. (Optional) Implement a way to change coding-system Yoshiki> permanently.

...

> I don't understand this.

Yoshiki> You start shell mode with autodetection, but now you know Yoshiki> you only get iso-8859-1. Then you can set coding-system Yoshiki> for input/output forever. Oh. I don't much like that. Often enough I cat or less the wrong file and find something interesting in it; even if I know that what I _want_ is Thai-XTIS, I don't want to lose the solution to the koan just 'cause it appeared in mojibake due to Devanagari that accidentally got into the buffer ;-) But if a user wants it, it might be useful. -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules."

Yoshiki Hayashi

Thursday, 30 December Thu, 30 Dec

5:39 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

"Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:

...

>>>>> "Yoshiki" == Yoshiki Hayashi <t90553(a)m.ecc.u-tokyo.ac.jp> writes: >> Second, the point of having a shell-mode is that the behavior >> of the shell is volatile; you cannot count on repeating it. Yoshiki> But why don't you save data to a separate file if you Yoshiki> can't repeat it again? I had exactly the same thought; I think I'd probably implement it as a buffer. Especially if we could somehow use indirect buffers to present a decoded view while maintaining the real transcript in raw form in the parent buffer. However, the point is that we don't do that now, and I don't think we have a robust implementation of indirect buffers although there was some traffic indicating people wanted to do it.

If we implement that to shell-mode, I think we need same mechanism for all buffers. You open a file with wrong coding-system that might lose data and then you removed the original file...

...

Will this be easily generalizable to other process-oriented modes?

Since what I changed is comint.el, all modes that use comint will do the same. I think network-oriented modes does recieving raw-data from process and then do something with it to avoid data loss.

...

That would be really great.

...

Coordination of the literal transcript buffer and the presentation buffer would be tricky; I think I know how to go from literal -> presentation but I don't know whather that would be invertible.

As you have written, data that looks like two escape sequences in a row will get lost. Some data might be lost or some garbage will be added if eol conversion is CRLF or CR since newline is normalized to LF internally.

...

I'll think about it, and inform you on it if it looks feasible.

Thanks. -- Yoshiki Hayashi

Stephen J. Turnbull

Thursday, 30 December Thu, 30 Dec

8:36 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

...

>>>> "Yoshiki" == Yoshiki Hayashi <t90553(a)m.ecc.u-tokyo.ac.jp> writes:

Yoshiki> "Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:

...

> Coordination of the literal transcript buffer and the > presentation buffer would be tricky; I think I know how to go > from literal -> presentation but I don't know whather that > would be invertible.

Yoshiki> As you have written, data that looks like two escape Yoshiki> sequences in a row will get lost. Some data might be Yoshiki> lost or some garbage will be added if eol conversion is Yoshiki> CRLF or CR since newline is normalized to LF internally. Actually, that's the point; the presentation buffer will be changed, and data lost in the presentation buffer, but _not in the underlying buffer_ because that _always_ contains the raw text as I design it.[1] (It might not be possible to do this with indirect buffers, I don't really understand those yet.) Ben has proposed a way to do this efficiently for large buffers, but for shell modes I think it's probably reasonable to keep two buffers. The only time this would not be true is if the user made changes that involved deleting and _manually_ restoring the text around the double escape sequence, because `insert' won't know there used to be an escape sequence governing a null extent in the literal buffer. You're right that in the case where the user inserts a linebreak in the first line of text using one convention and the preceding text uses a different convention, we can't get it right. But I don't think we really care, do we? In fact, if the user is editing the shell buffer as text, the newline convention used for insertions doesn't matter, the escape sequences inserted are not part of the text stream as such. If the user cares about those things, they have to edit the stream in binary mode. What we want to do is make sure that parts that the user has not changed remain as the external process output them. Or are you thinking that we also need to guess what form of input the external process is expecting? In almost all cases, that should be known already, and fixed for the life of the process, shouldn't it? Footnotes: [1] Actually, this is my conception of a proposal made in skeleton form by Ben. Good parts are his, errors are mine; the obvious we can share. ;-) -- University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Institute of Policy and Planning Sciences Tel/fax: +81 (298) 53-5091 _________________ _________________ _________________ _________________ What are those straight lines for? "XEmacs rules."

Yoshiki Hayashi

Wednesday, 5 January Wed, 5 Jan

10:52 a.m.

(permalink)

New subject: Don't set coding-system in shell-mode

"Stephen J. Turnbull" <turnbull(a)sk.tsukuba.ac.jp> writes:

...

You're right that in the case where the user inserts a linebreak in the first line of text using one convention and the preceding text uses a different convention, we can't get it right. But I don't think we really care, do we?

No.

...

In fact, if the user is editing the shell buffer as text, the newline convention used for insertions doesn't matter, the escape sequences inserted are not part of the text stream as such. If the user cares about those things, they have to edit the stream in binary mode.

Indeed. I just wanted to point out that user has to be careful about differences between text and binary data even if raw data can be restored. -- Yoshiki Hayashi

9358

days inactive

9367

days old

xemacs-beta@xemacs.org

Manage subscription

10 comments

3 participants

tags (0)

participants (3)

Ben Wing
Stephen J. Turnbull
Yoshiki Hayashi

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Don't set coding-system in shell-mode