Re: Coding system sniffing.

Thursday, 17 May 2007

        Joachim Schrod <jschrod(a)acm.org&gt; writes:

...
 Well, speaking of UTF-8: Since XEmacs is very happy to destroy lots
 of my files with its supposedly smart encoding detection -- and does
 so WITHOUT any warning -- your demand for more thought about error
 recovery is right on spot. But there seems to be other areas that
 are in more severe need of that error handling than recovery from
 wrong coding cookies (namely, automatic encoding sniffing). I have
 had literally dozens of UTF-8 files with one single Latin1 char in
 them, that got reencoded by XEmacs when I opened and saved
 them. (I.e., when I didn't pay enough attention to the modeline in
 the process of quickly modifying one or two words.) In all these
 cases, reliance on a user-supplied coding cookie would have saved me
 untold hours and hours of work to redo the result of XEmacs
 automatic encoding detection which Really Really Really Sucks. 
In this case, I think a solution from the coding system implementation
angle should obliterate the problem more reliably than approaching it
from the detection angle.  All of the latin-X codings as well as utf-8
have the property that _valid_ characters have exactly one valid
external representation.  If Emacs' internal buffer encoding offers a
possibility for encoding "invalid byte with code xx", then loading and
saving a file without changing the encoding will preserve the
contents.  Indeed, an "invalid byte with code xx" kind of character
can't be saved in latin-1 since latin-1 does not usually have an
invalid byte with code xx.

So there are a lot of possibilities for detecting uncurable
inconsistencies and maintaining file coherency even across wrong
detections.  I don't think that the potential in that area is used to
its full extent.  Automatic detection can't _always_ work right
(random bit patterns _always_ have the potential to look like a coding
cookie).  But one can make pretty sure that most used encodings will
preserve all the information present in the source file.

...
 Throwing an error if the coding system is not sufficient is much
 preferred to the current state of affairs (arbitrarily choosing a
 coding system that XEmacs thinks is right). 
As long as it can uniquely encode the buffer in the choice taken, and
the buffer uniquely represents the input encoding, at least the damage
is reversible.

Personally, I find that Emacs 22 does a remarkable job at not
corrupting files, even though there are (quite rare for me, but as a
Western European one is somewhat favored) times when the initial
detection goes wrong.

I suppose that if I were to have Escape-code based encodings in my
normal work set of encodings (when there are several different legal
inputs possible for the same sequence of characters) I might get hit
more often by problems, but as things stand, I can't actually
complain.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Coding system sniffing.