Re: encoding of etc/HELLO

Saturday, 2 July 2011

 Ar an dara lá de mí Iúil, scríobh Stephen J. Turnbull: 

...
 Jeff Sparkes writes:

  > The new HELLO is based on the one in Emacs.  XEmacs opens the Emacs
  > HELLO and sets the buffer encoding to iso-2022-7.  Opening the
  > XEmacs HELLO gets the buffer encoding set to raw-text which doesn't
  > display properly.

 Right.  I haven't done a careful analysis, but I bet this is because
 the XEmacs coding systems are unable to detect this abominable mix.
 UTF-8-encoded segments do not conform to either the 7-bit ISO-2022
 format or the 8-bit ISO-2022-format.  We would need to special case
 these ISO UCS coded character set invocation sequences, and in general
 stop detection there. 
UTF-8-encoded segments *do* conform to ISO 2022. We happen to ignore that in
our detection algorithms, which is fine, since such text is basically not
encountered in the wild in contexts where we need to detect it.

...
 Aidan's choice of coding system doesn't make sense to me, I
would just
 have said "to hell with the ISO-2022-JP rule, we're going UTF-8."[1]
 Alternatively, use the DOCS sequence ESC % G, which has a length
 specification in octets, making it reasonably easy to find the rest of
 the 3000 octets normally used for coding detection (although I'm not
 sure that we actually use that information in detection, it would be
 easier to implement than trying to deal with the many ISO-IR sequences
 for Unicode). 
(ISO IR 196 describes exactly the DOCS sequence ESC % G; that DOCS does not
have an associated length specification in octets.)

...
 Aidan, were there specific reasons not to just convert the file to
 UTF-8? 
Yes, the file distinguishes between the various national Han characters, and
it’s useful to have a file with these differences available as long as we
don’t unify them in our internal encoding.

...
  > Should the encoding be set in the variables at the end of the
file?
  > And what would that be?  I've tried utf-8 and iso-2022-7[.]

 iso-2022-8 seems to work for me, but I think we need Aidan's input. 
iso-2022-8 and iso-2022-7 are both fine. (I’m not sure why the latter didn’t
work for Jeff; it worked fine for me.) But all the places where HELLO is
used programmatically specify its encoding explicitly--it would be helpful
to add the coding cookie, but not necessary.

...
 BTW, your knowledge may be meager, but your guesser is working
 fine. :-)  It just didn't quite suffice this time.

 Footnotes: 
 [1]  Note that that's a rule that I personally instituted before
 XEmacs could handle Unicode coding systems at all.  It made sense at
 that time, but now even no-Mule 21.5 XEmacsen can read UTF-8 files.
 Especially if we decide to use Unicode inside soon, this rule should
 be revisited (and in fact I think Ben Wing has already converted some
 files to UTF-8).

 The buffer is corrupted by no-Mule, though, because the information in
 the high bits is lost.  This is a bad bug, issue780. 
Trashing the user’s non-Latin-1 data is the fundamental design decision of
no-Mule, I don’t think it’s constructive to report that as a bug.

-- 
‘Iodine deficiency was endemic in parts of the UK until, through what has been
described as “an unplanned and accidental public health triumph”, iodine was
added to cattle feed to improve milk production in the 1930s.’
(EN Pearce, Lancet, June 2011)

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://lists.xemacs.org/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: encoding of etc/HELLO