Ar an dara lá de mí Iúil, scríobh Stephen J. Turnbull:
Jeff Sparkes writes:
> The new HELLO is based on the one in Emacs. XEmacs opens the Emacs
> HELLO and sets the buffer encoding to iso-2022-7. Opening the
> XEmacs HELLO gets the buffer encoding set to raw-text which doesn't
> display properly.
Right. I haven't done a careful analysis, but I bet this is because
the XEmacs coding systems are unable to detect this abominable mix.
UTF-8-encoded segments do not conform to either the 7-bit ISO-2022
format or the 8-bit ISO-2022-format. We would need to special case
these ISO UCS coded character set invocation sequences, and in general
stop detection there.
UTF-8-encoded segments *do* conform to ISO 2022. We happen to ignore that in
our detection algorithms, which is fine, since such text is basically not
encountered in the wild in contexts where we need to detect it.
Aidan's choice of coding system doesn't make sense to me, I
would just
have said "to hell with the ISO-2022-JP rule, we're going UTF-8."[1]
Alternatively, use the DOCS sequence ESC % G, which has a length
specification in octets, making it reasonably easy to find the rest of
the 3000 octets normally used for coding detection (although I'm not
sure that we actually use that information in detection, it would be
easier to implement than trying to deal with the many ISO-IR sequences
for Unicode).
(ISO IR 196 describes exactly the DOCS sequence ESC % G; that DOCS does not
have an associated length specification in octets.)
Aidan, were there specific reasons not to just convert the file to
UTF-8?
Yes, the file distinguishes between the various national Han characters, and
it’s useful to have a file with these differences available as long as we
don’t unify them in our internal encoding.
> Should the encoding be set in the variables at the end of the
file?
> And what would that be? I've tried utf-8 and iso-2022-7[.]
iso-2022-8 seems to work for me, but I think we need Aidan's input.
iso-2022-8 and iso-2022-7 are both fine. (I’m not sure why the latter didn’t
work for Jeff; it worked fine for me.) But all the places where HELLO is
used programmatically specify its encoding explicitly--it would be helpful
to add the coding cookie, but not necessary.
BTW, your knowledge may be meager, but your guesser is working
fine. :-) It just didn't quite suffice this time.
Footnotes:
[1] Note that that's a rule that I personally instituted before
XEmacs could handle Unicode coding systems at all. It made sense at
that time, but now even no-Mule 21.5 XEmacsen can read UTF-8 files.
Especially if we decide to use Unicode inside soon, this rule should
be revisited (and in fact I think Ben Wing has already converted some
files to UTF-8).
The buffer is corrupted by no-Mule, though, because the information in
the high bits is lost. This is a bad bug, issue780.
Trashing the user’s non-Latin-1 data is the fundamental design decision of
no-Mule, I don’t think it’s constructive to report that as a bug.
--
‘Iodine deficiency was endemic in parts of the UK until, through what has been
described as “an unplanned and accidental public health triumph”, iodine was
added to cattle feed to improve milk production in the 1930s.’
(EN Pearce, Lancet, June 2011)
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://lists.xemacs.org/mailman/listinfo/xemacs-beta