Re: [Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.

Thursday, 23 November 2006

 Ar an tríú lá is fiche de mí na Samhain, scríobh David Kastrup: 

...
 > I have a tentative plan to add a charset to XEmacs, 256
characters of
 > which reflect corrupt Unicode data. These 256 characters will be
 > generated by Unicode-oriented coding systems when they encounter
 > invalid data:
 >
 > (decode-coding-string "\x80\x80" 'utf-8) 
 > => "\200\200" ;; With funky redisplay properties once display tables
 > 	           ;; and char tables are integrated. Which, whee, is more
 > 	           ;; work.

 Here is what Emacs 22 returns:

 #("\xc2\x80\xc2\x80" 0 2 (display #("\\200" 0 4 (face escape-glyph))
help-echo utf-8-help-echo untranslated-utf-8 128) 2 4 (display #("\\200" 0 4
(face escape-glyph)) help-echo utf-8-help-echo untranslated-utf-8 128))

 > And will be ignored by them when writing: 
 >
 > (encode-coding-string (decode-coding-string "\x80\x80" 'utf-8)
'utf-8)
 > => ""

 Here is what Emacs 22 returns:

 "\200\200" 
Quite an old GNU Emacs 23.0.0 gives me this: 

(encode-coding-string "\x80\x80" 'utf-8)
=> "\200\200"

(decode-coding-string "\x80\x80" 'utf-8)
=> "\200\200"

Savannah’s being unco-operative about allowing me to cvs update, otherwise
it would be worth reporting the former as a bug. Were it my implementation,
I would regard the latter as a bug too. 

...
 Of course, the internal coding for Emacs 22 is emacs-mule, not utf-8
 based, so this is not completely relevant.  But maybe it is
 interesting, nevertheless. 
It is, thank you. 

...
 > This will allow applications like David Kastrup’s
reconstruct-utf-8
 > sequences-from-fragmentary-TeX-error-messages to be possible, while
 > not contradicting the relevant Unicode standards. With Unicode as
 > the internal encoding, there’s no need to have a separate Mule
 > character set; we can stick their codes somewhere above the astral
 > planes. But we should maintain the same syntax code for them. Note
 > also that, as far as I can work out, these 256 codes will be
 > sufficient for representing error data for all the other
 > Unicode-oriented representations well as UTF-8.

 Not just for "unicode-oriented".  The recipe should be workable for
 the iso-latin-* stuff as a file encoding, too, I think. 
Hmm? _Are_ there invalid sequences for the ISO-8859-N file encodings? 

-- 
Santa Maradona, priez pour moi!

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: [Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.