Re: [Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.

Thursday, 23 November 2006

        Aidan Kehoe <kehoea(a)parhasard.net&gt; writes:

...
  Ar an tríú lá is fiche de mí na Samhain, scríobh David Kastrup: 

  > > I have a tentative plan to add a charset to XEmacs, 256
  > > characters of which reflect corrupt Unicode data. These 256
  > > characters will be generated by Unicode-oriented coding systems
  > > when they encounter invalid data:
  > >
  > > (decode-coding-string "\x80\x80" 'utf-8) 
  > > => "\200\200" ;; With funky redisplay properties once display
tables
  > > 	           ;; and char tables are integrated. Which, whee, is more
  > > 	           ;; work.
  > 
  > Here is what Emacs 22 returns:
  > 
  > #("\xc2\x80\xc2\x80" 0 2 (display #("\\200" 0 4 (face
escape-glyph)) help-echo utf-8-help-echo untranslated-utf-8 128) 2 4 (display
#("\\200" 0 4 (face escape-glyph)) help-echo utf-8-help-echo untranslated-utf-8
128))
  > 
  > > And will be ignored by them when writing: 
  > >
  > > (encode-coding-string (decode-coding-string "\x80\x80" 'utf-8)
'utf-8)
  > > => ""
  > 
  > Here is what Emacs 22 returns:
  > 
  > "\200\200"

 Quite an old GNU Emacs 23.0.0 gives me this: 

 (encode-coding-string "\x80\x80" 'utf-8)
 => "\200\200"

 (decode-coding-string "\x80\x80" 'utf-8)
 => "\200\200"

 Savannah’s being unco-operative about allowing me to cvs update, otherwise
 it would be worth reporting the former as a bug. 
I am not sure.  It is round-trip, but of course it leaves bad bytes in
the buffer.  And some of them might combine badly with others.

...
 Were it my implementation, I would regard the latter as a bug too.

Yes, could be worth working over.  I suppose that "Emacs 23" will only
start getting a really solid beating once it is moved to HEAD.

...
  > > This will allow applications like David Kastrup’s
reconstruct-utf-8
  > > sequences-from-fragmentary-TeX-error-messages to be possible, while
  > > not contradicting the relevant Unicode standards. With Unicode as
  > > the internal encoding, there’s no need to have a separate Mule
  > > character set; we can stick their codes somewhere above the astral
  > > planes. But we should maintain the same syntax code for them. Note
  > > also that, as far as I can work out, these 256 codes will be
  > > sufficient for representing error data for all the other
  > > Unicode-oriented representations well as UTF-8.
  > 
  > Not just for "unicode-oriented".  The recipe should be workable for
  > the iso-latin-* stuff as a file encoding, too, I think.

 Hmm? _Are_ there invalid sequences for the ISO-8859-N file
 encodings?  
If the file is in iso-8859-N, but the locale and/or the process
encoding (which might come from a master file in utf-8 that includes a
subfile in iso-*)...

The possibilities for complications are pretty much endless.  Polish
people, for example, tend to use Latin-2 encodings in their files, but
Latin-1 locales.  They just "know" which characters will be displayed
wrong and how, and in a Polish locale, more things go wrong than in an
English one.

You don't really want to know the number of idiocies I have to cater
for in connection with AUCTeX/TeX/LaTeX.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998