Re: [Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.

Thursday, 23 November 2006

 Ar an dara lá is fiche de mí na Samhain, scríobh Stephen J. Turnbull: 

...
  > +		/* ASCII, or the lower control characters.
  > +                   
  > +                   Perhaps we should signal an error if the character is in
  > +                   the range 0x80-0xc0; this is illegal UTF-8. */
  > +                Dynarr_add (dst, (c & 0x7f));

 Please do.  This is corrupting the data.

 I don't have a clue how to recover from it, but the user should at
 least be told. 
I have a tentative plan to add a charset to XEmacs, 256 characters of which
reflect corrupt Unicode data. These 256 characters will be generated by
Unicode-oriented coding systems when they encounter invalid data:

(decode-coding-string "\x80\x80" 'utf-8) 
=> "\200\200" ;; With funky redisplay properties once display tables and
	      ;; char tables are integrated. Which, whee, is more work. 

And will be ignored by them when writing: 

(encode-coding-string (decode-coding-string "\x80\x80" 'utf-8) 'utf-8)
=> ""

This will allow applications like David Kastrup’s reconstruct-utf-8
sequences-from-fragmentary-TeX-error-messages to be possible, while not
contradicting the relevant Unicode standards. With Unicode as the internal
encoding, there’s no need to have a separate Mule character set; we can
stick their codes somewhere above the astral planes. But we should maintain
the same syntax code for them. Note also that, as far as I can work out,
these 256 codes will be sufficient for representing error data for all the
other Unicode-oriented representations well as UTF-8.

-- 
Santa Maradona, priez pour moi!

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: [Q] Handle bytes in the range 0x80-0xC0 better when dealing with ISO-IR 196.