Re: #'query-coding-region and invalid Unicode sequences.

Friday, 16 January 2009

        "Stephen J. Turnbull" <stephen(a)xemacs.org&gt; writes:

...
 Aidan Kehoe writes:

  > The attached file is UTF-16 with byte order mark, and has an invalid
  > sequence after the first ". ". Current GNU Emacs deals with it badly.
XEmacs
  > deals with it a little better, but not in a particularly stellar way right
  > now either. 

 There have been a couple of long threads on Python-Dev (or maybe
 Python-3000) about how to deal with these issues.  It's not obvious to
 me that there is a "stellar" way to deal with the problem. 
There is a transparent way, however.  Note that utf-8 is an encoding
scheme that can, even within 4-byte values, encode more than just legal
utf-8.  Pick a 256-byte code page from there (either beyond the
2^21-something threshold, or, saving one byte but being more obfuscate,
in the Unicode pages reserved for utf-16 surrogates and thus left free).
Now this is our XEmacs-internal code page.  _Any_ bytes that are not
part of valid codes in a particular encoding (and this _includes_
non-minimal code sequences in utf-8 and utf-16) are encoded using this
XEmacs-internal encoding into "bad byte of value xxx" and are displayed
as \xxx octal escapes byte by byte.  When writing out, such "byte" code
points get encoded back into single bytes.

Garbage in, garbage out.  Meticulously preserved garbage.  Note that
XEmacs-internal utf-8 in this case is not necessarily legal utf-8
because of those surrogate byte-code patterns not corresponding to legal
utf-8.

The advantage is that a _legal_ utf-8 file is represented as itself
XEmacs-internally.  And that even random byte streams can get read and
rewritten without modification.

Is it stellar?  Dealing with garbage never is stellar.  But it certainly
is better than ignoring the problem or calling it somebody else's.

-- 
David Kastrup, Kriemhildstr. 15, 44793 Bochum

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: #'query-coding-region and invalid Unicode sequences.