Re: #'query-coding-region and invalid Unicode sequences.

Wednesday, 14 January 2009

 Ar an ceathrú lá déag de mí Eanair, scríobh Julian Bradfield: 

...
 On 2009-01-14, Aidan Kehoe <kehoea(a)parhasard.net&gt; wrote:
 >
 > At the moment, #'query-coding-region ignores invalid Unicode sequences,
 > it treats them as always encodable--which they are, it is clear what
 > they should correspond to when written to disk. But Unicode says they
 > are not encodable.

 What do you mean by "invalid Unicode sequence"? 
XEmacs characters that reflect that Unicode coding systems encountered
invalid octet sequences on disk. E.g. the output of 

(decode-coding-string "\xd8\x00\x00\x01" 'utf-16-be) ;; Invalid surrogates

or 

(decode-coding-string "\xe4" 'utf-8) ;; Attempt to decode Latin 1 as utf-8

We produce them so that loading a, for example, koi8-r file as utf-8, making
a single modification, and saving it does not necessarily trash the
non-ASCII content. 

...
 How do they get into the buffer in the first place? 
unicode.c:1743 and the code that uses that macro.

Pre v23 GNU Emacs have the eight-bit-graphic character set, which they used
for this situation with UTF-8. 23 doesn’t seem to deal with this situation
well, as far as I can tell, I don’t get the option to write the invalid
sequence to disk. 

The attached file is UTF-16 with byte order mark, and has an invalid
sequence after the first ". ". Current GNU Emacs deals with it badly. XEmacs
deals with it a little better, but not in a particularly stellar way right
now either. 

-- 
¿Dónde estará ahora mi sobrino Yoghurtu Nghe, que tuvo que huir
precipitadamente de la aldea por culpa de la escasez de rinocerontes?

_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: #'query-coding-region and invalid Unicode sequences.