"Stephen J. Turnbull" <stephen(a)xemacs.org> writes:
Aidan Kehoe writes:
> The attached file is UTF-16 with byte order mark, and has an invalid
> sequence after the first ". ". Current GNU Emacs deals with it badly.
XEmacs
> deals with it a little better, but not in a particularly stellar way right
> now either.
There have been a couple of long threads on Python-Dev (or maybe
Python-3000) about how to deal with these issues. It's not obvious to
me that there is a "stellar" way to deal with the problem.
There is a transparent way, however. Note that utf-8 is an encoding
scheme that can, even within 4-byte values, encode more than just legal
utf-8. Pick a 256-byte code page from there (either beyond the
2^21-something threshold, or, saving one byte but being more obfuscate,
in the Unicode pages reserved for utf-16 surrogates and thus left free).
Now this is our XEmacs-internal code page. _Any_ bytes that are not
part of valid codes in a particular encoding (and this _includes_
non-minimal code sequences in utf-8 and utf-16) are encoded using this
XEmacs-internal encoding into "bad byte of value xxx" and are displayed
as \xxx octal escapes byte by byte. When writing out, such "byte" code
points get encoded back into single bytes.
Garbage in, garbage out. Meticulously preserved garbage. Note that
XEmacs-internal utf-8 in this case is not necessarily legal utf-8
because of those surrogate byte-code patterns not corresponding to legal
utf-8.
The advantage is that a _legal_ utf-8 file is represented as itself
XEmacs-internally. And that even random byte streams can get read and
rewritten without modification.
Is it stellar? Dealing with garbage never is stellar. But it certainly
is better than ignoring the problem or calling it somebody else's.
--
David Kastrup, Kriemhildstr. 15, 44793 Bochum
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta