Stephen J. Turnbull wrote:
> > How do systems deal with the problem that in some
encodings (any
> > ISO2022 that allows general character sets) there are many
> > octet-strings that encode the same abstract text string?
>
> For the most part, they don't. Unicode fans pretend that anything
> which causes problems for Unicode doesn't exist (or is "obsolete";
> apparently, they get to decide that this is the case).
Well, yes, they do. If you want something else, feel free to start
your own standards effort. Ken'ichi Handa will help, I'm sure. ;-)
However, even for 99% of Han users, simply putting things into the
appropriate font will work. The only people who really need to
disambiguate Han are Buddhist scholars; even Japanese high school
students read their Chinese poetry in Japanese fonts.
I was referring mainly to the technical issues, e.g. the
non-reversibility of encoding conversions.
The problem with Unicode isn't that it's inherently defective as an
encoding, but the some of the "universalism" in the way that it's
often used. E.g. languages or libraries which insist that all "text"
is represented in Unicode, so that e.g. readdir() -> open() fails for
files which don't match a specific encoding.
Any such problems are then waved away with "use UTF-8 for all
filenames". No mention of how to handle filenames obtained from binary
data streams with no specified encoding (e.g. tar/zip/rar files, FTP),
or whether we're supposed to simply ditch customers who have other
ideas about which encodings to use for their data.
> Anyone who actually needs to use such encodings typically
avoids
> Unicode like the plague (I've yet to see a Japanese game for
> Windows which uses the Unicode API rather than the codepage-based
> API).
Use any example but Japanese, please. Japanese exceptionalism is
alive and well throughout the society. I find it hard to believe that
changing your fonts when you change your .mo files wouldn't work fine
for games as it does almost everywhere else, except in truly
multilingual text; I think that Japanese just enjoy being different.
The issue tends to apply to any language which isn't based upon the
Latin alphabet, although possibly to a lesser extent than for
Japanese. If a language is latin-based, it's not too much of a stretch
to just stick to ASCII in situations where use of other encodings is
problematic.
So long as there are file formats and network protocols where
filenames are sequences of bytes with no encoding specified (or where
the specified encoding is often incorrect), there will be a strong
temptation for application programmers to make the encoding issue
Someone Else's Problem (TM) by passing the data to anything which is
willing to accept a string of bytes.
On Windows, that means using the legacy "A" API rather than the
Unicode "W" API. On Unix, that means passing the data directly to the
OS without bothering about conversions. In Unicode-everywhere
environments, it means either blindly accepting any built-in
conversions or, if an encoding is required, hunting for a function
(any function) which returns an encoding with requiring any arguments.
> If you want to retrieve a filename from the OS then pass it
back at a
> later point, you need to retain the raw data. If you can't get at the
> raw data, you lose.
That's exactly the conclusion the Python people just came to.
Which conclusion? "Retain the raw data" or "you lose"?
> A more significant point is that Unicode strings aren't
strings of
> "characters", but of Unicode code points. The conversions between
> Unicode and abstract characters suffer from many of the same problems
> as with traditional encodings.
No, they suffer from various forms of inefficiency, but since there
are two canonical decompositions you just have to do like the Japanese
and make sure all strings take off their muddy shoes at the door and
put on canonicalized slippers before entering the house. This isn't
possible with traditional encodings, and of course it does require a
lot of programmer discipline to construct and use these interfaces.
Unfortunately, this canonicalisation frequently doesn't happen. It
isn't too surprising, given the way that Unicode is so often touted as
eliminating these sorts of problems.
--
Glynn Clements <glynn(a)gclements.plus.com>
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta