Glynn Clements writes:
Julian Bradfield wrote:
> How do systems deal with the problem that in some encodings (any
> ISO2022 that allows general character sets) there are many
> octet-strings that encode the same abstract text string?
For the most part, they don't. Unicode fans pretend that anything
which causes problems for Unicode doesn't exist (or is "obsolete";
apparently, they get to decide that this is the case).
Well, yes, they do. If you want something else, feel free to start
your own standards effort. Ken'ichi Handa will help, I'm sure. ;-)
However, even for 99% of Han users, simply putting things into the
appropriate font will work. The only people who really need to
disambiguate Han are Buddhist scholars; even Japanese high school
students read their Chinese poetry in Japanese fonts.
Anyone who actually needs to use such encodings typically avoids
Unicode like the plague (I've yet to see a Japanese game for
Windows which uses the Unicode API rather than the codepage-based
Use any example but Japanese, please. Japanese exceptionalism is
alive and well throughout the society. I find it hard to believe that
changing your fonts when you change your .mo files wouldn't work fine
for games as it does almost everywhere else, except in truly
multilingual text; I think that Japanese just enjoy being different.
If you want to retrieve a filename from the OS then pass it back at
later point, you need to retain the raw data. If you can't get at the
raw data, you lose.
That's exactly the conclusion the Python people just came to.
> Come to that, how do UTF-8 based filesystems (Windows, Mac)
> when faced with a filename that is invalid - or are the OSes
> sufficiently well written to validate filename on creation?
Mac OS X is not---it's just a Unix VFS---although HFS+ more or less is
a validating FS. But even on HFS+ it's not hard to bypass the
validation; see the comments on the ISO-8859-2 test in
tests/automated/mule-tests.el. And in general, any system that
supports mounting arbitrary file systems cannot guarantee validation.
Your Pyramid/Vax example is perfectly general.
In fact, typically GNOME and Windows applications simply silently drop
such file names when encountered on the system.
A more significant point is that Unicode strings aren't strings
"characters", but of Unicode code points. The conversions between
Unicode and abstract characters suffer from many of the same problems
as with traditional encodings.
No, they suffer from various forms of inefficiency, but since there
are two canonical decompositions you just have to do like the Japanese
and make sure all strings take off their muddy shoes at the door and
put on canonicalized slippers before entering the house. This isn't
possible with traditional encodings, and of course it does require a
lot of programmer discipline to construct and use these interfaces.
XEmacs-Beta mailing list