Julian Bradfield wrote:
How do systems deal with the problem that in some encodings (any
ISO2022 that allows general character sets) there are many
octet-strings that encode the same abstract text string?
For the most part, they don't. Unicode fans pretend that anything
which causes problems for Unicode doesn't exist (or is "obsolete";
apparently, they get to decide that this is the case). Anyone who
actually needs to use such encodings typically avoids Unicode like the
plague (I've yet to see a Japanese game for Windows which uses the
Unicode API rather than the codepage-based API).
If you want to retrieve a filename from the OS then pass it back at a
later point, you need to retain the raw data. If you can't get at the
raw data, you lose.
Come to that, how do UTF-8 based filesystems (Windows, Mac) behave
when faced with a filename that is invalid - or are the OSes
sufficiently well written to validate filename on creation?
A more significant point is that Unicode strings aren't strings of
"characters", but of Unicode code points. The conversions between
Unicode and abstract characters suffer from many of the same problems
as with traditional encodings.
E.g. an accented letter can often be represented either as a single
code representing the composed character or as a sequence of the base
letter and a combining accent (Windows and Linux typically use the
former, while MacOSX uses the latter).
NTFS will happily let you have files whose names represent identical
text but differ in the exact sequence of codepoints.
(Many years ago, we had a Pyramid Unix system, which had a network
system interface to the Vaxen. This interface did so little checking
of filenames that it was possible, from a Vax, to create a Unix file
on the Pyramid with a '/' in its name! Of course, the only way to
remove it, or access it in any way, was from a Vax.)
A slightly similar situation exists on Windows, at least for registry
keys (I'm not sure about filenames). The "native" NT API represents
strings using an explicit length, while the Win32 API uses NUL
termination. Using the native API, you can create registry keys which
contain embedded NUL characters.
It's impossible to specify such keys via the Win32 API. They will show
up in RegEdit, minus the first NUL and anything following it.
Attempting to examine the key's subkeys or value will result in a "key
not found" error.
Glynn Clements <glynn(a)gclements.plus.com>
XEmacs-Beta mailing list