Ar an ceathrú lá déag de mí Eanair, scríobh Michael Sperber:
"Stephen J. Turnbull" <stephen(a)xemacs.org> writes:
> Michael Sperber writes:
> > Could you give a hint about detecting UTF-8? (I know what UTF-8 looks
> > like, but enough about the other coding systems to be able to say what
> > distinguishes them.)
> There are a lot of coding systems. But basically if you have as many
> as 3 non-ASCII characters, the chance that any natural language text
> "looks like" UTF-8 is vanishingly small. Except at the beginning and
> end of the string, a single byte >= 0xC0 gives you information about
> *at least* three other bytes: the preceding one may *not* be >= 0xC0,
> the following N bytes must be in the range 0x80 to 0xBF, and the next
> one after that must not be >= 0xC0.
I'm not sure I understand: These are conditions which must hold true for
UTF-8. Is the presence of a valid UTF-8 3-byte encoding in a byte
sequence enough to be able to say that it is UTF-8? What about typical
Latin-1 text, whose UTF-8 encodings will include only 2-byte encodings?
If there are three non-ASCII octets in a text, and they are positioned such
that the text can be interpreted as valid UTF-8, then the chance that the
text is anything but UTF-8 (or something like UTF-8 strings stored in a core
file) is vanishingly small. So Stephen’s statement also holds for Western
European text stored as UTF-8.
On your original question; it’s laughably unlikely (outside of Cygwin, where
this code is not used) that ls will output file names in a coding system
that doesn’t reflect the octets stored in the directory entries. And on OS X
file-name-coding-system (and relatedly, the 'file-name coding system alias)
is unconditionally UTF-8, independent of the locale coding system. I would
suggest binding coding-system-for-read to 'file-name, not
(get-coding-system-from-locale (current-locale)) . For future reference,
too, the coding system alias 'native is equivalent to and much faster than
¿Dónde estará ahora mi sobrino Yoghurtu Nghé, que tuvo que huir
precipitadamente de la aldea por culpa de la escasez de rinocerontes?
XEmacs-Beta mailing list