"Stephen J. Turnbull" <stephen(a)xemacs.org> writes:
Michael Sperber writes:
> Could you give a hint about detecting UTF-8? (I know what UTF-8 looks
> like, but enough about the other coding systems to be able to say what
> distinguishes them.)
There are a lot of coding systems. But basically if you have as many
as 3 non-ASCII characters, the chance that any natural language text
"looks like" UTF-8 is vanishingly small. Except at the beginning and
end of the string, a single byte >= 0xC0 gives you information about
*at least* three other bytes: the preceding one may *not* be >= 0xC0,
the following N bytes must be in the range 0x80 to 0xBF, and the next
one after that must not be >= 0xC0.
I'm not sure I understand: These are conditions which must hold true for
UTF-8. Is the presence of a valid UTF-8 3-byte encoding in a byte
sequence enough to be able to say that it is UTF-8? What about typical
Latin-1 text, whose UTF-8 encodings will include only 2-byte encodings?
--
Cheers =8-} Mike
Friede, Völkerverständigung und überhaupt blabla
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta