Michael Sperber writes:
"Stephen J. Turnbull" <stephen(a)xemacs.org> writes:
> There are a lot of coding systems. But basically if you have as
many
> as 3 non-ASCII characters, the chance that any natural language text
> "looks like" UTF-8 is vanishingly small. Except at the beginning and
> end of the string, a single byte >= 0xC0 gives you information about
> *at least* three other bytes: the preceding one may *not* be >= 0xC0,
> the following N bytes must be in the range 0x80 to 0xBF, and the next
> one after that must not be >= 0xC0.
I'm not sure I understand: These are conditions which must hold true for
UTF-8. Is the presence of a valid UTF-8 3-byte encoding in a byte
sequence enough to be able to say that it is UTF-8?
No, it's not. For one thing, in a shell buffer you might be accessing
several remote systems or cat'ing files saved from MIME mail in the
specified encoding. Let's not worry about those, though.
Consider the popular Western European languages (English, Spanish,
French, and German). Suppose the text has three non-ASCII characters
you want to encode. The accented letters (acute, grave, tilde,
umlaut) almost always occur in isolation: that can't be UTF-8 where
high-bit-set bytes occur only in groups of two or more. If they do
occur in groups, what is the chance that the first of the group has
high bits that encode the byte count of the UTF-8-like group it is in?
Note that the probability that an ISO-8859 encoding of random
characters will put a UTF-8 trailing byte in the text is fairly low by
itself, since only the range 0xA0-0xBF (1/8) satisfies both ISO-8859's
avoidance of C1 controls and the UTF-8 trailing byte restriction.
How likely it is given the contraints of natural language
vocabulary, I don't know.
Spanish, however, has those inverted punctuation characters used at
the beginning of a sentence, and it does have words that begin with an
accented character. Do we need to worry? No! What will precede the
punctuation character? Most likely an ASCII whitespace character,
possibly NO-BREAK SPACE (NBSP). But what precedes NBSP? Probably
whitespace or ASCII punctuation. NBSP is encoded as a UTF-8 trailing
byte, so that can't be a UTF-8 character: no leading byte. How about
the inverted punctuation? We got lucky: they're both trailing bytes,
the argument holds. I believe the same analysis holds for French
guillemots, currency symbols, etc.
So once you've got 3 non-ASCII characters encoded in ISO-8859 Latin,
the chances that some UTF-8 restriction isn't violated seem rather
slim. It's all very hand-wavy and heuristic, but seems pretty solid
to me.
The only reason we even care about this, by the way, is that Mule
decoding is a lossy transformation and we don't provide any way for
the user to recover the original, and try again.
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta(a)xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta