>>>> "sjt" == Stephen J Turnbull
>>>> "Joachim" == Joachim Schrod
Joachim> Lutz posted a three-linee change to
Joachim> more than twice the amount of odd runs appear than even
Joachim> runs, coding category iso_8_1 is set to
Joachim> `somewhat-likely'. See
Joachim> This change works and makes auto-detection work for all
Joachim> German files that I tried.
sjt> It will also probably break auto-detection for Shift JIS and
I have no files with any of those encodings to test. Can you (or
anybody else) please send me a few by PM?
But just from looking at the code, Big5 will be surely a problem.
I wouldn't characterize that as `breaking', merely as `making a
conflict explicit that is already there'.
There are actually two cases. (Below I use the term `GR octet' to mean
an octet in the range 0xA0..0xFE.)
1) There are only single GR octets in the file (no runs), and these
single octets are not at the end of words.
This is representative of many West-European languages (e.g.,
French, German, Danish), and is also a valid BIG5 encoding.
Autodetection represents that conflict. Both big5 and iso_8_1 are
rated as somewhat-likely, and the coding-priority list decides
which one is taken.
2) There are mainly single GR octets in the file, those single octets
are not at the end of the words, and some two byte runs of GR octets.
This is still representative for the European languages named above.
This is _as well_ a valid Big5 encoding, just like the case above.
Even though this is the same conflict situation as case 1,
autodetect suddenly decides differently here. It rates big5 as
somewhat-likely (which is correct) and it rates iso_8_1 as
somewhat-unlikely (which is definitively *not* correct, as this
*is* a likely iso_8_1 encoding).
My argument boils down to the point that these two cases represent the
same situation, but are handled differently. It might be that the
decision between iso_8_1 and big5 should be made by the
coding-priority list, and not by autodetection.
OTOH, I don't know enough about statistical distributions of big5
characters in typical BIG5-encoded texts, this might make my argument
moot. Maybe a high amount of even-runs of GR octets should cause a
quite-probable rating for BIG5, and not just a somewhat-likely?!
sjt> On consideration, I don't see how to test that issue without, well,
sjt> testing it.
OK, let's do so. I see if I can improve the heuristic from Lutz (which
is quite rough currently) and will post a patch against 21.5.25 later
Joachim Schrod Email: jschrod(a)acm.org