Re: [Bug: 21.5-b24] Problems with coding systems autodetect

Monday, 27 February 2006

        ...
>>>> "sjt" == Stephen J Turnbull
<stephen(a)xemacs.org&gt; writes: 
...
>>>> "Joachim" == Joachim Schrod
<jschrod(a)acm.org&gt; writes: Joachim> Lutz posted a three-linee change to
mule-coding.c; if
Joachim> more than twice the amount of odd runs appear than even
Joachim> runs, coding category iso_8_1 is set to
Joachim> `somewhat-likely'. See

Joachim> http://list-archive.xemacs.org/xemacs-beta/200601/msg00083.html

Joachim> This change works and makes auto-detection work for all
Joachim> German files that I tried.

sjt> It will also probably break auto-detection for Shift JIS and
sjt> Big5.

I have no files with any of those encodings to test. Can you (or
anybody else) please send me a few by PM?

But just from looking at the code, Big5 will be surely a problem.
I wouldn't characterize that as `breaking', merely as `making a
conflict explicit that is already there'.

There are actually two cases. (Below I use the term `GR octet' to mean
an octet in the range 0xA0..0xFE.)

1) There are only single GR octets in the file (no runs), and these
   single octets are not at the end of words.

   This is representative of many West-European languages (e.g.,
   French, German, Danish), and is also a valid BIG5 encoding.

   Autodetection represents that conflict. Both big5 and iso_8_1 are
   rated as somewhat-likely, and the coding-priority list decides
   which one is taken.

2) There are mainly single GR octets in the file, those single octets
   are not at the end of the words, and some two byte runs of GR octets.

   This is still representative for the European languages named above.
   This is _as well_ a valid Big5 encoding, just like the case above.

   Even though this is the same conflict situation as case 1,
   autodetect suddenly decides differently here. It rates big5 as
   somewhat-likely (which is correct) and it rates iso_8_1 as
   somewhat-unlikely (which is definitively *not* correct, as this
   *is* a likely iso_8_1 encoding).

My argument boils down to the point that these two cases represent the
same situation, but are handled differently. It might be that the
decision between iso_8_1 and big5 should be made by the
coding-priority list, and not by autodetection.

OTOH, I don't know enough about statistical distributions of big5
characters in typical BIG5-encoded texts, this might make my argument
moot. Maybe a high amount of even-runs of GR octets should cause a
quite-probable rating for BIG5, and not just a somewhat-likely?!

Anyhow,

sjt> On consideration, I don't see how to test that issue without, well,
sjt> testing it.

OK, let's do so. I see if I can improve the heuristic from Lutz (which
is quite rough currently) and will post a patch against 21.5.25 later
today.

	Joachim

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Joachim Schrod				Email: jschrod(a)acm.org
Roedermark, Germany

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: [Bug: 21.5-b24] Problems with coding systems autodetect