coding-detection failure on recent xemacs-beta(Mule).

Thursday, 5 June 2003

        Hello guys,

I think all recent xemacs have problem around MULE coding-detection
codes since version 21.5.6 or 21.5.7 that fails coding-detection for
some kinds of multi-byte character encodings in the iso_8_2 category,
e.g. EUC-JP.  Because of this problem, I was using version 21.5.5 for
quite long time.

Recently, I could have time to dig the problem and found what's hapeen
when coding-detection fails.  Here I try to write the problem.  Please
fix the problem.

*PROBLEM*
coding-detection fails to guess the coding-system under some condition

*DESCRIPTION*
Since MULE coding-detection are designed to find what coding-system is
most appropriate one for a given text on user's current environment as
quick as possible, the logic to guess the coding-system is implemented
to analyze ONLY SMALL FRAGMENT OF TARGET TEXT.  But, the
implementation of iso_8_2 detection logic seems to be so stubborn that
cannot allow mis-detection and forget the fact that it only analyzes
small fragment of text.  On coding-systems within iso_8_2 category,
i.e. EUC-JP, texts usually include both of iso_8_2(multi-byte)
characters and ascii(1-byte) characters, e.g. even pure Japanese text
in EUC-JP usually includes 1-byte characters such as LF(0x0a), of
course most of texts we write includes not only control-characters but
also ascii characters such as 'XEmacs'.  Because of this, the fragment
of target text given for coding-detection logic can be incomplete
string, i.e. the first byte and the last byte of text fragment can be
incomplete multi-byte character and the logic must consider this fact.

*SOLUTION*
I made a very simple patch to relax the condition of iso_8_2 detection
to allow sampling-error.  Of course I know there may be more complete
and smart way, but I didn't dig XEmacs entirely so that I can write
such a smart solution.

--- src/mule-coding.c.orig	Mon Jan 13 17:46:44 2003
+++ src/mule-coding.c	Thu Jun  5 04:49:22 2003
＠＠ -2935,8 +2935,12 ＠＠
       else
 	DET_RESULT (st, iso_8_1) = DET_SOMEWHAT_LIKELY;
     }
+#if 1
+  else if (data->even_high_byte_groups > 0)
+#else
   else if (data->odd_high_byte_groups == 0 &&
 	   data->even_high_byte_groups > 0)
+#endif
     {
 #if 0
       SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);

*For more complete solution*
The logic counts data->odd_high_byte_groups must take the fact that
'text fragment (of DEFINITELY VALID TEXT represented in a
coding-system of iso_8_2 category) can have incomplete multi-byte
characters at the first and the last byte' under consideration.
Especially, current logic increments data->odd_high_byte_groups even
if the last byte of TEXT FRAGMENT is *eventually* a part of
(incomplete) multi-byte character.
If (data->odd_high_byte_groups != 0), current logic never guess the
text can be in iso_8_2 category.

There's another way such as to calculate the error-rate of
data->odd_high_byte_groups for iso_8_2 to ignore some small value of
data->odd_high_byte_groups as sampling-error.

Anyway, I believe above description may help the maintainer to fix the
problem.

Thanks in advance,

--
Fuyuhiko MARUYAMA <fuyuhik8(a)is.titech.ac.jp&gt;
Matsuoka laboratory,
Department of Mathematical and Computing Sciences,
Graduate School of Information Science and Engineering,
Tokyo Institute of Technology.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998