Hello guys,
I think all recent xemacs have problem around MULE coding-detection
codes since version 21.5.6 or 21.5.7 that fails coding-detection for
some kinds of multi-byte character encodings in the iso_8_2 category,
e.g. EUC-JP. Because of this problem, I was using version 21.5.5 for
quite long time.
Recently, I could have time to dig the problem and found what's hapeen
when coding-detection fails. Here I try to write the problem. Please
fix the problem.
*PROBLEM*
coding-detection fails to guess the coding-system under some condition
*DESCRIPTION*
Since MULE coding-detection are designed to find what coding-system is
most appropriate one for a given text on user's current environment as
quick as possible, the logic to guess the coding-system is implemented
to analyze ONLY SMALL FRAGMENT OF TARGET TEXT. But, the
implementation of iso_8_2 detection logic seems to be so stubborn that
cannot allow mis-detection and forget the fact that it only analyzes
small fragment of text. On coding-systems within iso_8_2 category,
i.e. EUC-JP, texts usually include both of iso_8_2(multi-byte)
characters and ascii(1-byte) characters, e.g. even pure Japanese text
in EUC-JP usually includes 1-byte characters such as LF(0x0a), of
course most of texts we write includes not only control-characters but
also ascii characters such as 'XEmacs'. Because of this, the fragment
of target text given for coding-detection logic can be incomplete
string, i.e. the first byte and the last byte of text fragment can be
incomplete multi-byte character and the logic must consider this fact.
*SOLUTION*
I made a very simple patch to relax the condition of iso_8_2 detection
to allow sampling-error. Of course I know there may be more complete
and smart way, but I didn't dig XEmacs entirely so that I can write
such a smart solution.
--- src/mule-coding.c.orig Mon Jan 13 17:46:44 2003
+++ src/mule-coding.c Thu Jun 5 04:49:22 2003
@@ -2935,8 +2935,12 @@
else
DET_RESULT (st, iso_8_1) = DET_SOMEWHAT_LIKELY;
}
+#if 1
+ else if (data->even_high_byte_groups > 0)
+#else
else if (data->odd_high_byte_groups == 0 &&
data->even_high_byte_groups > 0)
+#endif
{
#if 0
SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);
*For more complete solution*
The logic counts data->odd_high_byte_groups must take the fact that
'text fragment (of DEFINITELY VALID TEXT represented in a
coding-system of iso_8_2 category) can have incomplete multi-byte
characters at the first and the last byte' under consideration.
Especially, current logic increments data->odd_high_byte_groups even
if the last byte of TEXT FRAGMENT is *eventually* a part of
(incomplete) multi-byte character.
If (data->odd_high_byte_groups != 0), current logic never guess the
text can be in iso_8_2 category.
There's another way such as to calculate the error-rate of
data->odd_high_byte_groups for iso_8_2 to ignore some small value of
data->odd_high_byte_groups as sampling-error.
Anyway, I believe above description may help the maintainer to fix the
problem.
Thanks in advance,
--
Fuyuhiko MARUYAMA <fuyuhik8(a)is.titech.ac.jp>
Matsuoka laboratory,
Department of Mathematical and Computing Sciences,
Graduate School of Information Science and Engineering,
Tokyo Institute of Technology.