Hi Alain,
I had the same problem and believe I found the correct cause and an
acceptable solution -- at least what I describe below works for me
since a long time.
I assume your file is in some encoding like ISO-8859-1? (If it is
UTF-8, I can't explain this behaviour.)
You may already have noticed that the problem goes away if the file
does not contain consecutive occurrences of characters outside ASCII,
that is, no words like "Europäësch"?
First the solution: In the file "src/mule-coding.c", near the end
of the function "iso2022_detect", replace the clause
else if (data->odd_high_byte_groups > 0 &&
data->even_high_byte_groups > 0)
SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);
with
else if (data->odd_high_byte_groups > 0 &&
data->even_high_byte_groups > 0)
{
SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);
if (data->odd_high_byte_groups >= 2 * data->even_high_byte_groups)
DET_RESULT (st, iso_8_1) = DET_SOMEWHAT_LIKELY;
}
and recompile XEmacs.
Second, the explanation: This part of XEmacs is responsible for a kind
of statistical recognition of file encodings. To do so among other things
it counts the number of occurrences of runs of even and odd lengths of
bytes >= 0xA0 in the first few KB of the file.
If the file is encoded in something like ISO-8859-1 most such runs will
be one byte long (e.g. "Compétitivitéit" contains two such runs), but
sometimes there will be some of length two, too, e.g. in "Europäësch".
If both occur, the clause above will trigger and erroneously reduce the
likelihood of encodings like ISO-8859-1 below the default level, leaving
Zh/Big5 as the most likely encoding.
The changed version above simply checks that the relative frequency
of occurrence of runs of even and odd lengths is compatible with the
assumption that the language/encoding in question uses bytes >= 0xA0
mostly singly, but sometimes in pairs, and increases the likelihood of
encodings like ISO-8859-1 in this case.
Hope this helps
Yours
Lutz Euler