Re: [Bug: 21.5-b18] certain Text Files appear garbled

Monday, 9 January 2006

        Hi Alain,

I had the same problem and believe I found the correct cause and an
acceptable solution -- at least what I describe below works for me
since a long time.

I assume your file is in some encoding like ISO-8859-1? (If it is
UTF-8, I can't explain this behaviour.)

You may already have noticed that the problem goes away if the file
does not contain consecutive occurrences of characters outside ASCII,
that is, no words like "Europäësch"?

First the solution: In the file "src/mule-coding.c", near the end
of the function "iso2022_detect", replace the clause

  else if (data->odd_high_byte_groups > 0 &&
	   data->even_high_byte_groups > 0)
    SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);

with

  else if (data->odd_high_byte_groups > 0 &&
	   data->even_high_byte_groups > 0)
    {
      SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);
      if (data->odd_high_byte_groups >= 2 * data->even_high_byte_groups)
        DET_RESULT (st, iso_8_1) = DET_SOMEWHAT_LIKELY;
    }

and recompile XEmacs.

Second, the explanation: This part of XEmacs is responsible for a kind
of statistical recognition of file encodings. To do so among other things
it counts the number of occurrences of runs of even and odd lengths of
bytes >= 0xA0 in the first few KB of the file.

If the file is encoded in something like ISO-8859-1 most such runs will
be one byte long (e.g. "Compétitivitéit" contains two such runs), but
sometimes there will be some of length two, too, e.g. in "Europäësch".
If both occur, the clause above will trigger and erroneously reduce the
likelihood of encodings like ISO-8859-1 below the default level, leaving
Zh/Big5 as the most likely encoding.

The changed version above simply checks that the relative frequency
of occurrence of runs of even and odd lengths is compatible with the
assumption that the language/encoding in question uses bytes >= 0xA0
mostly singly, but sometimes in pairs, and increases the likelihood of
encodings like ISO-8859-1 in this case.

Hope this helps

Yours

Lutz Euler

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: [Bug: 21.5-b18] certain Text Files appear garbled