Resend: [PATCH] Autodetection of Latin-1 text with high-byte chars in a row

Tuesday, 28 February 2006

[Sorry, code indentation got garbled. this is a resend.]

Hello,

As discussed this morning on xemacs-beta, here is a patch to support
Latin-1 encoded files that have several GR chars in a row.

I couldn't test inference with Shift-JIS and BIG5 detection as I don't
have test files with these encodings.

	Joachim

src/ChangeLog addition:

2006-02-27  Joachim Schrod  <jschrod(a)acm.org&gt;

	* mule-coding.c (iso2022_detect): Handle Latin-1 encoded files
	that have several high-byte chars in a row.

XEmacs 21.5 source patch:
Diff command:   cvs -q -f diff -u
Files affected: src/mule-coding.c

Index: src/mule-coding.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/mule-coding.c,v
retrieving revision 1.36
diff -u -r1.36 mule-coding.c
--- src/mule-coding.c   2005/11/22 07:19:32     1.36
+++ src/mule-coding.c   2006/02/28 00:49:58
＠＠ -2927,7 +2927,20 ＠＠
     }
   else if (data->odd_high_byte_groups > 0 &&
           data->even_high_byte_groups > 0)
-    SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);
+    {
+      /* Well, this could be a Latin-1 text, with most high-byte
+        characters single, but sometimes two are together, though
+        this happens not as often. This is common for Western
+        European languages like German, French, Danish, Swedish, etc.
+        Then we would either have a rather small file and
+        even_high_byte_groups would be low.
+        Or we would have a larger file and the ratio of odd to even
+        groups would be very high. */
+      SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);
+      if (data->even_high_byte_groups <= 3 ||
+         data->odd_high_byte_groups >= 10 * data->even_high_byte_groups)
+       DET_RESULT (st, iso_8_1) = DET_SOMEWHAT_LIKELY;
+    }
   else
     SET_DET_RESULTS (st, iso2022, DET_AS_LIKELY_AS_UNLIKELY);
 }      

-- 
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Joachim Schrod				Email: jschrod(a)acm.org
Roedermark, Germany

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

Resend: [PATCH] Autodetection of Latin-1 text with high-byte chars in a row