Hello,
As discussed this morning on xemacs-beta, here is a patch to support
Latin-1 encoded files that have several GR chars in a row.
I couldn't test inference with Shift-JIS and BIG5 detection as I don't
have test files with these encodings.
Joachim
src/ChangeLog addition:
2006-02-27 Joachim Schrod <jschrod(a)acm.org>
* mule-coding.c (iso2022_detect): Handle Latin-1 encoded files
that have several high-byte chars in a row.
XEmacs 21.5 source patch:
Diff command: cvs -q -f diff -b -u
Files affected: src/mule-coding.c
Index: src/mule-coding.c
===================================================================
RCS file: /pack/xemacscvs/XEmacs/xemacs/src/mule-coding.c,v
retrieving revision 1.36
diff -b -u -r1.36 mule-coding.c
--- src/mule-coding.c 2005/11/22 07:19:32 1.36
+++ src/mule-coding.c 2006/02/27 21:30:53
@@ -2927,7 +2927,20 @@
}
else if (data->odd_high_byte_groups > 0 &&
data->even_high_byte_groups > 0)
+ {
+ /* Well, this could be a Latin-1 text, with most high-byte
+ characters single, but sometimes two are together, though
+ this happens not as often. This is common for Western
+ European languages like German, French, Danish, Swedish, etc.
+ Then we would either have a rather small file and
+ even_high_byte_groups would be low.
+ Or we would have a larger file and the ratio of odd to even
+ groups would be very high. */
SET_DET_RESULTS (st, iso2022, DET_SOMEWHAT_UNLIKELY);
+ if (data->even_high_byte_groups <= 3 ||
+ data->odd_high_byte_groups >= 10 * data->even_high_byte_groups)
+ DET_RESULT (st, iso_8_1) = DET_SOMEWHAT_LIKELY;
+ }
else
SET_DET_RESULTS (st, iso2022, DET_AS_LIKELY_AS_UNLIKELY);
}
--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Joachim Schrod Email: jschrod(a)acm.org
Roedermark, Germany