Re: Mule autodetection is crap

Thursday, 2 January 2003

        ...
>>>> "Alexey" == Alexey Mahotkin
<alexm(a)hsys.msk.ru&gt; writes: 
    Alexey> I can say that there is an extremely good scheme for
    Alexey> statistical detection of various Russian (really Russian,
    Alexey> not Cyrillic) encodings, done by S. V. Znamensky.  I tried
    Alexey> it, and it works really wonderful, allowing even
    Alexey> "twice-encoded" text which is seen occasionally.

That would be a very nice example from my point of view, even though
it is limited to Russian.  If it also happened to be able to reject
(say) KOI8 Ukrainian and ISO 8859-7 Greek, that would be a wonderful
showcase for the feature.

    Alexey> I thought of adding something like this to XEmacs.  Now if
    Alexey> there is a common infrastructure for this, I'd be glad to
    Alexey> help in that area.

Well, AFAIK Ben is monoscriptal in ISO-8859-1 for practical purposes.
So I don't know if the current infrastructure will necessarily support
existing statistical detectors.  But I'll take a close look and try to
come up with some docs.  I'm pretty sure Ben is interested enough to
be responsive to requests for enhancement of the mechanism.

    Alexey> I'm now playing with current XEmacs-beta.  It recognizes
    Alexey> my ~/.xemacs/init.el as UTF-16,

This is probably a priority bug.

    Alexey> and does not let me to change the encoding with "C-x RET f
    Alexey> koi8-r RET"

This is probably not.

    Alexey> (but "C-x RET c koi8-r C-x C-f" works).  The file itself
    Alexey> is mostly ASCII, with two strings in Russian inside (near
    Alexey> the end of the file).

    Alexey> Are you interested in such bug reports, and if yes, should
    Alexey> I send the file or what?  Other files it at least detects
    Alexey> as "Raw".

Most definitely.  Especially from Cyrillic and Japanese users, who are
the roughest tests on autodetection (except for maybe Buddhist
scholars) because of the multiple encodings in daily use, plus the
need to handle ASCII, ISO-8859-1, and ISO-8859-15 for programming etc.

The report alone is probably enough for priority bugs.  However, if
you have a file you can send, that would be very nice.  As usual, the
shorter the better.  The very best would be a test library in
test-harness.el format (see tests/automated/test-harness.el and the
"Regression Testing" node in the Internals Info manual).

    Alexey> set-language-enviroment Cyrillic-KOI8 does not help at
    Alexey> all.

This doesn't entirely surprise me as I know Ben started a synch of the
language environment stuff to GNU 21.x, but in the process broke stuff
for Japanese at least.  I wouldn't be surprised if something similar
happened in Cyrillic.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: Mule autodetection is crap