Re: proper way to do autodetection

Friday, 23 September 2005

        Stephen J. Turnbull wrote:

...
>>>>>"Ben" == Ben Wing <ben(a)666.com&gt;
writes:
>>>>>            
>>>>>

    Ben> I am currently taking a class in statistical machine learning
    Ben> and it occurs to me that there is a well-understood method
    Ben> for doing autodetection robustly.

Yes.

But you're fighting the last war with modern weapons.  Most of the
stuff that you mention below is what Mule calls "coding category", and
we can already do a pretty good job on that.

What we really want to do is _language_ recognition in Unicode and
unibyte.  I would suggest a structure that goes:

(1) Pick the best adequate match among Unicode UTFs.  Convert to
canonical (AKA read into an Emacs buffer).

(2) Detect language (using a statistical classifier approach).  If the
classification is significantly high, show the user.

(3) If (1) or (2) failed, fall back to a different category.

    Ben> [1] create a corpus of correctly classified examples --
    Ben> basically, a whole bunch of pieces of text, hopefully as
    Ben> varied as possible, along with the correct encoding (manually
    Ben> determined);

This is the hard part, and will basically take years.  Although we
don't have to worry about spammers who will deliberately change the
signatures of their documents, we (I hope) will be getting new users:
Koreans and Chinese there are already lots of but we don't have many,
South-East Asians, South Asians, Africans are coming on line.

Also, there will be new formats invented or handled in Emacs (eg, XML
applications, tar files, compression, encryption) that we would like
to automatically recognize.  So we SHOULD think in terms of a
continuing process of developing the corpus and reoptimizing.

 Well, collecting a decent-sized corpus isn't too hard.  Just get all the 
computer users you know who work with non-ASCII documents to start 
collecting their documents.  In this case, figuring out the coding by 
hand isn't too difficult and we can get a computer to do a first guess 
(that's usually how big corpora of parsed English sentences are created, 
but there the task is much harder since you're not just putting 
something in a single category but creating a tree that covers every 
word in the sentence).

I suggest we set up a place where people can email documents -- esp. 
documents that XEmacs doesn't recognize properly -- and indicate what 
the encoding is.

ben
ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: proper way to do autodetection