Re: proper way to do autodetection

Friday, 23 September 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> I am currently taking a class in statistical machine learning
    Ben> and it occurs to me that there is a well-understood method
    Ben> for doing autodetection robustly.

Yes.

But you're fighting the last war with modern weapons.  Most of the
stuff that you mention below is what Mule calls "coding category", and
we can already do a pretty good job on that.

What we really want to do is _language_ recognition in Unicode and
unibyte.  I would suggest a structure that goes:

(1) Pick the best adequate match among Unicode UTFs.  Convert to
canonical (AKA read into an Emacs buffer).

(2) Detect language (using a statistical classifier approach).  If the
classification is significantly high, show the user.

(3) If (1) or (2) failed, fall back to a different category.

    Ben> [1] create a corpus of correctly classified examples --
    Ben> basically, a whole bunch of pieces of text, hopefully as
    Ben> varied as possible, along with the correct encoding (manually
    Ben> determined);

This is the hard part, and will basically take years.  Although we
don't have to worry about spammers who will deliberately change the
signatures of their documents, we (I hope) will be getting new users:
Koreans and Chinese there are already lots of but we don't have many,
South-East Asians, South Asians, Africans are coming on line.

Also, there will be new formats invented or handled in Emacs (eg, XML
applications, tar files, compression, encryption) that we would like
to automatically recognize.  So we SHOULD think in terms of a
continuing process of developing the corpus and reoptimizing.

    Ben> In fact, I may look into implementing something like this for
    Ben> a class project, e.g. the final project for my
    Ben> machine-learning class.

That would be cool!

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: proper way to do autodetection