>>>> "Ben" == Ben Wing <ben(a)666.com>
writes:
Ben> I am currently taking a class in statistical machine learning
Ben> and it occurs to me that there is a well-understood method
Ben> for doing autodetection robustly.
Yes.
But you're fighting the last war with modern weapons. Most of the
stuff that you mention below is what Mule calls "coding category", and
we can already do a pretty good job on that.
What we really want to do is _language_ recognition in Unicode and
unibyte. I would suggest a structure that goes:
(1) Pick the best adequate match among Unicode UTFs. Convert to
canonical (AKA read into an Emacs buffer).
(2) Detect language (using a statistical classifier approach). If the
classification is significantly high, show the user.
(3) If (1) or (2) failed, fall back to a different category.
Ben> [1] create a corpus of correctly classified examples --
Ben> basically, a whole bunch of pieces of text, hopefully as
Ben> varied as possible, along with the correct encoding (manually
Ben> determined);
This is the hard part, and will basically take years. Although we
don't have to worry about spammers who will deliberately change the
signatures of their documents, we (I hope) will be getting new users:
Koreans and Chinese there are already lots of but we don't have many,
South-East Asians, South Asians, Africans are coming on line.
Also, there will be new formats invented or handled in Emacs (eg, XML
applications, tar files, compression, encryption) that we would like
to automatically recognize. So we SHOULD think in terms of a
continuing process of developing the corpus and reoptimizing.
Ben> In fact, I may look into implementing something like this for
Ben> a class project, e.g. the final project for my
Ben> machine-learning class.
That would be cool!
--
School of Systems and Information Engineering
http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.