Stephen J. Turnbull wrote:
>>>>>"Ben" == Ben Wing <ben(a)666.com>
writes:
>>>>>
>>>>>
Ben> I am currently taking a class in statistical machine learning
Ben> and it occurs to me that there is a well-understood method
Ben> for doing autodetection robustly.
Yes.
But you're fighting the last war with modern weapons. Most of the
stuff that you mention below is what Mule calls "coding category", and
we can already do a pretty good job on that.
What we really want to do is _language_ recognition in Unicode and
unibyte. I would suggest a structure that goes:
(1) Pick the best adequate match among Unicode UTFs. Convert to
canonical (AKA read into an Emacs buffer).
(2) Detect language (using a statistical classifier approach). If the
classification is significantly high, show the user.
(3) If (1) or (2) failed, fall back to a different category.
Ben> [1] create a corpus of correctly classified examples --
Ben> basically, a whole bunch of pieces of text, hopefully as
Ben> varied as possible, along with the correct encoding (manually
Ben> determined);
This is the hard part, and will basically take years. Although we
don't have to worry about spammers who will deliberately change the
signatures of their documents, we (I hope) will be getting new users:
Koreans and Chinese there are already lots of but we don't have many,
South-East Asians, South Asians, Africans are coming on line.
Also, there will be new formats invented or handled in Emacs (eg, XML
applications, tar files, compression, encryption) that we would like
to automatically recognize. So we SHOULD think in terms of a
continuing process of developing the corpus and reoptimizing.
Well, collecting a decent-sized corpus isn't too hard. Just get all the
computer users you know who work with non-ASCII documents to start
collecting their documents. In this case, figuring out the coding by
hand isn't too difficult and we can get a computer to do a first guess
(that's usually how big corpora of parsed English sentences are created,
but there the task is much harder since you're not just putting
something in a single category but creating a tree that covers every
word in the sentence).
I suggest we set up a place where people can email documents -- esp.
documents that XEmacs doesn't recognize properly -- and indicate what
the encoding is.
ben
ben