proper way to do autodetection

Wednesday, 21 September 2005

        I am currently taking a class in statistical machine learning and it 
occurs to me that there is a well-understood method for doing 
autodetection robustly.  In particular, it is an instance of what is 
known as a classification problem: Given a bunch of input (patient 
symptoms, properties of the text, etc.), correctly classify the input in 
one of a set of classes (the patient's disease, the encoding of the 
text, etc.).  There are general methods that learn how to optimally do 
this, given a set of preclassified input.  In our case, we'd

[1] create a corpus of correctly classified examples -- basically, a 
whole bunch of pieces of text, hopefully as varied as possible, along 
with the correct encoding (manually determined);

[2] pick a bunch of features that we think might be relevant.  There 
could potentially be a large number of them -- in principle, you could 
just use the entire text itself as a set of features, one per byte, but 
you'd run into problems with data sparseness.  In practice these 
features might be (I'm just guessing here; it doesn't totally matter 
what you choose but I'm sure there has been work done on how you choose 
features for document classification, which is exactly what this is) the 
byte values of each of the first 8 or 10 bytes of the text; the total 
number and/or proportion of printing ASCII bytes, ESC characters, ISO 
designation sequences, invalid ISO ESC sequences, 80-9F bytes, sequences 
of two 80-9F bytes, sequences of four 80-9F bytes, sequences of the form 
XA, XAXA, XAXAXA, XAXAXAXA, XAXAXAXAXA, XAXAXAXAXAXA where X = 80-9F and 
Y = non-80-9F, longest such sequence of XA's, null bytes, sequences of 
two nulls, sequences of null and non-null, sequences of XA, XAXA, etc. 
just as above where X = null and A = non-null, longest sequence of A0-FF 
bytes, number of sequences of an odd number of A0-FF bytes, number of 
sequences of an even number of A0-FF bytes,  number of tabs, CR's, LF's, 
CRLF sequences, spaces, whitespace in the 00-1F range, non-whitespace in 
the 00-1F range, bytes in the 00-1F range used in ISO2022 (SI, SO, ...), 
bytes in the 00-1F range not used in ISO2022 and not whitespace, bytes 
that are legal in base-64-encoded text, bytes that aren't legal in 
base-64-encoded text, = signs, sequences of = followed by two hex 
digits, sequences of = not followed by two hex digits, etc. etc.  The 
point is, you want to choose features that will be applicable to the 
various encodings: for example, shift-JIS encodes Kanji as 80-9F + some 
other thing, so you're likely to see lots of alternating 80-9F with 
non-80-9F; most other encodings won't have much 80-9F at all, except 
binary.  Unicode (i.e. UTF-16), similarly, tends to have 0 alternating 
with non-zero.  Most other encodings, however, have few or no null 
bytes.  The byte values of the first 8 or 10 bytes might contain magic 
sequences indicating the type of encoding.  The idea here is that we 
just throw a bunch of features at the data, and the classification 
algorithm learns which features are relevant and which ones aren't, and 
how best to combine them to get a function that does a near-optimal job 
at correct classification.  These things do a *far* *far* better job 
than pretty much any hand-tuned classifier that anyone can construct, 
not to mention the nonchalant piece of crap we have.

[3] run the trainer -- once -- on all the data.  it may take awhile, but 
this only happens once, and afterwards, we should have a damn good 
encoding autodetection process.  note that you can throw all sorts of 
things in there -- not just I18N encodings, but also whether it's 
base64-encoded, quoted-printable-encoded, an EXE program, a shell 
script, pure text with one or another line feed ending, text that 
doesn't wrap (i.e. CR is only a paragraph separator), random binary data 
(of course, you could have it look for different kinds of such data, 
e.g. gzipped, bzipped, etc.), email, C code, Lisp code, etc.  In some 
cases it might make sense to have a tree of classifiers; e.g. the basic 
one determines it's binary and then we have another one that just 
classifies binary data using a corpus of classified binary data and its 
own set of features.

[4] if you want to get more sophisticated, you allow it to continue to 
learn as it sees new items.  This is what SpamAssassin does 
(SpamAssassin is also basically just a classifier, which classifies into 
only two categories -- spam or non-spam, and uses the same sorts of 
machine-learning algorithms) -- in its case it combines statistical 
learning with some hand-coded rules (although fewer and fewer of them 
over time) plus ancillary knowledge (e.g. whether the mail arrived 
through a blacklisted relay or whether the mail is listed in one of the 
dynamic spam databases).  If it's very sure that something is or isn't 
spam, it feeds that back into its learner.  You can also feed it stuff 
by hand; it's especially useful to do this on stuff it got wrong.  In 
the case of XEmacs, you could imagine doing that semi-automatically with 
a menu or other command that basically means "This encoding is wrong, 
let's choose another one;" it might have various kinds of options -- 
"pick the next best" (since the classifier will usually have an N-best 
list of possibilities) or "pick the next best from this group of 
encodings" (e.g. i know it's supposed to be CJK or maybe even just 
Japanese, but i don't know what the right encoding is) or just "pick 
this particular encoding".  When we do that, we could learn from this.  
You could also imagine creating specialized corpora (pl. of corpus) and 
learning functions for particular tasks, like decoding known CJK text, 
if this is important enough.

of course, [4] is "pie-in-the-sky" stuff that might never get done.  but 
1-3 is conceptually not too hard, and this stuff has been extremely well 
studied and lots of high-quality, free code already exists (e.g. the 
Maxent package for maximum-entropy learning or the Weka package, which 
is a large, well-developed package with all sorts of machine learning 
algorithms).

In fact, I may look into implementing something like this for a class 
project, e.g. the final project for my machine-learning class.

ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

proper way to do autodetection