I am currently taking a class in statistical machine learning and it
occurs to me that there is a well-understood method for doing
autodetection robustly. In particular, it is an instance of what is
known as a classification problem: Given a bunch of input (patient
symptoms, properties of the text, etc.), correctly classify the input in
one of a set of classes (the patient's disease, the encoding of the
text, etc.). There are general methods that learn how to optimally do
this, given a set of preclassified input. In our case, we'd
[1] create a corpus of correctly classified examples -- basically, a
whole bunch of pieces of text, hopefully as varied as possible, along
with the correct encoding (manually determined);
[2] pick a bunch of features that we think might be relevant. There
could potentially be a large number of them -- in principle, you could
just use the entire text itself as a set of features, one per byte, but
you'd run into problems with data sparseness. In practice these
features might be (I'm just guessing here; it doesn't totally matter
what you choose but I'm sure there has been work done on how you choose
features for document classification, which is exactly what this is) the
byte values of each of the first 8 or 10 bytes of the text; the total
number and/or proportion of printing ASCII bytes, ESC characters, ISO
designation sequences, invalid ISO ESC sequences, 80-9F bytes, sequences
of two 80-9F bytes, sequences of four 80-9F bytes, sequences of the form
XA, XAXA, XAXAXA, XAXAXAXA, XAXAXAXAXA, XAXAXAXAXAXA where X = 80-9F and
Y = non-80-9F, longest such sequence of XA's, null bytes, sequences of
two nulls, sequences of null and non-null, sequences of XA, XAXA, etc.
just as above where X = null and A = non-null, longest sequence of A0-FF
bytes, number of sequences of an odd number of A0-FF bytes, number of
sequences of an even number of A0-FF bytes, number of tabs, CR's, LF's,
CRLF sequences, spaces, whitespace in the 00-1F range, non-whitespace in
the 00-1F range, bytes in the 00-1F range used in ISO2022 (SI, SO, ...),
bytes in the 00-1F range not used in ISO2022 and not whitespace, bytes
that are legal in base-64-encoded text, bytes that aren't legal in
base-64-encoded text, = signs, sequences of = followed by two hex
digits, sequences of = not followed by two hex digits, etc. etc. The
point is, you want to choose features that will be applicable to the
various encodings: for example, shift-JIS encodes Kanji as 80-9F + some
other thing, so you're likely to see lots of alternating 80-9F with
non-80-9F; most other encodings won't have much 80-9F at all, except
binary. Unicode (i.e. UTF-16), similarly, tends to have 0 alternating
with non-zero. Most other encodings, however, have few or no null
bytes. The byte values of the first 8 or 10 bytes might contain magic
sequences indicating the type of encoding. The idea here is that we
just throw a bunch of features at the data, and the classification
algorithm learns which features are relevant and which ones aren't, and
how best to combine them to get a function that does a near-optimal job
at correct classification. These things do a *far* *far* better job
than pretty much any hand-tuned classifier that anyone can construct,
not to mention the nonchalant piece of crap we have.
[3] run the trainer -- once -- on all the data. it may take awhile, but
this only happens once, and afterwards, we should have a damn good
encoding autodetection process. note that you can throw all sorts of
things in there -- not just I18N encodings, but also whether it's
base64-encoded, quoted-printable-encoded, an EXE program, a shell
script, pure text with one or another line feed ending, text that
doesn't wrap (i.e. CR is only a paragraph separator), random binary data
(of course, you could have it look for different kinds of such data,
e.g. gzipped, bzipped, etc.), email, C code, Lisp code, etc. In some
cases it might make sense to have a tree of classifiers; e.g. the basic
one determines it's binary and then we have another one that just
classifies binary data using a corpus of classified binary data and its
own set of features.
[4] if you want to get more sophisticated, you allow it to continue to
learn as it sees new items. This is what SpamAssassin does
(SpamAssassin is also basically just a classifier, which classifies into
only two categories -- spam or non-spam, and uses the same sorts of
machine-learning algorithms) -- in its case it combines statistical
learning with some hand-coded rules (although fewer and fewer of them
over time) plus ancillary knowledge (e.g. whether the mail arrived
through a blacklisted relay or whether the mail is listed in one of the
dynamic spam databases). If it's very sure that something is or isn't
spam, it feeds that back into its learner. You can also feed it stuff
by hand; it's especially useful to do this on stuff it got wrong. In
the case of XEmacs, you could imagine doing that semi-automatically with
a menu or other command that basically means "This encoding is wrong,
let's choose another one;" it might have various kinds of options --
"pick the next best" (since the classifier will usually have an N-best
list of possibilities) or "pick the next best from this group of
encodings" (e.g. i know it's supposed to be CJK or maybe even just
Japanese, but i don't know what the right encoding is) or just "pick
this particular encoding". When we do that, we could learn from this.
You could also imagine creating specialized corpora (pl. of corpus) and
learning functions for particular tasks, like decoding known CJK text,
if this is important enough.
of course, [4] is "pie-in-the-sky" stuff that might never get done. but
1-3 is conceptually not too hard, and this stuff has been extremely well
studied and lots of high-quality, free code already exists (e.g. the
Maxent package for maximum-entropy learning or the Weka package, which
is a large, well-developed package with all sorts of machine learning
algorithms).
In fact, I may look into implementing something like this for a class
project, e.g. the final project for my machine-learning class.
ben