>>>> "Alexey" == Alexey Mahotkin
<alexm(a)hsys.msk.ru> writes:
Alexey> I can say that there is an extremely good scheme for
Alexey> statistical detection of various Russian (really Russian,
Alexey> not Cyrillic) encodings, done by S. V. Znamensky. I tried
Alexey> it, and it works really wonderful, allowing even
Alexey> "twice-encoded" text which is seen occasionally.
That would be a very nice example from my point of view, even though
it is limited to Russian. If it also happened to be able to reject
(say) KOI8 Ukrainian and ISO 8859-7 Greek, that would be a wonderful
showcase for the feature.
Alexey> I thought of adding something like this to XEmacs. Now if
Alexey> there is a common infrastructure for this, I'd be glad to
Alexey> help in that area.
Well, AFAIK Ben is monoscriptal in ISO-8859-1 for practical purposes.
So I don't know if the current infrastructure will necessarily support
existing statistical detectors. But I'll take a close look and try to
come up with some docs. I'm pretty sure Ben is interested enough to
be responsive to requests for enhancement of the mechanism.
Alexey> I'm now playing with current XEmacs-beta. It recognizes
Alexey> my ~/.xemacs/init.el as UTF-16,
This is probably a priority bug.
Alexey> and does not let me to change the encoding with "C-x RET f
Alexey> koi8-r RET"
This is probably not.
Alexey> (but "C-x RET c koi8-r C-x C-f" works). The file itself
Alexey> is mostly ASCII, with two strings in Russian inside (near
Alexey> the end of the file).
Alexey> Are you interested in such bug reports, and if yes, should
Alexey> I send the file or what? Other files it at least detects
Alexey> as "Raw".
Most definitely. Especially from Cyrillic and Japanese users, who are
the roughest tests on autodetection (except for maybe Buddhist
scholars) because of the multiple encodings in daily use, plus the
need to handle ASCII, ISO-8859-1, and ISO-8859-15 for programming etc.
The report alone is probably enough for priority bugs. However, if
you have a file you can send, that would be very nice. As usual, the
shorter the better. The very best would be a test library in
test-harness.el format (see tests/automated/test-harness.el and the
"Regression Testing" node in the Internals Info manual).
Alexey> set-language-enviroment Cyrillic-KOI8 does not help at
Alexey> all.
This doesn't entirely surprise me as I know Ben started a synch of the
language environment stuff to GNU 21.x, but in the process broke stuff
for Japanese at least. I wouldn't be surprised if something similar
happened in Cyrillic.
--
Institute of Policy and Planning Sciences
http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.