>>>> "Ben" == Ben Wing <ben(a)666.com>
writes:
Ben> it's a coded character set, no?
Not technically, any more than EUC-JP is. It's a transformation
format that would be equally applicable to any 2 X 94^1 + 94^2 coded
character set. The character set is JIS X 0208.
Ben> it's treated that way in other applications.
That's not unreasonable if you are implementing a monolingual Japanese
single buffered application. We got any of those around here? :-)
Ben> there is currently no other method for declaring a coded
Ben> charset than using the `charset' object (that's its point, in
Ben> fact).
No, charset objects have a whole bunch of ISO 2022 crap in them. A
coded character set is a registered (eg, national standard) map from
the integers to a repertoire of characters. No directionality, no
column width, no CCL, etc.
Ben> also, are you *sure* that shift-jis is *identical* to
Ben> jisx0208-anything? might it not have extra characters in it?
In practice, yes, Shift JIS does include characters that aren't in JIS
X 0208. There is an agreed-upon standard space for user-defined
characters which will never be used by a character mapped from JIS X
0208, and that's where the corporate extensions go. BTW, there are
somewhere around 10 of them, and AFAIK the only thing that is nested
is JIS X 0208 in the corporate sets. You can arbitrarily pick
"Microsoft" and be right most of the time, I guess.
Ben> the current system is extremely bogus.
The current system is the best you can do without using the
statistical classifier approach. It is extremely successful at
distinguishing the various encodings used by Japanese, ditto Korean
and Chinese. It obviously sucks for unibyte, though.
Ben> in general, there needs to be a separation of detectors from
Ben> categories; and in fact categories should disappear entirely.
Don't you mean a separation of categories from coding systems?
Categories are very strong indicators. Shift JIS and Big5 are
monolingual; ISO 7-bit escapes are very strong signs of East Asian
ideographs, etc.
Ben> a simple mbcs detector, for example, would check the
Ben> characters that go by, for each existing mbcs coding system,
How do you lex a character out of the octet stream without a category?
Ben> to see whether they are defined in that coding system, and
Ben> based on this, assign some likelihood to all mbcs coding
Ben> systems. we could then break ties based on some preferred
Ben> list of coding systems.
That's simply a slight generalization of coding categories. If you
take your plan literally (which I don't for a minute think you do, by
the way), we're back where we started for ISO Latin (most of which are
complete 94+96 sets), which is where it matters. We need the
statistical detector which looks are frequencies of characters, not
just presence or absence.
BTW, according to a colleague studying data mining, summarizing
"likelihoods" for each coding system independently is not flexible
enough to do a good job, and a full-blown statistical classifier no
more computationally burdensome. Evidently you can get a lot of
leverage from "priors", eg, if you look at LANG and see it's ja_JP,
you can give a lot more weight to certain features because the chance
of confusion with KOI8 is much smaller and similar things.
Ben> well, it doesn't hurt to use dynarrs.
OK. I still find them very hard to read; it's a barrier to new developers.
Ben> l2r is the default; likewise the default for the columns is
Ben> the dimension of the charset. i'm not nuking these
Ben> properties but just eliminating unnecessary arguments to
Ben> simplify the code.
Ah, OK.
Ben> since all valid character values are non-negative, a negative
Ben> number is a good choice for "no such character", but 0xFFFF
Ben> or any other positive number isn't, even if 0xFFFF is
Ben> theoretically "not a character" under Unicode.
What "theoretically"? 0xFFFF is _guaranteed_ not to be a character,
or it's not Unicode. Given that, it works as a uint16_t, it works as
a uint32_t or int32_t, it works as a as a uint64_t or int64_t. It
probably works as a int16_t. Anybody familiar with Unicode will
immediately understand it.
I see the argument for negative numbers, but now you have two "not a
character" values.
The stuff about ranges spanning charsets in regex.c pinned my
bogometer. Ranges don't even make sense within charsets any more,
because collating order varies with locale in POSIX, and different
languages collate the same charsets in different ways.
Ben> sure, collating order varies, but regexps are about
Ben> searching, not sorting.
You know the difference. I know the difference. Do our users? Worse
yet, do other programmers? The glob "[a-z]" on Debian GNU/Linux in
some locales gives you the regexp [a-zA-Y]. (Surprised the heck out
of me when I first saw it.)
Ben> a range over some part of a national charset specifies a
Ben> well-defined set of characters.
And Unicode implements only a few of them fully: ASCII, Hangul, the
Japanese and Chinese syllabaries, Russian Cyrillic (but not Ukrainian
IIRC), Hebrew, Arabic. However, I misdoubt that ISO Latin languages
in general get the national order for any national standard.
And are ranges going to behave consistently across Emacs
implementations? I hope not! That would mean implementing the Mule
ordering in unicode-internal.
Ben> there's no reason to deprecate ranges.
Maybe not for users, but they're bad practice for programmers.
--
School of Systems and Information Engineering
http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.