Re: changing the values of iso-8859-* charsets

Wednesday, 9 November 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> it's a coded character set, no?

Not technically, any more than EUC-JP is.  It's a transformation
format that would be equally applicable to any 2 X 94^1 + 94^2 coded
character set.  The character set is JIS X 0208.

    Ben> it's treated that way in other applications.

That's not unreasonable if you are implementing a monolingual Japanese
single buffered application.  We got any of those around here?  :-)

    Ben> there is currently no other method for declaring a coded
    Ben> charset than using the `charset' object (that's its point, in
    Ben> fact).

No, charset objects have a whole bunch of ISO 2022 crap in them.  A
coded character set is a registered (eg, national standard) map from
the integers to a repertoire of characters.  No directionality, no
column width, no CCL, etc.

    Ben> also, are you *sure* that shift-jis is *identical* to
    Ben> jisx0208-anything?  might it not have extra characters in it?

In practice, yes, Shift JIS does include characters that aren't in JIS
X 0208.  There is an agreed-upon standard space for user-defined
characters which will never be used by a character mapped from JIS X
0208, and that's where the corporate extensions go.  BTW, there are
somewhere around 10 of them, and AFAIK the only thing that is nested
is JIS X 0208 in the corporate sets.  You can arbitrarily pick
"Microsoft" and be right most of the time, I guess.

    Ben> the current system is extremely bogus.

The current system is the best you can do without using the
statistical classifier approach.  It is extremely successful at
distinguishing the various encodings used by Japanese, ditto Korean
and Chinese.  It obviously sucks for unibyte, though.

    Ben> in general, there needs to be a separation of detectors from
    Ben> categories; and in fact categories should disappear entirely.

Don't you mean a separation of categories from coding systems?
Categories are very strong indicators.  Shift JIS and Big5 are
monolingual; ISO 7-bit escapes are very strong signs of East Asian
ideographs, etc.

    Ben> a simple mbcs detector, for example, would check the
    Ben> characters that go by, for each existing mbcs coding system,

How do you lex a character out of the octet stream without a category?

    Ben> to see whether they are defined in that coding system, and
    Ben> based on this, assign some likelihood to all mbcs coding
    Ben> systems.  we could then break ties based on some preferred
    Ben> list of coding systems.

That's simply a slight generalization of coding categories.  If you
take your plan literally (which I don't for a minute think you do, by
the way), we're back where we started for ISO Latin (most of which are
complete 94+96 sets), which is where it matters.  We need the
statistical detector which looks are frequencies of characters, not
just presence or absence.

BTW, according to a colleague studying data mining, summarizing
"likelihoods" for each coding system independently is not flexible
enough to do a good job, and a full-blown statistical classifier no
more computationally burdensome.  Evidently you can get a lot of
leverage from "priors", eg, if you look at LANG and see it's ja_JP,
you can give a lot more weight to certain features because the chance
of confusion with KOI8 is much smaller and similar things.

    Ben> well, it doesn't hurt to use dynarrs.

OK.  I still find them very hard to read; it's a barrier to new developers.

    Ben> l2r is the default; likewise the default for the columns is
    Ben> the dimension of the charset.  i'm not nuking these
    Ben> properties but just eliminating unnecessary arguments to
    Ben> simplify the code.

Ah, OK.

    Ben> since all valid character values are non-negative, a negative
    Ben> number is a good choice for "no such character", but 0xFFFF
    Ben> or any other positive number isn't, even if 0xFFFF is
    Ben> theoretically "not a character" under Unicode.

What "theoretically"?  0xFFFF is _guaranteed_ not to be a character,
or it's not Unicode.  Given that, it works as a uint16_t, it works as
a uint32_t or int32_t, it works as a as a uint64_t or int64_t.  It
probably works as a int16_t.  Anybody familiar with Unicode will
immediately understand it.

I see the argument for negative numbers, but now you have two "not a
character" values.

...
The stuff about ranges spanning charsets in regex.c pinned my
bogometer.  Ranges don't even make sense within charsets any more,
because collating order varies with locale in POSIX, and different
languages collate the same charsets in different ways. 
    Ben> sure, collating order varies, but regexps are about
    Ben> searching, not sorting.

You know the difference.  I know the difference.  Do our users?  Worse
yet, do other programmers?  The glob "[a-z]" on Debian GNU/Linux in
some locales gives you the regexp [a-zA-Y].  (Surprised the heck out
of me when I first saw it.)

    Ben> a range over some part of a national charset specifies a
    Ben> well-defined set of characters.

And Unicode implements only a few of them fully: ASCII, Hangul, the
Japanese and Chinese syllabaries, Russian Cyrillic (but not Ukrainian
IIRC), Hebrew, Arabic.  However, I misdoubt that ISO Latin languages
in general get the national order for any national standard.

And are ranges going to behave consistently across Emacs
implementations?  I hope not!  That would mean implementing the Mule
ordering in unicode-internal.

    Ben> there's no reason to deprecate ranges.

Maybe not for users, but they're bad practice for programmers.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets