Re: changing the values of iso-8859-* charsets

Tuesday, 8 November 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> the only thing that's really "quite clear" is that you
    Ben> haven't bothered to actually read any of the code i've
    Ben> written.

You're absolutely right: it was a _major_ bother to read it.  In the
future, check it in to CVS or I'm not going to read it.

    Ben> please go do that, then come back to me with *specific*
    Ben> issues referring to *actual* code, and i'll be glad to
    Ben> discuss them.

Vcharset_japanese_shift_jis.  Shift JIS is not a character set, and
this object is only used in the corresponding coding system.  It
should not be visible from Lisp.

I don't understand what the problem with not-a-character values for
256^N character sets is.  There aren't any.  I would say one way to
deal with that problem (if it really is a problem in practice) is to
pick an arbitrary code point that is unlikely to be a character or
likely to be rarely used (such as ASCII SUB, 0x1A or ASCII NUL), and
use that as the not-a-character value.  If you encounter it, then look
it up in the inverse table (ie, to Unicode).  If the "Unicode" value
is 0xFFFF, then it's not-a-character.

There's a remark about needing an mcbs detector.  That's not possible
in the coding category framework AFAICS; mcbs as a category is
indistinguishable from binary (although instances such as Shift JIS
and KOI8 can be detected).  Even when you go to the (statistical)
classifier framework, it looks to me like you're going to need a
different detector for _every_ mcbs coding system.  Common operations
like range checking could be factored out, but (for example) KOI8 and
EUC-JP (with uniform ranges) will do it differently from Shift JIS and
Big5 (which have ranges which vary according to whether they are
single-byte, first-of-two, or second-of-two).  So each mcbs will have
to have some code to make its idiosyncratic checks.

ISTR that there was a comment about KOI8 being assigned the ISO8
category although it's really mcbs; it might be better to make it
binary.  Hard to say.  Shift JIS and Big5, OTOH, clearly should be
conflated with binary IMO.  Would it be hard to make a no-C0-controls
category, which only allows the usual whitespace characters from the
0x00-0x1F,0x7F set?  All of the mbcs encodings I know of would fit
into that.  Even EBCDIC almost does!

Why use Dynarrs for the charset list?  An mbcs needs two charsets,
plus the minimum Dynarr overhead of 5 words.  That's the space
required by 7 charsets.  Seems rather implausible that you'd need more
than that.  Wouldn't an application with 8 or more charsets be better
served by coalescing the non-overlapping charsets in the list into a
single charset?

In configure.ac you should force Mule on if Unicode is requested, not
force Unicode off if Mule is not present.  (Barfing and saying "make
up your mind" would also be reasonable, but I really think that people
who want Unicode _really want Unicode_.)

I don't understand the logic for nuking the columns and direction
properties of charsets.  That's just going to introduce GNU
incompatibilities in table formatting and the like.

Shouldn't the no-such-character value be (Ichar) 0xFFFF, rather than
(Ichar) (-1)?  I guess your tables are shorts, but still it's ugly,
and in code where they're mixed with ints could cause trouble.

char_table subtables need to become Lisp objects, or Marcus's new
allocator will break.

The stuff about ranges spanning charsets in regex.c pinned my
bogometer.  Ranges don't even make sense within charsets any more,
because collating order varies with locale in POSIX, and different
languages collate the same charsets in different ways.  I agree that
ranges are a very convenient UI, but we should deprecate them,
document that they do not have a universal meaning, and document what
the meaning is in XEmacs (probably To Be Announced, I think we should
avoid committing).

Is Ichar_charset_obsolete_me_baby_please gone yet?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets