Stephen J. Turnbull wrote:
>>>>>"Ben" == Ben Wing <ben(a)666.com>
writes:
>>>>>
>>>>>
Ben> the only thing that's really "quite clear" is that you
Ben> haven't bothered to actually read any of the code i've
Ben> written.
You're absolutely right: it was a _major_ bother to read it. In the
future, check it in to CVS or I'm not going to read it.
in time.
Ben> please go do that, then come back to me with *specific*
Ben> issues referring to *actual* code, and i'll be glad to
Ben> discuss them.
Vcharset_japanese_shift_jis. Shift JIS is not a character set, and
this object is only used in the corresponding coding system. It
should not be visible from Lisp.
it's a coded character set, no? it's treated that way in other
applications. if i want to create an mbcs coding system for shift-jis,
i need to combine the shift-jis coded charset with half-width katakana
and ascii. either i hard-code it into C or i do it from lisp, at which
point you need to shift-jis charset visible. there is currently no
other method for declaring a coded charset than using the `charset'
object (that's its point, in fact). also, are you *sure* that shift-jis
is *identical* to jisx0208-anything? might it not have extra characters
in it?
I don't understand what the problem with not-a-character values
for
256^N character sets is. There aren't any. I would say one way to
deal with that problem (if it really is a problem in practice) is to
pick an arbitrary code point that is unlikely to be a character or
likely to be rarely used (such as ASCII SUB, 0x1A or ASCII NUL), and
use that as the not-a-character value. If you encounter it, then look
it up in the inverse table (ie, to Unicode). If the "Unicode" value
is 0xFFFF, then it's not-a-character.
well, i did pick such a value, 0x0001 i think. i'll see about
implementing the inverse lookup trick.
There's a remark about needing an mcbs detector. That's not
possible
in the coding category framework AFAICS; mcbs as a category is
indistinguishable from binary (although instances such as Shift JIS
and KOI8 can be detected). Even when you go to the (statistical)
classifier framework, it looks to me like you're going to need a
different detector for _every_ mcbs coding system. Common operations
like range checking could be factored out, but (for example) KOI8 and
EUC-JP (with uniform ranges) will do it differently from Shift JIS and
Big5 (which have ranges which vary according to whether they are
single-byte, first-of-two, or second-of-two). So each mcbs will have
to have some code to make its idiosyncratic checks.
right. in general, every coding system needs its own detector (mbcs or
not). the current system is extremely bogus.
ISTR that there was a comment about KOI8 being assigned the ISO8
category although it's really mcbs; it might be better to make it
binary. Hard to say. Shift JIS and Big5, OTOH, clearly should be
conflated with binary IMO. Would it be hard to make a no-C0-controls
category, which only allows the usual whitespace characters from the
0x00-0x1F,0x7F set? All of the mbcs encodings I know of would fit
into that. Even EBCDIC almost does!
well, that comment was left over from the previous implementation of
koi8 as ccl; it used to say "this should really be in the ccl category
but xemacs doesn't have one". of course, a ccl category is totally
bogus, as bogus as an mbcs category.
it would indeed be possible to make a no-c0-controls category.
actually, there's already code to do checks something like this in some
of the detectors, to distinguish them from binary.
in general, there needs to be a separation of detectors from categories;
and in fact categories should disappear entirely. instead we should
just have one "category" per coding system, and each detector assigns
some likelihood to all the coding systems that it knows how to detect.
a simple mbcs detector, for example, would check the characters that go
by, for each existing mbcs coding system, to see whether they are
defined in that coding system, and based on this, assign some likelihood
to all mbcs coding systems. we could then break ties based on some
preferred list of coding systems.
Why use Dynarrs for the charset list? An mbcs needs two charsets,
plus the minimum Dynarr overhead of 5 words. That's the space
required by 7 charsets. Seems rather implausible that you'd need more
than that. Wouldn't an application with 8 or more charsets be better
served by coalescing the non-overlapping charsets in the list into a
single charset?
well, it doesn't hurt to use dynarrs. there won't be near enough mbcs
coding systems for the space used by them to make any difference, and
i'd rather not build in arbitrary limits if there's no need to.
In configure.ac you should force Mule on if Unicode is requested, not
force Unicode off if Mule is not present. (Barfing and saying "make
up your mind" would also be reasonable, but I really think that people
who want Unicode _really want Unicode_.)
will do.
I don't understand the logic for nuking the columns and direction
properties of charsets. That's just going to introduce GNU
incompatibilities in table formatting and the like.
l2r is the default; likewise the default for the columns is the
dimension of the charset. i'm not nuking these properties but just
eliminating unnecessary arguments to simplify the code.
Shouldn't the no-such-character value be (Ichar) 0xFFFF, rather
than
(Ichar) (-1)? I guess your tables are shorts, but still it's ugly,
and in code where they're mixed with ints could cause trouble.
you're confusing Ichars with the indices on the to-charset tables. the
to-charset tables use unsigned shorts to represent the two octets in a
charset codepoint. Ichars represent actual characters (unicode or
old-mule), and are ints. since all valid character values are
non-negative, a negative number is a good choice for "no such
character", but 0xFFFF or any other positive number isn't, even if
0xFFFF is theoretically "not a character" under Unicode.
char_table subtables need to become Lisp objects, or Marcus's new
allocator will break.
right, i did this last night.
The stuff about ranges spanning charsets in regex.c pinned my
bogometer. Ranges don't even make sense within charsets any more,
because collating order varies with locale in POSIX, and different
languages collate the same charsets in different ways. I agree that
ranges are a very convenient UI, but we should deprecate them,
document that they do not have a universal meaning, and document what
the meaning is in XEmacs (probably To Be Announced, I think we should
avoid committing).
sure, collating order varies, but regexps are about searching, not
sorting. a range over some part of a national charset specifies a
well-defined set of characters. a range over different charsets makes
no sense since charsets themselves are unordered wrt each other. a
range over any subset of unicode is also well-defined. there's no
reason to deprecate ranges.
Is Ichar_charset_obsolete_me_baby_please gone yet?
not until i rewrite the font-handling code. that won't happen till
after i get the current code working and checked in.
ben