Stephen J. Turnbull wrote:
>>>>>"Ben" == Ben Wing <ben(a)666.com>
writes:
>>>>>
>>>>>
Ben> why is it a waste of time? took me half an hour or so.
Because other people (eg, me) could do it in 45 minutes the first
time, and the same half-hour it takes you for the next one.
no, actually it took me half an hour to add *all* the windows charsets
-- at least, all the ones that we already had tables for in
etc/unicode. adding new ones will take a minute or so. it was useful,
among other things, because it clarified what needs to be done with
national character sets.
Ben> mbcs is used for implementing things like the koi8-r
coding
Ben> system under unicode-internal.
The KOI8 coded character sets are unibyte and have no mode shifts, and
at most 256 characters. Why not simply have one table?
it will, at some point. but that requires some more work to remove
assumptions about ascii-compatibility while still making ascii
conversion efficient.
Ben> it can also replace the hand-coded big5 and shift-jis
coding
Ben> systems, in unicode-internal.
Sure. But as you say, we've already got implementations of those, and
they're going to go away over time.
and there are others that work this way; johab, for example.
Ben> any ideas? (if this doesn't work, i'm sure there
are gpl-ed
Ben> utf-7 implementations available.)
I doubt the implementation in the Unicode book is efficient or robust,
and there's no error handling in it. I'm sure there's one in Emacs
and another in gconv (glibc's implememntation of iconv). Python has
one. Surely Perl and Ruby do.
actually, their implementation of utf-8 is highly optimized and i took
some ideas from it; i'd take more if i knew it was kosher. their utf-7
implementation does handle, or at least report, errors; besides, we have
no effective error-handling scheme. (btw python doesn't do much better,
i don't think; either it just throws an error on any problems, or
ignores the error and copes as best it can. we could do the same; but
just throwing an error and aborting isn't very helpful. actually, in
python the situation is worse, since we at least have
call-with-condition-handler, which allows continuing from a thrown
error; i don't think there's any equivalent in python.)
Ben> i can implement this if you can tell me the names and
Ben> encodings that are typically used in these segments. the x
Ben> standard only defines the general format of extended segments
Ben> and doesn't say what is actually encoded in them.
According to the standard, anything with an agreed name that isn't in
the list (ie, iso8859-14 and iso8859-15 violate XF86's own standard,
and UTF-8 should be in there).
The elegant way to implement it would be to treat it as a buffer and
translate it using a new lstream, parsing the name out of the extended
segment header and using that to determine the coding system.
i think that's more or less what handa does. but it seems low-priority
to me.
Ben> we already have a gzip coding system. we also have
base64
Ben> functions but not yet converted to a coding system (not too
Ben> hard to do, though). internally, i already generalized
Ben> coding systems (some time ago, in fact) to be typed for
Ben> either bytes or characters at either end; there's also a
Ben> `chain' coding system for stringing multiple coding systems
Ben> together.
Yeah, I'm aware of all that, but again it's mostly stuff that somebody
else can do, except that it would be really nice if the lstreams and
chain coding systems were exposed to LISP somehow.
the chain coding system *is* exposed to lisp. when you want lstreams
exposed, is this so people can write their own coding systems? there's
a proposal that i made for doing this more-or-less efficiently at the
lisp level; it's sitting around somewhere in internals.texi or text.c.
ideally, this would be as efficient as ccl, if the lisp is coded
reasonably carefully.
Ben> keep in mind that i've already done most of the work
you're
Ben> describing here. i think we're talking past each other; at
Ben> any rate, you seem to think i'm more confused than i am.
I'm sure you know exactly what you're doing, in the small. I will
look at the code asap, but your verbal descriptions do not inspire
confidence that what I will find is going to be a GNU-beater in
practice. GNU has more than one person seriously working on their
Mule implementation, and any of the senior developers is reasonably
comfortable trying to diagnose and even fix bugs. That's simply not
true for current XEmacs, and you're emphasizing backward
compatibility. Call it back-seat driving if you like, but somebody
needs to tell you about that tree looming in front of the windshield.
well, take a look. my verbal descriptions are often elliptic because
it's difficult for me to type long, detailed emails. but i keep feeling
like you are worried about issues i'm already dealing with.
Ben> actually, what would really help is if you could take a
look
Ben> at emacs-unicode-2, figure out what their api is, and
Ben> summarize it. this would be extremely useful to me.
I was afraid you'd say something like that. Maybe somebody left some
notes in Japanese....
well, do your best :)
ben