Re: changing the values of iso-8859-* charsets

Thursday, 3 November 2005

        Stephen J. Turnbull wrote:

...
>>>>>"Ben" == Ben Wing <ben(a)666.com&gt;
writes:
>>>>>            
>>>>>

    Ben> why is it a waste of time?  took me half an hour or so.

Because other people (eg, me) could do it in 45 minutes the first
time, and the same half-hour it takes you for the next one.

 no, actually it took me half an hour to add *all* the windows charsets 
-- at least, all the ones that we already had tables for in 
etc/unicode.  adding new ones will take a minute or so.  it was useful, 
among other things, because it clarified what needs to be done with 
national character sets.

...
    Ben> mbcs is used for implementing things like the koi8-r
coding
    Ben> system under unicode-internal.

The KOI8 coded character sets are unibyte and have no mode shifts, and
at most 256 characters.  Why not simply have one table?

 it will, at some point.  but that requires some more work to remove 
assumptions about ascii-compatibility while still making ascii 
conversion efficient.

...
    Ben> it can also replace the hand-coded big5 and shift-jis
coding
    Ben> systems, in unicode-internal.

Sure.  But as you say, we've already got implementations of those, and
they're going to go away over time.

 and there are others that work this way; johab, for example.

...
    Ben> any ideas? (if this doesn't work, i'm sure there
are gpl-ed
    Ben> utf-7 implementations available.)

I doubt the implementation in the Unicode book is efficient or robust,
and there's no error handling in it.  I'm sure there's one in Emacs
and another in gconv (glibc's implememntation of iconv).  Python has
one.  Surely Perl and Ruby do.

 actually, their implementation of utf-8 is highly optimized and i took 
some ideas from it; i'd take more if i knew it was kosher.  their utf-7 
implementation does handle, or at least report, errors; besides, we have 
no effective error-handling scheme. (btw python doesn't do much better, 
i don't think; either it just throws an error on any problems, or 
ignores the error and copes as best it can.  we could do the same; but 
just throwing an error and aborting isn't very helpful.  actually, in 
python the situation is worse, since we at least have 
call-with-condition-handler, which allows continuing from a thrown 
error; i don't think there's any equivalent in python.)

...
    Ben> i can implement this if you can tell me the names and
    Ben> encodings that are typically used in these segments.  the x
    Ben> standard only defines the general format of extended segments
    Ben> and doesn't say what is actually encoded in them.

According to the standard, anything with an agreed name that isn't in
the list (ie, iso8859-14 and iso8859-15 violate XF86's own standard,
and UTF-8 should be in there).

The elegant way to implement it would be to treat it as a buffer and
translate it using a new lstream, parsing the name out of the extended
segment header and using that to determine the coding system.

 i think that's more or less what handa does.  but it seems low-priority 
to me.

...
    Ben> we already have a gzip coding system.  we also have
base64
    Ben> functions but not yet converted to a coding system (not too
    Ben> hard to do, though).  internally, i already generalized
    Ben> coding systems (some time ago, in fact) to be typed for
    Ben> either bytes or characters at either end; there's also a
    Ben> `chain' coding system for stringing multiple coding systems
    Ben> together.

Yeah, I'm aware of all that, but again it's mostly stuff that somebody
else can do, except that it would be really nice if the lstreams and
chain coding systems were exposed to LISP somehow.

 the chain coding system *is* exposed to lisp.  when you want lstreams 
exposed, is this so people can write their own coding systems?  there's 
a proposal that i made for doing this more-or-less efficiently at the 
lisp level; it's sitting around somewhere in internals.texi or text.c.  
ideally, this would be as efficient as ccl, if the lisp is coded 
reasonably carefully.

...
    Ben> keep in mind that i've already done most of the work
you're
    Ben> describing here.  i think we're talking past each other; at
    Ben> any rate, you seem to think i'm more confused than i am.

I'm sure you know exactly what you're doing, in the small.  I will
look at the code asap, but your verbal descriptions do not inspire
confidence that what I will find is going to be a GNU-beater in
practice.  GNU has more than one person seriously working on their
Mule implementation, and any of the senior developers is reasonably
comfortable trying to diagnose and even fix bugs.  That's simply not
true for current XEmacs, and you're emphasizing backward
compatibility.  Call it back-seat driving if you like, but somebody
needs to tell you about that tree looming in front of the windshield.

 well, take a look.  my verbal descriptions are often elliptic because 
it's difficult for me to type long, detailed emails.  but i keep feeling 
like you are worried about issues i'm already dealing with.

...
    Ben> actually, what would really help is if you could take a
look
    Ben> at emacs-unicode-2, figure out what their api is, and
    Ben> summarize it.  this would be extremely useful to me.

I was afraid you'd say something like that.  Maybe somebody left some
notes in Japanese....

 well, do your best :)

ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets