Re: changing the values of iso-8859-* charsets

Wednesday, 2 November 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> why is it a waste of time?  took me half an hour or so.

Because other people (eg, me) could do it in 45 minutes the first
time, and the same half-hour it takes you for the next one.

    Ben> mbcs is used for implementing things like the koi8-r coding
    Ben> system under unicode-internal.

The KOI8 coded character sets are unibyte and have no mode shifts, and
at most 256 characters.  Why not simply have one table?

    Ben> it can also replace the hand-coded big5 and shift-jis coding
    Ben> systems, in unicode-internal.

Sure.  But as you say, we've already got implementations of those, and
they're going to go away over time.

    Ben> any ideas? (if this doesn't work, i'm sure there are gpl-ed
    Ben> utf-7 implementations available.)

I doubt the implementation in the Unicode book is efficient or robust,
and there's no error handling in it.  I'm sure there's one in Emacs
and another in gconv (glibc's implememntation of iconv).  Python has
one.  Surely Perl and Ruby do.

    Ben> i can implement this if you can tell me the names and
    Ben> encodings that are typically used in these segments.  the x
    Ben> standard only defines the general format of extended segments
    Ben> and doesn't say what is actually encoded in them.

According to the standard, anything with an agreed name that isn't in
the list (ie, iso8859-14 and iso8859-15 violate XF86's own standard,
and UTF-8 should be in there).

The elegant way to implement it would be to treat it as a buffer and
translate it using a new lstream, parsing the name out of the extended
segment header and using that to determine the coding system.

    Ben> we already have a gzip coding system.  we also have base64
    Ben> functions but not yet converted to a coding system (not too
    Ben> hard to do, though).  internally, i already generalized
    Ben> coding systems (some time ago, in fact) to be typed for
    Ben> either bytes or characters at either end; there's also a
    Ben> `chain' coding system for stringing multiple coding systems
    Ben> together.

Yeah, I'm aware of all that, but again it's mostly stuff that somebody
else can do, except that it would be really nice if the lstreams and
chain coding systems were exposed to LISP somehow.

    Ben> keep in mind that i've already done most of the work you're
    Ben> describing here.  i think we're talking past each other; at
    Ben> any rate, you seem to think i'm more confused than i am.

I'm sure you know exactly what you're doing, in the small.  I will
look at the code asap, but your verbal descriptions do not inspire
confidence that what I will find is going to be a GNU-beater in
practice.  GNU has more than one person seriously working on their
Mule implementation, and any of the senior developers is reasonably
comfortable trying to diagnose and even fix bugs.  That's simply not
true for current XEmacs, and you're emphasizing backward
compatibility.  Call it back-seat driving if you like, but somebody
needs to tell you about that tree looming in front of the windshield.

    Ben> actually, what would really help is if you could take a look
    Ben> at emacs-unicode-2, figure out what their api is, and
    Ben> summarize it.  this would be extremely useful to me.

I was afraid you'd say something like that.  Maybe somebody left some
notes in Japanese....

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets