Re: changing the values of iso-8859-* charsets

Monday, 31 October 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> i really don't understand your sarcastic attitude, or what
    Ben> point you're trying to make.

What sarcastic attitude, aside from the comment about rants?  Even
that, I'm sorry you took it that way, but I think it's a huge waste of
your time to be defining "latin-windows-1252" character sets at this
point.  We should provide an API that allows any reasonably
experienced Emacs user to do that, no?  And ISO 8859-1 is close enough
for prototyping purposes.  In fact, it has the "advantage" that people
_will_ use directed quotes and suchlike, and so we'll get a natural
test of error handling.

    Ben> "charset" as i have defined it is a set of characters,
    Ben> indexed by one or two bytes.  the indices are as defined in
    Ben> the unicode translation tables.

Please don't call it "charset" if all it is is a table.  The word
"charset" has lots of associations, some (like "94N") that we would
really like to imprison in a very small area of the implementation,
and some that we really want gone, period (like "leading byte").  So
you are guaranteed to get frequent communication failures with other
Emacs developers if you insist on using that word.

I also think that your statements about what you're doing are totally
confusing.  Specifically:

    Ben> i see no purpose in creating a totally new concept rather
    Ben> than extending the current `charset' concept.

But that's my point!  According to your descriptions, you _are_
creating a totally new concept.  You're _not_ extending the Mule
charset concept, you're substituting a _completely different
definition_.  Specifically, there's no way that what you're talking
about can be used in legacy Mule, right?  It lacks the leading byte
and graphic properties, for one, and the indicies you describe are
presumbly platform integers, which are unsuitable for working with ISO
2022 multibyte character sets because they are always bigendian.

    Ben> there will be a new `mbcs' coding system type that just
    Ben> encodes one or more charsets using their indices, in the
    Ben> obvious fashion;

There's a lot that's unobvious about such a thing.  Most important,
what good is it?  Where will it be used?

As I see it, internally there is Unicode, an unordered collection of
astract characters.  Where is there a role for `mbcs'?

Externally, the "or more" part is unusable until you define a protocol
for switching "new-style charsets".  The "in the new world, you can
have an arbitrary number of charsets, with any characters you want in
them" aspect of "new-style charsets" is unusable until you define a
protocol for defining them on the wire.  But this is starting to sound
like ISO 2022.

*****

How would I do this?

As far as I can see there are basically five kinds of external coding
systems we will want to deal with.  Unicode transformation formats,
Unicode subsets with an associated invertible coding table, nontrivial
versions of ISO 2022, other text, and non-text.

(1) *Unicode transformation formats* are algorithmic and well-defined.
We have most of the code we need (AFAIK the only Unicode-standard UTF
missing is UTF-7, although there are a couple of weirdos out there,
eg, from China).  The tedious parts will be getting endianness and
Unicode signatures right, although AFAIK that's well-enough done in
current XEmacs Mule.

(2) *Unicode subsets* (with associated tables) include all the members
of the ISO 8859 family, KOI8, the IBM and Microsoft character sets,
and Shift JIS.  IIRC there are some dupes in Big5, but also AFAIK Big5
can in practice be subsetted to the invertible part.  It's possible
that some EUC coding systems can actually be treated as Unicode
subsets (if they are defined or used in such a way that ISO 2022
character set designations and invocations are never used).
Compressing and sharing tables may be worth some effort, but these are
optimizations; abstractly tables are trivial.  The most important
thing will be to take care with endianness issues, since with the
exception of the non-endian and specifically little-endian Unicode
formats, all external coding systems that I know of are big-endian.

(3) *ISO 2022* is basically hopeless for full implementation, but
fortunately most of the important versions are already well-defined
and implemented in XEmacs.  The only important problem I know of is
extended segments in X Compound Text.  And this stuff should (except
for cut and paste on X) rapidly decrease in importance (except
possibly in Japan, where it may take longer).

(4) *Other text*.  I don't know of any important major ones, offhand,
unless Microsoft, Apple, or IBM provide non-ISO-2022 ways of changing
code pages in a file.  But they're all very much on board the Unicode
bandwagon---non-Unicode multilingual text should diminish rapidly for
those platforms, too.  So the main issue that doesn't fit into the
above is private characters.

(5) *Non-text*.  Binary transformation formats, like deflate and
BASE64.  These are octet-stream to octet-stream transformations, so
the implementation will be different from the codecs.  Otherwise I
don't suppose there's much commonality here.

I would start by implementing (1) and (2), and maybe an API for (5).
(3) can be implemented ad hoc by others, just port the Mule algorithms
to the new framework, perhaps in less generality if the lstream
interfaces and the like change, or maybe just steal them wholesale if
it's just a matter of marshalling arguments.  Except for Microsoft
code pages (if there's a way to switch them midstream) (4) can wait
until we see a need.

Of course (2) requires named tables.  Don't we already have those?
Tables have a very simple API.  With a little care we should be able
to implement that API so that if we need something that's more space
efficient we can do that later without changing the API, right?

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets