Re: changing the values of iso-8859-* charsets

Wednesday, 2 November 2005

        Stephen J. Turnbull wrote:

...
>>>>>"Ben" == Ben Wing <ben(a)666.com&gt;
writes:
>>>>>            
>>>>>

    Ben> i really don't understand your sarcastic attitude, or what
    Ben> point you're trying to make.

What sarcastic attitude, aside from the comment about rants?  Even
that, I'm sorry you took it that way, but I think it's a huge waste of
your time to be defining "latin-windows-1252" character sets at this
point.  We should provide an API that allows any reasonably
experienced Emacs user to do that, no?  And ISO 8859-1 is close enough
for prototyping purposes.  In fact, it has the "advantage" that people
_will_ use directed quotes and suchlike, and so we'll get a natural
test of error handling.

 why is it a waste of time?  took me half an hour or so.

...
    Ben> "charset" as i have defined it is a set of
characters,
    Ben> indexed by one or two bytes.  the indices are as defined in
    Ben> the unicode translation tables.

Please don't call it "charset" if all it is is a table.  The word
"charset" has lots of associations, some (like "94N") that we would
really like to imprison in a very small area of the implementation,
and some that we really want gone, period (like "leading byte").  So
you are guaranteed to get frequent communication failures with other
Emacs developers if you insist on using that word.

 such charsets can work under old-mule, too.  in fact, they do, 
currently; it's just that you can't encode a character in a buffer using 
them.  but you can convert back and forth to unicode, just like for 
other charsets.  see my next paragraph.

...
I also think that your statements about what you're doing are
totally
confusing.  Specifically:

    Ben> i see no purpose in creating a totally new concept rather
    Ben> than extending the current `charset' concept.

But that's my point!  According to your descriptions, you _are_
creating a totally new concept.  You're _not_ extending the Mule
charset concept, you're substituting a _completely different
definition_.  Specifically, there's no way that what you're talking
about can be used in legacy Mule, right?  It lacks the leading byte
and graphic properties, for one, and the indicies you describe are
presumbly platform integers, which are unsuitable for working with ISO
2022 multibyte character sets because they are always bigendian.

    Ben> there will be a new `mbcs' coding system type that just
    Ben> encodes one or more charsets using their indices, in the
    Ben> obvious fashion;

There's a lot that's unobvious about such a thing.  Most important,
what good is it?  Where will it be used?

As I see it, internally there is Unicode, an unordered collection of
astract characters.  Where is there a role for `mbcs'?

Externally, the "or more" part is unusable until you define a protocol
for switching "new-style charsets".  The "in the new world, you can
have an arbitrary number of charsets, with any characters you want in
them" aspect of "new-style charsets" is unusable until you define a
protocol for defining them on the wire.  But this is starting to sound
like ISO 2022.

 mbcs is used for implementing things like the koi8-r coding system under 
unicode-internal.  it can also replace the hand-coded big5 and shift-jis 
coding systems, in unicode-internal.  it could replace them under 
old-mule, too, if i add a property on charsets indicating what other 
charsets they should preferentially be converted to. (we could do 
without this but then we might not get the expected charset; e.g. big5 
chars might end up as gb-2312 instead of big5-1 or big5-2.)

...
*****

How would I do this?

As far as I can see there are basically five kinds of external coding
systems we will want to deal with.  Unicode transformation formats,
Unicode subsets with an associated invertible coding table, nontrivial
versions of ISO 2022, other text, and non-text.

(1) *Unicode transformation formats* are algorithmic and well-defined.
We have most of the code we need (AFAIK the only Unicode-standard UTF
missing is UTF-7, although there are a couple of weirdos out there,
eg, from China).  The tedious parts will be getting endianness and
Unicode signatures right, although AFAIK that's well-enough done in
current XEmacs Mule.

 there's a utf-7 implementation in the unicode book, but i don't know its 
licensing.  any ideas? (if this doesn't work, i'm sure there are gpl-ed 
utf-7 implementations available.)

...
(2) *Unicode subsets* (with associated tables) include all the
members
of the ISO 8859 family, KOI8, the IBM and Microsoft character sets,
and Shift JIS.  IIRC there are some dupes in Big5, but also AFAIK Big5
can in practice be subsetted to the invertible part.  It's possible
that some EUC coding systems can actually be treated as Unicode
subsets (if they are defined or used in such a way that ISO 2022
character set designations and invocations are never used).
Compressing and sharing tables may be worth some effort, but these are
optimizations; abstractly tables are trivial.  The most important
thing will be to take care with endianness issues, since with the
exception of the non-endian and specifically little-endian Unicode
formats, all external coding systems that I know of are big-endian.

(3) *ISO 2022* is basically hopeless for full implementation, but
fortunately most of the important versions are already well-defined
and implemented in XEmacs.  The only important problem I know of is
extended segments in X Compound Text.  And this stuff should (except
for cut and paste on X) rapidly decrease in importance (except
possibly in Japan, where it may take longer).

 i can implement this if you can tell me the names and encodings that are 
typically used in these segments.  the x standard only defines the 
general format of extended segments and doesn't say what is actually 
encoded in them.  emacs-unicode only has big5, iso8859-14 and iso8859-15 
in them. (plus there's the xfree86 utf-8 extension.)

...
(4) *Other text*.  I don't know of any important major ones,
offhand,
unless Microsoft, Apple, or IBM provide non-ISO-2022 ways of changing
code pages in a file.  But they're all very much on board the Unicode
bandwagon---non-Unicode multilingual text should diminish rapidly for
those platforms, too.  So the main issue that doesn't fit into the
above is private characters.

(5) *Non-text*.  Binary transformation formats, like deflate and
BASE64.  These are octet-stream to octet-stream transformations, so
the implementation will be different from the codecs.  Otherwise I
don't suppose there's much commonality here.

I would start by implementing (1) and (2), and maybe an API for (5).
(3) can be implemented ad hoc by others, just port the Mule algorithms
to the new framework, perhaps in less generality if the lstream
interfaces and the like change, or maybe just steal them wholesale if
it's just a matter of marshalling arguments.  Except for Microsoft
code pages (if there's a way to switch them midstream) (4) can wait
until we see a need.

 we already have a gzip coding system.  we also have base64 functions but 
not yet converted to a coding system (not too hard to do, though).  
internally, i already generalized coding systems (some time ago, in 
fact) to be typed for either bytes or characters at either end; there's 
also a `chain' coding system for stringing multiple coding systems 
together.  crap like utf8-mac is internally implemented as a chain of 
two coding systems, one to handle utf8, the other to handle the eof 
transformation.  externally you can't specify or see the fact that the 
source and sink of coding systems are typed for bytes or characters; 
rather, an intermediate coding system to convert between the two is 
automatically inserted, as necessary. (i think, at least; it's been 
awhile since i did this work and it's not perfectly documented in 
`make-coding-system'.)

...
Of course (2) requires named tables.  Don't we already have
those?
Tables have a very simple API.  With a little care we should be able
to implement that API so that if we need something that's more space
efficient we can do that later without changing the API, right?

 keep in mind that i've already done most of the work you're describing 
here.  i think we're talking past each other; at any rate, you seem to 
think i'm more confused than i am.  what would help more is if you could 
look at the code i sent you.  if there's something specific in my 
implementation that you don't like, i'll be happy to discuss it with 
you; nothing is fixed.

actually, what would really help is if you could take a look at 
emacs-unicode-2, figure out what their api is, and summarize it.  this 
would be extremely useful to me.

ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets