Re: changing the values of iso-8859-* charsets

Monday, 31 October 2005

        Aidan Kehoe wrote:

...
 Ar an t-ochtú lá is fiche de mí Deireadh Fómhair, scríobh Ben Wing: 

 > >    Ben> currently we use 32-127 for the values of the chars in the
 > >    Ben> iso-8859 charsets.  maybe that was needed under old-mule, but
 > >    Ben> in unicode-internal charsets can have values in any arbitrary
 > >    Ben> interval or rectangle in 256 or 256x256 space.  shouldn't we
 > >    Ben> use 160-255?  this would only matter in the output of
 > >    Ben> `split-char'; `make-char' already goes either way.
 > >
 > >No.  This is gratuitous incompatibility with ISO 2022, legacy X11 font
 > >indexing, and other Emacsen.  Why buy trouble changing a public API?
 >
 > it's the other way around.  the current situation is incompatible with 
 > the X11 fonts, so we have to hack the values using the bogus `graphic' 
 > characteristic.

It remains incompatible with ISO 2022 and other Emacsen, and will break
existing code. 

 > note that in the new world, charsets can have values > 127 in any case.  
 > cf. big5, shift-jis, etc.
 > 
 > so when i'm creating a new charset like `latin-windows-1252', [...]

Create another API for character sets in that 256 char or
rectangle-of-256x256-char space, then. Breaking the old API doesn’t buy a
whole lot. 

 > which is compatible with iso-8859-1 but has extra chars in the range
 > 128-159, do i do the right thing and have its chars in the range 128-255
 > be indexed as 128-255 (and hence be inconsistent with the
 > `latin-iso8859-1' charset), or do i do the wrong thing and move its range
 > down to 0-127? and then it appears to have ascii control chars in the
 > range 0-31, but they aren't control chars, value 10 is not linefeed,
 > value 13 is not cr, etc.?

Another reason to create another API; it would be possible to implement
EBCDIC character sets with one that didn’t assume ASCII compatibility, as
the existing API does. With an API that didn’t suck, 

(make-char-alternative code-page-037 #x40)

and

(make-char-alternative code-page-037 #xc0)

could and should mean different things. 

 i don't want two different kinds of charsets.  but we can create all 
charsets with the proper indices and have `split-char' and `char-octet' 
generate the old, wrong indices and a new function do it right.  
suggestions for a new api to replace `make-char' and `split-char'?

btw unicode-internal now compiles (and crashes at startup, naturally).  
i still need to add the translation tables for koi8-r and friends, 
implement surrogates and (the biggest current issue) redo font handling 
to eliminate the concept of one-font-per-charset.  also, add a concept 
of "language" and introduce it appropriately in the unicode/charset 
conversion functions. (neither of these last two will make it into the 
first version of unicode-internal to be integrated into the mainline.)  
currently only about 64 ifdef UNICODE_INTERNAL's, and almost all 
localized to text.h, text.c and charset.h. (these are the only files 
that know anything about the actual encoding of characters.  chartab.c, 
for example, knows only that its hashing function must be different.  
mule-coding.c knows only that the bogus split big5 charsets don't exist 
under UNICODE_INTERNAL.) charset.h is totally rewritten and might go 
away entirely.  chartab.c is drastically changed and now uses the same 
basic format as the unicode translation tables.  also, we unfortunately 
can only implement 30-bit chars, even though Unicode theoretically 
allows 31 bits.

ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets