Re: changing the values of iso-8859-* charsets

Monday, 31 October 2005

 Ar an t-aonú lá is triochad de mí Deireadh Fómhair, scríobh Ben Wing: 

...
 i don't want two different kinds of charsets.  but we can create
all 
 charsets with the proper indices and have `split-char' and `char-octet' 
 generate the old, wrong indices and a new function do it right.  
 suggestions for a new api to replace `make-char' and `split-char'? 
If the internal encoding is Unicode, the charset (apart from, say, 'ucs)
isn’t as trivially available as it is with the iso-2022-oriented encoding. 
Are you suggesting implementing something like the extant translation in
unicode.c ? 

GNU’s API of (decode-char 'ucs #x20ac) and (encode-char ?\  'ucs), used with
other symbols (which they don’t allow, and which our compatibility
implementation doesn’t allow) could work well in that context. 

...
 btw unicode-internal now compiles (and crashes at startup,
naturally).  
 i still need to add the translation tables for koi8-r and friends, 
 implement surrogates and (the biggest current issue) redo font handling 
 to eliminate the concept of one-font-per-charset.  
Excellent.

...
 also, add a concept of "language" and introduce it
appropriately in the
 unicode/charset conversion functions. (neither of these last two will
 make it into the first version of unicode-internal to be integrated into
 the mainline.) 
Neither the Unicode conversion functions nor the charset conversion
functions will make it in? That doesn’t seem very practical; I’m sure you
mean something else there. 

...
 currently only about 64 ifdef UNICODE_INTERNAL's, and almost all

 localized to text.h, text.c and charset.h. (these are the only files 
 that know anything about the actual encoding of characters.  chartab.c, 
 for example, knows only that its hashing function must be different.  
 mule-coding.c knows only that the bogus split big5 charsets don't exist 
 under UNICODE_INTERNAL.)  
That file is full of assumptions about our internal string format which need
to be changed if your’re changing that format. I haven’t seen you mention
that you are, but I find it a hard to imagine supporting a 21-bit space with
the existing format, let alone a 30-bit space, given that you’d have to
abandon most of the leading byte architecture. 

But, I’m sure I’ll understand more of the details when I see the patch. 

...
 charset.h is totally rewritten and might go away entirely. chartab.c
is
 drastically changed and now uses the same basic format as the unicode
 translation tables. 
...
 also, we unfortunately can only implement 30-bit chars, even though
 Unicode theoretically allows 31 bits. 
They introduced a limit in 3.0 of 0x110000. Every code point approved by the
standard will be below that. 

-- 
„Frauen achten mehr aufs Herz und weniger auf Dummheiten. Darum leben sie
länger.“ (C.R. Zafón -- Übertragung von Peter Schwaar.)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets