Re: changing the values of iso-8859-* charsets

Monday, 31 October 2005

        Aidan Kehoe wrote:

...
 Ar an t-aonú lá is triochad de mí Deireadh Fómhair, scríobh Ben Wing:

 > i don't want two different kinds of charsets.  but we can create all 
 > charsets with the proper indices and have `split-char' and `char-octet' 
 > generate the old, wrong indices and a new function do it right.  
 > suggestions for a new api to replace `make-char' and `split-char'?

If the internal encoding is Unicode, the charset (apart from, say, 'ucs)
isn’t as trivially available as it is with the iso-2022-oriented encoding. 
Are you suggesting implementing something like the extant translation in
unicode.c ? 

 right.  these functions take an extra precedence-list argument (a list 
of charsets), which when nil does something reasonable as a default.  
essentially i removed unicode-to-char and integrated it into make-char; 
but i'm open to other suggestions.

...
GNU’s API of (decode-char 'ucs #x20ac) and (encode-char ?\ 
'ucs), used with
other symbols (which they don’t allow, and which our compatibility
implementation doesn’t allow) could work well in that context. 

 well, basically i want some api that says "convert a charset and 
codepoints into a Lisp char" and vice-versa "get a charset and 
codepoints from a Lisp char".  keep in mind that we will be running both 
unicode-internal and old-mule for awhile, so the api's must work with 
both.  for this reason, i also introduced other levels of conversion, 
similar but a bit different; e.g. charset-codepoint-to-unicode and 
unicode-to-charset-codepoint unilaterally convert between a unicode 
codepoint and a charset codepoint, regardless of the representation of a 
char.  on the other hand, int-to-char and char-to-int always give you 
the actual int that makes up the character; not very portable.  if you 
want unicode, char-to-unicode.  if you want charset codepoints, it's 
currently `split-char'.  but i suppose i could create 
`char-to-charset-codepoint', which would follow the others, and 
`charset-codepoint-to-char', and reintroduce `unicode-to-char', and 
deprecate `make-char' and `split-char'.  then you'd have a very 
symmetrical api.

suggestions for less verbose names are welcome.

...
 > btw unicode-internal now compiles (and crashes at startup,
naturally).  
 > i still need to add the translation tables for koi8-r and friends, 
 > implement surrogates and (the biggest current issue) redo font handling 
 > to eliminate the concept of one-font-per-charset. 

Excellent.

 > also, add a concept of "language" and introduce it appropriately in the
 > unicode/charset conversion functions. (neither of these last two will
 > make it into the first version of unicode-internal to be integrated into
 > the mainline.)

Neither the Unicode conversion functions nor the charset conversion
functions will make it in? That doesn’t seem very practical; I’m sure you
mean something else there. 

 no, neither the redone font handling nor the concept of a language 
introduced into the unicode conversion functions will make it in.  the 
font handling will still bogusly be in terms of charsets, with the same 
bogus hack currently there (under windows at least) to look harder 
through various fonts to find a font that can display a char, when 
necessary.  i know what needs to be done to change this but it will be a 
pervasive change and i don't want to bite off too much at a time right 
now.  similarly, the idea of introducing a language and tracking the 
language of text using extent properties will take some doing.  first 
step would be to  create a language object and set up properties on it, 
such as the charset precedence list for unicode translation.  then there 
is a buffer-local `current-language' variable (maybe the language 
environments can be made to work without too much effort).  so we need 
to pass around some sort of object from which the charset precedence 
list can be derived -- the list itself, a language object (maybe just a 
symbol, who knows), a buffer, etc.  currently most functions don't 
bother with this, so it requires a fair amount of refactoring.

...
 > currently only about 64 ifdef UNICODE_INTERNAL's, and almost
all 
 > localized to text.h, text.c and charset.h. (these are the only files 
 > that know anything about the actual encoding of characters.  chartab.c, 
 > for example, knows only that its hashing function must be different.  
 > mule-coding.c knows only that the bogus split big5 charsets don't exist 
 > under UNICODE_INTERNAL.) 

That file is full of assumptions about our internal string format which need
to be changed if your’re changing that format. I haven’t seen you mention
that you are, but I find it a hard to imagine supporting a 21-bit space with
the existing format, let alone a 30-bit space, given that you’d have to
abandon most of the leading byte architecture. 

 which file?  chartab.c got radically reworked to support 32-bit 
characters using a page-table-style lookup, as for unicode translation 
tables.  same system is used for old-mule as well.

mule-coding.c did not need so much rewriting, but it now works in terms 
of generic functions.  e.g. when iso2022 encodes to external, it uses 
macros to accumulate a whole character, then calls 
itext_to_charset_codepoint() -- an inline function -- to get an 
appropriate charset and codepoint.  similarly, once it's gotten a 
charset and codepoints, it calls charset_codepoint_to_dynarr() -- 
another inline function -- to write it out.  these inliners work in the 
old and new system, conditionalized appropriately.  similar stuff goes 
everywhere else; i abstracted out "leading bytes" and all specific 
knowledge of our internal representation whenever possible.

...
But, I’m sure I’ll understand more of the details when I see the
patch. 

 > charset.h is totally rewritten and might go away entirely. chartab.c is
 > drastically changed and now uses the same basic format as the unicode
 > translation tables.

 > also, we unfortunately can only implement 30-bit chars, even though
 > Unicode theoretically allows 31 bits.

They introduced a limit in 3.0 of 0x110000. Every code point approved by the
standard will be below that. 

 yeah, i've heard, but UCS-4 is still "theoretically" 31 bits, but all

private use ... oh well.  we'll just say the high half of the space is 
"private and reserved", heh heh :)

ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets