Re: changing the values of iso-8859-* charsets

Monday, 31 October 2005

        Stephen J. Turnbull wrote:

...
>>>>>"Ben" == Ben Wing <ben(a)666.com&gt;
writes:
>>>>>            
>>>>>

    Ben> it's the other way around.  the current situation is
    Ben> incompatible with the X11 fonts, so we have to hack the
    Ben> values using the bogus `graphic' characteristic.

Nothing bogus about it, it's perfectly (ISO 2022) standard.  The X
fonts' use of character codes as font indexes where they fit into the
appropriate space is perfectly reasonable (although I've never
actually looked closely at an ISO-8859 font, thus the confusion on my
part, mea culpa), but is not of any particular interest here.  Note
that multibyte fonts (like Japanese) do assume font indices in the
32-127 range (usually 33-126, actually).

 yeah, i forgot about its iso-2022 connection.

...
    Ben> note that in the new world, charsets can have values >
127 in
    Ben> any case.  cf. big5, shift-jis, etc.

Please stop abusing the word "charset" for an object that is only
well-defined in a workspace I have no access to.  It's confusing you,
too, it would seem.

    Ben> so when i'm creating a new charset like `latin-windows-1252',
    Ben> which is compatible with iso-8859-1 but has extra chars in
    Ben> the range 128-159, do i do the right thing and have its chars
    Ben> in the range 128-255 be indexed as 128-255 (and hence be
    Ben> inconsistent with the `latin-iso8859-1' charset), or do i do
    Ben> the wrong thing and move its range down to 0-127?  and then
    Ben> it appears to have ascii control chars in the range 0-31, but
    Ben> they aren't control chars, value 10 is not linefeed, value 13
    Ben> is not cr, etc.?

You're thinking in terms of Mule charsets.  Don't, it's no help.
Those values should _never_ _ever_ appear in a context where they
could be confused with characters.

We don't need named coded character sets internally, we don't need to
associate random octets with charsets to make characters.  (Except for
backward compatibility, where backward is spelled P E R V E R S E.)
We only need subsets of Unicode.  Abstractly, characters from internal
text (LISP characters, strings, and buffers) should only ever be
mapped to their Unicode values, and then from Unicode to external
coding systems for I/O.

If you're worrying about the practical problems of mapping Unicode
characters to font indicies, please don't bother.  It's a practical
problem, yes, but you aren't going to enforce sanity on fonts by
perpetuating the charset bogosity.  For now, _any old hack_ will do,
just get glyphs on the screen for the fonts you use.  As long as the
API looks like a table and handles two-byte indicies, we can
generalize and optimize the internals for space later, if we even need
to.

I agree, we'll need named font index tables (a la Cmaps).  We've
already got the tables in etc/unicode.  Give them their Unicode names,
provide an aliasing mechanism, add an xemacs vendor directory, and put
anything we need that we don't already have in there.

Footnotes: 
[1]  If you read this as an anti-Microsoft rant, you're missing the
point.

 i really don't understand your sarcastic attitude, or what point you're 
trying to make. "charset" as i have defined it is a set of characters, 
indexed by one or two bytes.  the indices are as defined in the unicode 
translation tables.  you can certainly see my code if you want.  the 
main purpose of charsets in the new world is to interface with external 
encodings, and maybe secondarily for font indexing under X.  i see no 
purpose in creating a totally new concept rather than extending the 
current `charset' concept.  in the new world, you can have an arbitrary 
number of charsets, with any characters you want in them; the only way 
that the code knows what's in the charset is by the appropriate unicode 
translations that have been provided.  i've created new charsets 
`japanese-shift-jis' and `latin-windows-1252' and various others; 
`cyrillic-koi8-r' and such will be coming soon.  there will be a new 
`mbcs' coding system type that just encodes one or more charsets using 
their indices, in the obvious fashion; this will replace the need for ccl.

ben

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets