Re: changing the values of iso-8859-* charsets

Monday, 31 October 2005

        ...
>>>> "Ben" == Ben Wing <ben(a)666.com&gt;
writes: 
    Ben> it's the other way around.  the current situation is
    Ben> incompatible with the X11 fonts, so we have to hack the
    Ben> values using the bogus `graphic' characteristic.

Nothing bogus about it, it's perfectly (ISO 2022) standard.  The X
fonts' use of character codes as font indexes where they fit into the
appropriate space is perfectly reasonable (although I've never
actually looked closely at an ISO-8859 font, thus the confusion on my
part, mea culpa), but is not of any particular interest here.  Note
that multibyte fonts (like Japanese) do assume font indices in the
32-127 range (usually 33-126, actually).

    Ben> note that in the new world, charsets can have values > 127 in
    Ben> any case.  cf. big5, shift-jis, etc.

Please stop abusing the word "charset" for an object that is only
well-defined in a workspace I have no access to.  It's confusing you,
too, it would seem.

    Ben> so when i'm creating a new charset like `latin-windows-1252',
    Ben> which is compatible with iso-8859-1 but has extra chars in
    Ben> the range 128-159, do i do the right thing and have its chars
    Ben> in the range 128-255 be indexed as 128-255 (and hence be
    Ben> inconsistent with the `latin-iso8859-1' charset), or do i do
    Ben> the wrong thing and move its range down to 0-127?  and then
    Ben> it appears to have ascii control chars in the range 0-31, but
    Ben> they aren't control chars, value 10 is not linefeed, value 13
    Ben> is not cr, etc.?

You're thinking in terms of Mule charsets.  Don't, it's no help.
Those values should _never_ _ever_ appear in a context where they
could be confused with characters.

We don't need named coded character sets internally, we don't need to
associate random octets with charsets to make characters.  (Except for
backward compatibility, where backward is spelled P E R V E R S E.)
We only need subsets of Unicode.  Abstractly, characters from internal
text (LISP characters, strings, and buffers) should only ever be
mapped to their Unicode values, and then from Unicode to external
coding systems for I/O.

If you're worrying about the practical problems of mapping Unicode
characters to font indicies, please don't bother.  It's a practical
problem, yes, but you aren't going to enforce sanity on fonts by
perpetuating the charset bogosity.  For now, _any old hack_ will do,
just get glyphs on the screen for the fonts you use.  As long as the
API looks like a table and handles two-byte indicies, we can
generalize and optimize the internals for space later, if we even need
to.

I agree, we'll need named font index tables (a la Cmaps).  We've
already got the tables in etc/unicode.  Give them their Unicode names,
provide an aliasing mechanism, add an xemacs vendor directory, and put
anything we need that we don't already have in there.

Footnotes: 
[1]  If you read this as an anti-Microsoft rant, you're missing the
point.

-- 
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

Re: changing the values of iso-8859-* charsets