Stephen J. Turnbull wrote:
>>>>>"Ben" == Ben Wing <ben(a)666.com>
writes:
>>>>>
>>>>>
Ben> it's the other way around. the current situation is
Ben> incompatible with the X11 fonts, so we have to hack the
Ben> values using the bogus `graphic' characteristic.
Nothing bogus about it, it's perfectly (ISO 2022) standard. The X
fonts' use of character codes as font indexes where they fit into the
appropriate space is perfectly reasonable (although I've never
actually looked closely at an ISO-8859 font, thus the confusion on my
part, mea culpa), but is not of any particular interest here. Note
that multibyte fonts (like Japanese) do assume font indices in the
32-127 range (usually 33-126, actually).
yeah, i forgot about its iso-2022 connection.
Ben> note that in the new world, charsets can have values >
127 in
Ben> any case. cf. big5, shift-jis, etc.
Please stop abusing the word "charset" for an object that is only
well-defined in a workspace I have no access to. It's confusing you,
too, it would seem.
Ben> so when i'm creating a new charset like `latin-windows-1252',
Ben> which is compatible with iso-8859-1 but has extra chars in
Ben> the range 128-159, do i do the right thing and have its chars
Ben> in the range 128-255 be indexed as 128-255 (and hence be
Ben> inconsistent with the `latin-iso8859-1' charset), or do i do
Ben> the wrong thing and move its range down to 0-127? and then
Ben> it appears to have ascii control chars in the range 0-31, but
Ben> they aren't control chars, value 10 is not linefeed, value 13
Ben> is not cr, etc.?
You're thinking in terms of Mule charsets. Don't, it's no help.
Those values should _never_ _ever_ appear in a context where they
could be confused with characters.
We don't need named coded character sets internally, we don't need to
associate random octets with charsets to make characters. (Except for
backward compatibility, where backward is spelled P E R V E R S E.)
We only need subsets of Unicode. Abstractly, characters from internal
text (LISP characters, strings, and buffers) should only ever be
mapped to their Unicode values, and then from Unicode to external
coding systems for I/O.
If you're worrying about the practical problems of mapping Unicode
characters to font indicies, please don't bother. It's a practical
problem, yes, but you aren't going to enforce sanity on fonts by
perpetuating the charset bogosity. For now, _any old hack_ will do,
just get glyphs on the screen for the fonts you use. As long as the
API looks like a table and handles two-byte indicies, we can
generalize and optimize the internals for space later, if we even need
to.
I agree, we'll need named font index tables (a la Cmaps). We've
already got the tables in etc/unicode. Give them their Unicode names,
provide an aliasing mechanism, add an xemacs vendor directory, and put
anything we need that we don't already have in there.
Footnotes:
[1] If you read this as an anti-Microsoft rant, you're missing the
point.
i really don't understand your sarcastic attitude, or what point you're
trying to make. "charset" as i have defined it is a set of characters,
indexed by one or two bytes. the indices are as defined in the unicode
translation tables. you can certainly see my code if you want. the
main purpose of charsets in the new world is to interface with external
encodings, and maybe secondarily for font indexing under X. i see no
purpose in creating a totally new concept rather than extending the
current `charset' concept. in the new world, you can have an arbitrary
number of charsets, with any characters you want in them; the only way
that the code knows what's in the charset is by the appropriate unicode
translations that have been provided. i've created new charsets
`japanese-shift-jis' and `latin-windows-1252' and various others;
`cyrillic-koi8-r' and such will be coming soon. there will be a new
`mbcs' coding system type that just encodes one or more charsets using
their indices, in the obvious fashion; this will replace the need for ccl.
ben