Aidan Kehoe wrote:
Ar an t-ochtú lá is fiche de mí Deireadh Fómhair, scríobh Ben Wing:
> > Ben> currently we use 32-127 for the values of the chars in the
> > Ben> iso-8859 charsets. maybe that was needed under old-mule, but
> > Ben> in unicode-internal charsets can have values in any arbitrary
> > Ben> interval or rectangle in 256 or 256x256 space. shouldn't we
> > Ben> use 160-255? this would only matter in the output of
> > Ben> `split-char'; `make-char' already goes either way.
> >
> >No. This is gratuitous incompatibility with ISO 2022, legacy X11 font
> >indexing, and other Emacsen. Why buy trouble changing a public API?
>
> it's the other way around. the current situation is incompatible with
> the X11 fonts, so we have to hack the values using the bogus `graphic'
> characteristic.
It remains incompatible with ISO 2022 and other Emacsen, and will break
existing code.
> note that in the new world, charsets can have values > 127 in any case.
> cf. big5, shift-jis, etc.
>
> so when i'm creating a new charset like `latin-windows-1252', [...]
Create another API for character sets in that 256 char or
rectangle-of-256x256-char space, then. Breaking the old API doesn’t buy a
whole lot.
> which is compatible with iso-8859-1 but has extra chars in the range
> 128-159, do i do the right thing and have its chars in the range 128-255
> be indexed as 128-255 (and hence be inconsistent with the
> `latin-iso8859-1' charset), or do i do the wrong thing and move its range
> down to 0-127? and then it appears to have ascii control chars in the
> range 0-31, but they aren't control chars, value 10 is not linefeed,
> value 13 is not cr, etc.?
Another reason to create another API; it would be possible to implement
EBCDIC character sets with one that didn’t assume ASCII compatibility, as
the existing API does. With an API that didn’t suck,
(make-char-alternative code-page-037 #x40)
and
(make-char-alternative code-page-037 #xc0)
could and should mean different things.
i don't want two different kinds of charsets. but we can create all
charsets with the proper indices and have `split-char' and `char-octet'
generate the old, wrong indices and a new function do it right.
suggestions for a new api to replace `make-char' and `split-char'?
btw unicode-internal now compiles (and crashes at startup, naturally).
i still need to add the translation tables for koi8-r and friends,
implement surrogates and (the biggest current issue) redo font handling
to eliminate the concept of one-font-per-charset. also, add a concept
of "language" and introduce it appropriately in the unicode/charset
conversion functions. (neither of these last two will make it into the
first version of unicode-internal to be integrated into the mainline.)
currently only about 64 ifdef UNICODE_INTERNAL's, and almost all
localized to text.h, text.c and charset.h. (these are the only files
that know anything about the actual encoding of characters. chartab.c,
for example, knows only that its hashing function must be different.
mule-coding.c knows only that the bogus split big5 charsets don't exist
under UNICODE_INTERNAL.) charset.h is totally rewritten and might go
away entirely. chartab.c is drastically changed and now uses the same
basic format as the unicode translation tables. also, we unfortunately
can only implement 30-bit chars, even though Unicode theoretically
allows 31 bits.
ben