Ar an t-aonú lá is triochad de mí Deireadh Fómhair, scríobh Ben Wing:
i don't want two different kinds of charsets. but we can create
all
charsets with the proper indices and have `split-char' and `char-octet'
generate the old, wrong indices and a new function do it right.
suggestions for a new api to replace `make-char' and `split-char'?
If the internal encoding is Unicode, the charset (apart from, say, 'ucs)
isn’t as trivially available as it is with the iso-2022-oriented encoding.
Are you suggesting implementing something like the extant translation in
unicode.c ?
GNU’s API of (decode-char 'ucs #x20ac) and (encode-char ?\ 'ucs), used with
other symbols (which they don’t allow, and which our compatibility
implementation doesn’t allow) could work well in that context.
btw unicode-internal now compiles (and crashes at startup,
naturally).
i still need to add the translation tables for koi8-r and friends,
implement surrogates and (the biggest current issue) redo font handling
to eliminate the concept of one-font-per-charset.
Excellent.
also, add a concept of "language" and introduce it
appropriately in the
unicode/charset conversion functions. (neither of these last two will
make it into the first version of unicode-internal to be integrated into
the mainline.)
Neither the Unicode conversion functions nor the charset conversion
functions will make it in? That doesn’t seem very practical; I’m sure you
mean something else there.
currently only about 64 ifdef UNICODE_INTERNAL's, and almost all
localized to text.h, text.c and charset.h. (these are the only files
that know anything about the actual encoding of characters. chartab.c,
for example, knows only that its hashing function must be different.
mule-coding.c knows only that the bogus split big5 charsets don't exist
under UNICODE_INTERNAL.)
That file is full of assumptions about our internal string format which need
to be changed if your’re changing that format. I haven’t seen you mention
that you are, but I find it a hard to imagine supporting a 21-bit space with
the existing format, let alone a 30-bit space, given that you’d have to
abandon most of the leading byte architecture.
But, I’m sure I’ll understand more of the details when I see the patch.
charset.h is totally rewritten and might go away entirely. chartab.c
is
drastically changed and now uses the same basic format as the unicode
translation tables.
also, we unfortunately can only implement 30-bit chars, even though
Unicode theoretically allows 31 bits.
They introduced a limit in 3.0 of 0x110000. Every code point approved by the
standard will be below that.
--
„Frauen achten mehr aufs Herz und weniger auf Dummheiten. Darum leben sie
länger.“ (C.R. Zafón -- Übertragung von Peter Schwaar.)