Aidan Kehoe wrote:
Ar an t-aonú lá is triochad de mí Deireadh Fómhair, scríobh Ben Wing:
> i don't want two different kinds of charsets. but we can create all
> charsets with the proper indices and have `split-char' and `char-octet'
> generate the old, wrong indices and a new function do it right.
> suggestions for a new api to replace `make-char' and `split-char'?
If the internal encoding is Unicode, the charset (apart from, say, 'ucs)
isn’t as trivially available as it is with the iso-2022-oriented encoding.
Are you suggesting implementing something like the extant translation in
unicode.c ?
right. these functions take an extra precedence-list argument (a list
of charsets), which when nil does something reasonable as a default.
essentially i removed unicode-to-char and integrated it into make-char;
but i'm open to other suggestions.
GNU’s API of (decode-char 'ucs #x20ac) and (encode-char ?\
'ucs), used with
other symbols (which they don’t allow, and which our compatibility
implementation doesn’t allow) could work well in that context.
well, basically i want some api that says "convert a charset and
codepoints into a Lisp char" and vice-versa "get a charset and
codepoints from a Lisp char". keep in mind that we will be running both
unicode-internal and old-mule for awhile, so the api's must work with
both. for this reason, i also introduced other levels of conversion,
similar but a bit different; e.g. charset-codepoint-to-unicode and
unicode-to-charset-codepoint unilaterally convert between a unicode
codepoint and a charset codepoint, regardless of the representation of a
char. on the other hand, int-to-char and char-to-int always give you
the actual int that makes up the character; not very portable. if you
want unicode, char-to-unicode. if you want charset codepoints, it's
currently `split-char'. but i suppose i could create
`char-to-charset-codepoint', which would follow the others, and
`charset-codepoint-to-char', and reintroduce `unicode-to-char', and
deprecate `make-char' and `split-char'. then you'd have a very
symmetrical api.
suggestions for less verbose names are welcome.
> btw unicode-internal now compiles (and crashes at startup,
naturally).
> i still need to add the translation tables for koi8-r and friends,
> implement surrogates and (the biggest current issue) redo font handling
> to eliminate the concept of one-font-per-charset.
Excellent.
> also, add a concept of "language" and introduce it appropriately in the
> unicode/charset conversion functions. (neither of these last two will
> make it into the first version of unicode-internal to be integrated into
> the mainline.)
Neither the Unicode conversion functions nor the charset conversion
functions will make it in? That doesn’t seem very practical; I’m sure you
mean something else there.
no, neither the redone font handling nor the concept of a language
introduced into the unicode conversion functions will make it in. the
font handling will still bogusly be in terms of charsets, with the same
bogus hack currently there (under windows at least) to look harder
through various fonts to find a font that can display a char, when
necessary. i know what needs to be done to change this but it will be a
pervasive change and i don't want to bite off too much at a time right
now. similarly, the idea of introducing a language and tracking the
language of text using extent properties will take some doing. first
step would be to create a language object and set up properties on it,
such as the charset precedence list for unicode translation. then there
is a buffer-local `current-language' variable (maybe the language
environments can be made to work without too much effort). so we need
to pass around some sort of object from which the charset precedence
list can be derived -- the list itself, a language object (maybe just a
symbol, who knows), a buffer, etc. currently most functions don't
bother with this, so it requires a fair amount of refactoring.
> currently only about 64 ifdef UNICODE_INTERNAL's, and almost
all
> localized to text.h, text.c and charset.h. (these are the only files
> that know anything about the actual encoding of characters. chartab.c,
> for example, knows only that its hashing function must be different.
> mule-coding.c knows only that the bogus split big5 charsets don't exist
> under UNICODE_INTERNAL.)
That file is full of assumptions about our internal string format which need
to be changed if your’re changing that format. I haven’t seen you mention
that you are, but I find it a hard to imagine supporting a 21-bit space with
the existing format, let alone a 30-bit space, given that you’d have to
abandon most of the leading byte architecture.
which file? chartab.c got radically reworked to support 32-bit
characters using a page-table-style lookup, as for unicode translation
tables. same system is used for old-mule as well.
mule-coding.c did not need so much rewriting, but it now works in terms
of generic functions. e.g. when iso2022 encodes to external, it uses
macros to accumulate a whole character, then calls
itext_to_charset_codepoint() -- an inline function -- to get an
appropriate charset and codepoint. similarly, once it's gotten a
charset and codepoints, it calls charset_codepoint_to_dynarr() --
another inline function -- to write it out. these inliners work in the
old and new system, conditionalized appropriately. similar stuff goes
everywhere else; i abstracted out "leading bytes" and all specific
knowledge of our internal representation whenever possible.
But, I’m sure I’ll understand more of the details when I see the
patch.
> charset.h is totally rewritten and might go away entirely. chartab.c is
> drastically changed and now uses the same basic format as the unicode
> translation tables.
> also, we unfortunately can only implement 30-bit chars, even though
> Unicode theoretically allows 31 bits.
They introduced a limit in 3.0 of 0x110000. Every code point approved by the
standard will be below that.
yeah, i've heard, but UCS-4 is still "theoretically" 31 bits, but all
private use ... oh well. we'll just say the high half of the space is
"private and reserved", heh heh :)
ben