Stephen J. Turnbull wrote:
>>>>>"Ben" == Ben Wing <ben(a)666.com>
writes:
>>>>>
>>>>>
Ben> char-charset, char-octet and split-char will remain in a
Ben> unicode world
Why? Isn't this the Ebola thing all over again?
what do you mean? especially, what do you mean about ebola? ebola to
me means "char-int confoundance".
the idea is to maintain these for backward compatibility. these
functions take an optional precedence list argument; if you just call
them as before you get something useful. keep in mind also that we have
code that will have to work with the old Mule- and the new
Unicode-internal rep for quite awhile, so we can't just get rid of these
functions.
in a unicode world, (split-char ch &optional charsets) is basically
equivalent to (unicode-to-charset-codepoint (char-to-unicode ch)
charsets), where (char-to-unicode ch) does very little. (char-charset
ch &optional charsets), likewise, is more or less like (car (split-char
ch charsets)). the idea is that the functions should provide consistent
semantics, as much as this is possible, and do the most
efficient/data-preseriving thing, regardless of internal representation.
I think we should design the API under the assumption that we're
going
to get rid of those functions. Instead, there should be translation
functions (ie, codecs) that take an explicit coded character set
argument for those cases where it's needed (eg, X11 font registries).
if you get rid of the funs, then presumably all code needs to condition
on (featurep 'unicode-internal)?
can you sketch out your proposal of what these functions should look like?
if you want, include your view on my proposed functions
charset-codepoint-to-unicode and unicode-to-charset-codepoint.
Ben> internally, some concept like "leading byte"
will remain but
Ben> will simply be an arbitrary charset index.
Again, I don't see what this is for in principle. We might want to
cache something like that for efficiency in redisplay (and that only
on core X11 displays, even Xft doesn't need it), but that should be
encapsulated in the implementation.
well, right now i'm rewriting charsets so that they're not limited to
covering 94, 94x94, 96, 96x96 and eliminating the requirement that all
charsets have an iso2022 final byte given. this is so we can add
128-byte charsets like windows-1252 and 94x158 charsets like big5 and
johab. specifically, a charset has 1 or 2 dimensions and each dimension
is of size 0 through 255; in addition, you can specify an offset for
each dimension, so [e.g.] that the allowed indices of Big5 are [0x21 to
0x7e] and [0x40 to 0xfd]. this has size (94 158) and offset (33 64).
you won't be able to create a character using these charsets under old
mule-internal. (there are two issues. one, if either dimension > 96,
the charset won't fit into an Ichar without radical surgery like what
was done to big5. second, if there is no iso2022 final, you can't save
out in iso2022, meaning we can't autosave a file with such a char. That
could potentially be fixed by creating an extension to the
`escape-quoted' format, but it's probably not worth it as we will
certainly make unicode-internal the default as soon as it stabilizes.
but even for charsets you can't instantiate to chars (in old-Mule), you
can still use them to convert between unicode and such charsets, and you
can use them as part of coding systems; probably i will add a new
coding-system type called `mbcs' which simply takes a precedence list of
charsets and reads/writes a very simple representation where each
character is just represented by its octets. so `shift-jis' the coding
system is just an mbcs coding system with list (jis-roman, shift-jis,
katakana-jisx0201). i'll have to see how these overlaps in
functionality with mswindows-multibyte, which implements something similar.
the idea -- correct me if i'm wrong, stephen -- is that our `charset'
should be a coded charset, similar to what's used in MIME; and as much
as reasonable all the charsets defined there we should support. what's
tricky is that the `charset' attribute as used in MIME actually
conflates charset and encoding; e.g. "charset=utf8" or
"charset=iso-2022-jp". (of course, even "coded charset" conflates some
ideas, namely the distinction of simple "set of characters" and
"assignment of octet values to characters", so that e.g. `shift-jis' and
`jisx0208' are two different coded character sets containing the same
characters.) note that Windows makes this same conflation as MIME --
early on its "code pages" were basically coded character sets, but now
they have "utf8" code pages and such.
stephen, when you have a chance, could you look at the intro mule
documentation on character sets, encodings and such? i have a feeling
that the concept of a `charset' as a `coded character set' and the its
conflation of character set and assignment of codes is something that
could use some clarification; certainly the fact that other places that
apparently name a coded charset (e.g. MIME charset property, code pages)
may actually name an encoding.
btw leading bytes/id's whatever may well just go away entirely. in
fact, probably they should; then i can use this as a check to find any
places in the code that need changing for unicode-internal.
Of course in a second stage we'll probably want Mule emulation
for
backward GNU compatibility (eg for MUAs), but I think we should see
how far we can go without "old Mule" baggage.
elaborate?
stephen, what do you think of the rest of what i suggested?