Stephen J. Turnbull wrote:
>>>>>"Ben" == Ben Wing <ben(a)666.com>
writes:
>>>>>            
>>>>>
    Ben> char-charset, char-octet and split-char will remain in a
    Ben> unicode world
Why?  Isn't this the Ebola thing all over again?
  
 
what do you mean?  especially, what do you mean about ebola?  ebola to 
me means "char-int confoundance".
the idea is to maintain these for backward compatibility.  these 
functions take an optional precedence list argument; if you just call 
them as before you get something useful.  keep in mind also that we have 
code that will have to work with the old Mule- and the new 
Unicode-internal rep for quite awhile, so we can't just get rid of these 
functions.
in a unicode world, (split-char ch &optional charsets) is basically 
equivalent to (unicode-to-charset-codepoint (char-to-unicode ch) 
charsets), where (char-to-unicode ch) does very little.  (char-charset 
ch &optional charsets), likewise, is more or less like (car (split-char 
ch charsets)).  the idea is that the functions should provide consistent 
semantics, as much as this is possible, and do the most 
efficient/data-preseriving thing, regardless of internal representation.
I think we should design the API under the assumption that we're
going
to get rid of those functions.  Instead, there should be translation
functions (ie, codecs) that take an explicit coded character set
argument for those cases where it's needed (eg, X11 font registries).
  
 
if you get rid of the funs, then presumably all code needs to condition 
on (featurep 'unicode-internal)?
can you sketch out your proposal of what these functions should look like?
if you want, include your view on my proposed functions 
charset-codepoint-to-unicode and unicode-to-charset-codepoint.
    Ben> internally, some concept like "leading byte"
will remain but
    Ben> will simply be an arbitrary charset index.
Again, I don't see what this is for in principle.  We might want to
cache something like that for efficiency in redisplay (and that only
on core X11 displays, even Xft doesn't need it), but that should be
encapsulated in the implementation.
  
 
well, right now i'm rewriting charsets so that they're not limited to 
covering 94, 94x94, 96, 96x96 and eliminating the requirement that all 
charsets have an iso2022 final byte given.  this is so we can add 
128-byte charsets like windows-1252 and 94x158 charsets like big5 and 
johab.  specifically, a charset has 1 or 2 dimensions and each dimension 
is of size 0 through 255; in addition, you can specify an offset for 
each dimension, so [e.g.] that the allowed indices of Big5 are [0x21 to 
0x7e] and [0x40 to 0xfd].  this has size (94 158) and offset (33 64).
you won't be able to create a character using these charsets under old 
mule-internal. (there are two issues.  one, if either dimension > 96, 
the charset won't fit into an Ichar without radical surgery like what 
was done to big5.  second, if there is no iso2022 final, you can't save 
out in iso2022, meaning we can't autosave a file with such a char.  That 
could potentially be fixed by creating an extension to the 
`escape-quoted' format, but it's probably not worth it as we will 
certainly make unicode-internal the default as soon as it stabilizes.
but even for charsets you can't instantiate to chars (in old-Mule), you 
can still use them to convert between unicode and such charsets, and you 
can use them as part of coding systems; probably i will add a new 
coding-system type called `mbcs' which simply takes a precedence list of 
charsets and reads/writes a very simple representation where each 
character is just represented by its octets.  so `shift-jis' the coding 
system is just an mbcs coding system with list (jis-roman, shift-jis, 
katakana-jisx0201).  i'll have to see how these overlaps in 
functionality with mswindows-multibyte, which implements something similar.
the idea -- correct me if i'm wrong, stephen -- is that our `charset' 
should be a coded charset, similar to what's used in MIME; and as much 
as reasonable all the charsets defined there we should support.  what's 
tricky is that the `charset' attribute as used in MIME actually 
conflates charset and encoding; e.g. "charset=utf8" or 
"charset=iso-2022-jp". (of course, even "coded charset" conflates some
ideas, namely the distinction of simple "set of characters" and 
"assignment of octet values to characters", so that e.g. `shift-jis' and 
`jisx0208' are two different coded character sets containing the same 
characters.) note that Windows makes this same conflation as MIME -- 
early on its "code pages" were basically coded character sets, but now 
they have "utf8" code pages and such.
stephen, when you have a chance, could you look at the intro mule 
documentation on character sets, encodings and such?  i have a feeling 
that the concept of a `charset' as a `coded character set' and the its 
conflation of character set and assignment of codes is something that 
could use some clarification; certainly the fact that other places that 
apparently name a coded charset (e.g. MIME charset property, code pages) 
may actually name an encoding.
btw leading bytes/id's whatever may well just go away entirely.  in 
fact, probably they should; then i can use this as a check to find any 
places in the code that need changing for unicode-internal.
Of course in a second stage we'll probably want Mule emulation
for
backward GNU compatibility (eg for MUAs), but I think we should see
how far we can go without "old Mule" baggage.
  
 
elaborate?
stephen, what do you think of the rest of what i suggested?