sorry if i've missed some of the discussions on switching to unicode
internally. i've been thinking of what needs to be done to implement
this. besides simply changing the internal representation, we need:
-- [1] at some point, use extent properties to track the language of a
text. this is well-recognized.
-- [2] charsets should be generalized so that they encompass an
arbitrary set of unicode chars.
-- [3] we should add unicode-compatible charsets. the names should be
such that they programmatically map onto perl-compatible (used with
regexp \p, see below) charset names.
-- [4] the perl regexp \p syntax should be adopted for referencing
charsets. (char categories just suck.) for that matter, we should move
in the direction of being as perl-compatible as possible with our
regexps, since that is where the world is going. (cf java, python, ruby,
c#, ...) the big problem here is \( and (, which are backwards. the
only reasonable solutions i can see are [a] a global variable to control
which kinds of regexps are used; [b] a double set of all functions that
take regexps. comments?
-- [5] char tables need to be changed. their current implementation is
heavily tied to the current mule character structure. we will also need
to change `map-char-table'. unfortunately, this will be incompatible
with its current workings. fortunately, only three packages use
map-char-table, and from looking at these three, it's not clear anything
will break. also, the new map-char-table will work like GNU Emacs.
-- [6] at some point, font objects should be changed to include a char
table so that different charsets can have different fonts. it is easy
to maintain backward compatibility here -- the "old" way of doing things
just maps all chars to the same font. and let's please use font
vectors, not XLFD-style crap.
as for [5], map-char-table needs to return either a single char or a
range; *not* any of the other crap it currently returns.
also, in [5] we have an implementation choice. either we use sorted
ranges or we use page-table-style lookups, indexed successively on each
byte of the char value. both are already implemented in XEmacs ("range
tables" and "unicode translation tables"). this is a standard
speed-vs-space issue. the former are O(log n) but compact; the latter
are O(constant) but potentially large. (although it depends on how full
the tables are. a sparse table referencing a few chars all over the
unicode space could get very large with the page-table style; but a
dense table with high locality might actually be smaller.) should we
allow the user to control this, with range tables the default?