unicode internal

Tuesday, 4 October 2005

        sorry if i've missed some of the discussions on switching to unicode 
internally.  i've been thinking of what needs to be done to implement 
this.  besides simply changing the internal representation, we need:

-- [1] at some point, use extent properties to track the language of a 
text.  this is well-recognized.
-- [2] charsets should be generalized so that they encompass an 
arbitrary set of unicode chars.
-- [3] we should add unicode-compatible charsets.  the names should be 
such that they programmatically map onto perl-compatible (used with 
regexp \p, see below) charset names.
-- [4] the perl regexp \p syntax should be adopted for referencing 
charsets. (char categories just suck.) for that matter, we should move 
in the direction of being as perl-compatible as possible with our 
regexps, since that is where the world is going. (cf java, python, ruby, 
c#, ...) the big problem here is \( and (, which are backwards.  the 
only reasonable solutions i can see are [a] a global variable to control 
which kinds of regexps are used; [b] a double set of all functions that 
take regexps.  comments?
-- [5] char tables need to be changed.  their current implementation is 
heavily tied to the current mule character structure.  we will also need 
to change `map-char-table'.  unfortunately, this will be incompatible 
with its current workings.  fortunately, only three packages use 
map-char-table, and from looking at these three, it's not clear anything 
will break.  also, the new map-char-table will work like GNU Emacs.
-- [6] at some point, font objects should be changed to include a char 
table so that different charsets can have different fonts.  it is easy 
to maintain backward compatibility here -- the "old" way of doing things 
just maps all chars to the same font.  and let's please use font 
vectors, not XLFD-style crap.

as for [5], map-char-table needs to return either a single char or a 
range; *not* any of the other crap it currently returns.

also, in [5] we have an implementation choice.  either we use sorted 
ranges or we use page-table-style lookups, indexed successively on each 
byte of the char value.  both are already implemented in XEmacs ("range 
tables" and "unicode translation tables").  this is a standard 
speed-vs-space issue.  the former are O(log n) but compact; the latter 
are O(constant) but potentially large. (although it depends on how full 
the tables are.  a sparse table referencing a few chars all over the 
unicode space could get very large with the page-table style; but a 
dense table with high locality might actually be smaller.) should we 
allow the user to control this, with range tables the default?

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

2001

2000

1999

1998

unicode internal