NOTE: One possible default internal representation that was compatible
with UTF16 but allowed all possible chars in UCS4 would be to take a
more-or-less unused range of 2048 chars (not from the private area
because Microsoft actually uses up most or all of it with EUDC chars).
Let's say we picked A400 - ABFF. Then, we'd have:
0000 - FFFF Simple chars
D[8-B]xx D[C-F]xx Surrogate char, represents 1M chars
A[4-B]xx D[C-F]xx D[C-F]xx Surrogate char, represents 2G chars
This is exactly the same number of chars as UCS-4 handles, and it follows the
same property as UTF8 and Mule-internal:
1. There are two disjoint groupings of units, one representing leading units
and one representing non-leading units.
2. Given a leading unit, you immediately know how many units follow to make
up a valid char, irrespective of any other context.
Note that A4xx is actually currently assigned to Yi. Since this is an
internal representation, we could just move these elsewhere.
An alternative is to pick two disjoint ranges, e.g. 2D00 - 2DFF and
A500 - ABFF.
Bill Tutt wrote:
> From: Ben Wing [mailto:ben@666.com]
>
> I wrote this last night:
>
>
> NOTE: One possible default internal representation that was compatible
> with UTF16 but allowed all possible chars in UCS4 would be to take an
> unused range of 2048 chars (not from the private area because
> Microsoft
> actually uses up most or all of it with EUDC chars). Let's
> say we picked
> 4000 - 47FF. Then, we'd have:
>
> 0000 - FFFF Simple chars
>
> D[8-B]xx D[C-F]xx Surrogate char, represents 1M chars
>
> 4[0-7]xx D[C-F]xx D[C-F]xx Surrogate char, represents 2G chars
>
> This is exactly the same number of chars as UCS-4 handles,
> and it follows the
> same property as UTF8 and Mule-internal:
>
> 1. There are two disjoint groupings of units, one
> representing leading units
> and one representing non-leading units.
> 2. Given a leading unit, you immediately know how many units
> follow to make
> up a valid char, irrespective of any other context.
>
>
There isn't a 2048 large empty block in the BMP atm.
See
http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2213.pdf
(dated 2000-03-28)
The biggest open block I noticed is U+0000A500-U+0000ABFF.
The next biggest open block looks like U+00010900-U+00010FFF.
After that its U+00011200 - U+00011FFF. Both of which are in Plane 1.
Plane 1 Roadmap:
http://anubis.dkuug.dk/jtc1/sc2/wg2/docs/n2214.pdf
By open I mean that there isn't even a subbmitted proposal about what should
actually be encoded there.
Bill
--
Ben
In order to save my hands, I am cutting back on my mail. I also write
as succinctly as possible -- please don't be offended. If you send me
mail, you _will_ get a response, but please be patient, especially for
XEmacs-related mail. If you need an immediate response and it is not
apparent in your message, please say so. Thanks for your understanding.
See also
http://www.666.com/ben/typing.html.